solid versus daring play styles
A win against a stronger player is an upset. A loss against a weaker player is an upset in the opposite direction; the weaker player upset you. Two players of the same strength may have different rates of upsets in their games: Maybe one often beats stronger players but loses to weaker ones, and the other has more consistent results and does not. It’s a difference in play style. I call the inconsistent, risk-taking player daring and the consistent, risk-avoiding player solid.
It should be possible to measure solidity from a tournament crosstable. But what is a mathematically correct way to do it? You don’t want to simply count upsets, because you expect to win a lot of games versus a player that is slightly better than you. You want to somehow take the severity of the upset into account. For example, a win rate that stands far from its expected value should count more. But what is the expected win rate if, say, one player scored 80% in the tournament and the other scored 50%?
Here’s one way. Finding expected win rates is what elo is for, so compute an elo rating for each player in the tournament. You could use a program like bayeselo to take all information into account, or you could simply use the tournament win rates to impute elo values, essentially running the elo function in reverse. The two methods will give slightly different answers, but not very different. Then you can use the elo function in the forward direction on the differences between elo values to find expected win rates for each pairing.
Then for each pairing you have an actual win rate from the tournament results, and a calculated expected win rate. By construction, the two are the same in an average sense—but not individually. All that is left is to turn these numbers into a metric of upset-proneness, or daring risk-seeking. I haven’t tried to work out the math of what the metric should be, but the outline is obvious. For each opponent, pick out the pairings that are upsets: Either higher-than-expected win rates against a stronger opponent, or lower-than-expected against a weaker opponent. You might ignore other pairings, on the theory that they are symmetrical anti-upsets, or you might try to refine your metric by assigning them upset values too (I think the results would be a little different but probably close). You want some difference function f(actualRate, expectedRate) that says how big the difference is; you might choose linear distance (subtract then take the absolute value). Then you want a combining function g() that accumulates the difference values into a final metric; if f is distance, then you might choose the arithmetic mean.
I’ve never seen a metric like this, but it seems like an easy idea. Has anyone seen it? Can you point it out?
Next: I want to try this for AIIDE 2020. If it works smoothly I may extend it to other past tournaments, to see whether bots retain a measurably consistent daring/solidity style over time.
Comments
Bytekeeper on :
Jay Scott on :
krasi0 on :
Joseph Huang on :
MicroDK on :
McRave on :
Hao Pan on :
1) When the matchmaking is shaped. Right now the "go-to" ladder BASIL picks opponents with similar ELO ratings much more often. I would argue that a matchmaking system which picks the opponents evenly is better for the purposes here, as I believe the final result does depend on the number of matches happened between the two bots.
2) The ELO is changing constantly. This problem is further magnified by fast rising/falling opponent. For example, imagine the time when Stardust first entered the ladder and beat the then #1, would this be an upset? Alternatively, say, the author of the #1 bot unknowingly introduced a bug recently and caused the ELO of the bot to drop straight to the bottom, do we count those games in this process?
So here's what I am thinking if I were to approach this problem:
0) Screen out bots that are exhibiting abnormal ELO behavior (rapidly decreasing, fluctuating, and increasing).
1) Establish a corridor which captures the most fluctuations of a bot's ELO
2) Once 1) is done, for every win against an opponent with an ELO higher than the upper bound of the corridor, we count it as an upset. Similarly, we count games lost to an opponent with an ELO lower than the lower bound as "being upset" (it makes sense for me to have the two sub-measures).
3) Almost there but we also need to adjust the measure based on the number of games happened between any two specific bots. The result from only 1 game is much less plausible than that from 100 games. Bayesian statistics could be helpful here.
There's not much to do with BASIL's matchmaking system, unfortunately. The final result may still be biased.
Jay Scott on :
My first worry is that there may not be enough data to get good answers. The stability measure might be unstable because there is not enough information to estimate it accurately.
MicroDK on :