solid versus daring play styles

A win against a stronger player is an upset. A loss against a weaker player is an upset in the opposite direction; the weaker player upset you. Two players of the same strength may have different rates of upsets in their games: Maybe one often beats stronger players but loses to weaker ones, and the other has more consistent results and does not. It’s a difference in play style. I call the inconsistent, risk-taking player daring and the consistent, risk-avoiding player solid.

It should be possible to measure solidity from a tournament crosstable. But what is a mathematically correct way to do it? You don’t want to simply count upsets, because you expect to win a lot of games versus a player that is slightly better than you. You want to somehow take the severity of the upset into account. For example, a win rate that stands far from its expected value should count more. But what is the expected win rate if, say, one player scored 80% in the tournament and the other scored 50%?

Here’s one way. Finding expected win rates is what elo is for, so compute an elo rating for each player in the tournament. You could use a program like bayeselo to take all information into account, or you could simply use the tournament win rates to impute elo values, essentially running the elo function in reverse. The two methods will give slightly different answers, but not very different. Then you can use the elo function in the forward direction on the differences between elo values to find expected win rates for each pairing.

Then for each pairing you have an actual win rate from the tournament results, and a calculated expected win rate. By construction, the two are the same in an average sense—but not individually. All that is left is to turn these numbers into a metric of upset-proneness, or daring risk-seeking. I haven’t tried to work out the math of what the metric should be, but the outline is obvious. For each opponent, pick out the pairings that are upsets: Either higher-than-expected win rates against a stronger opponent, or lower-than-expected against a weaker opponent. You might ignore other pairings, on the theory that they are symmetrical anti-upsets, or you might try to refine your metric by assigning them upset values too (I think the results would be a little different but probably close). You want some difference function f(actualRate, expectedRate) that says how big the difference is; you might choose linear distance (subtract then take the absolute value). Then you want a combining function g() that accumulates the difference values into a final metric; if f is distance, then you might choose the arithmetic mean.

I’ve never seen a metric like this, but it seems like an easy idea. Has anyone seen it? Can you point it out?

Next: I want to try this for AIIDE 2020. If it works smoothly I may extend it to other past tournaments, to see whether bots retain a measurably consistent daring/solidity style over time.

Trackbacks

No Trackbacks

Comments

Bytekeeper on Friday, January 8. 2021:

Maybe https://en.wikipedia.org/wiki/Outlier ? The detection algorithms look a bit different, but I think your approach is good.

Jay Scott on Friday, January 8. 2021:

Oh yeah, kurtosis. Now it seems obvious...

krasi0 on Friday, January 8. 2021:

You may also want to apply this to BASIL on a monthly basis.

Joseph Huang on Friday, January 8. 2021:

In chess an upset is any loss or draw vs anyone lower rated. It's not this complicated.

MicroDK on Friday, January 8. 2021:

It has shown again and again that we will encounter a lot of upsets in BW AI scene because most bots do not understand the game as good as humans and the purely randomness in the game and map picks. They tend to be good at what the authors had time to do or wanted to focus on.

McRave on Friday, January 8. 2021:

Would love to see an analysis of all bots ranked by descending order of "solid". I think my Z bot recently is very inconsistent and would rank very low on the "solid" ladder, where a bot like Iron would rank very high. Please do one if you have the time and a method to calculate it, it seems really interesting!

Hao Pan on Friday, January 8. 2021:

Having such a "solidness" measure can be challenging for two reasons:
1) When the matchmaking is shaped. Right now the "go-to" ladder BASIL picks opponents with similar ELO ratings much more often. I would argue that a matchmaking system which picks the opponents evenly is better for the purposes here, as I believe the final result does depend on the number of matches happened between the two bots.
2) The ELO is changing constantly. This problem is further magnified by fast rising/falling opponent. For example, imagine the time when Stardust first entered the ladder and beat the then #1, would this be an upset? Alternatively, say, the author of the #1 bot unknowingly introduced a bug recently and caused the ELO of the bot to drop straight to the bottom, do we count those games in this process?

So here's what I am thinking if I were to approach this problem:
0) Screen out bots that are exhibiting abnormal ELO behavior (rapidly decreasing, fluctuating, and increasing).
1) Establish a corridor which captures the most fluctuations of a bot's ELO
2) Once 1) is done, for every win against an opponent with an ELO higher than the upper bound of the corridor, we count it as an upset. Similarly, we count games lost to an opponent with an ELO lower than the lower bound as "being upset" (it makes sense for me to have the two sub-measures).
3) Almost there but we also need to adjust the measure based on the number of games happened between any two specific bots. The result from only 1 game is much less plausible than that from 100 games. Bayesian statistics could be helpful here.

There's not much to do with BASIL's matchmaking system, unfortunately. The final result may still be biased.

Jay Scott on Friday, January 8. 2021:

Well, that’s why I want to try with AIIDE first. Despite learning, we can pretend the strength is constant, and there are many games.

My first worry is that there may not be enough data to get good answers. The stability measure might be unstable because there is not enough information to estimate it accurately.

MicroDK on Friday, January 8. 2021:

For Basil that would be an excellent way of doing it!

Add Comment

Name*

Homepage

Comment*

In reply to

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA