solidity in AIIDE 2020 - part 3
I tried to be clever and make a solidity metric that was also a statistical test, so that it was easy to tell when the numbers were meaningful and when they were just noise. It didn’t work the way I wanted. Then I tried to be clever differently, and converted the results to elo differences, so that the metric ended up as a difference in elo. It’s easy to understand, easy to work with, and mathematically sound, because elo is linear. But the small change from a win rate of 98% to a win rate of 99% corresponds to the big elo difference of 122 points, so the measure was dominated by opponents at the extremes, blowing up the statistical uncertainty. OK, enough! Enough cleverness! Do it the easy way!
Here is a simple measure of goodness of fit, rms deviation of actual from expected win rate. This is not solidity, it is more like consistency or predictability: A small number means that the blue and green curves are close together. The numbers are slightly lower than correct, because I used a spreadsheet and it was easier to include the self-matchups with pretend 50% win rate and 0 deviation.
| bot | rms deviation |
|---|---|
| stardust | 7.6% |
| purplewave | 11.3% |
| bananabrain | 9.9% |
| dragon | 15.5% |
| mcrave | 13.6% |
| microwave | 15.5% |
| steamhammer | 16.9% |
| daqin | 21.2% |
| zzzkbot | 23.7% |
| ualbertabot | 9.0% |
| willyt | 15.7% |
| ecgberht | 14.7% |
| eggbot | 6.8% |
You can eyeball the graphs and compare these numbers, and you should see that the numbers are a fair summary of how well the blue and green lines match. The bot that mostly won and the bot that mostly lost are good fits, DaQin and ZZZKBot are pretty wild, and UAlbertaBot stands out as unusually consistent. In fact, Stardust, UAlbertaBot, and EggBot all play fixed strategies (one per race for random UAlbertaBot), so it should be no surprise that they are consistent. The next most consistent by this measure is BananaBrain, which plays a wide range of strategies very unpredictably, so it is a surprise.
Next: To turn this into a solidity metric is a matter of extracting the portion of deviation which is due to upsets. It will take a bit of detail work with the spreadsheet. I’m out of time today, so I’ll do that tomorrow. It will be interesting to judge whether consistency or solidity is the more useful metric.
Comments
MicroDK on :
To me, it is a surprise that ZZZKBot is so inconsistent because it plays a low amount of strategies. But if you think about it, they are all rushy / risky openings: 4pool, 9PoolMuta, 9PoolHydra etc... Some opponents are good against theese strategies, some are not and some has learned the answers to these strategies.
Jay Scott on :
Dan on :
I think the challenge of nailing down the metric is based partly in not yet finding the word which describes what you're looking for.
Jay Scott on :