solidity in AIIDE 2020 - part 3

I tried to be clever and make a solidity metric that was also a statistical test, so that it was easy to tell when the numbers were meaningful and when they were just noise. It didn’t work the way I wanted. Then I tried to be clever differently, and converted the results to elo differences, so that the metric ended up as a difference in elo. It’s easy to understand, easy to work with, and mathematically sound, because elo is linear. But the small change from a win rate of 98% to a win rate of 99% corresponds to the big elo difference of 122 points, so the measure was dominated by opponents at the extremes, blowing up the statistical uncertainty. OK, enough! Enough cleverness! Do it the easy way!

Here is a simple measure of goodness of fit, rms deviation of actual from expected win rate. This is not solidity, it is more like consistency or predictability: A small number means that the blue and green curves are close together. The numbers are slightly lower than correct, because I used a spreadsheet and it was easier to include the self-matchups with pretend 50% win rate and 0 deviation.

bot	rms deviation
stardust	7.6%
purplewave	11.3%
bananabrain	9.9%
dragon	15.5%
mcrave	13.6%
microwave	15.5%
steamhammer	16.9%
daqin	21.2%
zzzkbot	23.7%
ualbertabot	9.0%
willyt	15.7%
ecgberht	14.7%
eggbot	6.8%

You can eyeball the graphs and compare these numbers, and you should see that the numbers are a fair summary of how well the blue and green lines match. The bot that mostly won and the bot that mostly lost are good fits, DaQin and ZZZKBot are pretty wild, and UAlbertaBot stands out as unusually consistent. In fact, Stardust, UAlbertaBot, and EggBot all play fixed strategies (one per race for random UAlbertaBot), so it should be no surprise that they are consistent. The next most consistent by this measure is BananaBrain, which plays a wide range of strategies very unpredictably, so it is a surprise.

Next: To turn this into a solidity metric is a matter of extracting the portion of deviation which is due to upsets. It will take a bit of detail work with the spreadsheet. I’m out of time today, so I’ll do that tomorrow. It will be interesting to judge whether consistency or solidity is the more useful metric.

Trackbacks

No Trackbacks

Comments

MicroDK on Thursday, January 14. 2021:

These numbers are aligned at what I eyeballed: Stardust and BananaBrain being very consistent and DaQin and ZZZKBot not being consistent. I actually overlooked UAlbertaBot and EggBot as being very consistent... maybe because I was looking at the top... the graphs, to me, look less aligned than Stardust and BananaBrain.

To me, it is a surprise that ZZZKBot is so inconsistent because it plays a low amount of strategies. But if you think about it, they are all rushy / risky openings: 4pool, 9PoolMuta, 9PoolHydra etc... Some opponents are good against theese strategies, some are not and some has learned the answers to these strategies.

Jay Scott on Thursday, January 14. 2021:

It’s also possible that consistency of strategy might not be related to consistency of results across opponents. It’s intuitive that it should be, but is it truly?

Dan on Thursday, January 14. 2021:

Arguably the opposite: Having only one strategy means that you can more easily 0% someone who plays an answer to it.

I think the challenge of nailing down the metric is based partly in not yet finding the word which describes what you're looking for.

Jay Scott on Thursday, January 14. 2021:

Ha ha, you mean not yet changing the meaning of the word to what I want, like Humpty Dumpty! Technical terms are often existing words given new meanings.

Add Comment

Name*

Homepage

Comment*

In reply to

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA