solidity in AIIDE 2020 - part 5

A little more on daring/solid before I post about Steamhammer’s bugs.

Are the numbers reliable? Are results repeatable? If I measure another competition, will the solidity measure of the same bots come out similarly? If I measure A as more solid than B, is it true? Does solidity mean anything?

Statistically, there are two parts to the question. One part is, given a fixed set of bots, what is the spread of the solidity numbers? How many games do you need to feel sure you can tell A is more solid than B? In principle, that can be answered mathematically by running probability distributions forward through the calculation. That would be a useful exercise anyway, since it could suggest better calculations. But it turns out that I am not a statistician, and I don’t want to do it. It might be easier to answer by Monte Carlo analysis: Simulate a large number of tournaments, and see the spreads that come out.

The other part is, how does repeatability vary as the participants in the tournament vary? Will the same bot get a similar solidity number in a tournament where many of its opponents are different? What if it is a tournament where the average bot is (say) more solid than those in the original tournament? Are there other player characteristics that might make a difference? Does repeatability improve as the number of participants increases, as you would expect? That can also be answered by Monte Carlo analysis, but you’ll have to make more assumptions about how players behave. I don’t see any substitute for analyzing actual past tournaments, at least as a first step to understand the important factors.

I will analyze past tournaments, but not now. For the moment, I think my intuitive answers to both parts of the question are good enough: AIIDE 2020 does have enough games that the bots can likely be ordered by solidity without big mistakes, and it does not have enough varied participants to be sure that a solidity measurement from one tournament is useful for predicting the next one. In time, I want to automate the whole calculation and include it in my suite of tournament analysis software, so I can report on it as a matter of course. Right now I’d rather let the ideas simmer for a while and see if something better can be cooked up.

But mainly I want to get back to Steamhammer and make it great!

Trackbacks

No Trackbacks

Comments

MicroDK on Saturday, January 16. 2021:

3.6% upset deviation for Microwave is quite low. I would think it was higher, but I also think the number is highly dependent on the opponents in the tournament.
If we are looking at basil, Microwave is being upset much more often. But the tournament is also much different it the bots get more games against equal opponents and less games against much higher and much lower opponents. Looking at the elo for the last 30 days on Basil, it has a high deviation compared to other bots eg. Steamhammer, which means that it looses a lot of games to lower ranked bots than Steamhammer does. Steamhammer looks much more stable / solid.

Jay Scott on Saturday, January 16. 2021:

Yes, the opponents and the distribution of games among opponents are definitely important.

How would the SSCAIT round robin look, with many opponents and only 2 games each? How would BASIL look, with many opponents and a skewed distribution? I hope to find out in time. The numbers might look different between them, even though they’re so similar in the set of players.

Add Comment

Name*

Homepage

Comment*

In reply to

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA