comparing strength across time
We don’t get many tournaments of bots versus humans. I don’t think there have been any with conditions controlled well enough that we can judge how strong bots are and how they are improving: Enough human participants, of known strength, with known levels of familiarity with computer play, finishing enough games. Then hold events across years so we can compare. We have to make do with seeing how bots are improving against other bots. Here is my best idea so far for comparing strength across tournaments.
1. We need 2 tournaments, preferably round robin, that share some participants—exactly identical bots, the more the better. We can’t do it with humans, because we can’t get exactly identical people across time. Ideally the maps should be the same too. AIIDE has more games, and SSCAIT has more shared participants; either should work, but I think SSCAIT may work better for this purpose despite being short by comparison. You could also compare between AIIDE and SSCAIT, but it would not work as well. It would take extra effort to make sure you know which players are exactly identical, and the different lengths of the tournaments means each provides a different amount of evidence to support the ratings, plus you could get confusing results for learning bots.
2. Pool all the games from both tournaments and compute elo ratings. If some participants which are not identical have the same names, distinguish them somehow—Steamhammer 2017 versus Steamhammer 2018, or whatever.
3. The identical players have identical strength in both tournaments, so consider their elo ratings as fixed. For each tournament separately, compute the elo ratings of the remaining players while keeping the ratings of the identical players fixed. The fixed ratings are benchmarks that keep the elo comparison stable for the remaining players (the idea has been used before).
It’s the best way I’ve thought of to get strength comparisons across time. We can get a pretty accurate measure of how individual bots have improved—Steamhammer 2018 is this much above Steamhammer 2017. We can treat elo as a linear measure of strength (a given elo interval always represents the same win rate difference), so we can simply average together the ratings of any set of bots to compare: The top 16 are x points stronger this year, the protoss are y points stronger, the spread between best and worst has widened to....
I may do this analysis for SSCAIT once it finishes. It’s a bit elaborate, but I’m interested.
Comments
Antiga / Iruian on :
Jay Scott on :
Joseph Huang on :
Jay Scott on :
Edmund Nelson on :
1. SSCAIT 2016 vs SSCAIT 2017
(good) Bots that were unchanged
1. KillerBot
2. Bereaver
3. Ximp
4, Skynet
5. UalbertaBot
Since killerbot was changed in 2018 we lose a very useful piece of data for this experiment. But Iron and Letabot being unchanged in 2018 gives us 2 top bots that remained unchanged.
So there are only 4 (top) bots that remained unchanged in the 3 consecutive years of Student starcraft AI tournament
Jay Scott on :
MicroDK on :
https://github.com/Bytekeeper/sc-docker
https://github.com/Bytekeeper/sc-docker/blob/master/docker/Setup%20in%20PowerShell%20--%20instructions.md