AIIDE 2015 Bayesian Elo ratings
Krasi0 asked me to calculate ratings for tournaments using Rémi Coulom’s excellent bayeselo program. Here are ratings for AIIDE 2015.
bayeselo does not calculate basic Elo ratings like my little code snippets. It can’t calculate an Elo curve over time. It assumes that the players are fixed and have one true rating, and it crunches a full-on Bayesian statistical analysis to not only find the rating as accurately as possible, but also a 95% confidence interval so you can see how accurate the rating is. The ratings for the bots that learn, which aren’t fixed in strength as bayeselo assumes, can be seen as measuring the average strength over the tournament—the tournament score is no different in that respect.
The last column of the table is the probability of superiority, bayeselo’s calculated probability that the bot truly is better than the bot ranked immediately below it. The last bot doesn’t get one, of course. (bayeselo calculates this for all pairs, but in a tournament this long it rounds off to 100% for most.)
| bot | score | Elo | 95% conf. | better? | |
|---|---|---|---|---|---|
| 1 | tscmoo | 89% | 2026 | 2002-2050 | 81.0% |
| 2 | ZZZKBot | 88% | 2011 | 1988-2035 | 99.9% |
| 3 | UAlbertaBot | 80% | 1895 | 1874-1916 | 61.2% |
| 4 | Overkill | 81% | 1890 | 1870-1911 | 99.9% |
| 5 | Aiur | 73% | 1784 | 1765-1803 | 99.9% |
| 6 | Ximp | 68% | 1712 | 1694-1731 | 99.9% |
| 7 | Skynet | 64% | 1666 | 1648-1684 | 50.7% |
| 8 | IceBot | 64% | 1666 | 1648-1684 | 88.4% |
| 9 | Xelnaga | 63% | 1650 | 1632-1668 | 81.4% |
| 10 | LetaBot | 61% | 1638 | 1620-1656 | 99.9% |
| 11 | Tyr | 54% | 1553 | 1534-1572 | 96.0% |
| 12 | GarmBot | 52% | 1531 | 1513-1549 | 100% |
| 13 | NUSBot | 39% | 1380 | 1362-1398 | 73.1% |
| 14 | TerranUAB | 38% | 1372 | 1354-1390 | 99.8% |
| 15 | Cimex | 36% | 1335 | 1316-1353 | 99.6% |
| 16 | CruzBot | 32% | 1299 | 1280-1317 | 99.9% |
| 17 | OpprimoBot | 28% | 1231 | 1211-1250 | 96.7% |
| 18 | Oritaka | 26% | 1205 | 1185-1225 | 84.0% |
| 19 | Stone | 25% | 1190 | 1170-1210 | 91.3% |
| 20 | Bonjwa | 23% | 1171 | 1151-1191 | 100% |
| 21 | Yarmouk | 9% | 913 | 885-939 | 95.0% |
| 22 | SusanooTricks | 8% | 882 | 853-910 | - |
In the official results, Overkill came in ahead of UAlbertaBot with a higher tournament score. bayeselo ratings are more accurate than score because they take into account more information, and bayeselo says UAlbertaBot > Overkill with probability 61%. As explained in the original results, it’s a statistical tie, but bayeselo says it’s not an even tie but a little tilted in a counterintuitive way.
Skynet looks dead even with IceBot in the rounded-off numbers above. bayeselo says that Skynet > IceBot with probability 50.7%, a hair off dead even. Even the large number of games in this tournament could not rank all the bots accurately.
Tomorrow: The same for CIG 2016.
Comments
tscmoo on :
Jay Scott on :
Jay Scott on :
imp on :
Still it would make a lot of sense to use a tournament ranking system borrowed from chess.
Jay Scott on :
imp on :
https://www.fide.com/component/handbook/?id=187&view=article
Jay Scott on :
Jay Scott on :
krasi0 on :