CIG 2016 Bayesian Elo ratings
Same as yesterday, Bayesian Elo ratings calculated by bayeselo, this time for CIG 2016. I included both the qualifier and the final, of course. That gives the best possible ratings, so that confidence is higher for the 8 finalists. But the “score” column becomes difficult to interpret, because part of the score of the top 8 bots comes from the final when they faced tougher opposition. You can’t directly compare the scores of bots 1-8 with the scores of 9-16, only the ratings.
Also, with this analysis it doesn’t make sense to compare the rating values between tournaments. Each tournament is independently scaled to have an average rating of 1500. Only the relative ratings of bots in the same tournament can be compared. Ratings are relative.
| bot | score | Elo | 95% conf. | better? | |
|---|---|---|---|---|---|
| 1 | tscmoo | 73% | 1888 | 1872-1904 | 98.5% |
| 2 | Iron | 71% | 1864 | 1848-1880 | 99.9% |
| 3 | LetaBot | 68% | 1827 | 1811-1843 | 99.7% |
| 4 | Overkill | 65% | 1796 | 1781-1812 | 70.9% |
| 5 | ZZZKBot | 64% | 1790 | 1775-1805 | 86.8% |
| 6 | UAlbertaBot | 63% | 1778 | 1763-1793 | 99.8% |
| 7 | MegaBot | 60% | 1746 | 1731-1761 | 99.9% |
| 8 | Aiur | 54% | 1687 | 1671-1702 | 72.7% |
| 9 | Tyr | 62% | 1679 | 1659-1699 | 100% |
| 10 | Ziabot | 46% | 1500 | 1479-1521 | 100% |
| 11 | TerranUAB | 34% | 1338 | 1316-1360 | 100% |
| 12 | SRbotOne | 22% | 1158 | 1133-1183 | 59.1% |
| 13 | OpprimoBot | 22% | 1154 | 1128-1179 | 97.1% |
| 14 | XelnagaII | 21% | 1119 | 1092-1145 | 86.3% |
| 15 | Bonjwa | 19% | 1099 | 1072-1125 | 100% |
| 16 | Salsa | 1% | 579 | 510-636 | - |
The official results have LetaBot a hair ahead of ZZZKBot, then Overkill following. bayeselo has ZZZKBot and Overkill reversed, saying that LetaBot is clearly superior to Overkill, which is fairly likely to be superior to ZZZKBot. The difference comes about because, of course, the official results include only the final. Martin Rooijackers was justified after all in saying that ZZZKBot had fallen from the top 3. All other results agree with the official ranking. The tailing finalist Aiur is 72.7% likely to be superior to Tyr, so there is some doubt that the best finalists won through (in general the doubt can’t be avoided, though).
The tail-ender Salsa has a wide and asymmetrical confidence interval. It takes more evidence to pin down an extreme rating than a middle-of-the-road rating.
Tomorrow: I’ll try an analysis in which the ratings of unchanged bots are carried over from AIIDE 2015 to CIG 2016, so that we can compare between tournaments. I’m not sure how well it will work, or even if I can get it to work at all, but it will be interesting to try.
Comments
tscmoo on :