archive by month
Skip to content

AIIDE 2015 Bayesian Elo ratings

Krasi0 asked me to calculate ratings for tournaments using Rémi Coulom’s excellent bayeselo program. Here are ratings for AIIDE 2015.

bayeselo does not calculate basic Elo ratings like my little code snippets. It can’t calculate an Elo curve over time. It assumes that the players are fixed and have one true rating, and it crunches a full-on Bayesian statistical analysis to not only find the rating as accurately as possible, but also a 95% confidence interval so you can see how accurate the rating is. The ratings for the bots that learn, which aren’t fixed in strength as bayeselo assumes, can be seen as measuring the average strength over the tournament—the tournament score is no different in that respect.

The last column of the table is the probability of superiority, bayeselo’s calculated probability that the bot truly is better than the bot ranked immediately below it. The last bot doesn’t get one, of course. (bayeselo calculates this for all pairs, but in a tournament this long it rounds off to 100% for most.)

botscoreElo95% conf.better?
1tscmoo89%20262002-205081.0%
2ZZZKBot88%20111988-203599.9%
3UAlbertaBot80%18951874-191661.2%
4Overkill81%18901870-191199.9%
5Aiur73%17841765-180399.9%
6Ximp68%17121694-173199.9%
7Skynet64%16661648-168450.7%
8IceBot64%16661648-168488.4%
9Xelnaga63%16501632-166881.4%
10LetaBot61%16381620-165699.9%
11Tyr54%15531534-157296.0%
12GarmBot52%15311513-1549100%
13NUSBot39%13801362-139873.1%
14TerranUAB38%13721354-139099.8%
15Cimex36%13351316-135399.6%
16CruzBot32%12991280-131799.9%
17OpprimoBot28%12311211-125096.7%
18Oritaka26%12051185-122584.0%
19Stone25%11901170-121091.3%
20Bonjwa23%11711151-1191100%
21Yarmouk9%913885-93995.0%
22SusanooTricks8%882853-910-

In the official results, Overkill came in ahead of UAlbertaBot with a higher tournament score. bayeselo ratings are more accurate than score because they take into account more information, and bayeselo says UAlbertaBot > Overkill with probability 61%. As explained in the original results, it’s a statistical tie, but bayeselo says it’s not an even tie but a little tilted in a counterintuitive way.

Skynet looks dead even with IceBot in the rounded-off numbers above. bayeselo says that Skynet > IceBot with probability 50.7%, a hair off dead even. Even the large number of games in this tournament could not rank all the bots accurately.

Tomorrow: The same for CIG 2016.

Trackbacks

No Trackbacks

Comments

tscmoo on :

This is cool. Maybe this is how future tournaments should be ranked?

Jay Scott on :

It would be more accurate, but only for deep technical reasons that are hard to understand. So... I think it would be a good idea, but it might be hard to justify.

Jay Scott on :

People could reasonably object: The program is complicated and poorly documented. If it has a bug, or the tournament organizers misunderstand how to use it, then the results might be wrong.

imp on :

to my knowledge there is hardly any (or even not a single one) program that gets the tournament ranking calculation right for chess tournaments. The official ranking algorithm is described in prosa and the implementation is surprisingly difficult if you consider every single corner case.
Still it would make a lot of sense to use a tournament ranking system borrowed from chess.

Jay Scott on :

The bayeselo program is actually specialized for chess. It corrects for an empirical 32 elo advantage for the white player and has specific knowledge about the draw rate in chess. The only point that makes it unsuitable is that it can’t take prior ratings into account, because it doesn’t know how much evidence they represent. Luckily it also has commands to change its settings, so we can use it for other games!

imp on :

it is one thing to calculate the elo used to seed players in a Swiss system tournament. It is quite a different thing to calculate the "correct" ranking of players in a tournament. If you're interested in the topic, check out e.g. Wikipedia for "Swiss System tie-breaking" and "Buchholz" as well as:

https://www.fide.com/component/handbook/?id=187&view=article

Jay Scott on :

None of these tiebreak systems can equal the accuracy and fairness of bayeselo performance ratings. Bayeselo makes mathematically sound maximum likelihood estimates based on all the tournament information—use of maximum likelihood is its weakest point, and even there it is hard to suggest an improvement.

Jay Scott on :

So, in other words, bayeselo should be excellent for ranking one tournament. It can’t support an ongoing rating system by itself.

krasi0 on :

Agreed. I think we should use it in the case of next year's SSCAIT tournament as well

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

Form options

Submitted comments will be subject to moderation before being displayed.