Elo ratings | Starcraft AI blog

Steamhammer’s performance over time

Many will have missed it since the original post was almost a year ago, but today Tully Elliston commented on the Steamhammer 3.1 change list from August 2020:

Tully Elliston: Looking at BASIL win rates, it looks like SH competitive performance dropped visibly after this version.

It does look that way. Here is BASIL’s graph of Steamhammer’s elo for 2020. BASIL throws in the ratings of top bots, which by coincidence is exactly what I want here. The version in question is the red dot on 20 August (delayed from the posting of the change list due to downtime).

Steamhammer improved slowly but steadily up until around that version hit the server, then more or less held steady while the top bots gradually lifted away. The cause might be the sudden ascendance of Stardust, pushing everyone else down; the theory would be that the other bots on the graph coped better with the killer dragoons. It seems plausible to me, but Stardust is only one opponent and should not have much effect. The cause might be that I had spent a year distracted by other things and worked slowly on Steamhammer. That seems more likely to me. Or it could truly be that a weakness was introduced in this version.

Notice that Steamhammer’s improvement on the graph occurred in between widely-spaced updates. In principle, there are 3 ways that can happen: 1. By chance. 2. By artifacts of the rating system as implemented, because of bots arriving and leaving. You can get elo inflation if bots arrive, lose games and fall in elo to push everybody else up, then are dropped (and BASIL has dropped a lot of bots). 3. By Steamhammer’s opening learning. I think the opening learning is most likely. That opens another hypothesis for why improvement stopped around this version: Maybe, due to weaknesses already inherent in Steamhammer from earlier versions, the learning reached a ceiling and could no longer contribute. This suggests that there may be a bottleneck weakness somewhere, and to make big progress I have to break the bottleneck.

Wah, that is a lot of hypotheses. I looked at the long-term elo graphs for a number of bots which have not been updated the whole time, and they all show elo increases. BASIL has elo inflation, which explains some proportion of the elo rise of all bots. It also means that if your elo does not increase, maybe your bot is not staying the same, but getting worse! (We could take an average of non-updated bots and subtract out their elo inflation to get an estimate of true strength over time. There is no reason to expect that the inflation is constant over time.)

Here is the same graph starting from 1 January 2019 and continuing until today. BASIL began a little before the start of the graph, but the early period shows startup transients as the initial elos are established, so I left it out.

When I compare Steamhammer to Hao Pan and BananaBrain on this graph, I can make out 3 periods. From the start until about October 2019, Steamhammer was neck-and-neck with them. From then until August 2020 or so, Steamhammer remained behind them; a gap had been opened, and the gap stayed roughly constant over that time. And since that time, Steamhammer has gained elo extremely slowly if at all, and has fallen further behind. Despite bug fixes and demonstrable improvements in some points of play, Steamhammer does not seem to be improving and (accounting for elo inflation) may be deteriorating. It is consistent with the distraction hypothesis, if you assume that I still haven’t recovered... but I think I have.

I suspect that the bottleneck weakness hypothesis is true. After watching many SCHNAIL games, I’ve concluded that Steamhammer’s tactical weaknesses in the midgame are critical. It loses too many units due to bad tactical decisions, must replace the lost combat units to stay safe, and (spending on combat units instead of drones) reaches its lategame economy too late. I suspect that if I fix the bottleneck tactical weaknesses, the other improvements I’ve made will start to show.

It’s hard to be sure, though! Gotta try it and find out.

By the way, I think the big point in these graphs is the relative decline of Krasi0. Krasi0 gained slightly over time, but lost its dominance and now is only another top bot. Subtracting elo inflation, perhaps Krasi0 is no longer improving at all.

comparing strength across time

We don’t get many tournaments of bots versus humans. I don’t think there have been any with conditions controlled well enough that we can judge how strong bots are and how they are improving: Enough human participants, of known strength, with known levels of familiarity with computer play, finishing enough games. Then hold events across years so we can compare. We have to make do with seeing how bots are improving against other bots. Here is my best idea so far for comparing strength across tournaments.

1. We need 2 tournaments, preferably round robin, that share some participants—exactly identical bots, the more the better. We can’t do it with humans, because we can’t get exactly identical people across time. Ideally the maps should be the same too. AIIDE has more games, and SSCAIT has more shared participants; either should work, but I think SSCAIT may work better for this purpose despite being short by comparison. You could also compare between AIIDE and SSCAIT, but it would not work as well. It would take extra effort to make sure you know which players are exactly identical, and the different lengths of the tournaments means each provides a different amount of evidence to support the ratings, plus you could get confusing results for learning bots.

2. Pool all the games from both tournaments and compute elo ratings. If some participants which are not identical have the same names, distinguish them somehow—Steamhammer 2017 versus Steamhammer 2018, or whatever.

3. The identical players have identical strength in both tournaments, so consider their elo ratings as fixed. For each tournament separately, compute the elo ratings of the remaining players while keeping the ratings of the identical players fixed. The fixed ratings are benchmarks that keep the elo comparison stable for the remaining players (the idea has been used before).

It’s the best way I’ve thought of to get strength comparisons across time. We can get a pretty accurate measure of how individual bots have improved—Steamhammer 2018 is this much above Steamhammer 2017. We can treat elo as a linear measure of strength (a given elo interval always represents the same win rate difference), so we can simply average together the ratings of any set of bots to compare: The top 16 are x points stronger this year, the protoss are y points stronger, the spread between best and worst has widened to....

I may do this analysis for SSCAIT once it finishes. It’s a bit elaborate, but I’m interested.

CIG Elo ratings

Elo as calculated by Remi Coulom’s bayeselo program. The # column gives the official ranking, so you can see how it differs from the rank by Elo. The bayeselo ranking should be slightly more accurate because it takes into account all the information in the tournament results, but unfortunately there are missing games so the Elo is computed from slightly less data than the official results. The “better?” column tells how likely each bot is to be superior to the one below it.

#	bot	score	Elo	better?
1	ZZZKBot	75%	1749	98.8%
2	tscmoo	74%	1731	100%
3	PurpleWave	67%	1660	99.9%
4	LetaBot	63%	1626	97.9%
5	UAlbertaBot	62%	1611	85.1%
6	MegaBot	61%	1604	96.7%
7	Overkill	60%	1591	89.3%
8	CasiaBot	59%	1582	55.2%
9	Ziabot	59%	1581	58.8%
10	Iron	58%	1579	97.0%
11	AIUR	57%	1566	100%
12	McRave	47%	1476	97.6%
13	Tyr	45%	1462	79.9%
14	SRbotOne	45%	1456	99.9%
15	TerranUAB	38%	1397	99.9%
16	Bonjwa	33%	1347	94.9%
18	OpprimoBot	32%	1335	69.0%
17	Bigeyes	32%	1331	99.9%
19	Sling	26%	1275	100%
20	Salsa	9%	1041	-

Looking at the better? column, we see that the top 3 are well separated; the places are virtually sure to be accurate. ZZZKBot and Tscmoo are close, but bayeselo thinks they are separated enough. Farther down, CasiaBot, Ziabot, and Iron are statistically hard to distinguish; there is not strong evidence that they finished in the correct order. Also OpprimoBot and Bigeyes are not well separated—as you might guess since their results are reversed from the official results.

Is this all the analysis we want of CIG 2017? I also have a script for the map balance, to check whether any race is favored. But it tells more about who competed than about the maps or bot skills.

simple SSCAIT rating statistics

I pulled down a rating table from today and calculated a few simple statistics.

	count	mean elo	median elo
terran	24	1965	1884
protoss	18	2007	2050
zerg	16	2013	2017
random	4	1901	1909

1. Terran is the most popular.

2. The fact that the mean is higher than the median for terran implies that a few terran bots stand out (like #1 Iron and #2 Krasi0), but most terran bots are weaker. Terran seems more difficult than protoss or zerg.

3. The higher median for protoss over zerg suggests that protoss may be easier to get strong with (contrary to my opinion).

4. Random struggles, as you might expect, but still has a higher median elo than terran.

If you look over the colors on the rating table, you’ll see that terran bunches up toward the bottom and zerg bunches more toward the middle, while protoss is spread throughout. In the upper part of the table, protoss and zerg seem pretty equally represented, though. There is not much difference between them.

Also, 62 bots is a lot!

AIIDE 2016 Bayesian Elo ratings

Again I have Elo as calculated by Remi Coulom’s bayeselo program. The # column gives the official ranking, so you can see how it differs from the rank by Elo (the bayeselo ranking is slightly more accurate because it takes into account all the information in the tournament results, not only the raw winning rate). I left out the 95% confidence interval column as relatively uninteresting, since the “better?” column tells us how likely each bot is to be superior to the one below it.

#	bot	score	Elo	better?
1	Iron	87%	2016	99.4%
2	ZZZKBot	85%	1974	99.6%
3	tscmoo zerg	83%	1932	99.9%
4	LetaBot	74%	1815	99.8%
5	UAlbertaBot	70%	1774	99.9%
6	Ximp	65%	1699	99.6%
8	Aiur	61%	1663	51.6%
7	Overkill	62%	1663	99.6%
9	MegaBot	58%	1627	88.5%
10	IceBot	57%	1611	57.5%
12	Xelnaga	57%	1608	50.0%
11	JiaBot	57%	1608	98.1%
13	Skynet	55%	1581	100%
14	GarmBot	43%	1441	100%
16	TerranUAB	27%	1250	74.7%
15	NUSBot	27%	1240	99.9%
17	SRbotOne	22%	1167	99.0%
18	Cimex	21%	1130	92.6%
19	Oritaka	20%	1106	99.3%
20	CruzBot	17%	1064	100%
21	Tyr	1%	533	-

There are some switches from the official ranking, due to bots being statistically indistinguishable. Overkill and Aiur are in a dead heat. IceBot (terran), Xelnaga (protoss) and JiaBot (zerg) are also virtually even. bayeselo gives IceBot a 57.6% chance of being better than JiaBot two ranks down, essentially the same as its 57.5% chance of being better than Xelnaga one rank down.

Tomorrow: The per-map crosstables.

how bots like maps

I decided to analyze the maps to see how bots feel about them overall. This data is derived from yesterday’s big table of how much bots like each SSCAIT map. The “spread” column is the mean of the absolute value of the Elo deviation numbers for a given map, across all the bots. I thought of calling it “controversy”; it measures how much bots like or dislike the map. Maps that all bots do OK on get low numbers; maps that some bots love and others hate get high numbers.

The “RMS” column is the root mean square of the same data. Statistically, it’s a fairer measure of the differences. It’s bigger because it puts more weight on outliers. The two measures don’t agree closely.

Destination is the most “controversial” map, with 60 Elo spread. If you pick one bot that likes Destination and one bot that dislikes it, on average the bot that likes it will have a 60 Elo advantage, which means a 59% win rate if the bots are otherwise even—nothing devastating. Neo Moon Glaive has Elo spread 41 or about 56% advantage, not much different. Even if you go with the RMS number, the peak 81 Elo RMS difference means a 61% win rate, still not much different.

map	spread	RMS
Benzene	45	57
Destination	60	78
HeartbreakRidge	53	81
NeoMoonGlaive	41	53
TauCross	51	74
Andromeda	49	70
CircuitBreaker	54	69
EmpireoftheSun	50	69
FightingSpirit	51	72
Icarus	46	60
Jade	50	64
LaMancha1.1	49	65
Python	47	60
Roadrunner	46	63

Bottom line: On this analysis, the maps don’t seem to be distorting the competition. No highly “controversial” maps are introducing widespread unfairness.

Elo rating variations by map

From the SSCAIT data, I calculated the Elo advantage or disadvantage that each bot sees on each map. If it played all its games on that map, its Elo would change by that much. More or less; there isn’t as much data, so the advantage numbers are less accurate than the original Elo. I increased the Elo K factor to account for the smaller amount of data.

The 14 active maps:

(2)Benzene.scx
(2)Destination.scx
(2)HeartbreakRidge.scx
(3)NeoMoonGlaive.scx
(3)TauCross.scx
(4)Andromeda.scx
(4)CircuitBreaker.scx
(4)EmpireoftheSun.scm
(4)FightingSpirit.scx
(4)Icarus.scm
(4)Jade.scx
(4)LaMancha1.1.scx
(4)Python.scx
(4)Roadrunner.scx

The Elo ratings are repeated from the original post. I dropped 5 bots of the 103 for lack of data. The top number in each colored cell is the advantage or disadvantage that bot sees when playing on that map, in Elo points. You can look up winning rates for a given advantage in the Elo table. The bottom number is the count of games. Some bots have few games, like the new ZerGreenBot. A few bots have not played on every map, and get “-” instead of numbers.

bot	Elo	Benz	Dest	Hear	NeoM	TauC	Andr	Circ	Empi	Figh	Icar	Jade	LaMa	Pyth	Road	earliest	latest
krasi0	2163 2128	53 164	57 169	26 146	7 159	18 143	26 126	-76 144	-35 158	-73 140	-31 158	23 156	-63 155	-6 156	72 154	2015 Nov 30	2016 Sep 27
Iron bot	2081 1990	139 141	-15 144	47 134	-35 157	53 140	-34 133	-79 129	-36 142	17 145	-12 142	17 148	1 150	-78 137	15 148	2015 Nov 27	2016 Sep 26
Marian Devecka	2065 4117	-57 320	-18 255	70 303	-12 287	-37 322	104 273	45 285	-22 297	23 301	-41 289	-72 309	28 310	26 284	-38 282	2014 Oct 29	2016 Sep 27
Martin Rooijackers	2011 6449	-25 462	56 473	-67 478	0 450	40 477	44 449	88 478	-37 458	-15 463	42 480	-156 446	0 429	31 475	-2 431	2014 Oct 29	2016 Sep 27
tscmooz	1991 4972	-23 380	120 354	-20 323	-52 316	50 354	-50 393	21 354	-9 341	-78 370	43 364	-6 354	29 377	-54 328	29 364	2015 Feb 27	2016 Sep 27
tscmoo	1978 5682	-8 359	44 355	-15 438	39 389	27 445	30 402	12 447	-14 420	-97 410	53 408	-70 385	29 397	41 396	-71 431	2015 Jan 22	2016 Sep 27
LetaBot CIG 2016	1932 444	120 26	33 38	-126 35	5 29	-81 30	40 30	-12 33	-64 26	100 23	2 29	-92 35	14 37	35 45	27 28	2016 Aug 01	2016 Sep 27
WuliBot	1871 984	-9 73	-58 83	66 56	77 77	-32 68	14 59	-22 71	36 84	12 64	-21 64	54 80	-37 67	-55 68	-26 70	2016 Apr 19	2016 Sep 26
Simon Prins	1867 5400	-20 388	-25 410	51 381	67 432	19 356	-18 412	-110 374	97 376	-21 396	-25 385	95 375	-54 324	17 373	-73 418	2015 Jan 25	2016 Sep 27
ICELab	1865 6078	-21 441	-69 442	73 458	-127 377	-52 447	-17 398	-2 455	3 440	-13 425	-25 471	49 421	34 437	72 435	96 431	2014 Oct 29	2016 Sep 27
FlashTest	1863 204	-71 13	-106 12	-117 13	149 22	94 18	31 14	-121 15	2 20	78 15	-57 10	79 13	59 13	4 12	-23 14	2016 Mar 22	2016 Jul 27
Sijia Xu	1849 2313	22 171	30 155	19 182	-8 166	-2 148	-49 164	19 166	-25 165	-45 138	24 171	36 161	-72 153	51 201	0 172	2015 Oct 10	2016 Sep 27
LetaBot SSCAI 2015 Final	1813 416	-10 28	83 31	-41 22	31 34	15 27	-223 37	199 28	-44 29	-96 33	-44 36	-32 27	1 29	79 25	81 30	2016 Aug 04	2016 Sep 27
Dave Churchill	1804 6023	-29 473	200 428	-82 436	18 413	19 433	-74 412	-50 417	72 429	47 445	-68 396	-53 408	100 438	-46 455	-53 440	2014 Oct 29	2016 Sep 27
Chris Coxe	1800 2195	31 153	188 169	66 153	-106 165	-25 149	-79 149	-34 166	-7 167	-18 153	115 146	-124 153	-114 154	114 156	-8 162	2015 Sep 03	2016 Sep 27
Tomas Vajda	1790 6088	-2 441	9 439	21 441	-77 449	-14 443	-18 421	34 398	-7 458	55 422	17 425	61 440	-80 424	-6 439	7 448	2014 Oct 29	2016 Sep 27
Flash	1777 991	-19 70	-163 65	16 59	7 68	-70 59	-12 85	3 76	72 69	23 69	64 87	71 84	-25 61	30 75	3 64	2016 Apr 18	2016 Sep 27
LetaBot IM noMCTS	1766 1226	61 79	127 88	-116 83	39 106	85 87	-114 93	-106 89	-24 93	-28 89	58 93	-66 70	18 86	-2 92	68 78	2016 May 18	2016 Aug 01
Zia bot	1757 536	54 36	-63 40	15 39	-22 41	50 39	-63 43	93 42	-7 34	20 36	-78 33	-58 30	-55 40	66 39	48 44	2016 Jul 07	2016 Sep 27
A Jarocki	1741 932	20 63	121 66	125 62	-42 72	-48 74	18 53	-104 78	74 81	25 76	-93 59	-110 64	-39 55	52 67	3 62	2015 Oct 04	2016 Jan 26
PeregrineBot	1728 1262	-21 104	94 89	-28 95	123 76	-133 90	77 95	-4 77	10 85	-67 94	-32 83	-51 94	11 90	14 103	9 87	2016 Feb 09	2016 Sep 10
tscmoop	1721 1982	155 139	56 145	47 140	50 127	-7 130	53 154	-68 127	13 141	-17 140	-54 171	-59 128	-44 158	-51 148	-74 134	2015 Nov 11	2016 Sep 26
Andrew Smith	1718 6160	63 460	-37 387	46 445	-10 457	-11 452	33 449	-21 463	89 480	22 441	-58 455	-25 445	-29 418	-58 400	-3 408	2014 Oct 29	2016 Sep 27
Florian Richoux	1716 5970	-35 408	-107 425	120 441	42 411	-43 446	-46 443	37 397	152 468	66 416	-25 451	46 433	-90 410	-29 391	-88 430	2014 Oct 29	2016 Sep 27
Carsten Nielsen	1695 4683	-19 361	27 341	-65 334	33 315	-21 330	-1 319	-27 353	53 336	88 369	-35 318	6 318	-78 282	26 340	12 367	2015 Mar 17	2016 Sep 27
Soeren Klett	1687 6002	39 459	35 426	-12 420	63 396	-108 440	39 481	-20 402	-25 451	-4 454	36 433	-63 407	-19 406	71 410	-32 417	2014 Oct 29	2016 Sep 27
Vaclav Horazny	1686 4140	33 291	78 291	-17 302	18 295	-22 274	-46 332	33 290	-31 308	4 284	-38 270	58 313	-96 315	25 296	2 279	2014 Oct 29	2015 Nov 18
La Nuee	1662 558	-66 36	41 24	81 40	100 35	57 44	-11 41	-82 44	-21 36	50 52	-7 34	-78 43	-11 36	68 55	-121 38	2015 Dec 13	2016 Mar 18
Jakub Trancik	1657 6136	102 427	79 454	1 434	-32 443	-64 445	-10 459	-16 438	-106 424	18 439	-35 413	-23 404	-40 463	63 445	63 448	2014 Oct 29	2016 Sep 27
Marek Suppa	1655 4397	61 322	78 334	1 330	14 321	-48 304	-59 328	-88 315	-6 291	25 336	56 326	6 292	-25 300	-41 302	25 296	2015 Jan 05	2016 Mar 18
Krasimir Krystev	1653 4292	-24 322	-59 283	92 304	49 298	8 327	-56 295	-51 296	-43 318	-76 326	-9 318	-24 280	132 318	-58 299	117 308	2014 Oct 29	2016 Mar 10
ASPbot2011	1652 222	-22 21	-178 17	-27 12	-13 17	98 20	154 11	-29 13	87 17	17 10	75 14	-59 17	-135 15	21 18	11 20	2015 Jan 29	2016 Feb 25
Marcin Bartnicki	1633 1377	59 97	43 106	63 88	-54 110	-33 109	-20 110	-21 92	-46 96	-20 97	-12 92	53 92	-38 99	75 102	-48 87	2014 Nov 28	2016 Mar 18
Tomas Cere	1631 6131	-27 446	-41 419	-2 444	-46 443	109 459	14 448	2 429	2 424	48 433	-24 442	-69 417	49 480	-12 426	-2 421	2014 Oct 29	2016 Sep 27
MegaBot	1630 419	-22 29	27 27	6 34	36 27	-27 26	-21 25	-28 21	-1 28	44 41	64 28	-18 37	-2 37	-66 34	9 25	2016 Aug 01	2016 Sep 27
Aurelien Lermant	1622 3673	-30 258	-8 249	15 268	-32 247	30 260	-37 287	-46 250	-73 266	87 262	38 253	51 255	80 285	-20 261	-53 272	2015 Jun 22	2016 Sep 27
Matej Kravjar	1619 983	-29 69	22 67	-96 75	16 71	-266 75	12 65	78 73	74 79	-25 59	20 75	82 75	68 58	105 78	-60 64	2014 Oct 29	2015 Feb 18
Daniel Blackburn	1605 4591	-61 327	-83 332	74 337	8 326	-8 306	66 330	-56 354	-26 350	74 327	-23 312	17 311	-3 319	-10 345	31 315	2014 Oct 29	2016 Jan 26
Gabriel Synnaeve	1584 18	-3 1	-38 3	31 1	24 1	-26 2	-	-	19 1	-	-5 2	13 1	13 2	-9 2	-19 2	2015 Jan 30	2015 Nov 24
David Milec	1566 49	-	-11 5	48 2	-47 4	-73 4	9 4	-2 3	-26 6	-13 2	16 4	25 3	31 5	70 5	-26 2	2015 Jan 13	2015 Jan 20
Odin2014	1565 5602	-60 416	-40 377	16 396	-40 399	-38 400	42 402	-45 380	-68 401	-29 424	58 416	52 393	66 411	14 365	73 422	2014 Dec 21	2016 Sep 11
Gaoyuan Chen	1559 5106	-18 387	-40 348	-20 357	-64 387	59 379	48 352	-18 330	75 359	2 391	-24 362	-53 383	33 371	18 345	1 355	2015 Feb 10	2016 Sep 27
Henri Kumpulainen	1553 881	62 75	-128 64	37 68	51 64	-23 69	33 69	-48 58	15 67	-36 62	22 74	27 48	1 58	-11 54	-1 51	2016 Jan 13	2016 May 31
Martin Dekar	1533 2627	-58 195	14 186	25 189	5 196	-105 168	-27 189	73 178	-3 206	30 174	18 207	72 202	-9 166	-53 201	17 170	2014 Oct 29	2016 Jan 25
Serega	1505 3802	-35 280	104 257	131 262	51 260	-3 278	26 261	-23 275	-114 285	-76 276	6 270	-4 268	-1 279	-100 291	38 260	2015 Jan 31	2016 Jan 26
Chris Ayers	1481 1520	-34 115	-40 124	11 112	-28 106	0 106	-72 111	65 88	69 105	38 113	-52 117	49 109	-44 89	-44 102	82 123	2015 Aug 10	2016 Jan 26
Nathan a David	1481 991	13 57	0 61	-91 77	31 88	-7 65	-34 75	-23 72	103 61	68 65	43 72	-26 78	-110 70	-34 64	68 86	2016 Feb 23	2016 Aug 08
DAIDOES	1471 485	131 36	39 32	47 31	-27 30	1 28	-25 42	123 39	-159 28	29 44	32 34	-64 34	-80 36	-51 30	5 41	2016 Jun 13	2016 Sep 08
Igor Lacik	1454 5852	19 420	48 399	13 447	-28 408	-38 375	31 418	-95 461	111 418	-28 386	-9 442	12 415	-54 405	26 429	-7 429	2014 Oct 29	2016 Sep 08
Matej Istenik	1449 6054	-63 412	18 457	44 458	12 421	-20 429	-9 435	-109 417	-4 472	8 458	88 382	-3 414	8 426	-2 456	33 417	2014 Oct 29	2016 Sep 27
EradicatumXVR	1443 4539	-41 340	18 324	62 309	66 319	-14 322	-5 315	60 322	-90 303	-24 331	-39 330	11 340	24 311	19 347	-47 326	2014 Nov 04	2016 Jan 23
Tomasz Michalski	1432 433	6 29	34 30	-109 34	-35 23	142 27	30 40	-16 27	-92 31	174 32	-54 44	-5 23	12 31	-24 28	-63 34	2015 Dec 22	2016 Mar 18
Oleg Ostroumov	1431 1345	10 92	38 83	63 69	-33 116	74 96	22 98	60 96	-1 116	101 109	-82 80	-104 97	-89 90	80 102	-138 101	2014 Oct 29	2016 Jan 26
NUS Bot	1426 3333	78 216	-20 210	-525 233	64 236	29 249	25 221	-19 257	100 241	23 257	-39 254	133 240	104 232	36 257	11 230	2015 May 19	2016 Sep 06
Martin Pinter	1425 1580	53 114	-54 110	69 108	7 110	-44 122	0 123	8 101	77 108	-60 97	26 131	-31 139	61 112	-53 95	-59 110	2014 Oct 29	2015 Dec 11
Roman Danielis	1417 2945	-82 222	35 209	23 198	18 206	-9 202	13 221	89 202	-5 206	38 207	10 200	-18 225	-32 211	-29 226	-50 210	2014 Oct 29	2016 Sep 26
ZerGreenBot	1416 36	-	-21 3	-13 3	-6 2	13 3	42 2	-29 3	8 1	6 1	-73 5	45 4	-7 4	36 4	0 1	2016 Sep 22	2016 Sep 27
Marek Kadek	1413 5246	41 406	75 382	8 392	5 385	-28 383	-24 357	-12 369	-62 394	-49 359	-24 367	18 379	37 350	18 348	-3 375	2014 Oct 29	2016 May 22
Ian Nicholas DaCosta	1404 2928	-38 217	-31 210	55 214	29 213	-134 222	-62 192	5 193	10 218	-17 205	-18 213	73 233	-47 199	90 186	84 213	2015 Apr 27	2016 Sep 08
AwesomeBot	1403 473	-19 32	-24 31	-11 32	-23 49	20 29	149 29	-18 26	-3 30	9 53	-67 36	4 36	33 30	-12 32	-39 28	2016 Jun 16	2016 Sep 08
Radim Bobek	1390 1151	-85 70	11 96	-36 68	54 77	-39 94	-95 81	-11 78	0 89	184 87	43 91	61 70	-58 75	-72 100	44 75	2015 Oct 01	2016 Mar 06
Adrian Sternmuller	1375 4379	8 316	78 313	-85 330	-74 321	73 325	79 335	73 287	-42 315	-59 330	69 293	-115 288	33 331	-109 305	72 290	2014 Oct 30	2016 Jul 22
Martin Strapko	1366 1144	-55 98	-2 103	-21 82	12 65	-6 83	88 89	20 73	-51 76	23 72	69 80	-57 81	-108 80	50 81	38 81	2014 Oct 29	2016 Jan 26
Maja Nemsilajova	1363 4117	73 301	-73 292	73 322	7 309	31 298	-51 302	-77 269	-10 290	76 264	-36 302	-82 299	14 309	44 270	10 290	2014 Nov 04	2015 Nov 29
Johan Kayser	1361 413	-58 24	153 19	22 33	-50 29	-2 29	27 36	-156 19	-150 28	-59 40	66 28	21 27	128 28	26 42	34 31	2016 Jul 29	2016 Sep 27
UPStarcraftAI	1360 600	-16 43	56 48	6 41	45 48	11 36	12 43	-76 56	22 40	-28 44	52 37	37 43	23 32	-47 40	-98 49	2015 Dec 24	2016 Apr 13
Martin Vlcak	1353 1210	-49 68	-83 97	-51 81	-26 86	-25 98	-31 87	88 60	119 94	35 79	-24 102	16 90	45 90	33 83	-46 95	2016 Feb 16	2016 Sep 07
Johannes Holzfuss	1351 674	-46 52	98 39	24 36	59 51	-42 47	-27 54	-73 57	92 64	-82 49	-6 33	-4 57	-82 52	80 44	8 39	2016 Mar 05	2016 Jun 15
Vojtech Jirsa	1350 2759	-87 212	-79 182	32 200	-8 207	-127 205	6 174	49 184	83 211	27 199	-30 186	79 207	117 193	31 214	-93 185	2015 Jan 12	2015 Sep 05
JompaBot	1349 1043	-151 67	107 78	-180 70	-44 67	-52 83	81 61	95 88	1 83	71 75	82 80	-9 70	-72 77	68 73	2 71	2016 Feb 04	2016 Aug 13
Rob Bogie	1346 651	42 48	-313 54	-193 38	135 45	365 49	-361 47	246 38	-333 43	-418 52	291 55	273 45	298 43	-306 40	274 54	2016 May 14	2016 Sep 06
Christoffer Artmann	1344 395	30 25	-123 23	14 22	57 29	-45 31	-155 22	-36 27	-52 34	143 30	109 32	28 41	-17 31	-51 26	99 22	2016 Aug 07	2016 Sep 27
Marek Gajdos	1331 1370	2 91	-102 100	90 95	-13 102	-139 99	93 87	42 81	39 107	-78 107	-30 106	-7 97	9 94	79 101	15 103	2016 Jan 30	2016 Sep 11
Travis Shelton	1314 1212	38 72	-4 78	77 84	-31 80	-1 70	47 90	-25 105	2 92	-27 100	-30 104	-9 93	44 86	-87 82	5 76	2016 Feb 28	2016 Sep 06
Peter Dobsa	1307 3015	27 213	26 205	-45 218	-71 199	-19 215	82 228	-4 232	81 197	58 212	-28 224	-32 207	42 204	-65 237	-54 224	2015 Jan 11	2015 Oct 02
VeRLab	1304 888	-75 65	-5 52	-16 64	25 75	-20 56	-27 63	84 51	-6 79	42 71	57 77	99 51	6 52	-32 54	-131 78	2016 Feb 28	2016 Aug 01
Bjorn P Mattsson	1295 4432	18 303	75 328	48 340	-4 303	-20 307	39 302	6 333	-26 317	12 304	-58 315	-96 326	-50 345	113 280	-57 329	2015 Apr 05	2016 Sep 27
Lukas Sedlacek	1293 63	48 3	-80 10	74 3	-78 5	2 4	-14 6	-53 9	39 3	77 3	29 4	-15 2	7 3	-18 5	-20 3	2015 Jan 12	2015 Jan 20
Sergei Lebedinskij	1293 1083	-16 59	11 71	-19 77	69 77	47 72	-97 83	70 70	46 69	3 86	-85 85	55 71	36 106	-51 74	-68 83	2015 May 28	2015 Sep 03
Vladimir Jurenka	1278 6041	-87 429	-66 435	-33 454	-28 432	-15 402	11 438	44 453	77 435	-96 443	33 406	15 467	77 435	57 413	10 399	2014 Nov 04	2016 Sep 27
neverdieTRX	1272 334	65 29	-3 27	56 21	-9 26	-38 19	13 20	-82 27	-50 28	44 26	-159 27	67 21	35 25	-3 24	62 14	2016 Jul 19	2016 Sep 10
OpprimoBot	1256 1994	8 131	-23 138	14 144	122 146	70 143	-8 160	33 131	30 153	-70 135	-88 134	38 140	1 149	-93 139	-35 151	2015 Nov 18	2016 Sep 27
Marek Kruzliak	1255 399	-1 31	-92 35	66 23	-46 27	146 28	111 27	-123 23	112 25	-43 36	-99 32	80 26	15 30	-25 25	-101 31	2014 Nov 28	2015 Jan 20
Sungguk Cha	1250 697	-9 47	-38 46	34 54	-148 40	50 52	-62 51	130 48	-23 41	-3 65	-63 51	117 43	76 48	-44 61	-17 50	2016 Jun 05	2016 Sep 27
Jacob Knudsen	1247 1244	-13 79	-32 89	92 99	-20 89	36 81	34 75	70 87	-87 88	-77 81	-5 99	55 87	-7 103	3 94	-47 93	2016 Feb 23	2016 Sep 10
Ludmila Nemsilajova	1228 409	52 27	15 21	13 22	74 25	-90 27	-9 27	65 46	-71 37	-51 32	-75 28	1 30	-24 29	58 30	43 28	2014 Nov 28	2015 Jan 21
Karin Valisova	1226 1067	159 84	-42 86	-11 71	-2 72	-28 74	0 79	-62 68	-40 76	6 72	-16 66	4 68	-29 85	31 82	31 84	2014 Nov 04	2016 Jan 26
HoangPhuc	1209 300	-54 28	-46 32	25 22	3 16	83 21	-88 24	-21 16	51 20	-52 24	56 17	-70 25	-27 15	-37 25	178 15	2016 Jul 18	2016 Sep 07
Sebastian Mahr	1182 1191	-2 67	-61 89	-18 83	-34 96	14 71	118 74	95 87	25 81	25 79	-13 97	-63 94	-40 89	17 97	-61 87	2016 Jan 13	2016 Aug 08
Jan Pajan	1179 997	-15 85	58 71	14 64	-49 77	14 67	37 72	-61 78	2 61	-29 67	-19 70	-77 91	52 72	21 62	53 60	2014 Nov 04	2016 Jan 05
Pablo Garcia Sanchez	1174 579	26 33	11 53	-28 42	-74 33	3 45	35 39	84 34	-49 43	52 39	26 37	24 50	-21 46	-37 51	-51 34	2015 Dec 24	2016 Apr 13
Ivana Kellyerova	1131 1499	-59 115	-89 113	2 99	-20 113	71 111	38 108	-39 95	-6 99	19 125	30 92	53 106	-54 110	-5 97	60 116	2014 Nov 04	2015 Apr 01
Lucia Pivackova	1090 717	-69 50	-1 53	-32 55	20 50	94 41	-13 58	-25 47	23 42	-57 47	39 49	-42 55	-32 50	48 59	46 61	2014 Oct 30	2015 Jan 20
Tae Jun Oh	1036 138	43 11	21 8	-7 10	-40 7	96 8	-18 9	95 6	-35 9	-90 17	129 7	-40 14	-95 11	-21 9	-38 12	2016 Mar 22	2016 Apr 11
Denis Ivancik	1022 418	-65 37	46 30	-23 25	-34 26	109 29	78 21	-26 36	96 26	-90 43	4 28	-35 27	-28 18	-43 34	11 38	2014 Nov 28	2015 Jan 20
ButcherBoy	970 422	38 21	38 23	-68 32	-30 35	-43 34	100 31	46 29	-40 29	-6 35	-40 31	-49 34	128 23	7 30	-81 35	2016 Jun 21	2016 Sep 06
Jon W	964 790	-30 58	47 66	-100 62	6 59	83 45	27 67	4 52	-44 50	7 60	35 47	-9 57	79 49	-66 62	-39 56	2015 Apr 30	2015 Jul 09
Matyas Novy	885 1693	77 103	-76 132	-60 133	-69 122	-2 110	5 120	1 107	67 119	110 104	-59 145	-20 122	66 126	-83 124	44 126	2015 Feb 04	2015 Jul 09

There are some interesting things to see in the chart, but first look at Rob Bogie! That’s the bot MaasCraft. All the bots have preferences, some have strong preferences, but MaasCraft loves some maps and hates others. Why is that? If it could be made to love all the maps....

comparing AIIDE 2015 and CIG 2016 Elo ratings

The cool technique I had in mind to compare ratings across tournaments turned out not to work. Not cool after all. But 6 bots played unchanged in both AIIDE 2015 and CIG 2016, and we can compare their relative ratings. In this table the subtract column gives the AIIDE 2015 rating minus the CIG 2016 rating.

bot	AIIDE Elo	CIG elo	subtract	normalize
UAlbertaBot	1895	1778	117	35
Overkill	1890	1796	94	12
Aiur	1784	1687	97	15
TerranUAB	1372	1338	34	-48
OpprimoBot	1231	1154	77	-5
Bonjwa	1171	1099	72	-10
average			82	0

As you might expect, two tournaments with different maps and different opponents give different ratings. UAlbertaBot and Overkill swapped ranks among the 6. But after correcting for the 82 point offset (since only rating differences matter), the ratings turn out to be quite close between the tournaments. The biggest difference is for TerranUAB. Look up 48 points in the Elo table—it says that TerranUAB has a 57% probability of beating itself, not a drastic error.

You can try to convert a CIG 2016 rating into a rough estimate of an AIIDE 2015 rating by adding 82. For example, tscmoo terran earned a CIG rating of 1888, which corresponds to an AIIDE rating of 1888+82 = 1970, whereas the tscmoo zerg that played in AIIDE earned a rating there of 2026. So the estimate appears to be way off. But estimates made this way are likely to be closer for bots near the middle of the pack.

Next: Another mass of colorful crosstables.

CIG 2016 Bayesian Elo ratings

Same as yesterday, Bayesian Elo ratings calculated by bayeselo, this time for CIG 2016. I included both the qualifier and the final, of course. That gives the best possible ratings, so that confidence is higher for the 8 finalists. But the “score” column becomes difficult to interpret, because part of the score of the top 8 bots comes from the final when they faced tougher opposition. You can’t directly compare the scores of bots 1-8 with the scores of 9-16, only the ratings.

Also, with this analysis it doesn’t make sense to compare the rating values between tournaments. Each tournament is independently scaled to have an average rating of 1500. Only the relative ratings of bots in the same tournament can be compared. Ratings are relative.

	bot	score	Elo	95% conf.	better?
1	tscmoo	73%	1888	1872-1904	98.5%
2	Iron	71%	1864	1848-1880	99.9%
3	LetaBot	68%	1827	1811-1843	99.7%
4	Overkill	65%	1796	1781-1812	70.9%
5	ZZZKBot	64%	1790	1775-1805	86.8%
6	UAlbertaBot	63%	1778	1763-1793	99.8%
7	MegaBot	60%	1746	1731-1761	99.9%
8	Aiur	54%	1687	1671-1702	72.7%
9	Tyr	62%	1679	1659-1699	100%
10	Ziabot	46%	1500	1479-1521	100%
11	TerranUAB	34%	1338	1316-1360	100%
12	SRbotOne	22%	1158	1133-1183	59.1%
13	OpprimoBot	22%	1154	1128-1179	97.1%
14	XelnagaII	21%	1119	1092-1145	86.3%
15	Bonjwa	19%	1099	1072-1125	100%
16	Salsa	1%	579	510-636	-

The official results have LetaBot a hair ahead of ZZZKBot, then Overkill following. bayeselo has ZZZKBot and Overkill reversed, saying that LetaBot is clearly superior to Overkill, which is fairly likely to be superior to ZZZKBot. The difference comes about because, of course, the official results include only the final. Martin Rooijackers was justified after all in saying that ZZZKBot had fallen from the top 3. All other results agree with the official ranking. The tailing finalist Aiur is 72.7% likely to be superior to Tyr, so there is some doubt that the best finalists won through (in general the doubt can’t be avoided, though).

The tail-ender Salsa has a wide and asymmetrical confidence interval. It takes more evidence to pin down an extreme rating than a middle-of-the-road rating.

Tomorrow: I’ll try an analysis in which the ratings of unchanged bots are carried over from AIIDE 2015 to CIG 2016, so that we can compare between tournaments. I’m not sure how well it will work, or even if I can get it to work at all, but it will be interesting to try.

AIIDE 2015 Bayesian Elo ratings

Krasi0 asked me to calculate ratings for tournaments using Rémi Coulom’s excellent bayeselo program. Here are ratings for AIIDE 2015.

bayeselo does not calculate basic Elo ratings like my little code snippets. It can’t calculate an Elo curve over time. It assumes that the players are fixed and have one true rating, and it crunches a full-on Bayesian statistical analysis to not only find the rating as accurately as possible, but also a 95% confidence interval so you can see how accurate the rating is. The ratings for the bots that learn, which aren’t fixed in strength as bayeselo assumes, can be seen as measuring the average strength over the tournament—the tournament score is no different in that respect.

The last column of the table is the probability of superiority, bayeselo’s calculated probability that the bot truly is better than the bot ranked immediately below it. The last bot doesn’t get one, of course. (bayeselo calculates this for all pairs, but in a tournament this long it rounds off to 100% for most.)

	bot	score	Elo	95% conf.	better?
1	tscmoo	89%	2026	2002-2050	81.0%
2	ZZZKBot	88%	2011	1988-2035	99.9%
3	UAlbertaBot	80%	1895	1874-1916	61.2%
4	Overkill	81%	1890	1870-1911	99.9%
5	Aiur	73%	1784	1765-1803	99.9%
6	Ximp	68%	1712	1694-1731	99.9%
7	Skynet	64%	1666	1648-1684	50.7%
8	IceBot	64%	1666	1648-1684	88.4%
9	Xelnaga	63%	1650	1632-1668	81.4%
10	LetaBot	61%	1638	1620-1656	99.9%
11	Tyr	54%	1553	1534-1572	96.0%
12	GarmBot	52%	1531	1513-1549	100%
13	NUSBot	39%	1380	1362-1398	73.1%
14	TerranUAB	38%	1372	1354-1390	99.8%
15	Cimex	36%	1335	1316-1353	99.6%
16	CruzBot	32%	1299	1280-1317	99.9%
17	OpprimoBot	28%	1231	1211-1250	96.7%
18	Oritaka	26%	1205	1185-1225	84.0%
19	Stone	25%	1190	1170-1210	91.3%
20	Bonjwa	23%	1171	1151-1191	100%
21	Yarmouk	9%	913	885-939	95.0%
22	SusanooTricks	8%	882	853-910	-

In the official results, Overkill came in ahead of UAlbertaBot with a higher tournament score. bayeselo ratings are more accurate than score because they take into account more information, and bayeselo says UAlbertaBot > Overkill with probability 61%. As explained in the original results, it’s a statistical tie, but bayeselo says it’s not an even tie but a little tilted in a counterintuitive way.

Skynet looks dead even with IceBot in the rounded-off numbers above. bayeselo says that Skynet > IceBot with probability 50.7%, a hair off dead even. Even the large number of games in this tournament could not rank all the bots accurately.

Tomorrow: The same for CIG 2016.

SSCAIT Elo ratings over time

Here it is, the great chart of SSCAIT Elo ratings over time. Well, not here actually, I put it on a separate page so that not every blog visitor has to load the mass of Javascript and data.

SSCAIT interactive ratings chart for 100 bots

The chart is generated from this csv file. Spreadsheet software or stats software should open it right up, if you want to poke the data yourself. It’s 950 lines of 101 columns each, a date and ratings for 100 bots.

Data in the csv file is filled in for each day from the bot’s first to its last game in the original raw data file (which is just a list of games), and left blank on other days. There may be an off-by-one error causing some bots to miss their last day of data; I didn’t bother to verify it since it’s hardly visible. Some bots have short lifetimes and only appear on the graph as a brief squiggle. Some bots have inactive periods in between their first and last games; the inactive periods with no games graph as flat lines. In excluding the 3 bots with insufficient games, I also removed them from the rating calculation, which improves the ratings to a tiny degree. The rankings stay exactly the same for all 100 bots, though.

Elo ratings are easy to calculate

Elo ratings may seem mysterious and complicated, but basic Elo ratings are super easy to calculate. The entire code is in these 2 yellow boxes. It’s perl, a particularly ugly language, but any coder should be able to read it.

First, given two ratings, here’s how you find out the probability that the player with the first rating beats the player with the second. I broke it out because it’s useful on its own. It’s literally 1 line of calculation. This is the logistic function, which has been shown empirically to be a good fit for the job. 400 is a scaling constant which is standard for Elo.

sub expected_win_rate ($, $) {
  my $rating1 = shift;
  my $rating2 = shift;
  
  return 1.0 / (1.0 + 10.0 ** (($rating2 - $rating1) / 400.0));
}

Second, given two ratings and a game result, here’s how you figure out the two new ratings after the game. This time it’s 2 lines of calculation, and it calls the expected win rate function above. $actual is the game result, 0.0 if the first player lost and 1.0 if the first player won. You can use 0.5 for a draw, but I skipped over draws because I’m not sure what ‘draw’ means in the file I have (there are 1211 draws in the 141,164 games, a negligible number, so it shouldn’t make much difference). $elo_k is the K constant for the Elo formula, which is the maximum rating change per game. Setting $elo_k high means that ratings react quickly to changes, and setting $elo_k low means ratings are more accurate if changes are slow. I have a lot of games, so I set $elo_k to a low value, 16. Other common values are 24 and 32.

sub update_elo ($, $, $) {
  my $rating1 = shift;
  my $rating2 = shift;
  my $actual = shift;

  my $delta = $elo_k * ($actual - &expected_win_rate ($rating1, $rating2));

  return ($rating1 + $delta, $rating2 - $delta);
}

Third and last, you need to already have an Elo rating before you can use the Elo formula. How do you set a player’s initial rating? If you don’t have any better idea, it’s standard to set it to 1500. It will take a while to become accurate. I used the calculate-backward trick to get accurate initial ratings, but that only works if you have the whole dataset ahead of time. Sometimes players are rated by a different “provisional” system for some small number of early games, before Elo kicks in.

And that’s the story! There are a bunch of fancy variations of Elo which try to do a little better. And though I think they mostly do do a little better, they’re more complicated and not very much better.

The bottom line: Calculating Elo ratings is easy and works well, so you should do it. If you care about playing strength and have the data, ratings are your answer.

a few preliminary Elo charts

The SSCAIT data includes 103 bots, and 3 of them have 10 or fewer games, leaving exactly 100 with useful rating curves. I’ve crunched and formatted the data, and now all I have to do is draw it. I hope to create a humongalicious zoomable graph of daily rating data for all 100 bots—if I can find a way to draw that many lines on a graph in a way that’s usable. Well, I’ll think of something. I chose powerful graphing software that’s fully capable of doing the job, but it’s complicated and my skill and patience may be less than fully capable....

Anyway, another appetizer. Here are static rating graphs for 2016 for the top 3 CIG finishers, all of which had many updates this year. The graphs run from 1 January 2016 to 27 September 2016. The authors may be interested in comparing their updates with movements in their graph. Krasi0 shows steady improvement since April, while the other two look more irregular.

graph of Krasi0’s rating in 2016

graph of Iron’s rating in 2016

graph of Tacmoo terran’s rating in 2016

Elo rating table

Here’s a table that explains what Elo ratings mean. To find out the chance that one bot will beat another, subtract their Elo ratings and look up the difference in the table. Iron is rated 2081 and Wulibot is rated 1871. The difference is 210—look it up in the table!

The probability estimate is not perfect, but it is good on average.

rating diff	win %	rating diff	win %	rating diff	win %	rating diff	win %
0	50%	200	76%	400	91%	600	97%
10	51%	210	77%	410	91%	610	97%
20	53%	220	78%	420	92%	620	97%
30	54%	230	79%	430	92%	630	97%
40	56%	240	80%	440	93%	640	98%
50	57%	250	81%	450	93%	650	98%
60	59%	260	82%	460	93%	660	98%
70	60%	270	83%	470	94%	670	98%
80	61%	280	83%	480	94%	680	98%
90	63%	290	84%	490	94%	690	98%
100	64%	300	85%	500	95%	700	98%
110	65%	310	86%	510	95%	710	98%
120	67%	320	86%	520	95%	720	98%
130	68%	330	87%	530	95%	730	99%
140	69%	340	88%	540	96%	740	99%
150	70%	350	88%	550	96%	750	99%
160	72%	360	89%	560	96%	760	99%
170	73%	370	89%	570	96%	770	99%
180	74%	380	90%	580	97%	780	99%
190	75%	390	90%	590	97%	790	99%
200	76%	400	91%	600	97%	800	99%

SSCAIT initial and current Elo ratings

I’m still working on Elo curves over time, but today I have Elo ratings for each bot in the SSCAIT data at the beginning and end of its career. Here is yesterday’s table plus the new info, now sorted by decreasing current rating—the bot’s real strength yesterday as best we can measure. The topmost ratings are, to my surprise, exactly in the order I expected!

To make the ratings easier to interpret, I added two columns labeled “expect”. These are the expected winning rate of the bot against the average opponent. The rating system is designed so that the average Elo rating is constant at 1500, and it’s easy to compute the expected winning rate against an opponent rated 1500. The constant average rating, by the way, means that a bot which remains the same can see its rating decline over time if its opponents improve.

Ratings are not accurate for bots with a very small number of games. I plan to exclude those bots from the curves over time.

		initial		current
bot	win %	Elo	expect	Elo	expect	games	earliest	latest
krasi0	68.77%	1593	63.07%	2163	97.85%	2142	2015 Nov 30	2016 Sep 27
Iron bot	77.74%	1580	61.31%	2081	96.59%	1999	2015 Nov 27	2016 Sep 26
Marian Devecka	58.66%	1790	84.15%	2065	96.28%	6289	2013 Dec 25	2016 Sep 27
Martin Rooijackers	68.50%	1840	87.62%	2011	94.99%	7290	2014 Jul 28	2016 Sep 27
tscmooz	79.80%	1823	86.52%	1991	94.41%	5006	2015 Feb 27	2016 Sep 27
tscmoo	72.06%	1838	87.50%	1978	94.00%	5719	2015 Jan 22	2016 Sep 27
LetaBot CIG 2016	75.68%	1748	80.65%	1932	92.32%	444	2016 Aug 01	2016 Sep 27
WuliBot	72.76%	1773	82.80%	1871	89.43%	984	2016 Apr 19	2016 Sep 26
Simon Prins	55.48%	1513	51.87%	1867	89.21%	5431	2015 Jan 25	2016 Sep 27
ICELab	81.12%	2189	98.14%	1865	89.10%	8344	2013 Dec 25	2016 Sep 27
FlashTest	69.44%	1744	80.29%	1863	88.99%	216	2016 Mar 22	2016 Jul 27
Sijia Xu	71.65%	1850	88.23%	1849	88.17%	2328	2015 Oct 10	2016 Sep 27
LetaBot SSCAI 2015 Final	65.87%	1710	77.01%	1813	85.84%	416	2016 Aug 04	2016 Sep 27
Dave Churchill	75.48%	1985	94.22%	1804	85.19%	8275	2013 Dec 25	2016 Sep 27
Chris Coxe	73.10%	1754	81.19%	1800	84.90%	2201	2015 Sep 03	2016 Sep 27
Tomas Vajda	79.37%	2169	97.92%	1790	84.15%	8372	2013 Dec 25	2016 Sep 27
Flash	65.69%	1458	43.98%	1777	83.13%	991	2016 Apr 18	2016 Sep 27
LetaBot IM noMCTS	60.93%	1645	69.73%	1766	82.22%	1226	2016 May 18	2016 Aug 01
Zia bot	52.24%	1568	59.66%	1757	81.45%	536	2016 Jul 07	2016 Sep 27
A Jarocki	62.77%	1711	77.11%	1741	80.02%	932	2015 Oct 04	2016 Jan 26
PeregrineBot	57.29%	1692	75.12%	1728	78.79%	1276	2016 Feb 09	2016 Sep 10
tscmoop	78.16%	1895	90.67%	1721	78.11%	1992	2015 Nov 11	2016 Sep 26
Andrew Smith	65.00%	1705	76.50%	1718	77.81%	8391	2013 Dec 25	2016 Sep 27
Florian Richoux	62.11%	1770	82.55%	1716	77.62%	8203	2013 Dec 25	2016 Sep 27
Carsten Nielsen	66.08%	1708	76.81%	1695	75.45%	4711	2015 Mar 17	2016 Sep 27
Soeren Klett	63.62%	2068	96.34%	1687	74.58%	8277	2013 Dec 25	2016 Sep 27
Vaclav Horazny	37.35%	1066	7.60%	1686	74.47%	6455	2013 Dec 25	2015 Nov 18
La Nuee	51.61%	1499	49.86%	1662	71.76%	558	2015 Dec 13	2016 Mar 18
Jakub Trancik	45.08%	1755	81.27%	1657	71.17%	8416	2013 Dec 25	2016 Sep 27
Marek Suppa	51.85%	1746	80.47%	1655	70.94%	4413	2015 Jan 05	2016 Mar 18
Krasimir Krystev	70.52%	2033	95.56%	1653	70.70%	6510	2013 Dec 25	2016 Mar 10
ASPbot2011	49.78%	1671	72.80%	1652	70.58%	227	2015 Jan 29	2016 Feb 25
Marcin Bartnicki	60.42%	1855	88.53%	1633	68.26%	1435	2014 Nov 28	2016 Mar 18
Tomas Cere	61.11%	1888	90.32%	1631	68.01%	8373	2013 Dec 25	2016 Sep 27
MegaBot	49.40%	1576	60.77%	1630	67.88%	419	2016 Aug 01	2016 Sep 27
Aurelien Lermant	58.26%	1688	74.69%	1622	66.87%	3687	2015 Jun 22	2016 Sep 27
Matej Kravjar	49.57%	1723	78.31%	1619	66.49%	3234	2013 Dec 25	2015 Feb 18
Daniel Blackburn	43.79%	1651	70.46%	1605	64.67%	6883	2013 Dec 25	2016 Jan 26
Gabriel Synnaeve	45.96%	1737	79.65%	1584	61.86%	1658	2013 Dec 25	2015 Nov 24
David Milec	49.09%	1552	57.43%	1566	59.39%	55	2015 Jan 13	2015 Jan 20
Odin2014	55.65%	1659	71.41%	1565	59.25%	5648	2014 Dec 21	2016 Sep 11
Gaoyuan Chen	48.05%	1582	61.59%	1559	58.41%	5118	2015 Feb 10	2016 Sep 27
Henri Kumpulainen	38.81%	1447	42.43%	1553	57.57%	894	2016 Jan 13	2016 May 31
Martin Dekar	33.14%	1429	39.92%	1533	54.73%	4910	2013 Dec 25	2016 Jan 25
Serega	48.20%	1771	82.64%	1505	50.72%	3803	2015 Jan 31	2016 Jan 26
Chris Ayers	35.53%	1610	65.32%	1481	47.27%	1520	2015 Aug 10	2016 Jan 26
Nathan a David	39.34%	1446	42.29%	1481	47.27%	1004	2016 Feb 23	2016 Aug 08
DAIDOES	34.02%	1370	32.12%	1471	45.84%	485	2016 Jun 13	2016 Sep 08
FlashZerg	0.00%	1474	46.27%	1459	44.13%	7	2016 Apr 24	2016 May 12
Igor Lacik	39.32%	1608	65.06%	1454	43.42%	8073	2013 Dec 25	2016 Sep 08
Matej Istenik	44.74%	1709	76.91%	1449	42.71%	8297	2013 Dec 25	2016 Sep 27
EradicatumXVR	40.88%	1537	55.30%	1443	41.87%	4687	2013 Dec 25	2016 Jan 23
Ibrahim Awwal	30.57%	1510	51.44%	1437	41.03%	530	2013 Dec 25	2014 Mar 24
Tomasz Michalski	27.02%	1314	25.53%	1432	40.34%	433	2015 Dec 22	2016 Mar 18
Oleg Ostroumov	48.75%	1714	77.41%	1431	40.20%	3641	2013 Dec 25	2016 Jan 26
NUS Bot	35.72%	1482	47.41%	1426	39.51%	3337	2015 May 19	2016 Sep 06
Martin Pinter	28.98%	1409	37.20%	1425	39.37%	3740	2013 Dec 25	2015 Dec 11
Roman Danielis	45.63%	1688	74.69%	1417	38.28%	5155	2013 Dec 25	2016 Sep 26
ZerGreenBot	22.22%	1404	36.53%	1416	38.14%	36	2016 Sep 22	2016 Sep 27
Rafael Bocquet	0.00%	1450	42.85%	1415	38.01%	10	2015 Jun 23	2015 Jun 26
Flashrelease	0.00%	1449	42.71%	1413	37.73%	8	2016 Apr 24	2016 Apr 24
Marek Kadek	37.29%	1557	58.13%	1413	37.73%	7641	2013 Dec 25	2016 May 22
Ian Nicholas DaCosta	37.12%	1394	35.20%	1404	36.53%	2928	2015 Apr 27	2016 Sep 08
AwesomeBot	29.81%	1326	26.86%	1403	36.39%	473	2016 Jun 16	2016 Sep 08
Radim Bobek	23.37%	1315	25.64%	1390	34.68%	1151	2015 Oct 01	2016 Mar 06
Adrian Sternmuller	26.89%	1436	40.89%	1375	32.75%	4529	2013 Dec 25	2016 Jul 22
Martin Strapko	19.76%	1388	34.42%	1366	31.62%	3386	2013 Dec 25	2016 Jan 26
Maja Nemsilajova	23.81%	1365	31.49%	1363	31.25%	4246	2013 Dec 25	2015 Nov 29
Johan Kayser	24.46%	1294	23.40%	1361	31.00%	413	2016 Jul 29	2016 Sep 27
UPStarcraftAI	24.75%	1346	29.18%	1360	30.88%	610	2015 Dec 24	2016 Apr 13
Martin Vlcak	28.92%	1370	32.12%	1353	30.02%	1224	2016 Feb 16	2016 Sep 07
Johannes Holzfuss	35.04%	1531	54.45%	1351	29.78%	685	2016 Mar 05	2016 Jun 15
Vojtech Jirsa	14.14%	1186	14.09%	1350	29.66%	2786	2015 Jan 12	2015 Sep 05
JompaBot	21.99%	1316	25.75%	1349	29.54%	1055	2016 Feb 04	2016 Aug 13
Rob Bogie	31.34%	1335	27.89%	1346	29.18%	651	2016 May 14	2016 Sep 06
Christoffer Artmann	20.51%	1289	22.89%	1344	28.95%	395	2016 Aug 07	2016 Sep 27
Marek Gajdos	22.69%	1251	19.26%	1331	27.43%	1384	2016 Jan 30	2016 Sep 11
Travis Shelton	23.59%	1390	34.68%	1314	25.53%	1221	2016 Feb 28	2016 Sep 06
Peter Dobsa	13.25%	1227	17.20%	1307	24.77%	3027	2015 Jan 11	2015 Oct 02
VeRLab	17.06%	1241	18.38%	1304	24.45%	897	2016 Feb 28	2016 Aug 01
Andrej Sekac	11.76%	1359	30.75%	1296	23.61%	68	2013 Dec 25	2014 Jan 04
Bjorn P Mattsson	22.22%	1351	29.78%	1295	23.50%	4442	2015 Apr 05	2016 Sep 27
Lukas Sedlacek	22.86%	1344	28.95%	1293	23.30%	70	2015 Jan 12	2015 Jan 20
Sergei Lebedinskij	13.30%	1178	13.55%	1293	23.30%	1083	2015 May 28	2015 Sep 03
Vladimir Jurenka	38.45%	1635	68.51%	1278	21.79%	6167	2013 Dec 25	2016 Sep 27
neverdieTRX	20.66%	1265	20.54%	1272	21.21%	334	2016 Jul 19	2016 Sep 10
OpprimoBot	21.85%	1321	26.30%	1256	19.71%	2009	2015 Nov 18	2016 Sep 27
Marek Kruzliak	14.45%	1151	11.83%	1255	19.62%	934	2013 Dec 25	2015 Jan 20
Sungguk Cha	18.65%	1207	15.62%	1250	19.17%	697	2016 Jun 05	2016 Sep 27
Jacob Knudsen	20.53%	1083	8.31%	1247	18.90%	1257	2016 Feb 23	2016 Sep 10
Ludmila Nemsilajova	16.04%	1133	10.79%	1228	17.28%	505	2013 Dec 25	2015 Jan 21
Karin Valisova	17.68%	1238	18.12%	1226	17.12%	1171	2013 Dec 25	2016 Jan 26
HoangPhuc	15.67%	1132	10.73%	1209	15.77%	300	2016 Jul 18	2016 Sep 07
Sebastian Mahr	15.06%	1205	15.47%	1182	13.82%	1202	2016 Jan 13	2016 Aug 08
Jan Pajan	14.48%	1210	15.85%	1179	13.61%	1119	2013 Dec 25	2016 Jan 05
Pablo Garcia Sanchez	12.20%	1123	10.25%	1174	13.28%	590	2015 Dec 24	2016 Apr 13
Ivana Kellyerova	11.47%	1129	10.57%	1131	10.68%	1630	2013 Dec 25	2015 Apr 01
Lucia Pivackova	13.29%	1111	9.63%	1090	8.63%	835	2013 Dec 25	2015 Jan 20
Tae Jun Oh	4.55%	1069	7.72%	1036	6.47%	154	2016 Mar 22	2016 Apr 11
Denis Ivancik	10.76%	1102	9.19%	1022	6.00%	502	2013 Dec 25	2015 Jan 20
ButcherBoy	4.74%	921	3.45%	970	4.52%	422	2016 Jun 21	2016 Sep 06
Jon W	5.06%	920	3.43%	964	4.37%	790	2015 Apr 30	2015 Jul 09
Matyas Novy	6.32%	1130	10.62%	885	2.82%	1693	2015 Feb 04	2015 Jul 09

How did I get the initial ratings? I had a cute idea. One of the issues with computing Elo ratings over time is: How do you initialize the ratings? Most systems either start everybody with the same rating, which makes an ugly graph, or use a different and less accurate method to estimate the rating in early games. But in this case I have the whole data set in hand. I set the final rating of every bot to the same rating and computed ratings backwards in time to find an initial rating. Then I threw away everything except the initial rating, and calculated the real ratings forward in time to find the ratings over time and the final ratings. That way every data point is equally good, from beginning to end. I doubt I’m the first to think of it, but it’s a cute idea and I’m pleased.

Next: I’ll find some sensible way to plot the curves. Stand by!

tournament design

If you design a tournament differently, a different bot may be favored to win.

AIIDE 2015 was an example. As pointed out in the tournament results, UAlbertaBot finished fourth even though it had a plus score against every other bot, because compared to the top 3 it was less consistent in defeating weaker bots. AIIDE runs on a round-robin design, all-play-all, so UAlbertaBot could defeat the top finishers and still be ranked behind them. In a progressive elimination tournament in which weaker competitors were dropped over time, UAlbertaBot would likely have finished first.

If you’ve seen the math of tournament design, or of related stuff like voting system design, then you know there’s no such thing as a fair tournament in which the best competitor always has the best chance to win, because there isn’t always such a thing as a best competitor. If A > B and B > C but C > A, then which is the “best”? That’s called intransitivity. A more complicated kind of intransitivity happened in AIIDE 2015.

Rating systems in the Elo tradition have the same issue (and their designers know all about it). They assume—they have to assume, to be what they are—that players have a “true skill” in a mathematical sense, putting players into a smooth mathematical model that doesn’t correspond exactly with bumpy reality. It’s a good approximation; Elo ratings are mostly accurate in predicting future results. (The small mismatch with reality has inspired a lot of variations of Elo ratings, Glicko and TrueSkill and so on, that try to do a little better.)

Given any big enough set of games (games that link up the competitors into a connected graph), you can find Elo ratings for the players. The ratings may have big uncertainties, but you can rank the players. You can use virtually any tournament design with almost any kind of random or biased pairings, and get a ranking.

To me this is an intuitive way to think about tournament design: Players play games which we take as evidence of skill, and the key question is: With a given amount of time to play games, how do you want to distribute the evidence? If you want to rank all the competitors as well as possible, then distribute the evidence equally in a round-robin. That’s the idea behind AIIDE’s design—I approve. If you want to pick out the one winner, or the top few winners, as clearly as possible, then let potential winners play more games. If Loser1 and Loser2 are out of the running, then games between them produce little evidence of who the top winner will be. A game between Winner1 and Loser1 produces less evidence than a game between Winner1 and Winner2. Because of intransitivity you may get a different winner than the round robin, but you have more evidence that your winner is the “best.” It’s a tradeoff, ask and it shall be given you.

You might also care about entertaining the spectators. That’s the idea behind SSCAIT’s elimination format for the “mixed competition.” I approve of that too; it’s poor evidence but good fun.

As a corollary, the kind of tournament you want to win could make a difference in what you want to work on. In a round robin, beating the weak enemies more consistently like ZZZKBot counts as much as clawing extra points from the strong enemies like UAlbertaBot.

rating diff	win %	rating diff	win %	rating diff	win %	rating diff	win %
0	50%	200	76%	400	91%	600	97%
10	51%	210	77%	410	91%	610	97%
20	53%	220	78%	420	92%	620	97%
30	54%	230	79%	430	92%	630	97%
40	56%	240	80%	440	93%	640	98%
50	57%	250	81%	450	93%	650	98%
60	59%	260	82%	460	93%	660	98%
70	60%	270	83%	470	94%	670	98%
80	61%	280	83%	480	94%	680	98%
90	63%	290	84%	490	94%	690	98%
100	64%	300	85%	500	95%	700	98%
110	65%	310	86%	510	95%	710	98%
120	67%	320	86%	520	95%	720	98%
130	68%	330	87%	530	95%	730	99%
140	69%	340	88%	540	96%	740	99%
150	70%	350	88%	550	96%	750	99%
160	72%	360	89%	560	96%	760	99%
170	73%	370	89%	570	96%	770	99%
180	74%	380	90%	580	97%	780	99%
190	75%	390	90%	590	97%	790	99%
200	76%	400	91%	600	97%	800	99%

rating diff	win %	rating diff	win %	rating diff	win %	rating diff	win %
0	50%	200	76%	400	91%	600	97%
10	51%	210	77%	410	91%	610	97%
20	53%	220	78%	420	92%	620	97%
30	54%	230	79%	430	92%	630	97%
40	56%	240	80%	440	93%	640	98%
50	57%	250	81%	450	93%	650	98%
60	59%	260	82%	460	93%	660	98%
70	60%	270	83%	470	94%	670	98%
80	61%	280	83%	480	94%	680	98%
90	63%	290	84%	490	94%	690	98%
100	64%	300	85%	500	95%	700	98%
110	65%	310	86%	510	95%	710	98%
120	67%	320	86%	520	95%	720	98%
130	68%	330	87%	530	95%	730	99%
140	69%	340	88%	540	96%	740	99%
150	70%	350	88%	550	96%	750	99%
160	72%	360	89%	560	96%	760	99%
170	73%	370	89%	570	96%	770	99%
180	74%	380	90%	580	97%	780	99%
190	75%	390	90%	590	97%	790	99%
200	76%	400	91%	600	97%	800	99%

rating diff	win %	rating diff	win %	rating diff	win %	rating diff	win %
0	50%	200	76%	400	91%	600	97%
10	51%	210	77%	410	91%	610	97%
20	53%	220	78%	420	92%	620	97%
30	54%	230	79%	430	92%	630	97%
40	56%	240	80%	440	93%	640	98%
50	57%	250	81%	450	93%	650	98%
60	59%	260	82%	460	93%	660	98%
70	60%	270	83%	470	94%	670	98%
80	61%	280	83%	480	94%	680	98%
90	63%	290	84%	490	94%	690	98%
100	64%	300	85%	500	95%	700	98%
110	65%	310	86%	510	95%	710	98%
120	67%	320	86%	520	95%	720	98%
130	68%	330	87%	530	95%	730	99%
140	69%	340	88%	540	96%	740	99%
150	70%	350	88%	550	96%	750	99%
160	72%	360	89%	560	96%	760	99%
170	73%	370	89%	570	96%	770	99%
180	74%	380	90%	580	97%	780	99%
190	75%	390	90%	590	97%	790	99%
200	76%	400	91%	600	97%	800	99%