archive by month
Skip to content

Steamhammer’s performance over time

Many will have missed it since the original post was almost a year ago, but today Tully Elliston commented on the Steamhammer 3.1 change list from August 2020:

Tully Elliston: Looking at BASIL win rates, it looks like SH competitive performance dropped visibly after this version.

It does look that way. Here is BASIL’s graph of Steamhammer’s elo for 2020. BASIL throws in the ratings of top bots, which by coincidence is exactly what I want here. The version in question is the red dot on 20 August (delayed from the posting of the change list due to downtime).

graph of rating over 2020

Steamhammer improved slowly but steadily up until around that version hit the server, then more or less held steady while the top bots gradually lifted away. The cause might be the sudden ascendance of Stardust, pushing everyone else down; the theory would be that the other bots on the graph coped better with the killer dragoons. It seems plausible to me, but Stardust is only one opponent and should not have much effect. The cause might be that I had spent a year distracted by other things and worked slowly on Steamhammer. That seems more likely to me. Or it could truly be that a weakness was introduced in this version.

Notice that Steamhammer’s improvement on the graph occurred in between widely-spaced updates. In principle, there are 3 ways that can happen: 1. By chance. 2. By artifacts of the rating system as implemented, because of bots arriving and leaving. You can get elo inflation if bots arrive, lose games and fall in elo to push everybody else up, then are dropped (and BASIL has dropped a lot of bots). 3. By Steamhammer’s opening learning. I think the opening learning is most likely. That opens another hypothesis for why improvement stopped around this version: Maybe, due to weaknesses already inherent in Steamhammer from earlier versions, the learning reached a ceiling and could no longer contribute. This suggests that there may be a bottleneck weakness somewhere, and to make big progress I have to break the bottleneck.

Wah, that is a lot of hypotheses. I looked at the long-term elo graphs for a number of bots which have not been updated the whole time, and they all show elo increases. BASIL has elo inflation, which explains some proportion of the elo rise of all bots. It also means that if your elo does not increase, maybe your bot is not staying the same, but getting worse! (We could take an average of non-updated bots and subtract out their elo inflation to get an estimate of true strength over time. There is no reason to expect that the inflation is constant over time.)

Here is the same graph starting from 1 January 2019 and continuing until today. BASIL began a little before the start of the graph, but the early period shows startup transients as the initial elos are established, so I left it out.

graph of rating over all time

When I compare Steamhammer to Hao Pan and BananaBrain on this graph, I can make out 3 periods. From the start until about October 2019, Steamhammer was neck-and-neck with them. From then until August 2020 or so, Steamhammer remained behind them; a gap had been opened, and the gap stayed roughly constant over that time. And since that time, Steamhammer has gained elo extremely slowly if at all, and has fallen further behind. Despite bug fixes and demonstrable improvements in some points of play, Steamhammer does not seem to be improving and (accounting for elo inflation) may be deteriorating. It is consistent with the distraction hypothesis, if you assume that I still haven’t recovered... but I think I have.

I suspect that the bottleneck weakness hypothesis is true. After watching many SCHNAIL games, I’ve concluded that Steamhammer’s tactical weaknesses in the midgame are critical. It loses too many units due to bad tactical decisions, must replace the lost combat units to stay safe, and (spending on combat units instead of drones) reaches its lategame economy too late. I suspect that if I fix the bottleneck tactical weaknesses, the other improvements I’ve made will start to show.

It’s hard to be sure, though! Gotta try it and find out.

By the way, I think the big point in these graphs is the relative decline of Krasi0. Krasi0 gained slightly over time, but lost its dominance and now is only another top bot. Subtracting elo inflation, perhaps Krasi0 is no longer improving at all.

comparing strength across time

We don’t get many tournaments of bots versus humans. I don’t think there have been any with conditions controlled well enough that we can judge how strong bots are and how they are improving: Enough human participants, of known strength, with known levels of familiarity with computer play, finishing enough games. Then hold events across years so we can compare. We have to make do with seeing how bots are improving against other bots. Here is my best idea so far for comparing strength across tournaments.

1. We need 2 tournaments, preferably round robin, that share some participants—exactly identical bots, the more the better. We can’t do it with humans, because we can’t get exactly identical people across time. Ideally the maps should be the same too. AIIDE has more games, and SSCAIT has more shared participants; either should work, but I think SSCAIT may work better for this purpose despite being short by comparison. You could also compare between AIIDE and SSCAIT, but it would not work as well. It would take extra effort to make sure you know which players are exactly identical, and the different lengths of the tournaments means each provides a different amount of evidence to support the ratings, plus you could get confusing results for learning bots.

2. Pool all the games from both tournaments and compute elo ratings. If some participants which are not identical have the same names, distinguish them somehow—Steamhammer 2017 versus Steamhammer 2018, or whatever.

3. The identical players have identical strength in both tournaments, so consider their elo ratings as fixed. For each tournament separately, compute the elo ratings of the remaining players while keeping the ratings of the identical players fixed. The fixed ratings are benchmarks that keep the elo comparison stable for the remaining players (the idea has been used before).

It’s the best way I’ve thought of to get strength comparisons across time. We can get a pretty accurate measure of how individual bots have improved—Steamhammer 2018 is this much above Steamhammer 2017. We can treat elo as a linear measure of strength (a given elo interval always represents the same win rate difference), so we can simply average together the ratings of any set of bots to compare: The top 16 are x points stronger this year, the protoss are y points stronger, the spread between best and worst has widened to....

I may do this analysis for SSCAIT once it finishes. It’s a bit elaborate, but I’m interested.

CIG Elo ratings

Elo as calculated by Remi Coulom’s bayeselo program. The # column gives the official ranking, so you can see how it differs from the rank by Elo. The bayeselo ranking should be slightly more accurate because it takes into account all the information in the tournament results, but unfortunately there are missing games so the Elo is computed from slightly less data than the official results. The “better?” column tells how likely each bot is to be superior to the one below it.

#botscoreElobetter?
1ZZZKBot75%174998.8%
2tscmoo74%1731100%
3PurpleWave67%166099.9%
4LetaBot63%162697.9%
5UAlbertaBot62%161185.1%
6MegaBot61%160496.7%
7Overkill60%159189.3%
8CasiaBot59%158255.2%
9Ziabot59%158158.8%
10Iron58%157997.0%
11AIUR57%1566100%
12McRave47%147697.6%
13Tyr45%146279.9%
14SRbotOne45%145699.9%
15TerranUAB38%139799.9%
16Bonjwa33%134794.9%
18OpprimoBot32%133569.0%
17Bigeyes32%133199.9%
19Sling26%1275100%
20Salsa9%1041-

Looking at the better? column, we see that the top 3 are well separated; the places are virtually sure to be accurate. ZZZKBot and Tscmoo are close, but bayeselo thinks they are separated enough. Farther down, CasiaBot, Ziabot, and Iron are statistically hard to distinguish; there is not strong evidence that they finished in the correct order. Also OpprimoBot and Bigeyes are not well separated—as you might guess since their results are reversed from the official results.

Is this all the analysis we want of CIG 2017? I also have a script for the map balance, to check whether any race is favored. But it tells more about who competed than about the maps or bot skills.

simple SSCAIT rating statistics

I pulled down a rating table from today and calculated a few simple statistics.

countmean elomedian elo
terran2419651884
protoss1820072050
zerg1620132017
random419011909

1. Terran is the most popular.

2. The fact that the mean is higher than the median for terran implies that a few terran bots stand out (like #1 Iron and #2 Krasi0), but most terran bots are weaker. Terran seems more difficult than protoss or zerg.

3. The higher median for protoss over zerg suggests that protoss may be easier to get strong with (contrary to my opinion).

4. Random struggles, as you might expect, but still has a higher median elo than terran.

If you look over the colors on the rating table, you’ll see that terran bunches up toward the bottom and zerg bunches more toward the middle, while protoss is spread throughout. In the upper part of the table, protoss and zerg seem pretty equally represented, though. There is not much difference between them.

Also, 62 bots is a lot!

AIIDE 2016 Bayesian Elo ratings

Again I have Elo as calculated by Remi Coulom’s bayeselo program. The # column gives the official ranking, so you can see how it differs from the rank by Elo (the bayeselo ranking is slightly more accurate because it takes into account all the information in the tournament results, not only the raw winning rate). I left out the 95% confidence interval column as relatively uninteresting, since the “better?” column tells us how likely each bot is to be superior to the one below it.

#botscoreElobetter?
1Iron87%201699.4%
2ZZZKBot85%197499.6%
3tscmoo zerg83%193299.9%
4LetaBot74%181599.8%
5UAlbertaBot70%177499.9%
6Ximp65%169999.6%
8Aiur61%166351.6%
7Overkill62%166399.6%
9MegaBot58%162788.5%
10IceBot57%161157.5%
12Xelnaga57%160850.0%
11JiaBot57%160898.1%
13Skynet55%1581100%
14GarmBot43%1441100%
16TerranUAB27%125074.7%
15NUSBot27%124099.9%
17SRbotOne22%116799.0%
18Cimex21%113092.6%
19Oritaka20%110699.3%
20CruzBot17%1064100%
21Tyr1%533-

There are some switches from the official ranking, due to bots being statistically indistinguishable. Overkill and Aiur are in a dead heat. IceBot (terran), Xelnaga (protoss) and JiaBot (zerg) are also virtually even. bayeselo gives IceBot a 57.6% chance of being better than JiaBot two ranks down, essentially the same as its 57.5% chance of being better than Xelnaga one rank down.

Tomorrow: The per-map crosstables.

how bots like maps

I decided to analyze the maps to see how bots feel about them overall. This data is derived from yesterday’s big table of how much bots like each SSCAIT map. The “spread” column is the mean of the absolute value of the Elo deviation numbers for a given map, across all the bots. I thought of calling it “controversy”; it measures how much bots like or dislike the map. Maps that all bots do OK on get low numbers; maps that some bots love and others hate get high numbers.

The “RMS” column is the root mean square of the same data. Statistically, it’s a fairer measure of the differences. It’s bigger because it puts more weight on outliers. The two measures don’t agree closely.

Destination is the most “controversial” map, with 60 Elo spread. If you pick one bot that likes Destination and one bot that dislikes it, on average the bot that likes it will have a 60 Elo advantage, which means a 59% win rate if the bots are otherwise even—nothing devastating. Neo Moon Glaive has Elo spread 41 or about 56% advantage, not much different. Even if you go with the RMS number, the peak 81 Elo RMS difference means a 61% win rate, still not much different.

mapspreadRMS
Benzene4557
Destination6078
HeartbreakRidge5381
NeoMoonGlaive4153
TauCross5174
Andromeda4970
CircuitBreaker5469
EmpireoftheSun5069
FightingSpirit5172
Icarus4660
Jade5064
LaMancha1.14965
Python4760
Roadrunner4663

Bottom line: On this analysis, the maps don’t seem to be distorting the competition. No highly “controversial” maps are introducing widespread unfairness.

Elo rating variations by map

From the SSCAIT data, I calculated the Elo advantage or disadvantage that each bot sees on each map. If it played all its games on that map, its Elo would change by that much. More or less; there isn’t as much data, so the advantage numbers are less accurate than the original Elo. I increased the Elo K factor to account for the smaller amount of data.

The 14 active maps:

  • (2)Benzene.scx
  • (2)Destination.scx
  • (2)HeartbreakRidge.scx
  • (3)NeoMoonGlaive.scx
  • (3)TauCross.scx
  • (4)Andromeda.scx
  • (4)CircuitBreaker.scx
  • (4)EmpireoftheSun.scm
  • (4)FightingSpirit.scx
  • (4)Icarus.scm
  • (4)Jade.scx
  • (4)LaMancha1.1.scx
  • (4)Python.scx
  • (4)Roadrunner.scx

The Elo ratings are repeated from the original post. I dropped 5 bots of the 103 for lack of data. The top number in each colored cell is the advantage or disadvantage that bot sees when playing on that map, in Elo points. You can look up winning rates for a given advantage in the Elo table. The bottom number is the count of games. Some bots have few games, like the new ZerGreenBot. A few bots have not played on every map, and get “-” instead of numbers.

botEloBenzDestHearNeoMTauCAndrCircEmpiFighIcarJadeLaMaPythRoadearliestlatest
krasi02163
2128
53
164
57
169
26
146
7
159
18
143
26
126
-76
144
-35
158
-73
140
-31
158
23
156
-63
155
-6
156
72
154
2015 Nov 302016 Sep 27
Iron bot2081
1990
139
141
-15
144
47
134
-35
157
53
140
-34
133
-79
129
-36
142
17
145
-12
142
17
148
1
150
-78
137
15
148
2015 Nov 272016 Sep 26
Marian Devecka2065
4117
-57
320
-18
255
70
303
-12
287
-37
322
104
273
45
285
-22
297
23
301
-41
289
-72
309
28
310
26
284
-38
282
2014 Oct 292016 Sep 27
Martin Rooijackers2011
6449
-25
462
56
473
-67
478
0
450
40
477
44
449
88
478
-37
458
-15
463
42
480
-156
446
0
429
31
475
-2
431
2014 Oct 292016 Sep 27
tscmooz1991
4972
-23
380
120
354
-20
323
-52
316
50
354
-50
393
21
354
-9
341
-78
370
43
364
-6
354
29
377
-54
328
29
364
2015 Feb 272016 Sep 27
tscmoo1978
5682
-8
359
44
355
-15
438
39
389
27
445
30
402
12
447
-14
420
-97
410
53
408
-70
385
29
397
41
396
-71
431
2015 Jan 222016 Sep 27
LetaBot CIG 20161932
444
120
26
33
38
-126
35
5
29
-81
30
40
30
-12
33
-64
26
100
23
2
29
-92
35
14
37
35
45
27
28
2016 Aug 012016 Sep 27
WuliBot1871
984
-9
73
-58
83
66
56
77
77
-32
68
14
59
-22
71
36
84
12
64
-21
64
54
80
-37
67
-55
68
-26
70
2016 Apr 192016 Sep 26
Simon Prins1867
5400
-20
388
-25
410
51
381
67
432
19
356
-18
412
-110
374
97
376
-21
396
-25
385
95
375
-54
324
17
373
-73
418
2015 Jan 252016 Sep 27
ICELab1865
6078
-21
441
-69
442
73
458
-127
377
-52
447
-17
398
-2
455
3
440
-13
425
-25
471
49
421
34
437
72
435
96
431
2014 Oct 292016 Sep 27
FlashTest1863
204
-71
13
-106
12
-117
13
149
22
94
18
31
14
-121
15
2
20
78
15
-57
10
79
13
59
13
4
12
-23
14
2016 Mar 222016 Jul 27
Sijia Xu1849
2313
22
171
30
155
19
182
-8
166
-2
148
-49
164
19
166
-25
165
-45
138
24
171
36
161
-72
153
51
201
0
172
2015 Oct 102016 Sep 27
LetaBot SSCAI 2015 Final1813
416
-10
28
83
31
-41
22
31
34
15
27
-223
37
199
28
-44
29
-96
33
-44
36
-32
27
1
29
79
25
81
30
2016 Aug 042016 Sep 27
Dave Churchill1804
6023
-29
473
200
428
-82
436
18
413
19
433
-74
412
-50
417
72
429
47
445
-68
396
-53
408
100
438
-46
455
-53
440
2014 Oct 292016 Sep 27
Chris Coxe1800
2195
31
153
188
169
66
153
-106
165
-25
149
-79
149
-34
166
-7
167
-18
153
115
146
-124
153
-114
154
114
156
-8
162
2015 Sep 032016 Sep 27
Tomas Vajda1790
6088
-2
441
9
439
21
441
-77
449
-14
443
-18
421
34
398
-7
458
55
422
17
425
61
440
-80
424
-6
439
7
448
2014 Oct 292016 Sep 27
Flash1777
991
-19
70
-163
65
16
59
7
68
-70
59
-12
85
3
76
72
69
23
69
64
87
71
84
-25
61
30
75
3
64
2016 Apr 182016 Sep 27
LetaBot IM noMCTS1766
1226
61
79
127
88
-116
83
39
106
85
87
-114
93
-106
89
-24
93
-28
89
58
93
-66
70
18
86
-2
92
68
78
2016 May 182016 Aug 01
Zia bot1757
536
54
36
-63
40
15
39
-22
41
50
39
-63
43
93
42
-7
34
20
36
-78
33
-58
30
-55
40
66
39
48
44
2016 Jul 072016 Sep 27
A Jarocki1741
932
20
63
121
66
125
62
-42
72
-48
74
18
53
-104
78
74
81
25
76
-93
59
-110
64
-39
55
52
67
3
62
2015 Oct 042016 Jan 26
PeregrineBot1728
1262
-21
104
94
89
-28
95
123
76
-133
90
77
95
-4
77
10
85
-67
94
-32
83
-51
94
11
90
14
103
9
87
2016 Feb 092016 Sep 10
tscmoop1721
1982
155
139
56
145
47
140
50
127
-7
130
53
154
-68
127
13
141
-17
140
-54
171
-59
128
-44
158
-51
148
-74
134
2015 Nov 112016 Sep 26
Andrew Smith1718
6160
63
460
-37
387
46
445
-10
457
-11
452
33
449
-21
463
89
480
22
441
-58
455
-25
445
-29
418
-58
400
-3
408
2014 Oct 292016 Sep 27
Florian Richoux1716
5970
-35
408
-107
425
120
441
42
411
-43
446
-46
443
37
397
152
468
66
416
-25
451
46
433
-90
410
-29
391
-88
430
2014 Oct 292016 Sep 27
Carsten Nielsen1695
4683
-19
361
27
341
-65
334
33
315
-21
330
-1
319
-27
353
53
336
88
369
-35
318
6
318
-78
282
26
340
12
367
2015 Mar 172016 Sep 27
Soeren Klett1687
6002
39
459
35
426
-12
420
63
396
-108
440
39
481
-20
402
-25
451
-4
454
36
433
-63
407
-19
406
71
410
-32
417
2014 Oct 292016 Sep 27
Vaclav Horazny1686
4140
33
291
78
291
-17
302
18
295
-22
274
-46
332
33
290
-31
308
4
284
-38
270
58
313
-96
315
25
296
2
279
2014 Oct 292015 Nov 18
La Nuee1662
558
-66
36
41
24
81
40
100
35
57
44
-11
41
-82
44
-21
36
50
52
-7
34
-78
43
-11
36
68
55
-121
38
2015 Dec 132016 Mar 18
Jakub Trancik1657
6136
102
427
79
454
1
434
-32
443
-64
445
-10
459
-16
438
-106
424
18
439
-35
413
-23
404
-40
463
63
445
63
448
2014 Oct 292016 Sep 27
Marek Suppa1655
4397
61
322
78
334
1
330
14
321
-48
304
-59
328
-88
315
-6
291
25
336
56
326
6
292
-25
300
-41
302
25
296
2015 Jan 052016 Mar 18
Krasimir Krystev1653
4292
-24
322
-59
283
92
304
49
298
8
327
-56
295
-51
296
-43
318
-76
326
-9
318
-24
280
132
318
-58
299
117
308
2014 Oct 292016 Mar 10
ASPbot20111652
222
-22
21
-178
17
-27
12
-13
17
98
20
154
11
-29
13
87
17
17
10
75
14
-59
17
-135
15
21
18
11
20
2015 Jan 292016 Feb 25
Marcin Bartnicki1633
1377
59
97
43
106
63
88
-54
110
-33
109
-20
110
-21
92
-46
96
-20
97
-12
92
53
92
-38
99
75
102
-48
87
2014 Nov 282016 Mar 18
Tomas Cere1631
6131
-27
446
-41
419
-2
444
-46
443
109
459
14
448
2
429
2
424
48
433
-24
442
-69
417
49
480
-12
426
-2
421
2014 Oct 292016 Sep 27
MegaBot1630
419
-22
29
27
27
6
34
36
27
-27
26
-21
25
-28
21
-1
28
44
41
64
28
-18
37
-2
37
-66
34
9
25
2016 Aug 012016 Sep 27
Aurelien Lermant1622
3673
-30
258
-8
249
15
268
-32
247
30
260
-37
287
-46
250
-73
266
87
262
38
253
51
255
80
285
-20
261
-53
272
2015 Jun 222016 Sep 27
Matej Kravjar1619
983
-29
69
22
67
-96
75
16
71
-266
75
12
65
78
73
74
79
-25
59
20
75
82
75
68
58
105
78
-60
64
2014 Oct 292015 Feb 18
Daniel Blackburn1605
4591
-61
327
-83
332
74
337
8
326
-8
306
66
330
-56
354
-26
350
74
327
-23
312
17
311
-3
319
-10
345
31
315
2014 Oct 292016 Jan 26
Gabriel Synnaeve1584
18
-3
1
-38
3
31
1
24
1
-26
2
-
-
19
1
-
-5
2
13
1
13
2
-9
2
-19
2
2015 Jan 302015 Nov 24
David Milec1566
49
-
-11
5
48
2
-47
4
-73
4
9
4
-2
3
-26
6
-13
2
16
4
25
3
31
5
70
5
-26
2
2015 Jan 132015 Jan 20
Odin20141565
5602
-60
416
-40
377
16
396
-40
399
-38
400
42
402
-45
380
-68
401
-29
424
58
416
52
393
66
411
14
365
73
422
2014 Dec 212016 Sep 11
Gaoyuan Chen1559
5106
-18
387
-40
348
-20
357
-64
387
59
379
48
352
-18
330
75
359
2
391
-24
362
-53
383
33
371
18
345
1
355
2015 Feb 102016 Sep 27
Henri Kumpulainen1553
881
62
75
-128
64
37
68
51
64
-23
69
33
69
-48
58
15
67
-36
62
22
74
27
48
1
58
-11
54
-1
51
2016 Jan 132016 May 31
Martin Dekar1533
2627
-58
195
14
186
25
189
5
196
-105
168
-27
189
73
178
-3
206
30
174
18
207
72
202
-9
166
-53
201
17
170
2014 Oct 292016 Jan 25
Serega1505
3802
-35
280
104
257
131
262
51
260
-3
278
26
261
-23
275
-114
285
-76
276
6
270
-4
268
-1
279
-100
291
38
260
2015 Jan 312016 Jan 26
Chris Ayers1481
1520
-34
115
-40
124
11
112
-28
106
0
106
-72
111
65
88
69
105
38
113
-52
117
49
109
-44
89
-44
102
82
123
2015 Aug 102016 Jan 26
Nathan a David1481
991
13
57
0
61
-91
77
31
88
-7
65
-34
75
-23
72
103
61
68
65
43
72
-26
78
-110
70
-34
64
68
86
2016 Feb 232016 Aug 08
DAIDOES1471
485
131
36
39
32
47
31
-27
30
1
28
-25
42
123
39
-159
28
29
44
32
34
-64
34
-80
36
-51
30
5
41
2016 Jun 132016 Sep 08
Igor Lacik1454
5852
19
420
48
399
13
447
-28
408
-38
375
31
418
-95
461
111
418
-28
386
-9
442
12
415
-54
405
26
429
-7
429
2014 Oct 292016 Sep 08
Matej Istenik1449
6054
-63
412
18
457
44
458
12
421
-20
429
-9
435
-109
417
-4
472
8
458
88
382
-3
414
8
426
-2
456
33
417
2014 Oct 292016 Sep 27
EradicatumXVR1443
4539
-41
340
18
324
62
309
66
319
-14
322
-5
315
60
322
-90
303
-24
331
-39
330
11
340
24
311
19
347
-47
326
2014 Nov 042016 Jan 23
Tomasz Michalski1432
433
6
29
34
30
-109
34
-35
23
142
27
30
40
-16
27
-92
31
174
32
-54
44
-5
23
12
31
-24
28
-63
34
2015 Dec 222016 Mar 18
Oleg Ostroumov1431
1345
10
92
38
83
63
69
-33
116
74
96
22
98
60
96
-1
116
101
109
-82
80
-104
97
-89
90
80
102
-138
101
2014 Oct 292016 Jan 26
NUS Bot1426
3333
78
216
-20
210
-525
233
64
236
29
249
25
221
-19
257
100
241
23
257
-39
254
133
240
104
232
36
257
11
230
2015 May 192016 Sep 06
Martin Pinter1425
1580
53
114
-54
110
69
108
7
110
-44
122
0
123
8
101
77
108
-60
97
26
131
-31
139
61
112
-53
95
-59
110
2014 Oct 292015 Dec 11
Roman Danielis1417
2945
-82
222
35
209
23
198
18
206
-9
202
13
221
89
202
-5
206
38
207
10
200
-18
225
-32
211
-29
226
-50
210
2014 Oct 292016 Sep 26
ZerGreenBot1416
36
-
-21
3
-13
3
-6
2
13
3
42
2
-29
3
8
1
6
1
-73
5
45
4
-7
4
36
4
0
1
2016 Sep 222016 Sep 27
Marek Kadek1413
5246
41
406
75
382
8
392
5
385
-28
383
-24
357
-12
369
-62
394
-49
359
-24
367
18
379
37
350
18
348
-3
375
2014 Oct 292016 May 22
Ian Nicholas DaCosta1404
2928
-38
217
-31
210
55
214
29
213
-134
222
-62
192
5
193
10
218
-17
205
-18
213
73
233
-47
199
90
186
84
213
2015 Apr 272016 Sep 08
AwesomeBot1403
473
-19
32
-24
31
-11
32
-23
49
20
29
149
29
-18
26
-3
30
9
53
-67
36
4
36
33
30
-12
32
-39
28
2016 Jun 162016 Sep 08
Radim Bobek1390
1151
-85
70
11
96
-36
68
54
77
-39
94
-95
81
-11
78
0
89
184
87
43
91
61
70
-58
75
-72
100
44
75
2015 Oct 012016 Mar 06
Adrian Sternmuller1375
4379
8
316
78
313
-85
330
-74
321
73
325
79
335
73
287
-42
315
-59
330
69
293
-115
288
33
331
-109
305
72
290
2014 Oct 302016 Jul 22
Martin Strapko1366
1144
-55
98
-2
103
-21
82
12
65
-6
83
88
89
20
73
-51
76
23
72
69
80
-57
81
-108
80
50
81
38
81
2014 Oct 292016 Jan 26
Maja Nemsilajova1363
4117
73
301
-73
292
73
322
7
309
31
298
-51
302
-77
269
-10
290
76
264
-36
302
-82
299
14
309
44
270
10
290
2014 Nov 042015 Nov 29
Johan Kayser1361
413
-58
24
153
19
22
33
-50
29
-2
29
27
36
-156
19
-150
28
-59
40
66
28
21
27
128
28
26
42
34
31
2016 Jul 292016 Sep 27
UPStarcraftAI1360
600
-16
43
56
48
6
41
45
48
11
36
12
43
-76
56
22
40
-28
44
52
37
37
43
23
32
-47
40
-98
49
2015 Dec 242016 Apr 13
Martin Vlcak1353
1210
-49
68
-83
97
-51
81
-26
86
-25
98
-31
87
88
60
119
94
35
79
-24
102
16
90
45
90
33
83
-46
95
2016 Feb 162016 Sep 07
Johannes Holzfuss1351
674
-46
52
98
39
24
36
59
51
-42
47
-27
54
-73
57
92
64
-82
49
-6
33
-4
57
-82
52
80
44
8
39
2016 Mar 052016 Jun 15
Vojtech Jirsa1350
2759
-87
212
-79
182
32
200
-8
207
-127
205
6
174
49
184
83
211
27
199
-30
186
79
207
117
193
31
214
-93
185
2015 Jan 122015 Sep 05
JompaBot1349
1043
-151
67
107
78
-180
70
-44
67
-52
83
81
61
95
88
1
83
71
75
82
80
-9
70
-72
77
68
73
2
71
2016 Feb 042016 Aug 13
Rob Bogie1346
651
42
48
-313
54
-193
38
135
45
365
49
-361
47
246
38
-333
43
-418
52
291
55
273
45
298
43
-306
40
274
54
2016 May 142016 Sep 06
Christoffer Artmann1344
395
30
25
-123
23
14
22
57
29
-45
31
-155
22
-36
27
-52
34
143
30
109
32
28
41
-17
31
-51
26
99
22
2016 Aug 072016 Sep 27
Marek Gajdos1331
1370
2
91
-102
100
90
95
-13
102
-139
99
93
87
42
81
39
107
-78
107
-30
106
-7
97
9
94
79
101
15
103
2016 Jan 302016 Sep 11
Travis Shelton1314
1212
38
72
-4
78
77
84
-31
80
-1
70
47
90
-25
105
2
92
-27
100
-30
104
-9
93
44
86
-87
82
5
76
2016 Feb 282016 Sep 06
Peter Dobsa1307
3015
27
213
26
205
-45
218
-71
199
-19
215
82
228
-4
232
81
197
58
212
-28
224
-32
207
42
204
-65
237
-54
224
2015 Jan 112015 Oct 02
VeRLab1304
888
-75
65
-5
52
-16
64
25
75
-20
56
-27
63
84
51
-6
79
42
71
57
77
99
51
6
52
-32
54
-131
78
2016 Feb 282016 Aug 01
Bjorn P Mattsson1295
4432
18
303
75
328
48
340
-4
303
-20
307
39
302
6
333
-26
317
12
304
-58
315
-96
326
-50
345
113
280
-57
329
2015 Apr 052016 Sep 27
Lukas Sedlacek1293
63
48
3
-80
10
74
3
-78
5
2
4
-14
6
-53
9
39
3
77
3
29
4
-15
2
7
3
-18
5
-20
3
2015 Jan 122015 Jan 20
Sergei Lebedinskij1293
1083
-16
59
11
71
-19
77
69
77
47
72
-97
83
70
70
46
69
3
86
-85
85
55
71
36
106
-51
74
-68
83
2015 May 282015 Sep 03
Vladimir Jurenka1278
6041
-87
429
-66
435
-33
454
-28
432
-15
402
11
438
44
453
77
435
-96
443
33
406
15
467
77
435
57
413
10
399
2014 Nov 042016 Sep 27
neverdieTRX1272
334
65
29
-3
27
56
21
-9
26
-38
19
13
20
-82
27
-50
28
44
26
-159
27
67
21
35
25
-3
24
62
14
2016 Jul 192016 Sep 10
OpprimoBot1256
1994
8
131
-23
138
14
144
122
146
70
143
-8
160
33
131
30
153
-70
135
-88
134
38
140
1
149
-93
139
-35
151
2015 Nov 182016 Sep 27
Marek Kruzliak1255
399
-1
31
-92
35
66
23
-46
27
146
28
111
27
-123
23
112
25
-43
36
-99
32
80
26
15
30
-25
25
-101
31
2014 Nov 282015 Jan 20
Sungguk Cha1250
697
-9
47
-38
46
34
54
-148
40
50
52
-62
51
130
48
-23
41
-3
65
-63
51
117
43
76
48
-44
61
-17
50
2016 Jun 052016 Sep 27
Jacob Knudsen1247
1244
-13
79
-32
89
92
99
-20
89
36
81
34
75
70
87
-87
88
-77
81
-5
99
55
87
-7
103
3
94
-47
93
2016 Feb 232016 Sep 10
Ludmila Nemsilajova1228
409
52
27
15
21
13
22
74
25
-90
27
-9
27
65
46
-71
37
-51
32
-75
28
1
30
-24
29
58
30
43
28
2014 Nov 282015 Jan 21
Karin Valisova1226
1067
159
84
-42
86
-11
71
-2
72
-28
74
0
79
-62
68
-40
76
6
72
-16
66
4
68
-29
85
31
82
31
84
2014 Nov 042016 Jan 26
HoangPhuc1209
300
-54
28
-46
32
25
22
3
16
83
21
-88
24
-21
16
51
20
-52
24
56
17
-70
25
-27
15
-37
25
178
15
2016 Jul 182016 Sep 07
Sebastian Mahr1182
1191
-2
67
-61
89
-18
83
-34
96
14
71
118
74
95
87
25
81
25
79
-13
97
-63
94
-40
89
17
97
-61
87
2016 Jan 132016 Aug 08
Jan Pajan1179
997
-15
85
58
71
14
64
-49
77
14
67
37
72
-61
78
2
61
-29
67
-19
70
-77
91
52
72
21
62
53
60
2014 Nov 042016 Jan 05
Pablo Garcia Sanchez1174
579
26
33
11
53
-28
42
-74
33
3
45
35
39
84
34
-49
43
52
39
26
37
24
50
-21
46
-37
51
-51
34
2015 Dec 242016 Apr 13
Ivana Kellyerova1131
1499
-59
115
-89
113
2
99
-20
113
71
111
38
108
-39
95
-6
99
19
125
30
92
53
106
-54
110
-5
97
60
116
2014 Nov 042015 Apr 01
Lucia Pivackova1090
717
-69
50
-1
53
-32
55
20
50
94
41
-13
58
-25
47
23
42
-57
47
39
49
-42
55
-32
50
48
59
46
61
2014 Oct 302015 Jan 20
Tae Jun Oh1036
138
43
11
21
8
-7
10
-40
7
96
8
-18
9
95
6
-35
9
-90
17
129
7
-40
14
-95
11
-21
9
-38
12
2016 Mar 222016 Apr 11
Denis Ivancik1022
418
-65
37
46
30
-23
25
-34
26
109
29
78
21
-26
36
96
26
-90
43
4
28
-35
27
-28
18
-43
34
11
38
2014 Nov 282015 Jan 20
ButcherBoy970
422
38
21
38
23
-68
32
-30
35
-43
34
100
31
46
29
-40
29
-6
35
-40
31
-49
34
128
23
7
30
-81
35
2016 Jun 212016 Sep 06
Jon W964
790
-30
58
47
66
-100
62
6
59
83
45
27
67
4
52
-44
50
7
60
35
47
-9
57
79
49
-66
62
-39
56
2015 Apr 302015 Jul 09
Matyas Novy885
1693
77
103
-76
132
-60
133
-69
122
-2
110
5
120
1
107
67
119
110
104
-59
145
-20
122
66
126
-83
124
44
126
2015 Feb 042015 Jul 09

There are some interesting things to see in the chart, but first look at Rob Bogie! That’s the bot MaasCraft. All the bots have preferences, some have strong preferences, but MaasCraft loves some maps and hates others. Why is that? If it could be made to love all the maps....

comparing AIIDE 2015 and CIG 2016 Elo ratings

The cool technique I had in mind to compare ratings across tournaments turned out not to work. Not cool after all. But 6 bots played unchanged in both AIIDE 2015 and CIG 2016, and we can compare their relative ratings. In this table the subtract column gives the AIIDE 2015 rating minus the CIG 2016 rating.

botAIIDE EloCIG elosubtractnormalize
UAlbertaBot1895177811735
Overkill189017969412
Aiur178416879715
TerranUAB1372133834-48
OpprimoBot1231115477-5
Bonjwa1171109972-10
average820

As you might expect, two tournaments with different maps and different opponents give different ratings. UAlbertaBot and Overkill swapped ranks among the 6. But after correcting for the 82 point offset (since only rating differences matter), the ratings turn out to be quite close between the tournaments. The biggest difference is for TerranUAB. Look up 48 points in the Elo table—it says that TerranUAB has a 57% probability of beating itself, not a drastic error.

You can try to convert a CIG 2016 rating into a rough estimate of an AIIDE 2015 rating by adding 82. For example, tscmoo terran earned a CIG rating of 1888, which corresponds to an AIIDE rating of 1888+82 = 1970, whereas the tscmoo zerg that played in AIIDE earned a rating there of 2026. So the estimate appears to be way off. But estimates made this way are likely to be closer for bots near the middle of the pack.

Next: Another mass of colorful crosstables.

CIG 2016 Bayesian Elo ratings

Same as yesterday, Bayesian Elo ratings calculated by bayeselo, this time for CIG 2016. I included both the qualifier and the final, of course. That gives the best possible ratings, so that confidence is higher for the 8 finalists. But the “score” column becomes difficult to interpret, because part of the score of the top 8 bots comes from the final when they faced tougher opposition. You can’t directly compare the scores of bots 1-8 with the scores of 9-16, only the ratings.

Also, with this analysis it doesn’t make sense to compare the rating values between tournaments. Each tournament is independently scaled to have an average rating of 1500. Only the relative ratings of bots in the same tournament can be compared. Ratings are relative.

botscoreElo95% conf.better?
1tscmoo73%18881872-190498.5%
2Iron71%18641848-188099.9%
3LetaBot68%18271811-184399.7%
4Overkill65%17961781-181270.9%
5ZZZKBot64%17901775-180586.8%
6UAlbertaBot63%17781763-179399.8%
7MegaBot60%17461731-176199.9%
8Aiur54%16871671-170272.7%
9Tyr62%16791659-1699100%
10Ziabot46%15001479-1521100%
11TerranUAB34%13381316-1360100%
12SRbotOne22%11581133-118359.1%
13OpprimoBot22%11541128-117997.1%
14XelnagaII21%11191092-114586.3%
15Bonjwa19%10991072-1125100%
16Salsa1%579510-636-

The official results have LetaBot a hair ahead of ZZZKBot, then Overkill following. bayeselo has ZZZKBot and Overkill reversed, saying that LetaBot is clearly superior to Overkill, which is fairly likely to be superior to ZZZKBot. The difference comes about because, of course, the official results include only the final. Martin Rooijackers was justified after all in saying that ZZZKBot had fallen from the top 3. All other results agree with the official ranking. The tailing finalist Aiur is 72.7% likely to be superior to Tyr, so there is some doubt that the best finalists won through (in general the doubt can’t be avoided, though).

The tail-ender Salsa has a wide and asymmetrical confidence interval. It takes more evidence to pin down an extreme rating than a middle-of-the-road rating.

Tomorrow: I’ll try an analysis in which the ratings of unchanged bots are carried over from AIIDE 2015 to CIG 2016, so that we can compare between tournaments. I’m not sure how well it will work, or even if I can get it to work at all, but it will be interesting to try.

AIIDE 2015 Bayesian Elo ratings

Krasi0 asked me to calculate ratings for tournaments using Rémi Coulom’s excellent bayeselo program. Here are ratings for AIIDE 2015.

bayeselo does not calculate basic Elo ratings like my little code snippets. It can’t calculate an Elo curve over time. It assumes that the players are fixed and have one true rating, and it crunches a full-on Bayesian statistical analysis to not only find the rating as accurately as possible, but also a 95% confidence interval so you can see how accurate the rating is. The ratings for the bots that learn, which aren’t fixed in strength as bayeselo assumes, can be seen as measuring the average strength over the tournament—the tournament score is no different in that respect.

The last column of the table is the probability of superiority, bayeselo’s calculated probability that the bot truly is better than the bot ranked immediately below it. The last bot doesn’t get one, of course. (bayeselo calculates this for all pairs, but in a tournament this long it rounds off to 100% for most.)

botscoreElo95% conf.better?
1tscmoo89%20262002-205081.0%
2ZZZKBot88%20111988-203599.9%
3UAlbertaBot80%18951874-191661.2%
4Overkill81%18901870-191199.9%
5Aiur73%17841765-180399.9%
6Ximp68%17121694-173199.9%
7Skynet64%16661648-168450.7%
8IceBot64%16661648-168488.4%
9Xelnaga63%16501632-166881.4%
10LetaBot61%16381620-165699.9%
11Tyr54%15531534-157296.0%
12GarmBot52%15311513-1549100%
13NUSBot39%13801362-139873.1%
14TerranUAB38%13721354-139099.8%
15Cimex36%13351316-135399.6%
16CruzBot32%12991280-131799.9%
17OpprimoBot28%12311211-125096.7%
18Oritaka26%12051185-122584.0%
19Stone25%11901170-121091.3%
20Bonjwa23%11711151-1191100%
21Yarmouk9%913885-93995.0%
22SusanooTricks8%882853-910-

In the official results, Overkill came in ahead of UAlbertaBot with a higher tournament score. bayeselo ratings are more accurate than score because they take into account more information, and bayeselo says UAlbertaBot > Overkill with probability 61%. As explained in the original results, it’s a statistical tie, but bayeselo says it’s not an even tie but a little tilted in a counterintuitive way.

Skynet looks dead even with IceBot in the rounded-off numbers above. bayeselo says that Skynet > IceBot with probability 50.7%, a hair off dead even. Even the large number of games in this tournament could not rank all the bots accurately.

Tomorrow: The same for CIG 2016.

SSCAIT Elo ratings over time

Here it is, the great chart of SSCAIT Elo ratings over time. Well, not here actually, I put it on a separate page so that not every blog visitor has to load the mass of Javascript and data.

SSCAIT interactive ratings chart for 100 bots

The chart is generated from this csv file. Spreadsheet software or stats software should open it right up, if you want to poke the data yourself. It’s 950 lines of 101 columns each, a date and ratings for 100 bots.

Data in the csv file is filled in for each day from the bot’s first to its last game in the original raw data file (which is just a list of games), and left blank on other days. There may be an off-by-one error causing some bots to miss their last day of data; I didn’t bother to verify it since it’s hardly visible. Some bots have short lifetimes and only appear on the graph as a brief squiggle. Some bots have inactive periods in between their first and last games; the inactive periods with no games graph as flat lines. In excluding the 3 bots with insufficient games, I also removed them from the rating calculation, which improves the ratings to a tiny degree. The rankings stay exactly the same for all 100 bots, though.

Elo ratings are easy to calculate

Elo ratings may seem mysterious and complicated, but basic Elo ratings are super easy to calculate. The entire code is in these 2 yellow boxes. It’s perl, a particularly ugly language, but any coder should be able to read it.

First, given two ratings, here’s how you find out the probability that the player with the first rating beats the player with the second. I broke it out because it’s useful on its own. It’s literally 1 line of calculation. This is the logistic function, which has been shown empirically to be a good fit for the job. 400 is a scaling constant which is standard for Elo.

sub expected_win_rate ($, $) {
  my $rating1 = shift;
  my $rating2 = shift;
  
  return 1.0 / (1.0 + 10.0 ** (($rating2 - $rating1) / 400.0));
}

Second, given two ratings and a game result, here’s how you figure out the two new ratings after the game. This time it’s 2 lines of calculation, and it calls the expected win rate function above. $actual is the game result, 0.0 if the first player lost and 1.0 if the first player won. You can use 0.5 for a draw, but I skipped over draws because I’m not sure what ‘draw’ means in the file I have (there are 1211 draws in the 141,164 games, a negligible number, so it shouldn’t make much difference). $elo_k is the K constant for the Elo formula, which is the maximum rating change per game. Setting $elo_k high means that ratings react quickly to changes, and setting $elo_k low means ratings are more accurate if changes are slow. I have a lot of games, so I set $elo_k to a low value, 16. Other common values are 24 and 32.

sub update_elo ($, $, $) {
  my $rating1 = shift;
  my $rating2 = shift;
  my $actual = shift;

  my $delta = $elo_k * ($actual - &expected_win_rate ($rating1, $rating2));

  return ($rating1 + $delta, $rating2 - $delta);
}

Third and last, you need to already have an Elo rating before you can use the Elo formula. How do you set a player’s initial rating? If you don’t have any better idea, it’s standard to set it to 1500. It will take a while to become accurate. I used the calculate-backward trick to get accurate initial ratings, but that only works if you have the whole dataset ahead of time. Sometimes players are rated by a different “provisional” system for some small number of early games, before Elo kicks in.

And that’s the story! There are a bunch of fancy variations of Elo which try to do a little better. And though I think they mostly do do a little better, they’re more complicated and not very much better.

The bottom line: Calculating Elo ratings is easy and works well, so you should do it. If you care about playing strength and have the data, ratings are your answer.

a few preliminary Elo charts

The SSCAIT data includes 103 bots, and 3 of them have 10 or fewer games, leaving exactly 100 with useful rating curves. I’ve crunched and formatted the data, and now all I have to do is draw it. I hope to create a humongalicious zoomable graph of daily rating data for all 100 bots—if I can find a way to draw that many lines on a graph in a way that’s usable. Well, I’ll think of something. I chose powerful graphing software that’s fully capable of doing the job, but it’s complicated and my skill and patience may be less than fully capable....

Anyway, another appetizer. Here are static rating graphs for 2016 for the top 3 CIG finishers, all of which had many updates this year. The graphs run from 1 January 2016 to 27 September 2016. The authors may be interested in comparing their updates with movements in their graph. Krasi0 shows steady improvement since April, while the other two look more irregular.

graph of Krasi0’s rating in 2016

graph of Iron’s rating in 2016

graph of Tacmoo terran’s rating in 2016

Elo rating table

Here’s a table that explains what Elo ratings mean. To find out the chance that one bot will beat another, subtract their Elo ratings and look up the difference in the table. Iron is rated 2081 and Wulibot is rated 1871. The difference is 210—look it up in the table!

The probability estimate is not perfect, but it is good on average.

rating
diff
win %rating
diff
win %rating
diff
win %rating
diff
win %
050% 20076% 40091% 60097%
1051% 21077% 41091% 61097%
2053% 22078% 42092% 62097%
3054% 23079% 43092% 63097%
4056% 24080% 44093% 64098%
5057% 25081% 45093% 65098%
6059% 26082% 46093% 66098%
7060% 27083% 47094% 67098%
8061% 28083% 48094% 68098%
9063% 29084% 49094% 69098%
10064% 30085% 50095% 70098%
11065% 31086% 51095% 71098%
12067% 32086% 52095% 72098%
13068% 33087% 53095% 73099%
14069% 34088% 54096% 74099%
15070% 35088% 55096% 75099%
16072% 36089% 56096% 76099%
17073% 37089% 57096% 77099%
18074% 38090% 58097% 78099%
19075% 39090% 59097% 79099%
20076% 40091% 60097% 80099%

SSCAIT initial and current Elo ratings

I’m still working on Elo curves over time, but today I have Elo ratings for each bot in the SSCAIT data at the beginning and end of its career. Here is yesterday’s table plus the new info, now sorted by decreasing current rating—the bot’s real strength yesterday as best we can measure. The topmost ratings are, to my surprise, exactly in the order I expected!

To make the ratings easier to interpret, I added two columns labeled “expect”. These are the expected winning rate of the bot against the average opponent. The rating system is designed so that the average Elo rating is constant at 1500, and it’s easy to compute the expected winning rate against an opponent rated 1500. The constant average rating, by the way, means that a bot which remains the same can see its rating decline over time if its opponents improve.

Ratings are not accurate for bots with a very small number of games. I plan to exclude those bots from the curves over time.

initialcurrent
botwin %EloexpectEloexpectgamesearliestlatest
krasi068.77%159363.07%216397.85%21422015 Nov 302016 Sep 27
Iron bot77.74%158061.31%208196.59%19992015 Nov 272016 Sep 26
Marian Devecka58.66%179084.15%206596.28%62892013 Dec 252016 Sep 27
Martin Rooijackers68.50%184087.62%201194.99%72902014 Jul 282016 Sep 27
tscmooz79.80%182386.52%199194.41%50062015 Feb 272016 Sep 27
tscmoo72.06%183887.50%197894.00%57192015 Jan 222016 Sep 27
LetaBot CIG 201675.68%174880.65%193292.32%4442016 Aug 012016 Sep 27
WuliBot72.76%177382.80%187189.43%9842016 Apr 192016 Sep 26
Simon Prins55.48%151351.87%186789.21%54312015 Jan 252016 Sep 27
ICELab81.12%218998.14%186589.10%83442013 Dec 252016 Sep 27
FlashTest69.44%174480.29%186388.99%2162016 Mar 222016 Jul 27
Sijia Xu71.65%185088.23%184988.17%23282015 Oct 102016 Sep 27
LetaBot SSCAI 2015 Final65.87%171077.01%181385.84%4162016 Aug 042016 Sep 27
Dave Churchill75.48%198594.22%180485.19%82752013 Dec 252016 Sep 27
Chris Coxe73.10%175481.19%180084.90%22012015 Sep 032016 Sep 27
Tomas Vajda79.37%216997.92%179084.15%83722013 Dec 252016 Sep 27
Flash65.69%145843.98%177783.13%9912016 Apr 182016 Sep 27
LetaBot IM noMCTS60.93%164569.73%176682.22%12262016 May 182016 Aug 01
Zia bot52.24%156859.66%175781.45%5362016 Jul 072016 Sep 27
A Jarocki62.77%171177.11%174180.02%9322015 Oct 042016 Jan 26
PeregrineBot57.29%169275.12%172878.79%12762016 Feb 092016 Sep 10
tscmoop78.16%189590.67%172178.11%19922015 Nov 112016 Sep 26
Andrew Smith65.00%170576.50%171877.81%83912013 Dec 252016 Sep 27
Florian Richoux62.11%177082.55%171677.62%82032013 Dec 252016 Sep 27
Carsten Nielsen66.08%170876.81%169575.45%47112015 Mar 172016 Sep 27
Soeren Klett63.62%206896.34%168774.58%82772013 Dec 252016 Sep 27
Vaclav Horazny37.35%10667.60%168674.47%64552013 Dec 252015 Nov 18
La Nuee51.61%149949.86%166271.76%5582015 Dec 132016 Mar 18
Jakub Trancik45.08%175581.27%165771.17%84162013 Dec 252016 Sep 27
Marek Suppa51.85%174680.47%165570.94%44132015 Jan 052016 Mar 18
Krasimir Krystev70.52%203395.56%165370.70%65102013 Dec 252016 Mar 10
ASPbot201149.78%167172.80%165270.58%2272015 Jan 292016 Feb 25
Marcin Bartnicki60.42%185588.53%163368.26%14352014 Nov 282016 Mar 18
Tomas Cere61.11%188890.32%163168.01%83732013 Dec 252016 Sep 27
MegaBot49.40%157660.77%163067.88%4192016 Aug 012016 Sep 27
Aurelien Lermant58.26%168874.69%162266.87%36872015 Jun 222016 Sep 27
Matej Kravjar49.57%172378.31%161966.49%32342013 Dec 252015 Feb 18
Daniel Blackburn43.79%165170.46%160564.67%68832013 Dec 252016 Jan 26
Gabriel Synnaeve45.96%173779.65%158461.86%16582013 Dec 252015 Nov 24
David Milec49.09%155257.43%156659.39%552015 Jan 132015 Jan 20
Odin201455.65%165971.41%156559.25%56482014 Dec 212016 Sep 11
Gaoyuan Chen48.05%158261.59%155958.41%51182015 Feb 102016 Sep 27
Henri Kumpulainen38.81%144742.43%155357.57%8942016 Jan 132016 May 31
Martin Dekar33.14%142939.92%153354.73%49102013 Dec 252016 Jan 25
Serega48.20%177182.64%150550.72%38032015 Jan 312016 Jan 26
Chris Ayers35.53%161065.32%148147.27%15202015 Aug 102016 Jan 26
Nathan a David39.34%144642.29%148147.27%10042016 Feb 232016 Aug 08
DAIDOES34.02%137032.12%147145.84%4852016 Jun 132016 Sep 08
FlashZerg0.00%147446.27%145944.13%72016 Apr 242016 May 12
Igor Lacik39.32%160865.06%145443.42%80732013 Dec 252016 Sep 08
Matej Istenik44.74%170976.91%144942.71%82972013 Dec 252016 Sep 27
EradicatumXVR40.88%153755.30%144341.87%46872013 Dec 252016 Jan 23
Ibrahim Awwal30.57%151051.44%143741.03%5302013 Dec 252014 Mar 24
Tomasz Michalski27.02%131425.53%143240.34%4332015 Dec 222016 Mar 18
Oleg Ostroumov48.75%171477.41%143140.20%36412013 Dec 252016 Jan 26
NUS Bot35.72%148247.41%142639.51%33372015 May 192016 Sep 06
Martin Pinter28.98%140937.20%142539.37%37402013 Dec 252015 Dec 11
Roman Danielis45.63%168874.69%141738.28%51552013 Dec 252016 Sep 26
ZerGreenBot22.22%140436.53%141638.14%362016 Sep 222016 Sep 27
Rafael Bocquet0.00%145042.85%141538.01%102015 Jun 232015 Jun 26
Flashrelease0.00%144942.71%141337.73%82016 Apr 242016 Apr 24
Marek Kadek37.29%155758.13%141337.73%76412013 Dec 252016 May 22
Ian Nicholas DaCosta37.12%139435.20%140436.53%29282015 Apr 272016 Sep 08
AwesomeBot29.81%132626.86%140336.39%4732016 Jun 162016 Sep 08
Radim Bobek23.37%131525.64%139034.68%11512015 Oct 012016 Mar 06
Adrian Sternmuller26.89%143640.89%137532.75%45292013 Dec 252016 Jul 22
Martin Strapko19.76%138834.42%136631.62%33862013 Dec 252016 Jan 26
Maja Nemsilajova23.81%136531.49%136331.25%42462013 Dec 252015 Nov 29
Johan Kayser24.46%129423.40%136131.00%4132016 Jul 292016 Sep 27
UPStarcraftAI24.75%134629.18%136030.88%6102015 Dec 242016 Apr 13
Martin Vlcak28.92%137032.12%135330.02%12242016 Feb 162016 Sep 07
Johannes Holzfuss35.04%153154.45%135129.78%6852016 Mar 052016 Jun 15
Vojtech Jirsa14.14%118614.09%135029.66%27862015 Jan 122015 Sep 05
JompaBot21.99%131625.75%134929.54%10552016 Feb 042016 Aug 13
Rob Bogie31.34%133527.89%134629.18%6512016 May 142016 Sep 06
Christoffer Artmann20.51%128922.89%134428.95%3952016 Aug 072016 Sep 27
Marek Gajdos22.69%125119.26%133127.43%13842016 Jan 302016 Sep 11
Travis Shelton23.59%139034.68%131425.53%12212016 Feb 282016 Sep 06
Peter Dobsa13.25%122717.20%130724.77%30272015 Jan 112015 Oct 02
VeRLab17.06%124118.38%130424.45%8972016 Feb 282016 Aug 01
Andrej Sekac11.76%135930.75%129623.61%682013 Dec 252014 Jan 04
Bjorn P Mattsson22.22%135129.78%129523.50%44422015 Apr 052016 Sep 27
Lukas Sedlacek22.86%134428.95%129323.30%702015 Jan 122015 Jan 20
Sergei Lebedinskij13.30%117813.55%129323.30%10832015 May 282015 Sep 03
Vladimir Jurenka38.45%163568.51%127821.79%61672013 Dec 252016 Sep 27
neverdieTRX20.66%126520.54%127221.21%3342016 Jul 192016 Sep 10
OpprimoBot21.85%132126.30%125619.71%20092015 Nov 182016 Sep 27
Marek Kruzliak14.45%115111.83%125519.62%9342013 Dec 252015 Jan 20
Sungguk Cha18.65%120715.62%125019.17%6972016 Jun 052016 Sep 27
Jacob Knudsen20.53%10838.31%124718.90%12572016 Feb 232016 Sep 10
Ludmila Nemsilajova16.04%113310.79%122817.28%5052013 Dec 252015 Jan 21
Karin Valisova17.68%123818.12%122617.12%11712013 Dec 252016 Jan 26
HoangPhuc15.67%113210.73%120915.77%3002016 Jul 182016 Sep 07
Sebastian Mahr15.06%120515.47%118213.82%12022016 Jan 132016 Aug 08
Jan Pajan14.48%121015.85%117913.61%11192013 Dec 252016 Jan 05
Pablo Garcia Sanchez12.20%112310.25%117413.28%5902015 Dec 242016 Apr 13
Ivana Kellyerova11.47%112910.57%113110.68%16302013 Dec 252015 Apr 01
Lucia Pivackova13.29%11119.63%10908.63%8352013 Dec 252015 Jan 20
Tae Jun Oh4.55%10697.72%10366.47%1542016 Mar 222016 Apr 11
Denis Ivancik10.76%11029.19%10226.00%5022013 Dec 252015 Jan 20
ButcherBoy4.74%9213.45%9704.52%4222016 Jun 212016 Sep 06
Jon W5.06%9203.43%9644.37%7902015 Apr 302015 Jul 09
Matyas Novy6.32%113010.62%8852.82%16932015 Feb 042015 Jul 09

How did I get the initial ratings? I had a cute idea. One of the issues with computing Elo ratings over time is: How do you initialize the ratings? Most systems either start everybody with the same rating, which makes an ugly graph, or use a different and less accurate method to estimate the rating in early games. But in this case I have the whole data set in hand. I set the final rating of every bot to the same rating and computed ratings backwards in time to find an initial rating. Then I threw away everything except the initial rating, and calculated the real ratings forward in time to find the ratings over time and the final ratings. That way every data point is equally good, from beginning to end. I doubt I’m the first to think of it, but it’s a cute idea and I’m pleased.

Next: I’ll find some sensible way to plot the curves. Stand by!

tournament design

If you design a tournament differently, a different bot may be favored to win.

AIIDE 2015 was an example. As pointed out in the tournament results, UAlbertaBot finished fourth even though it had a plus score against every other bot, because compared to the top 3 it was less consistent in defeating weaker bots. AIIDE runs on a round-robin design, all-play-all, so UAlbertaBot could defeat the top finishers and still be ranked behind them. In a progressive elimination tournament in which weaker competitors were dropped over time, UAlbertaBot would likely have finished first.

If you’ve seen the math of tournament design, or of related stuff like voting system design, then you know there’s no such thing as a fair tournament in which the best competitor always has the best chance to win, because there isn’t always such a thing as a best competitor. If A > B and B > C but C > A, then which is the “best”? That’s called intransitivity. A more complicated kind of intransitivity happened in AIIDE 2015.

Rating systems in the Elo tradition have the same issue (and their designers know all about it). They assume—they have to assume, to be what they are—that players have a “true skill” in a mathematical sense, putting players into a smooth mathematical model that doesn’t correspond exactly with bumpy reality. It’s a good approximation; Elo ratings are mostly accurate in predicting future results. (The small mismatch with reality has inspired a lot of variations of Elo ratings, Glicko and TrueSkill and so on, that try to do a little better.)

Given any big enough set of games (games that link up the competitors into a connected graph), you can find Elo ratings for the players. The ratings may have big uncertainties, but you can rank the players. You can use virtually any tournament design with almost any kind of random or biased pairings, and get a ranking.

To me this is an intuitive way to think about tournament design: Players play games which we take as evidence of skill, and the key question is: With a given amount of time to play games, how do you want to distribute the evidence? If you want to rank all the competitors as well as possible, then distribute the evidence equally in a round-robin. That’s the idea behind AIIDE’s design—I approve. If you want to pick out the one winner, or the top few winners, as clearly as possible, then let potential winners play more games. If Loser1 and Loser2 are out of the running, then games between them produce little evidence of who the top winner will be. A game between Winner1 and Loser1 produces less evidence than a game between Winner1 and Winner2. Because of intransitivity you may get a different winner than the round robin, but you have more evidence that your winner is the “best.” It’s a tradeoff, ask and it shall be given you.

You might also care about entertaining the spectators. That’s the idea behind SSCAIT’s elimination format for the “mixed competition.” I approve of that too; it’s poor evidence but good fun.

As a corollary, the kind of tournament you want to win could make a difference in what you want to work on. In a round robin, beating the weak enemies more consistently like ZZZKBot counts as much as clawing extra points from the strong enemies like UAlbertaBot.