archive by month
Skip to content

AIIDE 2019 second first look

The AIIDE 2019 tournament has been rerun to correct an error. The results are official, different from before, and hopefully final. In the original run of the tournament, we’re told, a hardware error corrupted a file and caused McRave to crash every game against Locutus. In the corrected rerun, McRave was able to score 1 win against Locutus in 100 games, but ironically ended up with a slightly lower overall winning rate. Bugs in McRave were more important for its result than bugs in the tournament.

#1 Locutus and #2 PurpleWave maintain their positions, but Locutus no longer had plus results against every opponent: PurpleWave edged it out 55-45 in their matchup. #3 BananaBrain gained a rank, and #4 DaQin lost one. From my point of view, the most important result is that #5 Steamhammer moved ahead of #7 Microwave and #8 Iron—these competitors were tightly grouped, and it only took small changes in the results to switch their finishing order around thoroughly.

shifts in the results

The order of finishers looks different, but most winning rates in the final results are within a few percent of the deprecated original results. Exceptions are #4 DaQin at 63.33% which was formerly #3 DaQin at 69.39%, a shift of 6% down, and #6 ZZZKBot at 52.08%, formerly #9 ZZZKBot at 43.04%, a shift of 9% up. What accounts for these two bots having such different results? To my eye, it doesn’t look like typical statistical variation.

I looked at the scores of specific matchups. Surprise result one: Formerly ZZZKBot scored 18-82 versus DaQin, but this time ZZZKBot 90-10 DaQin. This one difference accounts for the entire shift in DaQin’s winning rate, moving it down a rank, and much of ZZZKBot’s shift. Surprise result two: Formerly ZZZKBot 34-66 McRave, but this time ZZZKBot 67-33. That accounts for McRave performing worse overall, and for ZZZKBot jumping up the ranks. In other matchups, ZZZKBot performed similarly in both runs of the tournament.

Why did ZZZKBot perform so differently in these 2 matchups alone? I’ll dig in later, but I can speculate; here are 3 possible reasons, and it could be something else. There is some smell of software error: 18-82 -> 90-10 and 34-66 -> 67-33 look as though the results for the players were swapped. Or perhaps ZZZKBot was affected by the hardware error in these 2 matchups. Another possibility is unstable learning. I know that Steamhammer can perform very differently in two runs of the same matchup depending on what openings it happens to randomly try (does it hit on a winner early?). ZZZKBot’s learning is complicated and hard to analyze, but maybe it is susceptible to some effect like that.

AIIDE 2019 results first look

Important update on Friday 11 October: The results are invalid due to an error and the tournament will be repeated from scratch. See Dave Churchill’s tweet “The 2019 AIIDE StarCraft AI Competition will have to be re-run due to an error on our part causing a corrupted file which caused McRave to crash a lot of games.” The same error might have caused other problems. Even if McRave was the only bot directly affected, the competition was round robin so every bot’s score was potentially affected.

The AIIDE 2019 results were announced today at the conference. The AIIDE conference stream includes Dave Churchill’s presentation starting at about 52:30.They come with a video of Locutus versus PurpleWave, with commentary by Dan Gant focusing not on the game, but on the AI techniques.

The standings: #1 Locutus edged out #2 PurpleWave. #3 DaQin and #4 BananaBrain were far behind, but finished out the dominant protoss bloc. (The win rate over time graph strangely omits #4 BananaBrain.) #5 Iron, #6 Microwave, #7 XiaoYi, and #8 Steamhammer were closely grouped around 50% win rate. As in CoG, Iron is the top terran and the top returning bot, and Microwave was the top zerg.

#10 McRave did surprisingly poorly. It must be suffering from new bugs. I notice that McRave’s army has become strangely passive; it sometimes seems unwilling to fight even with a large advantage. That seems like a symptom of an important bug.

#8 Steamhammer did about as I expected, or at least as I expected after I noticed the combat sim bug that I had just added. Without that bug I think it would have finished slightly ahead of Microwave. I’m bothered by the 59% win rate against Iron, though; I expected over 90%. I tested on every map with the correct version of Iron, but must have made a mistake somewhere.

Last year, Bruce Nielsen provided diffs from Locutus for bots derived from it. This year, Dan Gant has provided diffs of a few other bots.

Stormbreaker derived from SAIDA - Stormbreaker was disqualified because its behavior was nearly identical to SAIDA’s, though there are big code differences. According to the presentation, Stormbreaker adds a neural network but does not use it.

XiaoYi derived from SAIDA - According to the presentation, SAIDA would likely have finished 3rd if it had played. XiaoYi placed 7th behind Microwave.

DaQin this year versus last year. I see a great many detailed changes.

We were promised a second competition on “unknown” maps, for those bots which did not opt out. I count 8 participants for the second competition. I don’t see a sign of its results. Perhaps it has not been run yet.

As always, I will analyze both CoG and AIIDE. But CoG is showing evidence of sloppiness, so AIIDE deserves more attention. With fewer entrants in AIIDE this year, it won’t take as long to dig into them. But I think I have almost managed to interpret the CoG result file, so I’ll start there.

Steamhammer can’t finish the game

Finishing off the enemy just means destroying all their buildings. It sounds simple, but it is a sophisticated skill, and there are a lot of ways to go wrong. Steamhammer has a number of special provisions for quickly finding the last enemy remnants, but small loopholes persist and occasionally a game slips through one.

PurpleSpirit-Steamhammer on La Mancha is an example. It’s an entertaining game, thanks to the purple habit of playing all over the map, but I want to focus on the end, after PurpleSpirit has lost, when Steamhammer fails to destroy the floating terran buildings that are right over its head. That part is entertaining too, but not for the same reason.

the beginning of the end

Everything terran on the ground is destroyed, except one command center which was infested instead. The remaining terran force is 2 full-strength battlecruisers, and the remaining terran infrastructure is 2 floating ebays. Steamhammer is maxed and banking resources, but its only anti-air units are 8 scourge, plus a defiler with plague. Notice how much game time is left?

One of Steamhammer’s special game-finishing skills is that it makes mutalisks to chase down the residue of the enemy. The condition is, if the enemy has no known bases and no known anti-air units, then Steamhammer will tech to mutalisks and make mutas its primary unit. The mutas scout faster than ground units, and can find floating buildings and island bases that ground units can’t reach. But here terran still has battlecruisers, so the mutalisk rule does not kick in. First the terran army, such as it is, must be defeated.

swarm all over

Some scourge have hit, and the battlecruisers are no longer at full HP. But Steamhammer has been replacing losses primarily with more zerglings and ultralisks, which are of no use. Oops, the unit mix is wrong. Now there are only 2 scourge, and a battlecruiser can kill a scourge in one shot—2 battlecruisers, if not distracted by other targets, are safe from 2 scourge. The ebays choose to park over the terran natural, and zerg units have congregated there. The battlecruisers seek zerg stuff to shoot, and the defiler responds by blanketing the area in defensive swarm, consuming zerglings like crazy.

some damage has been done

Well, the 2 scourge hit one of the battlecruisers, which was distracted seeking zerg units that strayed from under dark swarm. And Steamhammer is now making 3 new pairs of scourge to replace various losses; if it can keep this up, the battlecruisers will eventually fall. Best of all, the defiler has plagued the terran buildings, despite the zerg units underneath. That will put the ebays into the yellow. One more plague should put them into the red, after which they will burn down.

nothing happens after this

Whew, finally the battlecruisers are shot down by scourge.

But that’s all she wrote. The swarms wore off and there was no need to renew them. The defiler did not plague again because it thought the zerg units underneath were more valuable. Scourge are coded to avoid floating buildings, because it is usually wasteful to spend gas destroying them. The mutalisk rule is engaged, and there is supply to build 1 mutalisk, but Steamhammer happened to choose to spawn zerglings first, and after that there was no supply to make a mutalisk. The game timed out with no more progress.

Finishing off the enemy can be hard. In this case, Steamhammer had the wrong unit mix; to make zerglings and ultralisks when all enemies were in the air was no good. The mutalisk rule should make mutalisks only, not mix them with other units. The scourge might have understood that when only floating buildings are left, they are good targets. Also the zerg ground units that can’t shoot up might have known better than to chase floating buildings (though it can be useful when they track a building trying to escape), and the defiler might have realized that damage to its own units was irrelevant when it could eliminate the enemy. That is a lot of flaws, and yet Steamhammer rarely fails to finish a game!

And fixing all the problems would only narrow the loophole, not eliminate it. In the worst case, Steamhammer would need to be able to destroy some of its own units to clear supply to make mutalisks to finish off the enemy. And that’s a high-end skill that I am in no hurry to add.

Next: The start of CoG 2019 analysis.

CoG 2019 results first look

Dan Gant let me know yesterday that the CoG 2019 (formerly CIG 2019) full results are out. They finally got a new web site up. I grabbed everything, but found that replays_04.zip is corrupt, so we are missing replays from the final 10 rounds. The SOURCE_CODE download does not contain source code.

There were 27 participants, a good number, but only 9 were new entrants, not such a good number. The remaining 18 were holdovers from previous years (this assumes that LetaBot was a new submission as registered, not a holdover as stated in the slide show; I don’t know which is correct). 40 rounds were played, numbered 0 to 39. 27*26*40/2 gives 14,040 games ideally, and they claim that 14,027 successfully made it into the results. The five maps were a version of Heartbreak Ridge (2 starting locations), Alchemist and Great Barrier Reef (a version of El Niño) (3 starts), and Neo Sniper Ridge and Python (4 starts). Alchemist is badly designed, but the others are good choices. Heartbreak Ridge and Sniper Ridge have layout similarities; I would not have included both with only 5 maps total (of course, they chose randomly). I still maintain that 5 maps are not enough to smooth out differences; if a bot does particularly well or poorly on one map, it introduces an element of luck into the results.

The result chart in the slide show does not agree with the result crosstable. The table gives #4 Iron 75.96% #5 BananaBrain 74.81%, #6 XiaoYi 72.21%. The slide show gives #5 BananaBrain 72.21% from the next entry down in the table, #6 XiaoYi 70.38% from its next entry down in the table, and so on. The error seems to be in the slide show; all the values are assigned to bots which are off by one from #5 BananaBrain until #21 Ziabot, which in the slide show shares the same 29.33% win rate as #20 Bonjwa. I’ll be careful to use the win rates from the crosstable.

I get the impression that the organizers are overburdened. Running a tournament is a ton of work. They do not seem to have the resources to verify details and get everything right. I hope they have time to go back and clean up the errors.

The participants fall into fairly neat score groups. The protoss leaders #1 PurpleWave, #2 McRave, and #3 Locutus (all independently written, by the way, with no shared code history) are at 88.56-84.9%. Then a gap, and the next group is #4 Iron to #9 BetaStar at 75.96-67.41%. Another gap, and the next group is #10 MetaBot to #13 TitanIron at 59.04-56.35%.

I can verify from its learning files that terran #6 XiaoYi at 72.21% wins is a fork of SAIDA. By the way, it is given as registering under the name XiaoYiAI, but played under the name XIAOYI. It was brand new, so probably no bot had special preparation for it by name. Nevertheless, other bots seem to have played under their names as registered, so XiaoYi was potentially given an advantage of anonymity.

#4 Iron at 75.96% win rate is the top terran and the top holdover bot from the previous year. #7 Microwave did well at 70.38%, making it the top zerg. (#22 Steamhammer is the buggy holdover from the previous year and performed miserably, as expected.)

The biggest upset by far is #24 OpprimoBot, 23.22% win rate, 28-12 versus #1 PurpleWave with 88.56% win rate. I watched a few replays and found that, in those games, PurpleWave made one probe and then stopped all production. I can only guess that OpprimoBot tickled a bug, and the bug must be triggered by the name “OpprimoBot”, since PurpleWave went wrong before learning anything else about its opponent. I imagine that the famously thorough Purple tournament preparation hit a glitch in this one case. Maybe it is related to the fact that OpprimoBot plays random on SSCAIT, but played terran here. The object file Opponents.class does mention OpprimoBot by name, along with many other potential opponents.

I will analyze the results as usual, with the colorful crosstables and stuff, but may be slower than in the past.