AIIDE 2019 second first look
The AIIDE 2019 tournament has been rerun to correct an error. The results are official, different from before, and hopefully final. In the original run of the tournament, we’re told, a hardware error corrupted a file and caused McRave to crash every game against Locutus. In the corrected rerun, McRave was able to score 1 win against Locutus in 100 games, but ironically ended up with a slightly lower overall winning rate. Bugs in McRave were more important for its result than bugs in the tournament.
#1 Locutus and #2 PurpleWave maintain their positions, but Locutus no longer had plus results against every opponent: PurpleWave edged it out 55-45 in their matchup. #3 BananaBrain gained a rank, and #4 DaQin lost one. From my point of view, the most important result is that #5 Steamhammer moved ahead of #7 Microwave and #8 Iron—these competitors were tightly grouped, and it only took small changes in the results to switch their finishing order around thoroughly.
shifts in the results
The order of finishers looks different, but most winning rates in the final results are within a few percent of the deprecated original results. Exceptions are #4 DaQin at 63.33% which was formerly #3 DaQin at 69.39%, a shift of 6% down, and #6 ZZZKBot at 52.08%, formerly #9 ZZZKBot at 43.04%, a shift of 9% up. What accounts for these two bots having such different results? To my eye, it doesn’t look like typical statistical variation.
I looked at the scores of specific matchups. Surprise result one: Formerly ZZZKBot scored 18-82 versus DaQin, but this time ZZZKBot 90-10 DaQin. This one difference accounts for the entire shift in DaQin’s winning rate, moving it down a rank, and much of ZZZKBot’s shift. Surprise result two: Formerly ZZZKBot 34-66 McRave, but this time ZZZKBot 67-33. That accounts for McRave performing worse overall, and for ZZZKBot jumping up the ranks. In other matchups, ZZZKBot performed similarly in both runs of the tournament.
Why did ZZZKBot perform so differently in these 2 matchups alone? I’ll dig in later, but I can speculate; here are 3 possible reasons, and it could be something else. There is some smell of software error: 18-82 -> 90-10 and 34-66 -> 67-33 look as though the results for the players were swapped. Or perhaps ZZZKBot was affected by the hardware error in these 2 matchups. Another possibility is unstable learning. I know that Steamhammer can perform very differently in two runs of the same matchup depending on what openings it happens to randomly try (does it hit on a winner early?). ZZZKBot’s learning is complicated and hard to analyze, but maybe it is susceptible to some effect like that.
Comments
Dave Churchill on :
The initial hardware error was caused by me swapping some RAM into a new machine right before the run of the initial tournament. The 2nd tournament was run on different hardware and so we're 100% sure that didn't happen again. This is just one of the dangers of using strategy selection. The extreme case is if both bots have some sort of Rock-Paper-Scissors and chase each other in a cycle trying to find a winning strategy. In one case if you chose the winner first (possibly randomly) you could end up 100-0 instead of 0-100
Dan on :
Bot learning dynamics make for huge variances in matchup results. I've observed this on lots of 50-100 game test runs. If your bot has one good strategy for the matchup, and it loses the first time around, you're likely going another N (= # of strategies) losses in a row before you get a second shot.
If you've got one strategy that wins 60-40, and ten strategies that lose 10-90, the odds that you get fooled into trying a succession of bad builds is very high.
A prominent example of that is the most recent SSCAIT finals. PurpleWave went 7-1 against Locutus, but the results on BASIL immediately thereafter were 50-50. Locutus had strategies with greater than 50% winrate against PurpleWave's DT-expand strategy, but didn't settle in on them in time. That match could just as easily have gone 7-1 Locutus as it did 7-1 PurpleWave.
That high-variance dynamic is a major driver for the amount of pre-training and strategy filtering that I put into PurpleWave before each contest. Not only is exploration totally unaffordable in win% formats (PurpleWave alone accounted for over half of Locutus' losses -- meaning you really need to go near-100% against everyone else to win), but a few bad early-round coinflips can quickly cost you an extra 20+ games.
I continue to argue that the need to have strong opponent priors is one of a few good reasons to move away from win% formats. Priors help in all formats, but are overweighted in formats where exploration is unaffordable.
Congrats again to Locutus. And thanks to Dave Churchill and Rick Kelly for running this year's throwdown and for ensuring a smooth and accurate operation. Looking forward to the results on the alternative map pool.
MicroDK on :
I didn't believe that a big swing like 18-82 -> 90-10 was possible without some sort of bugs. And I am still sceptical...
Quatari on :