tournaments - 21 | Starcraft AI blog

map balance - bot balance in AIIDE 2015

I wrote Ye Usualle Little Perl Script to calculate map balance in AIIDE 2015, based on the the detailed game results (the “plaintext” link on that page). The results do not tell us what race random UAlbertaBot got each game, so its results don’t count in the analysis. UAlbertaBot was the only random bot.

map	TvZ		ZvP		PvT
	wins	n	wins	n	wins	n
(2)Benzene.scx	18%	405	72%	315	64%	567
(2)Destination.scx	19%	405	73%	315	64%	567
(2)HeartbreakRidge.scx	24%	405	70%	315	61%	567
(3)Aztec.scx	19%	405	72%	315	64%	567
(3)TauCross.scx	20%	405	70%	315	64%	567
(4)Andromeda.scx	22%	405	69%	315	60%	567
(4)CircuitBreaker.scx	20%	405	69%	315	67%	567
(4)EmpireoftheSun.scm	16%	405	69%	315	65%	567
(4)Fortress.scx	19%	405	70%	315	69%	567
(4)Python.scx	22%	405	73%	315	66%	567
overall	20%	4050	71%	3150	64%	5670

In the table, n is the total number of games in the matchup, one of several crosschecks to make sure the analysis is right. The tournament had 5 zerg, 7 protoss, and 9 terran bots (plus random UAlbertaBot, which was not counted, making 22 participants). There were 90 rounds, each on one map (which over 10 maps means 9 times through the map pool). So for TvZ there should be 5*9*90 = 4050 games; for ZvP 5*7*90 = 3150 games; for PvT 7*9*90 = 5670 games. Good.

OK, from this exercise I learned more about race balance in this tournament than about map balance. Zerg came out on top because zerg bots won. Meanwhile terran bots were concentrated toward the bottom of the crosstable, while protoss were scattered throughout. Zerg crushed protoss 2:1 but annihilated terran 5:1. I had not realized that it was so extreme. The maps made small differences, the bots made big differences.

Bots analyze maps shallowly and try to play about the same on different maps. I had expected that that lack of adaptivity would cause maps to affect results strongly: Adapting means that the bot matters more; failing to adapt means that the map matters more. But if so, it’s not visible in this table. Maybe the maps are standardized enough that adaptation doesn’t matter at this level of play. Or maybe my original thinking is wrong, and adaptation is what allows the map to matter—Heartbreak Ridge has a narrow base entrance, so that you can easily block your enemy in or out, and high ground over the natural to proxy on, and I haven’t seen any bot take advantage of those features.

You can download the AIIDE 2015 map balance analysis script in a .zip file. I ran it on a *nix but it can probably be adapted to run under Windows with no more than a tweak or two.

Next: I’ll try to normalize the results and compare human map balance to bot map balance in relative terms. Though you can get an idea already by eyeballing the tables.

map balance - AIIDE 2015

Here’s the map balance table for AIIDE 2015. As yesterday, these per-matchup statistics are for pro players and are copied from TLPD.

map	TvZ	ZvP	PvT
Benzene	64.1%	49.1%	48.7%
Destination	52.3%	57%	54.5%
Heartbreak Ridge	48.6%	56.6%	59.1%
Aztec	39%	50%	65.4%
Tau Cross	50%	50%	52%
Andromeda	42.7%	58.8%	57.6%
Circuit Breaker	52.9%	51.8%	53%
Empire of the Sun	64.2%	50%	51.1%
Fortress	64.3%	66.7%	51.2%
Python	55.2%	53.9%	45.8%
overall	53.3%	54.4%	53.8%

With 10 maps to average over, the balance looks close enough to be fair. Some individual maps have large imbalances, but they mostly even out over the map pool. They don’t completely even out, though, because imbalances are too consistent across maps; there aren’t enough counterbalancing maps.

Of these 10 maps, only 2 (Tau Cross and Python) overlap with the 5 CIG 2016 maps.

Human balance and bot balance should be different. Next: I’ll try to investigate the bot balance in practice, using the AIIDE 2015 game results. Per-matchup numbers can’t be deduced from any of the summary tables, so I’ll have to go back to the raw game results. Will human and bot balance be somewhat similar, or all different?

map balance - CIG 2016

Map balance is hard.

Only about 5 competition maps have stats showing balance within a few percent of equal for all matchups. Seriously! That’s less than 2% of maps ever used in pro play! (Though to be fair, the total includes maps without enough games for us to know the balance.) The closest are the popular Fighting Spirit, Circuit Breaker, and Tau Cross, and the less-popular Arcadia 2 and Neo Aztec. If you want a balanced map pool beyond these 5 maps, you have to balance the maps against each other: “This one is T>P by 10%, so the rest should add up to P>T by 10%.” Of course those are human stats, and bot balance should be different, so you might want to balance using bot data.

The AIIDE and CIG rules both say that maps will be chosen at random from a larger pool. SSCAIT says its maps are selected from popular recent pro maps, and doesn’t mention balance. So I decided to look into it.

For today I calculated the balance of the CIG 2016 map pool, 5 maps randomly selected from a larger collection. Think of this as a first check to see how balance may come out when you’re not paying attention.

(2)RideofValkyries1.0
(3)Alchemist1.0
(3)TauCross1.1
(4)LunaTheFinal2.3
(4)Python1.3

I used balance numbers from the TLPD map database, which gives statistics for pro games played from 1999 to 2012. It’s not a definitive current pro balance, but it should be pretty good and it was complete and easy to use. Alchemist is not often played (presumably because it is grossly Z>P; also, according to Liquipedia “Alchemist is mostly noted for being a poor attempt at an asymmetrical three-player map”) and its stats are based on only 53 games. The % number in each cell is the winning rate for the first race in the matchup over each column.

map	TvZ	ZvP	PvT
Ride of Valkyries	48.5%	67.1%	54.4%
Alchemist	55.6%	80%	62.5%
Tau Cross	50%	50%	52%
Luna the Final	53.2%	60.2%	60%
Python	55.2%	53.9%	45.8%
overall	52.5%	62.2%	54.9%

I’d say that’s a substantial Z>P imbalance.

The numbers from TLPD are raw outcomes, with no attempt to adjust for the strength of the players. That’s likely good enough; it should average out over the large number of games played on most of these maps. But if we want to compare the pro balance with the bot balance after the tournament is over, we may want to do some normalization of both data sets. I’m predicting that this tournament will be dominated by terran bots. A comparison might give the impression that the maps are T>P and T>Z for bots, when in fact the terran bots were playing better.

Tomorrow: AIIDE 2015 map balance.

tournament map selection as a prod

I will never run a tournament. I don’t have the stomach for that much administrative work (and hats off to those who do!). So it’s perfectly safe for me to offer advice—I know I’ll never have to listen to it myself.

The way I see it, one goal of tournaments is to prod bots to improve; tournaments motivate. Another goal is to measure progress; tournament organizers are happy to include older bots that have competed in past tournaments, to see how they do against newer competition. There’s some tension between the two goals, but you don’t want to compromise either of them too much.

Earlier I suggested changing timeout rules to prod the winner to finish the game. Another way to prod bots is to make them play on new maps that present different challenges. Unfortunately, most of the concept maps that I talked about seem too hard for current bots (and the novelty maps are not suitable for competitions). Exception: The map Fantasy is not too hard, but it’s too subtle. Stepping down a level, I don’t know any current bot that can play on an island map. Even ignoring balance issues, a tournament would not want to include an island map like Charity, or even a semi-island map like Indian Lament, because it would break the goal of measuring progress. Bots that were made able to play the maps would likely score 100% against bots that could not.

There is a compromise. I suggest the map Namja Iyagi, a land map with 4 mineral-only islands (one in the corner behind each main base) and 2 mineral-and-gas islands. A bot with island skills would have a large advantage over a bot without island skills (the prod)—but not necessarily a decisive advantage. Two bots with no island skills could still play sound games against each other. If Namja Iyagi is only one map out of several, the tournament results remain a fair measure of progress.

The map Return of the King has 4 islands, so it might be a gentler prod.

Another prod that would be good is a map that promotes (but does not require) pushing through minerals or mineral-walking through obstacles, as in some of the concept maps. I’m not sure what a good choice would be, though.

A Team Liquid thread RFC: BW AI Bot Ladder proposes a much fancier attempt to encourage progress.

Tomorrow: Map balance.

CIG 2016 and the Terran Renaissance

Looking at the entrants to CIG 2016, I think the Terran Renaissance is confirmed. The authors of terran bots have been pushing hard to get into the forefront, and I think they’ve passed the other races and succeeded. Protoss and zerg have not been putting in the same effort to reach the serrated leading edge.

I think terrans Iron and Letabot have the best chances to come in #1. Tscmoo terran can never be counted out, especially since it seems to have gotten a last-minute update (the neural network diagram got bigger; apparently it can remember more in its long short-term memory). I judge that zerg 4-pooler ZZZKBot still has a chance to make it into the top 3. Random UAlbertaBot and zerg Overkill haven’t been updated and seem a cut below. If Krasi0 were playing, I would forecast a terran sweep, though not with full confidence.

Protoss XelnagaII boldly gives itself a new version number, so I consider it an unknown. Protoss MegaBot has mixed results in early going on SSCAIT, but its description emphasizes strategy so maybe it’s an opponent-modeling bot that will do better in a long tournament (I can hope, anyway). And there’s no telling what other bots may have updates that I don’t know about.

We learned that Sungguk Cha’s bot is called Navinad, and Johan Kayser’s bot is called SRbotOne. OpprimoBot is (as in past tourneys) listed as playing terran, not random—I assume that’s its best race, though I would have guessed zerg.

Salsa has the best shot at the bottom of the score chart, with Bonjwa runner-up for the caboose. Not that I would discourage either of them. Salsa learned to play on its own from scratch—that’s an achievement in itself, just not the kind that the tournament is trying to measure.

yet another human-vs-bot tournament

The last two are not even completed and LetaBot is starting another tournament. The announcement post on Team Liquid: StarApple D/D+/C- man vs machine tournament.

I like these tournaments. Human players and bots both face game play that they haven’t seen before, which is a great way to get new ideas. I hope lots of bots will sign up.

Update from Martin Rooijackers, aka LetaBot, who runs the tournaments. He points out 1. In one of the still-running tourneys, all bots are eliminated so it is no longer a man-machine event, and in the other, one bot team is already eliminated. Bots have it tough. And 2. the bots are under active development and several have updates already. Play tourney, get smacked, fix weaknesses—sure is nice when the next event is soon!

Hey, as long as people keep signing up.

tournament design

If you design a tournament differently, a different bot may be favored to win.

AIIDE 2015 was an example. As pointed out in the tournament results, UAlbertaBot finished fourth even though it had a plus score against every other bot, because compared to the top 3 it was less consistent in defeating weaker bots. AIIDE runs on a round-robin design, all-play-all, so UAlbertaBot could defeat the top finishers and still be ranked behind them. In a progressive elimination tournament in which weaker competitors were dropped over time, UAlbertaBot would likely have finished first.

If you’ve seen the math of tournament design, or of related stuff like voting system design, then you know there’s no such thing as a fair tournament in which the best competitor always has the best chance to win, because there isn’t always such a thing as a best competitor. If A > B and B > C but C > A, then which is the “best”? That’s called intransitivity. A more complicated kind of intransitivity happened in AIIDE 2015.

Rating systems in the Elo tradition have the same issue (and their designers know all about it). They assume—they have to assume, to be what they are—that players have a “true skill” in a mathematical sense, putting players into a smooth mathematical model that doesn’t correspond exactly with bumpy reality. It’s a good approximation; Elo ratings are mostly accurate in predicting future results. (The small mismatch with reality has inspired a lot of variations of Elo ratings, Glicko and TrueSkill and so on, that try to do a little better.)

Given any big enough set of games (games that link up the competitors into a connected graph), you can find Elo ratings for the players. The ratings may have big uncertainties, but you can rank the players. You can use virtually any tournament design with almost any kind of random or biased pairings, and get a ranking.

To me this is an intuitive way to think about tournament design: Players play games which we take as evidence of skill, and the key question is: With a given amount of time to play games, how do you want to distribute the evidence? If you want to rank all the competitors as well as possible, then distribute the evidence equally in a round-robin. That’s the idea behind AIIDE’s design—I approve. If you want to pick out the one winner, or the top few winners, as clearly as possible, then let potential winners play more games. If Loser1 and Loser2 are out of the running, then games between them produce little evidence of who the top winner will be. A game between Winner1 and Loser1 produces less evidence than a game between Winner1 and Winner2. Because of intransitivity you may get a different winner than the round robin, but you have more evidence that your winner is the “best.” It’s a tradeoff, ask and it shall be given you.

You might also care about entertaining the spectators. That’s the idea behind SSCAIT’s elimination format for the “mixed competition.” I approve of that too; it’s poor evidence but good fun.

As a corollary, the kind of tournament you want to win could make a difference in what you want to work on. In a round robin, beating the weak enemies more consistently like ZZZKBot counts as much as clawing extra points from the strong enemies like UAlbertaBot.

LetaBot man-vs-machine team tournament

Martin Rooijackers aka LetaBot is organizing another man-vs-machine tournament, this time a team tournament. He will go so far as to accept your bot in compiled form and put it on a team and operate it for you, so that your only commitment is to send in your bot.

He wrote to me: “The thing that will make this interesting is that unlike other man vs machine tournaments, this one will have the all-kill format. So since some bots are better at certain match-ups like TvZ, these specialized bots can thus still win the tournament.” I agree it’s an entertaining format and offers chances you don’t get otherwise.

It’s cool that he keeps running new competitions in different formats. It’s certainly not going to get stale. Kudos for the hard work!

If you have a bot, then I have a suggestion for what point in your development path is a good time to participate in a man-machine competition: Before you are ready. You can’t be ready until you’ve done it once before!