archive by month
Skip to content

CIG 2018 - Overkill was broken

Did Overkill actually perform much worse in CIG 2018 than in past years? Here are the bots carried over from 2017 to 2018, with win rates in both years, with numbers from the official results. We see that Overkill collapsed in win rate from 2017 to 2018, a far bigger change than any other bot. Iron performed poorly in 2017 because it failed on the map Hitchhiker. Other bots mostly had modestly lower win rates in the stronger field this year. My 2017 crosstable was calculated from the detailed results, which included some corrupted data and are a little different from the official results—except for Sling, which was a lot different: 26.07% in 2017 versus its official 18.08%, reducing its year-over-year difference.

bot20172018
UAlbertaBot65.59%60.58%
Overkill62.75%34.68%
Ziabot61.75%51.08%
Iron61.62%74.31%
Aiur59.83%51.54%
TerranUAB36.78%34.40%
SRbotOne34.14%24.37%
OpprimoBot30.69%27.11%
Bonjwa30.67%23.57%
Sling18.08%26.52%
Salsa4.64%1.54%

Was the difference due to the maps? N0. In 2017, Overkill scored 57% or more on every map (CIG 2017 bots x maps). In 2018, Overkill scored 38% or below on every map (official results). And 3 of the 5 maps were the same: Tau Cross, Andromeda, and Python.

Did they run different versions of Overkill? The source that they distributed for Overkill is identical in both years. Theoretically they might have run something different by mistake—but it produced the expected files in the write directory, so it would be a surprise.

Finally I downloaded the Overkill replays and watched some. The poor bot’s build orders were severely distorted, skipping over drones and buildings. It would do things like take gas on 7 and then stop all construction, or follow a normal-ish build but drop many drones so that its economy was anemic. Sometimes drones moved erratically instead of mining. It looked similar to play I’ve seen from Steamhammer when latency stuff is way out of whack. Of the games I looked at, some were hopelessly muddled, some were close to normal with only occasional dropped drones, and none were 100% good. I don’t know what the problem was, something corrupted or a server setting that Overkill could not cope with, but whatever it was, Overkill was badly broken and far short of its normal strength.

43864-OVER_ZIAB.REP (Overkill’s last game of the tournament) is an example replay that shows the problems.

It’s possible that some other bots may have been affected. If the difference was in a server setting that Overkill was not ready for, then it would be surprising if every other bot was ready.

CIG 2018 - what Overkill learned

After analyzing AIUR yesterday, I ran a similar (but much simpler) analysis for the classic zerg #18 Overkill. The version in CIG 2018 has not been updated since 2015 and is the same version that still plays on SSCAIT. In 2015 it was a sensation, placing 3rd in both CIG and AIIDE—its place of 18 in this tournament, with about 35% win rate, suggests huge progress over the past 3 years. But keep reading; Overkill appears to have been broken in this tournament. I did this analysis once before: See what Overkill learned in AIIDE 2015.

Classic Overkill knows 3 openings, a 9 pool opening which stays on one base for a good time, and 10- and 12-hatch openings to get mutalisks first. When it chooses 9 pool, that means that the opponent is either rushing (so the 9 pool is necessary to defend) or is being too greedy (which the 9 pool can exploit). Overkill counts some games twice in an attempt to learn faster, so sometimes its total game count is larger than the number of rounds in the tournament (125).

NinePoollingTenHatchMutaTwelveHatchMutatotal
opponentnwinnwinnwinnwin
#1 Locutus420%420%410%1250%
#2 PurpleWave430%430%420%1280%
#3 McRave440%440%430%1310%
#4 tscmoo400%400%472%1271%
#5 ISAMind420%420%410%1250%
#6 Iron547%320%393%1254%
#7 ZZZKBot472%390%472%1332%
#8 Microwave546%350%422%1313%
#9 LetaBot526%330%402%1253%
#10 MegaBot6012%240%417%1258%
#11 UAlbertaBot410%410%482%1301%
#12 Tyr400%390%472%1261%
#13 Ecgberht5716%244%4212%12312%
#14 Aiur9434%147%1712%12528%
#15 TitanIron3611%200%6916%12512%
#16 Ziabot160%160%9323%12517%
#17 Steamhammer10748%70%1010%12442%
#19 TerranUAB2467%30%9883%12578%
#20 CUNYbot1844%617%10166%12561%
#21 OpprimoBot3667%30%8676%12571%
#22 Sling6746%60%5242%12542%
#23 SRbotOne2374%425%9589%12284%
#24 Bonjwa7592%425%4687%12588%
#25 Stormbreaker7091%20%5387%12588%
#26 Korean7799%20%4693%12595%
#27 Salsa46100%3294%46100%12498%
total130536%5976%137240%327432%

The 10 hatch opening was useless in this tournament—against every opponent, 10 hatch was the worst choice, at best tying for 0. In 2015, 10 hatch was about as successful as the other openings.

Signs are that something was wrong with Overkill in this tournament. In AIIDE 2015, then #3 Overkill scored 23% against then #4 UAlbertaBot, 68% against #5 AIUR, and 99% against #17 OpprimoBot. In CIG 2018, it was 1.6% against UAlbertaBot, 28% against AIUR, 71% against OpprimoBot. All versions appear to be the same in both tournaments—I didn’t look closely, but I did unpack the sources and check dates (in particular, Overkill has file change dates up to 8 October 2015 in both tournaments). Overkill had 14 crash games in CIG 2018, not enough to account for the difference. It’s hard to believe that the maps could have shifted results that much.

Tomorrow: What went wrong with Overkill?

too many devourers

I’ve been watching the replays of Steamhammer’s losses in AIIDE 2017 against its lowest-scoring opponents. I reason that if A beats B nearly every time, then a weakness which causes A to lose to B must be a severe weakness that I should fix.

Steamhammer scored 105-5 against Overkill in this tournament. The losses were all due to the same basic strategy blunder, teching too fast and not making enough mutalisks. One loss was an interesting one that taught me something: It involved making too many devourers.

Here’s the game. Steamhammer played its turtle opening, which it was set to do 1% of the time in ZvZ. (It’s a weak opening, but it is played rarely and it poses challenges that some bots can’t meet, so I don’t mind.) Meanwhile, Overkill also turtled, on 2 bases instead of 1 but with fewer drones.

As in the other losses, Steamhammer spent on zerglings and on teching to hive when it should have been making mutalisks. Overkill flew over, killed a lot of drones—then left for no apparent reason. Overkill was ahead, but Steamhammer could recover. Mistakes like that may explain how Overkill scored so poorly. Steamhammer still didn’t feel like making many mutas, but it did add scourge and morphed a greater spire, which (since it was behind and had already invested in a hive) was reasonable.

Steamhammer started morphing its mutas into devourers. It can’t catch up in mutalisks, so it was a good try to turn the game around. But as I watched, I realized there was a flaw in the strategy logic: It wanted to morph all the mutalisks and go straight devourer. The unit mix is declared as mutalisk + devourer, but because of how gas accounting works, the strategy boss preferred to turn gas-heavy mutas into mineral-heavy devourers.

Overkill finally moved out when it was maxed, with 72 mutalisks. Steamhammer had 20 devourers and 4 mutalisks, plus a fleet of scourge. Now, devourers are support units. They don’t do much damage; their benefit is that they splash acid spores which reduce the enemy’s attack rate and increase the damage taken, and to get the benefit you need fast-attacking units to deal the damage. Because acid spores only stack up to 9, it is not likely that 20 devourers will ever be useful, and definitely not if you fly them around in a group. I guess that about 6 devourers would have been best, certainly not more than 9.

massive purple goop

The battle was hilarious. The devourers fought well enough, apparently not harmed by the known bugs, and instantly stacked their purple goop up to the limit of 9 on every Overkill mutalisk that appeared. Devourers are tough, and Overkill’s mutas had their attacks slowed, so the the devourers took a long time to die. But with only 4 mutalisks to deal damage, even though the damage per shot was 9+9 + 3+9 + 1+9 = 40 (spread over 3 enemies with mutalisk bounce) Overkill’s losses were minor.

The ultralisk cavern you can see in the picture was useless. Steamhammer never made an ultra. It shouldn’t have wanted to.

In the next version I’ve taken measures. Steamhammer tries to make more units, techs more moderately, and limits the number of devourers.

Overkill’s games

In September I looked into the source of Overkill AIIDE 2017. It has much more extensive and ambitious learning than Overkill 2016. But at the time, no replays were available, so I didn’t know how well the learning worked.

Overkill 2017 has 2 kinds of learning. One is online opening learning, which of 3 openings to play against each opponent. This is the same simple method as many other bots. The other is strategy learning, what to build/research/etc. next; I think this learning is offline, done ahead of time. The graph of win rate over time shows a modest increase in score throughout the tournament, consistent with the opening learning. The modest increase was enough to pull Overkill ahead of 5 different opponents that started out better.

Here is Overkill’s tournament history. I believe these tournaments represent 3 different versions of Overkill. The 2017 version played only in AIIDE 2017.

tournament#%
CIG 2015381%
AIIDE 2015381%
CIG 2016 qualifier471%
CIG 2016 final551%
AIIDE 2016762%
CIG 2017760%
AIIDE 20172133%

It’s safe to say that the learning was not successful.

What did the games look like? In every game I looked at, Overkill played sunken defense and then seemed hesitant and uncertain what to do next. It researched tech that it did not use, or did not use until much later; it made scourge when the opponent had shown no sign of air units; it added more hatcheries than its income could feed. The sunken defense worked especially poorly when the opponent went air. The basic mistakes caused weak results. I don’t know the training procedure; I can only guess that maybe the neural networks were not trained enough, or were trained on crummy data, or were misled by bugs. Or something.

Oh well. :-( Learning is hard.

Overkill-McRave is one of Overkill’s better games. #21 Overkill scored 50% against #7 McRave, which you could count as an upset. Overkill-IceBot is a more typical game with some OK action. Overkill-Juno is a sad, sad game that ran the full hour due to suicidal play by both sides.

Next: Catalog of bots which wrote learning files.

Overkill’s CIG 2017 replays

Well, I looked at Overkill’s replays from CIG 2017 as I said I would, but I was disappointed. Not only was the CIG 2017 Overkill carried over from the previous year (which I forgot), but the CIG 2016 was also carried over from the year before (which I also forgot). Overkill from CIG 2017 is the same as Overkill from AIIDE 2015, from before the big learning feature was added. It plays the same as the Overkill version on SSCAIT.

Nothing to see here. :-(

Of the tournaments in yesterday’s table, AIIDE 2016 has the least-old version of Overkill. Last year’s version is the first version with unit choice learning, not only opening learning. The other tournaments all played the AIIDE 2015 version which has only simple opening learning like many bots. The AIIDE 2016 result is not far out of line with the other results, so we can guess that the new learning feature worked adequately but not great.

Overkill AIIDE 2017 updates

PurpleWaveJadian pointed out in a comment that Sijia Xu had updated Overkill’s repository, from last year’s version to bring it up to date with the AIIDE 2017 version of Overkill. There are extensive changes. The bot looks substantially more capable.

Here’s what caught my eye. This is based on a quick read through a long list of diffs, so I may have missed or misunderstood a lot.

• New machine learning library with a couple different learning methods.

• Support for lurkers, scourge, guardians, devourers, ultralisks—all the regular zerg combat units. Only queens and defilers are omitted, the spell units. Last year it supported only zerglings, hydralisks, and mutalisks.

• Major changes in tactical calculations. Well, there would have to be to support new units.

• Changes to building construction. I’m not sure what the idea is here.

• It detects production deadlocks, to avoid freezes. The method is written to be simple and general-purpose rather than extensive and tweakable like Steamhammer’s, so it may be worth a look. See ProductionManager::checkProductionDeadLock().

• Scouting seems to be completely rewritten. Overlords scout around using an influence map. It looks like 1 zergling can be singled out to be a scout zergling.

• Overkill 2016 learned a model of when to make zerglings, hydralisks, or mutalisks. The actions in this Overkill version include the combat units except for guardians and devourers, sunken colonies, tech buildings to construct, upgrades and tech to research, expanding, attacking versus harrassing with mutalisks, and doing nothing. So it not only has a wider scope for strategy with more unit types, it looks as though it learns more of its strategy. I see declarations for 4 neural networks, which apparently cooperate to do different parts of the job.

• It chooses from among the same 3 openings as before: 9 pool, 10 hatch, 12 hatch. The details of the build orders have been revised.

How does it play? Well, I only looked at the source, and not too closely. Overkill first got a learning model in the CIG 2016 tournament. Here are its historical tournament results:

tournament#%
CIG 2015381%
AIIDE 2015381%
CIG 2016 qualifier471%
CIG 2016 final551%
AIIDE 2016762%
CIG 2017760%

The CIG 2016 qualifier is a better comparison to the other tournaments than the final, since it includes all entrants. The CIG 2017 entry is probably not much different from the AIIDE 2017 one. [Update: Not so. CIG says that the CIG 2017 was the same as the CIG 2016 version. See comments.] It seems that Overkill’s win rate has been falling as it learns more on its own. Is that a sign of how hard it is to do strategy learning? I would like to see it play before I draw conclusions!

Next: I’ll look at Overkill’s CIG 2017 replays and see what I can see.

Overkill - wrapup discussion

As usual, I have a mass of final comments.

unused features

After a few days to mull it over, I think the unused features are left over from experiments. Sijia Xu had to try stuff out to see what would work, and not everything was worth including. More features mean that potentially more knowledge can be learned, but also that learning is slower. Leaving unused features in the code makes it easier to switch them in and out, even if you end up computing and saving features that never make a difference. I take it as a sign of the academic attitude.

choice of actions

Overkill uses its Q-learning to choose among only 3 possible actions, which combat units to build. It’s an important decision, but only one decision in a sky full of them.

Bots could use similar methods to learn more complex decisions, like tactical decisions: How to group units into squads and what orders to give the squads. Or complex build order choices. Or unit micro decisions. In those cases you can’t list the decisions ahead of time.

Overkill independently learns the Q values for each action. In effect, it learns 3 separate evaluations, Q(situation, zergling action), Q(situation, hydra action) and Q(situation, muta action). If zerglings and hydras are similar in some ways because they’re both ground units, or hydras and mutas are similar because they’re both ranged, then Overkill’s learning can’t take advantage of that; it can’t generalize across actions.

With many actions which are unknown in advance, you need a model which generalizes over the actions. In effect, you want to fold the actions (or their predicted results) into the situation. You can add your potential decisions as primitive features of the situation, treating “these guys are going over there to attack” the same way you treat “the enemy has 3 bases”. Or you can create a separate model of the effects your decisions have on the game, and use that model’s results as the situation.

Something along these lines is what DeepMind wants to do. They’ll use a neural network with deep learning for their model, but the idea is closely related to what Overkill does. From the AI point of view, it’s a fancier version of the same underlying idea, with a closely related top-level learning algorithm.

the linear model

Overkill’s model is a linear function, in the math sense. It sums up a lot of weights. But because of how the weighted features are made, the evaluation of actions is not linear overall with respect to the primitive game features: Overkill’s features themselves are nonlinear, in 2 ways. First, they are not primitive features, they are combination features. To simplify, it’s not “I have zerglings, that’s worth weight w[z], you have goliaths, that’s worth w[g], result w[z] + w[g].” The 2 features are (I’m still simplifying) combined into 4, “I have no zerglings, you have no goliaths, weight w[-z, -g]; I have zerglings, you have no goliaths, w [+z, -g], ....” Second, if the underlying primitive feature like “I have n zerglings” takes a range of values, then Overkill breaks it into a set of binary features, “I have 0 to 6 zerglings, w[0 to 6]; I have 7 to 12 zerglings....” If the value of having n zerglings (with respect to taking a given action) does not increase linearly with n, then the weights will be different to reflect that. (So after combination you get a lot of weights, w[0..6 z, 0..6 g], w[7..12 z, 0..6 g], ....)

I should pause to explain terminology. A “primitive” feature is one that comes directly from the game, like “I have n zerglings.” Other features can be called “derived” features. Overkill’s model is linear with respect to Overkill’s derived features, but not linear with respect to the primitive game features. That’s what I mean by “nonlinear features.”

Starcraft strategy is a highly nonlinear domain (with respect to the primitive features). But if you have nonlinear features, and you have a large enough number of them, then you can re-encode the nonlinear domain as a linear model—not exactly, but as accurately as you like.

Is the linear model with nonlinear binary features a good model? For many games, it is proven good by experience. For Starcraft, I don’t know. Can you find the features you need to play well? Are there few enough that you can learn them in a reasonable time? Sijia Xu has made a start on finding the answer.

offline versus online learning

Overkill’s Q-learning seems to make sense for offline learning, “how to play Starcraft.” It seems to be too slow for online learning of opponent models, “how to beat this opponent.” There are ways to do both at once.

One idea is to use different models for the 2 cases: Learn one big offline model with a huge number of features, to play well on average against unknown opponents. And also learn a small model with few features for each opponent. The small model won’t be able to learn as much knowledge, but it will converge faster. The small opponent model tells you, “here is how this opponent differs from the average opponent.” To make decisions during play, add together the results of the models.

Another idea is to use a hierarchical model. I don’t want to go into detail, but the basic idea is have high-level features with low-level features under them. The high-level features are few and can be learned quickly, so that the model’s learning starts to be useful quickly. The low-level features are many and take more time, so that the model can eventually learn a lot of knowledge if you keep feeding it data.

To learn an opponent model empirically, you have to explore. It’s mathematically required! And exploring an unfamiliar action will sometimes be bad play. We already have proof from opening strategy learning that opponent modeling can be worth it overall; good decisions later can make up for bad play caused by exploring early. But how far can we push it? We can learn to choose from among a few opening strategies and it helps, but can we make a sophisticated opponent model or is the cost of exploring too high? Humans can learn powerful opponent models, but humans don’t rely purely on the statistics—we also reason about our models in a way that is beyond the state of the art in AI.

judgment

As Martin Rooijackers put it, humans have better judgment than bots—so far. And how can we code judgment? There are only 2 fundamental techniques, knowledge and search. To match human judgment we’ll need some combination of knowledge and search.

In a realtime domain like Starcraft, I believe it makes sense to go for knowledge first. And if you go for knowledge first, then I believe hand-coding will not get you there. You’ll fail to hand-code enough knowledge for good judgment. Machine learning is the path, and Sijia Xu has taken a step down it ahead of everyone else.

Next: More on offline versus online learning.

other changes to Overkill

Overkill AIIDE 2016 is not Overkill 2015 plus Q-learning. Overkill is extensively updated. Scanning quickly through the diffs, I spotted changes affecting:

  • pathfinding
  • influence mapping
  • handling of enemy cloaked units
  • reactions to enemy flying units
  • choke finding
  • scouting
  • assignment of squads to targets (I think)
  • behavior of irradiated units
  • behavior of overlords
  • prioritization of targets to attack
  • static defense, both ground and air
  • reactions to enemy bunkers and cannons

My list is not complete. I really want to see some games!

Next: Overkill wrapup.

Overkill’s new learning 6 - how it fits with the rest of the program

The learning is done by StrategyManager::strategyChange(). During the game, the production manager is in one of two states, depending on whether Overkill is still in its opening build order.

	enum ProductionState {fixBuildingOrder, goalOriented};

	ProductionState				productionState;

When productionState is goalOriented, ProductionManager::update() calls onGoalProduction() to check whether the goal (zerglings, hydralisks, mutalisks) should be changed. The key parts:

	bool isAllUpgrade = true;
	for (int i = 0; i < int(queue.size()); i++)
	{
		//if remain unit type is just worker/upgrade/tech, check the strategy
		if (... a long condition here ...)
			isAllUpgrade = false;
	}
	if (isAllUpgrade)
	{
		if (BWAPI::Broodwar->getFrameCount() > nextStrategyCheckTime)
		{
			std::string action = StrategyManager::Instance().strategyChange(0);
			BWAPI::Broodwar->printf("change strategy to %s", action.c_str());
			setBuildOrder(StrategyManager::Instance().getStrategyBuildingOrder(action), false);

			nextStrategyCheckTime = BWAPI::Broodwar->getFrameCount() + 24 * 10;
			curStrategyAction = action;
		}
		...
	}

In other words, if the production queue is empty of combat units, and at least 10 seconds have passed since the last strategyChange(), then call strategyChange() again (with reward 0 because we’re in the middle of the game). Overkill changes its choice at most once every 10 seconds.

At the end of the game, Overkill calls strategyChange() one last time, giving a reward of 100 for a win and -100 for a loss. From Overkill::onEnd():

	if (frameElapse >= 86000 && BWAPI::Broodwar->self()->supplyUsed() >= 150 * 2)
	{
		win = true;
	}
	else
	{
		win = isWinner;
	}

	int reward = win == true ? 100 : -100;
	
	StrategyManager::Instance().strategyChange(reward);

There’s something curious here. isWinner is the flag passed into onEnd(). If Overkill makes it to the late game, it sets win = true no matter what; otherwise it believes the isWinner flag. For purposes of its learning algorithm, Overkill tells itself that making it to the late game counts as a win in itself! Isn’t that a pessimistic way to think?

Every call to strategyChange() produces 1 data point for the learning algorithm. The reward comes in only at the end of the game, but Overkill needs the other data points to fill in the Q values for states during the game. The model has a lot of values to learn, so the more data points the better.

exploration

When Overkill is in Release mode, it explores 10% of the time. The code is in StrategyManager::strategyChange(). So, 1 time in 10 when Overkill makes a decision, it’s a random exploratory decision instead of the best decision it knows.

On the one hand, if Overkill wants to learn an opponent model, it has to explore. On the other hand, making that many random decisions can’t be good for its play! Since the AIIDE 2016 tournament was not long enough for the opponent model to become accurate, Overkill might have done better to skip the opponent modeling.

Next: Other changes to Overkill.

Overkill’s new learning 5 - the full list of features

I changed my mind about what to cover today. Here is an overview of the features in Overkill’s model. I decided it was interesting to see what they are.

Yesterday we saw that features are kept in a map of maps, so names have two levels. The variable featureNames declares the top-level groups of feature names and some of the lower-level names. It looks like this (leaving out most of it):

	featureNames = decltype(featureNames) {
		//state feature
		{"state_general_feature", { 
			{ "time_", { "6", "12", "20", "30", "max" } },
			{ "enemyRace_", { "z", "t", "p", "unknow" } },
			{ "ourSupplyUsed_", { "9", "18", "30", "50", "80", "120", "150", "max" } },
		...
		{ "state_raw_combine_feature", {} },
		{ "action_battle_combine_state_battle_feature", {} }
	};

Features that originally take on a range of values have to be made into binary features. Overkill does it by breaking the range into coarse pieces. time_ looks to be game time in minutes. state_raw_combine_feature and action_battle_combine_state_battle_feature have their lower-level names filled in by code rather than declared directly. Those last two are the majority of features.

Here are top-level names and what it looks like they cover. Don’t trust my interpretations too much. Not all features in the code end up in the I/O files. I wrote down the number of features that the code includes, but apparently some of the features are never present in actual games.

name # my interpretation
state_general_feature 53 Game time, map name, distance between bases, enemy race, etc.
state_tech_feature 7 Overkill’s tech level.
state_building_feature 30 Overkill’s tech buildings, the enemy’s tech and production buildings.
state_economy_feature 46 Minerals and gas for Overkill, expansions and workers for both sides.
state_our_army 40 Counts of Overkill’s static defense and units of each kind.
state_battle_feature 168 Counts of enemy units of each kind.
state_raw_combine_feature 6723 state_our_army crossed with state_battle_feature, that is, every combination of our units and the enemy’s, plus 3 extra features.
action_battle_combine_state_battle_feature 6754 Copies of state_raw_combine_feature and state_building_feature and the one state_tech_feature feature ourKeyUpgrade_zerglingsAttackSpeed.

We know from yesterday that the features in action_battle_combine_state_battle_feature are exactly the ones that matter in calculating the Q value—they are the ones which get the action appended to their names. The others are only along for the ride; it’s an implementation quirk that they are kept in the same data structure. A number of features seem to be declared, evaluated, learned and remembered, but never effectively used.

So if I counted right (which I question), then there are 6754 binary features in total, though it would be strange for all of them to be useful against any one opponent.

Next: How it fits into the rest of the program, for real.

Overkill’s new learning 4 - the model and its features

What does it mean to have a linear model with binary features? “Linear” means that each feature comes with a number, its weight, so that with binary features you find Q(s,a) by adding up the weights for each feature that is present. Usually only a small proportion of all the features are present, so it’s not as crazy as it may sound.

Overkill gives its features long multi-part names, which it implements throughout as strings accessed via maps. (I was surprised to see that in a real-time program, but it’s probably easier.) The feature names are written out plainly in the I/O files. Here are a few scattered samples from the file feature_valueAiur, which lists 9638 features altogether:

action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*hydraBuild:0.13396
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*mutaBuild:0.07588
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*zerglingBuild:0.06963
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*hydraBuild:0.05439
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*mutaBuild:0.10049
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*zerglingBuild:0.26210

state_raw_combine_feature:enemyP_cannon_1*ourHydra_6:-0.21410
state_raw_combine_feature:enemyP_cannon_1*ourHydra_12:-0.43786
state_raw_combine_feature:enemyP_cannon_1*ourHydra_18:-0.08806
state_raw_combine_feature:enemyP_cannon_1*ourHydra_24:0.24174
state_raw_combine_feature:enemyP_cannon_1*ourHydra_36:0.42465
state_raw_combine_feature:enemyP_cannon_1*ourHydra_48:0.39939
state_raw_combine_feature:enemyP_cannon_1*ourHydra_60:0.52629
state_raw_combine_feature:enemyP_cannon_1*ourHydra_max:0.59403

state_tech_feature:ourKeyUpgrade_zerglingsAttackSpeed:2.33542
state_tech_feature:ourTechLevel_hatchery:2.28803
state_tech_feature:ourTechLevel_lair:0.25170
state_tech_feature:ourTechLevel_hive:1.48611

You can guess what the feature names mean: Enemy has 1 cannon and we have up to 6 hydralisks, for example. That’s how it got so many features!

Each opponent’s file seems to list a different number of features, probably leaving out features that never came up, so 9638 is not the total number of features. But there’s something here I don’t understand. 9638 is not divisible by 3. Each line gives one weight—shouldn’t there be 3 weights for each state, so that the 3 actions can all be evaluated?

Here’s the routine that calculates Q(s,a). Its arguments are reversed—it puts the action before the state.

double StrategyManager::calActionFeature(std::string curAction, std::map<std::string, std::map<std::string, int>>& features)
{
	for (auto categoryStateFeature : features)
	{
		if (categoryStateFeature.first == "state_raw_combine_feature" || categoryStateFeature.first == "state_building_feature")
		{
			for (auto stateFeature : categoryStateFeature.second)
			{
				std::string combineFeatureName = stateFeature.first + "*" + curAction;
				features["action_battle_combine_state_battle_feature"][combineFeatureName] = 1;
			}
		}
	}

	if (features["state_tech_feature"].find("ourKeyUpgrade_zerglingsAttackSpeed") != features["state_tech_feature"].end())
	{
		std::string combineFeatureName = std::string("ourKeyUpgrade_zerglingsAttackSpeed") + "*" + curAction;
		features["action_battle_combine_state_battle_feature"][combineFeatureName] = 1;
	}

	double curQValue = 0;
	for (auto categoryFeature : features)
	{
		for (auto curfeature : categoryFeature.second)
		{
			int curfeatureValue = curfeature.second;
			if (parameterValue.find(categoryFeature.first) != parameterValue.end() && parameterValue[categoryFeature.first].find(curfeature.first) != parameterValue[categoryFeature.first].end())
			{
				double curParameterValue = parameterValue[categoryFeature.first][curfeature.first];
				curQValue += curParameterValue * curfeatureValue;
			}
		}
	}
	return curQValue;

}

parameterValue holds the model. curAction is the action and the features map with its nested type is the state. Having read this, I still don’t understand. The action name is coded into some feature names and not others, which we see above as + curAction. The list of actions:

	stateActions = {"zerglingBuild", "hydraBuild", "mutaBuild"};

Here’s the call, the bit of code which chooses the action with the highest Q value. (Below this is another bit where it changes the action if it feels like exploring.)

		for (auto action : stateActions)
		{
			std::map<std::string, std::map<std::string, int>> actionFeatureValue = featureValue;
			double curQValue = calActionFeature(action, actionFeatureValue);

			if (curQValue > maxQValue)
			{
				maxQValue = curQValue;
				maxAction = action;
				maxFeatureValue = actionFeatureValue;
			}
		}

The call does nothing to differentiate actions. As far as I can tell, only the features which include the action in their names can be used to tell actions apart, and the other features are irrelevant constants that happen to be added in.

$ grep hydraBuild feature_valueAiur | wc -l
    2176
$ grep mutaBuild feature_valueAiur | wc -l
    2267
$ grep zerglingBuild feature_valueAiur | wc -l
    2403

So 2176+2267+2403 = 6846 features out of 9638 encode the build name in the I/O file for AIUR. As far as I can tell, the other 2792 features are irrelevant. And those 2792 features include some that look important. Surely you want to pay attention to what upgrades you have when you choose which units to make!

The number of features is different for each action. That means two things. 1. The fact that the total number of features is not divisible by 3 is meaningless. 2. Not all actions have been explored in the different states. As expected, the games played against AIUR were not enough to fill in the model.

Either I’ve misunderstood something, or Overkill’s learning has flaws (I wouldn’t go so far as to say bugs, it is only a loss of effectiveness, not an error). Can anybody correct me? I’ll contact Sijia Xu.

Next: How it fits into the rest of the program.

Overkill’s new learning 3 - one model or many?

My first question about Overkill’s model is: One model for all opponents, or one model for each opponent? It turns out that the answer is: It depends. It checks curMode.

enum developMode { Develop, Release };
extern developMode	curMode;

Here is an example. Bits of code like this show up for each of the 3 data files used to store the model and learning information.

	if (curMode == Develop)
	{
		filePath = “./bwapi-data/write/RL_data”;
	}
	else
	{
		string enemyName = BWAPI::Broodwar->enemy()->getName();
		filePath = “./bwapi-data/write/RL_data”;
		filePath += enemyName;
	}

In the AIIDE 2016 competition, curMode is set to Release. It looks as though each opponent gets its own model, learned independently. But not learned from scratch!

My idea that Overkill has a general model turned out true. (I may have read it somewhere and forgotten where.) When it plays an opponent for the first time, it uses a model defined in file ModelWeightInit.h as the initial model, and learns starting from there. I don’t see any information about how the initial model was created. It may have been trained by playing against a variety of opponents, in Develop mode.

You could say that the initial model is “how to play Starcraft” and the refined model made for each opponent is “how to beat this opponent.” The same learning system can be used both offline to learn the game and online to model the opponent.

How well did opponent modeling work? We can look at Overkill’s graph of win rate over time in the AIIDE 2016 results. Its winning rate after 20 rounds was 0.59 and after 90 rounds was 0.62. The curve shows a fairly steady rise, and visually it’s convincing that Overkill learned something about its opponents, but it improved its win rate only a little. The unsteady learning curve we saw in the description suggests that Overkill might have eventually learned more if the tournament had gone on long enough—maybe.

Next: The features of Overkill’s model.

Overkill’s new learning 2 - Q-learning

This post is for people who want an easy introduction to an important AI technique. Q-learning is 100% worth knowing if you care about AI. It’s easy to understand, it’s a fundamental technique, it’s used in high-performance learning systems (DeepMind used Q-learning in their famous demo of playing Atari video games), and it remains the subject of serious research.

Q-learning is a form of reinforcement learning. If you want to understand reinforcement learning, I like the book Reinforcement Learning: An Introduction by Sutton and Barto (it’s all online at the link). You do have to feel comfy with the math, but it’s nothing difficult.

The problem. Q-learning solves the control problem, which means: You’re in a situation. You have a set of actions available. How do you learn from experience what action to take? In computer speak, being in a situation means you’re in some state from a set of states. At time t you’re in state[t]. Overkill (we saw last time) represents its state as a collection of about 4000 binary features. You have a set of actions to choose from, and you have to choose one, so we can say that at time t you choose action[t]. In principle the actions might be different in each state, but Overkill always has the same three actions (hatch zerglings, hatch hydralisks, hatch mutalisks), so we can ignore that.

The framework. To decide what to do, you have to figure out how good the different actions are. Q-learning is direct. It says: Let’s call that Q, the utility of each action in each state: Q [state[i], action[j]]. And every time we find something out, let’s update Q to more closely match the new information. The idea of updating bit by bit is what makes it reinforcement learning.

I had a reason to write Q[] as if it were an array lookup. If you do implement it as a table lookup (so that you’re using tabular learning), then (given certain conditions) Q-learning is mathematically guaranteed to converge to the optimal policy, where “policy” is the technical term for what you do when. But usually you can’t implement it that way. Overkill has 24000 states times 3 actions and can’t store an array that big, much less fill in all its values. Instead you plug in a model that stores less information and generalizes over states which are similar: “enemy has 20 goliaths, make hydras” should generalize to “enemy has 30 goliaths, make hydras”.

Here is a subtle point: To find out how good different actions are, you have to try them. You have to explore the space of states and actions. The current Q value tells you how good you think each action is, but only if you follow it up with other good actions. A good action you take now can be wiped out by a bad action you take later. And an exploratory action by definition may be bad. “Enemy has 20 goliaths, I know making hydras is good, but this time let’s explore the option of making zerglings.” You’re doing one thing, which includes exploring, and learning something else, the best policy which doesn’t include exploring. Since you’re not following the policy that you’re learning, Q-learning is called an off-policy learning algorithm. It’s not quite obvious how to do that, is it!

The solution. It may not be obvious, but the solution is simple. You believe your Q values, and update them based on your belief. I’ll write a version as pseudocode—slightly different choices are possible at pretty much every point.

You need two numbers, a learning rate alpha between 0 and 1 that says how much you update Q, and a discount factor gamma between 0 and 1 that says how far into the future to trust your Q values. It makes sense to decrease the learning rate as you go on, but in practice it’s common to set a fixed value. Normally alpha should be not too far from 0.

You also need to plug in a function approximation model that can update Q(s, a) by a given amount. Neural nets work. Many other models also work. Overkill uses a linear fit to its large number of features.

initialize Q (note 1)
for each game {
  for each game state s where you make a decision {
    r <- 1 if the game was just won (note 2)
      or 0 if the game was lost or you’re still playing
    a <- choose an action (note 3)
    update model Q(s, a) += alpha * (r + gamma * nextQ - Q(s, a)) (note 4)
  }
}

1. You can initialize Q any way you want. Constant values or random values work. If you have an idea what values may be good, you can try to give the algorithm a head start.

2. r stands for “reward”, the standard terminology in reinforcement learning. The values are arbitrary. You could use -1 for losing, as long as the value is 0 except at the end of the game. That’s because in Starcraft, the only reward is at the end of the game—in another domain you might get rewards along the way.

3. How do you choose an action? You want some combination of exploring and following the policy. If you always explore you won’t learn much, because you’ll mostly explore stupid actions. If you always follow the policy you won’t learn much because you’ll do the same things over and over. Do both and you get traction! One combination is epsilon-greedy, where you explore at random epsilon of the time (say, 10% of the time) and otherwise follow the policy. There are other combinations.

4. Updating is the key step. You tell your model: For this state and action, adjust Q by this increment and generalize as you will. What is nextQ? It is the best Q value in the following state. You chose an action in state s[t] and ended up in a new state s[t+1]. The best available estimate of Q for your action is: The Q for the best action available in the next state. Using the Q values for the next state is what makes off-policy learning work. In math, nextQ = maxa Q (st+1, a). Or in pseudocode:

nextQ = -infinity
for x (actions) {
  nextQ = max (nextQ, Q(s[t+1], x))
}

The version I wrote is the most basic version for people who’re learning it for the first time. There are a ton of methods for speeding it up and making it more accurate. Overkill uses an up-to-date selection of some of those methods, so its version is more complicated.

Next: Overkill’s model.

Overkill’s new learning 1 - the description

Overkill has new strategy learning unlike what we’ve seen before. I’m excited to look into it.

Author Sijia Xu updated Overkill for AIIDE 2016 with a machine learning method to learn what units to build after the opening, depending on the game situation. As far as I know, no other bot has anything similar. So far, this version of Overkill played only in AIIDE 2016, where it finished 7th; it’s not playing on SSCAIT or anywhere else that I know of. But the code is in Sijia Xu's github repository along with a description.

overview

I’ll start with the description. To quote:

Overkill currently contain two main AI system:

  • building unit selection. Using reinforcement learning model to choose the most proper unit to build(currently only three types of unit to choose: zerglings,mutalisk,hydralisk) according to current game status, and saving the match info to continually improve the RL model itself.
  • opening strategy selection. Using UCB1 to select the best opening strategy at the the beginning of the game according to the previous match result against this opponent.

In other words, the opening strategy learning that I went over before remains the same, though according to the readme the openings have been trimmed back to let the unit selection model make more decisions.

It sounds as though Overkill’s tactical abilities are not changed. Overkill 2015 plays with zerglings, mutalisks, and hydralisks and nothing else, switching from one to the next in a fixed order when simple heuristics are triggered. Overkill 2016 plays with the same units but can change between them however it likes depending on the way the game develops. Its play should be more flexible and fun.

My thought: Unit selection is an important skill, but it is only one skill. Unit selection among only 3 units is an interesting experiment, not a way in itself to score a lot of wins. The most basic zerg skill isn’t knowing when to make which fighting unit, it’s knowing when to make drones, and that is not learned here. With no ability to use scourge, lurkers, or any hive tech, and hand-coded tactics that maneuver each type of unit independently of the other types, Overkill does not have a high ceiling on its skill level, no matter how smart the unit selection model grows. That’s why it only came in 7th, I guess. It’s the academic attitude at work, tackling one problem at a time and not worrying about full generality until later. And I’m intensely interested to see what lessons we can draw from the experiment.

the method

Now I’ll cover the “model”, “feature” and “learning methods” sections from the description. They are written to be understandable to AI people, and in fact I can understand the outline clearly enough. But it’s probably incomprehensible unless you know the jargon, so I’ll try to explain step by step in followup posts.

The basic algorithm is Q-learning, a fundamental reinforcement learning algorithm. Q-learning is well worth knowing if you care about AI at all, so I’ll write a post to explain it. Q-learning is a plug-in algorithm where you can plug in different kinds of models which are learned by another algorithm.

Overkill plugs in a model of a linear fit to a large number of binary game features, which is exactly the same kind of model that I briefly mentioned in breakthrough game programs as used in the evaluation function of the othello program Logistello. The description says that Overkill has about 4000 features, which (believe it or not) is a small number for this kind of model! Having more features might make it too slow. The features are things like which units and tech each side has, and so on. I’ll read the code and find out details.

Overkill does its linear fit to the features not using regression like Logistello but using stochastic gradient descent or SGD. Versions of SGD are state of the art and heavily used in deep learning (and more). SGD is important, but it’s not fundamental so I don’t intend to look at it.

The description does not tell us whether Overkill includes “who is my opponent?” in its model. Looking at the I/O files from AIIDE 2016, I see files like “feature_gradientAiur”, “feature_valueAiur”, and “RL_dataAiur” for each opponent. It looks as though Overkill learns a different model for each opponent, although it could combine the opponent data into one model. I’ll read the code and find out.

the results

The description shows us 3 graphs from one experimental run of Overkill playing 600 games versus IceBot. That’s a lot of games. The graphs show Overkill’s model appearing to converge, slowly and unsteadily, and improve its already-high winning rate.

My thoughts: 1. One experiment does not tell us whether this works. It seemed to work against IceBot. 2. The learning rate was slow. It took hundreds of games. As I pointed out in the generalization for strategy learning post, reinforcement learning doesn’t stand out as a good choice when you have limited data. 3. Therefore Overkill’s learning seems suitable only for a very long tournament. If Overkill learns a model for each opponent, then the 90 games against each opponent in AIIDE 2016 were not enough. If it learns one overall model, then there may be enough games but an experiment against one opponent doesn’t tell us.

I don’t see where this is written, but somewhere I got the idea that for AIIDE 2016, Overkill started with a general model trained against some range of opponents, its prior knowledge. Then it refined the prior knowledge with experience during the tournament. It seems like a good idea. On the other hand, Overkill’s AIIDE 2016 winning rate of 69/87 (79%) doesn’t match the expected >90% win rate from the graph if Overkill had been pre-trained against IceBot. So I’m not sure. A general model plus 90 games for each opponent could produce the results we see—it’s plausible.

Next: Q-learning.

both sides take the same base

Funny picture: Until just before this screenshot of a game between Igor Lacik's bot and Overkill, both sides were mining from this base. As you can see, terran ended up winning it... with a displaced command center.

funny image of zerg and terran taking the same base

what Overkill learned

I wrote a little Perl script to read Overkill’s learning files from AIIDE 2015 and add up the numbers. The three strategy names are as Overkill spells them. The opponents are listed in tournament order, so the strongest are at the top.

 NinePoollingTenHatchMutaTwelveHatchMutatotal
opponentnwinnwinnwinnwin
tscmoo5726%1911%1811%9420%
ZZZKBot8046%80%80%9639%
UAlbertaBot6130%2015%100%9123%
Aiur1354%6680%30%8273%
Ximp20%3083%5793%8988%
IceBot425%7283%1457%9077%
Skynet1362%1968%5884%9078%
Xelnaga7581%1250%30%9074%
LetaBot78100%1070%20%9094%
Tyr633%2564%5377%8470%
GarmBot2796%2796%36100%9098%
NUSBot66100%1377%1173%9093%
TerranUAB30100%30100%30100%90100%
Cimex56100%3394%20%9196%
CruzBot30100%30100%29100%89100%
OpprimoBot2496%33100%33100%9099%
Oritaka5698%1070%2488%9092%
Stone5693%1267%2181%8987%
Bonjwa30100%30100%30100%90100%
Yarmouk30100%30100%30100%90100%
SusanooTricks32100%2396%32100%8799%
total82680%55280%50483%188281%

The number n here is not the number of games played. There were 90 rounds. Some games were perhaps not recorded due to crashes or other errors, which could explain why some opponents have n< 90. Also, when the 10-hatch mutalisk strategy failed, Overkill assumes it must have lost due to a rush that would also kill the 12-hatch muta strategy. In that case Overkill records 2 game records, a 10-hatch muta loss and a 12-hatch muta loss, explaining why some opponents have n> 90. At least that’s what the code says; some of the data in the table doesn’t seem to match up (see the Xelnaga row). What did I miss?

Some of the strategy choices make sense intuitively. Overkill learned to get early zerglings against ZZZKBot and UAlbertaBot which play rushes, and learned that a more economy-oriented strategy worked against XIMP with its later carriers. These are examples of learning as a substitute for scouting and adapting.

Look at the bottom row. Each strategy ended up with virtually the same winning rate; the UCB algorithm evened them out accurately. But it didn’t use the strategies equally often; the 9-pool was more successful on average against this set of opponents. The early zerglings are important against many opponents, for whatever reason.

Look at the individual lines. Except for weaker opponents that Overkill defeats no matter what, for most opponents one or two strategies were clearly better and were played more often. How much did Overkill learn? If it had played strategies randomly, then the winning rate would be the average of the strategy winning rates. The gain can be estimated as the total winning rate minus the mean of the strategy winning rates—how far did you rise above ignorance? The number varies from zero to huge for different opponents. Because of sampling effects, the estimate will statistically tend to be higher than the truth.

This learning method has to play weak strategies to find out that they’re weak, so it can’t be perfect. The regret for each opponent can be estimated as the difference between the total winning rate and the winning rate of the best strategy if you’d known to play it from the start—how far did you fall short of omniscience? For many of the opponents, the regret estimated that way is 6% to 7%. If the learning algorithm converges to an exact solution, then in an infinitely long tournament the regret will fall to 0. Thinking about numbers like this can give you an idea of when learning makes sense.

The Perl script and related files are available as a zip archive.