Starcraft AI blog | Entries from November 2016

Killerbot has been updated!

Yesterday was American Thanksgiving. I want to give thanks that Marian Devecka updated Killerbot on that day!

The new binary is about 5% larger, so I expect some improvement in its play but nothing dramatic. I’ve only seen a few games so far. In a game against Bereaver I thought Killerbot put up a tougher fight before losing, with better use of hydralisks, but the 2015 version already put up a pretty tough fight so I’m not sure. In a game against Iron there was a definite change, though: Killerbot went mutalisks and batted away Iron’s early pressure to gain an advantage. I’ve never seen Killerbot go air against terran before. Before this update, Iron had been rolling up Killerbot like a rug, but this game was another tough fight. Iron came out on top after a while, with good use of repair and steady aggression.

The new Killerbot version has not quite caught up with the current leaders, but it shows progress and there’s still time before the 18 December SSCAIT deadline.

The Little Tailor

These 2 pictures are from a game between Bereaver and the AIIDE 2016 version of LetaBot. It was a great game—Bereaver lost its natural and narrowly held its main over and over against LetaBot’s powerful attacks, until finally LetaBot ran out of steam and lost. LetaBot didn’t scout Bereaver’s third base. But I want to pick out a funny detail:

I dub this reaver The Little Tailor.

Seven at one blow! One marine lived and kept shooting until the next scarab got it. It’s good reaver targeting.

Bereaver’s curious weaknesses

Bereaver was reuploaded today, and apparently the new version introduces bugs because its play is worse in obvious ways. Until today Bereaver had been holding the #1 rank with seeming ease. I expect the problems will be ironed out before SSCAIT closes submissions for the tourney!

Even without bugs, Bereaver’s play is curious because it is a top bot, yet some of its weaknesses look simple to fix. Usually the weaknesses of top bots look difficult or time-consuming or at least troublesome to correct. I’ll give a couple examples.

1. Bereaver gets the reaver capacity upgrade, so that a reaver can hold 10 scarabs. Bereaver eventually makes a lot of reavers, so an upgrade makes sense—but the capacity upgrade? The way I see it, the main purpose of the capacity upgrade is to make reavers usable when your hands are too slow. Expert players don’t even use up the normal capacity of 5 scarabs, at least not until later in the game when micro becomes more demanding. It’s normal to send a reaver into the fight with 3 or even 2 scarabs in the magazine, because every scarab is a sunk cost and you never know when the reaver will die. Bereaver invests 200/200 for the upgrade and then builds 5 extra scarabs for 75 minerals more per reaver, and I don’t think it earns any return.

I imagine it would be a 1-line change to order the reaver damage upgrade instead. Instead of more expensive, the reavers would become more lethal.

2. Bereaver makes corsairs against terran, apparently to counter science vessels or dropships. There are times when it could make sense, but mainly on island maps. I have never seen Bereaver’s corsairs pay off in a game against a terran.

Again, it seems like a 1-line change to skip corsairs. It would be a different story if the corsairs used disruption web, but that’s for later in the game and it’s not half as simple.

offline and online learning

Offline learning means learning at home; you do it before the tournament. Online learning means learning from games as you play them; you do it during the tournament. For good play, you want to learn as much as you can at home—humans are the same. But some things come up only during live play and can’t be prepared ahead of time. If you face an unfamiliar opponent or an unfamiliar map and need to adapt, you have to adapt on the spot.

Reinforcement learning algorithms are online algorithms, which means that you can use them either online or offline. Overkill tried to do both, and it looks as though the offline learning was successful but the online learning could not get enough data to bite. I love this part: Batch learning algorithms, which are considered offline algorithms, are more appropriate for the small amounts of data we can get in online learning. The names turned out to be reversed!

deep learning and memory

Here’s a thought experiment. Suppose you are DeepMind, and you know from your name that you’re going to do deep learning and no other kind. You decide that you want to learn opponent models, which has to be done online in real time. Deep learning is powerful, but it is also data-hungry to the max. It can’t work online. What do you do?

There are ways.

The simplest way is to make the past part of the input. Your deep learning network can’t remember things from game to game, but you can remember for it and let it know. For opponent modeling, you remember details about the opponent’s past behavior and present them as input to the network. You can collect a lot of details; deep learning is able to cope with large amounts of undigested input data. If you train your network thoroughly at home, it will learn how adapt to different kinds of opponent.

Another way is to give your deep learning network a memory under its own control—let it decide what information to store for next time. “What should I remember?” becomes part of the network output alongside “what should I do?” In the next game against the same opponent, you feed in “what should I remember?” from the last output as part of the current network input. In training, the deep learning model learns what it should best remember. Similar experiments have been done (not in Starcraft), so the idea is known to work. Figuring out what to remember is trivial next to playing go!

More generally, if you have a powerful learning method, then it can learn more than how to play on its own. It can learn how to play plus how to operate whatever kind of analysis machine will help it play better. A memory is only one example of an analysis machine that deep learning could learn to operate. It could also operate a build order planner or a tactics planner or whatever, request exactly the plans best for the situation as a whole, and take those plans as part of its input.

Summary: 1. Offline learning is usually preferable. 2. Online learning can be done offline. It’s at least good to know you have a choice!

Overkill - wrapup discussion

As usual, I have a mass of final comments.

unused features

After a few days to mull it over, I think the unused features are left over from experiments. Sijia Xu had to try stuff out to see what would work, and not everything was worth including. More features mean that potentially more knowledge can be learned, but also that learning is slower. Leaving unused features in the code makes it easier to switch them in and out, even if you end up computing and saving features that never make a difference. I take it as a sign of the academic attitude.

choice of actions

Overkill uses its Q-learning to choose among only 3 possible actions, which combat units to build. It’s an important decision, but only one decision in a sky full of them.

Bots could use similar methods to learn more complex decisions, like tactical decisions: How to group units into squads and what orders to give the squads. Or complex build order choices. Or unit micro decisions. In those cases you can’t list the decisions ahead of time.

Overkill independently learns the Q values for each action. In effect, it learns 3 separate evaluations, Q(situation, zergling action), Q(situation, hydra action) and Q(situation, muta action). If zerglings and hydras are similar in some ways because they’re both ground units, or hydras and mutas are similar because they’re both ranged, then Overkill’s learning can’t take advantage of that; it can’t generalize across actions.

With many actions which are unknown in advance, you need a model which generalizes over the actions. In effect, you want to fold the actions (or their predicted results) into the situation. You can add your potential decisions as primitive features of the situation, treating “these guys are going over there to attack” the same way you treat “the enemy has 3 bases”. Or you can create a separate model of the effects your decisions have on the game, and use that model’s results as the situation.

Something along these lines is what DeepMind wants to do. They’ll use a neural network with deep learning for their model, but the idea is closely related to what Overkill does. From the AI point of view, it’s a fancier version of the same underlying idea, with a closely related top-level learning algorithm.

the linear model

Overkill’s model is a linear function, in the math sense. It sums up a lot of weights. But because of how the weighted features are made, the evaluation of actions is not linear overall with respect to the primitive game features: Overkill’s features themselves are nonlinear, in 2 ways. First, they are not primitive features, they are combination features. To simplify, it’s not “I have zerglings, that’s worth weight w[z], you have goliaths, that’s worth w[g], result w[z] + w[g].” The 2 features are (I’m still simplifying) combined into 4, “I have no zerglings, you have no goliaths, weight w[-z, -g]; I have zerglings, you have no goliaths, w [+z, -g], ....” Second, if the underlying primitive feature like “I have n zerglings” takes a range of values, then Overkill breaks it into a set of binary features, “I have 0 to 6 zerglings, w[0 to 6]; I have 7 to 12 zerglings....” If the value of having n zerglings (with respect to taking a given action) does not increase linearly with n, then the weights will be different to reflect that. (So after combination you get a lot of weights, w[0..6 z, 0..6 g], w[7..12 z, 0..6 g], ....)

I should pause to explain terminology. A “primitive” feature is one that comes directly from the game, like “I have n zerglings.” Other features can be called “derived” features. Overkill’s model is linear with respect to Overkill’s derived features, but not linear with respect to the primitive game features. That’s what I mean by “nonlinear features.”

Starcraft strategy is a highly nonlinear domain (with respect to the primitive features). But if you have nonlinear features, and you have a large enough number of them, then you can re-encode the nonlinear domain as a linear model—not exactly, but as accurately as you like.

Is the linear model with nonlinear binary features a good model? For many games, it is proven good by experience. For Starcraft, I don’t know. Can you find the features you need to play well? Are there few enough that you can learn them in a reasonable time? Sijia Xu has made a start on finding the answer.

offline versus online learning

Overkill’s Q-learning seems to make sense for offline learning, “how to play Starcraft.” It seems to be too slow for online learning of opponent models, “how to beat this opponent.” There are ways to do both at once.

One idea is to use different models for the 2 cases: Learn one big offline model with a huge number of features, to play well on average against unknown opponents. And also learn a small model with few features for each opponent. The small model won’t be able to learn as much knowledge, but it will converge faster. The small opponent model tells you, “here is how this opponent differs from the average opponent.” To make decisions during play, add together the results of the models.

Another idea is to use a hierarchical model. I don’t want to go into detail, but the basic idea is have high-level features with low-level features under them. The high-level features are few and can be learned quickly, so that the model’s learning starts to be useful quickly. The low-level features are many and take more time, so that the model can eventually learn a lot of knowledge if you keep feeding it data.

To learn an opponent model empirically, you have to explore. It’s mathematically required! And exploring an unfamiliar action will sometimes be bad play. We already have proof from opening strategy learning that opponent modeling can be worth it overall; good decisions later can make up for bad play caused by exploring early. But how far can we push it? We can learn to choose from among a few opening strategies and it helps, but can we make a sophisticated opponent model or is the cost of exploring too high? Humans can learn powerful opponent models, but humans don’t rely purely on the statistics—we also reason about our models in a way that is beyond the state of the art in AI.

judgment

As Martin Rooijackers put it, humans have better judgment than bots—so far. And how can we code judgment? There are only 2 fundamental techniques, knowledge and search. To match human judgment we’ll need some combination of knowledge and search.

In a realtime domain like Starcraft, I believe it makes sense to go for knowledge first. And if you go for knowledge first, then I believe hand-coding will not get you there. You’ll fail to hand-code enough knowledge for good judgment. Machine learning is the path, and Sijia Xu has taken a step down it ahead of everyone else.

Next: More on offline versus online learning.

other changes to Overkill

Overkill AIIDE 2016 is not Overkill 2015 plus Q-learning. Overkill is extensively updated. Scanning quickly through the diffs, I spotted changes affecting:

pathfinding
influence mapping
handling of enemy cloaked units
reactions to enemy flying units
choke finding
scouting
assignment of squads to targets (I think)
behavior of irradiated units
behavior of overlords
prioritization of targets to attack
static defense, both ground and air
reactions to enemy bunkers and cannons

My list is not complete. I really want to see some games!

Next: Overkill wrapup.

Overkill’s new learning 6 - how it fits with the rest of the program

The learning is done by StrategyManager::strategyChange(). During the game, the production manager is in one of two states, depending on whether Overkill is still in its opening build order.

	enum ProductionState {fixBuildingOrder, goalOriented};

	ProductionState				productionState;

When productionState is goalOriented, ProductionManager::update() calls onGoalProduction() to check whether the goal (zerglings, hydralisks, mutalisks) should be changed. The key parts:

	bool isAllUpgrade = true;
	for (int i = 0; i < int(queue.size()); i++)
	{
		//if remain unit type is just worker/upgrade/tech, check the strategy
		if (... a long condition here ...)
			isAllUpgrade = false;
	}
	if (isAllUpgrade)
	{
		if (BWAPI::Broodwar->getFrameCount() > nextStrategyCheckTime)
		{
			std::string action = StrategyManager::Instance().strategyChange(0);
			BWAPI::Broodwar->printf("change strategy to %s", action.c_str());
			setBuildOrder(StrategyManager::Instance().getStrategyBuildingOrder(action), false);

			nextStrategyCheckTime = BWAPI::Broodwar->getFrameCount() + 24 * 10;
			curStrategyAction = action;
		}
		...
	}

In other words, if the production queue is empty of combat units, and at least 10 seconds have passed since the last strategyChange(), then call strategyChange() again (with reward 0 because we’re in the middle of the game). Overkill changes its choice at most once every 10 seconds.

At the end of the game, Overkill calls strategyChange() one last time, giving a reward of 100 for a win and -100 for a loss. From Overkill::onEnd():

	if (frameElapse >= 86000 && BWAPI::Broodwar->self()->supplyUsed() >= 150 * 2)
	{
		win = true;
	}
	else
	{
		win = isWinner;
	}

	int reward = win == true ? 100 : -100;
	
	StrategyManager::Instance().strategyChange(reward);

There’s something curious here. isWinner is the flag passed into onEnd(). If Overkill makes it to the late game, it sets win = true no matter what; otherwise it believes the isWinner flag. For purposes of its learning algorithm, Overkill tells itself that making it to the late game counts as a win in itself! Isn’t that a pessimistic way to think?

Every call to strategyChange() produces 1 data point for the learning algorithm. The reward comes in only at the end of the game, but Overkill needs the other data points to fill in the Q values for states during the game. The model has a lot of values to learn, so the more data points the better.

exploration

When Overkill is in Release mode, it explores 10% of the time. The code is in StrategyManager::strategyChange(). So, 1 time in 10 when Overkill makes a decision, it’s a random exploratory decision instead of the best decision it knows.

On the one hand, if Overkill wants to learn an opponent model, it has to explore. On the other hand, making that many random decisions can’t be good for its play! Since the AIIDE 2016 tournament was not long enough for the opponent model to become accurate, Overkill might have done better to skip the opponent modeling.

Next: Other changes to Overkill.

Overkill’s new learning 5 - the full list of features

I changed my mind about what to cover today. Here is an overview of the features in Overkill’s model. I decided it was interesting to see what they are.

Yesterday we saw that features are kept in a map of maps, so names have two levels. The variable featureNames declares the top-level groups of feature names and some of the lower-level names. It looks like this (leaving out most of it):

	featureNames = decltype(featureNames) {
		//state feature
		{"state_general_feature", { 
			{ "time_", { "6", "12", "20", "30", "max" } },
			{ "enemyRace_", { "z", "t", "p", "unknow" } },
			{ "ourSupplyUsed_", { "9", "18", "30", "50", "80", "120", "150", "max" } },
		...
		{ "state_raw_combine_feature", {} },
		{ "action_battle_combine_state_battle_feature", {} }
	};

Features that originally take on a range of values have to be made into binary features. Overkill does it by breaking the range into coarse pieces. time_ looks to be game time in minutes. state_raw_combine_feature and action_battle_combine_state_battle_feature have their lower-level names filled in by code rather than declared directly. Those last two are the majority of features.

Here are top-level names and what it looks like they cover. Don’t trust my interpretations too much. Not all features in the code end up in the I/O files. I wrote down the number of features that the code includes, but apparently some of the features are never present in actual games.

name	#	my interpretation
`state_general_feature`	53	Game time, map name, distance between bases, enemy race, etc.
`state_tech_feature`	7	Overkill’s tech level.
`state_building_feature`	30	Overkill’s tech buildings, the enemy’s tech and production buildings.
`state_economy_feature`	46	Minerals and gas for Overkill, expansions and workers for both sides.
`state_our_army`	40	Counts of Overkill’s static defense and units of each kind.
`state_battle_feature`	168	Counts of enemy units of each kind.
`state_raw_combine_feature`	6723	`state_our_army` crossed with `state_battle_feature`, that is, every combination of our units and the enemy’s, plus 3 extra features.
`action_battle_combine_state_battle_feature`	6754	Copies of `state_raw_combine_feature` and `state_building_feature` and the one `state_tech_feature` feature `ourKeyUpgrade_zerglingsAttackSpeed`.

We know from yesterday that the features in action_battle_combine_state_battle_feature are exactly the ones that matter in calculating the Q value—they are the ones which get the action appended to their names. The others are only along for the ride; it’s an implementation quirk that they are kept in the same data structure. A number of features seem to be declared, evaluated, learned and remembered, but never effectively used.

So if I counted right (which I question), then there are 6754 binary features in total, though it would be strange for all of them to be useful against any one opponent.

Next: How it fits into the rest of the program, for real.

Overkill’s new learning 4 - the model and its features

What does it mean to have a linear model with binary features? “Linear” means that each feature comes with a number, its weight, so that with binary features you find Q(s,a) by adding up the weights for each feature that is present. Usually only a small proportion of all the features are present, so it’s not as crazy as it may sound.

Overkill gives its features long multi-part names, which it implements throughout as strings accessed via maps. (I was surprised to see that in a real-time program, but it’s probably easier.) The feature names are written out plainly in the I/O files. Here are a few scattered samples from the file feature_valueAiur, which lists 9638 features altogether:

action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*hydraBuild:0.13396
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*mutaBuild:0.07588
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*zerglingBuild:0.06963
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*hydraBuild:0.05439
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*mutaBuild:0.10049
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*zerglingBuild:0.26210

state_raw_combine_feature:enemyP_cannon_1*ourHydra_6:-0.21410
state_raw_combine_feature:enemyP_cannon_1*ourHydra_12:-0.43786
state_raw_combine_feature:enemyP_cannon_1*ourHydra_18:-0.08806
state_raw_combine_feature:enemyP_cannon_1*ourHydra_24:0.24174
state_raw_combine_feature:enemyP_cannon_1*ourHydra_36:0.42465
state_raw_combine_feature:enemyP_cannon_1*ourHydra_48:0.39939
state_raw_combine_feature:enemyP_cannon_1*ourHydra_60:0.52629
state_raw_combine_feature:enemyP_cannon_1*ourHydra_max:0.59403

state_tech_feature:ourKeyUpgrade_zerglingsAttackSpeed:2.33542
state_tech_feature:ourTechLevel_hatchery:2.28803
state_tech_feature:ourTechLevel_lair:0.25170
state_tech_feature:ourTechLevel_hive:1.48611

You can guess what the feature names mean: Enemy has 1 cannon and we have up to 6 hydralisks, for example. That’s how it got so many features!

Each opponent’s file seems to list a different number of features, probably leaving out features that never came up, so 9638 is not the total number of features. But there’s something here I don’t understand. 9638 is not divisible by 3. Each line gives one weight—shouldn’t there be 3 weights for each state, so that the 3 actions can all be evaluated?

Here’s the routine that calculates Q(s,a). Its arguments are reversed—it puts the action before the state.

double StrategyManager::calActionFeature(std::string curAction, std::map<std::string, std::map<std::string, int>>& features)
{
	for (auto categoryStateFeature : features)
	{
		if (categoryStateFeature.first == "state_raw_combine_feature" || categoryStateFeature.first == "state_building_feature")
		{
			for (auto stateFeature : categoryStateFeature.second)
			{
				std::string combineFeatureName = stateFeature.first + "*" + curAction;
				features["action_battle_combine_state_battle_feature"][combineFeatureName] = 1;
			}
		}
	}

	if (features["state_tech_feature"].find("ourKeyUpgrade_zerglingsAttackSpeed") != features["state_tech_feature"].end())
	{
		std::string combineFeatureName = std::string("ourKeyUpgrade_zerglingsAttackSpeed") + "*" + curAction;
		features["action_battle_combine_state_battle_feature"][combineFeatureName] = 1;
	}

	double curQValue = 0;
	for (auto categoryFeature : features)
	{
		for (auto curfeature : categoryFeature.second)
		{
			int curfeatureValue = curfeature.second;
			if (parameterValue.find(categoryFeature.first) != parameterValue.end() && parameterValue[categoryFeature.first].find(curfeature.first) != parameterValue[categoryFeature.first].end())
			{
				double curParameterValue = parameterValue[categoryFeature.first][curfeature.first];
				curQValue += curParameterValue * curfeatureValue;
			}
		}
	}
	return curQValue;

}

parameterValue holds the model. curAction is the action and the features map with its nested type is the state. Having read this, I still don’t understand. The action name is coded into some feature names and not others, which we see above as + curAction. The list of actions:

	stateActions = {"zerglingBuild", "hydraBuild", "mutaBuild"};

Here’s the call, the bit of code which chooses the action with the highest Q value. (Below this is another bit where it changes the action if it feels like exploring.)

		for (auto action : stateActions)
		{
			std::map<std::string, std::map<std::string, int>> actionFeatureValue = featureValue;
			double curQValue = calActionFeature(action, actionFeatureValue);

			if (curQValue > maxQValue)
			{
				maxQValue = curQValue;
				maxAction = action;
				maxFeatureValue = actionFeatureValue;
			}
		}

The call does nothing to differentiate actions. As far as I can tell, only the features which include the action in their names can be used to tell actions apart, and the other features are irrelevant constants that happen to be added in.

$ grep hydraBuild feature_valueAiur | wc -l
    2176
$ grep mutaBuild feature_valueAiur | wc -l
    2267
$ grep zerglingBuild feature_valueAiur | wc -l
    2403

So 2176+2267+2403 = 6846 features out of 9638 encode the build name in the I/O file for AIUR. As far as I can tell, the other 2792 features are irrelevant. And those 2792 features include some that look important. Surely you want to pay attention to what upgrades you have when you choose which units to make!

The number of features is different for each action. That means two things. 1. The fact that the total number of features is not divisible by 3 is meaningless. 2. Not all actions have been explored in the different states. As expected, the games played against AIUR were not enough to fill in the model.

Either I’ve misunderstood something, or Overkill’s learning has flaws (I wouldn’t go so far as to say bugs, it is only a loss of effectiveness, not an error). Can anybody correct me? I’ll contact Sijia Xu.

Next: How it fits into the rest of the program.

Overkill’s new learning 3 - one model or many?

My first question about Overkill’s model is: One model for all opponents, or one model for each opponent? It turns out that the answer is: It depends. It checks curMode.

enum developMode { Develop, Release };
extern developMode	curMode;

Here is an example. Bits of code like this show up for each of the 3 data files used to store the model and learning information.

	if (curMode == Develop)
	{
		filePath = “./bwapi-data/write/RL_data”;
	}
	else
	{
		string enemyName = BWAPI::Broodwar->enemy()->getName();
		filePath = “./bwapi-data/write/RL_data”;
		filePath += enemyName;
	}

In the AIIDE 2016 competition, curMode is set to Release. It looks as though each opponent gets its own model, learned independently. But not learned from scratch!

My idea that Overkill has a general model turned out true. (I may have read it somewhere and forgotten where.) When it plays an opponent for the first time, it uses a model defined in file ModelWeightInit.h as the initial model, and learns starting from there. I don’t see any information about how the initial model was created. It may have been trained by playing against a variety of opponents, in Develop mode.

You could say that the initial model is “how to play Starcraft” and the refined model made for each opponent is “how to beat this opponent.” The same learning system can be used both offline to learn the game and online to model the opponent.

How well did opponent modeling work? We can look at Overkill’s graph of win rate over time in the AIIDE 2016 results. Its winning rate after 20 rounds was 0.59 and after 90 rounds was 0.62. The curve shows a fairly steady rise, and visually it’s convincing that Overkill learned something about its opponents, but it improved its win rate only a little. The unsteady learning curve we saw in the description suggests that Overkill might have eventually learned more if the tournament had gone on long enough—maybe.

Next: The features of Overkill’s model.

Overkill’s new learning 2 - Q-learning

This post is for people who want an easy introduction to an important AI technique. Q-learning is 100% worth knowing if you care about AI. It’s easy to understand, it’s a fundamental technique, it’s used in high-performance learning systems (DeepMind used Q-learning in their famous demo of playing Atari video games), and it remains the subject of serious research.

Q-learning is a form of reinforcement learning. If you want to understand reinforcement learning, I like the book Reinforcement Learning: An Introduction by Sutton and Barto (it’s all online at the link). You do have to feel comfy with the math, but it’s nothing difficult.

The problem. Q-learning solves the control problem, which means: You’re in a situation. You have a set of actions available. How do you learn from experience what action to take? In computer speak, being in a situation means you’re in some state from a set of states. At time t you’re in state[t]. Overkill (we saw last time) represents its state as a collection of about 4000 binary features. You have a set of actions to choose from, and you have to choose one, so we can say that at time t you choose action[t]. In principle the actions might be different in each state, but Overkill always has the same three actions (hatch zerglings, hatch hydralisks, hatch mutalisks), so we can ignore that.

The framework. To decide what to do, you have to figure out how good the different actions are. Q-learning is direct. It says: Let’s call that Q, the utility of each action in each state: Q [state[i], action[j]]. And every time we find something out, let’s update Q to more closely match the new information. The idea of updating bit by bit is what makes it reinforcement learning.

I had a reason to write Q[] as if it were an array lookup. If you do implement it as a table lookup (so that you’re using tabular learning), then (given certain conditions) Q-learning is mathematically guaranteed to converge to the optimal policy, where “policy” is the technical term for what you do when. But usually you can’t implement it that way. Overkill has 2⁴⁰⁰⁰ states times 3 actions and can’t store an array that big, much less fill in all its values. Instead you plug in a model that stores less information and generalizes over states which are similar: “enemy has 20 goliaths, make hydras” should generalize to “enemy has 30 goliaths, make hydras”.

Here is a subtle point: To find out how good different actions are, you have to try them. You have to explore the space of states and actions. The current Q value tells you how good you think each action is, but only if you follow it up with other good actions. A good action you take now can be wiped out by a bad action you take later. And an exploratory action by definition may be bad. “Enemy has 20 goliaths, I know making hydras is good, but this time let’s explore the option of making zerglings.” You’re doing one thing, which includes exploring, and learning something else, the best policy which doesn’t include exploring. Since you’re not following the policy that you’re learning, Q-learning is called an off-policy learning algorithm. It’s not quite obvious how to do that, is it!

The solution. It may not be obvious, but the solution is simple. You believe your Q values, and update them based on your belief. I’ll write a version as pseudocode—slightly different choices are possible at pretty much every point.

You need two numbers, a learning rate alpha between 0 and 1 that says how much you update Q, and a discount factor gamma between 0 and 1 that says how far into the future to trust your Q values. It makes sense to decrease the learning rate as you go on, but in practice it’s common to set a fixed value. Normally alpha should be not too far from 0.

You also need to plug in a function approximation model that can update Q(s, a) by a given amount. Neural nets work. Many other models also work. Overkill uses a linear fit to its large number of features.

initialize Q (note 1)
for each game {
  for each game state s where you make a decision {
    r <- 1 if the game was just won (note 2)
      or 0 if the game was lost or you’re still playing
    a <- choose an action (note 3)
    update model Q(s, a) += alpha * (r + gamma * nextQ - Q(s, a)) (note 4)
  }
}

1. You can initialize Q any way you want. Constant values or random values work. If you have an idea what values may be good, you can try to give the algorithm a head start.

2. r stands for “reward”, the standard terminology in reinforcement learning. The values are arbitrary. You could use -1 for losing, as long as the value is 0 except at the end of the game. That’s because in Starcraft, the only reward is at the end of the game—in another domain you might get rewards along the way.

3. How do you choose an action? You want some combination of exploring and following the policy. If you always explore you won’t learn much, because you’ll mostly explore stupid actions. If you always follow the policy you won’t learn much because you’ll do the same things over and over. Do both and you get traction! One combination is epsilon-greedy, where you explore at random epsilon of the time (say, 10% of the time) and otherwise follow the policy. There are other combinations.

4. Updating is the key step. You tell your model: For this state and action, adjust Q by this increment and generalize as you will. What is nextQ? It is the best Q value in the following state. You chose an action in state s[t] and ended up in a new state s[t+1]. The best available estimate of Q for your action is: The Q for the best action available in the next state. Using the Q values for the next state is what makes off-policy learning work. In math, nextQ = max_a Q (s_t+1, a). Or in pseudocode:

nextQ = -infinity
for x (actions) {
  nextQ = max (nextQ, Q(s[t+1], x))
}

The version I wrote is the most basic version for people who’re learning it for the first time. There are a ton of methods for speeding it up and making it more accurate. Overkill uses an up-to-date selection of some of those methods, so its version is more complicated.

Next: Overkill’s model.

updates on Iron and Bereaver

Here are a couple of updates on top bots before I return to Overkill’s reinforcement learning.

Iron

For a while now, Iron has been playing a different strategy against protoss, going 2 barracks plus academy and attacking early with marines and medics rather than its traditional vultures. I’m a bit slow, but I finally realized why: It was having trouble against dragoons. Dragoons, adding observers in time to detect mines, are a strategically sound counter to Iron’s early vulture play, where tanks come late. Bereaver was handing Iron a string of losses with dragoon play. Time to switch it up!

Bereaver poked with 2 zealots but was unable to land a single hit against accurate micro.

Iron kills the protoss natural with infantry

Bereaver expanded early. I thought its force of 4 dragoons plus probes could have held off the marines, but it would have needed coordination and micro beyond the state of the art. Instead Bereaver lost its natural and was set back.

Iron tried to force the ramp while expanding, but didn’t have a big enough edge and was pushed back. Bereaver counterattacked and tried to get ahead with a double expansion, but tanks came out and Bereaver was never able to catch up. Bereaver strangely made a robo support bay early but did not build any reavers until much later, adding to its disadvantage—this game was played before the reupload below.

The dragoon opening is still sound against barracks play. Dragoons are versatile. But fighting the infantry calls for excellent micro with focus fire, and even Bereaver is lacking. Skynet is fairly skilled, but Iron’s new opening smashes it.

Bereaver

Bereaver was reuploaded yesterday. The new version’s first game was against Tyr by Simon Prins, and I immediately saw plays I hadn’t seen it make before. Bereaver went double robo.

Bereaver’s shuttle-reaver micro is not fluid like ZerGreenBot’s, but sharply goal-directed. Tyr vainly struggled to save itself, not realizing that it had to cover the reaver drop zones and come up with some kind of air defense. Every time Tyr unsieged, the reaver dropped, fired, was picked up instantly, the scarab skirted around the minerals—boom. By the time the second shuttle appeared with 1 zealot inside, it was already too late for terran.

The tanks at the top are approaching, too late to defend. As for reaver shots, sometimes they dud out. And sometimes they do this.

mass destruction in the terran mineral line

Reaver drop is crazy complicated. In time, I expect Bereaver to learn how to drop in the teeth of sieged tanks: Drop a zealot first. Tanks fire. Zealot says “ow!” Reaver drops and has time to fire before the tanks can reload. Terran defense has to be on the ball!

Overkill’s new learning 1 - the description

Overkill has new strategy learning unlike what we’ve seen before. I’m excited to look into it.

Author Sijia Xu updated Overkill for AIIDE 2016 with a machine learning method to learn what units to build after the opening, depending on the game situation. As far as I know, no other bot has anything similar. So far, this version of Overkill played only in AIIDE 2016, where it finished 7th; it’s not playing on SSCAIT or anywhere else that I know of. But the code is in Sijia Xu's github repository along with a description.

overview

I’ll start with the description. To quote:

Overkill currently contain two main AI system:

building unit selection. Using reinforcement learning model to choose the most proper unit to build(currently only three types of unit to choose: zerglings,mutalisk,hydralisk) according to current game status, and saving the match info to continually improve the RL model itself.

opening strategy selection. Using UCB1 to select the best opening strategy at the the beginning of the game according to the previous match result against this opponent.

In other words, the opening strategy learning that I went over before remains the same, though according to the readme the openings have been trimmed back to let the unit selection model make more decisions.

It sounds as though Overkill’s tactical abilities are not changed. Overkill 2015 plays with zerglings, mutalisks, and hydralisks and nothing else, switching from one to the next in a fixed order when simple heuristics are triggered. Overkill 2016 plays with the same units but can change between them however it likes depending on the way the game develops. Its play should be more flexible and fun.

My thought: Unit selection is an important skill, but it is only one skill. Unit selection among only 3 units is an interesting experiment, not a way in itself to score a lot of wins. The most basic zerg skill isn’t knowing when to make which fighting unit, it’s knowing when to make drones, and that is not learned here. With no ability to use scourge, lurkers, or any hive tech, and hand-coded tactics that maneuver each type of unit independently of the other types, Overkill does not have a high ceiling on its skill level, no matter how smart the unit selection model grows. That’s why it only came in 7th, I guess. It’s the academic attitude at work, tackling one problem at a time and not worrying about full generality until later. And I’m intensely interested to see what lessons we can draw from the experiment.

the method

Now I’ll cover the “model”, “feature” and “learning methods” sections from the description. They are written to be understandable to AI people, and in fact I can understand the outline clearly enough. But it’s probably incomprehensible unless you know the jargon, so I’ll try to explain step by step in followup posts.

The basic algorithm is Q-learning, a fundamental reinforcement learning algorithm. Q-learning is well worth knowing if you care about AI at all, so I’ll write a post to explain it. Q-learning is a plug-in algorithm where you can plug in different kinds of models which are learned by another algorithm.

Overkill plugs in a model of a linear fit to a large number of binary game features, which is exactly the same kind of model that I briefly mentioned in breakthrough game programs as used in the evaluation function of the othello program Logistello. The description says that Overkill has about 4000 features, which (believe it or not) is a small number for this kind of model! Having more features might make it too slow. The features are things like which units and tech each side has, and so on. I’ll read the code and find out details.

Overkill does its linear fit to the features not using regression like Logistello but using stochastic gradient descent or SGD. Versions of SGD are state of the art and heavily used in deep learning (and more). SGD is important, but it’s not fundamental so I don’t intend to look at it.

The description does not tell us whether Overkill includes “who is my opponent?” in its model. Looking at the I/O files from AIIDE 2016, I see files like “feature_gradientAiur”, “feature_valueAiur”, and “RL_dataAiur” for each opponent. It looks as though Overkill learns a different model for each opponent, although it could combine the opponent data into one model. I’ll read the code and find out.

the results

The description shows us 3 graphs from one experimental run of Overkill playing 600 games versus IceBot. That’s a lot of games. The graphs show Overkill’s model appearing to converge, slowly and unsteadily, and improve its already-high winning rate.

My thoughts: 1. One experiment does not tell us whether this works. It seemed to work against IceBot. 2. The learning rate was slow. It took hundreds of games. As I pointed out in the generalization for strategy learning post, reinforcement learning doesn’t stand out as a good choice when you have limited data. 3. Therefore Overkill’s learning seems suitable only for a very long tournament. If Overkill learns a model for each opponent, then the 90 games against each opponent in AIIDE 2016 were not enough. If it learns one overall model, then there may be enough games but an experiment against one opponent doesn’t tell us.

I don’t see where this is written, but somewhere I got the idea that for AIIDE 2016, Overkill started with a general model trained against some range of opponents, its prior knowledge. Then it refined the prior knowledge with experience during the tournament. It seems like a good idea. On the other hand, Overkill’s AIIDE 2016 winning rate of 69/87 (79%) doesn’t match the expected >90% win rate from the graph if Overkill had been pre-trained against IceBot. So I’m not sure. A general model plus 90 games for each opponent could produce the results we see—it’s plausible.

Next: Q-learning.

LetaBot’s wall 5 - discussion

I’m done looking at code, but I have a bunch of final observations.

limitations

As written, LetaBot’s wall code can only plan walls with a barracks and two supply depots. In theory it might create a wall with a barracks and one depot, or with only the barracks, but the map would have to have a strange layout of unbuildable spots so that the later buildings fail to be placed during the search. Normally it will squeeze all buildings in, all in contact with each other, even if some are unneeded, because it places all buildings before checking whether it has a wall.

It looks easy to reconfigure to use another fixed set of buildings. SpaceArray already includes data for the command center (wall until you lift it off) and the academy. Walls including a factory are more common, and it looks simple to add the data for that. If you have the time for the extra calls to WalledOffAStar, you could do a more extensive search that considered different possible buildings and stopped as soon as it found a wall, so that no unnecessary buildings are included. Of course the choice of buildings interacts with the build order. If your build order has the factory late but you need the wall early....

If you know the size and form of the choke, maybe there’s a way to figure out the right set of buildings to wall it off, so that you don’t have to try different sets until you find one that works. It’s an idea, anyway. For protoss, simple heuristics would do pretty well: To close a wide choke as tightly as possible, start with gateway + forge. To narrow a wide choke without closing it, use pylons. To wall a ramp, use pylons. Of course LetaBot doesn’t have code for protoss building restrictions.

reusing the wall code

The wall code is complicated. It’s also fairly messy, showing the scars of a lot of surgery. Some of the rework may be due to changing goals as the bot’s strategy shifted, but I suspect that most of it is from the sheer difficulty of getting the complex calculation to work properly in real play. If I wanted the code for a project of my own, I would first take time to clean and tighten it, comb for bugs, and make the efficiency improvements that belong in a second draft. It’s unpolished, but I judge the code quality overall as within the normal range for an academic project done under the usual heavy time constraints.

Legalities. I didn’t find any copyright license info in the CIG 2016 version of LetaBot. Of course it’s not required! But authors who actively want their code to be reused should think about conditions. For example, academic authors probably want credit and might choose a license that allows reuse on condition that the author is credited. Creative Commons offers choices. Copyright laws are different everywhere, and international copyright is even more confusing, so it’s helpful to make things as easy as possible for those who want to get by within the legal system.

more uses for WalledOffAStar

WalledOffAStar takes a start and end location and looks for a path between them for a unit of a given size, staying within a bounding box, taking into account known buildings. If there is no path, then the way is blocked by buildings or terrain (there may exist a path that goes outside the bounding box). It is coded to work only for the buildings that LetaBot uses in its wall, but it doesn’t seem difficult to generalize. You could use it to check for walls that you suspect or fear might exist.

If one of your units is trying to move somewhere, and it hasn’t made expected progress, and it is near your own buildings, then you might want to use code like this to check whether you have accidentally walled it in. If yes, you can also use it to pick a building to lift or destroy to break the wall.

If you see enemy buildings near a choke point that you want to attack through, you can use code like this to see which of your units fit through, or how much the choke is narrowed. The result can inform whether you should try to destroy buildings as the first stage in the attack.

spawning inside or outside

I don’t see any attempt to judge whether units will spawn inside or outside the wall. Terran and protoss units as well as zerg larvae spawn in predictable locations relative to the building that creates them, so I think it would be a comparatively easy extension. For a wall that includes a production building, it seems important to know. On the other hand, there’s usually not much you can do about it when units spawn outside.

how I expected the wall code to work

When I heard that LetaBot uses A* to plan its wall, I imagined a different algorithm than it actually uses. LetaBot uses brute force: It places buildings in different arrangements and figures out whether the buildings form a wall by looking for a path through them. Humans plan a wall by spacial reasoning, not brute force.

I imagined an algorithm that’s sort of closer to the human method. First, identify a flat area that you want to wall and find the two edges that the wall should run between. Let’s say the edges are the top and bottom. We’ll use A* to place buildings in a best-first way. When you use A* to find a path, the cost of a partial path is g(x) the length of the path so far plus h(x) a heuristic estimate of the remaining path length. The wall we want is basically a path of buildings from the top edge to the bottom edge! Our h(x) will be the distance from the bottommost point of the bottommost building in the wall to the bottom edge. g(x) is a measure of how long the path is so far: It will be the sum of the heights of the buildings placed.

As LetaBot does, decree a maximum gap size and don’t allow any larger gaps between buildings. Place the first building against the top edge, respecting the gap size. The A* formula of g(x) + h(x) tells you which placement to try first. Place the second building against the first one, and A* again tells you which placement to try first, etc. You’re done when the distance to the bottom respects the gap size.

This is a more expensive version of A* than WalledOffAStar, but you only run it once so apparently wall-finding would take less work overall. Though you shouldn’t trust human intuition when it comes to run times! It’s likely that g(x) and h(x) would have to be tweaked for best results—why should my first idea be the best idea? As always, it’s just an idea.

LetaBot’s wall 4 - placing buildings to form the wall

Continuing from yesterday, here is how RecursiveWall places buildings to try to create a wall. It is of course recursive. It wants a vector of building types and a current search depth.

void BuildingManager::RecursiveWall(std::vector Buildings, int depth){

The call is like this. (The three bits of code here are in different places in the source.) The numbers representing buildings are used as indexes into the SpaceArray mentioned yesterday, which has the information needed to compute the gaps between buildings. I’m not sure why the two depots get different indexes into SpaceArray (which has the same data for both). Is it so they can be told apart in maps, so that building adjacency can be found? Anyway, depth is initially 0.

	  WallSound = false;
	  
	  std::vector Buildings;
	  Buildings.push_back(2);//Barracks
	  Buildings.push_back(3);//Supply depot
	  Buildings.push_back(4);//Supply depot

	  RecursiveWall(Buildings,0);

The code is full of implementation details that don’t matter at this level of discussion (and includes some redundant code), so here is pseudocode. CanWall() is a utility function to figure out whether a building can be placed at a given location. mapWall() maintains a 2D map of building locations. I should also mention, though I didn’t realize it until today, that WalledOffAStar() returns false if a wall exists and true if you can walk through, despite its name and some comments. It’s confusing.

RecursiveWall (vector Buildings, int depth) :
  if (WallSound) {                    // we found a wall
    return;                           // so stop the search
  }
  if (depth == Buildings.size()) {    // all buildings are placed
    walkable = WalledOffAStar (...)   // can we walk through?
    if (not walkable) {               // no, try lifting the barracks
      gate = WalledOffAStar (..., ignoring barracks)
      if (gate) {                     // can we walk through now?
        WallSound = true;             // yes, we found a wall
        BarracksWall = ...;           // record the building placements
      }
    }  
    return;                           // either way, we're done
  } else {
    int BuildType = Buildings[depth]; // next building to place
    for (int x, y inside the bounding box) {
      if (CanWall (BuildType, x, y)) {
        mapWall (x, y, BuildType, BuildType); // update data structures
        BWAPI::TilePosition thisTile = BWAPI::TilePosition (x, y);
        BuildingPlace.push_back (thisTile);
        RecursiveWall (Buildings, depth+1);   // recursive call
        BuildingPlace.pop_back();             // roll back updates
        mapWall (x, y, BuildType, BUILDABLE);
      }
    }
  }

Wow, it’s a little hard to follow even boiled down this much. WallSound is initialized to false before the call. So WallSound ends up true only if both the wall is sound and lifting the barracks opens it (so that the barracks is a functional part of the wall and not an appendix to a wall of supply depots). Older versions of LetaBot sometimes got that wrong and walled themselves permanently into their bases.

RecursiveWall does not know which of the buildings in its vector are the same, so if one building fails to place at any location it simply continues on to the next. As far as it knows, a good wall might still be possible with the remaining buildings. An optimization might be to prune the search appropriately when building placements fail. Is it too rare to matter?

Here’s what CanWall does, again in pseudocode. It is given the building’s type and proposed location and returns whether the location is good to build on. Buildings are required to be adjacent to an already placed building. It knows that the barracks is first and lets it be “adjacent to itself.”

CanWall (buildingType, buildingX, buildingY) :
  nextToB = false;       // adjacent to an existing building?
  for (int x, y in the building's footprint from SpaceArray) {
    if (outside the map) return false;
    if (not buildable) return false;
    if (covers Closest or Furthest) return false;
    if (useHighGround and this is not high ground) return false;
    if (buildingType == barracks) nextToB = true;
    if (adjacent to an already placed building) nextToB = true;
  }
  return nextToB;

The code runs the “adjacent to an existing building” loop even for a barracks. For efficiency, it could skip adjacency checks (the inner loop and thus the most expensive part of the code) once nextToB turns true, since it never goes false again. The compiler may be smart enough to lift the “is this building a barracks?” test above the outer loop, since it doesn’t depend on x and y, but it would be tidier to code it that way from the start.

Next: Discussion of walling code.

LetaBot’s wall 3 - low-level utilities

Today, a few low-level parts that RecursiveWall() relies on. As mentioned yesterday, LetaBot plans its wall to fit inside a bounding box around the center of the choke to be walled off. It finds the Closest and Furthest tiles from the original command center that are in the box.

First, the SpaceArray that stores the sizes of buildings. The information can be found under List of Unit and Building Sizes in Liquipedia. The top, bottom, left, and right numbers are pixel gaps that the building leaves along each side. It’s the basic data you need to figure out how tight a wall is. With the gap sizes of two adjacent buildings, you can find the total gap between the buildings and see which units fit through the gap. Here is a sample of the initialization code:

ExtraSpace SpaceArray[10];

	ExtraSpace Barracks;
	Barracks.Top = 8;
	Barracks.Bottom = 15;
	Barracks.Left = 16;
	Barracks.Right = 7;
	Barracks.width = 4;
	Barracks.heigth = 3;

	SpaceArray[2] = Barracks;

MaxGap() is the function that uses the SpaceArray to find the gap. You pass in x and y and it looks to see what’s there. If the location is at the edge of a building, it looks for a neighboring building, figures out the relative locations of the buildings, and does a bogglingly complex calculation to find the horizontal and vertical gaps between the buildings, in pixels. I didn’t try to understand the details.

// a 2x2 part of the walkable array
//check for gaps between each of the tiles
int BuildingManager::MaxGap(int x, int y){

And WalledOffAstar() is the function that determines whether a wall has been created inside the bounding box. For some reason you pass in Startx and Starty, which are always at the Closest location. WalledOffAstar() uses the A* algorithm to look for a path between Closest and Furthest, a path which a unit with horizontal size unitH and vertical size unitV can follow. If no such path exists, then there’s a wall across the choke!

bool BuildingManager::WalledOffAstar(int Startx, int Starty, int Ignore, int unitH, int unitW){

When I heard that LetaBot used A* in its wall planning, I imagined that it used A* directly to search for building placements. But now we can see the real strategy. RecursiveWall() tries a building placement and then calls WalledOffAstar() to see whether that placement constitutes a wall. No? OK, try the next building placement.

Next: Finally, RecursiveWall() proper, finding the wall.