AI methods - 2 | Starcraft AI blog

offline and online learning

Offline learning means learning at home; you do it before the tournament. Online learning means learning from games as you play them; you do it during the tournament. For good play, you want to learn as much as you can at home—humans are the same. But some things come up only during live play and can’t be prepared ahead of time. If you face an unfamiliar opponent or an unfamiliar map and need to adapt, you have to adapt on the spot.

Reinforcement learning algorithms are online algorithms, which means that you can use them either online or offline. Overkill tried to do both, and it looks as though the offline learning was successful but the online learning could not get enough data to bite. I love this part: Batch learning algorithms, which are considered offline algorithms, are more appropriate for the small amounts of data we can get in online learning. The names turned out to be reversed!

deep learning and memory

Here’s a thought experiment. Suppose you are DeepMind, and you know from your name that you’re going to do deep learning and no other kind. You decide that you want to learn opponent models, which has to be done online in real time. Deep learning is powerful, but it is also data-hungry to the max. It can’t work online. What do you do?

There are ways.

The simplest way is to make the past part of the input. Your deep learning network can’t remember things from game to game, but you can remember for it and let it know. For opponent modeling, you remember details about the opponent’s past behavior and present them as input to the network. You can collect a lot of details; deep learning is able to cope with large amounts of undigested input data. If you train your network thoroughly at home, it will learn how adapt to different kinds of opponent.

Another way is to give your deep learning network a memory under its own control—let it decide what information to store for next time. “What should I remember?” becomes part of the network output alongside “what should I do?” In the next game against the same opponent, you feed in “what should I remember?” from the last output as part of the current network input. In training, the deep learning model learns what it should best remember. Similar experiments have been done (not in Starcraft), so the idea is known to work. Figuring out what to remember is trivial next to playing go!

More generally, if you have a powerful learning method, then it can learn more than how to play on its own. It can learn how to play plus how to operate whatever kind of analysis machine will help it play better. A memory is only one example of an analysis machine that deep learning could learn to operate. It could also operate a build order planner or a tactics planner or whatever, request exactly the plans best for the situation as a whole, and take those plans as part of its input.

Summary: 1. Offline learning is usually preferable. 2. Online learning can be done offline. It’s at least good to know you have a choice!

Overkill - wrapup discussion

As usual, I have a mass of final comments.

unused features

After a few days to mull it over, I think the unused features are left over from experiments. Sijia Xu had to try stuff out to see what would work, and not everything was worth including. More features mean that potentially more knowledge can be learned, but also that learning is slower. Leaving unused features in the code makes it easier to switch them in and out, even if you end up computing and saving features that never make a difference. I take it as a sign of the academic attitude.

choice of actions

Overkill uses its Q-learning to choose among only 3 possible actions, which combat units to build. It’s an important decision, but only one decision in a sky full of them.

Bots could use similar methods to learn more complex decisions, like tactical decisions: How to group units into squads and what orders to give the squads. Or complex build order choices. Or unit micro decisions. In those cases you can’t list the decisions ahead of time.

Overkill independently learns the Q values for each action. In effect, it learns 3 separate evaluations, Q(situation, zergling action), Q(situation, hydra action) and Q(situation, muta action). If zerglings and hydras are similar in some ways because they’re both ground units, or hydras and mutas are similar because they’re both ranged, then Overkill’s learning can’t take advantage of that; it can’t generalize across actions.

With many actions which are unknown in advance, you need a model which generalizes over the actions. In effect, you want to fold the actions (or their predicted results) into the situation. You can add your potential decisions as primitive features of the situation, treating “these guys are going over there to attack” the same way you treat “the enemy has 3 bases”. Or you can create a separate model of the effects your decisions have on the game, and use that model’s results as the situation.

Something along these lines is what DeepMind wants to do. They’ll use a neural network with deep learning for their model, but the idea is closely related to what Overkill does. From the AI point of view, it’s a fancier version of the same underlying idea, with a closely related top-level learning algorithm.

the linear model

Overkill’s model is a linear function, in the math sense. It sums up a lot of weights. But because of how the weighted features are made, the evaluation of actions is not linear overall with respect to the primitive game features: Overkill’s features themselves are nonlinear, in 2 ways. First, they are not primitive features, they are combination features. To simplify, it’s not “I have zerglings, that’s worth weight w[z], you have goliaths, that’s worth w[g], result w[z] + w[g].” The 2 features are (I’m still simplifying) combined into 4, “I have no zerglings, you have no goliaths, weight w[-z, -g]; I have zerglings, you have no goliaths, w [+z, -g], ....” Second, if the underlying primitive feature like “I have n zerglings” takes a range of values, then Overkill breaks it into a set of binary features, “I have 0 to 6 zerglings, w[0 to 6]; I have 7 to 12 zerglings....” If the value of having n zerglings (with respect to taking a given action) does not increase linearly with n, then the weights will be different to reflect that. (So after combination you get a lot of weights, w[0..6 z, 0..6 g], w[7..12 z, 0..6 g], ....)

I should pause to explain terminology. A “primitive” feature is one that comes directly from the game, like “I have n zerglings.” Other features can be called “derived” features. Overkill’s model is linear with respect to Overkill’s derived features, but not linear with respect to the primitive game features. That’s what I mean by “nonlinear features.”

Starcraft strategy is a highly nonlinear domain (with respect to the primitive features). But if you have nonlinear features, and you have a large enough number of them, then you can re-encode the nonlinear domain as a linear model—not exactly, but as accurately as you like.

Is the linear model with nonlinear binary features a good model? For many games, it is proven good by experience. For Starcraft, I don’t know. Can you find the features you need to play well? Are there few enough that you can learn them in a reasonable time? Sijia Xu has made a start on finding the answer.

offline versus online learning

Overkill’s Q-learning seems to make sense for offline learning, “how to play Starcraft.” It seems to be too slow for online learning of opponent models, “how to beat this opponent.” There are ways to do both at once.

One idea is to use different models for the 2 cases: Learn one big offline model with a huge number of features, to play well on average against unknown opponents. And also learn a small model with few features for each opponent. The small model won’t be able to learn as much knowledge, but it will converge faster. The small opponent model tells you, “here is how this opponent differs from the average opponent.” To make decisions during play, add together the results of the models.

Another idea is to use a hierarchical model. I don’t want to go into detail, but the basic idea is have high-level features with low-level features under them. The high-level features are few and can be learned quickly, so that the model’s learning starts to be useful quickly. The low-level features are many and take more time, so that the model can eventually learn a lot of knowledge if you keep feeding it data.

To learn an opponent model empirically, you have to explore. It’s mathematically required! And exploring an unfamiliar action will sometimes be bad play. We already have proof from opening strategy learning that opponent modeling can be worth it overall; good decisions later can make up for bad play caused by exploring early. But how far can we push it? We can learn to choose from among a few opening strategies and it helps, but can we make a sophisticated opponent model or is the cost of exploring too high? Humans can learn powerful opponent models, but humans don’t rely purely on the statistics—we also reason about our models in a way that is beyond the state of the art in AI.

judgment

As Martin Rooijackers put it, humans have better judgment than bots—so far. And how can we code judgment? There are only 2 fundamental techniques, knowledge and search. To match human judgment we’ll need some combination of knowledge and search.

In a realtime domain like Starcraft, I believe it makes sense to go for knowledge first. And if you go for knowledge first, then I believe hand-coding will not get you there. You’ll fail to hand-code enough knowledge for good judgment. Machine learning is the path, and Sijia Xu has taken a step down it ahead of everyone else.

Next: More on offline versus online learning.

Overkill’s new learning 6 - how it fits with the rest of the program

The learning is done by StrategyManager::strategyChange(). During the game, the production manager is in one of two states, depending on whether Overkill is still in its opening build order.

	enum ProductionState {fixBuildingOrder, goalOriented};

	ProductionState				productionState;

When productionState is goalOriented, ProductionManager::update() calls onGoalProduction() to check whether the goal (zerglings, hydralisks, mutalisks) should be changed. The key parts:

	bool isAllUpgrade = true;
	for (int i = 0; i < int(queue.size()); i++)
	{
		//if remain unit type is just worker/upgrade/tech, check the strategy
		if (... a long condition here ...)
			isAllUpgrade = false;
	}
	if (isAllUpgrade)
	{
		if (BWAPI::Broodwar->getFrameCount() > nextStrategyCheckTime)
		{
			std::string action = StrategyManager::Instance().strategyChange(0);
			BWAPI::Broodwar->printf("change strategy to %s", action.c_str());
			setBuildOrder(StrategyManager::Instance().getStrategyBuildingOrder(action), false);

			nextStrategyCheckTime = BWAPI::Broodwar->getFrameCount() + 24 * 10;
			curStrategyAction = action;
		}
		...
	}

In other words, if the production queue is empty of combat units, and at least 10 seconds have passed since the last strategyChange(), then call strategyChange() again (with reward 0 because we’re in the middle of the game). Overkill changes its choice at most once every 10 seconds.

At the end of the game, Overkill calls strategyChange() one last time, giving a reward of 100 for a win and -100 for a loss. From Overkill::onEnd():

	if (frameElapse >= 86000 && BWAPI::Broodwar->self()->supplyUsed() >= 150 * 2)
	{
		win = true;
	}
	else
	{
		win = isWinner;
	}

	int reward = win == true ? 100 : -100;
	
	StrategyManager::Instance().strategyChange(reward);

There’s something curious here. isWinner is the flag passed into onEnd(). If Overkill makes it to the late game, it sets win = true no matter what; otherwise it believes the isWinner flag. For purposes of its learning algorithm, Overkill tells itself that making it to the late game counts as a win in itself! Isn’t that a pessimistic way to think?

Every call to strategyChange() produces 1 data point for the learning algorithm. The reward comes in only at the end of the game, but Overkill needs the other data points to fill in the Q values for states during the game. The model has a lot of values to learn, so the more data points the better.

exploration

When Overkill is in Release mode, it explores 10% of the time. The code is in StrategyManager::strategyChange(). So, 1 time in 10 when Overkill makes a decision, it’s a random exploratory decision instead of the best decision it knows.

On the one hand, if Overkill wants to learn an opponent model, it has to explore. On the other hand, making that many random decisions can’t be good for its play! Since the AIIDE 2016 tournament was not long enough for the opponent model to become accurate, Overkill might have done better to skip the opponent modeling.

Next: Other changes to Overkill.

Overkill’s new learning 5 - the full list of features

I changed my mind about what to cover today. Here is an overview of the features in Overkill’s model. I decided it was interesting to see what they are.

Yesterday we saw that features are kept in a map of maps, so names have two levels. The variable featureNames declares the top-level groups of feature names and some of the lower-level names. It looks like this (leaving out most of it):

	featureNames = decltype(featureNames) {
		//state feature
		{"state_general_feature", { 
			{ "time_", { "6", "12", "20", "30", "max" } },
			{ "enemyRace_", { "z", "t", "p", "unknow" } },
			{ "ourSupplyUsed_", { "9", "18", "30", "50", "80", "120", "150", "max" } },
		...
		{ "state_raw_combine_feature", {} },
		{ "action_battle_combine_state_battle_feature", {} }
	};

Features that originally take on a range of values have to be made into binary features. Overkill does it by breaking the range into coarse pieces. time_ looks to be game time in minutes. state_raw_combine_feature and action_battle_combine_state_battle_feature have their lower-level names filled in by code rather than declared directly. Those last two are the majority of features.

Here are top-level names and what it looks like they cover. Don’t trust my interpretations too much. Not all features in the code end up in the I/O files. I wrote down the number of features that the code includes, but apparently some of the features are never present in actual games.

name	#	my interpretation
`state_general_feature`	53	Game time, map name, distance between bases, enemy race, etc.
`state_tech_feature`	7	Overkill’s tech level.
`state_building_feature`	30	Overkill’s tech buildings, the enemy’s tech and production buildings.
`state_economy_feature`	46	Minerals and gas for Overkill, expansions and workers for both sides.
`state_our_army`	40	Counts of Overkill’s static defense and units of each kind.
`state_battle_feature`	168	Counts of enemy units of each kind.
`state_raw_combine_feature`	6723	`state_our_army` crossed with `state_battle_feature`, that is, every combination of our units and the enemy’s, plus 3 extra features.
`action_battle_combine_state_battle_feature`	6754	Copies of `state_raw_combine_feature` and `state_building_feature` and the one `state_tech_feature` feature `ourKeyUpgrade_zerglingsAttackSpeed`.

We know from yesterday that the features in action_battle_combine_state_battle_feature are exactly the ones that matter in calculating the Q value—they are the ones which get the action appended to their names. The others are only along for the ride; it’s an implementation quirk that they are kept in the same data structure. A number of features seem to be declared, evaluated, learned and remembered, but never effectively used.

So if I counted right (which I question), then there are 6754 binary features in total, though it would be strange for all of them to be useful against any one opponent.

Next: How it fits into the rest of the program, for real.

Overkill’s new learning 4 - the model and its features

What does it mean to have a linear model with binary features? “Linear” means that each feature comes with a number, its weight, so that with binary features you find Q(s,a) by adding up the weights for each feature that is present. Usually only a small proportion of all the features are present, so it’s not as crazy as it may sound.

Overkill gives its features long multi-part names, which it implements throughout as strings accessed via maps. (I was surprised to see that in a real-time program, but it’s probably easier.) The feature names are written out plainly in the I/O files. Here are a few scattered samples from the file feature_valueAiur, which lists 9638 features altogether:

action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*hydraBuild:0.13396
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*mutaBuild:0.07588
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_robotics_facility*zerglingBuild:0.06963
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*hydraBuild:0.05439
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*mutaBuild:0.10049
action_battle_combine_state_battle_feature:enemyKeyBuilding_hasP_stargate*zerglingBuild:0.26210

state_raw_combine_feature:enemyP_cannon_1*ourHydra_6:-0.21410
state_raw_combine_feature:enemyP_cannon_1*ourHydra_12:-0.43786
state_raw_combine_feature:enemyP_cannon_1*ourHydra_18:-0.08806
state_raw_combine_feature:enemyP_cannon_1*ourHydra_24:0.24174
state_raw_combine_feature:enemyP_cannon_1*ourHydra_36:0.42465
state_raw_combine_feature:enemyP_cannon_1*ourHydra_48:0.39939
state_raw_combine_feature:enemyP_cannon_1*ourHydra_60:0.52629
state_raw_combine_feature:enemyP_cannon_1*ourHydra_max:0.59403

state_tech_feature:ourKeyUpgrade_zerglingsAttackSpeed:2.33542
state_tech_feature:ourTechLevel_hatchery:2.28803
state_tech_feature:ourTechLevel_lair:0.25170
state_tech_feature:ourTechLevel_hive:1.48611

You can guess what the feature names mean: Enemy has 1 cannon and we have up to 6 hydralisks, for example. That’s how it got so many features!

Each opponent’s file seems to list a different number of features, probably leaving out features that never came up, so 9638 is not the total number of features. But there’s something here I don’t understand. 9638 is not divisible by 3. Each line gives one weight—shouldn’t there be 3 weights for each state, so that the 3 actions can all be evaluated?

Here’s the routine that calculates Q(s,a). Its arguments are reversed—it puts the action before the state.

double StrategyManager::calActionFeature(std::string curAction, std::map<std::string, std::map<std::string, int>>& features)
{
	for (auto categoryStateFeature : features)
	{
		if (categoryStateFeature.first == "state_raw_combine_feature" || categoryStateFeature.first == "state_building_feature")
		{
			for (auto stateFeature : categoryStateFeature.second)
			{
				std::string combineFeatureName = stateFeature.first + "*" + curAction;
				features["action_battle_combine_state_battle_feature"][combineFeatureName] = 1;
			}
		}
	}

	if (features["state_tech_feature"].find("ourKeyUpgrade_zerglingsAttackSpeed") != features["state_tech_feature"].end())
	{
		std::string combineFeatureName = std::string("ourKeyUpgrade_zerglingsAttackSpeed") + "*" + curAction;
		features["action_battle_combine_state_battle_feature"][combineFeatureName] = 1;
	}

	double curQValue = 0;
	for (auto categoryFeature : features)
	{
		for (auto curfeature : categoryFeature.second)
		{
			int curfeatureValue = curfeature.second;
			if (parameterValue.find(categoryFeature.first) != parameterValue.end() && parameterValue[categoryFeature.first].find(curfeature.first) != parameterValue[categoryFeature.first].end())
			{
				double curParameterValue = parameterValue[categoryFeature.first][curfeature.first];
				curQValue += curParameterValue * curfeatureValue;
			}
		}
	}
	return curQValue;

}

parameterValue holds the model. curAction is the action and the features map with its nested type is the state. Having read this, I still don’t understand. The action name is coded into some feature names and not others, which we see above as + curAction. The list of actions:

	stateActions = {"zerglingBuild", "hydraBuild", "mutaBuild"};

Here’s the call, the bit of code which chooses the action with the highest Q value. (Below this is another bit where it changes the action if it feels like exploring.)

		for (auto action : stateActions)
		{
			std::map<std::string, std::map<std::string, int>> actionFeatureValue = featureValue;
			double curQValue = calActionFeature(action, actionFeatureValue);

			if (curQValue > maxQValue)
			{
				maxQValue = curQValue;
				maxAction = action;
				maxFeatureValue = actionFeatureValue;
			}
		}

The call does nothing to differentiate actions. As far as I can tell, only the features which include the action in their names can be used to tell actions apart, and the other features are irrelevant constants that happen to be added in.

$ grep hydraBuild feature_valueAiur | wc -l
    2176
$ grep mutaBuild feature_valueAiur | wc -l
    2267
$ grep zerglingBuild feature_valueAiur | wc -l
    2403

So 2176+2267+2403 = 6846 features out of 9638 encode the build name in the I/O file for AIUR. As far as I can tell, the other 2792 features are irrelevant. And those 2792 features include some that look important. Surely you want to pay attention to what upgrades you have when you choose which units to make!

The number of features is different for each action. That means two things. 1. The fact that the total number of features is not divisible by 3 is meaningless. 2. Not all actions have been explored in the different states. As expected, the games played against AIUR were not enough to fill in the model.

Either I’ve misunderstood something, or Overkill’s learning has flaws (I wouldn’t go so far as to say bugs, it is only a loss of effectiveness, not an error). Can anybody correct me? I’ll contact Sijia Xu.

Next: How it fits into the rest of the program.

Overkill’s new learning 2 - Q-learning

This post is for people who want an easy introduction to an important AI technique. Q-learning is 100% worth knowing if you care about AI. It’s easy to understand, it’s a fundamental technique, it’s used in high-performance learning systems (DeepMind used Q-learning in their famous demo of playing Atari video games), and it remains the subject of serious research.

Q-learning is a form of reinforcement learning. If you want to understand reinforcement learning, I like the book Reinforcement Learning: An Introduction by Sutton and Barto (it’s all online at the link). You do have to feel comfy with the math, but it’s nothing difficult.

The problem. Q-learning solves the control problem, which means: You’re in a situation. You have a set of actions available. How do you learn from experience what action to take? In computer speak, being in a situation means you’re in some state from a set of states. At time t you’re in state[t]. Overkill (we saw last time) represents its state as a collection of about 4000 binary features. You have a set of actions to choose from, and you have to choose one, so we can say that at time t you choose action[t]. In principle the actions might be different in each state, but Overkill always has the same three actions (hatch zerglings, hatch hydralisks, hatch mutalisks), so we can ignore that.

The framework. To decide what to do, you have to figure out how good the different actions are. Q-learning is direct. It says: Let’s call that Q, the utility of each action in each state: Q [state[i], action[j]]. And every time we find something out, let’s update Q to more closely match the new information. The idea of updating bit by bit is what makes it reinforcement learning.

I had a reason to write Q[] as if it were an array lookup. If you do implement it as a table lookup (so that you’re using tabular learning), then (given certain conditions) Q-learning is mathematically guaranteed to converge to the optimal policy, where “policy” is the technical term for what you do when. But usually you can’t implement it that way. Overkill has 2⁴⁰⁰⁰ states times 3 actions and can’t store an array that big, much less fill in all its values. Instead you plug in a model that stores less information and generalizes over states which are similar: “enemy has 20 goliaths, make hydras” should generalize to “enemy has 30 goliaths, make hydras”.

Here is a subtle point: To find out how good different actions are, you have to try them. You have to explore the space of states and actions. The current Q value tells you how good you think each action is, but only if you follow it up with other good actions. A good action you take now can be wiped out by a bad action you take later. And an exploratory action by definition may be bad. “Enemy has 20 goliaths, I know making hydras is good, but this time let’s explore the option of making zerglings.” You’re doing one thing, which includes exploring, and learning something else, the best policy which doesn’t include exploring. Since you’re not following the policy that you’re learning, Q-learning is called an off-policy learning algorithm. It’s not quite obvious how to do that, is it!

The solution. It may not be obvious, but the solution is simple. You believe your Q values, and update them based on your belief. I’ll write a version as pseudocode—slightly different choices are possible at pretty much every point.

You need two numbers, a learning rate alpha between 0 and 1 that says how much you update Q, and a discount factor gamma between 0 and 1 that says how far into the future to trust your Q values. It makes sense to decrease the learning rate as you go on, but in practice it’s common to set a fixed value. Normally alpha should be not too far from 0.

You also need to plug in a function approximation model that can update Q(s, a) by a given amount. Neural nets work. Many other models also work. Overkill uses a linear fit to its large number of features.

initialize Q (note 1)
for each game {
  for each game state s where you make a decision {
    r <- 1 if the game was just won (note 2)
      or 0 if the game was lost or you’re still playing
    a <- choose an action (note 3)
    update model Q(s, a) += alpha * (r + gamma * nextQ - Q(s, a)) (note 4)
  }
}

1. You can initialize Q any way you want. Constant values or random values work. If you have an idea what values may be good, you can try to give the algorithm a head start.

2. r stands for “reward”, the standard terminology in reinforcement learning. The values are arbitrary. You could use -1 for losing, as long as the value is 0 except at the end of the game. That’s because in Starcraft, the only reward is at the end of the game—in another domain you might get rewards along the way.

3. How do you choose an action? You want some combination of exploring and following the policy. If you always explore you won’t learn much, because you’ll mostly explore stupid actions. If you always follow the policy you won’t learn much because you’ll do the same things over and over. Do both and you get traction! One combination is epsilon-greedy, where you explore at random epsilon of the time (say, 10% of the time) and otherwise follow the policy. There are other combinations.

4. Updating is the key step. You tell your model: For this state and action, adjust Q by this increment and generalize as you will. What is nextQ? It is the best Q value in the following state. You chose an action in state s[t] and ended up in a new state s[t+1]. The best available estimate of Q for your action is: The Q for the best action available in the next state. Using the Q values for the next state is what makes off-policy learning work. In math, nextQ = max_a Q (s_t+1, a). Or in pseudocode:

nextQ = -infinity
for x (actions) {
  nextQ = max (nextQ, Q(s[t+1], x))
}

The version I wrote is the most basic version for people who’re learning it for the first time. There are a ton of methods for speeding it up and making it more accurate. Overkill uses an up-to-date selection of some of those methods, so its version is more complicated.

Next: Overkill’s model.

generalization for strategy learning

This post is for people who want to do strategy learning better than we have seen so far, but who haven’t married AI and are looking for a few hints on what’s good to try. I assume the simplest case: The bot has a fixed set of strategies and wants to choose one based on experience (but possibly influenced by scouting info). Similar ideas work in more complicated cases, too.

In past posts I looked at the strategy learning done by Overkill and AIUR. Overkill learns (strategy, opponent) and AIUR learns (strategy, opponent, map size). I found out that, on the one hand, AIUR learned more by including the map size, but on the other hand, AIUR learned more slowly and didn’t have time to explore the possibilities thoroughly and find the best. It would be nice to learn (strategy, opponent, opponent race, map, player positions, any other results of early scouting), but how could a bot possibly learn so much?

Overkill and AIUR learn tables of outcomes. Tabular learning is slow learning because it does not generalize. AIUR may win with its cannon rush on a 2-player map against opponent A and opponent B, but when it faces opponent C on a 2-player map it starts with a blank slate. It doesn’t try cannon rush first because that worked against other opponents, it says “well gosh darn I don’t know a thing yet, I’ll pick randomly.” And again, when nexus-first wins against opponent D on 2-player and 3-player maps and AIUR faces opponent D on a 4-player map for the first time, it’s “well gosh darn.”

Tabular learning is, well, it’s the only kind of learning which does not generalize. Tabular learning is a form of rote memorization, and all the countless other learning algorithms try to generalize in one way or another. That doesn’t mean you should learn strategies using any random algorithm you have lying around, though. You can, but it’s best to look for one that suits the problem.

The problem requirements are not too complicated.

1. Our algorithm’s input will be a set of past observations like (strategy, opponent, any other data you want to include, game result). The output will be the strategy to play this game, where you don’t know the game result yet. Or at least the output will have enough information to let you decide on a strategy. Estimated-probability-to-win for each strategy choice is one idea.

2. Some of the inputs, like the opponent, are categorical (as opposed to numerical). We need an algorithm that likes categorical inputs. Some work best with numerical inputs. One way to look at it is: Fitting a curve from opponent A to opponent C doesn’t tell you anything about opponent B, so you don’t want an algorithm that’s always trying that.

3. The algorithm should work well with small to moderate amounts of data. In the first game of the tournament, with no observations made yet, you’ll pick a strategy from prior knowledge (pick randomly, or pick one that did well in testing, or a combination). In the second game, you want to consider your prior knowledge plus 1 data point. The prior knowledge stops some algorithms from saying “we lost the first game, by generalization all strategies always lose.” You want the 1 data point to be important enough to make some difference, and not so important that it immediately overrides prior knowledge. And so on to thousands or tens of thousands of data points if the tournament is that long (it’s hardly likely to be longer); by then, prior knowledge should not make much difference.

4. You also want to consider exploration. If you always play the strategy that looks best (a “greedy algorithm”), then you may be overlooking a strategy that plays better but happened to lose its first game, or that never got tried. You have to explore to learn well.

My suggestions. First, exploration is not hard. Epsilon-greedy (see multi-armed bandit) should always work for exploration. There may be better choices in particular cases, but you have a fallback. You can do better if the algorithm outputs not only an estimated win rate but also its confidence in the estimate: Preferentially explore options which have low confidence.

Second, prior knowledge is not too hard either. You can always encode your prior knowledge as a set of fictional data points, fish story style. Again, there may be better ways, especially if you go with a Bayesian algorithm which by definition includes priors.

The requirement to work with varying but mostly modest amounts of data means that batch algorithms that analyze the dataset as a whole are preferred. Incremental algorithms that analyze one data point at a time, like the huge family of reinforcement learning algorithms that includes most neural networks, are by and large less suitable; they have a harder time controlling the level of generalization as the amount of data increases, to learn fast enough without overfitting. It’s not that reinforcement learning won’t work, or even that it can’t be made to work just as well, but without extra knowledge and care you can expect it to be less effective or less efficient. I was surprised to see the new version of Overkill use reinforcement learning for unit production decisions—it may be a good choice, but if so it’s not obvious why.

I suggest boosted decision trees. Decision trees have good generalization properties with small and modest amounts of data, and adding a boosting algorithm increases their accuracy. Since there’s not too much data and strategy learning happens once per game, speed should not be a problem. (If it does get too slow, then discard the oldest data points.) Go look up code to implement it and check the license, you know the drill.

It’s just a suggestion. Other choices may be better.

In a little more detail, at the end of each game the bot records the result with whatever other information it wants to learn from: Opponent, race, map, etc. At the start of each game it reads the records and runs its learning algorithm from scratch (it doesn’t have to or want to remember what it thought it knew last game). You may want to vary this depending on tournament rules about when learning data becomes available.

With the learned model in hand, the bot can look at the game situation, run it through to find out what strategies seem best, and combine that with the exploration policy to decide what strategy to play.

What if some inputs are not known yet? Say the opponent is random and your scout didn’t find out the enemy race before it’s time to decide on the initial strategy. If the learning algorithm estimates win rates, here’s one way: Run the game situation through three times, once with each race, and combine the results. There are different ways to combine the results, but averaging works. The same for other information that you don’t know yet; run through each possibility that hasn’t been excluded (“I know they’re not at that base, but then my scout died”). If there’s too much unknown info to test all possibilities against your learned model, then limit it to a statistical sample.

Generalizing across opponents. If you have an opponent model, you can do better. If you’re able to recognize characteristics of your opponents, then you can remember the information in an opponent model and use the models to generalize across opponents. It’s a way of learning counter-strategies alongside counter-this-opponent strategies. I think opponent modeling should make strategy learning more effective. “Oh, opponent X went dark templar and I won with strategy A. Now I’m fighting opponent Y, which has been known to go dark templar too.”

opponent random?
opponent race
how rushy/all-in? (consider the earliest attack, or the early economy)
when (if ever) did opponent make unit X (for each X)?
when did opponent get upgrade Y (for each Y)?
when did opponent first use spell Z (for each Z)?
or in less detail: when did opponent get air units/detection/etc.?
how soon/often did opponent expand?
did the opponent scout my whole base?
was the opponent seen to take island bases?
was the opponent seen to attack island bases?

Or whatever you think might help. Since there’s never a ton of data, the huge number of inputs in the list might be too much.

Generalizing across maps can follow the same kind of idea: Number of players on the map, air distance and ground distance between bases, and so on. Adapting your strategy to the map is basic to Starcraft strategy, and bots are weak at it.

means-end analysis

Here’s the secret connection between prioritizing your goals and novelty maps: Once you have explicit goals, another thing you can do is reason about them.

You’re playing on Crystallis. Human players somehow realize, without having to think about it, “I’ll need gas, so I should set workers to mine in the direction of a geyser.” How could a bot figure out the same thing? Imagine that you want to be able to play on any map, even a novelty map with strange features.

The bot should already know “I’ll need gas,” so the question is how it can figure out the way to get it. The bot also needs a model of the game, so that it can understand the effects of its actions and plan a sequence of actions to reach its goal. It needs to know that minerals can block movement and that workers can mine out minerals.

Means-end analysis was introduced into AI sometime around 1960 in the famous General Problem Solver program. It says, given a goal, seek means (aka actions) to reduce differences between the current situation and the goal. One way to put means-end analysis to work is in “plan repair,” which means finding workarounds for problems. It’s part of the AI topic of planning. If you like academic papers, here’s a list: Citeseer on plan repair.

Suppose the bot is terran. It already knows a basic plan to get gas: 1. Send an SCV to a geyser. 2. Build a refinery. 3. Transfer gas from the refinery to a command center with SCVs. If bot uses its model of the game to see whether the basic plan works, it will find that the first step fails; the SCV can’t get to the geyser. Can the plan be repaired? Working backward from the goal of being at the geyser, the bot might reason: If I can mine out this mineral block next to the geyser, then I can reach it. Can I mine out that mineral block? No, but these other mineral blocks are next to it, can I mine one of them out? If the bot keeps searching backward (using a means-end heuristic to speed things up: the mineral block closest to the command center is the one that minimizes the difference between the current situation and the goal), it will eventually find a path of mineral blocks to mine out to reach the gas. Problem solved.

The bot might also ask: Can I fly an SCV to the geyser? To fly it by dropship, I would need a dropship, which costs gas; no can do. Or I could float a command center and make an SCV, but there’s no place to land it; no can do. Or the bot might figure out how to push SCVs through the minerals enough times to reach the geyser, which probably is part of best play. If it’s clever enough, it might be able to compute an efficient combination of mining out minerals and pushing SCVs through the last minerals to start getting gas a little earlier.

Plan repair can work to solve many kinds of goals in many kinds of situations. A bot with that skill could play novelty maps at least passably, and could solve planning problems that come up in normal games. Obviously it’s a very complicated skill, and one that bot authors will not be in a hurry to implement. But I think it’s a destination we should aim to reach eventually. And the first step, of course, is to make your bot’s goals explicit.

Next: After the map is mined out.