archive by month
Skip to content

LastOrder and its macro model - technical info

Time to dig into the details! I read the paper and some of the code to find stuff out.

LastOrder’s “macro” decisions are made by a neural network whose data size is close to 8MB—much larger than LastOrder’s code (but much smaller than CherryPi’s model data). There is room for a lot of smarts in that many bytes. The network takes in a summary description of the game state as a vector of feature values, and returns a macro action, what to build or upgrade or whatever next. The code to marshal data to and from the network is in StrategyManager.cpp.

network input

The list of network input features is initialized in the StrategyManager constructor and filled in in StrategyManager::triggerModel(). There are a lot of features. I didn’t dig into the details, but it looks as though some of the features are binary, some are counts, some are X or Y values that together give a position on the map, and a few are other numbers. They fall into these groups:

• State features. Basic information about the map and the opponent, our upgrades and economy, our own and enemy tech buildings.

• Waiting to build features. I’m not sure what these mean, but it’s something to do with production.

• “Our battle basic features” and “enemy battle basic features.” Combat units.

• “Our defend building features” and “enemy defend building features.” Static defense.

• “Killed” and “kill” features, what units of ours or the enemy’s are destroyed.

• A mass of features related to our current attack action and what the enemy has available to defend against it.

• “Our defend info” looks like places we are defending and what the enemy is attacking with.

• “Enemy defend info” looks like it locates the enemy’s static defense relative to the places we are interested in attacking.

• “Visible” gives the locations of the currently visible enemy unit types. I’m not quite sure what this means. A unit type doesn’t have an (x,y) position, and it seems as though LastOrder is making one up. It could be the location of the largest group of each unit type, or the closest unit of each type, or something. Have to read more code.

With this much information available, sophisticated strategies are possible in principle. It’s not clear how much of this the network successfully understands and makes use of. The games I watched did not give the impression of deep understanding, but then again, we have to remember that LastOrder learned to play against 20 specific opponents. Its results against those opponents suggest that it does understand them deeply.

network output

It looks like the network output is a single macro action. Code checks whether the action is valid in the current situation and, if so, calls on the appropriate manager to carry it out. The code is full of I/O details and validation and error handling, so I might have missed something in the clutter. Also the code shows signs of having been modified over time without tying up loose ends. I imagine they experimented actively.

By the way, the 9 pool/10 hatch muta/12 hatch muta opening choices and learning code from Overkill are still there, though Overkill’s opening learning is not used.

learning setup

The learning setup uses Ape-X DQN. The term is as dense as a neutron star! Ape-X is a way to organize deep reinforcement learning; see the paper Distributed Prioritized Experience Replay by Horgan et al of Google’s DeepMind. In “DQN”, D stands for deep and as far as I’m concerned is a term of hype and means “we’re doing the cool stuff.” Q is for Q-learning, the form of reinforcement learning you use when you know what’s good (winning the game) and you have to figure out from experience a policy (that’s a technical term) to achieve it in a series of steps over time. The policy is in effect a box where you feed in the situation and it tells you what to do in that situation. What’s good is represented by a reward (given as a number) that you may receive long after the actions that earned it; that can make it hard to figure out a good policy, which is why you end up training on a cluster of 1000 machines. Finally, “N” is for the neural network that acts as the box that knows the policy.

In Ape-X, the learning system consists of a set of Actors that (in the case of LastOrder) play Brood War and record the input features and reward for each time step, plus a Learner (the paper suggests that 1 learner is enough, though you could have more) that feeds the data to a reinforcement learning algorithm. The Actors are responsible for exploring, that is, trying out variations from the current best known policy to see if any of them are improvements. The Ape-X paper suggests having different Actors explore differently so you don’t get stuck in a rut. In the case of LastOrder, the Actors play against a range of opponents. The Learner keeps track of which which data points are more important to learn and feeds those in more often to speed up learning. If you hit a surprise, meaning the reward is much different than you expected (“I thought I was winning, then a nuke hit”), that’s something important to learn.

LastOrder seems to have closely followed the Ape-X DQN formula from the Ape-X paper. They name the exact same set of techniques, although many other choices are possible. Presumably DeepMind knows what they’re doing.

LastOrder does not train with a reward “I won/I lost.” That’s very little information and appears long after the actions that cause it, and it would leave learning glacially slow. They use reward shaping, which means giving a more informative reward number that offers the learning system more clues about whether it is going in the right direction. They use a reward based on the current game score.

the network itself

Following an idea from the 2015 paper Deep Recurrent Q-Learning for Partially Observable MDPs by Hausknecht and Stone, the LastOrder team layered a Long Short-Term Memory network in front of the DQN. We’ve seen LSTM before in Tscmoo (at least at one point; is it still there?). The point of the LSTM network is to remember what’s going on and more fully represent the game state, because in Brood War there is fog of war. So inputs go through the LSTM to expand the currently observed game state into some encoded approximation of all the game state that has been seen so far, then through the DQN to turn that into an action.

The LastOrder paper does not go into detail. There is not enough information in it to reproduce their network design. The Actor and Learner code is in the repo. I haven’t read it to see if it tells us everything.

Taken together it’s a little complicated, isn’t it? Not something for one hobbyist to try on their own. I think you need a team and a budget to put together something like this.

LastOrder and its macro model - general info

LastOrder (github) now has a 15-page academic paper out, Macro action selection with deep reinforcement learning in StarCraft by 6 authors including Sijia Xu as lead author. The paper does not go into great detail, but it reveals new information. It also uses a lot of technical terms without explanation, so it may be hard to follow if you don’t have the background. Also see my recent post how LastOrder played for a concrete look at its games.

I want to break my discussion into 2 parts. Today I’ll go over general information, tomorrow I’ll work through technical stuff, the network input and output and training and so on.

The name LastOrder turns out to be an ingenious reference to the character Last Order from the A Certain Magical Index fictional universe, the final clone sister. The machine learning process produces a long string of similar models which go into combat for experimental purposes, and you keep the last one. Good name!

LastOrder divides its work into 2 halves, “macro” handled by the machine learning model and “micro” handled by the rule-based code derived from Overkill. It’s a broad distinction; in Steamhammer’s 4-level abstraction terms, I would say that “macro” more or less covers strategy and operations, while “micro” covers tactics and micro. The macro model has a set of actions to build stuff, research tech, and expand to a new base, and a set of 18 attack actions which call for 3 different types of attack in each of 5 different places plus 3 “add army” actions which apparently assign units to the 3 types of attack. (The paper says 17 though it lists 18. It looks as though the mutalisk add army action is unused, maybe because mutas are added automatically.) There is also an action to do nothing.

The paper includes a table on the last page, results of a test tournament where each of the 28 AIIDE 2017 participants played 303 games against LastOrder. We get to see how LastOrder scored its advertised 83% win rate: #2 PurpleWave and #3 Iron (rankings from AIIDE 2017) won nearly all games, no doubt overwhelming the rule-based part of LastOrder so that the macro model could not help. Next Microwave scored just under 50%, XIMP scored about 32%, and all others performed worse, including #1 ZZZKBot at 1.64% win rate—9 bots scored under 2%. When LastOrder’s micro part is good enough, the macro part is highly effective.

In AIIDE 2018, #13 LastOrder scored 49%, ranking in the top half. The paper has a brief discussion on page 10. LastOrder was rolled by top finishers because the micro part could not keep up with #9 Iron and above (according to me) or #8 McRave and above (according to the authors, who know things I don’t). Learning can’t help if you’re too burned to learn. LastOrder was also put down by terrans Ecgberht and WillyT, whose play styles are not represented in the 2017 training group, which has only 4 terrans (one of which is Iron that LastOrder cannot touch).

In the discussion of future work (a mandatory part of an academic paper; the work is required to be unending), they talk briefly about how to fix the weaknesses that showed in AIIDE 2018. They mention improving the rule-based part and learning unit-level micro to address the too-burned-to-learn problem, and self-play training to address the limitations of the training opponents. Self-play is the right idea in principle, but it’s not easy. You have to play all 3 races and support all the behaviors you might face, and that’s only the starting point before making it work.

I’d like to suggest another simple idea for future work: Train each matchup separately. You lose generalization, but how much do production and attack decisions generalize between matchups? I could be wrong, but I think not much. Instead, a zerg player could train 3 models, ZvT ZvP and ZvZ, each of which takes fewer inputs and is solving an easier problem. A disadvantage is that protoss becomes relatively more difficult if you allow for mind control.

LastOrder has skills that I did not see in the games I watched. There is code for them, at least; whether it can carry out the skills successfully is a separate question. It can use hydralisks and lurkers. Most interestingly, it knows how to drop. The macro model includes an action to research drop (UpgradeOverlordLoad), an action to assign units and presumably load up for a drop (AirDropAddArmy), and actions to carry out drops in different places (AttackAirDropStart for the enemy starting base, AttackAirDropNatural, AttackAirDropOther1, AttackAirDropOther2, AttackAirDropOther3). The code to carry out drops is AirdropTactic.cpp; it seems to expect to drop either all zerglings, all hydralisks, or all lurkers, no mixed unit types. Does LastOrder use these skills at all? If anybody can point out a game, I’m interested.

Learning to when to make hydras and lurkers should not be too hard. If LastOrder rarely or never uses hydras, it must be because it found another plan more effective—in games you make hydras first and then get the upgrades, so it’s easy to figure out. If it doesn’t use lurkers, maybe they didn’t help, or maybe it didn’t have any hydras around to morph after it tried researching the upgrade, because hydras were seen as useless. But still, it’s only 2 steps, it should be doable. Learning to drop is not as easy, though. To earn a reward, the agent has to select the upgrade action, the load action, and the drop action in order, each at a time when it makes sense. Doing only part of the sequence sets you back, and so does doing the whole sequence if you leave too much time between the steps, or drop straight into the enemy army, or make any severe mistake. You have to carry through accurately to get the payoff. It should be learnable, but it may take a long time and trainloads of data. I would be tempted to explicitly represent dependencies like this in some way or another, to tell the model up front the required order of events.

AIIDE 2018 - how CherryPi played

Overall, the play of the AIIDE 2018 CherryPi version looks similar to last year’s CherryPi which is still playing on SSCAIT. It still has the devastating ling micro, and it still prefers to win games with a flood of low-level units. It still gets melee +1 attack even when +1 carapace seems better. (Do CherryPi’s micro skills make +1 attack better, and if so, how?) Mutalisk micro looks very similar to Tscmoo’s, with mutas individually cautious and clever and collectively lazy and uncoordinated. It can use lurkers, guardians, and ultralisks. I didn’t see defilers, even when they would have been useful.

CherryPi scouts extremely aggressively with its first 2 overlords. They stick near the enemy base and try to poke into every corner, even if the enemy is terran and can shoot them down early. It gets a clear view, which must be useful for its build order switcher. The drawback is that the overlords often die young.

I think this CherryPi looks beatable. It doesn’t have SAIDA’s wide knowledge of action and reaction. It doesn’t have Steamhammer’s knowledge of how to react to LastOrder’s excessive static defense (but usually wins anyway with a zergling flood). It sometimes ignores undefended enemy bases, preferring to attack into the enemy’s strength—or even to wait idly. Game 31245 versus Iron shows it sticking with gas units and failing at macro; it forgot its love of zerglings. It doesn’t know whether it is ahead or behind, and it doesn’t realize that when it is maxed and owns the map, it ought to attack regardless of losses. It’s strong and tricky, but it also makes mistakes. I think next year’s version had better be improved if they don’t want to be overtaken.

Here are the names of the build orders that CherryPi recorded itself as playing in its opponent learning files. One of CherryPi’s major advertised features is a learned build order switcher that can switch to a new build order on the fly. It recorded 103 build order wins/losses for each opponent (except a couple with fewer), and 103 rounds were played, so these appear to be opening build orders only rather than all build orders tried throughout each game. Presumably the openings reflect CherryPi’s intentions when it started the game. It may not have followed the initial build order to its end.

  • 10hatchling
  • 2hatchmuta
  • 3basepoollings
  • 9poolspeedlingmuta
  • hydracheese
  • zve9poolspeed
  • zvp10hatch
  • zvp3hatchhydra
  • zvp6hatchhydra
  • zvpohydras
  • zvpomutas
  • zvt2baseguardian
  • zvt2baseultra
  • zvt3hatchlurker
  • zvtmacro
  • zvz12poolhydras
  • zvz9gas10pool
  • zvz9poolspeed
  • zvzoverpool

CherryPi tried between 1 and 4 openings against each opponent. CherryPi sometimes switched away from its initial try even if it won all games (for example, against CDBot and Hellbot), so I’m not sure what the switching criterion is. But opponents that it tried 4 openings against are all ones that gave it a touch of trouble.

grep -c key *.json

AILien.json:1
Aiur.json:1
Arrakhammer.json:1
BlueBlueSky.json:3
CDBot.json:2
CSE.json:4
CUNYBot.json:1
DaQin.json:1
Ecgberht.json:1
Hellbot.json:2
ISAMind.json:2
Iron.json:1
KillAll.json:2
LastOrder.json:3
LetaBot.json:4
Locutus.json:4
McRave.json:4
MetaBot.json:1
Microwave.json:3
SAIDA.json:4
Steamhammer.json:4
Tyr.json:1
UAlbertaBot.json:2
WillyT.json:1
Ximp.json:2
ZZZKBot.json:4

The other machine learning feature advertised for CherryPi is a building placer. It was trained against human building placements and apparently takes into account some of the bot’s intentions. I recommend against training on human play (or at least exclusively on human play), because machines play differently. Teaching a bot to blindly imitate human decisions that it doesn’t understand will lead to mistakes. It’s worse than teaching a human to imitate without understanding, because the bot won’t figure things out on its own. Nevertheless, CherryPi’s building placement does seem cleaner than other bots. To me the building placement looks simple and logical, but not sophisticated like a strong human player’s. Here’s an example from a ZvZ game, game 1755. The sunken colony does not interfere with gas mining, and it is somewhat protected from zergling surrounds by the geyser, the spawning pool, and the lair itself, while remaining open for drone drills on the drone side. The spire is curiously far away; I would have fit it into the gap next to the sunken. It looks OK but a little loose, not quite optimized. (By the way, game 14742 against the same opponent has the same building layout, except that the spire is placed close.)

ZvZ building placement in game 1755

CherryPi has gained new tactical tricks. I mentioned the burrow trick where it burrows zerglings at expansion locations. So far, I haven’t seen a game where the opponent was ready for the trick; I imagine it contributed to a lot of wins, even though CherryPi sometimes researches burrow and then never uses it. (And I’m disappointed. I thought of using this trick in Steamhammer, and didn’t because I expected that bots which knew how to clear spider mines would also know how to clear burrowed zerglings. I think I was wrong!) As far as I’ve seen, CherryPi doesn’t use burrow for any other purpose (though I wouldn’t be surprised, since there are so many). CherryPi also does zergling runbys; an example is game 1406 versus SAIDA where CherryPi played an unusual and not entirely efficient gas-first 3 hatch zergling build.

CherryPi doesn’t have as many complex skills as SAIDA, but it has a good number. I doubt I saw everything it can do.

Steamhammer 2.1.1 status

Posts take time, but I am also making progress on Steamhammer. The next version will have no big features, so I’ll call it version 2.1.1. Big stuff has to wait until next year, after the tournament season. I found a second serious bug in defiler control, and now defilers consistently move to where they are wanted. It helps them swarm and plague more actively, though they still don’t do as much work as I would like. So far, 2.1.1 development version keeps track of burrowed units accurately, recognizes enemy proxies better, wards off the enemy scout worker more reliably, and has some improved macro decisions and emergency reactions, a new opening (of course), and a variety of other fixes.

There are a bunch of debilitating bugs in squad control and micro. For upcoming work, I have my eyes on 2 bugs in particular that I think cause the most frequent setbacks (rather than the most glaring blunders), the suicide pokes and the stuck units. If I get those fixed in time, I have a priority list of more stuff. If I succeed, Steamhammer will play better in nearly every game, which should show in the SSCAIT round robin phase.

note on CherryPi

I’ve been watching CherryPi’s AIIDE games. No conclusions yet, but I noticed that CherryPi likes to research burrow (not the first bot to do so) and burrow scouting lings at expansions to watch for the enemy (I think it’s the first to do that). SAIDA appeared unready for the trick. When an SCV showed up, CherryPi did not unburrow the ling, but sent another to prevent the expansion.