LastOrder | Starcraft AI blog

LastOrder and its macro model - technical info

Time to dig into the details! I read the paper and some of the code to find stuff out.

LastOrder’s “macro” decisions are made by a neural network whose data size is close to 8MB—much larger than LastOrder’s code (but much smaller than CherryPi’s model data). There is room for a lot of smarts in that many bytes. The network takes in a summary description of the game state as a vector of feature values, and returns a macro action, what to build or upgrade or whatever next. The code to marshal data to and from the network is in StrategyManager.cpp.

network input

The list of network input features is initialized in the StrategyManager constructor and filled in in StrategyManager::triggerModel(). There are a lot of features. I didn’t dig into the details, but it looks as though some of the features are binary, some are counts, some are X or Y values that together give a position on the map, and a few are other numbers. They fall into these groups:

• State features. Basic information about the map and the opponent, our upgrades and economy, our own and enemy tech buildings.

• Waiting to build features. I’m not sure what these mean, but it’s something to do with production.

• “Our battle basic features” and “enemy battle basic features.” Combat units.

• “Our defend building features” and “enemy defend building features.” Static defense.

• “Killed” and “kill” features, what units of ours or the enemy’s are destroyed.

• A mass of features related to our current attack action and what the enemy has available to defend against it.

• “Our defend info” looks like places we are defending and what the enemy is attacking with.

• “Enemy defend info” looks like it locates the enemy’s static defense relative to the places we are interested in attacking.

• “Visible” gives the locations of the currently visible enemy unit types. I’m not quite sure what this means. A unit type doesn’t have an (x,y) position, and it seems as though LastOrder is making one up. It could be the location of the largest group of each unit type, or the closest unit of each type, or something. Have to read more code.

With this much information available, sophisticated strategies are possible in principle. It’s not clear how much of this the network successfully understands and makes use of. The games I watched did not give the impression of deep understanding, but then again, we have to remember that LastOrder learned to play against 20 specific opponents. Its results against those opponents suggest that it does understand them deeply.

network output

It looks like the network output is a single macro action. Code checks whether the action is valid in the current situation and, if so, calls on the appropriate manager to carry it out. The code is full of I/O details and validation and error handling, so I might have missed something in the clutter. Also the code shows signs of having been modified over time without tying up loose ends. I imagine they experimented actively.

By the way, the 9 pool/10 hatch muta/12 hatch muta opening choices and learning code from Overkill are still there, though Overkill’s opening learning is not used.

learning setup

The learning setup uses Ape-X DQN. The term is as dense as a neutron star! Ape-X is a way to organize deep reinforcement learning; see the paper Distributed Prioritized Experience Replay by Horgan et al of Google’s DeepMind. In “DQN”, D stands for deep and as far as I’m concerned is a term of hype and means “we’re doing the cool stuff.” Q is for Q-learning, the form of reinforcement learning you use when you know what’s good (winning the game) and you have to figure out from experience a policy (that’s a technical term) to achieve it in a series of steps over time. The policy is in effect a box where you feed in the situation and it tells you what to do in that situation. What’s good is represented by a reward (given as a number) that you may receive long after the actions that earned it; that can make it hard to figure out a good policy, which is why you end up training on a cluster of 1000 machines. Finally, “N” is for the neural network that acts as the box that knows the policy.

In Ape-X, the learning system consists of a set of Actors that (in the case of LastOrder) play Brood War and record the input features and reward for each time step, plus a Learner (the paper suggests that 1 learner is enough, though you could have more) that feeds the data to a reinforcement learning algorithm. The Actors are responsible for exploring, that is, trying out variations from the current best known policy to see if any of them are improvements. The Ape-X paper suggests having different Actors explore differently so you don’t get stuck in a rut. In the case of LastOrder, the Actors play against a range of opponents. The Learner keeps track of which which data points are more important to learn and feeds those in more often to speed up learning. If you hit a surprise, meaning the reward is much different than you expected (“I thought I was winning, then a nuke hit”), that’s something important to learn.

LastOrder seems to have closely followed the Ape-X DQN formula from the Ape-X paper. They name the exact same set of techniques, although many other choices are possible. Presumably DeepMind knows what they’re doing.

LastOrder does not train with a reward “I won/I lost.” That’s very little information and appears long after the actions that cause it, and it would leave learning glacially slow. They use reward shaping, which means giving a more informative reward number that offers the learning system more clues about whether it is going in the right direction. They use a reward based on the current game score.

the network itself

Following an idea from the 2015 paper Deep Recurrent Q-Learning for Partially Observable MDPs by Hausknecht and Stone, the LastOrder team layered a Long Short-Term Memory network in front of the DQN. We’ve seen LSTM before in Tscmoo (at least at one point; is it still there?). The point of the LSTM network is to remember what’s going on and more fully represent the game state, because in Brood War there is fog of war. So inputs go through the LSTM to expand the currently observed game state into some encoded approximation of all the game state that has been seen so far, then through the DQN to turn that into an action.

The LastOrder paper does not go into detail. There is not enough information in it to reproduce their network design. The Actor and Learner code is in the repo. I haven’t read it to see if it tells us everything.

Taken together it’s a little complicated, isn’t it? Not something for one hobbyist to try on their own. I think you need a team and a budget to put together something like this.

LastOrder and its macro model - general info

LastOrder (github) now has a 15-page academic paper out, Macro action selection with deep reinforcement learning in StarCraft by 6 authors including Sijia Xu as lead author. The paper does not go into great detail, but it reveals new information. It also uses a lot of technical terms without explanation, so it may be hard to follow if you don’t have the background. Also see my recent post how LastOrder played for a concrete look at its games.

I want to break my discussion into 2 parts. Today I’ll go over general information, tomorrow I’ll work through technical stuff, the network input and output and training and so on.

The name LastOrder turns out to be an ingenious reference to the character Last Order from the A Certain Magical Index fictional universe, the final clone sister. The machine learning process produces a long string of similar models which go into combat for experimental purposes, and you keep the last one. Good name!

LastOrder divides its work into 2 halves, “macro” handled by the machine learning model and “micro” handled by the rule-based code derived from Overkill. It’s a broad distinction; in Steamhammer’s 4-level abstraction terms, I would say that “macro” more or less covers strategy and operations, while “micro” covers tactics and micro. The macro model has a set of actions to build stuff, research tech, and expand to a new base, and a set of 18 attack actions which call for 3 different types of attack in each of 5 different places plus 3 “add army” actions which apparently assign units to the 3 types of attack. (The paper says 17 though it lists 18. It looks as though the mutalisk add army action is unused, maybe because mutas are added automatically.) There is also an action to do nothing.

The paper includes a table on the last page, results of a test tournament where each of the 28 AIIDE 2017 participants played 303 games against LastOrder. We get to see how LastOrder scored its advertised 83% win rate: #2 PurpleWave and #3 Iron (rankings from AIIDE 2017) won nearly all games, no doubt overwhelming the rule-based part of LastOrder so that the macro model could not help. Next Microwave scored just under 50%, XIMP scored about 32%, and all others performed worse, including #1 ZZZKBot at 1.64% win rate—9 bots scored under 2%. When LastOrder’s micro part is good enough, the macro part is highly effective.

In AIIDE 2018, #13 LastOrder scored 49%, ranking in the top half. The paper has a brief discussion on page 10. LastOrder was rolled by top finishers because the micro part could not keep up with #9 Iron and above (according to me) or #8 McRave and above (according to the authors, who know things I don’t). Learning can’t help if you’re too burned to learn. LastOrder was also put down by terrans Ecgberht and WillyT, whose play styles are not represented in the 2017 training group, which has only 4 terrans (one of which is Iron that LastOrder cannot touch).

In the discussion of future work (a mandatory part of an academic paper; the work is required to be unending), they talk briefly about how to fix the weaknesses that showed in AIIDE 2018. They mention improving the rule-based part and learning unit-level micro to address the too-burned-to-learn problem, and self-play training to address the limitations of the training opponents. Self-play is the right idea in principle, but it’s not easy. You have to play all 3 races and support all the behaviors you might face, and that’s only the starting point before making it work.

I’d like to suggest another simple idea for future work: Train each matchup separately. You lose generalization, but how much do production and attack decisions generalize between matchups? I could be wrong, but I think not much. Instead, a zerg player could train 3 models, ZvT ZvP and ZvZ, each of which takes fewer inputs and is solving an easier problem. A disadvantage is that protoss becomes relatively more difficult if you allow for mind control.

LastOrder has skills that I did not see in the games I watched. There is code for them, at least; whether it can carry out the skills successfully is a separate question. It can use hydralisks and lurkers. Most interestingly, it knows how to drop. The macro model includes an action to research drop (UpgradeOverlordLoad), an action to assign units and presumably load up for a drop (AirDropAddArmy), and actions to carry out drops in different places (AttackAirDropStart for the enemy starting base, AttackAirDropNatural, AttackAirDropOther1, AttackAirDropOther2, AttackAirDropOther3). The code to carry out drops is AirdropTactic.cpp; it seems to expect to drop either all zerglings, all hydralisks, or all lurkers, no mixed unit types. Does LastOrder use these skills at all? If anybody can point out a game, I’m interested.

Learning to when to make hydras and lurkers should not be too hard. If LastOrder rarely or never uses hydras, it must be because it found another plan more effective—in games you make hydras first and then get the upgrades, so it’s easy to figure out. If it doesn’t use lurkers, maybe they didn’t help, or maybe it didn’t have any hydras around to morph after it tried researching the upgrade, because hydras were seen as useless. But still, it’s only 2 steps, it should be doable. Learning to drop is not as easy, though. To earn a reward, the agent has to select the upgrade action, the load action, and the drop action in order, each at a time when it makes sense. Doing only part of the sequence sets you back, and so does doing the whole sequence if you leave too much time between the steps, or drop straight into the enemy army, or make any severe mistake. You have to carry through accurately to get the payoff. It should be learnable, but it may take a long time and trainloads of data. I would be tempted to explicitly represent dependencies like this in some way or another, to tell the model up front the required order of events.

AIIDE 2018 - how LastOrder played

The new bot #13 LastOrder is related to the classic Overkill by Sijia Xu, but uses a machine learning model to make certain decisions: According to the description, “all the production of unit (excluding overlord), building, upgrade, tech and trigger of attack.” The learning is entirely offline; LastOrder does not store information about its opponents between games. Tactical and micro decisions, and I think building placement, are decided by rule-based code. One survey answer says,

we train the proposed macro model against about 20 AIIDE 2017 bots on a 1000 machines cluster scheduled by Tournament manager. the final model achieve about 83% win rate on all AIIDE 2017 bots

Against the stronger AIIDE 2018 bots, LastOrder scored about 49%, good enough to land in the top half of the ranking. I think the 83% win rate is too high for LastOrder’s underlying strength; I suspect that it overfitted to its 20 opponents. I think it learned to recognize some of its training opponents by their play style, and when it sees similar signs from different bots that play differently, it reacts incorrectly to a game plan that the different bot does not follow.

I watched a bunch of games to see what kind of play LastOrder figured out for itself. LastOrder’s units are mutalisks and zerglings, sometimes with scourge; I did not see it make other units (though Overkill has hydralisk skills that it might have chosen). LastOrder’s game plan is to open safely with 9 pool, sit back for a while, watch the opponent, lay down massive static defenses when danger seems to loom, macro up lots of drones, zerglings, and mutalisks behind its ramparts, and eventually burst into action and overwhelm the opponent. Details vary, but the overall game plan seemed consistent in all the games I watched.

It’s not an objectively strong game plan, but it seems effective against many bots. LastOrder had trouble touching stronger bots, upsetting only Steamhammer, and was itself upset by Ecgberht and WillyT, which as terrans had no difficulty steamrolling static defenses. But it scored highly against most lower-ranked opponents, including LetaBot (which may have been on its panel of 20 with little change).

Game 39, LastOrder-Steamhammer (replay file), was a good example of the game plan. LastOrder countered zergling with zergling for a while, then seemed to grow bored and made 4 sunkens to hide behind—far more than necessary or useful. A little later, it seemed to predict Steamhammer’s spire timing, adding excessive spores too. Steamhammer understands in principle how to react: It makes extra drones and gets ahead in both army and economy. Steamhammer could not safely attack the heavy defenses, but it could prevent LastOrder from expanding beyond its natural and win slowly. Sure enough, LastOrder tried to expand to a third, Steamhammer caught it and sent the entire army to erase the attempt—and LastOrder exploited the play, which was strategically correct but tactically wrong, hitting Steamhammer’s natural while its forces were out of position. Steamhammer’s tactical analysis is minimal; it doesn’t realize that it should destroy the expansion attempt with a small detachment.

Game 33041, LastOrder-Tyr (replay file), is one of the games that makes me suspect that LastOrder overfitted. Watch what happens after 7:00. LastOrder scouts Tyr’s turtle cannons with a zergling. LastOrder immediately reacts by building... many spore colonies, a nonsensical action. I think LastOrder saw the cannons and concluded, “I’ve seen this play before, and I know what is coming: Carriers!” It believes it is playing against XIMP. It plays similarly in games against XIMP.

LastOrder is a super interesting experiment. It did not score high like CherryPi, but it applied reinforcement learning to a more difficult problem, and it is far more successful than Sijia Xu’s past experiments with machine learning in Overkill. Its middling result is worth something, and yet its play remains somewhat disappointing. LastOrder’s play is highly reactive, but the reactions are often poor and the bot’s range of play is narrow (a wider pool of training opponents should help). I didn’t give examples, but many games show dishearteningly weak macro and mistaken tech decisions (possibly a better training methodology is needed). The problem is not solved yet!