LastOrder and its macro model - general info

LastOrder (github) now has a 15-page academic paper out, Macro action selection with deep reinforcement learning in StarCraft by 6 authors including Sijia Xu as lead author. The paper does not go into great detail, but it reveals new information. It also uses a lot of technical terms without explanation, so it may be hard to follow if you don’t have the background. Also see my recent post how LastOrder played for a concrete look at its games.

I want to break my discussion into 2 parts. Today I’ll go over general information, tomorrow I’ll work through technical stuff, the network input and output and training and so on.

The name LastOrder turns out to be an ingenious reference to the character Last Order from the A Certain Magical Index fictional universe, the final clone sister. The machine learning process produces a long string of similar models which go into combat for experimental purposes, and you keep the last one. Good name!

LastOrder divides its work into 2 halves, “macro” handled by the machine learning model and “micro” handled by the rule-based code derived from Overkill. It’s a broad distinction; in Steamhammer’s 4-level abstraction terms, I would say that “macro” more or less covers strategy and operations, while “micro” covers tactics and micro. The macro model has a set of actions to build stuff, research tech, and expand to a new base, and a set of 18 attack actions which call for 3 different types of attack in each of 5 different places plus 3 “add army” actions which apparently assign units to the 3 types of attack. (The paper says 17 though it lists 18. It looks as though the mutalisk add army action is unused, maybe because mutas are added automatically.) There is also an action to do nothing.

The paper includes a table on the last page, results of a test tournament where each of the 28 AIIDE 2017 participants played 303 games against LastOrder. We get to see how LastOrder scored its advertised 83% win rate: #2 PurpleWave and #3 Iron (rankings from AIIDE 2017) won nearly all games, no doubt overwhelming the rule-based part of LastOrder so that the macro model could not help. Next Microwave scored just under 50%, XIMP scored about 32%, and all others performed worse, including #1 ZZZKBot at 1.64% win rate—9 bots scored under 2%. When LastOrder’s micro part is good enough, the macro part is highly effective.

In AIIDE 2018, #13 LastOrder scored 49%, ranking in the top half. The paper has a brief discussion on page 10. LastOrder was rolled by top finishers because the micro part could not keep up with #9 Iron and above (according to me) or #8 McRave and above (according to the authors, who know things I don’t). Learning can’t help if you’re too burned to learn. LastOrder was also put down by terrans Ecgberht and WillyT, whose play styles are not represented in the 2017 training group, which has only 4 terrans (one of which is Iron that LastOrder cannot touch).

In the discussion of future work (a mandatory part of an academic paper; the work is required to be unending), they talk briefly about how to fix the weaknesses that showed in AIIDE 2018. They mention improving the rule-based part and learning unit-level micro to address the too-burned-to-learn problem, and self-play training to address the limitations of the training opponents. Self-play is the right idea in principle, but it’s not easy. You have to play all 3 races and support all the behaviors you might face, and that’s only the starting point before making it work.

I’d like to suggest another simple idea for future work: Train each matchup separately. You lose generalization, but how much do production and attack decisions generalize between matchups? I could be wrong, but I think not much. Instead, a zerg player could train 3 models, ZvT ZvP and ZvZ, each of which takes fewer inputs and is solving an easier problem. A disadvantage is that protoss becomes relatively more difficult if you allow for mind control.

LastOrder has skills that I did not see in the games I watched. There is code for them, at least; whether it can carry out the skills successfully is a separate question. It can use hydralisks and lurkers. Most interestingly, it knows how to drop. The macro model includes an action to research drop (UpgradeOverlordLoad), an action to assign units and presumably load up for a drop (AirDropAddArmy), and actions to carry out drops in different places (AttackAirDropStart for the enemy starting base, AttackAirDropNatural, AttackAirDropOther1, AttackAirDropOther2, AttackAirDropOther3). The code to carry out drops is AirdropTactic.cpp; it seems to expect to drop either all zerglings, all hydralisks, or all lurkers, no mixed unit types. Does LastOrder use these skills at all? If anybody can point out a game, I’m interested.

Learning to when to make hydras and lurkers should not be too hard. If LastOrder rarely or never uses hydras, it must be because it found another plan more effective—in games you make hydras first and then get the upgrades, so it’s easy to figure out. If it doesn’t use lurkers, maybe they didn’t help, or maybe it didn’t have any hydras around to morph after it tried researching the upgrade, because hydras were seen as useless. But still, it’s only 2 steps, it should be doable. Learning to drop is not as easy, though. To earn a reward, the agent has to select the upgrade action, the load action, and the drop action in order, each at a time when it makes sense. Doing only part of the sequence sets you back, and so does doing the whole sequence if you leave too much time between the steps, or drop straight into the enemy army, or make any severe mistake. You have to carry through accurately to get the payoff. It should be learnable, but it may take a long time and trainloads of data. I would be tempted to explicitly represent dependencies like this in some way or another, to tell the model up front the required order of events.

Trackbacks

No Trackbacks

Comments

Sijia on Tuesday, December 4. 2018:

we observed that the AirDrop tactic only appears at the beginning of the training. We suspect that maybe there are some bugs in the air drop code, then the selection of this tactic action may not lead to expected improving reward in the end of the game.

generally, the bug in each action's execution code significantly influence the expected reward of each selected action, then lead to a policy which only can suit for current micro ability and avoid to use other actions.

fix these bugs and optimize micro is really a headache task...

Jay Scott on Tuesday, December 4. 2018:

This suggests premature convergence. Even if overlord drop is somewhat buggy, there should be SOME discoverable circumstance where it is a good idea. It should be missing only if it is so badly broken that the learning process can’t find any good time to use it.

Sijia on Wednesday, December 5. 2018:

another point is these attack tactic action's reward variance is much larger than other actions and is hard to converge. because attack action corresponding to a longer time execution which may comprise of more problems, e.g., micro bugs, tactic bugs.

for air drop actions, it is valid only when the two overlord's upgrade is ready, compared with the other two kinds of attack action, the insufficient training may be another problem.

Jay Scott on Wednesday, December 5. 2018:

Yes, drop definitely should be harder because it is good under hard-to-find conditions.

Jay Scott on Tuesday, December 4. 2018:

You’re right about debugging so the micro and macro parts work together. It’s hard work.

Add Comment

Name*

Homepage

Comment*

In reply to

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA