software development | Starcraft AI blog

timeout issues

Bruce @ Stardust commented on yesterday’s post. The indented paragraphs are quoted from the comment.

There is some AIIDE news from Discord that relates a bit to your last point about timeouts: PurpleWave has withdrawn from the tournament as it somewhat mysteriously was timing out in most of its games. I say mysteriously since Dan has invested a huge amount of effort into making PurpleWave run asynchronously, which should make timeouts impossible.

Executive summary: That sucks. Purple Dan Gant dug deep into low-level particulars that bot authors really should not have to know, and yet it wasn’t enough.

the problem

Last year when the BWAPI client timing bug was being investigated, some other issues were discovered, like problems relating to the timer not being high-enough resolution and problems with the bot process being pre-empted by the OS and therefore appearing to have spikes that were actually nothing to do with the bot.

BWAPI’s timer. I read the source and saw that BWAPI 4.4.0 times bots using the real-time counter GetTickCount(). The documentation that I just linked says that the timer’s resolution is “typically in the range of 10 milliseconds to 16 milliseconds.” That’s very crude for measuring that an interval does not exceed 55 milliseconds. A measurement “it took 55ms” means “it probably sort of maybe took between 40ms and 70ms, though it depends on your system.” One solution would be to use a high resolution timer in a new BWAPI version. That’s how Steamhammer times itself, with code inherited from UAlbertaBot. Another solution might be to find a way to time accurately from outside BWAPI, somehow accounting for overheads.

BWAPI reports the time through a call BWAPI::Broodwar->getLastEventTime(). A comment in ExampleTournamentModule.cpp explains a workaround in the code to cope with peculiarities that are hard to understand. It’s a code smell, as the authors are well aware, or the comment would not be there. I don’t want to try to figure out if and when the code works as intended.

Both these points appear in the BWAPI issue getLastEventTime() has different behavior for client/module bots, linked by Dan in another comment. Confusion among BWAPI developers in the comment thread shows how hard it is to understand!

Being pre-empted. As I understand it, in the tournament the timer is provided by a multitasking virtual machine which itself runs under a multitasking operating system. Looks like ample opportunity for slippage in every aspect of timing. I don’t know the solution for that. Is it possible to measure something like cpu time + I/O time instead of real time? Surely every operating system keeps track. Would it work better, even when running under a vm that might itself be pre-empted by the host OS? I can think of other potential problems, but that’s a good start!

Experiments with timers in one environment might not tell us about timers in another environment. And yet if we want to hold bots to time limits, then bots need a reliable way to measure their time usage.

a proposed solution

All of this has got me wondering if we should change the approach to timeouts. I think the 1- and 10-second limits are fine, but perhaps the 55ms rule should be an average over the entire game instead of a frame count limit. I’m a bit worried that the current rules will result in more unfair disqualifications or force more bot authors to spend a lot of time working around single-frame spikes, both of which are bad for our already-quite-small community.

That worries me too, and I like your suggestion, especially for tournaments, because tournaments care more about total time needed than time per frame. (For playing against humans, or for streaming, consistent speed counts.) My first thought is that if there is a mean frame time limit, then the limit should be lower than 55ms, perhaps 42ms. Averages are easier to keep low than occasional peaks are. Maybe histograms of frame time for a bunch of bots would help us understand what is best. I’m imagining that the tournament would allow a startup transient, then keep track of the frame count and total frame time, and verify the average periodically, perhaps once per second. Fail and earn an immediate loss.

Dan suggested a rolling average (aka moving average) as a possible alternative. That’s more complicated to implement, but not by much.

There are other averages than the mean. The mean has the advantage of simplicity, and the advantage that the total time allowed is proportional to the length of the game. I think the mean is the right choice. But if the goal is to limit spikes above 55ms (or whatever), then we could choose an averaging function that penalizes those more. Choose appropriately, and the 1-second and 10-second rules could be eliminated because the averaging function takes over their roles.

real-time systems

I favor making life easy for bot authors, but there’s only so easy it can be made.

Stepping back for a bigger picture, a BWAPI bot is a complex real-time system. If the bot does little work per frame, it is easy to hold it to its real-time promises, no matter the details of the promises. Don’t worry, just run it, it’ll be fine (Steamhammer’s approach so far). If it does a lot of work per frame and risks breaking its promises, then in general it has decide what work to skip to save time. It needs some way, preferably a smart way, to divide its work into pieces and to choose which pieces to drop or delay (PurpleWave’s approach). It’s much harder. The difficulty is intrinsic to real-time systems: If you want to play as well as possible, and playing your best may take more time than you have, then the bot needs a baked-in system to cut corners.

I can imagine that somebody might provide a real-time framework for bots, but even then not everybody would or should use it. With more to learn, starting a bot would be harder. Maybe it would be good to have a framework with optional real-time features.

I remember BeeBot, interesting but eventually disabled for being too slow. I can at least offer advice for authors whose bots are slow, or in danger of becoming slow. Many of these bots, I think, are by less-experienced programmers who haven’t yet mastered the art of efficient algorithms and structuring their code to avoid unnecessary work. Over-optimization that obfuscates code is an anti-skill for long-term development, but clear and efficient structure is good. Skip computations that you don’t need, calculate on demand data that you may or may not use, cache data that you may reuse, tolerate some out-of-date data if it doesn’t need to be the latest—all easy ideas, but not so easy to become expert at. And that means that the expertise is valuable.

A little more for those who do have the experience. If you’re not familiar with real-time systems, you may not realize: Code with a predictable runtime is often better than fast but unpredictable code. If you know how long it will take, then you can schedule it to safely meet your real-time promises. If it’s faster on average but occasionally takes longer, you may risk breaking your promises. Better yet is code where you decide how long it takes: See anytime algorithms, which offer some answer whenever you stop them, and a better answer if you let them run longer. Many search algorithms have the anytime property.

plan then execute

I notice that in coding Steamhammer features, I increasingly employ a pattern of separating a planning phase and an execution phase. In the Steamhammer change list a couple days ago, I described new code for ground upgrades. It’s only 78 lines, including comments and blank lines. The planning phase looks at the game and decides on a priority order for melee attack, missile attack, and carapace upgrades. The execution phase carries out the top-priority upgrades when everything needed is available. By separating the concerns, each phase has a smaller job that is easier to understand. The only cost is that you need a data structure to carry the plan between phases.

On a larger scale, the static defense controller works the same. The planning phase does not go into detail about each base, but figures out how much defense is needed for each category of base: The front line needs this many sunkens, exposed outer bases need that many, and so on. The execution phase runs the following frame, and works out the details of which specific bases need more defensive buildings, and where exactly they should be placed, and how fast they should be made. Compared to most of Steamhammer, the code is straightforward and easy to understand, and I give part of the credit to the separation of planning and execution.

On a larger scale yet is the Micro module. It accepts orders for individual units, remembers the orders, and carries them out over however many frames they take. It figures out how to kite hydras and tries to solve problems like stuck units. Micro constitutes the execution phase for individual unit micro; its job is to make life easier for the rest of the bot. It is incomplete and not as pretty as the static defense controller, but I see it as benefiting from the same general idea.

as an architectural principle

It seems to me that completely separating planning from execution at the top level of the frame loop could be a good architectural choice. onFrame() might look like this:

void PerfectBot::onFrame()
{
  GameState state = collectGameState();
  GamePlan plan = analyzeAndPlan(state);
  execute(state, plan);
}

The planner would presumably be made up of many different modules, each planning a different aspect of play: Production, scouting, army movement, and so on. A minor point of doing all planning up front, before any execution, is that the execution phase then always sees a consistent view of the game; nothing is out of date because that module hasn’t run yet this frame. The major point is that each aspect of the planner has access to the others, so that (at least in principle) resources can be allocated well, conflicting goals can be reconciled, and tradeoffs can be resolved using all information. All this happens before the bot takes any action, so it should be easier to arrange for it to take good actions. For example, if the planner assigns each unit one job, then the bot should never have bugs where two modules both think they control the same unit (which has happened to Steamhammer).

The execution phase would presumably have many modules too, one for each executable aspect of the plan. They might be parallel to the analysis modules, but I don’t see that they have to be.

Compare CherryPi’s blackboard architecture. The blackboard is a global data structure which lets program modules communicate with each other. A blackboard is a good foundation for separating planning from execution, whether at the frame loop level or otherwise, and CherryPi uses it that way.

low level bugs

In its first draft, a new feature in Steamhammer caused a crashing bug. After several runs to collect information, it looked like a bad pointer or a memory corruption bug. Thank you C++, we enjoy unsafe languages above all others. With more poking, I identified a memory safety issue and realized that it would take some rejiggering to fix. It took me several hours to do minor rewrites up and down long call chains and verify that I hadn’t missed any links. Finally the code was tight, everything was double-checked, it compiled correctly, and I ran the first test—and it failed with the exact same symptoms. It happens to everyone, right?

Well, the show must go on. I configured tests to all run on one map for efficient debugging, and brought out the loupe for a close look. It took me three days, I wrote self-test code to catch errors as early as possible and extract their secrets, I had to learn (what seems to me) Lovecraftian C++ arcana that I would prefer not to know the existence of much less to understand, but I solved it, and kept the code nice and efficient too, and my new feature came to work perfectly on the test map in a wide variety of situations. That took a lot longer than writing the first draft. OK, now to test more widely. On the very next map—the self-test immediately caught errors. It happens to everyone, right? Right? Right? Cue “They’re Coming to Take Me Away, Ha Ha!”

This bug turned out to be trivial, though.

I once worked with an intern who wrote the worst code I have ever seen. Not only was it tangled and barely decipherable, it was ingeniously self-modifying, and not with address variables but with hardcoded address constants so that it depended on running at a fixed address... even though the software was planned to be burned into ROM. Unbelievable but true. My first step in fixing it was to throw it out. I guess I should be proud of myself as a stable and careful programmer, since I still have some hair and I rarely lose days of time to low-level bugs.

But I have to say, if I wanted to have fun in an unsafe low-level language, I would use a fun unsafe low-level language, like FORTH.

stealth bugs

Early on in Steamhammer’s history, one version introduced a bug that caused small delays in starting the construction of buildings. You couldn’t really tell by watching games unless you knew what to look for, but results were worse and it was clear that something was wrong. It was a stealth bug: The visible symptom was that play was worse, and the cause was hard to discern. I was only able to narrow it down and solve it by knowing what I had changed recently.

Another version had a problem with zergling micro that dragged down results for months. I did not see the cause. I had to watch a great many games to train my eyes to see that zerglings were not as effective as they had been earlier, to tell which fights were going the wrong way, and then I could finally ferret it out. Zerglings are numerous; I remember focusing on the actions of a few specific lings in a fight and being satisfied, but it was an illusion that didn’t hold up when I learned to better follow the movement of the whole fight.

Most recently the latency compensation issue that caused Steamhammer to try to re-use larvas was hard to notice. I noted that Steamhammer was opening with 11 hatchery strangely often, but it can do that so I didn’t smell a bug. And the BASIL ranking was a little disappointing, but that might be because of Monster and other improving opponents. It was only when I was writing a new opening and watching production closely that I saw the blip and could start tracing it. After the fix and its friends were uploaded, Steamhammer’s ranking quickly climbed back to the range I had originally expected.

How many stealth bugs have I never noticed? I have no shortage of known bugs and severe weaknesses on my to-do list, so in a way it doesn’t matter. And yet surely I could do better if I knew more.

What is a stealth bug to one person might be an obvious problem to another. A game has more going on than a mere human can absorb in one viewing, and people pay attention to different aspects. (Flash is presumably not a mere human.) I remember a lurker target-fixation bug that Steamhammer shared with Microwave. I thought the bug was noticeable, since I pay attention to lurker micro, while Microwave’s author MicroDK had not realized it was happening.

In the end I guess there’s no moral to the story, other than to test rigorously, vigorously, and variously, which is standard software advice anyway. Pay attention to performance regressions and dig up their roots. It can never hurt to run tournaments at home.

apparent latency compensation bug

The just played Simplicity vs BananaBrain is a fine game by Simplicity. The early defense against zealots was especially well done, and Simplicity’s tech and attack decisions were good. Recommended.

In the meantime, I’ve hit a bug that’s slowing me down. I found a reproducible case where production fails because it tries to use the same larva to produce two drones. It looks like slippage in BWAPI’s latency compensation: The production system picks a larva to produce a drone. Ask the type during the same frame after giving the morph order, and you get egg; that is latency comp at work. Ask again a couple frames later, and the egg has turned back into a larva; the production system picks it a second time, and the second morph can only fail. I think it should be easy to work around, but can it be fixed? Latency compensation is not expected to be perfect.

It makes me wonder what other slippages may be hiding under the rug.

unexpected infrastructure work in Steamhammer

To fix a bug, to make new features easier, and to improve usability, I updated Steamhammer so that the building manager carries out both steps of constructing a sunken colony or a spore colony, laying down the creep colony and then morphing the creep. Openings that used to say "creep colony", [other steps while we wait for the creep to complete], "sunken colony" now go "sunken colony", [other steps]. If you ask for a creep colony, then that is all you get; other code will have to take care of morphing it, because the building manager no longer does that step by itself, only the two-step process as a unit.

The bug, by the way, is that there was no tracking of what was supposed to happen to each creep colony. When making a sunken and a spore at the same time, they often got swapped, so that instead of (say) a spore in the main and a spore and sunk in the natural, there might be 2 spores in the natural and a useless sunk in the main. If you’ve watched many Steamhammer games, you may have wondered why spores are so often oddly placed—now you know one reason.

I didn’t intend to do any serious infrastructure work, but this piece turned into it. It was worth it, because the bug was causing more trouble after every update, but there was more to it than I realized. The assumption that you build a creep first and later morph it as a separate step turned out to be baked into the codebase. Besides the openings and the building manager, I had to closely analyze parts of the strategy boss, the production manager, and the macro acts themselves, and make changes that risk introducing new bugs. It has worked perfectly in tests so far, but I do not have confidence that reactions and corner cases will work in every situation.

I haven’t dealt with the issue of canceling and restarting sunkens due to the curious hit point change of the sunken colony, which Steamhammer has supported since version 1.4. Changes are needed. After that I think it’s not much extra work to implement delayed morphing of sunkens, like Arrakhammer, so I may do that too. But I could be wrong about the amount of work....

How do these subtle assumptions become so deeply threaded through a codebase, and so hard to change? On the one hand, it’s a failure to separate concerns, so you feel you could have done it better. On the other, how are you supposed to know what concerns you’ll have in the future? Software is hard.

software maintenance and the decision cycle

Whether in your brain or in a Starcraft bot, to act in the world you first collect information, evaluate the information to make decisions, and execute your decisions. The steps may not be as neatly separated as the words that describe them, but they are always there. Think of the psychology concepts of perception, cognition, and motor control, or the military OODA loop (observe, orient, decide, act), and other decision cycles.

When you write a big piece of software, it matters how you organize the steps. In general terms, Steamhammer follows its parent UAlbertaBot, and many other bots, in the way it organizes them: By the decisions. The code that makes a decision is responsible for collecting whatever information it needs, by whatever combination of calling BWAPI directly and calling on the rest of the program, and responsible for executing its decisions, again sometimes calling BWAPI directly to issue orders and sometimes passing internal orders to the rest of the program. So one module makes spending decisions (“a hydralisk next”), one module controls mining workers (“send it to that patch”), and so on.

To a certain extent, that organization is inevitable. Decisions of different kinds have to be made by different code (absent super-powerful machine learning or some other extreme abstraction technique), and the code has to have inputs and outputs. But the haphazard way of collecting inputs, and of passing along outputs, is not so good. I noticed long ago, and over time I’ve seen more clearly, that it is error prone.

On the input side, the data a module sees depends on the order that modules run in: They are not independent. I sorted modules so that, on each frame, information-gathering ones like InformationManager run before decision-making ones like CombatCommander, but in the full program the dependencies are not that simple. Read closely and you’ll find comments like “this must happen before that,” and comments like “eh, the data is one frame out of date but in this case it doesn’t matter,” and special cases to work around backward dependencies. I have fixed bugs, and I feel 100% certain that there are undiscovered bugs due to computing information only after it is needed.

On the output side, it’s difficult to coordinate decisions. A common error is double commanding, where a unit is given contradictory orders: One bit says “Look out, drone, the enemy is near, run away,” then the rest of the code doesn’t remember that the decision is made and says “Hey drone, you’re not mining, get back to work.” Most orders (not all) go through the Micro module for execution, and Micro knows not to issue two BWAPI commands for a unit on the same frame, so a frequent result is that the drone is told to run away one frame, then to mine the next frame, and so on back and forth. It’s a common cause of bugs where units vibrate in place instead of doing anything useful, and the worker manager (which makes a lot of special case decisions) has a particularly elaborate internal system to try to prevent it. Literal double commanding at the BWAPI level is only one issue; the same kind of thing can also happen at higher levels of abstraction, causing problems like indecisive squads.

The logical fix is to add architectural barriers between input, decision, and output. In principle, each module collects all its inputs and puts them into a data structure, then draws a line under it, done. Then it makes its decisions on that basis, records the decisions in another data structure (with the idea of forcing it to resolve any conflicting decisions up front), and draws a line under that. Then it executes the recorded decisions. Input, decision, and output become separate phases of execution.

In real life the dependencies are complicated and it’s not that simple. I’m thinking that the ideal architecture for input data is a fixed declarative representation of everything that might be wanted during a given frame, which is evaluated on demand, in the style of lazy functional programming. That way dependencies are explicit, dependency loops will make themselves evident, and only the information you need is computed each frame.

I don’t have such a beautiful solution for output. The Micro module is a partially implemented attempt to separate some decisions from their execution. It does help, but as we’ve seen above, even if it were a complete implementation it would not solve the problem. The decisions themselves have to be good, and though architecture can aid good decisions it can’t require them. Maybe there’s nothing for it but to be clear about exactly what you’re deciding, at what level of abstraction, and be careful to do it right.

a bug and its antimatter twin

Steamhammer has a special case reaction in the building manager to ensure that it builds enough macro hatcheries when it is contained: If a drone sent to build an expansion hatchery did not make it there, and zerg suffers from a larva shortage, then the expansion is converted to a macro hatchery instead. A drone is assigned to build the hatchery in base.

The feature seems to have bit-decayed and it is not working reliably. I’ve seen a few games where Steamhammer was contained and desperately kept sending drones to try to expand, while it had a larva shortage and its mineral bank was building up.

At the same time I see the diametrically opposite bug in Steamhammer’s terran play: When terran is contained for a long time, it starts to build “macro command centers” inside its base, which essentially act as oversize 400 mineral supply depots. It doesn’t know how to lift them off and land them elsewhere, and it doesn’t need more SCV production, so it’s pure loss.

What the what?!?

Logically, the bugs must be unrelated. The zerg special case explicitly checks that the building is a zerg hatchery. But it behaves exactly as though it had the races mixed up. If the two bugs annihilated like matter and antimatter, they would create energetic play.

Both bugs are on my list to solve soon. The terran bug is as serious as the zerg one in terms of wasted resources.

Steamhammer and bugs

I’m pleased with Steamhammer’s reliability this tournament. I watched all 88 games, and for the first time ever I saw no losses caused by crippling bugs. There were 2 games with terrible play due to bugs, but in both cases Steamhammer recovered and won anyway. There was also a bug that occasionally caused multiple drones to be sent to fail to scout the enemy: Lose one, send another, repeat. That would lose games for sure against an even opponent, but in practice it occurs in games that Steamhammer will lose no matter what. I think I know the cause, and it will be fixed soon.

I surmise that Steamhammer is especially prone to reliability problems because it aspires to do everything. It plays over 150 opening builds of all kinds; failing to adapt to all the misadventures of the opening causes a lot of snags. I count only 3 zerg abilities yet to be implemented: Nydus canals, infested terrans, and lifting off an infested command center, all on my list for coming months. (Drop is technically implemented even though not yet used by zerg. I guess overlord sight range is not implemented either, but it’s trivial if I ever want it.) More features means more bugs; fixing bugs in defiler play was a big time sink last year.

Another source of reliability hitches is my habit of swapping in new plans as fast as I can make them. A number of Steamhammer’s internal modules are half-rewritten, stuck in the middle of a transition from an old design that is not flexible enough to a new design that is not finished enough. It would be more efficient to complete one task before moving on to the next. But then I would be making progress on only one front at a time, and it wouldn’t be as much fun. I like the variety.

Anyway, after burnishing the 2.x versions for over a year, I’ve remedied almost all of the worst bugs. It is past time to get back to major features and structural work, so that I can add a new round of bugs. To symbolize that, I’ll be calling the next version 3.0 even though it doesn’t in reality have any new feature worthy of a major version number. It will in time.

more Steamhammer 2.3 bugs

Earlier, I pointed out 2 rare game-over bugs in Steamhammer 2.3. It turns out that there are 2 other serious bugs in this version, which are not rare; they are causing a good proportion of the bot’s losses.

One is a bug in turning gas collection on and off. The bug can turn off gas collection even when the bot needs gas to continue production, causing a production freeze that can last for the rest of the game (not likely to be long with no production). It may have been introduced when I refactored the code for gas decisions. The other is a bug that cancels vital hatcheries (and perhaps other buildings) for no apparent reason. I suspect an error in one of the emergency “uh oh, no drones!” or “panic! I need those resources back NOW!” tests. There are other possibilities.

Critical bugs go straight to the top of my list.

The bugs also point out a weakness in Steamhammer’s design: Decentralized decision making. Decision rules that take direct action are scattered through the code. Of course a bug in any rule can cause bad behavior; that’s inevitable. The weak point of the design is that the separate rules, sometimes far away from each other in the code, have to cooperate so they don’t override each other’s decisions. The gas collection bug is clearly a coordination bug; one rule says “I need gas for the next production item” and some other rule is incorrectly overriding the decision, “nah, I wanna turn off gas.”

Compare the unit mix decision, which is made using a more indirect method: Rules collect evidence, then in a second stage the evidence is weighed and the decision is made. The evidence is explicit, so bugs are easier to trace, and evidence-collecting rules do not need to cooperate with each other, they only need to weight their evidence consistently. Machine learning systems conventionally follow this collect evidence, then decide paradigm as a matter of course.

six months of bug fixing

Today I fixed what I think is the last serious bug that I introduced in version 2.0 last September. It was the one that causes innocent lurkers which are trying to walk across the map to move stutteringly or to vibrate in place. The bug turned out to also affect medics and defilers, though in a minor way that’s not easy to see.

How can it take half a year to fix serious bugs? I can list several reasons. I think that is a bad sign in itself.

1. Some of the bugs, including this lurker bug, were introduced in 2.0 but did not have visible bad effects until later—in this case, when I updated the micro system and taught it a stricter attitude toward executing commands. Experience says that Steamhammer has many invisible bugs that rarely bite or that have hard-to-see effects or that are complex and hard to trace back to a cause, but still need to be fixed. In any case, bugs that visibly hurt play get priority.

2. Some parts of Steamhammer inherited a UAlbertaBot habit that close enough is close enough, if some expected prerequisite is not there after all... eh, ignore this part of the job. As time goes on I’ve been making Steamhammer more strict about checking its arguments and preconditions, and complaining if anything looks wrong. But today I ran a test: How often does Steamhammer try to issue the same unit different commands during the same frame? That should only happen by mistake, and yet the UAlbertaBot code ignores it. The answer was that it happens many times per game—not often enough to break micro regularly, but there are sure to be bad effects. At some point I have to find time to chase down the causes and fix them, and then add an assertion so it can’t happen again without drawing attention. Bugs must be made visible!

3. Version 2.0 was a major reworking of squad-level tactics. Of course a big ship needs a long shakedown cruise. It didn’t help that I was hurrying to prepare for AIIDE.

4. The squad data structure and code structure are not designed to support the more complicated behaviors that I implemented in 2.0. I added a bunch of awkward bug-prone special cases. In Software Development Utopia, I would have redesigned the Squad class first, and then had a much easier time with the squad behavior. That would be more efficient in the long run, but it wold also stunt my short-term progress. Sometimes we prefer to accrue technical debt.

And the lesson is... um... take a systems view? That sounds good.

There are still a bunch of unfixed weaknesses (as opposed to outright bugs) that were introduced in version 2.0.... And yet the advances in 2.0 were necessary to keep progress going.

funny map analysis picture

It turns out there are a lot of ways to calculate regions and chokes. In the course of putting one together, I’ve been doing some other map analysis that should be useful for micro and pathfinding. Here is one that doesn’t work yet, a color-coded debug drawing which is supposed to show the room available around each walk tile: How much space is there for a unit or army to fit into? If you know a path, you can check the tiles to find out which units are small enough to travel the path. Or you can figure out how many of your units fit behind the enemy mineral line—should you go there?

Unfortunately, it’s no good as it stands. Among other mistakes, it claims there is no room in places where there obviously is. It makes a funny picture, though.

Like BWEM, I also calculate the distance from the nearest unwalkable tile. Iron makes good use of that information. That code worked correctly on the first try....

which weaknesses are critical?

The tournament version Steamhammer 2.1.4 suffers from a command jam bug which reissues commands far too often, causing many to be dropped. It’s a critical bug with devastating effects, causing units to ignore their orders—to freeze in place, to wander past the enemy taking fire without noticing, and so on. It starts having an effect often before the zerg supply reaches 50, and the effect grows worse as supply increases. By the late game, large groups of units are sitting uselessly around the map doing nothing. It’s a critical bug and intolerably severe.

But how critical is it really? I look at every game that Steamhammer plays. Based on tournament losses, I estimate that if I had fixed the bug before SSCAIT started, Steamhammer’s rank would not be #10 as now, but #7—not much gain considering how closely the ranks are spaced, only a few percentage points up in win rate.

How can such a calamitous bug have so little practical effect? In Steamhammer’s early days, one version had a bug that subtly caused building construction to be delayed. I doubt any stream viewer noticed; I didn’t notice either, until surprised by unexpected losses. Experience and test games proved that it was a critical bug that caused a high rate of losses against early aggression. By comparison, the command jam bug is identifiable as the cause of loss in only a few games, like the one loss against XIMP by Tomas Vajda. In other games where the bug struck hard, as against ICEbot and MadMix, Steamhammer struggled more than it should have but won regardless.

Apparently the bug causes losses only against a narrow range of opponents which play macro games and are strong enough to exploit the weak play that the bug causes. There is no effect against a strong opponent like SAIDA, or against the weakest opponents which lose to Steamhammer’s first 6 zerglings. One explanation is that most opponents either prefer early aggression, or else fall to Steamhammer’s early aggression. Another explanation is that I may underestimate the damage the bug causes; maybe it leads to losses that are not clearly attributable.

forge expand reaction

Why is it still called “forge fast expand”? It was a fast expansion when invented, but by today’s standards it’s not fast at all. That’s why I say “forge expand.” (It has the same number of syllables as FFE).

Though I still have region work to do, today I decided to make an important improvement to how Steamhammer reacts to forge expand and other safe macro openings. The tournament Steamhammer makes 3 attempts to adapt: 1. If the enemy’s opening plan was predicted, it tries to select a good counter opening. 2. Otherwise, having missed the prediction and gone down a poor path, it cancels any planned static defense which is now unnecessary, and 3. makes extra drones to catch up in economy. If it’s still in its opening book, it stays the course and tries to minimize the disruption by changing planned zerglings into drones, which cost the same.

Its plans are still disrupted, though, because the extra drones and the omitted static defense cause minerals to build up. Steamhammer waits until the opening is over before it makes macro hatcheries and otherwise spends down its excess resources, and that is often too late. Zerg can’t keep up with the enemy’s economy and falls far behind.

Today I added 2 new reactions that happen when we want extra drones so that resources threaten to build up: 4. If possible, take gas early (or take another gas early). Putting drones on gas slows down mineral accumulation and may speed up tech openings, so that mutas or lurkers come out sooner. If gas accumulates too much, Steamhammer will stop gas collection, so there’s little downside. 5. Make extra hatcheries as conditions seem ripe. One or all of the extra hatcheries may be placed at expansions, depending on the situation. The rules are more cautious than the macro hatchery rules that apply once the opening is over, because they’re still trying not to disrupt the opening line. The overall effect of the new reactions is that Steamhammer pursues the tech of its opening line, sometimes faster since it has more income, gas, and larvas than expected, and transitions into the middle game in a stronger position. It’s making a big difference in test games, including wins from positions that were sure losses otherwise.

The fix is inspired by recent losses, especially the 2 losses to Skynet by Andrew Smith when it unexpectedly (to Steamhammer) switched from zealot rushes and DT rushes to forge expand. To my intuition, the forge expand reaction seems much less important than the command jam fix, which is a critical bug fix that affects far more games. And yet, taking into account test games and the rate at which Steamhammer was surprised by macro openings in the tournament, I estimate that it will save about 2/3 as many losses—in terms of improving elo, both seem almost equal. How does that happen?

Apparently you have to measure the severity of weaknesses, because intuition does not seem accurate. Unfortunately, to measure with an A/B test, first you need to fix the weakness. Maybe that is an advantage of machine learning, which does its entire job by measuring weaknesses and correcting them.

a mildly funny bug

Today, trying to track down a mysterious regression, I noticed this condition in an if in CombatCommander::getAttackOrder():

enemy.type == enemy.type == BWAPI::UnitTypes::Protoss_High_Templar

Hmm, it looks a little different than I intended. == is left-associative, so this always evaluates to true. The effect is that Steamhammer might make a poor decision of which enemy base to attack. The decision is pretty rough anyway, though, so the buggy decision is not much worse.

Gotta love edit slip bugs, they are so creative. In this case, the original intention is also wrong, because high templar are excluded by an earlier check—the condition would have always been false.

If only I could find the bugs that matter, instead of the bugs that don’t hurt much....

new features

I’ve been revising the internal structure of Steamhammer’s squads to make them more capable. It’s not nearly as big a rewrite as needed, but it adds new capability and in theory it should be a big win. After several days of work, yesterday I was able to try it out for the first time—it’s not completed, but it’s finished enough to work as designed in common situations, so I can test it in real games.

After fixing one severe bug, it worked as intended. And—it played horribly!

Adding features decreases strength, fixing bugs increases strength. I believe it more than ever. It’s not that new features bring new bugs, at least not necessarily. A new feature is not well tuned yet, not integrated with other features of the bot. If it’s a big feature, it tends to disrupt successful patterns of play that arise from the interaction of the existing features. When a new feature is a good idea, it takes time to tune it up and make it successful.

I’m convinced my new feature is a good idea, or I wouldn’t be working on it. It’s an open question whether it will be successful in time for AIIDE.

a parable of optimizing the software development process

I habitually take a top-down view of things (I’m a lumper, not a splitter), and I want to use an example to draw a comparison. The example is: Steamhammer’s production queue data structure is inherited from UAlbertaBot. It looks like this (I left out the constructors).

struct BuildOrderItem
{
    MacroAct macroAct;  // the thing we want to produce
    bool isGasSteal; 
}

Steamhammer renames UAlbertaBot’s MetaType (which might mean anything) to MacroAct, and extends it with a MacroLocation to specify where things are to be built (and other changes).

I want to simplify this, and it should be easy to understand why. Stealing gas amounts to building a refinery at a given location—it should be specified as a MacroLocation, not as a separate flag outside the MacroAct. I can drop the wrapping BuildOrderItem data structure and keep a queue of straight MacroAct values. There will be one less level to the data structure, a distinct improvement.

On the one hand, it simplifies the code and it’s a straightforward, low-risk refactoring. There are many uses of BuildOrderItem and it will take time to handle them all, but the compiler will tell me if I forget to rewrite one of the uses. On the other hand, it doesn’t affect play in the least. The production queue will behave the same, and it is not a heavily used data structure that affects performance.

So from a project management perspective, what’s the best time to carry out the work? The sooner you do a code simplification, the longer you benefit from it. But it will take time, and I want to concentrate on play improvements for AIIDE. What’s the best timing?

Well, it’s not an important question. I might do it before AIIDE or after, and it won’t make a big difference either way. But I find it interesting to think about. Optimizing the development process is similar in concept to optimizing Starcraft play: It is about taking actions, sometimes tiny actions, that improve efficiency over time. It requires similar subtle judgment calls. Take actions with big enough cumulative effect, and you pull ahead; you have to decide which actions are worth your limited time. To me, taking the abstract view, the big difference is that in software development you get more time to think about it—coding is a slower-moving game.