archive by month
Skip to content

next for Steamhammer

I’ll be making another release soon, to fix a few of the most irritating of the longstanding bugs that affect terran and protoss. After that I’ll prepare for AIST S4. I want to have fun implementing secret new skills, and I have thought of a suite of them. By AIST submission time, BASIL should have collected a moderate amount of opening data for me (as I write, it has collected a total of only 112 records, covering a small fraction of Steamhammer’s openings). I’ll be able to compare the wild BASIL data with my tame systematically-collected local data. What happens next depends on how that looks, but I may release another data collection version to fill in spots without enough data; that could harm elo, but it’s in a good cause.

This should be a productive year.

Next: Steamhammer’s experience fighting cannon rushes.

Steamhammer’s opening data

As promised, more on the opening timing data. Here’s the data I decided to collect. I hope I chose the most important data, because I expect that it will take a long time to collect a full set and I’d rather not have to redo it from scratch. I will have to update it over time as Steamhammer evolves, though.

I plan to play local games to systematically collect complete data, but that can’t be the only approach. Steamhammer needs real world experience with games against all kinds of opponents, so it can see the ways that its openings break down under stress. I think that will be key for playing well. That’s why I uploaded this data collection version of Steamhammer. I’ll see how it goes, but I may later upload a version that (heedless of its rank) chooses little-played openings to fill out the dataset.

By the way, this sounds like a lot of data when I describe it here. But it’s only a few lines of numbers. It’s not much.

• Opening timings: For each game, info about opening events, on Steamhammer’s side only. Ten numbers giving key info about the results of the build as of the time the opening line ends, plus the frame of completion of each tech building, each upgrade, and each tech researched. The completion times are recorded until 1500 frames after the end of the build, to give opening stuff time to finish and to see some events from the start of the middle game.

number of the first frame out of book
the number of workers alivehow strong is the economy?
mineral cost of all combat units producedhow strong is the army?
gas cost of all combat units produced
unspent mineralsdid the opening build use resources efficiently?
unspent gas
count of barracks, or gateways, or hatcheries/lairs/hiveswhat is the production capacity?
count of factories, or robo facilities, or 0 for zerg
count of starports, or stargates, or 0 for zerg
the number of bases, including incomplete bases

The 10 numbers outline the strategic state of the game at the end of the book line, and the completion times say what our tech is and when we got it. The idea is to save info about how the opening actually went, rather than how the book said it was supposed to go. Events will vary depending on the map, our reactions to the enemy strategy, the effects of fighting, random bobbles, and who knows what. My plan is to collect all the records for each opening (across all opponents), and reduce them to summaries for each matchup that show how the opening typically goes and the range of variation. The summaries can be stored in a static file read at initialization time. Add in win rates for each build and matchup, and Steamhammer will have a sound basis to choose openings against unfamiliar opponents. Choosing based on data will surely be better than choosing based on numbers I made up and typed into the config file.

Also keep data about enemy play, so that Steamhammer can choose openings based on data about the current opponent. This info is specific to the opponent, so it doesn’t need to be summarized into a separate file but lives on in the game records.

• Record the frame that each enemy unit type was first spotted, for the entire game. This summarizes a fair amount of the scouting info. It doesn’t tell how many production facilities of each type the enemy has when, but Steamhammer can’t scout that properly anyway. Info about proxies and contains is already kept in another field of the game records.

• Record the first two “significant” battles of the game. I modified the combat sim to save a count of the number of enemy units of each type for the largest battle that it has been asked to resolve so far. The first significant battle is one with at least 8 supply of enemy units. Since BWAPI doubles the supply, that means 4 marines, 2 zealots, or 8 zerglings, or the equivalent. Static defense takes 0 supply and is not counted, which may be a mistake. The second significant battle is one with at least 20 supply of enemy units. When we first notice a significant battle, we don’t record it right away, but wait 3 seconds to see if more enemy units appear. (3 seconds may be too low. Maybe 5 to 10 would be better.) Since the combat sim remembers the biggest battle, no information can be lost.

For each of the 2 battles, we keep the frame the battle started (saved by the combat simulator; it’s the actual frame, not 3 seconds delayed), and the distance of the center of the battle from our starting base in tiles, and the same distance from the enemy base (or -1 if the enemy base is not located). And we keep the count of unit types. Since these are very early battles, usually there are only 1 or 2 enemy unit types present. If the enemy is overrun without putting up much fight, then only one battle or no battle may be recorded.

The unit type timings tell is when we have to be ready for each enemy capability to come online, if the enemy repeats the same strategy. And the battles tell us something about how the enemy behaves, including whether the first fights were near our base or the enemy’s. To choose openings against a familiar opponent, we can check the timings of enemy play from the game records against the timings of our openings from the summary file for the matchup. If we can’t predict the enemy’s strategy, then we can check the range of play and hopefully find safe options. If we can predict, maybe we can find a direct counter.

Info about enemy timings is useful for reactions throughout the game, not only for choosing the opening. “XIMP’s carriers should be arriving around frame x, start the ranged attack upgrade in time to answer that.” And it can be used to automatically develop new opening lines to counter specific enemy builds. That feature is probably far in the future, but I’m looking forward to it.

In overview, the idea is to compare a model of our actual play against a model of the enemy’s actual play to figure out what is likely to work. That means understanding the game, at least for a limited value of “understanding”. It should be way more effective than trying stuff at random until you hit a winner.

Steamhammer 3.4 uploaded

Steamhammer 3.4 is uploaded. The previous version was uploaded only 6 days ago, so changes are few. Nothing here should have any big effect on playing strength. Most of the work went to opening data, but I also made some other tweaks, mostly affecting queens. I turned on the zerg strategy boss info display for stream watchers.

opening timing data

Steamhammer records in its files data useful for choosing openings, in a way similar to my original post. In later games it parses the saved data into a form that the rest of the program can use, but nothing uses the data yet. I have to collect enough first. I’ll post details later.

New skill “battles” added using the skill kit. It records summary info about the first 2 “significant” battles of the game, by its own definition.

New skill “opening timing” added using the skill kit. It records summary info about the results of the opening played.

• The “unit timings” skill, which records scouting data about the enemy, is updated slightly to mark enemy buildings whose completion time is known, because the building was seen under construction. Most buildings are scouted after completion, so it can only record the scouting time. Known times are negated; the minus sign is the flag.

scouting

• As I predicted, I have further relaxed the condition to release the scout worker early. The location of the enemy base need no longer be known. If the worker still sticks around at the front beyond when it will see anything new, I may loosen the condition one last time.

• An early game overlord—usually the second one—now scouts the main and natural, instead of only the main. Steamhammer should notice cannon contains and some proxies earlier, at least on average.

queens

• Steamhammer formerly limited its queen count until queen research was started, either ensnare or broodling. I was afraid that it might spawn a bunch of queens and then change its mind and have no use for them.

• I dropped the mechanism that used to leave a time gap (30 seconds) before remaking queens when all of them were destroyed. I decided that if it overmakes queens and needs to learn better, then I’d prefer a more general mechanism.

• Be more willing to get ensnare versus protoss.

• If terran doesn’t have any bases left, don’t make a queen in hopes of infesting a command center. It looked silly.

• Steamhammer is slower to research queen energy. I thought it was spending on that too often. It now insists on either having at least 4 queens, or having a target of at least 4 queens. I doubt it’s worth it even then. I’ve seen analyses which try to figure out the number of queens before it is economically worthwhile to buy +50 queen energy, and they came up with 8 to 12 as the minimum.

• The strategy boss debug display shows Steamhammer’s target queen counts for each purpose, parasite, ensnare, and broodling. If queens are desired, the line that used to say something like “0/1 queen” (zero queens made of a target of one queen) now says “0/1 queen (1 0 0)”, with the three numbers in parentheses. The overall target number of queens is the maximum of those three numbers. Steamhammer will try to research both ensnare and broodling if it wants queens for both purposes.

SSCAIT Steamhammer-PurpleWave games

Those who saw today’s stream of the second half of the SSCAIT round of 16 in the elimination bracket may be wondering why Steamhammer played the same losing opening three times in a row against PurpleWave. It’s easy to explain.

Before the tournament, PurpleWave was playing forge expand game after game against Steamhammer. Steamhammer experimented and found that it could win with 9 gas 9 pool, which is a strong ZvZ one-hatch mutalisk build. It is not a strong ZvP build, but before the tournament, the fast mutas scored 13-0 against the fast expands, not a single loss. PurpleWave was not ready to defend against air.

In the round robin phase, PurpleWave again opened the first game with forge expand, and lost to the one-hatch mutas. The score was 14-0. But PurpleWave had been updated for the tournament, and forgot whatever bug or other fixation had caused it to stick to the same strategy. In the second game, protoss varied, and the score went to 14-1. PurpleWave is ranked higher than Steamhammer and usually wins. 14-1 was statistically far and away the best available opening, so Steamhammer stuck with it. It saw during play that it had gotten into trouble and tried desperately to save itself, but the mischief came in the early opening and there was nothing for it; at best it could have lost more slowly.

An RPS analyzer that better predicts the enemy’s opening plan could have helped. Steamhammer would also have to pay more attention to the enemy’s predicted strategy; it deliberately ignores the prediction against long-familiar opponents. The only general way that I know for a bot to be sure that it’s time to switch plans is for it to model the game events and understand why it lost (“oh yeah, you can beat that every time”), which is much more powerful than comparing statistics. I hope to do that eventually, but it’s beyond the state of the art for now. Anyway, opening timing is first, and that feeds in too.

stealth bugs

Early on in Steamhammer’s history, one version introduced a bug that caused small delays in starting the construction of buildings. You couldn’t really tell by watching games unless you knew what to look for, but results were worse and it was clear that something was wrong. It was a stealth bug: The visible symptom was that play was worse, and the cause was hard to discern. I was only able to narrow it down and solve it by knowing what I had changed recently.

Another version had a problem with zergling micro that dragged down results for months. I did not see the cause. I had to watch a great many games to train my eyes to see that zerglings were not as effective as they had been earlier, to tell which fights were going the wrong way, and then I could finally ferret it out. Zerglings are numerous; I remember focusing on the actions of a few specific lings in a fight and being satisfied, but it was an illusion that didn’t hold up when I learned to better follow the movement of the whole fight.

Most recently the latency compensation issue that caused Steamhammer to try to re-use larvas was hard to notice. I noted that Steamhammer was opening with 11 hatchery strangely often, but it can do that so I didn’t smell a bug. And the BASIL ranking was a little disappointing, but that might be because of Monster and other improving opponents. It was only when I was writing a new opening and watching production closely that I saw the blip and could start tracing it. After the fix and its friends were uploaded, Steamhammer’s ranking quickly climbed back to the range I had originally expected.

How many stealth bugs have I never noticed? I have no shortage of known bugs and severe weaknesses on my to-do list, so in a way it doesn’t matter. And yet surely I could do better if I knew more.

What is a stealth bug to one person might be an obvious problem to another. A game has more going on than a mere human can absorb in one viewing, and people pay attention to different aspects. (Flash is presumably not a mere human.) I remember a lurker target-fixation bug that Steamhammer shared with Microwave. I thought the bug was noticeable, since I pay attention to lurker micro, while Microwave’s author MicroDK had not realized it was happening.

In the end I guess there’s no moral to the story, other than to test rigorously, vigorously, and variously, which is standard software advice anyway. Pay attention to performance regressions and dig up their roots. It can never hurt to run tournaments at home.

12 gas 11 pool build orders

I counted Steamhammer’s zerg openings, and found 222. Most of those I wrote as variations of standard builds, or when inspired to copy somebody else’s build ideas, or to systematically fill out a range of possibilities. A few are novel ideas that I had myself. Steamhammer’s new 12 gas 11 pool builds I have not seen elsewhere, though Steamhammer did already have a build named ZvZ_12Gas11Pool which is similar (it was an example of filling out a range of possibilities).

Why did I write over 200 opening builds? It’s obvious that Steamhammer can’t effectively use that many, because it doesn’t know how to choose between them without trying each one repeatedly. Against a difficult opponent, Steamhammer would have to play thousands of games to be likely to make the best choices, and even then it would risk being wrong—the opponent is learning too. I wrote them all anyway, because I have been planning from the beginning to make learning smarter. If the opening timing project works as well as I hope, the bot will be able to quickly zero in on promising builds from its large library, perhaps turning up surprising but effective tries. Compare my post Precomputing a catalog of build orders from April 2016, before I started work on Steamhammer.

11 gas 10 spawning pool is a standard zerg opening stem. It continues into fast one-hatchery mutalisks for ZvZ (implemented in Steamhammer in version 1.0 in January 2017 and for a long time Steamhammer’s favorite ZvZ build), or into one-hatchery lurkers for other matchups (mainly useful against terran). The timings have a good feeling, because when the spawning pool finishes, you can start your lair immediately and simultaneously start zerglings. You’re spending a lot to get fast tech, so the second hatchery comes much later.

go to 11 drones
extractor
spawning pool
2 more drones
when the pool finishes:
lair
3 pairs of zerglings
(continue with spire or lurkers)

I noticed that if you squeeze in a couple drones and drop one of the zergling pairs, you can get a second hatchery right after without delaying your tech. The only delay is in getting one extra drone before the extractor (you do have to wait a bit for the larva to spawn); after that, the build is every bit as smooth as the classic version. For the cost of that delay plus a lower number of zerglings, you get a stronger economy and a second hatchery, with a much wider selection of follow-up plans.

go to 12 drones
extractor
spawning pool
3 more drones
when the pool finishes:
lair
2 pairs of zerglings
hatchery
(continue with spire or lurkers)

The build shares ideas with 12 pool builds, where you use the high number of early drones to get a second hatchery, but gets the lair before the second hatch. It also shares ideas with 2.5 hatch muta builds, where you get your tech first and add the extra hatchery along the way. You could think of it is as a 1.5 hatchery lurker/muta build.

I prefer the lurker version. Steamhammer wins a lot of games versus terran with 1 hatch lurker builds, but it is not skilled at using its lings along with the fast lurkers. And many terrans stay safe at home anyway, so that lings have little value before lurkers; all they can do is keep an eye on the front line. Getting an extra hatchery and stronger economy seems like a good tradeoff.

Here are the timings of some of Steamhammer’s lurker builds. The lurker column is the number of lurkers made in the first wave, for an initial attack (Steamhammer’s lurkers always attack). The time given is the frame that the first lurker hatches from its egg, median of 3 runs.

buildsecond hatchlurkersframe
11 gas 10 pool lurkerafter lurkers3 then 1 more7405
12 gas 11 pool lurkerafter lair4 at once7551
12 pool lurkerbefore lair5, as gas allows7785
12 hatch lurkerbefore pool5 at once8513

Steamhammer 3.3.7 fixes crashing bug

Oops, yesterday’s upload included a crashing bug in the changes to base defense, ignoring proxy pylons when appropriate but not ignoring all the side effects of ignoring pylons. I’ve uploaded a new version 3.3.7 to fix the bug. Let’s see if there are any more... occasionally I have whole strings of new crashing bugs.

Steamhammer 3.3.6 uploaded

Steamhammer 3.3.6 is uploaded, along with matching Randomhammer. It is primarily a bug-fix release, and many of the bugs are noted here. It includes some work on opening timing, but nothing finished, so no writeup yet.

plan recognition

• I instrumented the code that incorrectly decided that Wuli was doing a “Fast rush”, and did not find a logic error. It behaved as designed. I tightened up the thresholds so that a rush has to be faster to be recognized as “fast”.

tactical analysis

Don’t assign defenders to destroy proxy pylons, or proxy supply depots, or proxy engineering bays, unless there are no other enemies to defend against. As an example, if there’s a pylon and a zealot, call in 4 zerglings to defeat the zealot, and ignore the pylon for the moment. (Formerly, Steamhammer called in 6 zerglings.) If there is a pylon and no other enemy, call in 1 zergling to destroy the pylon. This more frugal defense is also an important improvement to cannon rush defense: Steamhammer does not assign defenders to defeat the cannons, but relies on its anti-cannon sunken to hold them at bay, so that mobile units can instead go attack the enemy base. (Units made outside any cannon containment should have a clear path to the enemy, and another part of cannon rush defense is to build an outside base, diverting the scout drone if necessary.) This change means that Steamhammer also does not assign defenders to defeat the pylons powering the cannons, which was allowing the enemy base to survive too often. To be sure, blunders in unit movement count more.

• The base defense hysteresis is slightly increased, and the in-zone defense radius is increased significantly. This should ameliorate some chasing misbehaviors, such as when mutalisks seem to forget that vultures are in the base.

base->inOverlordDanger() added: Any overlords at a given base are at risk. Used in deciding where to spawn fresh overlords (see below).

base->inWorkerDanger() tightened up. The enemy must be 2 tiles closer before the workers are considered to be in danger. Workers have been panicking too early, losing too much mining time. More improvements are needed, but this simple tweak should help.

production

Bug of trying to morph the same larva twice is fixed. This is a big one. Steamhammer now does its own tracking of orders to morph, bypassing BWAPI. The fix brought a side effect bug of making it impossible to morph a lair into a hive, because the lair remembered its previous morph order from hatchery to lair and refused to repeat it to become a hive. I fixed that by clearing morph orders of buildings when they complete. (Morph orders of units change immediately after they complete, because units are given orders right away.)

Selecting which larva to morph is smarter. The routine had to be rewritten anyway to solve the morph-twice bug. Formerly, Steamhammer did an ad hoc analysis of the main hatchery of each base when deciding where to make drones, so it could make drones at a base which needed them, and otherwise preferred to spawn units at whatever hatchery had the most larvas (a hatchery with 3 larvas cannot spawn more, so you want to use those first). Now it follows a more general 2-step procedure. First, it decides what bases are best for the production, sorting them into priority order. Second, it scores every larva for nearness to a priority base, and for hatchery larva count. A larva that does not belong to any hatchery (because its hatchery was destroyed) gets a higher score (because it is not likely to live long), provided the enemy is not near. The enemy just destroyed the hatchery, they’ll hit any stray units too.

Base priority is implemented for drones and overlords. For drones, all base hatcheries are considered, not only the main hatchery, and a base that is under attack is avoided when possible. This is a small improvement over the former behavior, and will become more important when Steamhammer distributes hatcheries better, as I plan. For overlords, if the enemy has wraiths or corsairs or other overlord hunters and we have spores to defend, bases with a spore colony get priority. A base under active attack by anti-air units is avoided when possible. This is a substantial improvement and will save games.

micro

Drones do not burrow in reaction to cannons, or a nearby bunker or sunken colony. It was an oversight in the original implementation. The drones would burrow just out of range and remain burrowed unless and until the enemy static defense was destroyed, not a useful behavior.

• Fixed a bug causing double commanding of overlords.

• Fixed bugs causing double commanding of drones to burrow or unburrow. It’s not harmful, the commanding infrastructure swallows the extra safely, but I fixed it anyway. Though I didn’t entirely fix it, because I see double commanding to unburrow can still happen.

scouting

• In the last version I added a feature to release the scouting worker under tight conditions that ensured that the scout was only released if it was unlikely to be able to scout anything more. A drone stalled from getting into the enemy base by cannons would be released when zerglings arrived to keep watch, for example. I always expected that the conditions would turn out to be too tight, and I was right. I dropped the condition on the distance to the enemy base, so that (for example) a drone chased across the map by enemies could be released if zerglings showed up to save it. I will probably loosen the conditions further in the future.

zerg

Fix to ZvZ emergency spores due to an integer overflow bug.

• In an emergency, build sunken colonies or spore colonies even if the drone count is low. Keeping the last 5 drones alive is more important than making another drone because you only have 6.

Don’t break out of the opening versus cannons unless it is necessary to make a faster spawning pool (to make a sunk to hold the cannons at bay). This is a bug fix. It was always intended to work that way.

Place any anti-cannon sunken colony near the ramp to the main when possible. If the base has a corresponding main base (it is the natural of some main), then prefer sunken positions near the entrance to the main. This is another advance in cannon rush defense. It will be harder for the cannons to creep past the natural and into the main, as MadMixP and Juno both like to do, and it is generally good to defend the ramp anyway. Steamhammer’s cannon rush defense has grown complex, with many necessary skills that add resilience, and yet is still easy to overcome. I should post about that.

• Limit scourge production more tightly yet. I’ve improved this over and over, and it still makes trouble.

• Tune down the scourge priority for valuable targets, except for helpless guardians and cocoons. Scourge were too often trying to fly past the corsairs to reach the carriers, and not making it.

• Spawn all desired queens at once, instead of waiting to order them until resources are available. Rules are sorted so that more than 2 queens were almost never made, even when more were on order. This should make broodling more viable, but it’s mostly for fun.

openings

• I added new openings 12Gas11PoolLurker and 12Gas11PoolMuta. They’re more interesting than you probably guess, so I’ll post separately about them.

Steamhammer’s new bugs

From watching tournament games, I wrote down about 20 to-do items. A couple are ideas to try, but most are bugs, and 8 are critical bugs that I want to fix fast. I thought I had left Steamhammer in a more reliable state. :-(

Steamhammer lost its tournament record of no losses due to crippling bugs when it lost both games to Wuli. In both games, it misdiagnosed the protoss opening as “Fast rush” instead of “Heavy rush” and broke out of an opening that would have been strong, reacting as though it were facing a surprise 6 gate rather than 9-9 gates. You can’t survive a blunder like that against an aggressive opponent. I don’t see the error in the code, but I’ll work it until I find it.

Two winning games were even worse, against cannon bot Juno by Yuanheng Zhu. Steamhammer started out playing a build that had been 100% successful forever, then panicked and broke out of the opening when it saw the cannons. I think that somehow a bug got into the code that decides whether the current build is good for the situation and should not be interrupted. Having broken out of the opening, Steamhammer relied on the strategy boss, which made choices much inferior to the planned opening. In one of the 2 games it played well enough to eventually bust the cannons and win, but the other game, on Python, was plagued by further severe bugs. If you know Starcraft at all, this game will offend your sense of esthetics, and possibly your sense of ethics; it is so bad it is immoral. Steamhammer smuggled a drone to a hidden expansion, one of few successful actions in the game. It built the expansion after a long delay—there is probably another bug there—then sent nearly every drone made at the outside hatchery back to the main so that it died to the containing cannons—definitely a bug, and not one I’ve seen before. Steamhammer also burrowed drones that came near cannons, so that it eventually stopped mining (even gas mining) because every surviving drone was burrowed. And other bugs. So many bugs. :-( Zerg won the game on points because it mined less and because Juno had its own misbehavior, repeatedly building cannons in sunken range where they died instantly.

The rich player usually loses if the game times out, because the rich player made more stuff and it died. And you want to be the rich player. So win your winning games, don’t let them time out!

Steamhammer won its games versus XIMP by Tomas Vajda with its usual seeming ease, but there was a bug there too. This bug I’d seen before, but I only realized the cause in watching the tournament. Steamhammer had a partial production freeze, where it built nothing but scourge and zerglings for a period, losing ground compared to its regular production. The cause is that the reactive behavior of building scourge and the fallback behavior of adding zerglings when the situation allows were jointly blocking the regular behavior of refilling the production queue. Only special case units were being made, not regular units.

The bugs above are severe, but only affected a few games each. Together, they may have cost Steamhammer one rank in the tournament, if that much. I think the latency compensation bug is more severe, even though I can’t point to a game that Steamhammer definitely lost due to it. It affects far more games, and one of the effects is to drop one drone in 12 pool or 12 hatchery builds. Being one drone short of plan, starting early in the opening, is a serious handicap. It also sometimes drops planned zerglings or other units. Many openings have their timing disrupted, so that research is not accomplished or unit counts are reached late, causing further misplays.

I fixed the latency compensation bug yesterday. To do it I had to completely rewrite the routine that chooses which larva to morph into the next unit, so I took extra time and added features. Now it is more general: It is divided into one step that decides which bases are better locations for the next unit, and second step which tries to find a larva near one of those bases. The old version already knew that it’s better to make a drone at a base which needs more drones. I taught it in addition that if there are corsairs or wraiths flying about, and we have spores to defend, it’s better to make overlords at a base with a spore.

As mentioned here, I plan to release a bugfix version Steamhammer 3.3.6 first, then get back to work on my real project of 3.4. I’m often wrong in strength estimates, but even so, if I can fix the critical bugs I think 3.3.6 should be stronger by 50 elo points.

solidity in AIIDE 2020 - part 5

A little more on daring/solid before I post about Steamhammer’s bugs.

Are the numbers reliable? Are results repeatable? If I measure another competition, will the solidity measure of the same bots come out similarly? If I measure A as more solid than B, is it true? Does solidity mean anything?

Statistically, there are two parts to the question. One part is, given a fixed set of bots, what is the spread of the solidity numbers? How many games do you need to feel sure you can tell A is more solid than B? In principle, that can be answered mathematically by running probability distributions forward through the calculation. That would be a useful exercise anyway, since it could suggest better calculations. But it turns out that I am not a statistician, and I don’t want to do it. It might be easier to answer by Monte Carlo analysis: Simulate a large number of tournaments, and see the spreads that come out.

The other part is, how does repeatability vary as the participants in the tournament vary? Will the same bot get a similar solidity number in a tournament where many of its opponents are different? What if it is a tournament where the average bot is (say) more solid than those in the original tournament? Are there other player characteristics that might make a difference? Does repeatability improve as the number of participants increases, as you would expect? That can also be answered by Monte Carlo analysis, but you’ll have to make more assumptions about how players behave. I don’t see any substitute for analyzing actual past tournaments, at least as a first step to understand the important factors.

I will analyze past tournaments, but not now. For the moment, I think my intuitive answers to both parts of the question are good enough: AIIDE 2020 does have enough games that the bots can likely be ordered by solidity without big mistakes, and it does not have enough varied participants to be sure that a solidity measurement from one tournament is useful for predicting the next one. In time, I want to automate the whole calculation and include it in my suite of tournament analysis software, so I can report on it as a matter of course. Right now I’d rather let the ideas simmer for a while and see if something better can be cooked up.

But mainly I want to get back to Steamhammer and make it great!

solidity in AIIDE 2020 - part 4

I computed what I decided to call the upset deviation, which you can take as the average deviation of actual from expected win rate due to upsets. An upset pairing I defined as one where you outscore a stronger opponent (do better than expected) or underscore a weaker opponent (do worse than expected). Theoretically, smaller numbers are more “solid” and bigger numbers are more “daring”. The table also carries over the rms deviation from yesterday.

To summarize the procedure I followed: 1. Compute elo ratings for each participant in a tournament. 2. Using the elo ratings, you can compute expected win rates for each pairing. 3. For each pairing, the difference between the actual tournament result and the expected win rates is the deviation. 4. Square each deviation and calculate the sum of the squares. 5. Extract the pairings which are upsets and calculate the sum of those squares. 6. The upset ratio is the sum of the upset squares as a ratio of the entire sum of squares. 7. For each participant, given the deviations, compute the rms deviation, which is a kind of average of the deviations. Some people may not know what RMS is: It stands for root mean square, which means you square each number, find the arithmetic mean of the collection, then restore the original scale by taking the square root of the result. 8. Multiply the upset ratio by the rms deviation to get the upset deviation.

botrms
deviation
upset
ratio
upset
deviation
stardust7.6%74.9%5.7%
purplewave11.3%39.6%4.5%
bananabrain9.9%65.2%6.5%
dragon15.5%72.9%11.3%
mcrave13.6%10.1%1.4%
microwave15.5%23.5%3.6%
steamhammer16.9%71.2%12.0%
daqin21.2%63.8%13.5%
zzzkbot23.7%65.7%15.6%
ualbertabot9.0%27.8%2.5%
willyt15.7%33.8%5.3%
ecgberht14.7%52.7%7.8%
eggbot6.8%69.6%4.8%

The upset ratio has some interest in itself, so I included it. It doesn’t say how big the upsets were, it says what proportion of the (squared) deviations were due to upsets. You have to interpret the percentage as a ratio. The upset deviation then also recognizes how big the upsets were. In this case, you interpret the percentage as the average deviation from expected win rate due to upsets. The whole procedure is ad hoc and of questionable rigor but all the steps are logical and the results make sense to me. Can anybody suggest an improved method?

By this metric, Dragon, Steamhammer, DaQin, and especially ZZZKBot are the “daring” players in this group. McRave and UAlbertaBot are the most “solid”.

Next: Steamhammer’s bugs.

Randomhammer reuploaded

I noticed that Randomhammer was briefly re-enabled on SSCAIT. I had disabled it before the tournament by uploading an empty zip file. So when Randomhammer played a game versus Feint... it wasn’t much of a game. The bot can only play when it exists.

So I reuploaded Randomhammer 3.3.5. It’s exactly the tournament version of Steamhammer, playing random.

solidity in AIIDE 2020 - part 3

I tried to be clever and make a solidity metric that was also a statistical test, so that it was easy to tell when the numbers were meaningful and when they were just noise. It didn’t work the way I wanted. Then I tried to be clever differently, and converted the results to elo differences, so that the metric ended up as a difference in elo. It’s easy to understand, easy to work with, and mathematically sound, because elo is linear. But the small change from a win rate of 98% to a win rate of 99% corresponds to the big elo difference of 122 points, so the measure was dominated by opponents at the extremes, blowing up the statistical uncertainty. OK, enough! Enough cleverness! Do it the easy way!

Here is a simple measure of goodness of fit, rms deviation of actual from expected win rate. This is not solidity, it is more like consistency or predictability: A small number means that the blue and green curves are close together. The numbers are slightly lower than correct, because I used a spreadsheet and it was easier to include the self-matchups with pretend 50% win rate and 0 deviation.

botrms deviation
stardust7.6%
purplewave11.3%
bananabrain9.9%
dragon15.5%
mcrave13.6%
microwave15.5%
steamhammer16.9%
daqin21.2%
zzzkbot23.7%
ualbertabot9.0%
willyt15.7%
ecgberht14.7%
eggbot6.8%

You can eyeball the graphs and compare these numbers, and you should see that the numbers are a fair summary of how well the blue and green lines match. The bot that mostly won and the bot that mostly lost are good fits, DaQin and ZZZKBot are pretty wild, and UAlbertaBot stands out as unusually consistent. In fact, Stardust, UAlbertaBot, and EggBot all play fixed strategies (one per race for random UAlbertaBot), so it should be no surprise that they are consistent. The next most consistent by this measure is BananaBrain, which plays a wide range of strategies very unpredictably, so it is a surprise.

Next: To turn this into a solidity metric is a matter of extracting the portion of deviation which is due to upsets. It will take a bit of detail work with the spreadsheet. I’m out of time today, so I’ll do that tomorrow. It will be interesting to judge whether consistency or solidity is the more useful metric.

SSCAIT 2020 round robin is over

My first try at the solidity metric did not work well enough. The flaw is easy to fix; expect numbers tomorrow if there are no more flaws. For today, a few notes instead.

The SSCAIT 2020 round robin phase has just finished. The top ranks are no surprise. #1 is Stardust with 104-6 for 94.5%, a dominating performance after a slow start when most of the losses were front-loaded. Tied at #2-#3 are BananaBrain and BetaStar with 99-11, and they scored 1-1 against each other. There is a gap below #4 Monster with 98-12. Tied at #5-#6 are Halo by Hao Pan and PurpleWave with 91-19. This time the head-to-head score is Halo-PurpleWave 2-0. Compared to past expectations, that’s a good result for Halo and a poor result for PurpleWave. #7 Iron landed higher than I predicted.

#16 TyrProtoss, the bottom of the top, is notable for losing every game against others of the top 16, except for one game versus #15 MadMixP. It made up for it with solidity against the rest. The next bot to do at least as poorly against the top 16 is #36 Flash. #17 McRaveZ, the top of the bottom, narrowly missed the top 16, and is notable for a string of 1-1 scores against higher-ranked opponents. McRaveZ also lost 0-2 to #54 Marine Hell, the biggest 0-2 upset. Those 2 points left it 2 games behind #16 TyrProtoss, so it hurt. #18 Skynet by Andrew Smith had even more 1-1 scores against higher opponents. Skynet is old but still tough.

The biggest upsets are cannonbot #50 Jakub Trancik > #2-#3 BetaStar, #52 GarmBot by Aurelien Lermant > #8 Dragon, #54 Marine Hell > #11 Steamhammer, and #55 JumpyDoggoBot > #15 MadMixP. That’s not many extreme upsets; the top 16 are fairly solid. I think another notable upset is the old school champion #29 ICEbot > #1 Stardust.

#11 Steamhammer scored 78-32 for 70.9%. I had forecast rank #9 or #10, and I was optimistic. In the past I’ve predicted more accurately. Looking at every Steamhammer game, I learned about a half dozen bugs and weaknesses that I hadn’t seen before, some severe (I’ll post more later). I guess my prediction was off because of the unexpected weaknesses that I didn’t take into account. But in any case, #11 is the same rank that Steamhammer earned last year, and the year before too, and its win percentage varied neatly with the number of bots in the tournament. In the big picture, Steamhammer had a startup transient (weak the first year because it was barely started, extra strong the next year because other bots had not yet adapted to its new skills), and since then has been holding its level, not surpassing its neighbors but not falling behind either. That’s not bad. But this year I’m putting effort into skills no other bot has, so stand back!

Next I expect a wait while the elimination phase is run behind the scenes, then they’ll turn on bot submission. I will prioritize fixing some of the surprise bugs ahead of my bigger project of opening timing, so I’m thinking I’ll upload Steamhammer 3.3.6 sooner (the tournament version is 3.3.5) and hold 3.4 for later. Then I expect the elimination phase results will come out slowly, week by week. Steamhammer is likely to fall to the losers’ bracket in the first round of the elimination phase, and may visit E-Lemon Nation early on.

Steamhammer-Microwave rivalry

The round robin phase of the annual SSCAIT tournament is nearly over.

Steamhammer was ahead of Xiao Yi by one loss when Steamhammer’s final game came up—versus Microwave. Even with bugs that keep it out of contention, Microwave is still Steamhammer’s rival. Xiao Yi had unplayed games left, and in any case Steamhammer defeated Xiao Yi 2-0 so in the worst case it would place ahead on tiebreak. Nevertheless it was a tense pairing. The game was a long and difficult hive tech ZvZ that neither bot could play particularly well. Notice Microwave’s use of overlords to discover and eliminate burrowed drones. The mutas and devourers were all plagued, but too late....

Another difficult Steamhammer game was versus Ecgberht.

Addendum: BetaStar and BananaBrain will likely end up tied for #2-#3. They scored 1-1 against each other. Will the seeding order for the elimination phase be decided arbitrarily, or what?

solidity in AIIDE 2020 - part 2

Here are the graphs I promised. There is one for each bot in AIIDE 2020. Opponents are not labeled, but are arranged along the x-axis in order of strength. The green line shows the expected win rates against the opponents, based on the elos of the two bots (from yesterday). The blue line shows the actual win rate in the tournament. For purposes of charting, each bot has a fictional win rate of 50% against itself, on both the green and blue lines. So every chart shows green and blue crossing at 50%.

The green line has fundamentally the same shape in every graph, since it is based on fixed elo ratings. It’s just stretched a little differently each time. The blue line must roughly follow the green line; by construction, it can’t deviate in one direction without also deviating in the other. Notice that the scale is different on a couple of the later graphs. In particular, EggBot’s graph only goes to 50%.

Stardust chart PurpleWave chart BananaBrain chart Dragon chart McRave chart Microwave chart Steamhammer chart DaQin chart ZZZKBot chart UAlbertaBot chart WillyT chart Ecgberht chart EggBot chart

It’s not easy to eyeball the upset rate as such. You have to align your eyes to the 50% point where the green and blue cross. I should have drawn vertical lines on the graphs there, but my software was not fun. Nevertheless, the general goodness of fit is easy to see, and I guess that might be just as informative. For example, you can easily tell by eye that Dragon tends to upset the strong and suffer against the weaker. The strongest players want to be solid to avoid losing, since most losses are upsets, and the weaker players want to be daring to win more, and some of that shows too. To me, it’s telling that Microwave is visibly more consistent than Steamhammer, since MicroDK aims for defensive play that avoids risk while Steamhammer aims for aggressive play and seeks risk. That is exactly the kind of difference that a solidity metric is supposed to measure.

I judge the experiment a success so far.

Next: I’ll look for a good way to turn the data into single numbers, completing the solidity metric.