Overkill’s new learning 6 - how it fits with the rest of the program
The learning is done by StrategyManager::strategyChange(). During the game, the production manager is in one of two states, depending on whether Overkill is still in its opening build order.
enum ProductionState {fixBuildingOrder, goalOriented};
ProductionState productionState;
When productionState is goalOriented, ProductionManager::update() calls onGoalProduction() to check whether the goal (zerglings, hydralisks, mutalisks) should be changed. The key parts:
bool isAllUpgrade = true;
for (int i = 0; i < int(queue.size()); i++)
{
//if remain unit type is just worker/upgrade/tech, check the strategy
if (... a long condition here ...)
isAllUpgrade = false;
}
if (isAllUpgrade)
{
if (BWAPI::Broodwar->getFrameCount() > nextStrategyCheckTime)
{
std::string action = StrategyManager::Instance().strategyChange(0);
BWAPI::Broodwar->printf("change strategy to %s", action.c_str());
setBuildOrder(StrategyManager::Instance().getStrategyBuildingOrder(action), false);
nextStrategyCheckTime = BWAPI::Broodwar->getFrameCount() + 24 * 10;
curStrategyAction = action;
}
...
}
In other words, if the production queue is empty of combat units, and at least 10 seconds have passed since the last strategyChange(), then call strategyChange() again (with reward 0 because we’re in the middle of the game). Overkill changes its choice at most once every 10 seconds.
At the end of the game, Overkill calls strategyChange() one last time, giving a reward of 100 for a win and -100 for a loss. From Overkill::onEnd():
if (frameElapse >= 86000 && BWAPI::Broodwar->self()->supplyUsed() >= 150 * 2)
{
win = true;
}
else
{
win = isWinner;
}
int reward = win == true ? 100 : -100;
StrategyManager::Instance().strategyChange(reward);
There’s something curious here. isWinner is the flag passed into onEnd(). If Overkill makes it to the late game, it sets win = true no matter what; otherwise it believes the isWinner flag. For purposes of its learning algorithm, Overkill tells itself that making it to the late game counts as a win in itself! Isn’t that a pessimistic way to think?
Every call to strategyChange() produces 1 data point for the learning algorithm. The reward comes in only at the end of the game, but Overkill needs the other data points to fill in the Q values for states during the game. The model has a lot of values to learn, so the more data points the better.
exploration
When Overkill is in Release mode, it explores 10% of the time. The code is in StrategyManager::strategyChange(). So, 1 time in 10 when Overkill makes a decision, it’s a random exploratory decision instead of the best decision it knows.
On the one hand, if Overkill wants to learn an opponent model, it has to explore. On the other hand, making that many random decisions can’t be good for its play! Since the AIIDE 2016 tournament was not long enough for the opponent model to become accurate, Overkill might have done better to skip the opponent modeling.
Next: Other changes to Overkill.