learning for tournaments
Most learning bots use a multi-armed bandit algorithm to choose opening builds, such as a UCB variant or an epsilon-greedy algorithm. The algorithms trade off exploration—trying stuff out to find what’s good—versus exploitation—getting wins using the good stuff that it found. If you’re playing on a ladder that always runs, you want an algorithm tuned to keep on exploring. Unless the algorithm hits on a build that wins every game, it will always find some value in occasionally trying another idea that might turn out to be better than what you’ve found so far.
If you’re playing a tournament, it’s different. The tournament will end, and if you’re near the end of the tournament then it won’t help to explore, because if you do find a better strategy you won’t have time to play it and earn wins. The cost of exploring starts to exceed the benefit. In the very last game of the tournament, if you know that it’s the last, the benefit of exploring is exactly zero and you should always play your best idea so far.
It’s an idea I thought of for AIIDE this year, but ended up not implementing in Steamhammer. I don’t know how long the tournament will be, but I can guess that it will probably not be much longer or shorter than last year, which was 100 rounds—100 games against each opponent. I could add a configuration parameter for the expected number of games, and reduce the rate of exploration slowly so that it reaches zero around then.
I may yet do it, it is likely a good idea. Bandit algorithms generally assume that the environment—the opponent—is steady over time, or becomes steady given enough time. Assuming an unchanging opponent, reducing exploration when exploration cannot help is always a gain. In reality the opponent may itself learn, so that bot + opponent form a hard-to-predict dynamic system. Reducing exploration against a learning opponent might make you more predictable so that the opponent’s exploitation works better. But the opponent will have to explore a little to do that exploiting, so... my guess is that the idea is still likely to help.
Does any bot use this technique of reducing exploration as the tournament continues? Do you have data showing how well it works?
Comments
Dan on :
Jay Scott on :
Joseph Huang on :
Jay Scott on :