learning for tournaments

Most learning bots use a multi-armed bandit algorithm to choose opening builds, such as a UCB variant or an epsilon-greedy algorithm. The algorithms trade off exploration—trying stuff out to find what’s good—versus exploitation—getting wins using the good stuff that it found. If you’re playing on a ladder that always runs, you want an algorithm tuned to keep on exploring. Unless the algorithm hits on a build that wins every game, it will always find some value in occasionally trying another idea that might turn out to be better than what you’ve found so far.

If you’re playing a tournament, it’s different. The tournament will end, and if you’re near the end of the tournament then it won’t help to explore, because if you do find a better strategy you won’t have time to play it and earn wins. The cost of exploring starts to exceed the benefit. In the very last game of the tournament, if you know that it’s the last, the benefit of exploring is exactly zero and you should always play your best idea so far.

It’s an idea I thought of for AIIDE this year, but ended up not implementing in Steamhammer. I don’t know how long the tournament will be, but I can guess that it will probably not be much longer or shorter than last year, which was 100 rounds—100 games against each opponent. I could add a configuration parameter for the expected number of games, and reduce the rate of exploration slowly so that it reaches zero around then.

I may yet do it, it is likely a good idea. Bandit algorithms generally assume that the environment—the opponent—is steady over time, or becomes steady given enough time. Assuming an unchanging opponent, reducing exploration when exploration cannot help is always a gain. In reality the opponent may itself learn, so that bot + opponent form a hard-to-predict dynamic system. Reducing exploration against a learning opponent might make you more predictable so that the opponent’s exploitation works better. But the opponent will have to explore a little to do that exploiting, so... my guess is that the idea is still likely to help.

Does any bot use this technique of reducing exploration as the tournament continues? Do you have data showing how well it works?

Trackbacks

No Trackbacks

Comments

Dan on Saturday, October 3. 2020:

For CherryPi in AIIDE 2018 we experimented with a few bandit algorithms to see which produced the best results in practice. We assumed 100 games per run (typical for AIIDE) and did multiple runs. The competing options are in https://github.com/TorchCraft/TorchCraftAI/blob/master/src/models/bandit.cpp#L278 -- the winner we wound up using was "kBanditExpMooRolling"

Jay Scott on Saturday, October 3. 2020:

Very cool!

Joseph Huang on Wednesday, October 14. 2020:

Doesn't the huge number of openings SH have essentially mean openings are random unless the number of games vs an opponent are very high?

Jay Scott on Thursday, October 15. 2020:

Yes, though the random choice is strongly biased toward openings which counter recognized enemy plans.

Add Comment

Name*

Homepage

Comment*

In reply to

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA