A group associated with the DEMO lab at Brandeis University wrote HC-Gammon, a neural network program which learns using a simple hillclimbing algorithm rather than temporal difference learning. It’s not as strong as the world-class programs, but it is somewhat strong, and its learning method is more straightforward than a temporal difference method.
| Jay : game learning : backgammon : HC-Gammon |
The people involved are Jordan Pollack, Alan Blair and Mark Land.
The authors started with a neural network of about 4000 connections, patterned on the “knowledge-free” network that Tesauro tried in early TD-Gammon experiments. The initial network had all zero weights.
The learning method involves keeping track of a “champion” network. At each training step, they added small random numbers to the champion’s weights, creating a “challenger” network that was similar to the champion, but might be better. The two networks played a match, and if the challenger won then each of the champion’s weights was adjusted a small distance toward the challenger’s, and the resulting network became the new champion. They ran 100,000 training steps to create a final champion.
Some attention to details was necessary to keep learning on track. For example, as the champion became stronger, the number of games in the match was increased so that smaller differences in strength could be detected.
The program’s strength was measured against several benchmarks, but the authors give no estimate in human terms. I imagine that it may be an intermediate player; if that’s true, it’s fairly impressive for only 100,000 training steps.
I found it hard to get excited about this, because the results are pretty much what I would have expected. Still, it’s a good reminder that there’s nothing magic about temporal difference learning; even utterly simple learning methods can work, at least for this problem. The authors’ discussion of their results is informative and to the point.
HC-Gammon’s hillclimbing algorithm can be seen as a degenerate genetic algorithm with a population size of two, using weighted recombination instead of crossover. It’s simpler and might be more robust than temporal difference learning, and it’s easy to try variations on the basic algorithm. Since it’s not constrained to follow the path of the steepest gradient, there’s a chance that it could be more “creative” in finding solutions.
On the other hand, TD learning should be faster, at least once the value function has started to converge, because it uses more information. With the usual TD algorithm you get one update for every move of a game, rather than one every game or every several games. If the goal is to tune a network for the very best play, this could be a huge advantage. Top TD backgammon programs train for millions of games; if orders of magnitude more games are needed, the training will be infeasible.
In search of the electronic brain (1997)
Michael Gruber
A popular article about Jordan Pollack on connectionism,
with emphasis on the backgammon program.
Coevolution of a backgammon player (1996, PDF)
Jordan Pollack, Alan Blair and Mark Land
With
online abstract. The details.
Co-evolution in the successful learning of backgammon strategy (1997, PDF)
Jordan Pollack and Alan Blair
With
online abstract.
A longer, more recent version of the above paper, with excellent additional analysis. It’s also double-spaced and has the pages in reverse order. :-(
DEMO Lab’s HC-Gammon
Play against an evolved player. This page surprisingly still exists, but unsurprisingly did not work when I tried it.