Gerry Tesauro of IBM’s Thomas J. Watson Research Center wrote the first expert-strength backgammon program, TD-Gammon, which learned by self-play. TD-Gammon set the pattern for all strong backgammon programs since then, so it’s historically important.
| Jay : game learning : backgammon : TD-Gammon |
The program was released for OS/2 Warp as part of a product called “Family FunPak”, but of course it is long discontinued. You can read about it in the popular article TD-Gammon: From neural network to OS/2 game.
Neurogammon was a supervised-learning neural network with specialized backgammon input features to measure such things as the racing lead and the strength of blockades. It was trained on positions that Tesauro hand-labeled with good and bad moves.
Neurogammon reached a high intermediate level of play, but did not become an expert.
TD-Gammon came into being partly by serendipity. Tesauro experimented with “knowledge-free” networks that had no backgammon features, only a raw representation of the board position, which he trained by self-play with the TD(lambda) algorithm. With no knowledge, the networks’ initial games against themselves were random. But after tens of thousands of games, they became about as strong as Neurogammon, with its hand-crafted features and training examples.
Networks using Neurogammon’s features became expert strength, far stronger than any previous program.
Training is time-consuming. TD-Gammon 2.1 played 1,500,000 games against itself before it stopped improving.
Versions 2.0 and 2.1 used a two-ply search to further sharpen their decisions. Search is expensive in backgammon, because you have to take dice rolls into account. To search two plies, you consider, for each of your plays, each of the rolls your opponent could make, and each of the plays your opponent could make with that roll (though alpha-beta pruning applies).
Gerry Tesauro shared this information with me: TD-Gammon 3.0 searches three plies. Tesauro estimates that the three-ply search program would score +0.02 to +0.04 points per game against the two-ply version. Considering its already high level of play, that’s a substantial improvement. Kit Woolsey confirmed that three-ply TD-Gammon played “considerably better”.
A supervised-learning program can do only as well as the examples it learns from. More generally, though adding knowledge to a program by hand is a tempting way to get it quickly up to speed, it also puts a limit on the program’s ultimate achievement. If a program is capable enough, the more it learns on its own, the greater its chance of making new discoveries and surpassing its creators.
However, self-play has never been as successful with other games as with backgammon. I suspect that the main reason is exploration. In backgammon, the random dice rolls and the shape of the game are such that a program playing against itself will be forced to encounter every kind of position that matters, so that it will eventually learn everything it needs to to play well. In most games, if the program never has the idea of following a certain line of play or a certain strategy that is important, then it will never learn that way of playing.
This ties in with the issue of exploration versus exploitation. In backgammon, apparently you get exploration for free, and your training with a simple temporal-differences algorithm can be optimized for pure exploitation. When that is not so, I suspect that self-play can be successful only with a more complicated training method that accounts for the tradeoff. Starting with random play to explore and gradually reducing the randomness, simulated annealing style, is one obvious idea and has been tried.
Also, some of Tesauro’s publications are online as abstracts or as preprints from IBM.
The best articles to start with are the first one, from the book by Sutton and Barto, and the one in Communications of the ACM, listed last.
TD-Gammon
Richard Sutton and
Andrew Barto
A section from the book
Reinforcement Learning: An Introduction.
This is a textbook, so it’s more readable than most of
the other references here.
Neurogammon wins Computer Olympiad (1989, paywalled)
Gerald Tesauro
Neural Computation, volume 1, pages 321-323.
A parallel network that learns to play backgammon (1989, PDF)
Gerald Tesauro and
Terrence J. Sejnowski
Artificial Intelligence, volume 39, pages 357-390.
Neurogammon. This paper includes a brief description of its backgammon features.
"Neurogammon: A neural network backgammon program" (1990)
Gerald Tesauro
IJCNN Proceedings (International Joint Conference on Neural Networks), volume 3, pages 33-40.
Practical issues in temporal difference learning (1992)
Gerald Tesauro
Machine Learning, volume 8, pages 257-277.
The earliest TD-Gammon information.
TD-Gammon, a self-teaching backgammon program, achieves
master-level play (1993, PDF)
Gerald Tesauro
The longer 1994 tech report version is paywalled.
Temporal difference learning and TD-Gammon (March 1995)
Gerald Tesauro
Web page. Originally published in
Communications of the ACM, volume 38 number 3, pages 58-68.
J. Tepandi wrote a
brief review (paywalled though it was once freely available).
The matches were played by other people, but TD-Gammon’s suggested plays give an idea of its view of the world. Also, the program has a rather smug sense of humor. :-) What I noticed in going through these old matches is how much better human play is today than it was then—we learned because TD-Gammon and its followers taught us.
Jeremy Bagai vs. Kit Woolsey (1994)
Seven games annotated by the players and by a two-ply version of TD-Gammon.
Casey Forrest vs. David Eggert (PDF)
Thirteen games annotated by Kit Woolsey, TD-Gammon 3.0, and JellyFish (a commercial program by Fredrik Dahl). I’m not sure of the date. It’s interesting to see how sharply the programs sometimes disagree.