Gerry Tesauro <tesauro@watson.ibm.com> of IBM's Thomas J. Watson Research Center wrote the first expert-strength backgammon program, TD-Gammon, which learns by self-play.
| Jay : game learning : backgammon : TD-Gammon |
The program was released for OS/2 Warp as part of a product called "Family FunPak", but as far as I can tell this product seems to have been discontinued. You can read about it in the article TD-Gammon: From neural network to OS/2 game.
Neurogammon was a supervised-learning neural network with specialized backgammon input features to measure such things as the racing lead and the strength of blockades. It was trained on positions that Tesauro hand-labeled with good and bad moves.
Neurogammon reached a high intermediate level of play, but did not become an expert.
TD-Gammon came into being partly by serendipity. Tesauro experimented with "knowledge-free" networks that had no backgammon features, only a raw representation of the board position, which he trained by self-play with the TD(lambda) algorithm. With no knowledge, the networks' initial games against themselves were random. But after tens of thousands of games, they became about as strong as Neurogammon, with its hand-crafted features and training examples!
Networks using Neurogammon's features became expert strength, far stronger than any previous program.
Training is time-consuming. TD-Gammon 2.1 played 1,500,000 games against itself before it stopped improving.
Versions 2.0 and 2.1 used a two-ply search to further sharpen their decisions. Search is expensive in backgammon, because you have to take dice rolls into account: to search two plies, you consider, for each of your plays, each of the rolls your opponent could make, and each of the plays your opponent could make with that roll.
Gerry Tesauro shared with me some information about the latest version, TD-Gammon 3.0, which hasn't yet made it into print. (The version numbers of commercial OS/2 TD-Gammon don't correspond to the version numbers of research TD-Gammon.)
TD-Gammon 3.0 searches three plies. Tesauro estimates that the three-ply search program would score +0.02 to +0.04 points per game against the two-ply version. Considering its already high level of play, that's a substantial improvement. Kit Woolsey confirms that three-ply TD-Gammon plays "considerably better".
A supervised-learning program can do only as well as the examples it learns from. More generally, though adding knowledge to a program by hand is a tempting way to get it quickly up to speed, it also puts a limit on the program's ultimate achievement. If a program is capable enough, the more it learns on its own, the greater its chance of making new discoveries and surpassing its creators.
Few of the TD-Gammon articles are online, as far as I know. However, some of Tesauro's publications can be found in the IBM Research CyberJournal; you can order hardcopies from IBM by e-mail.
The best articles to start with are the first one, from the book by Sutton and Barto, and the one in Communications of the ACM, listed last.
TD-Gammon
Richard Sutton and
Andrew Barto
A section from the book
Reinforcement Learning: An Introduction.
This is a textbook, so it's more readable than most of
the other references here.
"Neurogammon wins Computer Olympiad" (1989)
Gerald Tesauro
Neural Computation, volume 1, pages 321-323.
"A parallel network that learns to play backgammon" (1989)
Gerald Tesauro and
Terrence J. Sejnowski
Artificial Intelligence, volume 39, pages 357-390.
There is an
online abstract for a tech report of the same title.
Neurogammon. This paper includes a brief description of its backgammon
features.
"Neural network defeats creator in backgammon match" (1988)
Gerald Tesauro
A tech report. There is an
online abstract with ordering instructions to get a paper copy.
An annotated match with Neurogammon.
"Neurogammon: A neural network backgammon program" (1990)
Gerald Tesauro
IJCNN Proceedings (International Joint Conference on
Neural Networks), volume 3, pages 33-40.
"Practical issues in temporal difference learning" (1992)
Gerald Tesauro
Machine Learning, volume 8, pages 257-277.
There is an
online abstract, and you can order the full article.
The earliest TD-Gammon information.
TD-Gammon, a self-teaching backgammon program, achieves
master-level play
(1994, 5 pages)
Gerald Tesauro
Compressed postscript.
There is an
online abstract.
Temporal difference learning and TD-Gammon (March 1995)
Gerald Tesauro
Web page. Originally published in
Communications of the ACM, volume 38 number 3, pages 58-68.
J. Tepandi wrote a
brief review.
The matches were played by other people, but TD-Gammon's suggested plays give an idea of its view of the world. Also, the program has a rather smug sense of humor. :-)
Jeremy Bagai vs. Kit Woolsey
Ftp directory of compressed postscript files. Also available as an
ftp directory of text files, and as
web pages
on another site.
Seven games annotated by the players and by a two-ply version of TD-Gammon.
Casey Forrest vs. David Eggert
Compressed postscript. Also available as
compressed text, and as
a web page,
with additional annotations, on another site.
Thirteen games annotated by Kit Woolsey, TD-Gammon 3.0, and JellyFish.
It's interesting to see how sharply the programs sometimes disagree.