Nicol Schraudolph, Peter Dayan, and Terrence J. Sejnowski of the Computational Neurobiology Lab at the Salk Institute wrote a paper about neural networks which learn to evaluate 9x9 go positions using temporal differences.
|Jay : game learning : Schraudolph|
The authors tried a variety of networks. The paper diagrams one sample network, but many others were experimented with, and the sample network doesn’t use all the techniques mentioned in the paper.
The networks were trained by the temporal difference algorithm TD(0) to predict, not the overall result of the game, but rather the owner of each point on the board at the end of the game. (The winner of a go game is the player who controls more points at the end of the game, after making a small adjustment to eliminate the first player’s advantage.) That gives the networks more information to learn from.
Features of a go position are approximately translation-invariant. In other words, a configuration of stones is about as valuable in one place as another, all else being equal, although it may depend on how close it is to the edge of the board. A network which does not take this into account will have to learn the value of each configuration at each location it can appear.
Therefore the networks learned feature detectors which were scanned over their inputs. The layers of the network were explicit feature maps. The sample network has two hidden layers, connected in parallel rather than in series, which were added at different points during the network’s training. Each layer is a feature map produced by scanning feature detectors over its inputs.
Some networks (but not the sample network) were forced to obey the symmetry of the go board by being constrained to learn symmetrical features. The paper says, “Although this is clearly beneficial during the evaluation of the network against its opponents, it appears to impede the course of learning.”
The networks played by making a one-ply search, evaluating every possible move. Nici Schraudolph wrote me that, although he used incremental techniques (not mentioned in the paper) to evaluate networks quickly, they were still too slow for full 19x19 go.
Training by self-play alone was found to be slow. Training against a skilled opponent was faster because the networks could learn from their opponents. The opponents used were a random player, useful to start off training; Wally, a weak public domain program; and Many Faces of Go, a comparatively strong commercial program.
Like any learner, the networks learned best from opponents not too far from their own strength. Networks started out knowing nothing, and so needed weak opponents. Wally was modified to play a certain proportion of random moves, and the proportion was reduced as the network improved. Against Many Faces, games were played with standard go handicaps.
Because go is a deterministic game, there was a risk that a network might never explore some options, because it falsely thought it “knew better”. That would leave blind spots in its understanding. The problem was solved by introducing randomness into the network’s play, using Gibbs sampling.
Networks that played too much against one opponent risked over-fitting to that opponent, hurting their results against other opponents.
The results are for 9x9 go, a simpler problem than the traditional 19x19 go. Unfortunately, no sample games are available.
A conventional, fully-connected, two-layer backpropagation network learned to defeat Wally after 659,000 training games.
The sample network learned to defeat Many Faces of Go set to a low level of play after about 3,000 training games. The custom network played better and learned much more quickly than the undifferentiated backprop net.
Nici Schraudolph has a lot of ideas about how to improve the networks, but the project had to be shelved for lack of time.
I reviewed this because it’s the most interesting neural net go paper that I’ve seen, though it’s short and some parts are hard to understand. The network design is creative. It reminds me how many ideas haven’t been had yet.
It’s remarkable how much improvement was possible with specially-designed rather than vanilla networks. I suspect that the greatest part of the improvement was due to using feature detectors. In effect, it’s factoring the problem so the networks can solve the factors rather than their product.
As the paper mentions, the networks played relatively well in the opening and less well in the mid-game and ending. Intuitively, neural networks seem better for heuristic evaluations than for the detailed calculations that are also necessary for strong play after the opening. As the paper says, “hybrid approaches might be rewarding.”
Temporal difference learning of position evaluation
in the game of go (1994, 8 pages)
Nicol N. Schraudolph, Peter Dayan and Terrence J. Sejnowski
Compressed postscript. The paper is also available from here.
various authors, last updated 2002
Compressed unix tar archive. Source of a C++ port of the original 1988 weak public-domain Wally program by Bill Newman.