Brian Sheppard (author of backgammon and Scrabble programs) sent me e-mail about a combination of a genetic algorithm and temporal difference learning. He uses TD training as a “mutation” to create new members of the population. This is a directed mutation technique, a class of ideas that I think is highly promising.
| Jay : game learning : GA : TD learning |
He tried creating a backgammon player using temporal difference learning with a neural net, and noticed that the network did not always improve: sometimes with more training it got worse. This is the method he came up with to overcome the problem.
Start with a random population. At each generation, play a tournament between the population members to find their relative fitness. Do some TD training on the fittest player to create a new player, and replace the least-fit player with the new player.
It’s important to make sure that the “fittest” player really is one of the best. Otherwise, if all the players are about equal the relative fitness becomes random. Brian Sheppard doubles the tournament length if the tourney winner is not significantly better than the others according to a statistical test.
He also adjusts the neural network learning rate and the number of games of training for each new player. A new player has half the learning rate of the player it was derived from, and the number of new training games is the reciprocal of the learning rate.
Brian Sheppard reports that his system produces a final player roughly as good as TD-Gammon 2.0 with about the same number of total training games, even though many of the training games were for players that were discarded along the way.
The method has advantages over plain TD learning. You can be fairly confident that the final player is one of the strongest to evolve, even without doing separate benchmarking. If you have multiple cores, you can distribute the workload. And you have enormous flexibility to try out variant methods.
Possible variations on the theme are endless. Almost any learning technique can be used to direct mutations. If the learning technique is powerful enough, then random mutations won’t be needed, but recombination may still be useful.
If you have several learning techniques, you can use all of them to direct mutations. This might work particularly well if the techniques have different abilities, and can cooperate in building up stronger population members. For example, a feature constructor and a feature value tuner might work well together; the algorithm would choose which features were best to include and fine-tune their weights.