tournament design
If you design a tournament differently, a different bot may be favored to win.
AIIDE 2015 was an example. As pointed out in the tournament results, UAlbertaBot finished fourth even though it had a plus score against every other bot, because compared to the top 3 it was less consistent in defeating weaker bots. AIIDE runs on a round-robin design, all-play-all, so UAlbertaBot could defeat the top finishers and still be ranked behind them. In a progressive elimination tournament in which weaker competitors were dropped over time, UAlbertaBot would likely have finished first.
If you’ve seen the math of tournament design, or of related stuff like voting system design, then you know there’s no such thing as a fair tournament in which the best competitor always has the best chance to win, because there isn’t always such a thing as a best competitor. If A > B and B > C but C > A, then which is the “best”? That’s called intransitivity. A more complicated kind of intransitivity happened in AIIDE 2015.
Rating systems in the Elo tradition have the same issue (and their designers know all about it). They assume—they have to assume, to be what they are—that players have a “true skill” in a mathematical sense, putting players into a smooth mathematical model that doesn’t correspond exactly with bumpy reality. It’s a good approximation; Elo ratings are mostly accurate in predicting future results. (The small mismatch with reality has inspired a lot of variations of Elo ratings, Glicko and TrueSkill and so on, that try to do a little better.)
Given any big enough set of games (games that link up the competitors into a connected graph), you can find Elo ratings for the players. The ratings may have big uncertainties, but you can rank the players. You can use virtually any tournament design with almost any kind of random or biased pairings, and get a ranking.
To me this is an intuitive way to think about tournament design: Players play games which we take as evidence of skill, and the key question is: With a given amount of time to play games, how do you want to distribute the evidence? If you want to rank all the competitors as well as possible, then distribute the evidence equally in a round-robin. That’s the idea behind AIIDE’s design—I approve. If you want to pick out the one winner, or the top few winners, as clearly as possible, then let potential winners play more games. If Loser1 and Loser2 are out of the running, then games between them produce little evidence of who the top winner will be. A game between Winner1 and Loser1 produces less evidence than a game between Winner1 and Winner2. Because of intransitivity you may get a different winner than the round robin, but you have more evidence that your winner is the “best.” It’s a tradeoff, ask and it shall be given you.
You might also care about entertaining the spectators. That’s the idea behind SSCAIT’s elimination format for the “mixed competition.” I approve of that too; it’s poor evidence but good fun.
As a corollary, the kind of tournament you want to win could make a difference in what you want to work on. In a round robin, beating the weak enemies more consistently like ZZZKBot counts as much as clawing extra points from the strong enemies like UAlbertaBot.
Comments