DeepMind takes on poker & Scotland Yard

Read an article the other day (DeepMind makes bet on AI system that can play poker, chess, Go, and more) about a new DeepMind game playing program that used a new approach to taking on perfect and imperfect information games with the same algorithms.

As you may recall, DeepMind prior game playing programs, AlphaZero and MuZero played perfect information games chess, shoji, & Go and achieved top rankings in all of them. These were all based on reinforcement learning and advanced search. .

Perfect information games have no hidden information, that is all the information needed to play a game is visible to all players (see wikipedia Perfect information article). Imperfect information games have private or hidden information, only visible to one or a select set of players. In card playing, any card that’s not shown to all players, would represent hidden information. The other difference in imperfect games is that players attempt to keep their private information hidden as long as possible.

The latest DeepMind paper (see: Player of Games Arxiv paper) discusses a new approach to automated game playing that works for both perfect and imperfect information games. DeepMind’s latest game playing program is called Player of Games (PoG).

As many may know, Texas hold’em is a form of poker where everyone is dealt two cards down and five cards are dealt up, that everyone shares (see: Texas hold’em and Betting in poker articles on wikipedia). Betting happens after the two down cards are dealt, after the next 3 up cards (called the “flop”) are dealt, then after each of the remaining 2 up cards are dealt. Players select any of the (2 down and 5 up) cards to create the best 5 card poker hand. Betting is based on a blind (sort of minimal bet). PoG plays as a single player, performing all the betting as well as card playing for Texas hold’em. No limit betting says there’s no limit (maximum) to the amount of a bet in the game.

Scotland Yard is a board game where detectives chase down a criminal (Mr. X) on the run across the city of London (see wikipedia Scotland Yard (board game) article). Detectives each get 23 transportation tickets for taxies (11), busses (8), and underground trains (4). The game takes place on a board layout of London and starts with each detective and the criminal selecting a card with their hidden position on the board. The criminal gets (not quite, but almost) an unlimited amount of transportation tickets plus 5 (in USA) universal tickets (which can be used to take ferries as well as any other form of transport). Every time (except when using universal tickets) the criminal moves, he reveals his form of transportation. And 5 times during the game the criminal also reveals his current location. The detective that finds the criminal wins.

I assume all my readers know how to play chess and Go (or at least understand them).

While MuZero and AlphaZero used reinforcement learning for training and sophisticated search for in game play, PoG needed to do something different due to the imperfect (or hidden) information present in the hold’em and Scotland Yard games.

How PoG is different

In imperfect information games, it’s important to hide private information. In poker when I got a great hand, I raised my betting levels extensively. But this often caused my opponents to withdraw from betting unless they had a great hand as well. I sometimes think that if I were to bet more consistently and only at the last betting round, bet big when I have a good hand, I might win more $. No doubt, why I don’t play poker anymore.

Like AlphaZero and MuZero, PoG also uses reinforcement learning through self game play but adds something they call Counterfactual Regret (CFR) Minimization to their game trees.

In addition to normally selecting and computing a value (reward) for the optimal move as in reinforcement learning, PoG uses CFR minimization to compute values (rewards) for all moves not taken during every stage in a game, for each player. As such, PoG computes possible rewards for the optimal move at a stage (step, move) in a game plus the values for all the regret (counterfactual or other) moves for all players. CFR minimization attempts to minimize the regret move values and maximize move optimal values at each move, for each player in a game.

CFR minimization is used during training for a game in self-play as well as during actual game play to generate sub-trees from wherever the game happens to be. PoG uses a depth limited CFR minimization to generate game sub-trees during game play which helps to reduce the time it takes to determine the best move for all players. Read the ArXiv paper to learn more.

The challenge with this approach is that it will never be as good as pure reinforcement learning + advanced search for perfect games, such as chess and Go. For example, below we show Exploitability ratings for various PoG training levels for Leduc Poker and Scotland Yard. Exploitability levels are one way to measure how good the player is playing. Lower is better.

Perfect play (in an imperfect information game) would have an Exploitability of 0. The charts show that the more training done the better the game play by PoG for (Leduc) poker and Scotland Yard. (Leduc poker is a simplified poker game with 6 cards and limited betting).

On the other hand, for perfect games the results were ok, but not stellar. Scockfish is the current non-reinforcement learning, chess playing champion. Gnugo and Pachi are non-reinforcement learning, Go playing programs. In tables below, they use a relative ranking based on a 0 baseline for chess (Stockfish with 1 thread and 100msec think time) and Go (GnuGo). Higher is better.

~~~~

So, yes PoG can do well in imperfect information games with decent training and ok (but much much better than I and probably the vast majority of humans), in perfect information games.

Why concern ourselves with imperfect games, The world is chock full of imperfect information games. They seem to occur everywhere, military strategy, sport play, finance, etc. In fact, perfect games are the exception in real world situations. Thus, any advance to play multiple imperfect information games better is yet another small step towards AGI.

Photo Credit(s):