Rory's Corner

AlphaGo

How AlphaGo Works

There are tons of AlphaGo thinkpieces so I'll refer you to Google for those. Here are two I like.

Briefly, suppose you didn't know anything about Poker but still wanted to play. If you were presented with possible moves you could make at each turn, all you need to know is how each move affects your chance of winning. You've turned a poker game into a math problem.

Poker image

Great, but that raises the question: How do you get these probabilities? This, of course, is the $1M dollar question but the answer is basically: simulate the play for a large number of games.

How AlphaGo Works in too much detail

The paradigm for AlphaGo isn't deep learning, it's Monte Carlo Tree Search (MCTS).

MCTS is a smart way of gathering statistics for games of perfect information where you're playing against an opponent. I'll refer you to other resources since I don't know more than that.

Monte Carlo Tree Search

Fig. 3 from their Nature Paper

Interpreting Fig. 3 (black to move)

a. Select the move with the maximum action-value Q plus an exploration term and repeat.

b. If a position hasn't been previously explored, it's time to evaluate possible moves with the policy network.

c. Moves are evaluated with the fast rollout policy and the value network

d. Action-values are backpropagated up the tree

The policy network limits the search space while the value network and fast rollout policy approximate rollouts to the end of the game.

Technical Questions

Why don't they use the Q value to choose a move? N visits is more stable – why?
How can AlphaGo be more efficient?
Why is the SL Policy used in liu of the RL-trained policy? It makes better, more diverse move selections. In other words why is the RL policy myopic?
Why didn't DQN work? Or rather, why couldn't it be made stronger without MCTS?

links

social