How AlphaGo Works
There are tons of AlphaGo thinkpieces so I'll refer you to Google for those. Here are two I like.
DeepMind videos
Atlantic
Briefly, suppose you didn't know anything about Poker but still wanted to play.
If you were presented with possible moves you could make at each turn, all you need to know is how each move affects your chance of winning.
You've turned a poker game into a math problem.
Great, but that raises the question: How do you get these probabilities?
This, of course, is the $1M dollar question but the answer is basically:
simulate the play for a large number of games.
How AlphaGo Works in too much detail
The paradigm for AlphaGo isn't deep learning, it's Monte Carlo Tree Search (MCTS).
MCTS is a smart way of gathering statistics for games of perfect information where
you're playing against an opponent. I'll refer you to other resources since I
don't know more than that.
Monte Carlo Tree Search
Interpreting Fig. 3 (black to move)
a. Select the move with the maximum action-value Q plus an exploration term and repeat.
b. If a position hasn't been previously explored, it's time to evaluate possible moves with the policy network.
c. Moves are evaluated with the fast rollout policy and the value network
d. Action-values are backpropagated up the tree
The policy network limits the search space while the value network
and fast rollout policy approximate rollouts to the end of the game.
Technical Questions
-
Why don't they use the Q value to choose a move? N visits is more stable – why?
-
How can AlphaGo be more efficient?
-
Why is the SL Policy used in liu of the RL-trained policy? It makes better, more diverse move selections.
In other words why is the RL policy myopic?
-
Why didn't DQN work? Or rather, why couldn't it be made stronger without MCTS?