Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm #13

mokemokechicken · 2017-12-06T03:47:38Z

mokemokechicken · 2017-12-06T03:53:01Z

The rules of Go are invariant to rotation and reflection. This fact was exploited in AlphaGo
and AlphaGo Zero in two ways. First, training data was augmented by generating 8 symmetries
for each position. Second, during MCTS, board positions were transformed using a randomly
selected rotation or reflection before being evaluated by the neural network, so that the MonteCarlo
evaluation is averaged over different biases

Oh..., I did't generate 8 symmetries for each position...

mokemokechicken · 2017-12-06T04:32:17Z

Dirichlet noise Dir(α) was added to the prior probabilities in the
root node; this was scaled in inverse proportion to the approximate number of legal moves in a
typical position, to a value of α = {0.3, 0.15, 0.03} for chess, shogi and Go respectively.

In reversi, it is better that α is 0.3 ~ 0.5?

mokemokechicken · 2017-12-06T05:53:26Z

Illegal moves are masked out by
setting their probabilities to zero, and re-normalising the probabilities for remaining moves.

re-normalising in legal moves may be important because of balance between value and policy.

Zeta36 · 2017-12-06T06:14:07Z

In chess, AlphaZero outperformed Stockfish after just 4 hours (300k steps)

Wow!!

gooooloo · 2017-12-06T07:12:27Z

In reversi, it is better that α is 0.3 ~ 0.5?

Agreed. Let's say 180 legal actions in average in Go19x19, and in Reversi it may be around 10? So as to the new paper, 10 times 0.03 seems more reasonable.

apollo-time · 2017-12-06T09:32:01Z

What is main different between alphago zero and alphazero?
Is same the MCTS architecture?

mokemokechicken · 2017-12-07T02:08:45Z

Hi @apollo-time

I think the main differences are as follows.

P3~4

AlphaZero:

AlphaZero does not augment the training data and does not transform the board position during MCTS. (for generality)
evaluation step is omitted. self-play is performed by the newest model parameters. (!)
didn't tune hyper-parameter by Bayesian optimization. (reuse past parameters except policy noise)

So, MCTS is also used without transforming the board position.

mokemokechicken mentioned this issue Dec 6, 2017

Feature/refine implementation #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm #13

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm #13

mokemokechicken commented Dec 6, 2017

mokemokechicken commented Dec 6, 2017

mokemokechicken commented Dec 6, 2017 •

edited

Loading

mokemokechicken commented Dec 6, 2017

Zeta36 commented Dec 6, 2017

gooooloo commented Dec 6, 2017

apollo-time commented Dec 6, 2017

mokemokechicken commented Dec 7, 2017

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm #13

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm #13

Comments

mokemokechicken commented Dec 6, 2017

mokemokechicken commented Dec 6, 2017

mokemokechicken commented Dec 6, 2017 • edited Loading

mokemokechicken commented Dec 6, 2017

Zeta36 commented Dec 6, 2017

gooooloo commented Dec 6, 2017

apollo-time commented Dec 6, 2017

mokemokechicken commented Dec 7, 2017

P3~4

mokemokechicken commented Dec 6, 2017 •

edited

Loading