Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm #13

Open
mokemokechicken opened this issue Dec 6, 2017 · 7 comments

Comments

@mokemokechicken
Copy link
Owner

FYI: https://arxiv.org/abs/1712.01815

@mokemokechicken
Copy link
Owner Author

The rules of Go are invariant to rotation and reflection. This fact was exploited in AlphaGo
and AlphaGo Zero in two ways. First, training data was augmented by generating 8 symmetries
for each position. Second, during MCTS, board positions were transformed using a randomly
selected rotation or reflection before being evaluated by the neural network, so that the MonteCarlo
evaluation is averaged over different biases

Oh..., I did't generate 8 symmetries for each position...

@mokemokechicken
Copy link
Owner Author

mokemokechicken commented Dec 6, 2017

Dirichlet noise Dir(α) was added to the prior probabilities in the
root node; this was scaled in inverse proportion to the approximate number of legal moves in a
typical position, to a value of α = {0.3, 0.15, 0.03} for chess, shogi and Go respectively.

In reversi, it is better that α is 0.3 ~ 0.5?

@mokemokechicken
Copy link
Owner Author

Illegal moves are masked out by
setting their probabilities to zero, and re-normalising the probabilities for remaining moves.

re-normalising in legal moves may be important because of balance between value and policy.

@Zeta36
Copy link

Zeta36 commented Dec 6, 2017

In chess, AlphaZero outperformed Stockfish after just 4 hours (300k steps)

Wow!!

@gooooloo
Copy link
Contributor

gooooloo commented Dec 6, 2017

In reversi, it is better that α is 0.3 ~ 0.5?

Agreed. Let's say 180 legal actions in average in Go19x19, and in Reversi it may be around 10? So as to the new paper, 10 times 0.03 seems more reasonable.

@apollo-time
Copy link

What is main different between alphago zero and alphazero?
Is same the MCTS architecture?

@mokemokechicken
Copy link
Owner Author

Hi @apollo-time

I think the main differences are as follows.

P3~4

AlphaZero:

  • AlphaZero does not augment the training data and does not transform the board position during MCTS. (for generality)
  • evaluation step is omitted. self-play is performed by the newest model parameters. (!)
  • didn't tune hyper-parameter by Bayesian optimization. (reuse past parameters except policy noise)

So, MCTS is also used without transforming the board position.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants