-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Adding the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces #499
Comments
Hey! Looking impressive with lots of work behind this! Given that we want to keep SB3 main repo (this one) compact-ish with well-established baselines, could you open this PR on the contrib repo which is designed for custom algorithms like this? We would be more than happy to review and accept this there! |
Thanks! sb3_contrib was my original plan at first, but I encountered problems with running the unit tests in that repo (problems I don't have with stable-baselines3). I will try again to port my code to sb3_contrib and see if I can run all the unit tests. |
Hello, |
Hello, yes, the paper was published at the European Conference on Machine Learning, with the proceedings in the Lectures Notes in Computer Science: https://link.springer.com/chapter/10.1007/978-3-030-46133-1_2 . I've put an arXiv link in the documentation as it is what the other algorithms have. |
Ah, feel free to open PR regardless. There might be some bugs with unit tests in that repo (not as battle-tested as this one), so we can iron those out there :) |
Hello, |
Dear maintainers,
I have implemented Bootstrapped Dual Policy Iteration in my fork of stable-baselines3: https://github.com/steckdenis/stable-baselines3 . As instructed by the contributing guide, and because it is the first time I contribute to stable-baselines3, I'm opening this issue before making a merge request.
The original BDPI paper is https://arxiv.org/abs/1903.04193. The main reason I propose to have BDPI in stable-baselines3 is that it is quite different from other algorithms, as it heavily focuses on sample-efficiency at the cost of compute-efficiency (which is nice for slow-sampled robotic tasks). The main results in the paper show that BDPI outperforms several state-of-the-art RL algorithms on many environments (Table is difficult to explore into, and Hallway comes from gym-miniworld and is a 3D environment):
I have reproduced these results with the BDPI algorithm I implemented in stable-baselines3, on LunarLander. PPO, DQN and A2C have been run using the default hyper-parameters used by rl-baselines3-zoo (I suppose the tuned ones?), for 8 random seeds each. The BDPI curve is also the result of 8 random seeds. I apologize for the truncated runs, BDPI and DQN only ran for 100K time-steps (for BDPI, due to time constraints, as it takes about an hour per run to perform 100K time-steps):
Features that I have implemented in my fork of stable-baselines3:
I welcome any comment regarding the quality of my code, and anything I can do to make the code and algorithm interesting to the community.
The text was updated successfully, but these errors were encountered: