Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Adding the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces #499

Closed
steckdenis opened this issue Jul 2, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@steckdenis
Copy link

Dear maintainers,

I have implemented Bootstrapped Dual Policy Iteration in my fork of stable-baselines3: https://github.com/steckdenis/stable-baselines3 . As instructed by the contributing guide, and because it is the first time I contribute to stable-baselines3, I'm opening this issue before making a merge request.

The original BDPI paper is https://arxiv.org/abs/1903.04193. The main reason I propose to have BDPI in stable-baselines3 is that it is quite different from other algorithms, as it heavily focuses on sample-efficiency at the cost of compute-efficiency (which is nice for slow-sampled robotic tasks). The main results in the paper show that BDPI outperforms several state-of-the-art RL algorithms on many environments (Table is difficult to explore into, and Hallway comes from gym-miniworld and is a 3D environment):

bdpi_results

I have reproduced these results with the BDPI algorithm I implemented in stable-baselines3, on LunarLander. PPO, DQN and A2C have been run using the default hyper-parameters used by rl-baselines3-zoo (I suppose the tuned ones?), for 8 random seeds each. The BDPI curve is also the result of 8 random seeds. I apologize for the truncated runs, BDPI and DQN only ran for 100K time-steps (for BDPI, due to time constraints, as it takes about an hour per run to perform 100K time-steps):

bdpi_stablebaselines3

Features that I have implemented in my fork of stable-baselines3:

  • Policies for the BDPI actor and critics, with documentation. Supported policies are MlpPolicy, CnnPolicy and MultiInputPolicy.
  • The BDPI algorithm, with a simple (non-invasive in the code) multi-processing approach, to help with the compute requirements. I aimed at writing the code in the easiest way possible to understand.
  • The unit tests have been updated for BDPI, to the best of my knowledge. Every test with BDPI passes, there is no regression.
  • I have updated the documentation, and written documentation for BDPI
  • My changes to rl-baselines3-zoo (hyper-parameter definition, and tuned hyper-parameters for LunarLander) are here: https://github.com/steckdenis/rl-baselines3-zoo

I welcome any comment regarding the quality of my code, and anything I can do to make the code and algorithm interesting to the community.

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 2, 2021

Hey! Looking impressive with lots of work behind this!

Given that we want to keep SB3 main repo (this one) compact-ish with well-established baselines, could you open this PR on the contrib repo which is designed for custom algorithms like this? We would be more than happy to review and accept this there!

@Miffyli Miffyli added the enhancement New feature or request label Jul 2, 2021
@steckdenis
Copy link
Author

Thanks!

sb3_contrib was my original plan at first, but I encountered problems with running the unit tests in that repo (problems I don't have with stable-baselines3). I will try again to port my code to sb3_contrib and see if I can run all the unit tests.

@araffin
Copy link
Member

araffin commented Jul 2, 2021

Hello,
as @Miffyli said, this is the perfect use case of SB3 contrib ;)
I will try to take a look at the paper later on, was it published somewhere?

@steckdenis
Copy link
Author

Hello, yes, the paper was published at the European Conference on Machine Learning, with the proceedings in the Lectures Notes in Computer Science: https://link.springer.com/chapter/10.1007/978-3-030-46133-1_2 . I've put an arXiv link in the documentation as it is what the other algorithms have.

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 2, 2021

sb3_contrib was my original plan at first, but I encountered problems with running the unit tests in that repo (problems I don't have with stable-baselines3). I will try again to port my code to sb3_contrib and see if I can run all the unit tests.

Ah, feel free to open PR regardless. There might be some bugs with unit tests in that repo (not as battle-tested as this one), so we can iron those out there :)

@araffin
Copy link
Member

araffin commented Jul 5, 2021

Hello,
could you re-open this issue in the contrib repo?

@araffin araffin closed this as completed Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants