RFC: Adding the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces #499

steckdenis · 2021-07-02T12:01:27Z

Dear maintainers,

I have implemented Bootstrapped Dual Policy Iteration in my fork of stable-baselines3: https://github.com/steckdenis/stable-baselines3 . As instructed by the contributing guide, and because it is the first time I contribute to stable-baselines3, I'm opening this issue before making a merge request.

The original BDPI paper is https://arxiv.org/abs/1903.04193. The main reason I propose to have BDPI in stable-baselines3 is that it is quite different from other algorithms, as it heavily focuses on sample-efficiency at the cost of compute-efficiency (which is nice for slow-sampled robotic tasks). The main results in the paper show that BDPI outperforms several state-of-the-art RL algorithms on many environments (Table is difficult to explore into, and Hallway comes from gym-miniworld and is a 3D environment):

I have reproduced these results with the BDPI algorithm I implemented in stable-baselines3, on LunarLander. PPO, DQN and A2C have been run using the default hyper-parameters used by rl-baselines3-zoo (I suppose the tuned ones?), for 8 random seeds each. The BDPI curve is also the result of 8 random seeds. I apologize for the truncated runs, BDPI and DQN only ran for 100K time-steps (for BDPI, due to time constraints, as it takes about an hour per run to perform 100K time-steps):

Features that I have implemented in my fork of stable-baselines3:

Policies for the BDPI actor and critics, with documentation. Supported policies are MlpPolicy, CnnPolicy and MultiInputPolicy.
The BDPI algorithm, with a simple (non-invasive in the code) multi-processing approach, to help with the compute requirements. I aimed at writing the code in the easiest way possible to understand.
The unit tests have been updated for BDPI, to the best of my knowledge. Every test with BDPI passes, there is no regression.
I have updated the documentation, and written documentation for BDPI
My changes to rl-baselines3-zoo (hyper-parameter definition, and tuned hyper-parameters for LunarLander) are here: https://github.com/steckdenis/rl-baselines3-zoo

I welcome any comment regarding the quality of my code, and anything I can do to make the code and algorithm interesting to the community.

Miffyli · 2021-07-02T12:03:33Z

Hey! Looking impressive with lots of work behind this!

Given that we want to keep SB3 main repo (this one) compact-ish with well-established baselines, could you open this PR on the contrib repo which is designed for custom algorithms like this? We would be more than happy to review and accept this there!

steckdenis · 2021-07-02T12:12:15Z

Thanks!

sb3_contrib was my original plan at first, but I encountered problems with running the unit tests in that repo (problems I don't have with stable-baselines3). I will try again to port my code to sb3_contrib and see if I can run all the unit tests.

araffin · 2021-07-02T12:24:07Z

Hello,
as @Miffyli said, this is the perfect use case of SB3 contrib ;)
I will try to take a look at the paper later on, was it published somewhere?

steckdenis · 2021-07-02T12:34:33Z

Hello, yes, the paper was published at the European Conference on Machine Learning, with the proceedings in the Lectures Notes in Computer Science: https://link.springer.com/chapter/10.1007/978-3-030-46133-1_2 . I've put an arXiv link in the documentation as it is what the other algorithms have.

Miffyli · 2021-07-02T14:27:41Z

sb3_contrib was my original plan at first, but I encountered problems with running the unit tests in that repo (problems I don't have with stable-baselines3). I will try again to port my code to sb3_contrib and see if I can run all the unit tests.

Ah, feel free to open PR regardless. There might be some bugs with unit tests in that repo (not as battle-tested as this one), so we can iron those out there :)

araffin · 2021-07-05T06:55:57Z

Hello,
could you re-open this issue in the contrib repo?

Miffyli added the enhancement New feature or request label Jul 2, 2021

steckdenis mentioned this issue Jul 2, 2021

Add the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces Stable-Baselines-Team/stable-baselines3-contrib#35

Open

17 tasks

araffin closed this as completed Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Adding the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces #499

RFC: Adding the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces #499

steckdenis commented Jul 2, 2021

Miffyli commented Jul 2, 2021 •

edited

Loading

steckdenis commented Jul 2, 2021

araffin commented Jul 2, 2021

steckdenis commented Jul 2, 2021

Miffyli commented Jul 2, 2021

araffin commented Jul 5, 2021

RFC: Adding the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces #499

RFC: Adding the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces #499

Comments

steckdenis commented Jul 2, 2021

Miffyli commented Jul 2, 2021 • edited Loading

steckdenis commented Jul 2, 2021

araffin commented Jul 2, 2021

steckdenis commented Jul 2, 2021

Miffyli commented Jul 2, 2021

araffin commented Jul 5, 2021

Miffyli commented Jul 2, 2021 •

edited

Loading