Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Double DQN #487

Closed
NickLucche opened this issue Jun 22, 2021 · 18 comments
Closed

[Feature Request] Double DQN #487

NickLucche opened this issue Jun 22, 2021 · 18 comments
Labels
enhancement New feature or request

Comments

@NickLucche
Copy link
Contributor

🚀 Feature

Add double variant of the dqn algorithm.

Motivation

It's in the roadmap #1.

Pitch

I suggest we go from:

with th.no_grad():
      # Compute the next Q-values using the target network
      next_q_values = self.q_net_target(replay_data.next_observations)
      # Follow greedy policy: use the one with the highest value
      next_q_values, _ = next_q_values.max(dim=1)

to:

with th.no_grad():
      # Compute the next Q-values using the target network
      next_q_values = self.q_net_target(replay_data.next_observations)
      if self.double_dqn:
          # use current model to select the action with maximal q value
          max_actions = th.argmax(self.q_net(replay_data.next_observations), dim=1)
          # evaluate q value of that action using fixed target network
          next_q_values = th.gather(next_q_values, dim=1, index=max_actions.unsqueeze(-1))
      else:
          # Follow greedy policy: use the one with the highest value
          next_q_values, _ = next_q_values.max(dim=1)

with double_dqn as additional flag to be passed to DQN init.

### Checklist

  • [ x] I have checked that there is no similar issue in the repo (required)
@NickLucche NickLucche added the enhancement New feature or request label Jun 22, 2021
@Miffyli
Copy link
Collaborator

Miffyli commented Jun 22, 2021

Yup indeed this is one of the core enhancements we would like to see!

@araffin just checking with you: is bloating DQN code here slightly ok (with a new parameter), or should this go to contrib?

@NickLucche If you wish to work on this PR, note the we would like to see matching results with reference implementations (e.g. same results as in the original paper or in a well-established codebase). One or two Atari games should be enough.

@NickLucche
Copy link
Contributor Author

Sure thing, I'll make sure to setup the experiments in a similar way.
I'll do some searching starting from the source paper.

@Miffyli
Copy link
Collaborator

Miffyli commented Jun 22, 2021

I highly recommend using zoo for doing that btw, since it already has known-to-work preprocessing for Atari :). It takes a moment to get into, but will definitely pay itself off with these experiments (there are many tiny details that need taking of).

@araffin
Copy link
Member

araffin commented Jun 23, 2021

@araffin just checking with you: is bloating DQN code here slightly ok (with a new parameter), or should this go to contrib?

I think double DQN is ok as it is only a 3 lines change.
However, for prioritized replay and the rest of DQN extensions, I would prefer to have a RAINBOW implementation with everything in it. So we keep DQN implementation clean but have the full-bloated RAINBOW too.

@NickLucche
Copy link
Contributor Author

@Miffyli Thanks for the tip.

It may take a while to run the experiments for the same number of timesteps as in "Human-level control through deep reinforcement learning" tho.
e.g I'm getting a decent improvement after ~5.5mln steps in terms of mean reward ( currently 5.92e+03 vs 1.41e+04) on Atlantis-v0.
Can I cut short on the number of steps or should I match the experiments in that too? What kind of benchmarks do you expect as a result?

@araffin
Copy link
Member

araffin commented Jun 23, 2021

Can I cut short on the number of steps or should I match the experiments in that too? What kind of benchmarks do you expect as a result?

I could help with that if needed but basically you can check what we did with QR-DQN: Stable-Baselines-Team/stable-baselines3-contrib#13

So Breakout + Pong using 10e6 training steps and 3 seeds (40e6 frames because of frame skip) + some benchmark on simple gym task.

@NickLucche
Copy link
Contributor Author

NickLucche commented Jun 23, 2021

Thanks a lot for your help, it is most appreciated!
I'd love to help out with this small contribution, so I'll try to mimic what has been done in that issue, if that's okay with you.

I was using a script very similar to this for testing so far Stable-Baselines-Team/stable-baselines3-contrib#13 (comment) as I'm yet not super-familiar with zoo.

@araffin
Copy link
Member

araffin commented Jun 24, 2021

I was using a script very similar to this for testing so far Stable-Baselines-Team/stable-baselines3-contrib#13 (comment) as I'm yet not super-familiar with zoo.

RL Zoo is quite easy to use, if you want to replicate results and plot them, we have a section for that in the documentation ;)
https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#how-to-replicate-the-results

@NickLucche
Copy link
Contributor Author

Sorry for the long inactivity.
I managed to run a few experiments with the proposed change on pong+breakout, I'll leave here the training curves although I can't notice much of a difference (at least that's on par with original papers findings).

Vanilla DQN Pong

pong_Training_Episodic_Reward

Vanilla DQN Breakout

breakout_Training_Episodic_Reward

Double DQN Pong

double_pong_Training_Episodic_Reward

Double DQN Breakout

breakout_double_Training_Episodic_Reward

@araffin
Copy link
Member

araffin commented Sep 16, 2021

thanks for the update, best would be to have both dqn and double dqn in the same plot (i can help you with the command if needed).
Also, you should probably try on Atari games where double dqn is known to make a difference (in the paper, they give some examples, i think Asterix is one of them)
edit: you mentioned Atlantis too, no?

@NickLucche
Copy link
Contributor Author

Oh yeah I'd appreciate that, I've used the plot_train script from rl-zoo to be more standard but I can't figure out how to put the results in the same plot using that.

Sure thing, I'll make sure to test it on a few more interesting use-cases.

@araffin
Copy link
Member

araffin commented Sep 17, 2021

Oh yeah I'd appreciate that, I've used the plot_train script from rl-zoo to be more standard but I can't figure out how to put the results in the same plot using that.

cf. https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#how-to-replicate-the-results

you need to plot the evaluations, for instance comparing dqn and ppo on Pong and Breakout:

python scripts/all_plots.py -a dqn ppo -e Pong Breakout -f rl-trained-agents/

(you can also specify multiple folders)

@ercoargante
Copy link

ercoargante commented Oct 22, 2021

As a lecturer in an AI program, I recommend our students to use Stable Baselines for their projects due to ease of use and clear documentation (thx for that!). From an educational point of Q-learning and DQN are a good introduction to RL, so students start off using DQN. Results using DQN of SB3 are much, much worse compared to SB2 (both with default values for the parameters). This hampers the adoption of SB3 (and the enthusiasm of students for RL). I have not yet understood/investigated the reason of this difference. Obvious candidates are the missing extensions like PER and DDQN, but of course this is an assumption. Goal of this comment is just to mention that progress in SB3 on this topic is much appreciated. If I can be of help, for example in testing improvements, let me know. Best regards, Erco Argante

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 22, 2021

@ercoargante

Thank you for the kind words!

You are right the DQN augmentations do contribute in the performance, however they should not be "make or break" level changes across games. You need to check the original publications for how much the augmentation affects in the specific game you are using (it varies a lot).

Another common point are the hyperparameters. I think we changed some of the default training parameters between SB and SB3, so you might want to double-check that they are valid. Once those are set right (and you disable the augmentations of SB DQN), you should be getting equal-ish performance: #110

Another source for disparency may be the logging of rewards, where the reported performance reflects a different number (see e.g., #181 ). I do not think we included fix for that specific issue in SB.

TL;DR the augmentations definitely contribute here, but I would double-check the parameters and evaluation procedure (these two are points we bring up often to others seeking advice) :)

@araffin
Copy link
Member

araffin commented Oct 22, 2021

Hello,

Results using DQN of SB3 are much, much worse compared to SB2 (both with default values for the parameters).

As mentioned in the doc, it is highly recommended to use the RL Zoo (https://github.com/DLR-RM/rl-baselines3-zoo) because default hyperparameters are tuned for Atari games only. And you should compare apples to apples... (that's what we did in #110 where we deactivated SB2 DQN extensions and match hyperparameters)
SB2 DQN defaults are not the same as SB3 defaults...

Obvious candidates are the missing extensions like PER and DDQN, but of course this is an assumption.

In case you want to use a more recent algorithm, we have QR-DQN in our contrib repo.

Goal of this comment is just to mention that progress in SB3 on this topic is much appreciated.

My idea would be to have a RAINBOW implementation in SB3 to keep DQN simple but have all the tricks in one algorithm. I will open an issue (and contribution are welcomed ;)).

@NickLucche
Copy link
Contributor Author

Hi,
sorry I didn't post any update, I'd gladly accept any help with running further experiments if you're still interested in the proposed feature as "flag argument" to DQN class.
I've had problems running experiments due to the long time requirements and (personal) gpu availability.
The one I managed to run on Atalantis didn't show any improvement over the vanilla implementation.

Results_Atlantis

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 22, 2021

@NickLucche

Could you try training with VideoPinball environment (has the highest gap between DQN and DDQN according to the original paper)? I also recommend trying to do the runs with sb3-zoo, as it automatically applies all necessary wrappers for proper evaluation (this can easily go wrong).

@araffin
Copy link
Member

araffin commented Apr 11, 2022

closing in favor of #622

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants