Skip to content

Python script to balance Pendulum from open ai gym using Q-Learning and Double Q-Learning

Notifications You must be signed in to change notification settings

BhanuPrakashPebbeti/Q-Learning_and_Double-Q-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Balancing-Pendulum-with-Q-Learning

Q-Learning

Q-learning is an off policy reinforcement learning algorithm that seeks to find the best action to take given the current state. It’s considered off-policy because the q-learning function learns from actions that are outside the current policy, like taking random actions, and therefore a policy isn’t needed. More specifically, q-learning seeks to learn a policy that maximizes the total reward.

Important Terms in Q-Learning

  • States: The State, S, represents the current position of an agent in an environment.
  • Action: The Action, A, is the step taken by the agent when it is in a particular state.
  • Rewards: For every action, the agent will get a positive or negative reward.
  • Episodes: When an agent ends up in a terminating state and can’t take a new action.
  • Q-Values: Used to determine how good an Action, A, taken at a particular state, S, is. Q (A, S)

Bellman Equation

The Bellman Equation is used to determine the value of a particular state and deduce how good it is to be in/take that state. This equation is used to update the Q-Table. The optimal state will give us the highest optimal value.

Q-Learning Pseudo code

Reward Stats while Training Q-Learning

Problem with Q-Learning

The important part of the Q-Learning is maxQ(S', a') is at the same time the biggest problem of Q-Learning. In fact, this is the reason why this algorithm performs poorly in some stochastic environments. Because of max operator Q-Learning can overestimate Q-Values for certain actions.

Solution - Double Q-Learning

The proposed solution is to maintain two Q-value functions QA and QB, each one gets update from the other for the next state. The update consists of finding the action a' that maximises QA in the next state (Q(s’, a') = Max Q(s’, a)), then use a' to get the value of QB(s’, a') in order to update QA(s, a).

Double Q-Learning Pseudo code

Reward Stats while Training Double Q-Learning

Pendulum Balancing

Pendulum_gif

About

Python script to balance Pendulum from open ai gym using Q-Learning and Double Q-Learning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages