Skip to content

Felhof/Deep-Reinforcement-Learning-Algorithms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DEEP REINFORCEMENT LEARNING ALGORITHMS

This repository contains PyTorch implementations of various Deep Reinforcement Learning algorithms and a comparison of their results.

Algorithms

The following algorithms have been implemented so far:

  1. REINFORCE/Vanilla Policy Gradient (VPG): OpenAI's Spinning Up
  2. Deep Q Learning (DQN): Mnih et al. 2013
  3. Trust Region Policy Gradient (TRPG): Spinning Up / (Schulman et al 2015b) *
  4. Proximal Policy Optimization (PPO): Spinning Up / (Schulman et al 2017)
  5. Soft Actor Critic (SAC) for discrete environments: Spinning Up (continuous version) / (Christodoulou 2019)

The policy gradient algorithms 1, 3, and 4 are using Generalized Advantage Estimation (Schulman et al 2015a)

* the implementation of TRPG occasionally fails during learning due to numerical issues. Results are from successful runs only

Results

Space Invaders

This is the result of the DQN agent on the gymnasium Atari Space Invader environment. The training setup is similar to the original paper by (Mnih et al 2013b): Agent observations are the last four frames which are scaled to 84x84 grayscale. The exact hyperparameters can be found in train_DQN_for_Space_Invaders.py and are also similar to (Mnih et al 2013b). However, I only trained for 5 million steps as opposed to 50 million in the original paper.

DQN Space Invaders

DQN Space Invaders Learning Curve

Cartpole

The algorithms were trained on OpenAI Gym's implementation of the Cart Pole Environment. Each agent was trained for 400 training steps with episodes automatically terminating after 200 timesteps. For the exact hyperparameters see the training scripts (train_X_for_cartpole.py). The y value of the learning curves represents the mean score of running the algorithm 5 times and the shaded area around the learning curve corresponds to the standard deviation. The following curves were smoothed using a moving average with a window size of 4.

Cartpole Results

Note that just looking at the learning curves is not sufficient to compare two algorithms. Firstly, the same amount of training steps does not necessarily require the same amount of computing power and training time. For example, DQN can do a training step after every timestep after an initial period of exploration. On the other hand, VPN must complete multiple full episodes for every training step. Furthermore, no hyperparameter tuning was done before running the algorithms. Doing so might significantly improve performance. Hence, the learning curves only serve to demonstrate the correct implementation of the algorithms and their learning behaviour.

Acknowledgements

The implementations of the algorithms in this repository are my own, but it was immensely useful to look at the Spinning Up repository and Deep Reinforcement Learning Algorithms in PyTorch when I was stuck or looking for things to improve.

This Medium article by Rohan Tangri helped me understand Generalized Advantage Estimation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages