This repository contains PyTorch implementations of various Deep Reinforcement Learning algorithms and a comparison of their results.
The following algorithms have been implemented so far:
- REINFORCE/Vanilla Policy Gradient (VPG): OpenAI's Spinning Up
- Deep Q Learning (DQN): Mnih et al. 2013
- Trust Region Policy Gradient (TRPG): Spinning Up / (Schulman et al 2015b) *
- Proximal Policy Optimization (PPO): Spinning Up / (Schulman et al 2017)
- Soft Actor Critic (SAC) for discrete environments: Spinning Up (continuous version) / (Christodoulou 2019)
The policy gradient algorithms 1, 3, and 4 are using Generalized Advantage Estimation (Schulman et al 2015a)
* the implementation of TRPG occasionally fails during learning due to numerical issues. Results are from successful runs only
This is the result of the DQN agent on the gymnasium Atari Space Invader environment. The training setup is similar to the original paper by (Mnih et al 2013b): Agent observations are the last four frames which are scaled to 84x84 grayscale. The exact hyperparameters can be found in train_DQN_for_Space_Invaders.py and are also similar to (Mnih et al 2013b). However, I only trained for 5 million steps as opposed to 50 million in the original paper.
The algorithms were trained on OpenAI Gym's implementation of the Cart Pole Environment. Each agent was trained for 400 training steps with episodes automatically terminating after 200 timesteps. For the exact hyperparameters see the training scripts (train_X_for_cartpole.py). The y value of the learning curves represents the mean score of running the algorithm 5 times and the shaded area around the learning curve corresponds to the standard deviation. The following curves were smoothed using a moving average with a window size of 4.
Note that just looking at the learning curves is not sufficient to compare two algorithms. Firstly, the same amount of training steps does not necessarily require the same amount of computing power and training time. For example, DQN can do a training step after every timestep after an initial period of exploration. On the other hand, VPN must complete multiple full episodes for every training step. Furthermore, no hyperparameter tuning was done before running the algorithms. Doing so might significantly improve performance. Hence, the learning curves only serve to demonstrate the correct implementation of the algorithms and their learning behaviour.
The implementations of the algorithms in this repository are my own, but it was immensely useful to look at the Spinning Up repository and Deep Reinforcement Learning Algorithms in PyTorch when I was stuck or looking for things to improve.
This Medium article by Rohan Tangri helped me understand Generalized Advantage Estimation.