GitXplorerGitXplorer
v

a2c_is_a_special_case_of_ppo

public
19 stars
2 forks
1 issues

Commits

List of commits on branch main.
Verified
db774011ded1430bacb18458364d91ad91b13ecb

Update README.md

vvwxyzjn committed 3 years ago
Verified
17c75928d8f35dc7940213a2f4533f67c1fec29f

Update SB3 dependency (#1)

aaraffin committed 3 years ago
Unverified
ad8bf3867fed4a78ddbb79b60738fd006e6cf4fe

Update source code and dependency

vvwxyzjn committed 3 years ago
Verified
ada5a865b43c3a2deb58f8c4193d6e7e4ee9d0fa

Update README.md

vvwxyzjn committed 3 years ago
Verified
d677c22600bd6e2ec54416718058347df881d4ee

Update README.md

vvwxyzjn committed 3 years ago
Verified
7caf8d7e5024dd621c25cb2ed06d226a95e6953f

Update README.md

vvwxyzjn committed 3 years ago

README

The README file for this repository.

A2C is a special case of PPO

See our technical paper here: https://arxiv.org/abs/2205.09123

We can match PPO and A2C's performance exactly by doing the following tweaks in PPO:

  1. Match the learning rate parameter to be exactly $0.0007$ (also means turning off learning rate annealing), the entropy coefficient to $0$, and the number of steps to be $5$.
  2. Turn off advantage normalization
  3. Disable GAE by setting its lambda parameter to 1.
  4. Set the number of update epochs $K$ to 1, so the clipped objective has nothing to clip.
  5. Perform update on the whole batch of training data (batch_size = n_envs * n_steps)
  6. Disable value function clipping.
  7. Use A2C's RMSprop optimizer and configurations

To see it in action, run

poetry install
poetry run python sb3_ppo.py
poetry run python sb3_a2c.py

we get the following screenshot, which shows the sum of the updated models' first layer's weights and they are exactly the same

A2C vs PPO code

Therefore, A2C is a special case of PPO when PPO 1) uses learning rate $=0.0007$ and turn off learning rate annealing, 2) set entropy coefficient $=0$, 3) set number of steps $=5$, 4) turn off advantage normalization, 5) disable GAE, 6) set update epoch $K=1$, 7) use whole batch of data for update, 8) disable value function clipping, and 9) use the RMSprop optimizer.

Citation

@misc{https://doi.org/10.48550/arxiv.2205.09123,
  doi = {10.48550/ARXIV.2205.09123},
  url = {https://arxiv.org/abs/2205.09123},
  author = {Huang, Shengyi and Kanervisto, Anssi and Raffin, Antonin and Wang, Weixun and Ontañón, Santiago and Dossa, Rousslan Fernand Julien},
  keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {A2C is a special case of PPO},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}