I’ve been studying the PPO1 algorithm in OpenAI’s Baselines repository to understand how Proximal Policy Optimization works in practice. I’m having trouble understanding two specific parameters in the learn function: optim_batchsize and timesteps_per_actorbatch.
Can someone explain what these hyperparameters control and how they differ from each other?
Also, when I look at the Atari setup in run_atari.py, I notice they use EpisodicLifeEnv wrapper that terminates episodes when a life is lost. This creates very short episodes of around 7-8 steps at the start of training. But if the batch size is set to 256, how does the algorithm collect enough data to perform updates? It seems like there’s a mismatch between episode length and required batch size.
Any insights into how this batching mechanism works would be really helpful. Thanks!