Understanding Batch Parameters in PPO Implementation from OpenAI Baselines

I’ve been studying the PPO1 algorithm in OpenAI’s Baselines repository to understand how Proximal Policy Optimization works in practice. I’m having trouble understanding two specific parameters in the learn function: optim_batchsize and timesteps_per_actorbatch.

Can someone explain what these hyperparameters control and how they differ from each other?

Also, when I look at the Atari setup in run_atari.py, I notice they use EpisodicLifeEnv wrapper that terminates episodes when a life is lost. This creates very short episodes of around 7-8 steps at the start of training. But if the batch size is set to 256, how does the algorithm collect enough data to perform updates? It seems like there’s a mismatch between episode length and required batch size.

Any insights into how this batching mechanism works would be really helpful. Thanks!

for sure! optim_batchsize is all about how many updates you plan on doing, and timesteps_per_actorbatch helps gather the data. even with short episodes, PPO can grab data from various runs to fill the batch size. it’s a bit tricky but it works!

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.