I’ve been studying the PPO1 algorithm code from OpenAI’s baseline reinforcement learning library to understand how Proximal Policy Optimization actually works under the hood.
I’m having trouble understanding two specific parameters that get passed to the learn function: optim_batchsize and timesteps_per_actorbatch. Can someone explain what these hyperparameters control and how they differ from each other?
Also, when I look at the Atari example script, I notice they use environment wrappers like make_atari and wrap_deepmind. The EpisodicLifeEnv wrapper terminates episodes when a life is lost. This means episodes are really short at the start of training (around 7-8 steps), but the batch size is set to 256. How does the algorithm collect enough data for parameter updates when episodes are so brief?
The confusion around these parameters is pretty common when diving into PPO implementations. Think of timesteps_per_actorbatch as your data collection budget - it determines how many environment interactions you perform before triggering a policy update. The optim_batchsize parameter controls how you process that collected data during the actual neural network training phase.
Regarding your Atari observation, the algorithm handles short episodes quite elegantly. Even though individual episodes might only last 7-8 steps initially, PPO continues collecting data across multiple episode resets until it reaches the timesteps_per_actorbatch threshold. So if you have 256 timesteps to collect but episodes are only 8 steps long, the agent will play through approximately 32 episodes before performing an update.
The EpisodicLifeEnv wrapper actually helps with learning efficiency by providing more frequent reward signals and episode boundaries, which can accelerate early training despite the apparent contradiction of shorter episodes.
Having worked with PPO implementations for several months now, I can share some practical insights about these parameters. The key thing to understand is that timesteps_per_actorbatch essentially controls your sample efficiency versus computational overhead tradeoff. Larger values mean you collect more experience before updating, which can lead to more stable gradients but also means your policy might be acting on outdated information for longer periods.
Regarding the short episode issue in Atari games, I’ve noticed that while the EpisodicLifeEnv wrapper does create brief episodes initially, this actually works in PPO’s favor during early training. The frequent episode resets help the agent explore different starting states more thoroughly rather than getting stuck in long sequences of poor actions. The algorithm simply aggregates these short episodes until it hits the timestep target, so you’re still getting meaningful batch sizes for training.
One thing that helped me understand this better was logging the actual episode lengths and batch compositions during training. You’ll see that as the agent improves, episode lengths naturally increase, but the data collection mechanism remains consistent throughout the training process.
timesteps_per_actorbatch is how many env steps you take before an update, while optim_batchsize is for gradient descent. so with 2048 timesteps, you could do several batches of 64 for optimization. for short episodes, ppo gathers data from multiple episodes to meet the timestep limit.
just went thru this myself recently. one thing that tripped me up was thinking optim_batchsize had to match timesteps_per_actorbatch but thats not how it works. you can have like 2048 timesteps collected then shuffle and split into smaller 64-size batches for sgd updates. makes the training more stable imo