What are the key distinctions between CartPole-v0 and CartPole-v1 in OpenAI Gym?

I’m trying to understand what makes CartPole-v0 different from CartPole-v1 in OpenAI Gym but can’t find clear documentation anywhere.

From what I can tell, both versions use the same base code file in the gym repository. When I checked the specs programmatically, I noticed they have different episode limits and reward thresholds. The v0 version stops at 200 steps with a 195.0 reward threshold, while v1 runs for 500 steps with a 475.0 threshold.

import gym

env_v1 = gym.make("CartPole-v1")
print(env_v1.spec.max_episode_steps)  # Shows 500
print(env_v1.spec.reward_threshold)   # Shows 475.0

env_v0 = gym.make("CartPole-v0")
print(env_v0.spec.max_episode_steps)  # Shows 200
print(env_v0.spec.reward_threshold)   # Shows 195.0

Are these the only differences between the two versions? Does anyone know if there are other subtle changes in the physics or reward calculation that I might be missing?

You got the main differences right. I’ve worked with both versions a lot, and the physics and reward structure are exactly the same. The episode length and reward threshold changes you mentioned - that’s it. The longer episodes in v1 make training way harder though. Your agent has to keep the pole balanced much longer to hit the reward threshold. This usually means you need better policies than v0, where simple approaches sometimes work just by getting lucky. I’ve seen algorithms that barely solve v0 completely fail on v1. It’s not different physics - the extended timeframe just exposes weak spots in the policy. I checked gym’s init.py file and these are the only spec differences. No hidden changes to the CartPole class.

You nailed it - those are all the differences between the versions. I spent ages comparing them when I migrated my RL experiments, and there’s nothing else beyond what you listed. What’s wild is how much that performance gap changes algorithm comparisons. Basic Q-learning that works fine on v0 gets crushed by v1 because of those longer episodes. Sure, the thresholds scale proportionally (195/200 vs 475/500), but the extra variance over longer runs makes convergence way harder. I’d start with v0 for initial dev and hyperparameter tuning, then test on v1. The longer episodes in v1 are great for checking if your policy’s actually robust, without changing the environment itself. Both versions are useful - v0 for quick prototyping, v1 for serious evaluation.

Yeah, that’s it - v1 runs longer with higher thresholds. I’ve noticed v1 really separates good algos from bad ones though. Those extra 300 steps test whether your policy actually learned the right thing or just got lucky. Same reward per step, but v1 exposes overfitting issues that v0 completely misses.