I’m having trouble getting my Deep Q Learning model to work properly with the CartPole environment from OpenAI Gym. Instead of getting better over time, my agent is actually performing worse as training continues.
The reward per episode keeps going down when it should be going up. I’m using experience replay and a target network like most DQN tutorials suggest. I tried changing the network architecture by adding more layers and adjusting neuron counts, but that didn’t help. I also experimented with different exploration decay schedules with no success.
I think there might be an issue with how I’m calculating the loss, but I’m not sure what’s wrong. Here’s the relevant code:
Your Q calculation looks wrong. You’re masking and taking max of the main network, but that’s not how DQN works. During training, you need Q-values for the specific actions you took (from your replay buffer), not the max Q-values. I had the same CartPole degradation issue when my epsilon wasn’t working right - double-check you’re actually using random actions during exploration, not just following your current policy. Also watch out for reward clipping or normalization. CartPole gives +1 per timestep, so any unnecessary preprocessing will mess up that simple learning signal.
CartPole’s tricky because of those terminal states - they need special handling in your Q-learning updates. When the pole falls or the cart goes out of bounds, there’s no next state, so don’t include discounted future rewards for those transitions. Make sure you’re setting the target to just the immediate reward when done=True in your experience tuples. I had the same degradation problem - my agent thought terminal states had high future value when they should be zero. Also, check if your exploration decay is too aggressive. CartPole needs plenty of random exploration since the state space is continuous and tiny changes in angle/velocity make a huge difference.
your learning rate’s probably too high - that causes instability. drop it to 0.001 or 0.0005. also make sure you’re not training on the same batch twice by accident. i did that once and it completely wrecked my cartpole agent.
I see the problem with your loss calculation. You’ve got target_value = next_q + self.reward_batch but it should be target_value = self.reward_batch + next_q. That’s not the main issue though - using tf.reduce_max on current Q-values with the action mask is what’s really breaking things. You need to grab the Q-value for the actual action from your experience replay buffer, not the max. The target network’s max Q-value is fine since that’s your best future value estimate. Also check your experience replay buffer - if you’re sampling the same experiences over and over or have indexing problems, that’ll tank your performance.
your target network update freq might be the issue. if you update too often, target values can get unstable - making it hard for your agent to learn. also, consider checking the size of your replay buffer; a small one could lead to overfitting on recent bad experiences.