Keras implementation of Q-learning for FrozenLake environment showing poor performance

Hazel_27Yoga · June 5, 2025, 4:19am

I’m working on converting a TensorFlow Q-learning implementation to Keras for the FrozenLake environment, but my results are much worse than expected.

Here’s my current implementation:

import gym
import numpy as np
import random
from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K
import matplotlib.pyplot as plt

env = gym.make('FrozenLake-v0')

network = Sequential()
network.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
network.add(Dense(4, activation='linear', kernel_initializer='uniform'))

def q_loss(target, prediction):
    return K.mean(K.square(target - prediction))

network.compile(loss=q_loss, optimizer='adam')

# Training parameters
gamma = 0.95
epsilon = 0.1
episode_rewards = []
step_counts = []

total_episodes = 2000
for episode in range(total_episodes):
    state = env.reset()
    total_reward = 0
    done = False
    steps = 0
    
    while steps < 99:
        steps += 1
        
        q_values = network.predict(np.identity(16)[state:state+1], batch_size=1)
        next_action = np.argmax(q_values[0])
        
        if np.random.random() < epsilon:
            next_action = env.action_space.sample()
        
        next_state, reward, done, _ = env.step(next_action)
        total_reward += reward
        
        future_q_values = network.predict(np.identity(16)[next_state:next_state+1], batch_size=1)
        max_future_q = np.max(future_q_values)
        
        target_q_values = q_values
        target_q_values[0, next_action] = reward + gamma * max_future_q
        
        network.fit(np.identity(16)[state:state+1], target_q_values, verbose=0, batch_size=1)
        state = next_state
        
        if done:
            epsilon = 1.0 / ((episode / 50) + 10)
            break
    
    step_counts.append(steps)
    episode_rewards.append(total_reward)

print(f"Success rate: {sum(episode_rewards)/total_episodes * 100}%")

My Keras version only achieves about 0.05% success rate, which is significantly lower than what I expected. The learning curve shows very poor performance compared to other implementations I’ve seen. What might be causing this low performance in my Keras Q-learning setup?

OwenNebula55 · June 18, 2025, 6:50am

There’s a subtle but critical bug in your target calculation that’s sabotaging the entire learning process. You’re computing the future Q-values for the next state, but then using the maximum future Q-value even when the episode is done. When done=True, there’s no future state, so the target should just be the immediate reward without the discounted future value. Add a condition like max_future_q = 0 if done else np.max(future_q_values) before calculating your target. This was killing my performance when I worked on a similar implementation last year. Also, your network architecture might be too simple for the state representation - I found that adding a second hidden layer with 32 units helped significantly with FrozenLake. The environment is deceptively tricky because of the stochastic nature of the ice, so proper terminal state handling is crucial for convergence.

JollyMusic3 · June 15, 2025, 12:08pm

The main issue I see is that you’re modifying the predicted Q-values directly instead of creating a proper target. When you do target_q_values = q_values, you’re creating a reference to the same array, so your target keeps changing as the network updates. This creates instability in training. I had similar problems when I first implemented Q-learning with Keras. Try creating a copy of the Q-values first: target_q_values = q_values.copy() before modifying the specific action value. Also, your epsilon decay seems too aggressive - starting at 0.1 and decaying so quickly means your agent barely explores the environment. Another thing to consider is that FrozenLake has sparse rewards, making it notoriously difficult to learn. You might want to increase your exploration initially (start epsilon at 1.0) and decay it more gradually. The current setup doesn’t give your agent enough chance to discover the successful path through random exploration.

MiaDragon42 · June 14, 2025, 3:27pm

looks like your epsilon decay is way off - you start at 0.1 which is already pretty low for exploration, then decay it even more aggressively. frozenlake needs lots of exploration since rewards are so sparse. try starting epsilon at 1.0 and decay slower like epsilon = max(0.01, epsilon * 0.995) each episode instead

ameliat · June 14, 2025, 6:34am

Your training loop has a fundamental flaw that’s preventing proper learning convergence. You’re calling network.predict() twice per step and immediately training on single samples, which creates massive instability in the Q-value estimates. This approach leads to catastrophic forgetting where each update overwrites previous learning. I encountered this exact issue when implementing DQN variants - the solution is implementing experience replay with a memory buffer. Store your (state, action, reward, next_state, done) tuples and train on random batches instead of immediate updates. Additionally, your reward signal is too sparse for the current learning rate. FrozenLake only gives reward at the goal, so consider reward shaping like giving small negative rewards for each step to encourage shorter paths. The current setup essentially has your agent learning from noise rather than meaningful patterns. Without experience replay, you’re not doing proper Q-learning but rather a very unstable approximation that rarely converges on discrete environments.