Creating reinforcement learning agent for blackjack environment using OpenAI Gym

Samuel87 · June 8, 2025, 5:13am

I’m working on building a reinforcement learning bot for the blackjack game using OpenAI Gym framework. Since I’m still learning Python and the Gym library, I’m having trouble getting my Q-learning implementation to work properly.

My main issues are:

Can’t figure out how to get the observation space size (env.observation_space.n throws an error, while env.action_space.n returns 2)
My code is mostly copied from other Gym examples like CartPole
Need help finishing this basic implementation so I can later upgrade it to Deep Q-Network

Here’s what I have so far:

import gym
import numpy as np
import matplotlib.pyplot as plt

blackjack_env = gym.make('Blackjack-v0')

# Initialize Q-table
q_table = np.zeros([500, blackjack_env.action_space.n])

episode_count = 8000
discount_factor = 0.95
reward_history = []

for episode in range(episode_count):
    current_state = blackjack_env.reset()
    total_reward = 0
    finished = False
    
    while not finished:
        # Choose action with exploration
        chosen_action = np.argmax(q_table[current_state, :] + 
                                np.random.randn(1, blackjack_env.action_space.n) / (episode + 1))
        
        next_state, reward_value, finished, info = blackjack_env.step(chosen_action)
        
        # Update Q-table
        q_table[current_state, chosen_action] = reward_value + discount_factor * np.max(q_table[next_state, :])
        
        total_reward += reward_value
        current_state = next_state
    
    reward_history.append(total_reward)

print("Final Q-table:", q_table)

I want to see the reward_history improving over time to confirm my algorithm is learning. Also looking for tips on using Gym effectively.

ethant · June 16, 2025, 1:04am

blackjack observation space is tuple not discrete, thats why .n fails. try converting state to index like state_idx = state[0] + (state[1]-1)*10 + state[2]*200 or use dict for q-table instead. also your q-update is wrong - missing current q-value in bellman equation.

DancingFox · June 11, 2025, 3:47pm

The main problem is that Blackjack returns a 3-tuple observation (player_sum, dealer_showing, usable_ace) but you’re treating it like a single integer for array indexing. This causes issues when you try to use current_state as an index into your q_table array. I ran into this exact same issue when I started with gym environments. You need to convert the tuple to a single index. Something like state_index = state[0] + state[1]*32 + state[2]*32*11 works since player sum ranges 4-21, dealer card 1-10, and ace is boolean. Also your Q-table size of 500 might not be enough - I’d go with at least 1000 to be safe. Another thing - your exploration method is overly complex. Simple epsilon-greedy works much better for blackjack. Start with epsilon=0.9 and decay it slowly. The random noise approach you’re using can actually hurt convergence in discrete action spaces like this.

Hermione_Book · June 11, 2025, 10:38am

Your Q-learning update formula is incomplete. You need to incorporate the current Q-value into the Bellman equation. The correct update should be q_table[current_state, chosen_action] = q_table[current_state, chosen_action] + learning_rate * (reward_value + discount_factor * np.max(q_table[next_state, :]) - q_table[current_state, chosen_action]). Add a learning rate parameter around 0.1 to 0.3. The observation space issue stems from blackjack returning a tuple (player_sum, dealer_card, usable_ace) rather than a single integer. I recommend using a dictionary for your Q-table with the tuple as a key, or create a proper state encoding function. Your exploration strategy also needs work - consider epsilon-greedy instead of adding random noise. Start with epsilon=1.0 and decay it gradually.

Mike71 · June 11, 2025, 9:36am

gym version matters here - newer versions return different tuple format on reset(). try current_state = blackjack_env.reset()[0] if your getting weird errors. also that 500 size q-table is way too small for blackjack state space, youll get index errors eventually.