Creating reinforcement learning agent for blackjack environment using OpenAI Gym

I’m working on building a reinforcement learning bot for the blackjack game using OpenAI Gym framework. Since I’m still learning Python and the Gym library, I’m having trouble getting my Q-learning implementation to work properly.

My main issues are:

  • Can’t figure out how to get the observation space size (env.observation_space.n throws an error, while env.action_space.n returns 2)
  • My code is mostly copied from other Gym examples like CartPole
  • Need help finishing this basic implementation so I can later upgrade it to Deep Q-Network

Here’s what I have so far:

import gym
import numpy as np
import matplotlib.pyplot as plt

blackjack_env = gym.make('Blackjack-v0')

# Initialize Q-table
q_table = np.zeros([500, blackjack_env.action_space.n])

episode_count = 8000
discount_factor = 0.95
reward_history = []

for episode in range(episode_count):
    current_state = blackjack_env.reset()
    total_reward = 0
    finished = False
    
    while not finished:
        # Choose action with exploration
        chosen_action = np.argmax(q_table[current_state, :] + 
                                np.random.randn(1, blackjack_env.action_space.n) / (episode + 1))
        
        next_state, reward_value, finished, info = blackjack_env.step(chosen_action)
        
        # Update Q-table
        q_table[current_state, chosen_action] = reward_value + discount_factor * np.max(q_table[next_state, :])
        
        total_reward += reward_value
        current_state = next_state
    
    reward_history.append(total_reward)

print("Final Q-table:", q_table)

I want to see the reward_history improving over time to confirm my algorithm is learning. Also looking for tips on using Gym effectively.

blackjack observation space is tuple not discrete, thats why .n fails. try converting state to index like state_idx = state[0] + (state[1]-1)*10 + state[2]*200 or use dict for q-table instead. also your q-update is wrong - missing current q-value in bellman equation.

The main problem is that Blackjack returns a 3-tuple observation (player_sum, dealer_showing, usable_ace) but you’re treating it like a single integer for array indexing. This causes issues when you try to use current_state as an index into your q_table array. I ran into this exact same issue when I started with gym environments. You need to convert the tuple to a single index. Something like state_index = state[0] + state[1]*32 + state[2]*32*11 works since player sum ranges 4-21, dealer card 1-10, and ace is boolean. Also your Q-table size of 500 might not be enough - I’d go with at least 1000 to be safe. Another thing - your exploration method is overly complex. Simple epsilon-greedy works much better for blackjack. Start with epsilon=0.9 and decay it slowly. The random noise approach you’re using can actually hurt convergence in discrete action spaces like this.

Your Q-learning update formula is incomplete. You need to incorporate the current Q-value into the Bellman equation. The correct update should be q_table[current_state, chosen_action] = q_table[current_state, chosen_action] + learning_rate * (reward_value + discount_factor * np.max(q_table[next_state, :]) - q_table[current_state, chosen_action]). Add a learning rate parameter around 0.1 to 0.3. The observation space issue stems from blackjack returning a tuple (player_sum, dealer_card, usable_ace) rather than a single integer. I recommend using a dictionary for your Q-table with the tuple as a key, or create a proper state encoding function. Your exploration strategy also needs work - consider epsilon-greedy instead of adding random noise. Start with epsilon=1.0 and decay it gradually.

gym version matters here - newer versions return different tuple format on reset(). try current_state = blackjack_env.reset()[0] if your getting weird errors. also that 500 size q-table is way too small for blackjack state space, youll get index errors eventually.