I’m working on building a reinforcement learning bot for the blackjack game using OpenAI Gym framework. Since I’m still learning Python and the Gym library, I’m having trouble getting my Q-learning implementation to work properly.
My main issues are:
Can’t figure out how to get the observation space size (env.observation_space.n throws an error, while env.action_space.n returns 2)
My code is mostly copied from other Gym examples like CartPole
Need help finishing this basic implementation so I can later upgrade it to Deep Q-Network
blackjack observation space is tuple not discrete, thats why .n fails. try converting state to index like state_idx = state[0] + (state[1]-1)*10 + state[2]*200 or use dict for q-table instead. also your q-update is wrong - missing current q-value in bellman equation.
The main problem is that Blackjack returns a 3-tuple observation (player_sum, dealer_showing, usable_ace) but you’re treating it like a single integer for array indexing. This causes issues when you try to use current_state as an index into your q_table array. I ran into this exact same issue when I started with gym environments. You need to convert the tuple to a single index. Something like state_index = state[0] + state[1]*32 + state[2]*32*11 works since player sum ranges 4-21, dealer card 1-10, and ace is boolean. Also your Q-table size of 500 might not be enough - I’d go with at least 1000 to be safe. Another thing - your exploration method is overly complex. Simple epsilon-greedy works much better for blackjack. Start with epsilon=0.9 and decay it slowly. The random noise approach you’re using can actually hurt convergence in discrete action spaces like this.
Your Q-learning update formula is incomplete. You need to incorporate the current Q-value into the Bellman equation. The correct update should be q_table[current_state, chosen_action] = q_table[current_state, chosen_action] + learning_rate * (reward_value + discount_factor * np.max(q_table[next_state, :]) - q_table[current_state, chosen_action]). Add a learning rate parameter around 0.1 to 0.3. The observation space issue stems from blackjack returning a tuple (player_sum, dealer_card, usable_ace) rather than a single integer. I recommend using a dictionary for your Q-table with the tuple as a key, or create a proper state encoding function. Your exploration strategy also needs work - consider epsilon-greedy instead of adding random noise. Start with epsilon=1.0 and decay it gradually.
gym version matters here - newer versions return different tuple format on reset(). try current_state = blackjack_env.reset()[0] if your getting weird errors. also that 500 size q-table is way too small for blackjack state space, youll get index errors eventually.