Building a TensorFlow Deep Learning Model for Gym Environment Data

I’m working on creating a deep learning model using TensorFlow that learns from collected game data, but I keep getting an error that I can’t figure out. Here’s what I’m trying to do:

import gym
import tensorflow as tf
import numpy as np

game_env = 'CartPole-v0'
environment = gym.make(game_env)
input_data = []
target_data = []
best_score = 1

for episode in range(100):
    state = environment.reset()
    for step in range(100):
        environment.render()
        move = environment.action_space.sample()
        state, score, finished, details = environment.step(move)
        if score >= best_score:
            input_data.append(state)
            target_data.append(move)
            best_score = score
        if finished:
            break

layer1_size = 50
layer2_size = 50
input_size = 4
output_size = 1

input_placeholder = tf.placeholder("float")
output_placeholder = tf.placeholder("float")

def create_network(input_data):
    network_weights = {
        'layer1': tf.Variable(tf.random_normal([input_size, layer1_size])),
        'layer2': tf.Variable(tf.random_normal([layer1_size, layer2_size])),
        'output': tf.Variable(tf.random_normal([layer2_size, output_size]))
    }
    network_biases = {
        'bias1': tf.Variable(tf.random_normal([layer1_size])),
        'bias2': tf.Variable(tf.random_normal([layer2_size])),
        'output': tf.Variable(tf.random_normal([output_size]))
    }
    
    first_layer = tf.add(tf.matmul(input_data, network_weights['layer1']), network_biases['bias1'])
    first_layer = tf.nn.relu(first_layer)
    
    second_layer = tf.add(tf.matmul(first_layer, network_weights['layer2']), network_biases['bias2'])
    second_layer = tf.nn.relu(second_layer)
    
    final_output = tf.matmul(second_layer, network_weights['output']) + network_biases['output']
    return final_output

model_output = create_network(input_placeholder)
cost_function = tf.reduce_sum(tf.square(model_output - output_placeholder))
training_step = tf.train.GradientDescentOptimizer(0.001).minimize(cost_function)
initializer = tf.global_variables_initializer()
session = tf.Session()
session.run(initializer)

for iteration in range(100):
    session.run(training_step, {input_placeholder: input_data, output_placeholder: target_data})

with tf.Session() as test_session:
    test_session.run(initializer)
    test_env = gym.make(game_env)
    current_state = test_env.reset
    for time_step in range(1000):
        test_env.render()
        predicted_action = np.around(create_network(current_state))
        current_state, reward, done, info = test_env.step(predicted_action)
        if done:
            break

The error I’m getting is:

TypeError: Expected binary or unicode string, got <bound method Env.reset of >

I think there might be something wrong with how I’m handling the environment state or passing data to the network. Can someone help me understand what’s causing this issue? Thanks for any help!

I ran into similar issues when I started working with gym environments and TensorFlow. The main problem I can spot is in your testing section - you have current_state = test_env.reset but it should be current_state = test_env.reset() with parentheses to actually call the method. Without the parentheses, you’re assigning the method object itself rather than executing it.

Another issue is that you’re creating a new session for testing but then calling create_network() outside of that session context. The network operations need to be run within the session that has the initialized variables. You should use session.run(model_output, {input_placeholder: current_state}) instead of calling the function directly.

Also, your data collection logic might not work as expected since you’re only keeping states where the score improves, but CartPole gives +1 reward for every step until failure. You probably want to collect data from successful episodes rather than individual high-scoring steps.

The core issue stems from mixing TensorFlow 1.x operations with session management incorrectly. Beyond the missing parentheses in test_env.reset(), you’re facing a fundamental problem with how you’re executing the model during testing. When you call create_network(current_state) directly in the test loop, you’re creating new graph operations outside of any session context rather than running inference.

I encountered this exact pattern when migrating older RL code. You need to run inference through your existing session using session.run(model_output, feed_dict={input_placeholder: current_state.reshape(1, -1)}). The reshape is crucial because your network expects batch dimensions.

Additionally, your training data collection is problematic - you’re treating CartPole rewards incorrectly. The environment gives +1 for each step, so your condition score >= best_score will trigger constantly. Consider collecting complete episode trajectories instead of individual steps, then filter based on total episode performance.

looks like you got some tensor shape issues too. your output_size is 1 but cartpole needs discrete actions (0 or 1). also the placeholders need proper shapes - should be something like tf.placeholder(tf.float32, [None, 4]) for input. the error happens becuase you’re mixing up bound methods with actual values when testing.