I’m having trouble getting the OpenAI universe starter agent to work correctly. The training process seems stuck and not making any progress.
I set up the environment on an AWS m4.16xlarge instance with 32 workers, but after running it for more than 30 minutes, the agent shows no improvement at all. According to the documentation, it should be able to solve the environment in about 10 minutes.
I’m monitoring everything through TensorBoard, and the results are really disappointing. The original example uses 16 workers and reaches an episode reward of 21 in half an hour. Even though I doubled the workers and gave it the same time, there’s zero improvement in the reward.
I checked the logs and there don’t seem to be any compilation errors. Here’s the command I’m using:
python main.py --workers 32 --environment PongDeterministic-v3 --output-dir /tmp/pong_results
One thing that caught my attention is this error message that keeps appearing: “failed to connect to server”. It doesn’t stop the execution, but it makes me wonder if it’s related to the problem.
Has anyone else experienced similar issues with the universe starter agent? Any ideas on what might be causing this?
That connection error is probably causing your training problems. OpenAI Universe uses VNC connections to talk to game environments, and when they fail, your agents aren’t getting proper observations or sending actions. So your training loop keeps running but the agent is basically training on garbage data or nothing at all.
I hit something similar with Universe environments last year. The ‘failed to connect to server’ message usually means the VNC server instances aren’t starting right or they’re crashing. With 32 workers, you’re hammering your system resources hard - VNC servers eat up a lot of resources. Try dropping to 8 or 16 workers first and see if that fixes the connection issues. Also make sure your AWS instance has enough memory per worker. Universe environments are memory hungry, especially with VNC rendering.
universe environments fail silently with VNC rendering all the time. Add --log-level DEBUG to see what’s actually going on. I’ve hit this before - workers spawn but never connect to the game instances. That ‘failed to connect’ error is your problem. Without proper game connections, your agent’s just learning noise.
This looks like a Docker container limit issue, not just VNC connections. Universe spawns Docker containers for each worker, and AWS instances have default resource limits that bottleneck performance. I hit this exact problem scaling workers on EC2. Check your Docker daemon config and make sure you’ve got enough shared memory allocated. Universe needs tons of /dev/shm space for rendering. Also verify your instance isn’t throttling CPU or network - m4.16xlarge instances can hit AWS account limits if you haven’t requested increases. Since 16 workers work fine but 32 don’t, you’re probably hitting resource exhaustion rather than code problems. Monitor system resources during training with htop and docker stats to see what’s actually getting consumed. Universe workers sometimes fail silently without proper error reporting, which matches your symptom of training continuing but no learning happening.