I’m working with a gym environment that has multiple different types of sensors. My setup includes a camera producing 24x24 images, an x-ray sensor outputting 1x25 readings, and 10 individual temperature sensors.
Right now I’m using spaces.Dict
to organize these different inputs:
class SensorEnv(gym.Env):
def __init__(self, ...):
sensor_spaces = {
'scanner': gym.spaces.Box(low=-np.inf, high=np.inf, shape=(sensor_count, )),
'camera_a': gym.spaces.Box(low=-np.inf, high=np.inf, shape=(width_a, height_a)),
'camera_b': gym.spaces.Box(low=-np.inf, high=np.inf, shape=(width_b, height_b)),
'heat_map': gym.spaces.Box(low=-np.inf, high=np.inf, shape=(heat_w, heat_h))
}
self.observation_space = gym.spaces.Dict(sensor_spaces)
This works great for custom agents since I can access obs['camera_a']
or obs['scanner']
directly. But when I try using algorithms from stable-baselines3, they don’t work with spaces.Dict
.
Should I flatten everything into one big 1D array instead? Something like combining all dimensions into a single spaces.Box
? I’m worried about performance overhead from constantly converting between 2D matrices and 1D arrays. Also seems like it would be easy to mess up the indexing.
What’s the best way to handle this situation? Should I create a wrapper environment that converts between the two formats depending on which agent I’m using?
From my experience working with mixed sensor environments, the flattening approach can actually work quite well if you structure it properly. I found that creating a standardized preprocessing pipeline where each sensor type gets normalized to a consistent range before concatenation helps avoid the scaling issues that plague naive flattening. The key insight was treating it like feature engineering - I concatenate in a predictable order (images first, then 1D sensor arrays) and document the layout thoroughly. Performance wise, numpy concatenation is surprisingly fast for reasonably sized observations. What really sold me on this approach was compatibility - almost every RL library expects Box spaces, so you avoid constantly hitting edge cases with different algorithms. The debugging concern is valid though. I mitigated this by logging the raw sensor values separately during training so I could always trace issues back to specific sensors. For your case with 24x24 images plus relatively small sensor arrays, the memory overhead should be negligible compared to model forward passes.
honestly i’d stick with spaces.Dict and just use stable-baselines3-contrib instead of regular sb3. they have better multi-input support and you wont need hacky wrappers. tried flattening before and indexing became a nightmare when debugging sensor issues later
I’ve dealt with this exact problem in several projects and ended up going with the wrapper approach you mentioned. Created a simple wrapper that flattens Dict observations to Box for stable-baselines3 compatibility while keeping the original structure for custom agents. The performance hit from flattening is actually minimal in practice. I was worried about the same thing initially, but profiling showed the conversion overhead was negligible compared to the actual RL computation. The bigger issue I found was debugging - when something goes wrong with a flattened observation, it’s much harder to trace back to which sensor caused the problem. One thing that helped was adding metadata to track the slice indices for each sensor type in the flattened array. This way you can still extract individual sensor data when needed for analysis or debugging. I also normalized each sensor type separately before flattening since the temperature readings and image pixels had completely different scales. The wrapper solution gives you the best of both worlds without forcing you to rewrite existing code that expects the Dict format.