How to handle variable action spaces in OpenAI Gym environments

I’m working on a custom OpenAI Gym environment and running into issues with dynamic action spaces. My environment has a total of 46 possible actions, but depending on the current state, only a subset of these actions (like 7 actions) are actually valid.

I’m trying to integrate this with keras-rl agents but can’t figure out how to properly handle this situation. The standard Gym framework seems to expect fixed action spaces.

Has anyone dealt with similar scenarios where the available actions change based on the environment state? I’m curious about how RL agents like DQN handle action selection when the action space is constrained.

Any suggestions or workarounds would be really helpful. Thanks!

another simple trick - return a big negative reward (like -999) when the agent picks an invalid action. it learns fast to avoid them without messing with gym interfaces. way easier than restructuring everything and works with any rl algorithm.

I encountered a similar issue in a trading environment where the actions changed based on available assets. What worked for me was implementing action masking at the agent level, rather than modifying the Gym space. I maintained the full action space of 46 but created a method called get_valid_actions() that returns the indices of currently valid actions. I then adjusted the agent’s action selection process to consider only these valid actions. For DQN, I masked the invalid action Q-values by assigning them negative infinity before applying softmax or selecting the maximum. This approach helps maintain compatibility with keras-rl while ensuring that the agent avoids selecting invalid actions. Just remember to handle the state consistently during both training and inference.

Here’s what worked for me: I used a two-stage action selection instead of masking invalid actions. The agent first picks an action category, then chooses the specific action within that category. This handles the variable action space naturally since each category only has valid actions for the current state. You skip all the masking complexity while keeping keras-rl happy. Basically, you’re turning your dynamic 46-action problem into several smaller fixed-action problems. The agent learns which category works best AND which specific action to take within it. Bonus: learning efficiency improves since the agent develops hierarchical strategies instead of wrestling with a huge sparse action space.

Here’s another approach that’s worked well for me: observation augmentation. You basically append an action validity vector to your state representation. So your observation becomes [original_state, valid_actions_mask] - the mask is just a binary vector showing which of the 46 actions are valid right now. The agent learns the connection between states and valid actions as it trains. The big win? You don’t touch the agent’s action selection logic at all. It figures out on its own to avoid actions when the mask bit is zero. I’ve had great results with this on policy gradient methods since the agent builds an intuitive sense of when actions are available instead of having it forced through masking.

u could also wrap ur env to handle invalid actions internally. when the agent picks smth invalid, the wrapper either penalizes it or maps it to a valid action automatically. I’ve seen ppl use gym.ActionWrapper for this - keeps ur RL code clean without touching the agent logic.