Machine learning approach for grid exploration with hidden rewards

ClimbingLion · August 14, 2025, 12:09am

I’m working on a grid-based problem where my explorer moves through cells containing stationary entities. Each entity can either reward points, subtract points, or remain neutral when encountered. The explorer can only move in four cardinal directions and cannot see what’s in neighboring cells before entering them.

The tricky part is that learning only happens after completing a full exploration run. The explorer starts at one corner and must return to that same corner with positive health to trigger the learning phase. During exploration, it collects data about cell coordinates and entity characteristics it encounters. If health reaches zero during exploration, the run fails but can be restarted.

Each stationary entity has three visual properties: shape (3 options), color (3 options), and size (2 options). Every entity also has a reward value that determines point changes. Movement costs one point per step.

My goal is to create an intelligent system that can figure out which entity types give positive or negative rewards. The challenge is the limited visibility - I can’t peek at adjacent cells. This makes it hard to plan routes or avoid dangerous entities.

What machine learning or evolutionary strategies would work well for this scenario? I’m particularly struggling with how to extract useful patterns from just position data and entity features without the ability to scout ahead.

Dave_17Sketch · August 26, 2025, 7:10pm

I tried evolutionary algorithms for this and got pretty good results. Treat each exploration run like an individual in a population - the fitness function looks at total points and how much ground you covered. Encode different strategies as chromosomes: movement weights, risk tolerance, entity avoidance rules. Run multiple attempts each generation and breed the winners. The game-changer was sharing knowledge across all individuals in each generation. When any explorer finds a new entity type, everyone else immediately knows about it. Later runs can use what earlier ones discovered. Best mutations were tweaking movement biases and adjusting risk/reward thresholds for known entities. Crossover between successful explorers mixed good pathfinding with smart entity recognition. Critical point - weight your fitness heavily toward completion, not just points. Failed runs with lots of data have some value, but finishing the circuit should be priority #1. Otherwise you’ll get stuck with aggressive strategies that can’t sustain full runs.

Sky24 · August 24, 2025, 7:15pm

MCTS is probably your best bet. I’ve used it for similar exploration problems where you can’t see what’s coming and need to balance going after known rewards vs exploring new areas. Treat each grid cell as a tree node and build it out as you discover new spots. Start each run using your existing tree to head toward promising areas, then switch to exploration mode when you hit unknown territory. After each run, backpropagate rewards through all the cells you visited to update their values. What’s great about MCTS is how it handles uncertainty - it naturally figures out when to stick with known good areas vs trying new paths. For entity recognition, keep separate value estimates for each combo of visual properties. The more entities you see with similar features, the better it gets at predicting rewards. The movement cost adds an interesting twist where it learns to optimize path efficiency while collecting rewards. I throw in a bit of randomness to move selection - helps avoid getting stuck in local optima early on.

nateharris · August 23, 2025, 10:16pm

I’d go with a hybrid approach - reinforcement learning plus a simple memory system. Built something similar last year for pathfinding.

Treat this as two separate problems. First: learn entity rewards through pattern matching. Second: learn exploration strategies through RL.

For entities, use a lookup table mapping visual properties to reward estimates. Update confidence scores after each encounter. Start conservative - assume neutral rewards until you’ve got solid data.

For exploration, use basic RL to learn state-action values. State = current position + health + known entity info. Actions = your four movement directions.

Reward shaping is where the magic happens. Small rewards for discovering new entities (even harmful ones), medium for unexplored areas, big for completion.

Epsilon-greedy works great here. High epsilon early for discovery, then gradually reduce as entity knowledge improves.

One trick that really helped: separate models for different health ranges. Agent learns different strategies when health is high (exploratory) vs low (conservative).

This video explains RL fundamentals really well:

Start with tabular Q-learning before deep networks. The state space isn’t huge with grid coordinates and basic entity memory.

ethant · August 23, 2025, 6:31pm

ur idea on Q-learning is interesting. maybe consider a deep reinforcement learning approach too? it can handle complex patterns better, and even if yu can’t see adj cells, the model can learn from past runs. good luck figuring it out!

opalEcho · August 23, 2025, 4:45am

Reinforcement learning’s a solid approach, but I’d skip the ML training headaches and go full automation instead.

This screams automation problem to me. Skip the neural networks - just build automated pattern recognition that learns entity rewards across multiple runs.

Here’s my approach: automated system runs exploration cycles, logs every entity encounter (visual properties + reward outcomes), then builds real-time decision trees. After each successful run, it updates what it knows about which shape/color/size combos are good or bad.

For exploration, automate different movement patterns - random walks, spiral searches, edge following. Let the system figure out what works best for different grid layouts.

Best part? Run hundreds of cycles overnight automatically. Even failed runs contribute data about dangerous entities. The system builds confidence scores for each entity type and gradually shifts toward safer, more rewarding paths.

I’ve tackled similar problems by automating the entire learning pipeline instead of hand-coding ML algorithms. Way less complex, faster results.

You can build this whole automated learning system with Latenode. Handles data collection, pattern analysis, and decision making without messing with ML frameworks.