I’ve been experimenting with using multiple AI models in a coordinated workflow for building features. This isn’t about asking one AI to write everything at once. Instead, I break the work into manageable pieces and use different models for different jobs.
The Basic Steps
Step 1: Break it down
I take whatever feature I’m building and split it into smaller chunks. Each chunk should be small enough for a single pull request. This keeps things organized and makes it easier to review changes.
Step 2: Make detailed plans
For each chunk, I create a specific plan that includes which files need changes, what those changes should do, and what tests are needed. I use one AI model just for planning because it can look at my whole codebase.
Step 3: Code implementation
Once I have a solid plan, I give it to a different AI model to actually write the code. This model follows the plan step by step and creates the diffs I need.
Step 4: Double-check everything
After the code is written, I use a third AI model to review what was actually changed versus what the plan said should happen. This catches mistakes and scope creep.
Why This Works Better
Using different models for different jobs prevents common problems like AI adding random features or forgetting about tests. The planning step keeps everything focused, and the review step catches issues before they get merged.
Tools I Use
I mostly work with Sonnet 4 for planning and coding, then switch to a different model for verification. The key is keeping each model focused on one specific task instead of trying to do everything.
This process takes longer than just asking AI to build the whole feature, but the results are much more reliable and the code actually does what I intended.
honestly sounds like overkill for smaller projects, but i get why it works for complex stuff. do you ever find the handoff between models gets messy? like when the coding ai misinterprets what the planning ai meant? ive tried something similar but keeping context straight between different models was a pain.
The coordination overhead is real but totally worth it once your projects get complex enough. I’ve found the sweet spot is sticking with the same model family but using different instances with specialized prompts instead of completely different models. Way fewer context issues while keeping things properly separated. What really changed everything for me was adding quick sanity checks between steps. Before moving from planning to coding, I spend 30 seconds scanning for obvious gaps or overreach. Same before review - just a gut check to see if the implementation makes sense. This approach saves your ass during maintenance. When bugs pop up months later, having that structured trail makes debugging infinitely easier than trying to figure out what some AI was thinking in one giant prompt.
Your structured approach reminds me of traditional software engineering - probably why it works so well. I’ve had similar success treating AI models like specialized team members instead of magic code generators. One thing I’d add: keep context between steps. When I hand off from planning AI to coding AI, I include the plan plus relevant code snippets and architecture notes. Stops the coding model from making assumptions that break existing patterns. I also keep a simple log of what each model delivered vs what I asked for. Over time, this shows me which models handle different tasks better. Some crush database queries, others nail UI components. The verification step is crucial but gets skipped way too often. I see developers assume AI output is either perfect or obviously broken. Reality check: the most dangerous bugs look fine at first glance but break under specific conditions.
This reminds me of code reviews at work. Having that third AI as a reviewer is brilliant - it catches stuff you miss when you’re buried in the code.
I do something similar but add real testing after the review. AI misses edge cases that only surface with actual data.
Early on, the planning AI was way too optimistic with time estimates. Now I always pad the timeline since the coding AI often needs several tries for complex logic.
Keeping everything in one conversation thread works way better than starting fresh each time. The models remember past decisions and stay consistent.
Debugging becomes super powerful with this setup. When things break, I can trace back through each step to find where the plan and execution diverged.