Building a tool for automatic prompt refinement across different AI models

mikezhang · August 11, 2025, 11:55pm

Hi there! I ran into an interesting challenge recently. I had this really well-tuned prompt that worked great with GPT 3.5. But when I tried to use the exact same prompt with GPT-4o-mini, the results were totally different and not what I was looking for.

This made me wonder if there’s a way to automate the whole prompt engineering process. My idea is to create a simple application that takes a set of example inputs and their expected outputs, then automatically generates and tests different prompt variations using AI models to evaluate which ones work best.

Has anyone come across existing solutions for this kind of automated prompt optimization? I know some platforms have evaluation features built in, but I haven’t found anything that specifically tackles this prompt adaptation problem when switching between models.

Would love to hear your thoughts on whether this would be valuable for your projects!

oliviac · August 18, 2025, 8:41am

Prompt optimization across models is brutal. Spent weeks on this last year migrating our support chatbot between models.

What worked: build a simple evaluation pipeline first. Take your input-output pairs and score them with another AI model as judge. Sounds weird but works.

I built this with basic Python scripts. The key is systematic prompt variations - not random changes. Try different instruction formats, add/remove examples, change tone, adjust temperature. Run batches through both models and compare scores.

One thing I learned - prompts that work on older models completely fail on newer ones because training changed how they interpret instructions. You’ve got to test systematic differences like formal vs casual language, step-by-step vs direct instructions.

For testing workflow, I used simple API calls in a loop with rate limiting. Store results in a basic database to track which modifications actually improve performance.

Biggest win was automated regression testing. Every time we update prompts, the system runs them against test cases and flags major performance drops before deployment.

DancingFox · August 18, 2025, 5:03am

This problem hits different when you’re managing production systems. I’ve watched prompt performance tank 40% just by switching from GPT-3.5 to Claude on the same task. The real pain isn’t generating variations - it’s getting reliable metrics that actually matter. Learned this the hard way when automated scoring kept picking prompts that tested well but bombed in real use. Now I focus on domain-specific benchmarks instead of generic ones. Build test cases that match your actual workflow, not textbook scenarios. Your scoring model needs to get your specific context and requirements. One thing I’ve learned building these: model-specific prompt libraries crush universal prompts every time. Each model follows instructions differently and has its own response style. Document what works for each instead of forcing them to play nice together. Start simple - A/B test two prompt versions before you build anything complex. Manually checking your first 100 test cases teaches you more about automation than any framework will.

ClimbingLion · August 17, 2025, 1:28pm

i totally feel u! it’s so frustrating when prompts don’t hit right on different models. langsmith is pretty cool, and i’ve heard good things about promptfoo too. a tool that simplifies this would be awesome! count me in for testing it!

quirky_quokka23 · August 16, 2025, 11:43am

Been dealing with this exact headache for months at work. Different models need completely different approaches even for the same task.

What you’re describing sounds perfect for automation. Skip building from scratch - set up a workflow that automatically tests prompt variations across multiple models and scores outputs against your expected results.

I’ve built similar systems. Feed in your example inputs and outputs, auto-generate prompt variations, run them through different models, and rank results. The key is having good scoring to evaluate which prompts actually work better.

You can set it up to learn from previous optimizations and get smarter about what prompt modifications work best for specific model transitions. Way more efficient than manual testing.

The whole thing becomes straightforward when you have the right automation platform handling all the API calls, comparisons, and data flow.

Check out Latenode for building this kind of automated prompt optimization system: https://latenode.com