spent the last few weeks diagnosing why some of my playwright tests are flaky, and i’ve been experimenting with using multiple AI models to help identify root causes and suggest fixes automatically within the workflow itself.
the idea is solid: when a test fails, instead of manually reviewing logs and trying to guess what happened, I give an AI agent the test failure details, error logs, and the original test code. The AI analyzes it, suggests potential causes, and proposes specific fixes. Then I test those fixes and iterate.
what’s been interesting is that different models tend to catch different things. One model picks up on timing issues. Another spots selector brittleness. A third catches logic errors. So i started setting up workflows where if one model’s diagnosis doesn’t resolve the issue, the next model takes a crack at it.
But here’s what i’m struggling with: how do you actually automate this in a sustainable way? like, do you embed this diagnosis loop into your test workflow itself, so every failure automatically triggers analysis? or do you run it separately and only when you have time? and how do you prevent it from becoming more overhead than actual testing.
also curious: when an AI suggests a fix, how confident are you in applying it automatically versus reviewing it first? I’ve had some suggestions that were spot-on, and others that were… creative but wrong.
Would love to hear how people have built this into their testing pipeline. Is this actually reducing the time you spend debugging, or is it adding another layer to manage?
you’re thinking about this right, but the workflow structure is what makes it scalable.
here’s how it should work: test fails, a diagnostic agent runs automatically. that agent collects the error logs, the test code, recent UI changes, and feeds that context to an AI model. the AI model analyzes it, suggests a fix. then the system validates the fix by re-running a smaller test to verify the suggestion is sound. you don’t apply untrusted fixes automatically. you validate them first.
the key is confidence scoring. when the AI suggests a fix, it should also score how confident it is. high confidence fixes can be auto-applied in a dev environment for testing. low confidence suggestions get flagged for human review. this way you’re not managing chaos—you’re automating confident fixes and flagging uncertain ones.
for the multiple models angle, that’s smart but structure it right. first model is your primary diagnostic. if it doesn’t resolve the issue, second model gets a shot. but don’t run all models simultaneously. that’s overhead.
the real win is embedding this in your dev/prod environment strategy. bugs get diagnosed in dev, fixes validated there, then promoted to prod. you’re not debugging production tests. you’re fixing test logic in a safe environment and deploying stable tests.
this absolutely saves time if you’re running hundreds of tests. for a few dozen, you’re adding overhead. but at volume, the ROI is solid.
we set up automatic diagnosis for high-priority tests. When those fail, the AI agent runs automatically, analyzes the failure, and flags issues. For low-priority tests, we only run diagnosis if the failure persists across multiple runs.
confidence scoring changed things for us. We trust high-confidence suggestions from our primary model enough to re-run the test with the suggestion in a staging environment. If it passes, we promote the fix. Low-confidence suggestions go to a queue for our team to review.
The breakthrough was treating AI suggestion as a proposal, not gospel. We always verify before deploying. That keeps us confident in the fixes and maintains test quality.
Sustainable flakiness diagnosis requires architectural discipline. Decouple diagnosis from test execution—don’t block test runs while analyzing. Run diagnosis asynchronously. Classify failures by type and route to specialized diagnostic agents. Implement safeguards preventing automatic application of fixes at scale without validation. The value compounds when you maintain a feedback loop: fix applied, test rerun, result logged. Over time, the system learns which fix categories are reliable and which need human review. This reduces manual debugging overhead substantially, but requires investment in validation infrastructure.