Plain english test descriptions to playwright workflows—how stable is this in production?

I’ve been looking at AI copilot features for converting plain English test scenarios into Playwright workflows, and I’m trying to figure out if this is actually stable enough to rely on in production or if it’s just a nice-to-have for quick prototypes.

Right now, our team writes test requirements in plain language, and someone manually converts them into Playwright test scripts. It works, but it’s slow and we lose details in translation. The idea of feeding a description like “verify user can login with valid credentials and sees dashboard” directly into a system that generates a ready-to-run workflow sounds amazing.

But I’m hesitant. Dynamic web apps are already causing us pain with flaky tests—selectors break, timing issues pop up, elements load in weird orders. When I add AI generation on top of that, I’m worried we’re just automating the creation of fragile tests.

Has anyone actually gotten this working reliably? Are the generated workflows robust enough to handle real-world sites, or do you end up spending as much time debugging the AI output as you would writing the tests manually?