I’ve been experimenting with converting plain English descriptions into Playwright workflows, and I’m curious how reliable this actually is in practice. The idea sounds great on paper—just describe what you want the test to do, and the AI generates the workflow. But I’m running into some questions.
My main concern is brittleness. When I write something like “log in with credentials and verify the dashboard loads,” the generated workflow handles the happy path fine. But the moment the UI changes slightly or there’s a timing issue, things fall apart. I’m wondering if this is just my experience or if others are hitting the same walls.
Also, I’m curious about edge cases. How does the conversion handle things like waiting for dynamic content, handling multiple browsers, or dealing with elements that load asynchronously? Does the AI bake in resilience, or are you manually adding error handling afterward?
Has anyone actually gotten this working reliably for production tests, or is it more useful as a starting point that needs significant manual tweaking?