i’ve been experimenting with using plain english descriptions to generate playwright test workflows, and i’m curious if anyone else has tried this. the idea sounds great in theory—describe what you want to test, get a ready-to-run workflow—but i’m wondering about real-world stability.
my main concern is whether these generated workflows hold up when the actual website changes. i’ve noticed that when i use detailed descriptions of my UI flows, the generated tests seem pretty solid initially. but once a site gets updated or classes change, things start breaking faster than hand-written tests.
also, i’m trying to figure out if there’s a sweet spot for how detailed your description needs to be. too vague and you get generic steps that don’t match your actual app. too detailed and you’re basically writing code anyway, which defeats the purpose.
has anyone found a reliable pattern for this? like, are there certain types of flows where ai-generated workflows from descriptions actually stay stable, and others where they just fall apart?
i handle this exact problem at work. the key is that descriptions alone aren’t enough—you need the ai to actually understand your app’s context and structure.
what changed for me was using a workflow builder that lets the ai copilot generate the initial playwright steps from your description, but then you can tweak and validate those steps in a visual interface before running them. that way you catch issues before they sink your test suite.
the stability issue you’re hitting usually happens because the generated workflow doesn’t account for dynamic content or updated selectors. but if you use a tool that lets you test and refine those steps immediately, you can build in checks for those changes.
i stopped fighting with brittle ai-generated tests when i moved to a platform that kept the human in the loop—describe what you want, let the ai draft it, then you validate and adjust the actual selectors and assertions in the builder.
i’ve seen this break down pretty consistently when people just trust the initial generation. the real issue is that ai doesn’t know your app’s quirks—loading states, modal delays, dynamic class names that change per session.
what actually works is generating the basic flow from your description, then immediately running it a few times against your live environment to see what breaks. the tests that stick around are the ones where someone invested time in understanding why each assertion matters.
plain english descriptions are great for getting started fast, but treating them as final is where things go wrong. think of them as scaffolding, not the building itself.
stability really depends on how much of the workflow is selector-based versus behavior-based. if your description focuses on user actions—“click the login button, fill in the form, submit”—those tend to be more resilient. but if the generated workflow ends up with hardcoded selectors from your current dom, you’re in trouble the moment the design changes.
the teams i know that got this working well started with descriptions that emphasize what the user does, not what the page looks like. then they validate those steps can handle minor visual tweaks.
from what i’ve seen, the stability issue boils down to how the ai interprets your description versus what’s actually on the page. if you’re describing user interactions at a high level—“complete the checkout process”—the generated workflow has more flexibility. but if you include specific visual details in your description, the ai locks onto those and breaks when things update.
the workflows that actually hold up are ones where the initial generation is just the first pass. then you run them, find what breaks, and adjust. treating it as a one-time generation is where most people hit problems.
the core problem with stability in ai-generated playwright workflows from descriptions is selector fragility. when the ai infers selectors from your description without seeing the actual dom, it’s making educated guesses. those guesses work until the page changes.
better approach: descriptions should focus on intent, not implementation. describe the user’s goal, not the page structure. then let the ai draft the workflow, but validate it against your actual site immediately. any mismatches get caught before they become flaky tests in production.
validate generated workflows against live environments immediately before trusting them. ai descriptions work best for high-level flows, not selector-specific steps.