Phrase steps the way a user experiences them
The agent's strength is resolving human-facing intent, so write to that strength. "Click 'Start free trial'" gives it the visible label a real user would look for; "click button.cta--primary" drags you back into the brittle selector world the agent exists to avoid. Refer to things by what's on screen — the button text, the link text, the field label, the heading — and the agent re-finds them each run even after the markup shifts underneath.
Keep each step a single observable action: navigate, click, type, select, wait for something to appear. If a step smuggles in two actions ("log in and go to settings"), split it — one action per step keeps the execution log readable, so when something fails you can see precisely which step the agent was on. Ambiguity is the other trap: if a page has two "Submit" buttons, say which one ("click 'Submit' in the billing form"), the same way you'd disambiguate it for a colleague.
End with one assertion you can actually check
A scenario without an assertion just proves the agent could click around without crashing — it never proves the flow did the right thing. Every test should end on something observable and verifiable: the URL contains a path ("verify the url contains /signup"), specific text is visible ("verify the text 'Welcome back' is visible"), or an element is present or gone. Those are checks the agent can evaluate unambiguously and report as a clean pass or fail.
Resist stuffing several assertions into one test. "Verify the url contains /dashboard AND the welcome banner shows AND the avatar loaded" is really three tests wearing a trench coat — when it fails, you can't tell which part broke. One goal, one assertion: if you care about three things, write three scenarios. They're cheap to author and each failure points straight at its cause, which is the entire reason you're testing in the first place.
Run it, read the evidence, tighten the wording
First-run failures are usually a wording problem, not an app problem, and assertly gives you the evidence to tell them apart. The execution log shows what the agent did at each step, and on a failure it captures a screenshot of the page at that moment. If the screenshot shows the agent stuck on the wrong element, your label was ambiguous — name it more precisely and re-run. If it shows the page genuinely in the wrong state, you found a real bug, which is the win.
Once a scenario runs clean, save it, and it joins a short run history you can re-fire whenever a change lands. For your critical flows, give the generated steps a real read the first time through — the agent is flexible by design, and you want to confirm it's checking what you actually meant before you start trusting a green tick. That review-once, re-run-forever rhythm is where natural-language testing earns its keep.
how it works
- 01
Pick one user goal
Decide the single thing this test proves — "a visitor can reach the signup page from pricing". One goal per scenario keeps the test readable and makes any failure point at one cause. If you're tempted to test three things, that's three scenarios.
- 02
Name a concrete starting point
Begin from a real URL or a clearly named page: "go to the homepage" or "go to /pricing". The agent needs an unambiguous place to start so the run is reproducible rather than dependent on wherever it happened to be.
- 03
Write each step as one user-visible action
"Click 'Learn more'", "type 'ada@example.com' into the email field", "select 'Annual' from the plan dropdown". Refer to things by their on-screen label, not a CSS selector, and keep it to one action per step.
- 04
Disambiguate where the page is repetitive
If two elements share a label, say which one — "click 'Submit' in the billing form". You're giving the agent the same context you'd give a teammate who'd never seen the page.
- 05
End on a single checkable assertion
Finish with something observable: "verify the url contains /signup", "verify the text 'Thanks for subscribing' is visible". This is the step that makes the test mean something — without it you've only proven the agent didn't crash.
- 06
Run it and read the execution log
Run the scenario, then read the step-by-step log and, on failure, the screenshot the agent captured. The evidence tells you whether the wording tripped or the app genuinely misbehaved.
- 07
Tighten the wording, then save and re-run
If a label was ambiguous, make it more specific and run again. Once it passes for the right reason, save it — it joins your run history, ready to fire on the next change. Review the generated steps once for any critical flow before you trust the green.
frequently asked
How specific do I need to be about elements?
Specific about what a user would see, not about the markup. Use the visible label — button text, link text, field label — and the agent resolves it against the live page. You only need to add detail when the page is genuinely ambiguous, e.g. two buttons reading 'Submit', where you'd name which form it's in.
What makes a good assertion versus a weak one?
A good assertion is observable and unambiguous: a URL path, visible text, or an element being present or absent. A weak one is vague ("verify the page looks right") or bundles several checks into one, which hides which part failed. End each scenario on exactly one assertion the agent can evaluate cleanly.
What happens when a scenario fails on the first run?
Read the execution log and the failure screenshot. Most first-run failures are a wording snag — an ambiguous label the agent resolved to the wrong element — which you fix by naming the target more precisely and re-running. If the screenshot shows the page genuinely in the wrong state, you've caught a real bug.
Can I write one big scenario that covers a whole journey?
You can, but shorter scenarios with one goal and one assertion each are easier to trust and far easier to debug. A long chain that fails on step nine tells you less than three focused tests would. If a journey matters end-to-end, it's fine to write the long version too — just keep the focused ones as the diagnostic layer.
Last updated June 20, 2026