A Graft build looked finished until I asked it to perform the action it existed to support.

At first, the shell is there. The sidebar looks plausible. The composer has a reassuring rectangle where text ought to go. Documentation exists, which always gives a project a faint institutional smell. There may even be settings, icons, routes, and a few animations. Everything has the air of a railway timetable, right up until you type a message, press send, and discover that the train was never actually scheduled to depart, which was the Graft moment in miniature.

Work began in the reasonable way: read the design docs, compare them to the code, identify what was missing. The app had a substantial set of docs describing prompts, automations, input behavior, message streams, navigation, skills, tools, settings, and the rest of the Codex-like surface it wanted to replicate.

On paper, agents handle this sort of work well: give them a pile of docs, give them a codebase, and ask them to map aspiration to implementation.

The early output helped, but product truth arrived from the simplest possible interaction: entering text in the input and sending it did nothing. The chat path was not real enough.

One click changed the assignment. What had looked like a coverage audit became an end-to-end obligation.

The test became explicit: use the browser. Send Hello!. See whether a response appears. Then send What's hello world in python. See whether it completes. Then test the UI elements, inspect errors, fix the 500s, wire /api/turns/start to the agent runtime, and stream events back to the client.

I keep returning to that turn because it is the difference between code generation and product development.

An agent can make code that satisfies a document. It can create routes, components, adapters, and types. It can make the codebase look more like the spec. But a user does not experience a spec. A user experiences the click.

Clicking is merciless because it has no respect for architecture diagrams, cleaner component names, plausible routes, or any other evidence that lives one layer away from the user. It asks one question: did the system respond?

AI-built software has to ask that question much earlier in the process. Do not wait until the final hour to discover whether the bridge has planks.

I have become increasingly suspicious of any session that spends too long in implementation without touching the running surface. Not because code inspection is bad. Code inspection is essential. But inspection and execution catch different species of bug.

Inspection finds missing routes, broken abstractions, duplicated logic, stale docs, unimplemented stubs. Execution finds whether the supposedly wired system survives contact with a browser, a real request, a real response, a real error, a real state transition.

Graft required both. It started with docs and code. It moved through implementation. But the quality bar only became honest when the browser became a participant.

Browser testing exposed a deeper problem: “UI complete” is often used far too generously. A composer that accepts typing but does not start a turn is not a composer. A chat stream that renders placeholder history but cannot show live agent output is not a chat stream. A route that returns 500 under the first real flow is not a route, at least not in the way users mean it.

Pedantic, until you are using agents. Then it becomes survival.

Agents are very good at producing intermediate artifacts that look like progress. Scaffolding helps when everyone still knows it is scaffolding. The danger begins when it is mistaken for the building, especially when the scaffolding has tasteful typography.

Trusting agents less is not the fix. Defining “done” in a way that leaves less room for theater is.

In a chat product, “done” means a user can type a message, send it, see the agent begin, watch useful intermediate state, and receive a final answer. It means errors are visible without being dumped as raw internals. It means refresh does not destroy the conversation. It means the app can perform the happy path twice, not once by accident.

That sounds like a lot because it is a lot. It is also the product.

An agent can help build it, test it, and even write the browser tests that hold it accountable. It does not get to define completion by the presence of files. When it says “done”, ask it to click.

A painterly editorial collage for agents should click the UI, showing the concrete objects and system relationships around browser path, send action, and response evidence.
Browser path, send action, and response evidence.

For anything with a user interface, the first meaningful test belongs in the interface early, not after the refactor, not after the last route is wired, and not after the agent has spent a long session making the internal architecture more beautiful. Open the app, type the message, press the button, watch the logs, refresh, and try again.

A lot of false progress evaporates under that little ritual.

Chris Chabot · February 2026