Graft’s first version of the bug was ugly in the visible way.
Raw tool output, JSON, command payloads, and internal event objects were showing up in the web UI message stream. Not in a debug pane. Not behind a setting. Right there in the conversation, where a person expected to read what the agent was doing. It was rather like ordering dinner and being handed the kitchen inventory.
Tempting diagnosis: a rendering bug. Something forgot to filter a tool call, so one adds a toggle, hides the JSON, and moves on. Roughly what we tried.
After the first attempted fix, the app showed no messages at all, which was certainly tidier, in the same sense that a boarded-up library is easier to dust.
Small comedy, large warning. In agent UI work, the line between “too much internal machinery” and “nothing visible enough to trust” is thin, and the failure worse than showing users raw plumbing is hiding the entire house.
Graft turned into a study in how little the phrase “message stream” explains. It sounds simple, like a brook through a meadow. In practice it is closer to customs processing at a busy port.
An agent conversation is not just a chat transcript. It braids different signals into one timeline: user text, assistant prose, tool calls, tool results, command output, file edits, plans, status changes, thinking indicators, errors, retries, summaries, final answers, and sometimes debug detail that earns its keep only when everything has gone sideways.
Dump all of it and the product feels broken; hide all of it and the product feels dead. Design begins at that boundary: which parts deserve public shape, and which parts stay in the debug record unless summoned.
The request quickly moved from “hide raw JSON by default” to “study exactly how CodexApp does this.” Which messages appear as prose? Which tool calls become expandable rows? Which completed work gets summarized? What stays visible after completion? What collapses? What remains available for inspection without taking over the transcript?
Not cosmetic questions. They define the contract between the agent and the person trusting it.
If a tool call writes a file, every byte of the payload can stay out of the transcript, but the written file becomes visible. If a command fails, the environment dump can stay folded away, but the failure remains legible. If the agent thinks for a while, liveness appears before the user assumes the app froze. If the final answer arrives after a tool sequence, the UI distinguishes “tools happened” from “the answer happened.”
Streaming is the interface because transport may be an implementation detail, but the stream is what the user experiences as agency. It is how they decide whether the system is working, whether it is stuck, whether it is making progress, whether it is safe to wait, whether to interrupt, whether to trust the final result.
A conventional app can often hide its internals behind a spinner. An agent cannot. Work that long, varied, and consequential requires a visible rhythm.
Visible does not mean raw. The raw event stream is optimized for machines, while the message stream is for people, and confusing those two creates the worst of both worlds: unreadable UI for humans and lossy semantics for machines. The stronger pattern is translation: tool calls become tool rows, results become summaries that can expand on request, debug payloads become opt-in, plans become structured progress, final prose remains final prose, and the completed conversation cleans itself up without erasing the evidence of what happened.
A product philosophy hides in that last sentence. An AI agent cannot behave like a magician who refuses to show the trick, or like a compiler dumping its entire AST into the user’s lap. The figure I trust is a careful operator: show enough process to be accountable, keep the raw record available, and do not make the person read every scratch mark unless they ask.
Code fixed the Graft bug, but the durable change was architectural: event streams take view models. A protocol event is not automatically a UI element. A tool payload is not automatically a transcript line. A final answer is not interchangeable with a status update.
Once you see this, a lot of agent products start to look underdesigned. Not because they lack features, but because they have not decided what work looks like while it is underway.
Better models will not fix a transcript that cannot explain what happened. The product work is making a long, messy, partially observable process feel legible without turning the interface into a server log.
The repair left me with three layers every agent UI keeps separate:
-
Human transcript: what the user asked, what the agent answered, and the durable result.
-
Work surface: tool calls, files, commands, plans, errors, progress, and decisions, presented as structured interface.
-
Debug layer: raw events and payloads, available on request, hidden by default.
Most broken agent UIs are broken because these layers collapse into one another.