Field entry, 4 May.

The thread looked like a crash, with the shape I have learned to dread: a first sentence, a few tool calls, then nothing that felt like a real completed answer. From the outside, the agent appeared to have fallen off a cliff mid-thought. This is the kind of failure that destroys trust quickly because it is so hard for the user to classify. One is left staring at the transcript like a railway platform after the last train has failed to appear.

The possible explanations all had the usual bad smell. Maybe the model failed, maybe the tool hung, maybe the backend crashed, maybe the UI lost connection, or maybe the answer existed somewhere, trapped behind a refresh.

The answer, in this case, was the most annoying possible version: the agent had completed. The final answer existed. The presentation layer suppressed it. The letter had been written, sealed, and then politely placed behind the mantelpiece.

That is almost worse than a crash, because a crash at least has the decency to be broken. A completed turn with a missing final answer is a lie by omission: the system did the work and then failed to show the evidence that would let the user know the work was done.

The root cause was in the event sink. After tool results, plain prose was being filtered. A small visible prefix made the sink believe it had already emitted an answer, so the real final summary was dropped and replaced with a generic fallback. The interface showed enough to suggest life, but not enough to communicate completion.

This is the dark twin of the raw-JSON problem.

In February, the UI leaked too much of the machine stream. In May, the UI filtered too aggressively and ate the human answer. Both are the same class of issue: the boundary between raw events and user-facing narrative was not precise enough, and agent products live or die at that boundary.

A conventional request/response app can often treat the final response as one object. Agent systems do not have that luxury. The answer emerges after a sequence: thinking, tool calls, results, more thinking, maybe another tool, maybe a file write, maybe a browser visit, then final prose. The UI has to preserve the order and meaning of those pieces without confusing one for another.

If it mistakes the first post-tool phrase for the whole answer, the user gets silence; if it treats tool output as assistant prose, the user gets JSON; if it hides internal steps too early, the user gets a jump cut; and if it never hides them, the user gets a transcript no human wants to read.

Reliability here is not just uptime. It is narrative continuity: can the user tell what happened, see that the agent is working, inspect the evidence, read the final answer, and refresh without losing the thread of the story?

The Cadenza fix was technical: emit final plain-prose summaries after tool work, buffer streamed chunks so short prefixes are not lost, keep scratchpad/debug suppression intact, deploy, restart, run regressions, run live smoke checks.

But the lesson is product-shaped: never let the system confuse “some text appeared” with “the answer was delivered.”

This is especially important because users will anthropomorphize the failure. If the answer disappears, they will say the agent stopped, got confused, crashed, gave up, or forgot. That may be false. The agent may have done exactly what it should have done. The UI may have destroyed the visible proof, and that distinction matters for debugging and for trust.

If the model failed, improve the model path; if the tool failed, improve tool execution; if the event sink failed, fix the sink. Do not punish the wrong layer.

This is one of the recurring patterns in the whole five-month archive. Bugs keep presenting themselves at the human surface while their causes live in a different layer: stale binaries pretending to be new releases, auth loops hiding project settings, existing files appearing missing because links were not promoted, completed answers appearing absent because the sink filtered prose.

The field note is not “AI is unreliable”, which is too blunt an instrument and too boring besides. The more precise and useful version is that AI systems are full of evidence, and product reliability is the discipline of not losing it on the way to the user.

Hand-drawn notebook detail plate showing event sink, suppressed prose, and completion evidence.
FIG. 02 — EVENT SINK, SUPPRESSED PROSE, AND COMPLETION EVIDENCE.

Field note

I want every agent UI to have regression tests for the boring endings: a tool call followed by final prose, a tool call followed by a short prefix and then final prose, multiple tool calls followed by a summary, debug text suppressed while the final answer is preserved, and refresh after completion with the same answer still visible.

It is tempting to test the spectacular parts: diagrams, generated UI, long tool chains, browser automation. But the final answer is sacred. If the system loses that, everything before it becomes suspicious.