A Cadenza thread looked like a crash, with the shape I have learned to dread: a first sentence, a few tool calls, then nothing that felt like a real completed answer. To the user, the agent appeared to have fallen off a cliff mid-thought. Failures like that destroy trust quickly because they are so hard to classify. One is left staring at the transcript like a railway platform after the last train has failed to appear.
Possible explanations all had the usual bad smell. Maybe the model failed, maybe the tool hung, maybe the backend crashed, maybe the UI lost connection, or maybe the answer existed somewhere, trapped behind a refresh.
In this case, the answer was the most annoying possible version: the agent had completed. The final answer existed. The presentation layer suppressed it. The letter had been written, sealed, and then politely placed behind the mantelpiece.
Almost worse than a crash, because a crash at least has the decency to be broken. A completed turn with a missing final answer is a lie by omission: the system did the work and then failed to show the evidence that would let the user know the work was done.
Root cause lived in the event sink. After tool results, plain prose was being filtered. A small visible prefix made the sink believe it had already emitted an answer, so the real final summary was dropped and replaced with a generic fallback. The interface showed enough to suggest life, but not enough to communicate completion.
Dark twin of the raw-JSON problem.
In February, the UI leaked too much of the machine stream. In May, the UI filtered too aggressively and ate the human answer. Both are the same class of issue: the boundary between raw events and user-facing narrative was not precise enough, and agent products live or die at that boundary.
A conventional request/response app can often treat the final response as one object. Agent systems do not have that luxury. An answer emerges after a sequence: thinking, tool calls, results, more thinking, maybe another tool, maybe a file write, maybe a browser visit, then final prose. The UI has to preserve the order and meaning of those pieces without confusing one for another.
If it mistakes the first post-tool phrase for the whole answer, the user gets silence; if it treats tool output as assistant prose, the user gets JSON; if it hides internal steps too early, the user gets a jump cut; and if it never hides them, the user gets a transcript no human wants to read.
Reliability here is narrative continuity: can the user tell what happened, see that the agent is working, inspect the evidence, read the final answer, and refresh without losing the thread of the story?
Cadenza needed a technical fix: emit final plain-prose summaries after tool work, buffer streamed chunks so short prefixes are not lost, keep scratchpad/debug suppression intact, deploy, restart, run regressions, run live smoke checks.
Product rule: never let the system confuse “some text appeared” with “the answer was delivered.”
Users will anthropomorphize this failure. If the answer disappears, they will say the agent stopped, got confused, crashed, gave up, or forgot. That may be false. The agent may have done exactly what it should have done. The UI may have destroyed the visible proof. Debugging starts by refusing to punish the wrong layer.
If the model failed, improve the model path; if the tool failed, improve tool execution; if the event sink failed, fix the sink. Do not punish the wrong layer.
Across the five-month archive, bugs keep presenting themselves at the human surface while their causes live in a different layer: stale binaries pretending to be new releases, auth loops hiding project settings, existing files appearing missing because links were not promoted, completed answers appearing absent because the sink filtered prose.
“AI is unreliable” is too blunt an instrument and too boring besides. More precise: AI systems are full of evidence, and product reliability is the discipline of not losing it on the way to the user.
Every agent UI needs regression tests for the boring endings: a tool call followed by final prose, a tool call followed by a short prefix and then final prose, multiple tool calls followed by a summary, debug text suppressed while the final answer is preserved, and refresh after completion with the same answer still visible.
Spectacular parts are tempting to test: diagrams, generated UI, long tool chains, browser automation. But the final answer is sacred. If the system loses that, everything before it becomes suspicious.