Field entry, 5 April.

At some point the test plan stopped sounding like a test plan and started sounding like a dispatch from a difficult crossing.

Start the whole stack. Clean the slate. Delete projects and sessions. Create multiple projects. Create multiple threads. Send prompts through the browser. Watch whether responses arrive quickly. Confirm thinking and reasoning show up. Inspect markdown, generative UI, Mermaid diagrams, SVGs, forty-one components, tool output, end-of-turn animations, show/hide behavior, logs, resource usage. Visually inspect every thread because rendering errors like to hide in corners.

Then run it again, and keep running it until it passes twice in a row, because a single successful passage may be luck while two begins to look like a route. That was Fabric in April.

This is the kind of workflow that makes sense only after enough smaller validations have lied to you.

A unit test says the renderer handles a shape. A protocol test says the event is valid. A service health check says the process is alive. A typecheck says the code is coherent. All useful. None sufficient.

The product was the sum of too many surfaces: Fabric protocol, Diminuendo clients, agent runtimes, file rendering, tool events, markdown, diagrams, live streams, histories, end-of-turn states, and whatever happens when a real browser sits there waiting for proof.

In that world, “green” has to mean more than “the test suite passed”; it has to mean a person can watch the system do the work and understand what happened.

The April battery session is absurd in the raw numbers. The log includes thousands of shell commands, hundreds of browser evaluations, hundreds of patches. That is not a boast. It is a warning label. Systems this interconnected do not fail politely; they leave little muddy footprints across several rooms at once.

One of the root causes from the session is almost comically specific: files existed, but inline mentions like solution.ts were not being promoted into clickable file links from transcript-backed state. The backend had not lost the file. The UI simply failed to treat the transcript as evidence that the file existed, and that distinction matters.

From the user’s perspective, “I cannot click the file” and “the file is gone” rhyme. Both break trust. But the fixes live in different parts of the system. One is durability. One is presentation. If you guess wrong, you go digging in the storage layer while the renderer quietly keeps lying.

This is why visual inspection was part of the battery. Not because pixels are precious, but because the UI is often where distributed truth becomes visible. If a file exists but is not inspectable, the product has still failed. If a tool ran but the output is unreadable, the product has still failed. If a diagram was generated but collapses the layout, the product has still failed.

The battery test became the product because the product was not a single service. It was the lived continuity between services, the difference between owning a collection of fine instruments and hearing them play in tune.

I used to think of broad manual-plus-browser test passes as something you do near release. I think agents change that. When implementation velocity increases, the cost of false confidence increases with it. You can accumulate plausible changes faster than you can understand their second-order effects.

The counterweight is not slower coding. It is harsher feedback: clean the slate, use a real browser, create multiple threads, watch the logs, inspect the artifacts, and rerun until the same success happens twice.

That “twice” matters more than it sounds. One clean run can be luck. Two consecutive clean runs are not proof of correctness, but they are at least evidence that you are not looking at a one-off alignment of caches, timing, and benevolent spirits.

The ritual is tedious. It is also calming. There is dignity in a checklist when the weather is poor.

There is a particular relief in moving from “I think this is working” to “I watched it work across twelve threads and then watched it do it again.” It does not eliminate bugs. Nothing does. But it changes the conversation from vibes to evidence.

And that, more than anything, is what developing with AI seems to demand.

The demand is not less testing, but more direct, more varied, more product-shaped testing.

The agent can write code all day. The field journal begins when the code has to survive weather.

Hand-drawn notebook detail plate showing multi-thread run sheet, checks, and repeated passes.
FIG. 02 — MULTI-THREAD RUN SHEET, CHECKS, AND REPEATED PASSES.

Field note

The most useful tests are not always the most elegant tests.

Sometimes the right test is a messy, browser-driven, high-friction, multi-thread battery that makes the whole system prove it can still breathe, because engineering culture likes small isolated checks because they are clean, while product reliability often comes from dirty checks because users are dirty checks.