Fabric’s test plan stopped sounding like a test plan and started sounding like a dispatch from a difficult crossing.
Start the whole stack. Clean the slate. Delete projects and sessions. Create multiple projects. Create multiple threads. Send prompts through the browser. Watch whether responses arrive quickly. Confirm thinking and reasoning show up. Inspect markdown, generative UI, Mermaid diagrams, SVGs, forty-one components, tool output, end-of-turn animations, show/hide behavior, logs, resource usage. Visually inspect every thread because rendering errors like to hide in corners.
Then run it again. Keep running it until it passes twice in a row, because a single successful passage may be luck while two begins to look like a route. That was Fabric in April.
A workflow like that makes sense only after enough smaller validations have lied to you.
Receipts mattered more than ritual: twelve threads, visible browser passes, logs, resource usage, generated files, diagrams, and a second clean run after the first one stopped feeling lucky.
A unit test says the renderer handles a shape. A protocol test says the event is valid. A service health check says the process is alive. A typecheck says the code is coherent. All useful. None sufficient.
Fabric was the sum of too many surfaces: protocol, Diminuendo clients, agent runtimes, file rendering, tool events, markdown, diagrams, live streams, histories, end-of-turn states, and whatever happens when a real browser sits there waiting for proof.
In that world, “green” has to mean more than “the test suite passed”; it has to mean a person can watch the system do the work and understand what happened.
April’s battery session is absurd in the raw numbers: 5,345 shell commands, 343 Playwright browser evaluations, and 523 patches. Not a boast. A warning label. Systems this interconnected do not fail politely; they leave little muddy footprints across several rooms at once.
One root cause from the session is almost comically specific: files existed, but inline mentions like solution.ts were not being promoted into clickable file links from transcript-backed state. Nothing had vanished from storage. The interface simply failed to treat the transcript as evidence that the file existed, and that distinction matters.
From the user’s perspective, “I cannot click the file” and “the file is gone” rhyme. Both break trust. But the fixes live in different parts of the system. One is durability. One is presentation. If you guess wrong, you go digging in the storage layer while the renderer quietly keeps lying.
Visual inspection belonged in the battery because the UI is often where distributed truth becomes visible. If a file exists but is not inspectable, the product has still failed. If a tool ran but the output is unreadable, the product has still failed. If a diagram was generated but collapses the layout, the product has still failed.
Battery testing became product work because Fabric was not a single service. It was the lived continuity between services, the difference between owning a collection of fine instruments and hearing them play in tune.
I used to think of broad manual-plus-browser test passes as something you do near release. I think agents change that. When implementation velocity increases, the cost of false confidence increases with it. You can accumulate plausible changes faster than you can understand their second-order effects.
Slower coding is not the counterweight. Harsher feedback is: clean the slate, use a real browser, create multiple threads, watch the logs, inspect the artifacts, and rerun until the same success happens twice.
Two consecutive runs matter more than they sound. One clean run can be luck. Two are not proof of correctness, but they are at least evidence that you are not looking at a one-off alignment of caches, timing, and benevolent spirits.
Ritual can be tedious. It can also be calming. There is dignity in a checklist when the weather is poor.
Moving from “I think this is working” to “I watched it work across twelve threads and then watched it do it again” brings a particular relief. It does not eliminate bugs. Nothing does. But it changes the conversation from vibes to evidence.
The most useful tests are not always the most elegant tests.
Sometimes the right test is a messy, browser-driven, high-friction battery that makes the whole system prove it can still breathe. Small isolated checks are clean. Users arrive as whole workflows.
AI does not reduce the testing burden. It makes the useful tests more direct, more varied, and more product-shaped.
The agent can write code all day. Proof begins when that code has to survive weather.