The first agent runtime I built this year didn’t know when to stop. In February I watched it finish its answer and keep going anyway — another auto-continued round, then another, with nothing anywhere in the system to tell it the answer was already done. I can’t show you that run; nothing of it survives except the fix, one commit line capping auto-continue at five rounds. Once you’ve watched an agent spend rounds on a finished answer, you stop trusting what it says about its own progress. I went back to checking its work by hand, which defeats most of the point of handing work over.
Hand an agent real work in the morning — a refactor, an investigation, a deploy to keep honest — and by evening you should be able to trust its account of what happened. That is the whole ask. It’s easy to say, and it has proven expensive to build: since February I have built three runtimes chasing it, and thrown two of them away.
All three live in private repos, so I can describe them but not link them. The first was a local, all-in-one workspace whose initial import on 17 February was already broad rather than cautious. The second was an edge-native rewrite of the same instincts that landed on 9 April. The third is the kernel I run today, built around durable events, projections, work items, and approval gates. This week I ran git log through all three histories with a single test: what did each version make visible that the one before it could hide?
February put everything in one room
Version one solved delegation by proximity. Chat surface, tool runtime, local database, terminal sessions, automations, skills, channel adapters, and the first swarm experiments all lived in a single application. It wasn’t elegant, but when something broke, the pieces sat close enough together that finding the break rarely took long.
The commit history moves fast. Four days after the import, on 21 February, a large platform expansion landed — services, UI, skills, and what the commits call soul documents, the files that gave each agent its standing personality. The same day produced four fixes I now read as the real curriculum. The whole first four days fit in six lines at the bottom of the log:
$ git log --format='%as %s' | tail -6
2026-02-21 Remove personality presets (friendly/pragmatic)
2026-02-21 Scale down chat heading sizes and match inline code font size
2026-02-21 Cap length-truncation auto-continue at 5 rounds
2026-02-21 Fix model token limits: pass maxOutputTokens to LLM API
2026-02-21 Phase 9-J: skills, soul documents, services, UI, and infra
2026-02-17 Initial commit
The third line is the cap from the run that opens this post — filed four days into the project, between a font-size fix and a token-limit fix.
Together those four lines describe an agent that had every capability it needed and no discipline about spending any of it — responses that ignored their token budget, a chat surface set in oversized headings, personality presets standing in for policy. Most of the code that mattered sat underneath the prompt: schemas, approval policy, path checks, result truncation, command lifetimes, session ids, and whether a tool result could be traced back to the call that produced it.
By the end of the month I was asking the system smaller, stricter questions: is this tool read-only, does this command stay inside its directory, has this response hit its budget, can the user open the skill file and read what it does.
April stopped pretending to be a laptop
System two began from one architectural bet: the runtime would no longer get to behave like a laptop. Processes became isolates; files, durable workspace views; cron, scheduled state; browser sessions, model calls, memory, and publishing all moved behind platform-shaped boundaries. The design question changed from “can this run in the cloud” to “which durable object owns this kind of truth.”
The rewrite landed on 9 April. Two days later came the swarm runtime and a round of chat performance work; the day after that, a Claude-native runtime and an overhaul of long-running tasks; within the week the chat stream had been rebuilt and runtime recovery hardened. Porting the code went quickly. The slow part was reassigning custody — every invariant the February workspace had kept safe through sheer adjacency now needed a named home.
April taught me that once an agent can outlive the browser tab, the model stream, and the warm process that started it, “working on it” has to survive a dead tab and a restarted process — and the transcript cannot be what keeps it alive. A transcript is written for a reader: it smooths over retries and narrates a half-finished run as if it were a story, and a story is a poor place to look up the precise point where a run stopped.
Six jobs for one transcript
By the third system I could name the defect precisely: the transcript had been holding six jobs at once — model history, user interface, recovery log, tool ledger, approval queue, progress indicator. The live kernel takes those apart. Everything that happens — a tool call, its result, an approval, a context change, a worker’s status — gets recorded as a durable event, and the transcript becomes one projection rendered from that record, alongside the progress view, the recovery path, and replay.
You could read this as event-sourcing for its own sake, and for a chat product it would be; transcripts have carried chat products for years without complaint. The break comes with delegation. The moment a run is allowed to continue after I close the tab, something other than a rendered conversation has to know, exactly and replayably, what has happened so far.
What converted me is what the split does to recovery. The kernel treats a dead tab as nothing special: the worker keeps going behind the boundary, a stalled model stream gets retried there too, and a page that reconnects replays the events it missed and shows the run as it actually stands, the parts that happened while the tab was gone included.
Multi-agent work became interesting to me again on the same terms. In the current kernel a subagent takes an owned, scoped task with its own abort path; edits and deploys carry their owner and approval trail; and the verification pass opens the produced artifact and checks it against the task.
The tests changed shape alongside the runtimes. February’s suite asked whether the pieces worked; April’s asked whether the runtime survived; the kernel’s take the sentence “the agent can do this” and ask under which interruption, through which projection, with whose permission, after which replay, and with what visible evidence. A model stream that stalls without reporting anything used to be a support mystery. Now it is a failing test.
Context management landed in the same books. Conversations grow, tool output is noisy, and a model can drown in its own helpfulness. I used to describe the resulting bad runs in terms of mood — the agent seemed distracted, or overconfident, or strangely loyal to a plan it should have dropped — and the moods turned out to be bookkeeping errors. So the kernel keeps books: an explicit account of what the model may still remember, what the user can inspect, what can be reconstructed, and what is never allowed to disappear quietly. When the ledger is right the moods mostly go away. Providers get entries too: the first system hid every model behind one generic shape, which held until the differences stopped being cosmetic — streaming semantics, tool-call formats, token counting, and recovery paths all diverge. The April commits went Claude-native; the kernel keeps provider reality at the boundary and normalizes it into durable events as early as it can.
What survived
I still want agents that act — long investigations, delegated builds, browser-validated deploys, work that continues while I sleep. Three generations in, the part I trust is the machinery around the action: permissions, budgets, durable events, replay, recovery, explicit ownership, and tests that make the product’s promises falsifiable.
The three systems in this post are staying private, but the convictions in them did not. Arcwell is where they now run in public: an MIT-licensed Rust platform that gives Codex-, Claude-, and MCP-capable agents personal-assistant services on this exact spine — durable records, policy gates, work runs, projections. The honest front door is STATUS.md, which is blunt about what works and what doesn’t yet.