When research reports should fail

When a research agent reads two credible sources that disagree, the useful output is often not a better paragraph. It is a failed report.

That sounds awkward only if the goal is to make every run produce a polished answer. For a system that stores durable research, failure is a product feature. It keeps uncertainty visible while there is still time to fix the corpus.

Arcwell is my local assistant layer for research, memory, monitoring, and long-running agent work. The deep-research path is built around a simple discipline: collect source cards, extract bounded claims, group the claims, run a skeptic pass, and only then compile a report. The model is allowed to help, but generated text is not allowed to become evidence for itself.

Here is the smallest useful version of that rule.

The fixture

I created a disposable research run with two source cards. Benchmark Alpha report was a primary, high-trust card saying that Indexer A was the fastest implementation for the measured query mix. Benchmark Beta report was also primary and high-trust, but said the opposite: Indexer A was not the fastest implementation for that same measured query mix.

The point of the fixture is not benchmarking. It is the gate behavior. Both cards are source-like, both are linked to the same research run, and both make a measurement claim about the same subject and predicate.

The extracted claim ledger looks like this:

[
  {
    "subject": "Indexer A",
    "predicate": "fastest implementation",
    "object": "yes",
    "confidence": 0.8,
    "source_card_id": "src-b4c8ffa00eaff80f"
  },
  {
    "subject": "Indexer A",
    "predicate": "fastest implementation",
    "object": "no",
    "confidence": 0.8,
    "source_card_id": "src-26019cd58d71d268"
  }
]

This is the step that matters. If the system only held summaries, the contradiction could disappear into prose. With structured claims, the disagreement has an address.

The skeptic pass

Arcwell groups the claims into a cluster for indexer a, then runs a skeptic pass over the linked sources, source roles, claims, and clusters.

The pass returned:

{
  "ok": false,
  "contradictions": [
    {
      "severity": "error",
      "notes": "`Indexer A is the fastest implementation for the measured query mix.` conflicts with `Indexer A is not the fastest implementation for the measured query mix.`"
    }
  ],
  "findings": [
    {
      "severity": "error",
      "code": "structured_claim_contradiction",
      "message": "Structured claims appear to conflict and require resolution or caveat."
    }
  ]
}

That is a better result than a confident answer. A research system should not turn disagreement into an averaged sentence unless the method for resolving the disagreement is explicit.

The code path is intentionally ordinary: build_research_clusters groups claim records by theme, run_research_skeptic_pass checks source roles and contradictions, and compile_research_report uses the skeptic and audit results to decide whether the report is complete. The important part is where the decision sits. It is outside the writer.

What the report wrote

The compiled report was marked incomplete:

status: incomplete
source cards: 2
primary source cards: 2
extracted claims: 2
clusters: 1
contradictions: 1

The generated report did not pick a winner. It carried both claims in the evidence section and included this audit finding:

structured_claim_contradiction:
Structured claims appear to conflict and require resolution or caveat.

It also kept the corpus warnings:

thin_source_corpus: linked_sources=2
thin_primary_source_coverage: primary_cards=2
measurement_claim_without_document_anchor

Those warnings are not decoration. They prevent a small fixture from being mistaken for broad research. In a real run, the next action would be to pull more benchmark sources, inspect methodology, anchor measurement claims to documents or tables, and decide whether the disagreement comes from different workloads, dates, versions, or measurement practice.

Why this belongs before writing

The tempting architecture is:

Gather links.
Ask a model to summarize them.
Store the summary.

That works until one of the links disagrees with another, or until a generated summary gets fed back in as if it were source material. Once that happens, the prose can look better while the evidence gets worse.

The safer architecture is:

Store sources as sources.
Extract claims with source-card IDs attached.
Keep claims separate from generated reports.
Run an adversarial pass over the claim ledger.
Let the report fail when the evidence fails.

The run demonstrates a narrow behavior: the research pipeline can preserve a contradiction between two primary-role source cards and prevent the compiled report from presenting the run as complete.

That does not establish broad web research quality, long-running scheduling, live provider behavior, or benchmark truth. It shows something more basic: disagreement survives contact with the report writer.

Chris Chabot · June 2026

technical blog