Most agent-written test plans are too polite.
They add a happy path, a couple of edge cases, and a build command. Sometimes that is enough. For code that handles permissions, files, external input, generated code, identity, money, or durable state, it is not.
severe-testing is a small skill I use to push agents toward a more useful habit: start by naming the claim, then design tests that would refute it.
Obvious in theory; in practice it changes the whole shape of the work.
Start with the claim
The wrong first question is:
What tests should I add?
The better first question is:
What would make the implementation's claim false?
For a file upload path, the claim may involve MIME handling, size limits, storage boundaries, cleanup, and authorization. For an API endpoint, it may involve object ownership, idempotency, retry behavior, stale sessions, and audit trails. For an agent tool, it may involve tool scope, prompt-injection resistance, data exfiltration, and whether indirect instructions can influence privileged actions.
Once the claim is explicit, the test design becomes less generic. You are no longer adding tests because tests are good. You are attacking a concrete promise the system makes.
Refutations are more useful than categories
Security and reliability categories are helpful, but they are not findings by themselves.
authz
injection
concurrency
resource exhaustion
secrets
path traversal
prompt injection
Those labels only become useful when tied to an observable failure:
wrong user can read object by direct ID
symlink escape writes outside the workspace
replayed request creates a duplicate charge
markdown payload becomes script execution
tool output carries a secret into the model prompt
interrupted save leaves corrupt state marked complete
The skill is written to keep pushing in that direction. It asks for preconditions, postconditions, invariants, malicious inputs, recovery paths, and independent oracles. It also asks the agent to suppress false positives when there is no attack surface or when a higher layer demonstrably handles the risk.
Aggressive testing without filtering becomes noise. The useful bar is not “can I imagine a scary failure?” It is “can I reproduce or strongly exhibit a failure against the claim?”
Oracles keep tests honest
A severe test needs an oracle that is not just the implementation repeating itself.
Useful oracles include:
permission matrices
schema validators
state-machine invariants
normalized diffs
reference libraries
accessibility trees
resource limits
audit logs
filesystem boundaries
mathematical identities
For generated or agentic code, this becomes especially important. A model can write a test that agrees with the bug. It can assert a shape instead of a behavior. It can test the helper that happens to exist rather than the guarantee users depend on.
An independent oracle makes the test less vulnerable to that circularity.
Evidence has a confidence score
The skill also separates demonstrated failures from imagined ones:
| Score | Meaning |
|---|---|
| 0 | The failure mode does not apply to this code. |
| 25 | Plausible risk, but no concrete path or reproducer. |
| 50 | Reproduced under contrived conditions. |
| 75 | Reliably reproduced under realistic conditions. |
| 100 | Demonstrated in the real runtime with captured evidence. |
The scale prevents two bad outcomes.
The first is under-reporting: a real failure gets waved away because the test was uncomfortable to write. The second is over-reporting: a speculative security concern gets presented as if it were a bug.
For engineering work, both are expensive. One ships defects. The other teaches people to ignore the review.
Agents need this more, not less
Agentic coding makes severe testing more valuable because agents are good at producing plausible-looking completion. They can add a green test, run a command, and summarize the result in a way that sounds finished even when the important claim was never threatened.
The corrective is simple but hard to keep doing:
Name the claim.
Name what would falsify it.
Pick an oracle.
Attack the input space.
Keep only the findings the evidence can support.
That is the whole point of severe-testing. It is not a fetish for bigger test suites. It is a way to make the system’s promise attackable before a user or attacker does it for you.