TapeChaos — A Fault-Injection Substrate for Durable Agents⌗
Tape proves the agent survived one crash. TapeChaos proves it survives the matrix.
This is the design spec for fault injection and chaos engineering on top of
Tape. It is a companion to tape.md (the runtime) and tape-event-bus.md
(the reactor surface); read those first.
0. The one-paragraph version⌗
A durable agent runtime has a journal. A journal is also the perfect chaos
substrate: every fault has a natural coordinate (a seq, a decision index,
an effect key, a subject path), every invariant is already projected (one
wire, one record; obligations drained; budgets respected), and every run is
replayable. TapeChaos turns that journal into a chaos engine. It ships four
stacked layers — deterministic simulation underneath, in-process failpoints
at the six primitives, span-keyed faults at the SDK surface, and a
reliability-surface harness on top — so the same scenario can be driven from
a unit test, a soak job, a GameDay, or a CI replay. The result is the first
chaos framework where the system under test and the chaos substrate
share the same data model.
1. The thesis⌗
Most chaos frameworks bolt on. They inject a fault, run a workload, and hope an alert fires. The journal Tape already keeps changes the shape:
- Faults can be keyed to the journal coordinates the system itself
uses.
tape::begin_effect::post_dbis a real surface, not a wrapping. - Invariants can be read off the projections that already exist. "One
wire" is
count(effects WHERE business_key=b AND status=CONFIRMED) = 1, not a custom assertion the test had to compute. - Failures can be replayed bit-for-bit, because the seed that drives the simulator is the seed of every choice the run made.
- Minimum failure sets can be derived from a successful run's lineage, not guessed by the test author.
The journal tells you what happened. TapeChaos asks the journal what would have to break for the answer to change.
The four properties above — keyed, projected, replayed, derived — are what separate this from "Chaos Monkey for agents."
2. The taxonomy — six primitives × six failure shapes⌗
Tape's six primitives (§IX of the treatise) admit six failure shapes
each. The matrix is the contract: every cell needs a name, an injector, and
an invariant that detects it.
| Primitive | Crash | Timeout | Lost ack | Duplicate | Drift | Adversarial |
|---|---|---|---|---|---|---|
| Decision | model 5xx | slow token stream | partial JSON | retry collision | schema drift | prompt injection |
| Effect | mid-call exit | upstream timeout | AckLost |
duplicate dispatch | replay drift | poisoned response |
| Obligation | compensator throws | inverse slow | rollback unknown | double-compensate | wrong payload | malicious payload |
| Gate | reactor down | gate timeout | duplicate signal | conflicting resolves | resume race | spoofed signal |
| Timer | reactor lag | clock skew | missed fire | duplicate fire | timer storm | adversarial wakeup |
| KV | watcher disconnect | stale read | CAS miss | watch gap | version skew | hostile writer |
Every cell here is a real production failure mode someone has hit. Every cell here is also a Tape-specific failure mode — i.e. it admits a projection-based invariant. That is the unit of test.
3. The four-layer stack⌗
Layer 4 — agent reliability surface (Inspect AI, agent-chaos, R(k,ε,λ))
Layer 3 — span-keyed user API (OTel gen_ai.* semconv, tape.obs.SPAN_*)
Layer 2 — in-process failpoints (tikv/fail-rs in the Rust core)
Layer 1 — deterministic simulation (madsim or turmoil, seeded + replayable)
No single layer is sufficient; each covers what the others can't.
-
Layer 1 — DST. A seeded, deterministic simulation of the Rust core's IO and time. Same seed = same trace, every time. This is the FoundationDB / TigerBeetle pattern, applied to a durable-agent runtime. Madsim or turmoil — pick one; this spec is agnostic. The killer property is bit-for-bit replay of any failure CI ever found.
-
Layer 2 — failpoints. Named injection sites in the Rust server, at each of the six primitives' mutating boundaries. Zero-cost when compiled without the
chaosfeature; active when compiled with it. Configured by env (FAILPOINTS=tape::begin_effect::post_db=panic) or by RPC. tikv'sfailcrate is the de-facto standard in this ecosystem. -
Layer 3 — span-keyed user API. The OpenTelemetry
gen_ai.*semantic conventions stabilized in late 2025 (gen_ai.invoke_agent,gen_ai.execute_tool,gen_ai.create_agent). They map 1:1 to Tape'stape.obs.SPAN_*constants (tape.begin_effect,tape.record_decision, …). TapeChaos uses span names as the fault-injection vocabulary, so the same scenario runs against any agent framework that emits the same spans. -
Layer 4 — agent reliability surface. The harness that drives whole runs through scenarios, checks the journal invariants, and emits a reliability score. This is ReliabilityBench's R(k, ε, λ) formalism, recast for durable runtimes.
The reason for four layers, not one: a single layer always has a blind spot. DST can't simulate Anthropic's API. Failpoints can't model an adversarial prompt. Spans can't reproduce a kernel-clock skew. Reliability scoring can't tell you where in the journal the run broke. Stacking is the design.
4. The architecture⌗
┌────────────────────────────────────────────┐
│ tape chaos run/soak/replay/fuzz/lint │ CLI
└────────────┬───────────────────────────────┘
│
┌─────────────────────────┼─────────────────────────────┐
│ │ │
┌──▼──────────────┐ ┌────────▼─────────┐ ┌─────────────────▼────────────┐
│ chaos.scenario │ │ chaos.invariants │ │ chaos.lineage (LDFI) │
│ faults[] │ │ ExactlyOne, │ │ derive minimal fault sets │
│ invariants[] │ │ NoStuckOblig, │ │ from a successful run │
│ seed │ │ NoBlindRetry, │ │ │
└──┬──────────────┘ │ ReplayDetermin │ └──────────────────────────────┘
│ │ Linearizable │
│ └──────────────────┘
│
│ span-keyed faults
▼
┌────────────────────────────┐ ┌───────────────────────────────────┐
│ tape.chaos (SDK) │ │ ChaosService (gRPC, optional) │
│ crash, delay, lose_ack, │◀──▶│ Configure(failpoints map) │
│ wrap_connector, attach, │ │ Inject(name, action) │
│ hypothesize (PBT) │ │ StartScenario(name, seed) │
└────────────────────────────┘ └────────────────┬──────────────────┘
│
▼
┌────────────────────────────────┐
│ tape-server (Rust) │
│ fail_point!("tape::*::post") │ ~80 named sites
│ + madsim/turmoil DST harness │ seeded; replayable
└────────────────────────────────┘
The split is deliberate: the catalog of injection sites lives in the server (because that's where the boundaries are); the language of scenarios lives in the SDK (because that's where the user writes them); and a small RPC bridges them (because the same Python test should drive a server it doesn't own).
5. Failpoint catalog (v1)⌗
Two markers per primitive's mutating RPC: pre_db (the request is parsed,
the store has not yet been touched) and post_db (the store mutation
completed, the response has not yet been sent). The post_db site is the
load-bearing one — it's exactly the window the treatise calls the one
place uncertainty lives, made injectable.
tape::begin_run::{pre_db, post_db}
tape::resume_run::{pre_db, post_db}
tape::end_run::{pre_db, post_db}
tape::record_decision::{pre_db, post_db}
tape::get_decision::{pre_db, post_db}
tape::begin_effect::{pre_db, post_db}
tape::complete_effect::{pre_db, post_db}
tape::reconcile_effect::{pre_db, post_db}
tape::claim_effect_dispatch::{pre_db, post_db}
tape::record_dispatch_attempt::{pre_db, post_db}
tape::record_external_observation::{pre_db, post_db}
tape::register_compensation::{pre_db, post_db}
tape::claim_obligation::{pre_db, post_db}
tape::resolve_obligation::{pre_db, post_db}
tape::record_obligation_attempt::{pre_db, post_db}
tape::await_signal::{pre_db, post_db}
tape::send_signal::{pre_db, post_db}
tape::set_timer::{pre_db, post_db}
tape::cancel_timer::{pre_db, post_db}
tape::list_due_timers::{pre_db, post_db}
tape::write_value::{pre_db, post_db}
tape::delete_value::{pre_db, post_db}
tape::append_event::{pre_db, post_db}
Each site supports the fail crate's standard actions: off, return,
panic, sleep(ms), pause, yield, print(msg), with probability
prefixes and chained alternatives (0.1*panic->return).
Compile-time: the macro is zero-cost when tape-server is built without
--features chaos. Production binaries pay nothing for this surface.
6. The scenario — what users write⌗
import tape.chaos as chaos
bank_wire_under_chaos = chaos.scenario(
name="bank-wire-under-chaos",
seed=42,
faults=[
# at the Tape server
chaos.crash("tape::complete_effect::post_db",
when="tool == 'execute_sweep'", after_n=1),
chaos.delay("tape::send_signal::pre_db", ms=2_000),
# at the SDK / span layer
chaos.crash("tape.dispatch_effect",
when="effect.business_key contains 'sweep'"),
# at the connector
chaos.lose_ack("bank.wire", probability=0.3),
chaos.duplicate("bank.wire", probability=0.05),
# at the infrastructure (Cilium / Chaos Mesh emit these)
chaos.partition(["tape-server", "reactor"], duration_s=10),
chaos.clock_skew("reactor", offset_s=300),
],
invariants=[
chaos.invariant.ExactlyOne(connector="bank.wire", by="business_key"),
chaos.invariant.NoStuckObligations,
chaos.invariant.NoBudgetOverrun,
chaos.invariant.NoBlindNonIdempotentRetry,
chaos.invariant.ReplayDeterminism,
],
)
result = chaos.run(bank_wire_under_chaos, runner_fn=build_runner)
print(result.reliability_surface) # R(k=20, ε=0, λ=0.99) ✓
Every field above maps to a journal projection or a span name. The DSL is
declarative, terse, and the same shape as @tape.on and
tape.connectors.register — Tape's existing user-facing patterns.
7. Invariants — the oracle⌗
Most chaos frameworks inject and forget to check. The catalog ships with:
| Invariant | Read from the journal |
|---|---|
ExactlyOne(connector, by) |
count(effects WHERE connector=c AND key=k AND status=CONFIRMED) = 1 |
NoStuckObligations |
count(obligations WHERE status=STUCK) = 0 |
NoBudgetOverrun |
sum(charges) ≤ cap per run |
NoBlindNonIdempotentRetry |
for every non_idempotent effect, dispatch_attempts ≤ 1 until RecordExternalObservation fires |
GatesEventuallyResolve(bound) |
status=WAITING ⇒ resolution within bound |
CompensationLIFO |
obligations resolve in newest-first order |
ReplayDeterminism |
re-drive(seed) produces the same projection |
LinearizableJournal |
SubscribeRun(from=0) is consistent with per-RPC happens-before (Porcupine) |
NoEffectWithoutDecision |
for every effect, decision_index ≥ 0 ⇒ decision exists at that index |
NoOrphanCompensation |
every obligation points to an effect that exists |
Each invariant is a one-pass query over the projections in
tape.md §6 — no extra instrumentation, no parallel test ledger. The
journal is the oracle.
8. The killer move — lineage-driven fault injection⌗
Tape's WAL is the lineage graph. Every effect points to a decision; every obligation to an effect; every signal to a gate. Molly-style LDFI (Alvaro et al., 2015) becomes one pass:
- Take a successful run's journal.
- Compute the lineage DAG: edges from each row to the rows it depends on.
- Enumerate minimal cuts of the DAG.
- For each cut, re-run the scenario with that cut faulted.
- If the run still succeeds — the invariant holds against that fault.
- If not — emit the minimal counterexample.
The reason no existing agent framework does this is no existing agent framework has the right journal. Tape does. The cost is one DAG traversal per successful run; the value is "the next thing to test, derived rather than guessed."
9. Deterministic simulation — Layer 1 in detail⌗
The Rust core compiles with one of two simulation backends behind a cargo feature:
[features]
chaos = ["fail/failpoints"] # Layer 2 only
sim = ["chaos", "madsim", "madsim-tokio"] # Layer 1 + 2
Under --features sim, tokio is swapped for madsim; IO, time, RNG, file
system and process control are all virtualized. The seed is an env var
(TAPE_SIM_SEED) and is stamped onto every chaos report. A failing seed
re-runs to bit-identity. CI runs one fresh seed per build; nightly fuzzes
1 000 seeds per scenario.
Three properties hold under sim:
- Determinism: same seed ⇒ same trace.
- Reduction: a failing seed shrinks to a minimal counterexample via binary search on injected fault count.
- Coverage feedback: each seed emits a coverage vector (failpoints triggered, RPC branches taken); the runner steers toward unexplored vectors.
This is what FoundationDB built in 2014 and what every serious durable system has rebuilt since — for the first time on top of an agent runtime.
10. What this is not⌗
- Not another chaos vendor. Antithesis stays the gold standard for proprietary hypervisor-level DST; TapeChaos doesn't reinvent the hypervisor.
- Not a generic agent eval. Inspect AI, DeepEval, PromptBench cover capability evaluation; TapeChaos covers durability — the intersection of an agent's behaviour and the journal's invariants.
- Not Jepsen. Jepsen tests databases. TapeChaos tests the agent runtime on top of the database, using Jepsen-style techniques (Porcupine) on Tape's journal.
- Not a Toxiproxy. In-process failpoints + a service-mesh layer (Cilium L7, Istio Ambient) replace it, faster and deterministically.
- Not a UI. Chaos as code, like everything else in Tape. Reports write Markdown; the dashboard is the journal.
11. Phases — how this lands⌗
| Phase | Deliverable | Status |
|---|---|---|
| 0 | Failpoint catalogue at the server primitives + this treatise | done |
| 1 | tape.chaos Python SDK: scenarios, invariants, connector shim |
done |
| 2 | Deterministic replay + single-thread DST harness | done |
| 3 | LDFI + Reliability Surface R(k,ε,λ) + DeepSnapshot | done |
| 4 | Agent-layer chaos: model_proxy, mcp_proxy (HTTP/SSE) — stdlib forward-proxy |
done |
| 4.1 | Stdio MCP proxy — subprocess wrapper for line-delimited JSON-RPC | done |
| 2.5 | Madsim DST foundation — virtualised time, deterministic scheduling, seeded RNG | done |
| 2.6 | Store bridge: sim-only MemRunStore so real TapeService tests run under cfg(madsim) |
done |
| 3.5 | Wing-Gong linearizability checker (Porcupine-style) for the obligation lease | done |
| 5 | tape chaos CLI + Chaos Mesh manifests (pod-kill, net-delay, time-skew, soak workflow) |
done |
| 5.5 | Hypothesis-stateful runner: random-sequence model checker over the obligation lifecycle | done |
| 6 | Mirror the SDK surface to Go / TypeScript / Java |
Phase 0 is the foundation: a small, named injection surface in the Rust core, gated behind a cargo feature, with this treatise pinning the shape. Everything else is additive.
12. The one-line elevator⌗
Tape ships a journal. TapeChaos turns the journal into the chaos engine. Seed it, drive the matrix of (primitive × failure), check the invariants the journal already knows about, score the run, replay any failure bit-for-bit. The first chaos engineering framework where the system under test and the chaos substrate share the same data model.