Leases⌗
A lease is the temporary authority to extend a run's journal. It is the only coordination primitive Tape relies on — and the only one it needs.
The journal preserves execution history. The lease grants temporary authority to extend that history.
Why durability creates replay races⌗
Once the journal is the source of truth, any process with a connection to the Tape server can attach to a run and try to drive it forward. That is the property that makes recovery work — a fresh pod, started after a crash, can resume an in-flight run because the previous pod's heap is not the source of truth.
It is also the property that makes multiple-driver races dangerous:
recovery worker A recovery worker B
│ │
│ list RUNNABLE │ list RUNNABLE
▼ ▼
finds run R finds run R
│ │
│ (no lease, no CAS)
▼ ▼
re-drives R re-drives R
│ │
▼ ▼
wires money wires money ← bad
The fix is not "try harder to coordinate." The fix is to grant exactly one writer at a time the right to extend the journal, and let everyone else short-circuit harmlessly.
What a lease is⌗
A lease is a tuple stored on the run row:
owner_id— the process / pod that holds the lease right now.reactor_name— which reactor / driver took it (recovery,reconciler,outbox,agent).expires_at— when the lease becomes stale. Other workers may try to take over after this point.fencing_token— a monotonically increasing integer. Any write the lease holder makes carries the token; the store rejects writes with a stale token. This is the "STONITH" for split brain.
The lease is acquired via CAS on the run row. The store rejects an
acquire that doesn't match the expected (owner_id, fencing_token).
The acquire is the only point of contention; everything downstream is
unblocked.
Leases vs. locks⌗
Two distinctions worth making explicit:
- A lock stops other readers. A lease does not — anybody may read the journal at any time. The lease only bounds writers.
- A lock is held until released. A lease expires. If the holder crashes, the world keeps moving: the lease times out, another worker takes over, the journal grows.
The combination — bounded writer, free reader, expiry on crash — is what makes Tape's reactors safely multi-replica with no extra coordination service.
Lease lifecycle⌗
worker A acquires lease
│
▼
run executing
│
│ ── heartbeat ── heartbeat ── heartbeat …
▼
crash
│
▼
lease expires (TTL passes without heartbeat)
│
▼
worker B observes expired lease, CAS-acquires
│
▼
replay from seq=0 to resume_point
│
▼
continue from resume_point
The cycle is the same for every actor that extends the journal: recovery, reconciler, outbox, the agent itself.
Heartbeats⌗
Long-running drivers must renew the lease before it expires. The SDK
takes care of this for tape.effect bodies; for outbox dispatch and
reactor work, the reactor framework bumps the lease on each tick.
If a heartbeat fails (lost connection, store gone), the driver must abandon the work in progress. The lease will expire; some other worker will pick up. Any write the abandoning driver attempts will be rejected by fencing — so abandoning is safe.
Takeover rules⌗
Takeover is the moment a worker observes an expired lease and tries to claim it. The rules:
- Read the lease — note
expires_atandfencing_token. - Compute the new state —
(my_owner, my_reactor, now + ttl, token + 1). - CAS the lease — succeed only if
(owner_id, fencing_token)on the row still matches what we read. If not, somebody else already took over; back off. - Replay the journal — reconstruct the resume point. Confirmed effects are short-circuited; pending effects go through the reconciler.
- Continue from the resume point — extend the journal under the new fencing token.
The store enforces fencing on every write — AppendEvent,
CompleteEffect, WriteValue. A stranded driver with a stale token
cannot corrupt the journal.
Why this is enough⌗
The lease is the only thing standing between "two workers race into the same wire transfer" and "exactly one worker, at a time, extends the journal." It works because the journal is the source of truth:
- The loser of a takeover race short-circuits harmlessly — it reads the same journal the winner is extending, and the next write will be fenced out.
- The winner doesn't need to know the loser even existed — replay is deterministic given the journal, so it starts from the same state regardless of who held the previous lease.
- A network partition that strands a driver does not strand the run — the lease expires, another driver takes over, and the journal keeps growing.
There is no need for a separate coordination service (ZooKeeper, etcd, a leader-election library). The store is the coordinator. Every RunStore backend (SQLite, Postgres, AlloyDB, Bigtable) supports the CAS shape the lease needs.
When the lease isn't enough⌗
Two cases the lease cannot fix on its own:
- The upstream is non-idempotent. The lease prevents two drivers from attempting the dispatch at the same time, but it does not prevent a single driver from dispatching, crashing before recording, and a successor blindly retrying. The fix for that is the outbox dispatch + reconciliation pattern, which is built on top of leases.
- The store loses the journal. A corrupted store is outside Tape's recovery model. Run on a transactional backend (Postgres, AlloyDB, Spanner) for any workload where this matters, and back it up like any other system of record.
Operational shape⌗
| Knob | Default | Why you'd change it |
|---|---|---|
| Lease TTL | 30 s | Lower for faster takeover at the cost of more heartbeat traffic; raise if your reactors are slow. |
| Heartbeat interval | TTL / 3 | The standard safety factor. |
| Takeover backoff | jittered exponential, 1–10 s | Avoid thundering herd when N reactors all see an expired lease. |
| Max in-flight lease attempts | unlimited | The store CAS handles contention; the loser just retries. |
The CLI exposes these per-reactor in tape.yaml:
Next⌗
- Reactors — what each reactor does under a lease.
- Replay & resume — the mechanics of resuming after a takeover.
- Architecture — where leases sit in the system.