Troubleshooting⌗
When something is wrong, work from the symptom down. Each section starts with the symptom, then what to check, then how to fix.
If tape doctor is unhappy, fix that first — it catches half the
problems on this page.
Installation & startup⌗
tape: command not found⌗
Check. Did pip install -e tape/cli succeed? Is pip's bin/
directory on PATH?
Fix. If installed into a venv, activate it. If bin/ isn't on
PATH, add it. On macOS with Homebrew Python, ensure
~/Library/Python/3.x/bin is on PATH.
tape dev exits immediately with "no tape-server binary"⌗
Check. Do you have either cargo (for native mode) or docker
(for the bundled image)?
Fix. Install one. Or point tape dev --server-binary /path/to/tape-server
if you've built one separately.
Schema migration errors on startup⌗
Check. Did a previous server version run against this store, and have you upgraded since?
Fix. tape migrate --dry-run shows what's pending. If the migration
state is genuinely confused (rare), open an issue with the dry-run output
and the store dump.
Agent / SDK⌗
ValueError: @tape.outbox_tool ... requires at least one of business_key=, status_check=, compensate=⌗
This is intentional. A non-idempotent tool with no UNKNOWN-resolution strategy can't be safely retried.
Fix. Add at least one:
business_key=lambda ...if the upstream can be queried by your identity.status_check=callableif you can call the upstream's status API.compensate=callableif you can undo a confirmed commit.
If you genuinely want to bypass this (after explicit review), pass
allow_unsafe=True. The server will still warn.
Tool body is being re-executed on resume⌗
This is a contract violation. A CONFIRMED effect should be replayed
from history, not re-executed.
Check. What status is the effect in?
If the effect is PENDING (not CONFIRMED), the tool body legitimately
needs to re-run — your code crashed before recording the result.
Fix. If PENDING and the body has side effects you didn't want
repeated, you've either:
- Forgotten to mark the upstream as idempotent and pass
idempotency_key— fix the tool body. - Used
@tape.outbox_toolbut performed the upstream call in the body instead of returning intent — refactor so the body returns the payload only.
Replay diverges — agent makes a different decision second time⌗
Check. Is your prompt construction deterministic? Are you reading
wall-clock, randomness, or mutable external state outside tape.sample?
Did policy_version change between attempts?
Fix. Route nondeterministic reads through tape.sample(tool_context,
fn) or make them their own @tape.effect. Use tape.policy_is(...) to
branch on policy version explicitly. See
replay & resume.
RunCancelled raised on a tool body that shouldn't have been cancelled⌗
Check. Who called tape.cancel_run?
Fix. Trace back to the caller. If cancel_run was called by a
timeout handler you set up, your timeout was too tight; bump it. If by
a user action, the user did the right thing.
Reactors⌗
A run is RUNNABLE for hours and never re-driven⌗
Check. Is the recovery reactor running? Is it leasing the run?
On GCP, the Cloud Run service for tape-reactor-recovery should have
min-instances >= 1. If min-instances=0, the reactor is sleeping.
Fix. Bump min-instances. Or, if you're event-driven, check the
Pub/Sub subscription metrics — the push endpoint might be returning
non-200 (Cloud Run cold start, missing IAM, etc.).
A reactor is stealing leases from another reactor⌗
This shouldn't happen. Leases are CAS-acquired; two reactors can't both think they own the same row.
Check. Are two reactors of the same kind running? That's fine — they coordinate via the lease. Are two reactors of different kinds both trying to handle the same state? That's a bug; file an issue.
Reactor lag is climbing⌗
Check. Is the reactor CPU-bound? Are there unusually many runs in the watched state?
Fix. Scale up (gcloud run services update --max-instances=N).
Or switch to event-driven mode for that reactor.
Effects⌗
Effect stuck in UNKNOWN⌗
Check. Is the reconciler reactor running? Does the effect's tool
have a status_check or its connector have an observe? Is the
upstream's lookup endpoint healthy?
Fix. If the reconciler is running and observe() keeps returning
inconclusive, the upstream is degraded — the effect is correctly
marked STUCK. Page yourself, look at the upstream's books, and
tape resolve --effect <id> --as confirmed|failed.
Effect went CONFIRMED but the upstream never received it⌗
This is rare and indicates an upstream contract violation (the upstream
acked, then dropped the request). It is also what happens if you
manually tape resolve --as confirmed when you shouldn't have.
Check. What was the connector's dispatch() result? What was the
external_ref?
Fix. Re-dispatch by hand: register a fresh effect via the agent or
the API, idempotency-keyed differently so it doesn't dedup. If you
caused this with a manual resolve, you've added a compensation
obligation — compensate_run.
Storage⌗
Bigtable schema isn't recognised⌗
Check. The Bigtable Terraform module creates the table + family + GC policy. Did you skip that?
Fix.
tape doctor --gcp --fix # creates missing schema if possible
# or by hand:
cbt -instance <inst> createtable tape families=m
cbt -instance <inst> setgcpolicy tape m maxversions=1
Spanner refuses to start⌗
This is intentional. The Spanner backend is feature-flagged.
Fix. Either set TAPE_SPANNER_EXPERIMENTAL=1 and accept the trade-off,
or switch to AlloyDB / Bigtable. See stores.
Auth & IAM⌗
PermissionDenied: caller does not have role roles/run.invoker⌗
Check. Whose identity is the SDK using?
Fix. Grant the caller roles/run.invoker on tape-server:
gcloud run services add-iam-policy-binding tape-server \
--member="serviceAccount:<caller>@<project>.iam.gserviceaccount.com" \
--role="roles/run.invoker" --region=<region>
See IAM cheat sheet.
Unauthenticated: invalid audience⌗
Check. Is the tapes:// URL pointing at the Cloud Run service URL?
The ID token's audience is bound to that URL.
Fix. Match the URL exactly. tapes://tape-server-...run.app is
right; tapes://internal-lb.example.com (your LB) is not — re-bind the
audience to the LB URL or hit the service URL directly.
When all else fails⌗
tape doctor --dump-run r-<id> # full state of one run, JSON
tape logs --service tape-server --limit 200
Attach both to a GitHub issue with what you expected vs. what you saw. The maintainers' best debugging starts there.