RBE / Cache / Runner SLO Definitions

RBE / Cache / Runner SLO Definitions

Status: draft, working (W5.1 / TIN-1477, under parent E5/TIN-1449). Snapshot date: 2026-05-18. Audience: Jess (primary operator) and Codex (proofs/guards/eligibility).

Frame

SLOs exist on GloriousFlywheel because the platform now has more than one authority surface that can quietly degrade: shared cache (CAS + action cache), the gf-reapi-cell REAPI endpoint, ARC runners, RustFS-backed cache storage, and external-input authority. The product contract is “one shared substrate for local development and CI” — when that substrate degrades, the failure mode is not loud (a red CI badge), it is slow (cache miss rate creeping up, queues filling, retries papering over digest mismatches). SLOs are the contract that turns those slow drifts into countable events.

The principle this doc enforces: an SLO without an error budget is decoration. Each SLI here gets a target, a 30-day budget in real units, a named gate epic, and a written consequence when the budget trips. SLOs that don’t gate anything are deleted, not demoted.

This doc serves the six-gate frame for the RBE Production Readiness initiative:

  • E1 CAS authority — TIN-1445
  • E2 AC authority — TIN-1446
  • E3 external input authority — TIN-1447
  • E4 tenant model — TIN-1448
  • E5 observability (parent) — TIN-1449
  • E6 target-class breadth — TIN-1450

Sprint horizon: 2026-05-18 → 2026-08-31. Current gate scorecard at the start of this sprint: 27/60 ≈ 45%.

The SLO Table

Each row is one SLI. “Source” is where the value comes from today, not where it should come from after E5 lands. Where a metric name does not yet exist in the repo or emitter, it is marked proposed so readers don’t grep for ghosts.

SLI Definition Measurement source Target Error budget (30d) Gates / consequence on breach
CAS hit rate cas_hits / (cas_hits + cas_misses) over warm CI builds, excluding cold-clone first 60s proposed gf_reapi_cas_hit_ratio from gf-reapi-cell worker logs; until then, bazel-cache access log parse ≥ 90% ~4.5h/30d under target (~0.62%) E1/TIN-1445 CAS authority close; sustained breach freezes E6 acceleration
Action cache hit rate ac_hits / actions_executed on warm CI (post-first-run) proposed gf_reapi_ac_hit_ratio; today via Bazel --experimental_remote_cache_async/info parse ≥ 70% ~9h/30d under target (~1.25%) E2/TIN-1446 AC authority close; gates E6/TIN-1450 acceleration
Analysis cache hit rate (Bazel-side) Bazel client-side analysis cache reuse on warm CI runs; not REAPI AC Bazel --profile JSON analysis phase cached + analysis_cache_hit events ≥ 80% ~6h/30d under target (~0.83%) Diagnostic; informs runner workdir contract, not a hard E-gate
Remote action p50 latency — CppCompile Median end-to-end action latency for CppCompile mnemonic proposed gf_reapi_action_latency_seconds{mnemonic="CppCompile"} p50 < 2s (latency, not budget-window — uses rolling 7d p50) E6/TIN-1450 — CppCompile target class promotion blocked above target
Remote action p95 latency — CppCompile 95th percentile end-to-end same, p95 < 5s rolling 7d p95 must stay under target on 28/30 days E6/TIN-1450 acceleration
Remote action p99 latency — CppCompile 99th percentile end-to-end same, p99 < 15s rolling 7d p99 must stay under target on 25/30 days E6/TIN-1450 acceleration
Remote action p50/p95 — GoCompile Same shape, GoCompile mnemonic proposed gf_reapi_action_latency_seconds{mnemonic="GoCompile"} p50 < 1s, p95 < 5s as above E6/TIN-1450
Remote action p50/p95 — TestRunner Same shape, TestRunner mnemonic (e.g. Vitest, Go test) proposed gf_reapi_action_latency_seconds{mnemonic="TestRunner"} p50 < 3s, p95 < 30s as above E6/TIN-1450; relevant to //app:unit_tests proof class
Remote action p50/p95 — GenRule Same shape, GenRule mnemonic proposed gf_reapi_action_latency_seconds{mnemonic="GenRule"} p50 < 2s, p95 < 10s as above E6/TIN-1450
Scheduler queue time p95 Time between action submit and worker pickup, p95 across all mnemonics implemented gf_reapi_scheduler_queue_seconds_bucket/_sum/_count by instance_name and pool plus gf_reapi_worker_pool_available_slots; first Grafana view in docs/monitoring/gf-reapi-fairness-dashboard.json < 2s breach if > 2s for > 4.5h/30d E4/TIN-1448 tenant model gates; sustained breach freezes E6
Action retry rate Fraction of actions retried at the REAPI client (Bazel --remote_retries) Bazel BES + proposed gf_reapi_action_retries_total < 1% > 1% for > 1h/30d (~0.14%) E1/TIN-1445 + E2/TIN-1446; pages on sustained breach
Digest-mismatch rate (poison signal) Count of digest mismatches observed during CAS read / AC lookup, normalized by action count implemented minimum gf_reapi_digest_mismatch_total{path="read\|write"}; richer labels proposed < 1e-6 any single mismatch is a paged incident; “budget” is per-quarter not 30d Hard halt on E1/TIN-1445; freezes E2/TIN-1446 and E6/TIN-1450 until RCA
Eviction rate (CAS) Bytes evicted per hour from CAS by the configured policy proposed gf_reapi_cas_bytes_evicted_total rate < 5% of CAS volume/24h sustained > 5%/24h for 3 consecutive days E1/TIN-1445 capacity contract
Bytes-evicted-while-referenced (poison) Bytes evicted from CAS for digests still referenced by an unfinished action proposed gf_reapi_cas_bytes_evicted_referenced_total 0 any nonzero value over 7d is paged E1/TIN-1445 hard halt; correctness gate
Per-tenant queue-time skew max(p95_queue_per_tenant) / median(p95_queue_per_tenant) across active tenants implemented through gf_reapi_scheduler_queue_seconds_bucket{instance_name="...",pool="..."} and docs/monitoring/gf-reapi-fairness-dashboard.json skew panel < 2× breach if skew > 2× for > 9h/30d (~1.25%) E4/TIN-1448 — tenant model close depends on this
Time-to-first-cache-hit (TTFCH) Wall time from git clone complete to first observed remote cache hit during a warm CI build of a representative target; samples with cache metadata timestamps before clone completion are counted as ok outcomes but excluded from the latency histogram implemented first-slice gf_runner_ttfch_seconds histogram from .github/workflows/ttfch-probe.yml; dashboard contract in docs/monitoring/gf-runner-ttfch-dashboard.json < 90s breach if > 90s on > 3 days/30d E5/TIN-1449 close gate — customer-shaped KPI; not closed until sustained live samples exist
Vendor-mode CI lane green rate Green / (green + red) rate for the vendor-mode external-input authority CI lane (nightly + on-demand) GitHub Actions workflow .github/workflows/gf-bazel-vendor-mode.yml; evidence artifact bazel-vendor-mode-evidence.json; operator rollup just e3-external-input-authority-status nightly green for ≥ 14 consecutive days one red day = budget consumed; 3 reds/30d trips E3/TIN-1447 close gate
Lockfile-error CI pass rate Fraction of CI runs where Bazel does not emit a MODULE.bazel.lock digest-mismatch or missing-entry error Source Bazel Proof workflow log scan; proposed gf_bazel_lockfile_error_total ≥ 99.5% breach if > 0.5% lockfile-error rate over 7d trailing window E3/TIN-1447 + ties to bazel-external-input-manifest
gf-reapi-cell availability Successful REAPI gRPC response rate, excluding client-side cancellations proposed gf_reapi_grpc_server_handled_total{code="OK"} / gf_reapi_grpc_server_started_total ≥ 99.5% ~3.6h/30d unavailability (rolling 7d window) Hard halt across all gates — if the cell is unreachable, no SLO below it is meaningful. Audit deferred from 2026-05-18 review.
AC write rejection rate Fraction of AC write attempts rejected for attestation/identity/platform/ref reasons (per the reject_reason schema in W2.3 / ac-writer-attestation-design.md) proposed gf_reapi_ac_write_rejected_total / gf_reapi_ac_write_attempts_total nonzero is normal spike pages if rejection rate > 5% over 1h trailing window E2/TIN-1446 — operational signal (client misconfig or attestation regression), not a poison signal. Sourced from W2.3 audit log.
GF REAPI AC Attestation Chaos Nightly and AC-path-triggered CI check that an authenticated but non-attested AC writer gets gRPC PermissionDenied / HTTP 403 and leaves no AC entry GitHub Actions workflow GF REAPI AC Attestation Chaos; local target just gf-reapi-ac-attestation-chaos-check 100% green any red run pages GF RBE on-call and freezes E6 target-class acceleration E2/TIN-1446 / W2.5 — proves the non-attested writer rejection stays closed while default RBE remains disabled.

Notes on the table:

  • “rolling 7d p95” rows are not standard 30d budget windows because percentile metrics don’t compose into time-budget the way ratio SLIs do. The 28/30 and 25/30 day formulations are placeholders — see Open Questions.
  • “Poison signal” rows (digest mismatch, evicted-while-referenced) do not get a normal error budget. They are correctness invariants; nonzero values page immediately. Recording them here so the dashboard layout (W5.3) knows to separate them from latency/availability SLIs.
  • Most gf_reapi_* metric names are proposed. The gf-reapi-cell binary now exports a minimal /metrics endpoint for the digest-mismatch poison counter and continues to emit worker/platform/action/command evidence in logs. Broader latency, queue, cache-hit, availability, and tenant metrics remain reserved here so E5/TIN-1449 implementation can use them without renaming later.

Targets

Concrete numbers, restated for the operator who is reading just this section:

  • CAS hit rate ≥ 90% on warm CI (post-cold-clone window). Cold-clone first 60s is excluded; otherwise the first build of every fresh runner workspace poisons the ratio.
  • Action cache hit rate ≥ 70% on warm CI. Below this and incremental CI rebuilds stop feeling like a shared substrate.
  • Analysis cache hit rate ≥ 80% on warm CI. This is Bazel-client-side and driven by --disk_cache / workdir reuse, not REAPI.
  • p95 remote action latency < 5s for *Compile mnemonics (CppCompile, GoCompile). Compile-shaped actions dominate the felt experience of “is RBE worth it?“.
  • p95 scheduler queue time < 2s. Above 2s, the marginal cost of remote beats local for a non-trivial fraction of actions.
  • Action retry rate < 1%. Retries hide correctness problems; they should be rare enough to investigate one-by-one when they happen.
  • Digest-mismatch rate < 1e-6 (effectively zero). Any single mismatch pages.
  • Vendor-mode CI lane green nightly for ≥ 14 consecutive days before E3 is allowed to close.
  • TTFCH < 90s for a fresh clone of GloriousFlywheel on a representative tinyland-nix runner. The first probe contract is implemented in ttfch-probe.md, but the SLO is not satisfied until the hourly ttfch-probe.yml workflow has enough sustained live samples.
  • Per-tenant queue-time skew < 2× between the slowest and median active tenant. (Tenant model is E4/TIN-1448; this is the SLO the model has to make honest.)

Error budgets

A 30-day error budget is the operator’s permission to accept some failure without paging. When the budget is spent, work that increases risk on that surface stops until the budget recovers. Concretely:

  • CAS hit rate. Budget = 10% of time below 90% = ~72h/30d. Action: at

    50% budget burn rate in any 7d window, freeze E6/TIN-1450 acceleration work and open a CAS-authority hot-spot RCA.

  • Action cache hit rate. Budget = 30% of time below 70% = ~216h/30d. Action: at >50% burn in any 7d window, freeze E6/TIN-1450 and notify Codex (AC authority owner via E2/TIN-1446).
  • Analysis cache hit rate. Diagnostic-only; budget burn alerts to the runner-workdir contract owner but does not freeze an epic.
  • Remote action latency (p95). Budget = “p95 must be under target on 28 of 30 days.” 2 budget-days = effectively 2 bad days/month. Action: 3+ budget-days = pause new mnemonic onboarding into E6/TIN-1450.
  • Scheduler queue time p95. Budget = 4.5h/30d over target (~0.62%). Action: at burn, page during business hours; freeze E4/TIN-1448 close.
  • Action retry rate. Budget = 1h/30d over 1% (~0.14%). Action: immediate operator page (this is the leading indicator for CAS/AC authority failure).
  • Digest mismatch. No budget. One event = page + halt E1/E2 work pending RCA.
  • Eviction-while-referenced. No budget. One event = page + halt E1 work pending RCA.
  • Per-tenant queue skew. Budget = 9h/30d > 2× (~1.25%). Action: skew budget burn blocks E4/TIN-1448 close until tenant model is reworked or capacity is rebalanced.
  • TTFCH. Budget = 3 days/30d over 90s. Action: budget burn blocks E5/TIN-1449 close.
  • Vendor-mode green rate. Budget = 3 red nights/30d. Hitting that re-opens E3/TIN-1447 close.
  • Distdir full package proof green rate. Budget = 3 red nights/30d for the Bazel Distdir Full Package Proof workflow. A red run means TIN-1468 package completeness regressed before durable backend authority is even in play.
  • Lockfile-error rate. Budget = 0.5% over 7d trailing. Burn ties back to bazel-external-input-manifest and E3/TIN-1447.

Operationally: a budget that trips triggers (a) a freeze on the named gate epic, (b) a page if the SLI is in the paging set (retry rate, queue time, poison signals, TTFCH), and (c) a short written RCA before the freeze lifts. Paging routing and on-call rotation are explicitly out of scope for this doc — see W5.2.

Gates

Cross-references, written so each epic can be closed against an SLO row:

  • E1 / TIN-1445 (CAS authority) closes when: CAS hit rate ≥ 90% for 14 consecutive days and digest-mismatch rate = 0 over the same 14 days and evicted-while-referenced = 0 over the same 14 days. Backend candidate must pass just ha-state-candidate-static-gate-shaped contract checks (CAS-flavored, proposed: just cas-primitives-static-gate).
  • E2 / TIN-1446 (AC authority) closes when: AC hit rate ≥ 70% for 14 consecutive days and action retry rate < 1% over the same window.
  • E3 / TIN-1447 (external input authority) closes when: vendor-mode CI lane green nightly for 14 consecutive days and Bazel Distdir Full Package Proof green nightly for the same 14 days and the non-secret authority package exists, passes the package gate, and has posture=proof_ready and at least one reviewed Bazel Distdir Mirror Live Proof run passes for that selected package and lockfile-error rate ≤ 0.5% over the same window. Ties to scripts/e3-external-input-authority-status.py / just e3-external-input-authority-status, scripts/validate-rbe-target-eligibility.py invariants, scripts/bazel-distdir-full-package-proof.sh, scripts/bazel-distdir-mirror-live-proof.py, and the existing bazel-external-input-manifest artifact.
  • E4 / TIN-1448 (tenant model) closes when: per-tenant queue-time skew < 2× sustained over 14d and scheduler queue p95 < 2s sustained over 14d. The tenant model must make these numbers honest, not vice versa.
  • E5 / TIN-1449 (observability) closes when: TTFCH < 90s for 14 consecutive days on the synthetic probe and every SLO row above is emitting from a named metric source (no **proposed** markers remain for any closing-gate SLO).
  • E6 / TIN-1450 (target-class breadth) closes when: each promoted target class meets its mnemonic-specific p50/p95/p99 targets sustained over 14d, and AC hit rate ≥ 70% over the same window, and digest-mismatch rate = 0. New target-class promotions in config/rbe-target-eligibility.json require a matching SLO row in this doc.

The runner-dashboard SvelteKit surface (under app/ and surfaced by //app:build proofs) is the intended operator-facing render of these SLOs. config/rbe-target-eligibility.json remains the source of truth for which target classes are even eligible to be measured.

Out of scope

Explicitly not defined in this doc:

  • Per-tenant per-mnemonic targets. Tenants get a global queue-skew SLO here. Per-mnemonic-per-tenant targets wait until E4/TIN-1448 produces a real tenant model.
  • Paging routing and on-call rotation. Which SLI pages whom, on what schedule, with what severity ladder — that’s W5.2.
  • Dashboard layout. The first W5.3 Grafana contract lives at docs/monitoring/gf-reapi-fairness-dashboard.json. The runner-dashboard SvelteKit SLO page, burn-rate rendering, and alert composition remain follow-up E5 work.
  • Metric backend selection. Whether gf-reapi-cell emits Prometheus natively, OTel and rely on a collector, or both — see Open Questions. This doc names SLIs without committing to an emitter.
  • Cost/efficiency SLIs. No bytes-per-action, GB-hours-per-build, or $-per-CI-minute targets. Cost is real but it is not a correctness or liveness SLO.
  • Local-developer-side cache hit rate. Developer-machine BAZEL_REMOTE_CACHE attachment is bounded (operator-provided endpoint only, per TIN-758). Developer-side SLOs wait until that exposure policy is revisited.
  • Cross-overlay (Tinyland / Jess) concurrency policy. Per the current-state doc, this is not yet a real surface; SLOs cannot define it into existence.
  • RustFS HA / state-authority SLOs. TIN-1012 owns that surface. This doc deliberately does not double-count the existing just tofu-state-ha-readiness gate as an RBE SLO.

Open questions

These are the questions the next iteration of this doc must answer before E5/TIN-1449 can close. Each is named so it can be turned into a sub-ticket under TIN-1449.

  1. Which metric backend is the authority? Prometheus? OTel? Both with a translation layer? Today the gf-reapi-cell binary emits structured logs but no Prometheus or OTel metrics. The first SLI to emit will set precedent. Recommendation pending: pick one, write it down, and stop debating.
  2. 30d vs 7d budget window. This doc uses 30d for ratio SLIs and a rolling 7d for percentile latency SLIs. That’s inconsistent and the inconsistency is on purpose because percentile SLOs don’t compose into 30d time-budget cleanly. A future decision: either accept the inconsistency and document the math, or convert latency SLOs to apdex-style ratios.
  3. Which mnemonics get individual p95 targets vs aggregate? This doc names CppCompile, GoCompile, TestRunner, GenRule. The full Bazel mnemonic set is much larger (e.g. Javac, SkylarkAction, JsRunBinary, Linkstamp). Need a rule for when a new mnemonic gets its own row vs falls into “all other actions, aggregate p95 < 10s”.
  4. How is TTFCH measured — cold cache, cold clone, or both? The target as written is “cold clone, warm cache” (90s). A separate cold-clone-and-cold-cache TTFCH SLO probably belongs alongside, with a looser target. Pending: write both, agree on which one gates E5.
  5. What is the canonical SLO probe target? TTFCH and several latency SLIs require a representative target. //app:build is the default-branch proof and a natural candidate, but it is a SvelteKit build (JsRunBinary-heavy), which does not exercise CppCompile or GoCompile. The probe set likely needs to be ≥ 2 targets to cover the proved mnemonic surface.
  6. How are external proofs (the Jesssullivan/8311 public-vendor handoff) represented in SLOs? The proof class exists in config/rbe-target-eligibility.json but is run from a consumer workspace. Pending: decide whether external-consumer SLIs are in-scope here or live in a separate consumer-facing doc.
  7. What is the SLO grace period for a newly-promoted target class? When a target class moves from candidate to proved in config/rbe-target-eligibility.json, does it get a grace window before its p95 latency starts counting against the 30d budget? Pending: propose 7 days.

GloriousFlywheel