2026 05 24 Rbe Production Gates

2026-05-24 RBE Production-Gate Plan

Current Truth

GloriousFlywheel is a cache-first Bazel/Nix runner substrate with real, target-scoped REAPI evidence. Current main is ac05f68d559ee4d36769681136d98df6e4c3957e after PR #777. First-party CI dogfoods shared tinyland-* capability-class runner lanes; hosted runners are not an acceptable fallback for this repo.

config/rbe-target-eligibility.json records 34 proved target classes across source-repo targets and spoke-style consumers such as omux.xoxd.ai, jesssullivan.github.io, MassageIthaca, and tinyland.dev. Those proofs include remote check, test, smoke, package, and build work. They do not yet make GloriousFlywheel default-capable multi-tenant RBE.

Live runner capacity is not the core blocker for default RBE. On the latest review, honey, bumble, and sting were Ready; all tinyland-* ARC scale sets were Running; and tinyland-nix-heavy was committed at a 64Gi memory request, 160Gi memory limit, 192Gi ephemeral-storage request, and 256Gi ephemeral-storage limit. The May 28 W3.4 canary showed this target class is scratch-heavy rather than memory-heavy, so the production correction is a schedulable request plus a large burst limit, not a repo-specific runner label or hosted-runner fallback.

Sprint Priority

The next sprint prioritizes production gates over more random target breadth. New target classes are valuable only when they also tighten one of the production contracts below.

Truth and tracker alignment. Keep docs/current-state.md, docs/roadmap.md, Linear, and GitHub issues aligned to current main, the 34-class manifest, dogfood routing, and live runner posture.
E2 action-cache authority. Treat attested writers, platform tagging, AC audit, nuke-key, chaos, and poison alerting as one trust system. Do not call broad/default RBE safe until the system has end-to-end evidence.
E3 external-input authority. Finish durable distdir/repository-cache authority for non-BCR inputs. Lockfile and vendor checks are necessary, but not enough without durable byte authority and restore evidence.
E4 tenant enforcement. Convert declaration into enforcement: remote_instance_name, IAM/OIDC, executor pools, quotas, structured denials, and self-service spoke onboarding must line up.
E5 operator and developer visibility. Expose TTFCH, split CAS/AC/analysis hit rates, fairness, poison signals, and queue/blocker diagnosis as operator- and agent-readable surfaces.

Default-RBE Promotion Gate

Broad/default RBE requires all of the following evidence:

no first-party hosted-runner fallback in GloriousFlywheel CI
a durable CAS/action-cache substrate with restore, retention, quota, audit, and failure-mode proof
AC writes restricted to trusted/attested writers; PR and developer lanes are read-only by default
non-BCR external inputs served from durable distdir/repository-cache authority with offline or vendor-mode proof
tenant isolation enforced by instance_name, IAM/OIDC, executor pool policy, and quota errors that are observable and documented
dashboard/API/agent surfaces that explain whether a slowdown is queueing, cache miss, eligibility, quota, external input, worker image, or backend health
every promoted target class still has forced --remote_executor evidence, nonzero remote processes, worker image digest, and proof artifacts

Boundaries

RustFS canaries are health evidence, not a CAS/action-cache promotion path. RustFS remains out of trusted Attic publication and REAPI CAS/action-cache authority until TIN-1147 proves repair, upgrade, or replacement against the known bucket-index recurrence.

BCR/Bzlmod package authority is adjacent, not proof of remote execution. Keep internal registry and module-name compatibility decisions separate from the RBE data path.

Multi-VCS is still staged. GitHub ARC is the primary production adapter; GitLab is bounded compatibility; Forgejo/Codeberg remain proof paths until they have their own runner registration, cache attachment, and operator evidence.

Working Set

TIN-1446 / E2: trusted-writer AC, audit, nuke-key, chaos, poison alerts
TIN-1447 / E3: durable external input authority and vendor/offline proof
TIN-1468 / W3.2: durable distdir mirror with SHA256 pinning
TIN-1448 / E4: tenant model enforcement
TIN-1475 / W4.4: per-tenant quotas
TIN-1476 / W4.5: self-service spoke onboarding
TIN-1449 / E5: observability and KPIs
TIN-1450 / E6: breadth expansion, downstream of E2/E3/E4/E5
TIN-668: target-class eligibility manifest and proof discipline
TIN-1147 / TIN-1046: RustFS and trusted Attic publication gate
GitHub #731: remote proof contract for spoke package consumption and Bazel-first workflows

Acceptance For This Sprint

The sprint is successful when a maintainer can answer these questions from source-controlled docs, checks, and tracker evidence without tribal context:

What is proved today, and which proof artifacts show remote execution?
Why is broad/default RBE still blocked?
Which gate owns the next blocker?
Which repos can pilot cache-first mode now?
Which target classes may use executor-backed mode explicitly?
What evidence would make a target or spoke eligible for default use?