2026-05-24 RBE Production-Gate Plan
Current Truth
GloriousFlywheel is a cache-first Bazel/Nix runner substrate with real,
target-scoped REAPI evidence. Current main is ac05f68d559ee4d36769681136d98df6e4c3957e
after PR #777. First-party CI dogfoods shared tinyland-* capability-class
runner lanes; hosted runners are not an acceptable fallback for this repo.
config/rbe-target-eligibility.json records 34 proved target classes across
source-repo targets and spoke-style consumers such as omux.xoxd.ai,
jesssullivan.github.io, MassageIthaca, and tinyland.dev. Those proofs
include remote check, test, smoke, package, and build work. They do not yet make
GloriousFlywheel default-capable multi-tenant RBE.
Live runner capacity is not the core blocker for default RBE. On the latest
review, honey, bumble, and sting were Ready; all tinyland-* ARC scale
sets were Running; and tinyland-nix-heavy was committed at a 64Gi memory
request, 160Gi memory limit, 192Gi ephemeral-storage request, and 256Gi
ephemeral-storage limit. The May 28 W3.4 canary showed this target class is
scratch-heavy rather than memory-heavy, so the production correction is a
schedulable request plus a large burst limit, not a repo-specific runner label
or hosted-runner fallback.
Sprint Priority
The next sprint prioritizes production gates over more random target breadth. New target classes are valuable only when they also tighten one of the production contracts below.
- Truth and tracker alignment. Keep
docs/current-state.md,docs/roadmap.md, Linear, and GitHub issues aligned to current main, the 34-class manifest, dogfood routing, and live runner posture. - E2 action-cache authority. Treat attested writers, platform tagging, AC audit, nuke-key, chaos, and poison alerting as one trust system. Do not call broad/default RBE safe until the system has end-to-end evidence.
- E3 external-input authority. Finish durable distdir/repository-cache authority for non-BCR inputs. Lockfile and vendor checks are necessary, but not enough without durable byte authority and restore evidence.
- E4 tenant enforcement. Convert declaration into enforcement:
remote_instance_name, IAM/OIDC, executor pools, quotas, structured denials, and self-service spoke onboarding must line up. - E5 operator and developer visibility. Expose TTFCH, split CAS/AC/analysis hit rates, fairness, poison signals, and queue/blocker diagnosis as operator- and agent-readable surfaces.
Default-RBE Promotion Gate
Broad/default RBE requires all of the following evidence:
- no first-party hosted-runner fallback in GloriousFlywheel CI
- a durable CAS/action-cache substrate with restore, retention, quota, audit, and failure-mode proof
- AC writes restricted to trusted/attested writers; PR and developer lanes are read-only by default
- non-BCR external inputs served from durable distdir/repository-cache authority with offline or vendor-mode proof
- tenant isolation enforced by
instance_name, IAM/OIDC, executor pool policy, and quota errors that are observable and documented - dashboard/API/agent surfaces that explain whether a slowdown is queueing, cache miss, eligibility, quota, external input, worker image, or backend health
- every promoted target class still has forced
--remote_executorevidence, nonzero remote processes, worker image digest, and proof artifacts
Boundaries
RustFS canaries are health evidence, not a CAS/action-cache promotion path. RustFS remains out of trusted Attic publication and REAPI CAS/action-cache authority until TIN-1147 proves repair, upgrade, or replacement against the known bucket-index recurrence.
BCR/Bzlmod package authority is adjacent, not proof of remote execution. Keep internal registry and module-name compatibility decisions separate from the RBE data path.
Multi-VCS is still staged. GitHub ARC is the primary production adapter; GitLab is bounded compatibility; Forgejo/Codeberg remain proof paths until they have their own runner registration, cache attachment, and operator evidence.
Working Set
- TIN-1446 / E2: trusted-writer AC, audit, nuke-key, chaos, poison alerts
- TIN-1447 / E3: durable external input authority and vendor/offline proof
- TIN-1468 / W3.2: durable distdir mirror with SHA256 pinning
- TIN-1448 / E4: tenant model enforcement
- TIN-1475 / W4.4: per-tenant quotas
- TIN-1476 / W4.5: self-service spoke onboarding
- TIN-1449 / E5: observability and KPIs
- TIN-1450 / E6: breadth expansion, downstream of E2/E3/E4/E5
- TIN-668: target-class eligibility manifest and proof discipline
- TIN-1147 / TIN-1046: RustFS and trusted Attic publication gate
- GitHub #731: remote proof contract for spoke package consumption and Bazel-first workflows
Acceptance For This Sprint
The sprint is successful when a maintainer can answer these questions from source-controlled docs, checks, and tracker evidence without tribal context:
- What is proved today, and which proof artifacts show remote execution?
- Why is broad/default RBE still blocked?
- Which gate owns the next blocker?
- Which repos can pilot cache-first mode now?
- Which target classes may use executor-backed mode explicitly?
- What evidence would make a target or spoke eligible for default use?