BCR, RBE, And RustFS Product Reality Review

GloriousFlywheel BCR, RBE, And RustFS Product Reality Review

Date: 2026-05-08

Related issues: TIN-974, TIN-1043, TIN-1046, TIN-1012, TIN-1027, TIN-665, TIN-1041

Executive Reality

GloriousFlywheel is currently a cache-forward local/CI execution substrate. That is real product value: shared ARC runner lanes, Attic-backed Nix substitution, Bazel remote-cache acceleration, implementation overlays, and repo-managed proof workflows are all working on main.

It is not yet the full target product.

The target product is broader:

  • local acceleration through shared caches that attach cleanly from devshells, CI, and runner jobs
  • BCR/Bzlmod package authority where reusable packages resolve from versioned module releases instead of local copies or ad hoc source pins
  • Bazel external input authority through repository-cache, distdir, approved mirrors, or generated injected repositories rather than mutable upstream URLs at build time
  • Bazel remote execution through a countable REAPI executor endpoint, separate from remote cache, with remote process evidence
  • remote test and remote build expansion after target eligibility is classified
  • remote runner capacity that remains capability-class based rather than repo-shaped
  • storage authorities that are separated by role: OpenTofu state, Attic cache, Bazel cache/CAS/action-cache, BCR mirrors, and future RBE CAS must not be conflated

The most important current blocker is RustFS reliability debt. RustFS can still serve the cache-forward read path after recovery, but it has reproduced bucket-index loss during trusted Attic publication. That makes it unsafe as a trusted write-publication backend and disqualifies it as a final HA state authority or future RBE CAS/action-cache authority until repair, replacement, or a stronger recovery proof exists.

Where We Are

Pooled Runner Substrate

Current main proves the shared runner model:

  • capability classes such as tinyland-nix, tinyland-docker, tinyland-dind, and bounded additive lanes such as tinyland-nix-heavy remain the product taxonomy
  • repo-specific runner labels are debt, not normal product structure
  • implementation overlays own owner-specific GitHub App installs, tfvars, backend settings, and private registration anchors
  • ARC/GitHub Actions jobs are remote CI jobs, but that is coarse-grained runner execution, not Bazel action-level remote execution

This is a viable pooled runner substrate. It is still not the same thing as Bazel remote execution.

Nix And Attic

Current Attic truth:

  • Attic reads are part of the intended shared cache substrate
  • nix-job defaults push-cache to false
  • pull requests remain read-only for Attic publication
  • pilot and downstream examples may use default-branch plus ATTIC_TOKEN gated push-cache
  • broad GloriousFlywheel proof workflows still keep push-cache: "false" while the RustFS bucket-index recurrence is unresolved
  • the manual Attic publication probe remains the only current strict require-cache-push trusted-write workflow exception
  • small-check and medium-check are now controlled reproduction profiles, not safe ramp steps

Trusted Attic writes remain quarantined. TIN-1043 closed the immediate default-read-only safety gate. TIN-1046 owns the trusted publication ramp, but TIN-1147 is the active stop/go blocker: it must prove non-restart RustFS repair/reindex, a RustFS upgrade/topology fix, or a replacement backend before any clean representative write ramp can restore broad push-cache.

Bazel Remote Cache

Current Bazel truth:

  • BAZEL_REMOTE_CACHE is the only default Bazel substrate endpoint
  • source-repo proof passes cache-backed Bazel through the repo-managed wrapper
  • .bazelrc intentionally avoids executor endpoint literals and placeholder --remote_executor= values
  • scripts/bazel-cache-backed.sh can enter opt-in executor-backed mode only when BAZEL_REMOTE_EXECUTOR is set and the strict contract validates the shell as executor-backed
  • ARC runner endpoint wiring can now inject a backend-neutral BAZEL_REMOTE_EXECUTOR only when bazel_executor_endpoint is explicitly configured; the default runner posture remains cache-backed
  • just rbe-boundary-check keeps default operational surfaces cache-backed

This is cache-forward Bazel acceleration. It is not Bazel remote execution.

Bazel REAPI Proof Lane

Current RBE proof truth:

  • gf-reapi-cell is the first GloriousFlywheel-owned REAPI proof endpoint
  • the proof lane is explicit and non-default through GF_RBE_PROOF_MODE=explicit
  • PR #564 proved //app:build through --remote_executor with worker image sha256:be2832171ac69cc9a2d012b3c789e8b765afb7cae0df8f7e9677dd6d8542dbc0
  • the proof reported 2308 processes: 1439 internal, 869 remote
  • app/sveltekit_sync and app/vite_build both ran remotely with exit code 0
  • cache-warm proof reruns must use GF_RBE_PROOF_FORCE_EXECUTION=true; remote cache hits alone are endpoint continuity evidence, not fresh remote-worker evidence
  • PR #572 made the WAS-110 public input workflow artifact machine-verifiable; main run 25589377905 built //:public_vendor_handoff_fixture with forced execution, 1 remote process, worker provenance, and injected was110_vendor_blobs evidence
  • PR #582 made build/test proof mode explicit; main run 25601913985 tested //app:unit_tests with bazel_command=test, forced execution, 527 remote processes, 20 Vitest files, 168 passing tests, and worker evidence for test-setup.sh app/unit_tests_/unit_tests
  • main run 25602726443 built //:deployment_bundle with bazel_command=build, forced execution, 1 remote process, and worker evidence for the rules_pkg build_tar action that writes deployment_bundle.tar.gz
  • main run 25608601158 built //docs-site:build with bazel_command=build, forced execution, 1046 remote processes, and remote JsRunBinary evidence for docs-site/.svelte-kit and docs-site/build
  • PR #604 added Stage 1 rust/c++/go cache-backed test targets; this broadened cache-backed toolchain evidence but did not promote those languages to RBE
  • PR #605 fixed gf-reapi-cell output inlining after forced Go proof run 25631848864; retry run 25632300253 reached 2 remote rules_go actions and then failed in runtime/cgo with cc: no such file or directory; run 25634296833 proved pure-Go //examples/hello-go:hello_test with bazel_command=test, forced execution, 11 remote processes, and remote test-setup evidence
  • the RBE target eligibility manifest now records the proved target classes, promotes //app:unit_tests as the first remote-test target class, promotes //:deployment_bundle as the first deployment packaging target class, promotes //docs-site:build for static docs-site rendering, promotes //examples/hello-go:hello_test for one pure-Go rules_go unit-test class, promotes //examples/hello-go-cgo:cgo_test for one cgo-backed rules_go unit-test class, promotes //examples/hello-rust:hello_test and //examples/hello-cc:hello_test for one trivial unit-test class each, and promotes //docs-site:playwright_chromium_smoke from run 25712694947 for one Chromium static-site Vite/SvelteKit Playwright smoke class with 1060 remote processes, and promotes public consumer web target classes from tinyland-inc/omux.xoxd.ai //:puppeteer_chromium_smoke run 25826953857, tinyland-inc/omux.xoxd.ai //:playwright_chromium_smoke run 25897326537 with proof nonce 20260515T024138Z-25897326537-1 and 6 remote processes, and Jesssullivan/jesssullivan.github.io //:puppeteer_chromium_smoke / //:sveltekit_vite_build_smoke runs 25777472760 and 25779597385, plus Jesssullivan/jesssullivan.github.io //:types_unit_tests run 25892939448 for one public SvelteKit/Vite/Vitest target class with 855 remote processes. Later private web proofs also promote narrow tinyland-inc/tinyland.dev app/package classes and Jesssullivan/MassageIthaca classes, including //:sveltekit_node_build from run 25983800544 with 3193 remote processes, remote sveltekit_sync_bin_/sveltekit_sync_bin, and remote vite_build_bin_/vite_build_bin evidence, plus tinyland-inc/tinyland.dev //packages/tinyland-a11y-engine:typecheck from run 25984827370 with proof nonce 20260517T073751Z-25984827370-1, 2 remote processes, remote esbuild lifecycle-hook execution, and remote TypeScript tsc for packages/tinyland-color-utils, and tinyland-inc/tinyland.dev //:playwright_local_route_smoke from run 25989829826 with proof nonce 20260517T114200Z-25989829826-1, 53 remote processes, remote test-setup.sh, and a passing local-server Playwright route smoke over loopback-served SvelteKit output. OpenTofu target classes remain blocked until toolchain, provider, runfiles, and mutable-state behavior are hermetic.

This is countable RBE evidence for narrow build, test, and public-input target classes. It is not broad product RBE, not broad Playwright/Puppeteer/E2E RBE, not ARC dispatch, not cache-only execution, and not RustFS-backed CAS/action-cache authority.

Bazel External Fetch Authority

Current external fetch truth:

  • the default Source Bazel Proof now materializes a verified ephemeral BAZEL_DISTDIR for the nodejs_linux_amd64:22.13.1:linux_amd64 toolchain archive before Bazel starts
  • docs/contracts/bazel-distdir-source-proof-coverage.json validates that required source-proof archive and classifies the other seven generated Node archives as deferred
  • repo-wide default status is still not durable mirror authority
  • .bazelrc has retry and timeout mitigation
  • no repo-wide repository-cache or durable distdir authority is live by default
  • BAZEL_REPOSITORY_CACHE, BAZEL_DISTDIR, and GF_BAZEL_INJECT_REPOSITORIES are wired as authority inputs for the cache-backed wrapper and explicit RBE proof wrapper
  • docs/contracts/bazel-external-input-mirror-candidates.json records candidate integrity for the eight generated Node.js 22.13.1 toolchain archives, but it is candidate-integrity-only with materialized: false
  • docs/contracts/bazel-external-input-durable-authority.json records the promotion gate as no-live-durable-authority: no candidate is durably covered yet, all eight candidates are pending, and future promotion requires auth, retention, restore, provenance, and consumer exposure evidence
  • the WAS-110 public-input mirror proof exists, but it is a specific public input staging path, not universal external fetch authority

Remote cache does not cover repository resolution. The verified ephemeral distdir removes raw Node template fetching from the source proof’s Bazel phase, but the product still needs a durable repository-cache/distdir or mirror policy before it can claim broad cache authority for Bazel external inputs. The durable authority contract makes the promotion criteria executable, but it does not make any storage backend live by itself. Candidate hashes reduce ambiguity about what to mirror next; they do not prove offline fetch authority by themselves.

BCR And Bzlmod

Current BCR/Bzlmod truth:

  • GloriousFlywheel itself is Bzlmod-shaped, but its module name is still attic-iac for compatibility
  • Bzlmod currently helps separate reusable core code from implementation overlays
  • internal package authority work is happening in package repos and registry follow-ups, not in the RustFS/RBE backend lane
  • BCR readiness is not proved by green cache-backed CI

The BCR goal is package/module authority:

  • package releases exist at their source authority
  • registry entries point to correct versions and dependencies
  • consumer repos resolve through Bzlmod without local package copies
  • compatibility names and dependency pins are reconciled
  • public BCR or internal-registry posture is explicit

That is adjacent to RBE but not the same work. BCR controls module discovery and dependency authority. RBE controls where Bazel actions execute.

RustFS

Current RustFS truth:

  • RustFS backs Attic object storage, Bazel cache storage, and the interim OpenTofu state path
  • the current active RustFS service is one RustFS Deployment, one service endpoint, one OpenEBS ZFS node, and a bumble-scoped ReadWriteOnce PVC
  • after the May 19 recovery, tofu-state is currently readable again, but the same date also showed it can disappear from S3 list-buckets while disk markers persist; strict HA still fails
  • live inventory returns NO_LIVE_HA_STATE_CANDIDATE
  • the current RustFS image does not expose an obvious non-restart heal/repair/reindex command surface
  • Attic publication has reproduced NoSuchBucket, HTTP 500, and InternalServerError while /data/<bucket> and /data/.rustfs.sys/buckets/<bucket> markers existed
  • May 14 repair-surface inventory confirmed the deployed rustfs v1.0.0-beta.1 CLI exposes only server and info; the pod has no rustfs-admin, rc, mc, aws, or s5cmd client binary. The tagged source has internal admin heal endpoints. A May 14 signed background-heal status endpoint probe returned HTTP 200 with valid JSON from /rustfs/admin/v3/background-heal/status; this proves observability, not repair. A source semantics audit also found that the bucket/object heal endpoint drops the parsed dryRun option: the handler builds channel work with create_heal_request, that constructor sets dry_run: None, and the heal processor defaults missing dry_run to false. There is still no repo-proved signed repair runbook for the live bucket-index recurrence. A follow-on source audit found export/import-bucket-metadata is not a proved repair path: export depends on the current list_bucket/get_bucket_info API view, while import is a mutating zip-archive path that can call make_bucket(force_create) and does not persist accumulated imported metadata config updates in the current handler. TIN-1147 remains active until that repair path, a RustFS upgrade/topology change, or a replacement backend is proved.

RustFS is currently guarded interim infrastructure. It is not an HA state authority, not trusted write-publication authority, not BCR authority, and not future RBE CAS/action-cache authority.

Where We Want To Be

Product North Star

GloriousFlywheel should become a pooled build substrate where a developer or CI job can attach once and receive the same governed acceleration and execution contract:

  • Nix substitution from trusted caches
  • Bazel remote cache for action outputs
  • Bazel external input authority for repository/archive fetches
  • Bzlmod/BCR package authority for reusable package dependencies
  • REAPI remote execution for eligible Bazel actions
  • capability-class runner capacity for workflows that remain runner-shaped
  • explicit local-only escape hatches for actions that are not remote-execution eligible

The product should not claim “remote build” just because a GitHub Actions job runs on a self-hosted runner. The countable remote-build claim starts when Bazel actions execute through a validated --remote_executor.

RBE Target

The first countable RBE milestone was intentionally small and is now landed for //app:build. Subsequent target-class expansions have landed for //app:unit_tests, //:deployment_bundle, the WAS-110 public injected repository handoff, and //docs-site:build. TIN-1027 and TIN-665 are closed on the minimum proof, while TIN-668 owns the continuing target-class gate:

  • provision a backend-neutral REAPI executor endpoint
  • validate BAZEL_REMOTE_EXECUTOR separately from BAZEL_REMOTE_CACHE
  • pass both as explicit Bazel CLI flags through a repo-managed wrapper after strict mode validation
  • prove a small hermetic target such as //app:build or //app:unit_tests
  • record remote process evidence, not only remote cache hits
  • pin or otherwise identify the worker image/provenance
  • mark unsupported targets local-only or explicitly excluded

Public/operator language should move from “shared cache acceleration” to “Bazel remote execution” only for explicitly proved target classes and only through the explicit proof lane or opt-in executor-backed wrapper mode until a default product posture is selected.

BCR Target

The BCR target is a package authority lane:

  • source packages publish versioned releases
  • internal registry entries are current and dependency pins are reconciled
  • consumers resolve through Bzlmod instead of local copies
  • compatibility module names are intentionally retired or documented
  • official BCR readiness is decided separately from internal registry health

BCR work should run in parallel with backend work, but it should not depend on RustFS being the object store and should not be counted as RBE evidence.

RustFS Target

The RustFS target is a stop/go decision, not endless probing:

  • either RustFS gains a proved non-restart repair/reindex path and enough retention/observability to be trusted for the relevant role
  • or trusted writes and state authority move to a backend that removes this bucket-index failure class

For OpenTofu state, the selected direction is managed or appliance S3-compatible state authority with scratch and disposable OpenTofu proofs before protected migration.

For Attic cache publication, TIN-1147 is the next gate: backend repair/reindex, RustFS upgrade/topology fix, or backend replacement must remove the failure class before a representative clean write ramp can count. The signed background-heal status probe is useful operator observability, not repair, so it does not unblock TIN-1046 by itself. The current tagged RustFS bucket/object heal path is not a safe dry-run proof surface because dryRun is not preserved into the queued heal request. The current tagged RustFS bucket-metadata export/import path is also not a safe repair proof: export uses the current bucket API view, and import is a mutating archive path rather than a disk-marker reindex. rustfs-trusted-publication-backend-gate.json is the static TIN-1147 stop/go gate that keeps the next decision finite: non-restart RustFS repair/reindex, RustFS upgrade/topology fix, or backend replacement. It rejects restart-only recovery, canary-only coherence, source-only admin-route existence, dry-run assumptions, ARC dispatch evidence, RBE proof evidence, and OpenTofu state-only HA proof as substitutes for trusted Attic publication backend evidence. rustfs-upgrade-topology-candidate.json is the concrete TIN-1152 candidate packet for the upgrade/topology path. It records upstream RustFS 1.0.0-beta.4 as a candidate because the release and beta.1…beta.4 comparison touch ListBuckets CreationDate, filemeta/metacache, bucket metadata, list_object_v2/listing, HeadObject, scanner/rebalance, and S3 tracing paths, but they explicitly do not claim the bucket-index recurrence is fixed. The selected Docker Hub beta.4 manifest/platform digests are recorded, but the preflight must now treat RustFS State Authority Canary as expected-red and preserve the uploaded evidence artifact alongside normal green main-suite health. Maintenance-window approval, state readiness, bucket-index RCA, NAR integrity, and representative small-check/medium-check publication evidence are still required. rustfs-upgrade-topology-proof-plan.json turns that candidate into the next source-owned, non-mutating operating plan: only tofu/stacks/attic/honey.tfvars rustfs_image may change in the eventual maintenance window, just tofu-plan-guard attic must approve the saved plan, live secret authorities and OpenEBS/PVC selectors must remain stable, Civo is not an endpoint or fallback, and TIN-1046 stays blocked until post-upgrade tofu-state, bucket-index RCA, NAR integrity, and representative publication evidence clear the current NoSuchBucket, curl 18, and size_download=0 failure classes. just rustfs-upgrade-topology-plan-guard is the saved-plan guard for that future maintenance window. It accepts only the digest-pinned beta.1 -> upgrade-topology candidate RustFS image update on the live Deployment and the drained legacy StatefulSet template when the shared module input touches both workload templates; it rejects Secret rotation, selector/PVC/storage/service drift, delete/create actions, wrong image direction, and plans that do not update the live Deployment. The managed Deploy Attic Stack workflow now has a manual plan_scope=rustfs_upgrade_topology path for this candidate: plan may continue past expected-red TIN-1147 state authority only to produce the saved plan and run both guards, while apply keeps strict state authority and adds post-apply candidate-image verification.

For future RBE, CAS/action-cache storage should be designed separately and must not inherit the current RustFS trust gap by accident.

Stop/Go Table

Area Current state Go condition
Nix cache reads usable cache-forward acceleration keep active
Trusted Attic writes quarantined after small/medium reproduction failures non-restart repair/reindex, backend replacement, or clean representative ramp after backend fix
OpenTofu state readable guarded interim RustFS path strict HA proof through TIN-1012/TIN-1017 path
Bazel remote cache proved shared cache acceleration; 2026-05-25 recovered one remote-cache digest mismatch, and the cache-object delete reproduced RustFS bucket-index loss until restart recovery keep active as acceleration only; run decoded CAS integrity audit on digest-mismatch incidents
Bazel external inputs upstream-with-retries repository-cache/distdir/mirror authority with retention and consumer policy
BCR/Bzlmod internal Bzlmod/package authority in progress reconciled package releases, registry entries, dependency pins, and public/internal BCR decision
RBE narrow explicit //app:build, //app:unit_tests, //:deployment_bundle, //docs-site:build, WAS-110 public-input, and target-scoped web/private consumer proofs with worker provenance next target inventory, product wrapper posture, and backend authority before broad/default RBE
RBE CAS/action-cache not designed as product authority separate backend/storage/auth/retention design, not inherited from current RustFS

Sprint Implications

  1. Keep the green cache-forward baseline green.
  2. Keep Attic writes read-only by default until TIN-1147 proves a repaired, upgraded, or replaced backend path and TIN-1046 then records a clean representative ramp.
  3. Move OpenTofu state off the interim RustFS singleton through the selected HA candidate proof path.
  4. Continue BCR/package authority work independently from backend repair.
  5. Harden Bazel external input authority before broad remote-build claims.
  6. Keep the minimum REAPI executor proof lane alive while deciding backend and product wrapper posture.
  7. Expand remote test/build eligibility one target class at a time.

Boundary Statement

The current product is cache-forward local/CI execution with shared runners, shared caches, and a narrow explicit REAPI proof lane. It is valuable, but it is not yet full RBE, not official BCR readiness, not HA state authority, and not a trusted RustFS write-publication backend.

GloriousFlywheel