Backend Authority Decision (May 2026)

Backend Authority Decision — May 2026

Snapshot date: 2026-05-09, with 2026-05-19 state-authority addendum. Author: J2 of the May 10-16 sprint plan (2026-05-10-cache-forward-toward-rbe.md).

Status: Approved. Drafted by Claude from existing repo truth; Jess signed off on 2026-05-09 as product authority.

Purpose

Lock down what the GloriousFlywheel substrate trusts each backend for, so that future work — Attic publication, HA OpenTofu state, RBE CAS/action-cache — does not silently inherit an interim trust model that hasn’t been re-validated.

This decision is referenced by:

Decision Summary

  1. Current RustFS is interim-only. The live singleton RustFS path is acceptable only for guarded reads and non-trusted state probes. It is not the current trusted backend for Attic publication, strict HA OpenTofu state, or future RBE CAS/action-cache because bucket-index visibility has repeatedly failed and restart has been the observed recovery path. A future RustFS promotion would require a separate TIN-1147 repair/topology proof and an explicit replacement decision; green canaries or restart recovery do not count.
  2. Attic trusted writes stay quarantined. TIN-1043 closed the default-read-only quarantine; TIN-1046 owns any future trusted publication ramp, gated on a non-restart repair path, a different backend, or a clean representative ramp after a backend fix.
  3. HA OpenTofu state has a candidate but no live endpoint. TIN-1016 selected a managed/appliance S3-compatible service as the proof target; TIN-1026 is the live-endpoint blocker. TIN-1012 stays In Progress until a real HA endpoint or replicated RustFS proof passes strict mode.
  4. RBE CAS / action-cache is a separate future design. Broad/default RBE is the product goal, but its durable CAS/action-cache authority does not inherit the current RustFS trust model. Backend, auth, retention, quota, tenant isolation, and observability for RBE storage are evaluated as their own production gate. The narrow gf-reapi-cell proof uses node-local storage by design and does not pre-commit a CAS choice.

Context

What is true today (verified 2026-05-09; amended 2026-05-19)

  • tests/ha_state_candidate_inventory.sh classifies the current attic-rustfs-openebs RustFS service as interim_only — not HA-ready.
  • .github/actions/nix-job/action.yml:10 defaults push-cache to false. scripts/validate-attic-write-quarantine.py enforces the quarantine in CI workflows.
  • The 2026-05-06 RustFS bucket-index incident (docs/research/gloriousflywheel-attic-rustfs-nar-index-incident-2026-05-06.md) reproduced NoSuchBucket + HTTP 500 on both small and medium publication probes; restart restored the S3 view but no non-restart repair path exists.
  • The TIN-1016 candidate contract (docs/contracts/ha-opentofu-state-managed-s3-candidate.json) chose managed/appliance S3-compatible state. TIN-1026 is the live-endpoint package + scoped TOFU_HA_STATE_* proof credentials blocker.
  • On 2026-05-19, the active RustFS state path failed the interim readiness guard again: tofu-state was absent from S3 list-buckets while /data/tofu-state and /data/.rustfs.sys/buckets/tofu-state remained present. GloriousFlywheel PR #735 also failed Plan ARC Runners on the same state-authority guard. The PR fixes parser truth only; it does not make RustFS a deploy/state authority.
  • Later on 2026-05-19, post-merge RustFS canary run 26083251931 passed. That is renewed current coherence evidence, not non-restart repair or strict HA proof.
  • The narrow gf-reapi-cell proof (gf-reapi-cell.md) deliberately uses node-local PVC storage on a compute-expansion lane and is explicitly separated from RustFS, Attic, and OpenTofu state buckets.

What this decision is NOT

  • Not a vendor decision for HA state. TIN-1016 names a class (managed S3-compatible); the specific vendor decision lands when TIN-1026 unblocks the live endpoint.
  • Not a Garage/SeaweedFS/MinIO/managed-S3 selection for RBE. RBE CAS/action-cache backend selection is gated on the broad-RBE work and does not happen in this decision.
  • Not a deprecation of RustFS. RustFS continues to back guarded interim read paths where the guard is green. The May 19 state-authority failure means it must not be treated as deploy/state authority, trusted Attic publication, strict HA state, or future CAS/action-cache authority until TIN-1147 repairs or replaces that role.

Decision Detail

Per-role backend posture

Role Current backend Trusted? Reference
Nix cache reads (Attic) RustFS via attic-rustfs-openebs Yes — guarded interim tofu/modules/rustfs/main.tf, roadmap “Now”
Trusted Attic writes (publication) RustFS No — quarantined TIN-1043, validate-attic-write-quarantine.py
OpenTofu state (tofu-state) RustFS bucket Degraded guarded interim; recurring guard failures TIN-1147 evidence; just tofu-state-ha-readiness
Strict HA OpenTofu state (no live backend) No backend selected TIN-1012 In Progress; TIN-1026 blocker
Bazel remote action cache RustFS-backed bazel-cache bucket via attic-rustfs-openebs Yes for cache-forward acceleration; not for trusted writes roadmap “Now”
RBE CAS / action cache (none — narrow proof uses node-local PVC) N/A — separate future design docs/build-system/gf-reapi-cell.md “Storage Boundary”
WAS-110 public archive mirror was110-public-inputs RustFS bucket Yes for read-side public-input pinning roadmap “Now”

Required next decisions (out of scope here)

These are not J2 decisions; they are flagged so future authors can pick them up cleanly:

  • TIN-1026 live HA endpoint package. Names the vendor + endpoint for managed S3-compatible state. Owns the actual cutover.
  • RBE CAS / action-cache backend selection. Probably blocks broader RBE rollout. Distinct from HA state because the write profile, retention, and auth model differ. See “Stop/Go Table” in the BCR/RBE/RustFS product reality review.
  • Trusted Attic write ramp (TIN-1046). Either a RustFS non-restart repair, a different backend, or a clean representative ramp after a backend fix. rustfs-trusted-publication-backend-gate.json is the static TIN-1147 stop/go gate for that backend decision. It keeps restart-only recovery, green canary-only coherence, source-only admin-route existence, and unrelated RBE or OpenTofu state evidence from counting as trusted Attic publication backend proof. rustfs-upgrade-topology-proof-plan.json is the non-mutating beta.4 upgrade/topology operating plan: it narrows the eventual live change to rustfs_image, requires just tofu-plan-guard attic plus just rustfs-upgrade-topology-plan-guard and operator approval, rejects Civo, and keeps TIN-1046 blocked until representative publication evidence clears the current failure classes. The managed Deploy Attic Stack workflow exposes this as manual plan_scope=rustfs_upgrade_topology: only plan may continue past the expected-red state authority check, and only to produce a saved plan that passes both guards; apply remains strict and still requires operator approval plus post-upgrade evidence.

Open questions

These do not block J2 sign-off but should be tracked:

  1. Does tofu-state-on-RustFS need to migrate before TIN-1026’s HA endpoint lands, or is an emergency restoration window needed first? It can no longer be described as simply “interim, read-side trusted” while the guard is red.
  2. Should was110-public-inputs graduate to a different backend if the public-alpha mirror story expands? (Currently fine as a guarded read-side use.)
  3. When RBE broad-rollout work begins, who owns the CAS backend evaluation — the same person who owns TIN-1016, or a separate evaluation?

Approval

Role Name Sign-off
Product authority (Jess) Jess Sullivan approved 2026-05-09
Drafter (Claude) n/a drafted 2026-05-09
  • TIN-1012 — strict HA state authority, In Progress
  • TIN-1016 — selected managed/appliance S3-compatible state candidate
  • TIN-1026 — live HA endpoint package (active blocker)
  • TIN-1043 — Attic publication quarantine, Done
  • TIN-1046 — future trusted Attic publication ramp owner
  • TIN-1070 — May 10-16 sprint control list
  • docs/research/gloriousflywheel-attic-rustfs-nar-index-incident-2026-05-06.md
  • docs/research/gloriousflywheel-bcr-rbe-rustfs-product-reality-2026-05-08.md
  • docs/build-system/gf-reapi-cell.md — narrow proof storage boundary
  • tests/ha_state_candidate_inventory.sh — interim_only classification

GloriousFlywheel