GloriousFlywheel

GloriousFlywheel RustFS State Backend RCA Gate 2026-05-06

TIN-987 working note for turning the RustFS bucket-index incident into a state-backend decision gate.

Current Reality

RustFS has served the active tofu-state bucket for the four active OpenTofu stacks.
RustFS has also shown bucket-index reliability debt: the S3 API returned NoSuchBucket for tofu-state while /data/tofu-state and /data/.rustfs.sys/buckets/tofu-state existed on disk.
A controlled RustFS restart restored the S3 API view.
That makes restart an incident recovery action, not an availability design.
This is backend hardening. It is not Bazel remote execution proof.

2026-05-10 Recurrence

The same failure class recurred during the ARC runner capacity apply:

scripts/rustfs-state-authority-check.py failed before recovery because tofu-state was absent from list-buckets while both /data/tofu-state and /data/.rustfs.sys/buckets/tofu-state existed.
tofu init for arc-runners failed with S3 NoSuchBucket.
A controlled kubectl rollout restart deployment/attic-rustfs-openebs -n nix-cache restored bucket visibility.
After restart, the deep check read all four protected state objects and a temporary write/read/delete proof succeeded.
The source-owned arc-runners apply then completed with 0 added, 1 changed, 0 destroyed, advancing arc-runners/terraform.tfstate to serial 65.
A post-apply plan returned no changes.

This recurrence strengthens the existing gate: restart is the only proven live repair response for this RustFS image, and it remains insufficient for strict HA state authority.

2026-05-11 Storage-Node Recovery

The May 11 outage was a separate storage-node failure path, not new evidence that the bucket-index bug was fixed:

bumble rebooted into an XR kernel while the installed ZFS module set only matched the stock 6.12.0-124.8.1.el10_1.x86_64 kernel.
OpenEBS ZFS could not load its node plugin because the ZFS modules could not be auto-loaded for that running kernel.
Restoring the default boot kernel to the stock ZFS-compatible kernel and rebooting bumble brought the node back with zfs/spl loaded and the tank pool online.
OpenEBS then recovered the bumble-backed PVC plane; Attic PostgreSQL and the RustFS pod recovered after the storage plane returned.
RustFS still needed single-pod cleanup because the terminated container state and volume ownership work extended unavailability after storage returned.

The operational invariant is now explicit: do not boot the OpenEBS/ZFS storage node into a kernel unless the matching ZFS module package is already installed and verified. That is host maintenance hygiene; it does not make singleton RustFS a HA state authority.

The Tofu-side restart hygiene response is also explicit. RustFS data pods use fsGroupChangePolicy = "OnRootMismatch", and the adopted OpenEBS deployment init container only corrects the data-volume root ownership when the root is wrong. It must not run a recursive chown -R over the bucket tree on every pod start. If nested ownership repair is ever needed, it should be an intentional operator repair action with before/after evidence, not the normal restart path.

RCA Capture

Use the repo-managed collector before and after any recovery action:

just rustfs-bucket-index-rca --scratch-probe --strict-scratch-disk-markers

The collector captures:

RustFS Deployment, pod, node, image, resources, and restart evidence
service, PVC, bootstrap job, and lifecycle job shape
RustFS process/version evidence from the selected pod
/data, /data/<bucket>, .rustfs.sys, and .rustfs.sys/buckets/<bucket> layout evidence
bootstrap/lifecycle job logs
recent RustFS logs and namespace events
signed S3 authority checks for the active state keys
read/JSON validation for the active state object bodies, without logging state contents
optional scratch bucket create/head/list/write/read/delete proof

The scratch probe is intentionally bounded. It creates a unique bucket, writes one probe object, reads it back, deletes the object, deletes the bucket, and confirms the bucket no longer answers through the S3 API. In strict mode it also requires the scratch bucket disk markers to appear after create and disappear after API delete.

Decision Gate

Keep the current RustFS-backed state path only as an interim authority until one of these gates is satisfied.

Preferred Direction

Move OpenTofu state to a small dedicated HA S3-compatible state authority that is separate from the cache object-store runtime.

That target may be SeaweedFS/shared-S3, another boring replicated S3 endpoint, or a managed object-store equivalent, but it must be evaluated as state authority first and cache storage second.

Required proof:

no-op tofu plan succeeds after state service pod restart
state authority deep check succeeds after pod restart
state authority deep check succeeds through one node maintenance event
versioning, backup, or restore path is explicit
protected applies fail closed before mutation when API health is bad
backend coordinates can be materialized for every active stack without local shell-only drift

RustFS Retention Gate

Keep RustFS as the state backend only if a follow-up proves:

bucket metadata can be reconciled without a full process restart, or the failure mode is fixed upstream and captured by a version pin
the scratch probe and real state-key checks survive pod restart
the scratch probe and real state-key checks survive node maintenance
lifecycle/bootstrap jobs cannot produce state-bucket index divergence
the operator runbook names the exact non-destructive repair path

Unacceptable Long-Term State

The current singleton RustFS path on its own is not enough for the BCR/RBE horizon. It may remain the cache/backend interim only with the protected-apply guard in place.

BCR And RBE Boundary

BCR/internal registry work can continue if it uses immutable release artifacts, integrity metadata, and approved source mirrors. BCR correctness must not depend on RustFS bucket-index behavior.

RBE remains a separate gate. RustFS must not be treated as CAS/AC authority for remote execution until storage consistency, durability, recovery, and observability are proved or a different HA store is chosen. Countable RBE still requires a real executor and observed --remote_executor use.