GloriousFlywheel HA State Authority Gate 2026-05-06
TIN-1002 successor gate for selecting and proving the next OpenTofu state-authority target after the RustFS bucket-index RCA. TIN-989 recorded the expected-red HA gate; TIN-1002 captured the candidate plan and guardrail. TIN-1012 now owns turning that gate green without weakening the checks.
Current Live Read
Live read from the honey context on 2026-05-06:
bumble,honey, andstingare Ready.stingis back online and still labeled for compute expansion and KVM, but it is not an OpenEBS ZFS node.- OpenEBS ZFS has one
ZFSNode:bumble. - all current
zfsvolumes.zfs.openebs.ioobjects are Ready onbumble. attic-rustfs-openebsis one RustFS Deployment replica innix-cache.- the RustFS pod is running on
bumble. - the state data PVC is
attic-rustfs-openebs-data,50Gi,openebs-bumble-zfs,ReadWriteOnce. - the RustFS image is pinned by digest to
ghcr.io/tinyland-inc/rustfs@sha256:3c2d55977829620284ece8593901bf776bcfc0fc9972784352de4dcffdb92416. - top-level
rustfs --helpexposesserver,info, andhelp; it does not expose an obviousadmin,heal,repair, orreindexcommand surface.
The current path is healthier than it was before TIN-986/TIN-987 because protected applies now check S3 state authority before mutation, and the RCA collector can capture API/disk evidence. It is still not HA.
Selected Target Class
Select a dedicated HA S3-compatible state authority, separate from the cache object-store runtime, as the next target class.
This deliberately selects a target class rather than a product implementation inside this note. SeaweedFS/shared-S3, replicated RustFS, or another small boring S3 endpoint can still compete, but the selected backend must be judged as OpenTofu state authority first and cache storage second.
The candidate and proof sequence is tracked in HA State Authority Candidate Plan.
The state authority must prove:
- S3
list-buckets,head-bucket, state-keyHEAD, and bounded write/read/delete checks. - no-op OpenTofu plan against at least one protected stack after the backend is wired.
- pod restart survival.
- node-maintenance or node-unavailability survival for the current bumble failure mode.
- explicit backup, restore, retention, and state-object versioning behavior.
- a repo-managed pre-mutation guard comparable to
just tofu-state-authority-deep-check.
Rejected For Final Authority
Current bumble-local RustFS singleton
Rejected as final authority.
Reasons:
- one RustFS Deployment replica
- one ready service endpoint
- one bumble-bound
ReadWriteOncedata PVC - one OpenEBS ZFS node
- no proved non-restart bucket-index repair path
- prior incident showed
NoSuchBucketfrom S3 while disk bucket markers still existed
Allowed use: guarded interim state authority only.
Sting local-path storage
Rejected as durable state authority.
Reasons:
- Sting fast-local classes are scratch/cache/recoverable capacity, not replicated storage
- a Ready node plus local-path storage does not create state availability when the object-store metadata authority is still singleton
Allowed use: cache, scratch, and recoverable workloads only.
Bazel cache or Attic cache store as state authority
Rejected as the state-backend selection criterion.
Reasons:
- cache systems are allowed to trade availability for rebuild latency
- OpenTofu state is mutation authority and must fail closed before apply
- BCR/RBE correctness must not depend on RustFS bucket-index behavior
Allowed use: acceleration layers with separate durability expectations.
Repo Gate
Use the expected-red readiness gate to keep this boundary visible:
just tofu-state-ha-readiness --expect-interim
Without --expect-interim, the gate exits nonzero until TIN-1012 proves a
state authority that satisfies the HA requirements. That makes the known
interim posture machine-checkable without teaching operators that the current
RustFS singleton is acceptable final state.
RBE Boundary
This is OpenTofu state availability hardening. It is not RBE.
RustFS must not be treated as Bazel CAS/AC authority until a separate storage
consistency, durability, recovery, and observability proof lands. Countable RBE
still requires a real executor endpoint and observed --remote_executor use.