GloriousFlywheel HA State Authority Gate 2026-05-06

GloriousFlywheel HA State Authority Gate 2026-05-06

TIN-1002 successor gate for selecting and proving the next OpenTofu state-authority target after the RustFS bucket-index RCA. TIN-989 recorded the expected-red HA gate; TIN-1002 captured the candidate plan and guardrail. TIN-1012 now owns turning that gate green without weakening the checks.

Current Live Read

Live read from the honey context on 2026-05-06:

  • bumble, honey, and sting are Ready.
  • sting is back online and still labeled for compute expansion and KVM, but it is not an OpenEBS ZFS node.
  • OpenEBS ZFS has one ZFSNode: bumble.
  • all current zfsvolumes.zfs.openebs.io objects are Ready on bumble.
  • attic-rustfs-openebs is one RustFS Deployment replica in nix-cache.
  • the RustFS pod is running on bumble.
  • the state data PVC is attic-rustfs-openebs-data, 50Gi, openebs-bumble-zfs, ReadWriteOnce.
  • the RustFS image is pinned by digest to ghcr.io/tinyland-inc/rustfs@sha256:3c2d55977829620284ece8593901bf776bcfc0fc9972784352de4dcffdb92416.
  • top-level rustfs --help exposes server, info, and help; it does not expose an obvious admin, heal, repair, or reindex command surface.

The current path is healthier than it was before TIN-986/TIN-987 because protected applies now check S3 state authority before mutation, and the RCA collector can capture API/disk evidence. It is still not HA.

Selected Target Class

Select a dedicated HA S3-compatible state authority, separate from the cache object-store runtime, as the next target class.

This deliberately selects a target class rather than a product implementation inside this note. SeaweedFS/shared-S3, replicated RustFS, or another small boring S3 endpoint can still compete, but the selected backend must be judged as OpenTofu state authority first and cache storage second.

The candidate and proof sequence is tracked in HA State Authority Candidate Plan.

The state authority must prove:

  • S3 list-buckets, head-bucket, state-key HEAD, and bounded write/read/delete checks.
  • no-op OpenTofu plan against at least one protected stack after the backend is wired.
  • pod restart survival.
  • node-maintenance or node-unavailability survival for the current bumble failure mode.
  • explicit backup, restore, retention, and state-object versioning behavior.
  • a repo-managed pre-mutation guard comparable to just tofu-state-authority-deep-check.

Rejected For Final Authority

Current bumble-local RustFS singleton

Rejected as final authority.

Reasons:

  • one RustFS Deployment replica
  • one ready service endpoint
  • one bumble-bound ReadWriteOnce data PVC
  • one OpenEBS ZFS node
  • no proved non-restart bucket-index repair path
  • prior incident showed NoSuchBucket from S3 while disk bucket markers still existed

Allowed use: guarded interim state authority only.

Sting local-path storage

Rejected as durable state authority.

Reasons:

  • Sting fast-local classes are scratch/cache/recoverable capacity, not replicated storage
  • a Ready node plus local-path storage does not create state availability when the object-store metadata authority is still singleton

Allowed use: cache, scratch, and recoverable workloads only.

Bazel cache or Attic cache store as state authority

Rejected as the state-backend selection criterion.

Reasons:

  • cache systems are allowed to trade availability for rebuild latency
  • OpenTofu state is mutation authority and must fail closed before apply
  • BCR/RBE correctness must not depend on RustFS bucket-index behavior

Allowed use: acceleration layers with separate durability expectations.

Repo Gate

Use the expected-red readiness gate to keep this boundary visible:

just tofu-state-ha-readiness --expect-interim

Without --expect-interim, the gate exits nonzero until TIN-1012 proves a state authority that satisfies the HA requirements. That makes the known interim posture machine-checkable without teaching operators that the current RustFS singleton is acceptable final state.

RBE Boundary

This is OpenTofu state availability hardening. It is not RBE.

RustFS must not be treated as Bazel CAS/AC authority until a separate storage consistency, durability, recovery, and observability proof lands. Countable RBE still requires a real executor endpoint and observed --remote_executor use.

GloriousFlywheel