GloriousFlywheel HA State Authority Candidate Plan 2026-05-06
TIN-1002 decision package for moving OpenTofu state beyond the current guarded RustFS singleton.
Decision
Select a dedicated HA S3-compatible state authority as the next proof target.
The selected target class is deliberately separate from the Attic, Bazel cache, and public-input mirror object-store path. It must be judged as OpenTofu state authority first and cache storage second.
Do not promote the current attic-rustfs-openebs RustFS deployment to final
state authority. It remains acceptable only as guarded interim authority while
strict just tofu-state-ha-readiness stays expected-red.
Live Constraints
Live read from honey on 2026-05-06:
- cluster nodes
bumble,honey, andstingare Ready. - OpenEBS ZFS has one node,
bumble. stingis Ready and labeled for compute expansion and KVM, but it is not an OpenEBS ZFS node.- Sting local-path classes are scratch/cache/recoverable capacity, not durable state authority.
- current
tofu-statestorage is one RustFS Deployment replica on one bumble-scopedReadWriteOnceOpenEBS ZFS PVC. - the current RustFS image exposes no obvious top-level
admin/heal/repair/reindexcommand surface.
Candidate Classes
| candidate class | posture | reason |
|---|---|---|
| dedicated HA S3-compatible state service | primary spike | separates mutation authority from cache object-store failure modes |
| replicated RustFS with repair proof | conditional | viable only if bucket-index repair and failure survival are proved |
| current bumble-local RustFS singleton | interim only | one replica, one endpoint, one bumble-scoped RWO PVC |
| Sting local-path storage | rejected final | node-local scratch/cache is not replicated state authority |
| Attic or Bazel cache object-store surface | rejected final | cache availability is not the same contract as OpenTofu state authority |
Proof Phases
Phase 0: Candidate Static Gate
Before any live state migration, the candidate must have a written static contract:
- S3 endpoint shape and audience
- credential source and rotation owner
- bucket versioning, retention, backup, or restore behavior
- failure behavior when one pod or one node is unavailable
- observability and alert surface for bucket/index/API divergence
- explicit statement that Attic, Bazel cache, and RBE CAS/AC are not using the state bucket as authority
Phase 1: Scratch S3 Proof
Use a non-state scratch bucket/path first. The proof must show:
list-bucketshead-bucket- bounded object write/read/delete
- bucket/object
HEADafter one pod restart - bucket/object
HEADthrough one node-maintenance or node-unavailability event - failure capture that is non-destructive to OpenTofu state
Do not point any active stack at the candidate during this phase.
Phase 2: Disposable OpenTofu Proof
Wire a disposable or non-production OpenTofu stack to the candidate backend. The proof must show:
- backend init succeeds from repo-managed config
- first state write succeeds
- no-op plan succeeds
- no-op plan still succeeds after pod restart
- no-op plan still succeeds through one node-maintenance or node-unavailability event
- restore or rollback path is exercised on non-production state
Phase 3: Protected Stack Migration Plan
Only after the scratch and disposable proofs pass, write the migration plan for the four active stacks:
atticarc-runnersgitlab-runnersrunner-dashboard
Migration must be one stack at a time. Each stack needs a state pull, backend config review, init/migrate plan, no-op plan, and rollback note before apply.
Phase 4: Implementation Gate
TIN-1002 selected the candidate plan and expected-red guardrail. TIN-1012 closes only when strict mode passes:
just tofu-state-ha-readiness
Until then, evidence capture should use:
just tofu-state-ha-readiness --expect-interim
Stop Conditions
Stop before migration if any of these are true:
- fewer than two independent service endpoints, unless the backend proves an equivalent HA authority model
- state data remains on a single bumble-scoped RWO volume
- no backup, restore, retention, or versioning behavior is documented
- bucket-index divergence has no non-destructive response
- candidate proof depends on Attic or Bazel cache state staying healthy
- candidate proof is only a cache hit, ARC job dispatch, or RBE-adjacent signal
BCR And RBE Boundary
This is OpenTofu state availability work. It is not BCR, RBE, or CAS/AC authority.
BCR work can continue against immutable releases, approved mirrors, and
integrity metadata. RBE still requires a real executor endpoint and observed
--remote_executor use. Neither should depend on RustFS bucket-index behavior
or on the future state-authority bucket.