GloriousFlywheel HA State Authority Candidate Plan 2026-05-06

GloriousFlywheel HA State Authority Candidate Plan 2026-05-06

TIN-1002 decision package for moving OpenTofu state beyond the current guarded RustFS singleton.

Decision

Select a dedicated HA S3-compatible state authority as the next proof target.

The selected target class is deliberately separate from the Attic, Bazel cache, and public-input mirror object-store path. It must be judged as OpenTofu state authority first and cache storage second.

Do not promote the current attic-rustfs-openebs RustFS deployment to final state authority. It remains acceptable only as guarded interim authority while strict just tofu-state-ha-readiness stays expected-red.

Live Constraints

Live read from honey on 2026-05-06:

  • cluster nodes bumble, honey, and sting are Ready.
  • OpenEBS ZFS has one node, bumble.
  • sting is Ready and labeled for compute expansion and KVM, but it is not an OpenEBS ZFS node.
  • Sting local-path classes are scratch/cache/recoverable capacity, not durable state authority.
  • current tofu-state storage is one RustFS Deployment replica on one bumble-scoped ReadWriteOnce OpenEBS ZFS PVC.
  • the current RustFS image exposes no obvious top-level admin/heal/repair/reindex command surface.

Candidate Classes

candidate class posture reason
dedicated HA S3-compatible state service primary spike separates mutation authority from cache object-store failure modes
replicated RustFS with repair proof conditional viable only if bucket-index repair and failure survival are proved
current bumble-local RustFS singleton interim only one replica, one endpoint, one bumble-scoped RWO PVC
Sting local-path storage rejected final node-local scratch/cache is not replicated state authority
Attic or Bazel cache object-store surface rejected final cache availability is not the same contract as OpenTofu state authority

Proof Phases

Phase 0: Candidate Static Gate

Before any live state migration, the candidate must have a written static contract:

  • S3 endpoint shape and audience
  • credential source and rotation owner
  • bucket versioning, retention, backup, or restore behavior
  • failure behavior when one pod or one node is unavailable
  • observability and alert surface for bucket/index/API divergence
  • explicit statement that Attic, Bazel cache, and RBE CAS/AC are not using the state bucket as authority

Phase 1: Scratch S3 Proof

Use a non-state scratch bucket/path first. The proof must show:

  • list-buckets
  • head-bucket
  • bounded object write/read/delete
  • bucket/object HEAD after one pod restart
  • bucket/object HEAD through one node-maintenance or node-unavailability event
  • failure capture that is non-destructive to OpenTofu state

Do not point any active stack at the candidate during this phase.

Phase 2: Disposable OpenTofu Proof

Wire a disposable or non-production OpenTofu stack to the candidate backend. The proof must show:

  • backend init succeeds from repo-managed config
  • first state write succeeds
  • no-op plan succeeds
  • no-op plan still succeeds after pod restart
  • no-op plan still succeeds through one node-maintenance or node-unavailability event
  • restore or rollback path is exercised on non-production state

Phase 3: Protected Stack Migration Plan

Only after the scratch and disposable proofs pass, write the migration plan for the four active stacks:

  • attic
  • arc-runners
  • gitlab-runners
  • runner-dashboard

Migration must be one stack at a time. Each stack needs a state pull, backend config review, init/migrate plan, no-op plan, and rollback note before apply.

Phase 4: Implementation Gate

TIN-1002 selected the candidate plan and expected-red guardrail. TIN-1012 closes only when strict mode passes:

just tofu-state-ha-readiness

Until then, evidence capture should use:

just tofu-state-ha-readiness --expect-interim

Stop Conditions

Stop before migration if any of these are true:

  • fewer than two independent service endpoints, unless the backend proves an equivalent HA authority model
  • state data remains on a single bumble-scoped RWO volume
  • no backup, restore, retention, or versioning behavior is documented
  • bucket-index divergence has no non-destructive response
  • candidate proof depends on Attic or Bazel cache state staying healthy
  • candidate proof is only a cache hit, ARC job dispatch, or RBE-adjacent signal

BCR And RBE Boundary

This is OpenTofu state availability work. It is not BCR, RBE, or CAS/AC authority.

BCR work can continue against immutable releases, approved mirrors, and integrity metadata. RBE still requires a real executor endpoint and observed --remote_executor use. Neither should depend on RustFS bucket-index behavior or on the future state-authority bucket.

GloriousFlywheel