GloriousFlywheel HA State Authority Candidate Contract 2026-05-07
TIN-1014 contract for the next TIN-1012 implementation proof.
Decision
The next proof target is a dedicated HA S3-compatible OpenTofu state authority.
This contract does not select the current attic-rustfs-openebs singleton as
the final backend. Current RustFS can remain guarded interim authority while it
is healthy, but it does not satisfy the strict state-authority gate.
The candidate must be judged as OpenTofu state authority first. It must not be coupled to Attic cache availability, Bazel remote-cache availability, BCR input mirrors, or future RBE CAS/action-cache authority.
Current Live Evidence
Live read on 2026-05-07:
just tofu-state-ha-readiness --expect-interimpasses as evidence capture, but still returnsINTERIM_ONLY.just rustfs-bucket-index-rca --scratch-probe --since 30mpassed against the current RustFS service.- scratch bucket
gf-rca-20260507005853-918208cewas created, listed, written, read, deleted, and confirmed absent after cleanup. - scratch disk markers appeared under both
/data/<bucket>and/data/.rustfs.sys/buckets/<bucket>, then disappeared after bucket cleanup. - required
tofu-stateobjects are reachable:attic/terraform.tfstatearc-runners/terraform.tfstatetinyland-infra/gitlab-runners/terraform.tfstatetinyland-infra/runner-dashboard/terraform.tfstate
- live RustFS image:
ghcr.io/tinyland-inc/rustfs@sha256:3c2d55977829620284ece8593901bf776bcfc0fc9972784352de4dcffdb92416. - live RustFS binary reports
rustfs v1.0.0-beta.1. - the live container does not include
rcorrustfs-admin, even though the public RustFS troubleshooting docs describerc admin healstyle repair.
This is a coherent current backend. It is not an HA backend.
Candidate Requirements
The selected state backend must have a written contract before any active stack state is migrated.
Required contract fields:
- S3 endpoint shape and audience.
- credential source, rotation owner, and read/write scope.
- bucket versioning, retention, backup, or restore behavior.
- state locking behavior for OpenTofu.
- failure behavior when one service pod is unavailable.
- failure behavior when one node or storage failure domain is unavailable.
- observability for S3 API, bucket metadata, object availability, and state lock failures.
- non-destructive response for bucket/API/index divergence, or evidence that the selected backend removes the current RustFS bucket-index failure class.
- explicit separation from Attic cache, Bazel cache, BCR/Bzlmod mirrors, and RBE CAS/action-cache authority.
Candidate Classes
| candidate class | posture | reason |
|---|---|---|
| dedicated HA S3-compatible state service | primary proof target | separates OpenTofu mutation authority from cache object-store incidents |
| managed or appliance S3-compatible state bucket with versioning | acceptable candidate | removes cluster-local single-node storage as the state authority if credentials, locking, recovery, and tailnet/operator access are proved |
| cluster-local MinIO or AIStor-style distributed object store | conditional candidate | viable only with an explicit multi-node/multi-drive quorum design and separate state-only bucket policy |
| replicated RustFS | conditional candidate | viable only if multi-node topology and admin/heal tooling are present in the deployed image and proved under failure |
current attic-rustfs-openebs singleton |
rejected final | one pod, one endpoint, one OpenEBS ZFS node, one bumble-scoped RWO PVC |
| Sting local-path storage | rejected final | fast scratch/cache/recoverable capacity is not replicated state authority |
| Attic or Bazel cache object-store surface | rejected final | cache availability is not OpenTofu state authority |
Stop/Go Decision
Go only if the candidate can prove one of these:
- at least two independent ready service endpoints and state durability through one endpoint loss, or
- equivalent managed/backend HA semantics documented by the provider and verified with a scratch proof.
Stop if any of these remain true:
- state data is on a single bumble-scoped RWO volume.
- the proof depends on the current Attic/Bazel cache object-store path.
- bucket versioning, retention, backup, or restore behavior is absent.
- state locking behavior is undefined.
- node-maintenance proof requires mutating a protected stack.
- the proof signal is only cache hit rate, ARC runner scheduling, or any other RBE-adjacent observation.
Proof Order
- Candidate static contract.
- Non-state scratch S3 proof.
- Disposable OpenTofu backend proof.
- Protected stack migration plan.
- One-stack-at-a-time migration.
- Strict
just tofu-state-ha-readinesswithout--expect-interim.
TIN-1013 owns steps 2 and 3 after this contract lands.
The repo-managed proof entrypoint is:
just ha-state-candidate-proof \
--endpoint-package <endpoint-package.json> \
--run-disposable-tofu
The repo-managed static contract gate is:
just ha-state-candidate-static-gate --contract <candidate-contract.json>
It validates that a written candidate contract names the S3 endpoint shape,
audience, credential source, rotation owner, read/write scope, recovery
behavior, OpenTofu locking behavior, endpoint and node failure behavior,
observability, bucket-index divergence response, authority separation, proof
plan, and protected-state migration sequencing. It rejects contracts that try
to promote the current attic-rustfs-openebs singleton, Sting local-path
storage, Attic/Bazel cache surfaces, or the active tofu-state bucket as the
final HA state authority.
The selected non-secret TIN-1016 proof-target artifact is:
docs/contracts/ha-opentofu-state-managed-s3-candidate.json
Validate it with:
just ha-state-selected-candidate-static-gate
That artifact chooses a managed or appliance S3-compatible OpenTofu state
service as the next proof target. It is deliberately not a claim that a live
endpoint already exists and not permission to migrate protected tofu-state
keys. TIN-1017 owns the later scratch and disposable OpenTofu proof once a real
endpoint and state-only credentials exist.
TIN-1026 must provide a non-secret endpoint package before that proof runs. Validate the package with:
just ha-state-endpoint-package-gate --package <endpoint-package.json>
That gate requires a concrete HTTPS endpoint, region, operator/proof-runner
audience, state-only credential source, TOFU_HA_STATE_* injection variables,
scratch bucket policy, protected state denials, recovery behavior, maintenance
proof method, OpenTofu S3 lockfile proof behavior, observability, and authority
separation. Its proof commands must include
--run-disposable-tofu --use-lockfile; endpoint readiness cannot skip state
locking. It rejects the current RustFS singleton, in-cluster cache/state
service endpoints, active tofu-state bucket, inline secret fields, and
cache/RBE authority claims.
The live proof harness consumes this package with --endpoint-package and
refuses endpoint, region, or scratch bucket values that do not match it before
it performs any S3 or OpenTofu operation.
The repo-managed live inventory entrypoint is:
just ha-state-candidate-inventory
Use it before writing a candidate contract. It classifies known live object-store
and storage surfaces, including the current RustFS state path, staging MinIO,
TCFS/SeaweedFS, Sting local-path storage, and Longhorn. It is intentionally
read-only and may report NO_LIVE_HA_STATE_CANDIDATE; that is valid evidence,
not a command failure, unless --fail-without-candidate is set.
The harness is intentionally scratch-only. It creates or verifies a non-state
scratch bucket, keeps all object keys under
.gloriousflywheel/ha-state-candidate/, refuses the active tofu-state bucket,
and refuses the protected attic, arc-runners, gitlab-runners, and
runner-dashboard state keys. The disposable OpenTofu proof must be run with
--use-lockfile; the harness fails early if the repo-managed OpenTofu binary is
older than the native S3 lockfile-capable version required by this contract.
For restart or node-maintenance evidence, run it
with --endpoint-package <endpoint-package.json> --keep-scratch-bucket --checkpoint-file <path> before the event and --endpoint-package <endpoint-package.json> --verify-existing --from-checkpoint <path> after the
event. The checkpoint records the endpoint package digest, so verification
refuses a different endpoint package file even if the endpoint and scratch
bucket coordinates match. When the proof is captured, clean the retained
scratch object and bucket with
--cleanup-checkpoint --delete-scratch-bucket --from-checkpoint <path>.
Source Notes
- OpenTofu’s S3 backend stores state at a bucket/key and recommends bucket versioning for recovery: https://opentofu.org/docs/v1.9/language/settings/backends/s3/.
- OpenTofu 1.10 introduced native S3 lockfile support; the candidate proof must verify the exact OpenTofu version and locking mode used by this repo before relying on it: https://opentofu.org/docs/v1.10/intro/whats-new/.
- MinIO/AIStor erasure-coded deployments have explicit read/write quorum behavior; a cluster-local proof must use that topology intentionally, not assume that one PVC or one node is enough: https://docs.min.io/aistor/operations/core-concepts/erasure-coding/.
- RustFS troubleshooting docs describe
rc admin heal-style repair, but the currently deployed image lacksrcandrustfs-admin, so current live repair authority is not proved: https://docs.rustfs.com/troubleshooting/healing.html.
Boundary
This is OpenTofu state authority work.
It is not Bazel remote execution. It is not BCR publication. It is not proof of CAS/action-cache durability. Those lanes can proceed only through their own authority contracts and default-branch proofs.