GloriousFlywheel

GloriousFlywheel HA State Authority Candidate Contract 2026-05-07

TIN-1014 contract for the next TIN-1012 implementation proof.

Decision

The next proof target is a dedicated HA S3-compatible OpenTofu state authority.

This contract does not select the current attic-rustfs-openebs singleton as the final backend. Current RustFS can remain guarded interim authority while it is healthy, but it does not satisfy the strict state-authority gate.

The candidate must be judged as OpenTofu state authority first. It must not be coupled to Attic cache availability, Bazel remote-cache availability, BCR input mirrors, or future RBE CAS/action-cache authority.

Current Live Evidence

Live read on 2026-05-07:

just tofu-state-ha-readiness --expect-interim passes as evidence capture, but still returns INTERIM_ONLY.
just rustfs-bucket-index-rca --scratch-probe --since 30m passed against the current RustFS service.
scratch bucket gf-rca-20260507005853-918208ce was created, listed, written, read, deleted, and confirmed absent after cleanup.
scratch disk markers appeared under both /data/<bucket> and /data/.rustfs.sys/buckets/<bucket>, then disappeared after bucket cleanup.
required tofu-state objects are reachable:
- attic/terraform.tfstate
- arc-runners/terraform.tfstate
- tinyland-infra/gitlab-runners/terraform.tfstate
- tinyland-infra/runner-dashboard/terraform.tfstate
live RustFS image: ghcr.io/tinyland-inc/rustfs@sha256:3c2d55977829620284ece8593901bf776bcfc0fc9972784352de4dcffdb92416.
live RustFS binary reports rustfs v1.0.0-beta.1.
the live container does not include rc or rustfs-admin, even though the public RustFS troubleshooting docs describe rc admin heal style repair.

This is a coherent current backend. It is not an HA backend.

Candidate Requirements

The selected state backend must have a written contract before any active stack state is migrated.

Required contract fields:

S3 endpoint shape and audience.
credential source, rotation owner, and read/write scope.
bucket versioning, retention, backup, or restore behavior.
state locking behavior for OpenTofu.
failure behavior when one service pod is unavailable.
failure behavior when one node or storage failure domain is unavailable.
observability for S3 API, bucket metadata, object availability, and state lock failures.
non-destructive response for bucket/API/index divergence, or evidence that the selected backend removes the current RustFS bucket-index failure class.
explicit separation from Attic cache, Bazel cache, BCR/Bzlmod mirrors, and RBE CAS/action-cache authority.

Candidate Classes

candidate class	posture	reason
dedicated HA S3-compatible state service	primary proof target	separates OpenTofu mutation authority from cache object-store incidents
managed or appliance S3-compatible state bucket with versioning	acceptable candidate	removes cluster-local single-node storage as the state authority if credentials, locking, recovery, and tailnet/operator access are proved
cluster-local MinIO or AIStor-style distributed object store	conditional candidate	viable only with an explicit multi-node/multi-drive quorum design and separate state-only bucket policy
replicated RustFS	conditional candidate	viable only if multi-node topology and admin/heal tooling are present in the deployed image and proved under failure
current `attic-rustfs-openebs` singleton	rejected final	one pod, one endpoint, one OpenEBS ZFS node, one bumble-scoped RWO PVC
Sting local-path storage	rejected final	fast scratch/cache/recoverable capacity is not replicated state authority
Attic or Bazel cache object-store surface	rejected final	cache availability is not OpenTofu state authority

Stop/Go Decision

Go only if the candidate can prove one of these:

at least two independent ready service endpoints and state durability through one endpoint loss, or
equivalent managed/backend HA semantics documented by the provider and verified with a scratch proof.

Stop if any of these remain true:

state data is on a single bumble-scoped RWO volume.
the proof depends on the current Attic/Bazel cache object-store path.
bucket versioning, retention, backup, or restore behavior is absent.
state locking behavior is undefined.
node-maintenance proof requires mutating a protected stack.
the proof signal is only cache hit rate, ARC runner scheduling, or any other RBE-adjacent observation.

Proof Order

Candidate static contract.
Non-state scratch S3 proof.
Disposable OpenTofu backend proof.
Protected stack migration plan.
One-stack-at-a-time migration.
Strict just tofu-state-ha-readiness without --expect-interim.

TIN-1013 owns steps 2 and 3 after this contract lands.

The repo-managed proof entrypoint is:

just ha-state-candidate-proof \
  --endpoint-package <endpoint-package.json> \
  --run-disposable-tofu

The repo-managed static contract gate is:

just ha-state-candidate-static-gate --contract <candidate-contract.json>

It validates that a written candidate contract names the S3 endpoint shape, audience, credential source, rotation owner, read/write scope, recovery behavior, OpenTofu locking behavior, endpoint and node failure behavior, observability, bucket-index divergence response, authority separation, proof plan, and protected-state migration sequencing. It rejects contracts that try to promote the current attic-rustfs-openebs singleton, Sting local-path storage, Attic/Bazel cache surfaces, or the active tofu-state bucket as the final HA state authority.

The selected non-secret TIN-1016 proof-target artifact is:

docs/contracts/ha-opentofu-state-managed-s3-candidate.json

Validate it with:

just ha-state-selected-candidate-static-gate

That artifact chooses a managed or appliance S3-compatible OpenTofu state service as the next proof target. It is deliberately not a claim that a live endpoint already exists and not permission to migrate protected tofu-state keys. TIN-1017 owns the later scratch and disposable OpenTofu proof once a real endpoint and state-only credentials exist.

TIN-1026 must provide a non-secret endpoint package before that proof runs. Validate the package with:

just ha-state-endpoint-package-gate --package <endpoint-package.json>

That gate requires a concrete HTTPS endpoint, region, operator/proof-runner audience, state-only credential source, TOFU_HA_STATE_* injection variables, scratch bucket policy, protected state denials, recovery behavior, maintenance proof method, OpenTofu S3 lockfile proof behavior, observability, and authority separation. Its proof commands must include --run-disposable-tofu --use-lockfile; endpoint readiness cannot skip state locking. It rejects the current RustFS singleton, in-cluster cache/state service endpoints, active tofu-state bucket, inline secret fields, and cache/RBE authority claims. The live proof harness consumes this package with --endpoint-package and refuses endpoint, region, or scratch bucket values that do not match it before it performs any S3 or OpenTofu operation.

The repo-managed live inventory entrypoint is:

just ha-state-candidate-inventory

Use it before writing a candidate contract. It classifies known live object-store and storage surfaces, including the current RustFS state path, staging MinIO, TCFS/SeaweedFS, Sting local-path storage, and Longhorn. It is intentionally read-only and may report NO_LIVE_HA_STATE_CANDIDATE; that is valid evidence, not a command failure, unless --fail-without-candidate is set.

The harness is intentionally scratch-only. It creates or verifies a non-state scratch bucket, keeps all object keys under .gloriousflywheel/ha-state-candidate/, refuses the active tofu-state bucket, and refuses the protected attic, arc-runners, gitlab-runners, and runner-dashboard state keys. The disposable OpenTofu proof must be run with --use-lockfile; the harness fails early if the repo-managed OpenTofu binary is older than the native S3 lockfile-capable version required by this contract. For restart or node-maintenance evidence, run it with --endpoint-package <endpoint-package.json> --keep-scratch-bucket --checkpoint-file <path> before the event and --endpoint-package <endpoint-package.json> --verify-existing --from-checkpoint <path> after the event. The checkpoint records the endpoint package digest, so verification refuses a different endpoint package file even if the endpoint and scratch bucket coordinates match. When the proof is captured, clean the retained scratch object and bucket with --cleanup-checkpoint --delete-scratch-bucket --from-checkpoint <path>.

Source Notes

OpenTofu’s S3 backend stores state at a bucket/key and recommends bucket versioning for recovery: https://opentofu.org/docs/v1.9/language/settings/backends/s3/.
OpenTofu 1.10 introduced native S3 lockfile support; the candidate proof must verify the exact OpenTofu version and locking mode used by this repo before relying on it: https://opentofu.org/docs/v1.10/intro/whats-new/.
MinIO/AIStor erasure-coded deployments have explicit read/write quorum behavior; a cluster-local proof must use that topology intentionally, not assume that one PVC or one node is enough: https://docs.min.io/aistor/operations/core-concepts/erasure-coding/.
RustFS troubleshooting docs describe rc admin heal-style repair, but the currently deployed image lacks rc and rustfs-admin, so current live repair authority is not proved: https://docs.rustfs.com/troubleshooting/healing.html.

Boundary

This is OpenTofu state authority work.

It is not Bazel remote execution. It is not BCR publication. It is not proof of CAS/action-cache durability. Those lanes can proceed only through their own authority contracts and default-branch proofs.