GloriousFlywheel HA State Authority Candidate Contract 2026-05-07

GloriousFlywheel HA State Authority Candidate Contract 2026-05-07

TIN-1014 contract for the next TIN-1012 implementation proof.

Decision

The next proof target is a dedicated HA S3-compatible OpenTofu state authority.

This contract does not select the current attic-rustfs-openebs singleton as the final backend. Current RustFS can remain guarded interim authority while it is healthy, but it does not satisfy the strict state-authority gate.

The candidate must be judged as OpenTofu state authority first. It must not be coupled to Attic cache availability, Bazel remote-cache availability, BCR input mirrors, or future RBE CAS/action-cache authority.

Current Live Evidence

Live read on 2026-05-07:

  • just tofu-state-ha-readiness --expect-interim passes as evidence capture, but still returns INTERIM_ONLY.
  • just rustfs-bucket-index-rca --scratch-probe --since 30m passed against the current RustFS service.
  • scratch bucket gf-rca-20260507005853-918208ce was created, listed, written, read, deleted, and confirmed absent after cleanup.
  • scratch disk markers appeared under both /data/<bucket> and /data/.rustfs.sys/buckets/<bucket>, then disappeared after bucket cleanup.
  • required tofu-state objects are reachable:
    • attic/terraform.tfstate
    • arc-runners/terraform.tfstate
    • tinyland-infra/gitlab-runners/terraform.tfstate
    • tinyland-infra/runner-dashboard/terraform.tfstate
  • live RustFS image: ghcr.io/tinyland-inc/rustfs@sha256:3c2d55977829620284ece8593901bf776bcfc0fc9972784352de4dcffdb92416.
  • live RustFS binary reports rustfs v1.0.0-beta.1.
  • the live container does not include rc or rustfs-admin, even though the public RustFS troubleshooting docs describe rc admin heal style repair.

This is a coherent current backend. It is not an HA backend.

Candidate Requirements

The selected state backend must have a written contract before any active stack state is migrated.

Required contract fields:

  • S3 endpoint shape and audience.
  • credential source, rotation owner, and read/write scope.
  • bucket versioning, retention, backup, or restore behavior.
  • state locking behavior for OpenTofu.
  • failure behavior when one service pod is unavailable.
  • failure behavior when one node or storage failure domain is unavailable.
  • observability for S3 API, bucket metadata, object availability, and state lock failures.
  • non-destructive response for bucket/API/index divergence, or evidence that the selected backend removes the current RustFS bucket-index failure class.
  • explicit separation from Attic cache, Bazel cache, BCR/Bzlmod mirrors, and RBE CAS/action-cache authority.

Candidate Classes

candidate class posture reason
dedicated HA S3-compatible state service primary proof target separates OpenTofu mutation authority from cache object-store incidents
managed or appliance S3-compatible state bucket with versioning acceptable candidate removes cluster-local single-node storage as the state authority if credentials, locking, recovery, and tailnet/operator access are proved
cluster-local MinIO or AIStor-style distributed object store conditional candidate viable only with an explicit multi-node/multi-drive quorum design and separate state-only bucket policy
replicated RustFS conditional candidate viable only if multi-node topology and admin/heal tooling are present in the deployed image and proved under failure
current attic-rustfs-openebs singleton rejected final one pod, one endpoint, one OpenEBS ZFS node, one bumble-scoped RWO PVC
Sting local-path storage rejected final fast scratch/cache/recoverable capacity is not replicated state authority
Attic or Bazel cache object-store surface rejected final cache availability is not OpenTofu state authority

Stop/Go Decision

Go only if the candidate can prove one of these:

  • at least two independent ready service endpoints and state durability through one endpoint loss, or
  • equivalent managed/backend HA semantics documented by the provider and verified with a scratch proof.

Stop if any of these remain true:

  • state data is on a single bumble-scoped RWO volume.
  • the proof depends on the current Attic/Bazel cache object-store path.
  • bucket versioning, retention, backup, or restore behavior is absent.
  • state locking behavior is undefined.
  • node-maintenance proof requires mutating a protected stack.
  • the proof signal is only cache hit rate, ARC runner scheduling, or any other RBE-adjacent observation.

Proof Order

  1. Candidate static contract.
  2. Non-state scratch S3 proof.
  3. Disposable OpenTofu backend proof.
  4. Protected stack migration plan.
  5. One-stack-at-a-time migration.
  6. Strict just tofu-state-ha-readiness without --expect-interim.

TIN-1013 owns steps 2 and 3 after this contract lands.

The repo-managed proof entrypoint is:

just ha-state-candidate-proof \
  --endpoint-package <endpoint-package.json> \
  --run-disposable-tofu

The repo-managed static contract gate is:

just ha-state-candidate-static-gate --contract <candidate-contract.json>

It validates that a written candidate contract names the S3 endpoint shape, audience, credential source, rotation owner, read/write scope, recovery behavior, OpenTofu locking behavior, endpoint and node failure behavior, observability, bucket-index divergence response, authority separation, proof plan, and protected-state migration sequencing. It rejects contracts that try to promote the current attic-rustfs-openebs singleton, Sting local-path storage, Attic/Bazel cache surfaces, or the active tofu-state bucket as the final HA state authority.

The selected non-secret TIN-1016 proof-target artifact is:

docs/contracts/ha-opentofu-state-managed-s3-candidate.json

Validate it with:

just ha-state-selected-candidate-static-gate

That artifact chooses a managed or appliance S3-compatible OpenTofu state service as the next proof target. It is deliberately not a claim that a live endpoint already exists and not permission to migrate protected tofu-state keys. TIN-1017 owns the later scratch and disposable OpenTofu proof once a real endpoint and state-only credentials exist.

TIN-1026 must provide a non-secret endpoint package before that proof runs. Validate the package with:

just ha-state-endpoint-package-gate --package <endpoint-package.json>

That gate requires a concrete HTTPS endpoint, region, operator/proof-runner audience, state-only credential source, TOFU_HA_STATE_* injection variables, scratch bucket policy, protected state denials, recovery behavior, maintenance proof method, OpenTofu S3 lockfile proof behavior, observability, and authority separation. Its proof commands must include --run-disposable-tofu --use-lockfile; endpoint readiness cannot skip state locking. It rejects the current RustFS singleton, in-cluster cache/state service endpoints, active tofu-state bucket, inline secret fields, and cache/RBE authority claims. The live proof harness consumes this package with --endpoint-package and refuses endpoint, region, or scratch bucket values that do not match it before it performs any S3 or OpenTofu operation.

The repo-managed live inventory entrypoint is:

just ha-state-candidate-inventory

Use it before writing a candidate contract. It classifies known live object-store and storage surfaces, including the current RustFS state path, staging MinIO, TCFS/SeaweedFS, Sting local-path storage, and Longhorn. It is intentionally read-only and may report NO_LIVE_HA_STATE_CANDIDATE; that is valid evidence, not a command failure, unless --fail-without-candidate is set.

The harness is intentionally scratch-only. It creates or verifies a non-state scratch bucket, keeps all object keys under .gloriousflywheel/ha-state-candidate/, refuses the active tofu-state bucket, and refuses the protected attic, arc-runners, gitlab-runners, and runner-dashboard state keys. The disposable OpenTofu proof must be run with --use-lockfile; the harness fails early if the repo-managed OpenTofu binary is older than the native S3 lockfile-capable version required by this contract. For restart or node-maintenance evidence, run it with --endpoint-package <endpoint-package.json> --keep-scratch-bucket --checkpoint-file <path> before the event and --endpoint-package <endpoint-package.json> --verify-existing --from-checkpoint <path> after the event. The checkpoint records the endpoint package digest, so verification refuses a different endpoint package file even if the endpoint and scratch bucket coordinates match. When the proof is captured, clean the retained scratch object and bucket with --cleanup-checkpoint --delete-scratch-bucket --from-checkpoint <path>.

Source Notes

Boundary

This is OpenTofu state authority work.

It is not Bazel remote execution. It is not BCR publication. It is not proof of CAS/action-cache durability. Those lanes can proceed only through their own authority contracts and default-branch proofs.

GloriousFlywheel