Rustfs Trusted Publication Decision

RustFS Trusted Publication Decision Runbook

Use this runbook for TIN-1147 when deciding how to restore trusted Attic publication after the RustFS bucket-index and NAR-body failures.

This is a decision and evidence runbook only. It does not authorize a live OpenTofu apply, RustFS restart, RustFS image upgrade, ARC apply, package runner flip, or trusted Attic write ramp. Those actions require an explicit operator maintenance-window approval and must retain quarantine until TIN-1046 records the staged write ramp.

Current Stop State

Trusted Attic publication stays quarantined while rustfs-trusted-publication-backend-gate.json has decision_state: no_go_until_selected_path_proved.

Current post-PR #815 evidence:

  • tofu-state bucket markers exist on disk while list-buckets omits the bucket.
  • attic bucket markers exist on disk while list-buckets omits the bucket.
  • incident-shaped NAR metadata exists, but the NAR body fails with curl exit 18 and zero transferred bytes.
  • rustfs-bucket-ensure-* activity is visible only in namespace events after short-lived Job/Pod cleanup.
  • no live HA state candidate is selected.

The expected-red RustFS State Authority Canary is signal, not noise. Do not silence it by treating green plan-only checks, ARC runner dispatch, RBE proof, or OpenTofu state-only checks as Attic publication repair evidence.

Decision Lanes

Choose exactly one lane before doing live work.

Lane When To Choose It Required Proof Before TIN-1147 Can Close
rustfs_repair_reindex A non-restart deployed RustFS admin operation can repair or rebuild the API bucket index from preserved disk markers. Pre/post S3 API evidence, pre/post disk-marker evidence, scratch write/list/delete, and small/medium Attic publication profiles without NoSuchBucket, HTTP 500, or InternalServerError.
rustfs_upgrade_topology The operator chooses the digest-pinned beta.4 candidate/topology path from TIN-1152. Explicit maintenance-window approval, saved OpenTofu plan guarded by just tofu-plan-guard attic and just rustfs-upgrade-topology-plan-guard, post-upgrade state/bucket/NAR checks, and small/medium publication profiles.
backend_replacement RustFS should stop being trusted Attic publication storage. A non-secret replacement package passes just attic-backend-replacement-package-gate, scratch object proof passes on the replacement backend, read compatibility/rollback is documented, and small/medium publication profiles pass there.

If none of these lanes has an operator owner and a proof path, leave TIN-1147 open and keep TIN-1046, TIN-1630, and trusted publication blocked.

Pre-Decision Checklist

Run or cite current evidence before selecting a lane:

just rustfs-trusted-publication-gate-check
just rustfs-upgrade-topology-candidate-check
just rustfs-upgrade-topology-proof-plan-check

Review the latest expected-red canary artifact and confirm it preserves:

  • FAIL: bucket tofu-state is absent from list-buckets while disk bucket markers are present
  • FAIL: bucket attic is absent from list-buckets while disk bucket markers are present
  • curl exit: 18
  • actual bytes: 0
  • FAIL: narinfo exists but NAR body did not stream cleanly
  • NO_LIVE_HA_STATE_CANDIDATE

If the artifact lacks those lines, fix artifact capture before running live repair work.

Upgrade/Topology Maintenance Window

Use this only if the selected lane is rustfs_upgrade_topology.

Required before apply:

  1. Confirm PR #811 or its successor is non-mutating: honey.tfvars still preserves the beta.1 rollback image in source.

  2. Record the live RustFS image and rollback digest.

  3. Make a maintenance-window branch or operator patch that changes only tofu/stacks/attic/honey.tfvars rustfs_image to the digest-pinned beta.4 candidate.

  4. Produce a saved Attic OpenTofu plan.

  5. Run:

    just tofu-plan-guard attic
    just rustfs-upgrade-topology-plan-guard
  6. Confirm the saved plan changes only the live RustFS Deployment image and, if the shared module input requires it, the drained legacy StatefulSet template.

  7. Confirm the saved plan has zero destroy and no Secret, selector, PVC, storage-class, service, Attic API, Attic GC, or Bazel cache drift.

Required after apply:

just tofu-state-authority-deep-check attic
just rustfs-bucket-index-rca --bucket attic --scratch-probe --strict-scratch-disk-markers
just attic-nar-integrity-check

Then run representative small-check and medium-check Attic publication probes. Do not restore trusted publication defaults until those probes pass and TIN-1046 records the staged ramp.

Rollback immediately if state readiness, bucket-index RCA, NAR integrity, or publication probes regress.

Replacement Backend Package

Use this only if the selected lane is backend_replacement.

Generate a non-secret package skeleton:

just attic-backend-replacement-package-template /tmp/attic-backend-replacement.json

Fill in non-secret endpoint, region, failure-domain, retention, restore, rollback, observability, and compatibility details. The package may name environment variables such as ATTIC_BACKEND_REPLACEMENT_ENDPOINT and ATTIC_BACKEND_REPLACEMENT_SECRET_ACCESS_KEY; it must not contain credential values.

Validate it:

just attic-backend-replacement-package-gate --package /tmp/attic-backend-replacement.json

Only after the package passes should an operator run scratch object proof and representative Attic publication profiles against the replacement backend.

Explicit Non-Proofs

Do not close TIN-1147 with any of these:

  • restart-only recovery
  • green canary-only coherence
  • source-only admin route existence
  • background-heal observability without a proved repair
  • Attic write restore without controlled recurrence or representative failing-state proof
  • ARC runner dispatch evidence
  • Bazel RBE proof evidence
  • OpenTofu state-only HA proof
  • a digest-pinned image upgrade without post-upgrade publication evidence

Exit Criteria

TIN-1147 can move toward closure only after one selected lane produces:

  • pre-change failing-state evidence or representative recurrence evidence
  • post-change S3 API and disk-marker evidence where RustFS remains involved
  • clean NAR body streaming for incident-shaped objects still available
  • representative small-check and medium-check Attic publication success
  • rollback/quarantine evidence
  • TIN-1046 staged write-ramp record

Until then, keep trusted Attic publication quarantined and keep TIN-1630 package runner flips held.

GloriousFlywheel