RustFS Trusted Publication Decision Runbook
Use this runbook for TIN-1147 when deciding how to restore trusted Attic publication after the RustFS bucket-index and NAR-body failures.
This is a decision and evidence runbook only. It does not authorize a live OpenTofu apply, RustFS restart, RustFS image upgrade, ARC apply, package runner flip, or trusted Attic write ramp. Those actions require an explicit operator maintenance-window approval and must retain quarantine until TIN-1046 records the staged write ramp.
Current Stop State
Trusted Attic publication stays quarantined while
rustfs-trusted-publication-backend-gate.json
has decision_state: no_go_until_selected_path_proved.
Current post-PR #815 evidence:
tofu-statebucket markers exist on disk whilelist-bucketsomits the bucket.atticbucket markers exist on disk whilelist-bucketsomits the bucket.- incident-shaped NAR metadata exists, but the NAR body fails with curl exit 18 and zero transferred bytes.
rustfs-bucket-ensure-*activity is visible only in namespace events after short-lived Job/Pod cleanup.- no live HA state candidate is selected.
The expected-red RustFS State Authority Canary is signal, not noise. Do not silence it by treating green plan-only checks, ARC runner dispatch, RBE proof, or OpenTofu state-only checks as Attic publication repair evidence.
Decision Lanes
Choose exactly one lane before doing live work.
| Lane | When To Choose It | Required Proof Before TIN-1147 Can Close |
|---|---|---|
rustfs_repair_reindex |
A non-restart deployed RustFS admin operation can repair or rebuild the API bucket index from preserved disk markers. | Pre/post S3 API evidence, pre/post disk-marker evidence, scratch write/list/delete, and small/medium Attic publication profiles without NoSuchBucket, HTTP 500, or InternalServerError. |
rustfs_upgrade_topology |
The operator chooses the digest-pinned beta.4 candidate/topology path from TIN-1152. | Explicit maintenance-window approval, saved OpenTofu plan guarded by just tofu-plan-guard attic and just rustfs-upgrade-topology-plan-guard, post-upgrade state/bucket/NAR checks, and small/medium publication profiles. |
backend_replacement |
RustFS should stop being trusted Attic publication storage. | A non-secret replacement package passes just attic-backend-replacement-package-gate, scratch object proof passes on the replacement backend, read compatibility/rollback is documented, and small/medium publication profiles pass there. |
If none of these lanes has an operator owner and a proof path, leave TIN-1147 open and keep TIN-1046, TIN-1630, and trusted publication blocked.
Pre-Decision Checklist
Run or cite current evidence before selecting a lane:
just rustfs-trusted-publication-gate-check
just rustfs-upgrade-topology-candidate-check
just rustfs-upgrade-topology-proof-plan-check
Review the latest expected-red canary artifact and confirm it preserves:
FAIL: bucket tofu-state is absent from list-buckets while disk bucket markers are presentFAIL: bucket attic is absent from list-buckets while disk bucket markers are presentcurl exit: 18actual bytes: 0FAIL: narinfo exists but NAR body did not stream cleanlyNO_LIVE_HA_STATE_CANDIDATE
If the artifact lacks those lines, fix artifact capture before running live repair work.
Upgrade/Topology Maintenance Window
Use this only if the selected lane is rustfs_upgrade_topology.
Required before apply:
-
Confirm PR #811 or its successor is non-mutating:
honey.tfvarsstill preserves the beta.1 rollback image in source. -
Record the live RustFS image and rollback digest.
-
Make a maintenance-window branch or operator patch that changes only
tofu/stacks/attic/honey.tfvarsrustfs_imageto the digest-pinned beta.4 candidate. -
Produce a saved Attic OpenTofu plan.
-
Run:
just tofu-plan-guard attic just rustfs-upgrade-topology-plan-guard -
Confirm the saved plan changes only the live RustFS Deployment image and, if the shared module input requires it, the drained legacy StatefulSet template.
-
Confirm the saved plan has zero destroy and no Secret, selector, PVC, storage-class, service, Attic API, Attic GC, or Bazel cache drift.
Required after apply:
just tofu-state-authority-deep-check attic
just rustfs-bucket-index-rca --bucket attic --scratch-probe --strict-scratch-disk-markers
just attic-nar-integrity-check
Then run representative small-check and medium-check Attic publication probes. Do not restore trusted publication defaults until those probes pass and TIN-1046 records the staged ramp.
Rollback immediately if state readiness, bucket-index RCA, NAR integrity, or publication probes regress.
Replacement Backend Package
Use this only if the selected lane is backend_replacement.
Generate a non-secret package skeleton:
just attic-backend-replacement-package-template /tmp/attic-backend-replacement.json
Fill in non-secret endpoint, region, failure-domain, retention, restore,
rollback, observability, and compatibility details. The package may name
environment variables such as ATTIC_BACKEND_REPLACEMENT_ENDPOINT and
ATTIC_BACKEND_REPLACEMENT_SECRET_ACCESS_KEY; it must not contain credential
values.
Validate it:
just attic-backend-replacement-package-gate --package /tmp/attic-backend-replacement.json
Only after the package passes should an operator run scratch object proof and representative Attic publication profiles against the replacement backend.
Explicit Non-Proofs
Do not close TIN-1147 with any of these:
- restart-only recovery
- green canary-only coherence
- source-only admin route existence
- background-heal observability without a proved repair
- Attic write restore without controlled recurrence or representative failing-state proof
- ARC runner dispatch evidence
- Bazel RBE proof evidence
- OpenTofu state-only HA proof
- a digest-pinned image upgrade without post-upgrade publication evidence
Exit Criteria
TIN-1147 can move toward closure only after one selected lane produces:
- pre-change failing-state evidence or representative recurrence evidence
- post-change S3 API and disk-marker evidence where RustFS remains involved
- clean NAR body streaming for incident-shaped objects still available
- representative small-check and medium-check Attic publication success
- rollback/quarantine evidence
- TIN-1046 staged write-ramp record
Until then, keep trusted Attic publication quarantined and keep TIN-1630 package runner flips held.