Cache and State Backend Roles

Canonical internal reference for cache systems and state backend authority in GloriousFlywheel.

This page is the contract-facing summary of the current live cache and state shape. It should agree with the internal live contract note in docs/research/.

Current Contract

honey is the only active physical cluster target
Attic and Bazel are acceleration layers, not publication surfaces
Attic and Bazel remote cache are shared acceleration layers for both CI and developer workflows
Attic and Bazel are part of the pooled substrate contract, not CI-only decorations
FlakeHub is publication and discovery only
the four active infrastructure stacks now use the honey-local S3-compatible state path
GitLab-managed HTTP state is compatibility-only
current main proves shared cache acceleration plus narrow explicit REAPI proofs for selected target classes
current main does not prove full remote execution or full remote-builder offload for every developer workload

Cache Surfaces

System	Primary audience	Canonical current address	Backing store	Current role
Attic API	in-cluster runners	`http://attic.nix-cache.svc.cluster.local`	RustFS-backed object storage on `honey`	runner-side Nix cache API
Attic HTTPS	operators and internal consumers	`https://nix-cache.tinyland.dev`	same Attic service family	human-facing read path and internal API base
Bazel remote cache	in-cluster runners	`grpc://bazel-cache.nix-cache.svc.cluster.local:9092`	RustFS/S3-backed object storage with local hot cache	optional Bazel acceleration
FlakeHub	public publication/discovery	`https://flakehub.com/f/tinyland-inc/GloriousFlywheel/*`	Determinate Systems SaaS	flake publication only
RustFS	internal storage plane	not user-facing	OpenEBS-backed object storage on `bumble`	S3-compatible storage backing the Attic family

Attic contract

Current Attic truth:

shared self-hosted runners talk to the cluster-internal API endpoint
the default shared cache name is main
the shared main cache is public-read and credentialed-write
developer workflows are also meant to consume this shared cache-backed substrate where the repo wiring proves it
internal human or dev-machine reads should use the HTTPS endpoint with the cache path they intend to consume, for example https://nix-cache.tinyland.dev/main
write access is internal and credentialed, not a public default
cache signing keys, JWT signing keys, workflow ATTIC_TOKEN values, and RustFS/S3 credentials are separate artifacts with separate rotation rules
pull requests must remain read-only for Attic publication
pilot and downstream examples may use default-branch plus ATTIC_TOKEN gated push-cache, but broad GloriousFlywheel proof workflows still keep push-cache: "false" while the 2026-05-08 RustFS bucket-index recurrence is unresolved
scripts/validate-attic-write-quarantine.py statically enforces that split across .github/workflows and operator-facing docs
the manual publication probe requires profile-specific confirmation: confirm=probe-attic-publication for the one-path synthetic profile and confirm=probe-attic-publication-small-check for the known-risk bounded statix check-output profile, then confirm=probe-attic-publication-medium-check for the known-risk representative deadnix medium closure
the one-path synthetic profile now has repeated clean evidence, including 2026-05-13 run 25816771239; this does not override the known small-check and medium-check reproduction failures
a current-main repeat medium-check run, 25817881900, still reproduced the failure with a 22-path Attic push delta and post-failure S3 list-buckets loss for both attic and tofu-state; restart recovery restored guarded reads but did not prove a non-restart repair
manual publication probe artifacts include the requested closure inventory, the actual Attic push delta, and Attic push stdout/stderr logs; push stderr is classified for RustFS/S3 bucket-index recurrence and credential/auth signatures; both real-output profiles are controlled reproduction tools, not a restore path for default workflow writes

Bazel contract

Current Bazel truth:

the only stable default contract today is the cluster-internal runner path
the live cache persists through the bazel-cache bucket on the OpenEBS-backed attic-rustfs-openebs service; the pod-local /data volume is only a hot cache
on 2026-05-25 the source Bazel proof reported a remote-cache digest mismatch while reading the RustFS-backed bazel-cache bucket through bazel-remote. The implicated cas.v2 key was removed as cache-only acceleration data, and that delete reproduced the RustFS bucket-index class: S3 list-buckets went empty while on-disk bucket markers remained, and a controlled attic-rustfs-openebs restart was required to restore API visibility. That is cache acceleration corruption, not source-code evidence, and it reinforces that the current RustFS-backed Bazel cache is not RBE CAS/action-cache authority. Do not infer corruption from raw S3 object hashes alone: the live bazel-remote backend stores zstd-encoded objects, so audits must hash decoded payload bytes.
Bazel cache is part of the intended shared developer-plus-CI substrate, not a CI-only decoration
no stable general-consumer external Bazel endpoint is promised yet
any private or operator-only Bazel hostname is internal implementation detail, not the onboarding contract
current cache-backed local execution is real; universal remote execution is not yet the proved default contract
executor-backed mode is available only when BAZEL_REMOTE_EXECUTOR is set separately from BAZEL_REMOTE_CACHE and the target class is eligible through config/rbe-target-eligibility.json

Developer-machine attachment

There are two cache attachment modes, and they must not be blurred.

Context	Attic behavior	Bazel behavior	Contract status
Shared self-hosted runner	workflow setup injects `ATTIC_SERVER=http://attic.nix-cache.svc.cluster.local` and `ATTIC_CACHE=main` when reachable	workflow setup injects `BAZEL_REMOTE_CACHE=grpc://bazel-cache.nix-cache.svc.cluster.local:9092`	proved source-repo CI path
Internal developer machine	`.envrc` derives `ATTIC_SERVER=https://nix-cache.<domain>` and `ATTIC_CACHE=main`; Nix may use `https://nix-cache.tinyland.dev/main` as a substituter when trusted locally	`.envrc` leaves `BAZEL_REMOTE_CACHE` empty unless the operator provides a routable endpoint	supported as explicit attachment, not automatic discovery
Future public or third-party consumer	use documented variables and exported public docs, not private Tinyland topology	use documented variables and exported public docs, not private Tinyland topology	projection only until a public product contract exists

If BAZEL_REMOTE_CACHE is empty, just info must report compatibility-local-only. That is not a failure; it is the guardrail that prevents developer machines from silently depending on stale or invented endpoints.

State Backend

Active stack authority

Stack	Current backend	Proven state key
`attic`	S3-compatible	`attic/terraform.tfstate`
`arc-runners`	S3-compatible	`arc-runners/terraform.tfstate`
`gitlab-runners`	S3-compatible	`tinyland-infra/gitlab-runners/terraform.tfstate`
`runner-dashboard`	S3-compatible	`tinyland-infra/runner-dashboard/terraform.tfstate`

Current truth:

all four active stacks use backend "s3" on current main
the active local operator path is the TOFU_BACKEND_S3_* family or a materialized backend HCL file consumed by just tofu-init <stack>
TF_HTTP_* remains compatibility-only for archived or external repair paths
RustFS has known bucket-index reliability debt: the S3 API has returned NoSuchBucket for tofu-state while both /data/tofu-state and /data/.rustfs.sys/buckets/tofu-state existed on disk. On 2026-05-08, a representative Attic publication probe also left both attic and tofu-state absent from S3 list-buckets while their disk bucket markers existed. A controlled RustFS restart restored the API view both times, but this is only an operator recovery action, not a proved non-restart repair.
On 2026-05-19, the state-authority guard failed again after the latest controlled restart window: tofu-state was absent from S3 list-buckets while disk bucket markers remained present, and GloriousFlywheel PR #735 failed Plan ARC Runners on that same guard. Treat the configured RustFS state path as degraded until just tofu-state-ha-readiness --expect-interim passes again, and treat strict HA as blocked until TIN-1026 and TIN-1017 produce endpoint package, scratch/disposable OpenTofu, lockfile, maintenance, and cleanup evidence.

Authority order

Use this order when reasoning about state:

repo code plus stack inputs define desired state
S3-compatible OpenTofu state defines managed-resource authority
live cluster state is observed state and may drift

Manual cluster edits are drift unless they are an explicit bounded operator action that is expected to reconcile later.

RustFS bucket-index guardrail

The S3 API view is the OpenTofu state authority. On-disk bucket directories are evidence for incident response, but they are not sufficient proof that OpenTofu can safely read, lock, or persist state.

If S3 returns NoSuchBucket while disk markers are present, preserve the failed workflow/apply logs and run the repo RCA scripts before restarting RustFS. Restarting nix-cache/attic-rustfs-openebs can restore service, but it does not close TIN-1012 or TIN-1046. TIN-1043 closed the default-read-only quarantine response, not the backend repair/replacement requirement.

Before protected OpenTofu mutation, run the deep RustFS state authority guard:

just tofu-state-authority-deep-check <stack>

The guard checks RustFS workload health, disk bucket markers when pod exec is available, S3 bucket and object metadata, the stack state object, optional state-object body readability/JSON shape, and a temporary write/read/delete proof. If apply mutates live resources but state persistence fails, preserve errored.tfstate and use a controlled tofu state push; do not rerun apply as the first recovery action.

For incident capture or RCA follow-up, run:

just rustfs-bucket-index-rca --scratch-probe

This collects RustFS workload, pod, version, data-layout, bootstrap/lifecycle job, log, event, and S3 authority evidence. The scratch probe creates and deletes an isolated bucket to prove normal bucket-index API/disk coherence without touching OpenTofu state objects.

The self-hosted RustFS State Authority Canary workflow runs the same evidence path on main, on demand, and hourly while RustFS remains the interim state authority. It runs on the shared tinyland-nix-operator dogfood lane because it uses kubeconfig-backed operator probes and bounded port-forwards; generic Nix overflow lanes and hosted runners are not valid substitutes. It executes tofu-state-ha-readiness --expect-interim, then read-only rustfs-bucket-index-rca --bucket attic and attic-nar-integrity-check evidence, then rustfs-bucket-index-rca --scratch-probe --strict-scratch-disk-markers, then the read-only ha-state-candidate-inventory classifier, and publishes the logs as workflow artifacts. The Attic read-path evidence is marked if: always() so the canary still captures attic bucket-index and NAR body evidence when the current tofu-state check fails first. A green canary means the current RustFS state path is coherent now, the Attic incident-shaped NAR body streamed, the scratch bucket appeared in both the API and disk bucket markers, the scratch bucket markers disappeared after API delete, and the known OpenTofu state objects were readable as JSON state bodies. It also means the candidate inventory completed. It is not an HA claim and it does not promote RustFS to Bazel CAS/AC or RBE authority. When the inventory reports NO_LIVE_HA_STATE_CANDIDATE, that is valid evidence for TIN-1012 rather than a canary failure.

To check the state path against the HA authority gate, run:

just tofu-state-ha-readiness --expect-interim

That command is expected-red without --expect-interim until TIN-1012 proves the implementation gate. If the command is red even with --expect-interim, do not start protected OpenTofu mutations through this state path. TIN-1002 captured the candidate plan and guardrail; it did not make the current RustFS path HA. Even when the interim guard is green, the current RustFS path is still one RustFS Deployment replica on a bumble-bound OpenEBS ZFS ReadWriteOnce PVC.

The same RustFS bucket-index class can break the Attic cache body path while narinfo and Attic database metadata remain present. When Nix reports Transferred a partial file from the Attic substituter, run:

just attic-nar-integrity-check --store-hash <nix-store-hash>

If that check fails, keep nix build --fallback enabled for CI safety and treat the incident as cache-object availability debt. A cache object-store repair or restart may restore acceleration, but it is not a substitute for the separate HA OpenTofu state authority decision.

This is backend hardening. It is not Bazel remote execution proof, and RustFS must not be treated as CAS/AC authority for RBE until bucket-index recovery, durability, and observability are explicitly proved or a different HA store is chosen.

HA And Durability Limits

None of these systems are HA today.

Component	Deployment shape	HA	Durability notes
Attic API	single-cluster service on `honey`	No	depends on the honey cache/storage plane
Attic metadata database	single-node stateful service family	No	no cross-cluster failover
RustFS	single-node storage-biased deployment on `bumble`	No	no off-site backup guarantee
Bazel remote cache	RustFS/S3-backed service with pod-local hot cache	No	durable within the current RustFS/OpenEBS envelope; no cross-cluster failover

Impact summary:

loss of bumble degrades the cache/storage plane sharply
loss of honey removes runners and cache access together
cache misses should slow work, not redefine the platform contract

Explicitly Out Of Contract

Do not treat these as current authority:

GitLab-managed HTTP state for the four active stacks
attic-cache-dev as the current live cache namespace
grpc://bazel-cache.attic-cache-dev.svc.cluster.local:9092
https://attic.dev-cluster.example.com
https://attic.tinyland.dev
old fuzzy-dev cache hostnames