Cache and State Backend Roles
Canonical internal reference for cache systems and state backend authority in GloriousFlywheel.
This page is the contract-facing summary of the current live cache and state
shape. It should agree with the internal live contract note in
docs/research/.
Current Contract
honeyis the only active physical cluster target- Attic and Bazel are acceleration layers, not publication surfaces
- Attic and Bazel remote cache are shared acceleration layers for both CI and developer workflows
- Attic and Bazel are part of the pooled substrate contract, not CI-only decorations
- FlakeHub is publication and discovery only
- the four active infrastructure stacks now use the honey-local S3-compatible state path
- GitLab-managed HTTP state is compatibility-only
- current
mainproves shared cache acceleration plus narrow explicit REAPI proofs for selected target classes - current
maindoes not prove full remote execution or full remote-builder offload for every developer workload
Cache Surfaces
| System | Primary audience | Canonical current address | Backing store | Current role |
|---|---|---|---|---|
| Attic API | in-cluster runners | http://attic.nix-cache.svc.cluster.local |
RustFS-backed object storage on honey |
runner-side Nix cache API |
| Attic HTTPS | operators and internal consumers | https://nix-cache.tinyland.dev |
same Attic service family | human-facing read path and internal API base |
| Bazel remote cache | in-cluster runners | grpc://bazel-cache.nix-cache.svc.cluster.local:9092 |
RustFS/S3-backed object storage with local hot cache | optional Bazel acceleration |
| FlakeHub | public publication/discovery | https://flakehub.com/f/tinyland-inc/GloriousFlywheel/* |
Determinate Systems SaaS | flake publication only |
| RustFS | internal storage plane | not user-facing | OpenEBS-backed object storage on bumble |
S3-compatible storage backing the Attic family |
Attic contract
Current Attic truth:
- shared self-hosted runners talk to the cluster-internal API endpoint
- the default shared cache name is
main - the shared
maincache is public-read and credentialed-write - developer workflows are also meant to consume this shared cache-backed substrate where the repo wiring proves it
- internal human or dev-machine reads should use the HTTPS endpoint with the
cache path they intend to consume, for example
https://nix-cache.tinyland.dev/main - write access is internal and credentialed, not a public default
- cache signing keys, JWT signing keys, workflow
ATTIC_TOKENvalues, and RustFS/S3 credentials are separate artifacts with separate rotation rules - pull requests must remain read-only for Attic publication
- pilot and downstream examples may use default-branch plus
ATTIC_TOKENgatedpush-cache, but broad GloriousFlywheel proof workflows still keeppush-cache: "false"while the 2026-05-08 RustFS bucket-index recurrence is unresolved scripts/validate-attic-write-quarantine.pystatically enforces that split across.github/workflowsand operator-facing docs- the manual publication probe requires profile-specific confirmation:
confirm=probe-attic-publicationfor the one-path synthetic profile andconfirm=probe-attic-publication-small-checkfor the known-risk boundedstatixcheck-output profile, thenconfirm=probe-attic-publication-medium-checkfor the known-risk representativedeadnixmedium closure - the one-path synthetic profile now has repeated clean evidence, including
2026-05-13 run
25816771239; this does not override the knownsmall-checkandmedium-checkreproduction failures - a current-main repeat
medium-checkrun,25817881900, still reproduced the failure with a 22-path Attic push delta and post-failure S3list-bucketsloss for bothatticandtofu-state; restart recovery restored guarded reads but did not prove a non-restart repair - manual publication probe artifacts include the requested closure inventory, the actual Attic push delta, and Attic push stdout/stderr logs; push stderr is classified for RustFS/S3 bucket-index recurrence and credential/auth signatures; both real-output profiles are controlled reproduction tools, not a restore path for default workflow writes
Bazel contract
Current Bazel truth:
- the only stable default contract today is the cluster-internal runner path
- the live cache persists through the
bazel-cachebucket on the OpenEBS-backedattic-rustfs-openebsservice; the pod-local/datavolume is only a hot cache - on 2026-05-25 the source Bazel proof reported a remote-cache digest mismatch
while reading the RustFS-backed
bazel-cachebucket throughbazel-remote. The implicatedcas.v2key was removed as cache-only acceleration data, and that delete reproduced the RustFS bucket-index class: S3list-bucketswent empty while on-disk bucket markers remained, and a controlledattic-rustfs-openebsrestart was required to restore API visibility. That is cache acceleration corruption, not source-code evidence, and it reinforces that the current RustFS-backed Bazel cache is not RBE CAS/action-cache authority. Do not infer corruption from raw S3 object hashes alone: the livebazel-remotebackend stores zstd-encoded objects, so audits must hash decoded payload bytes. - Bazel cache is part of the intended shared developer-plus-CI substrate, not a CI-only decoration
- no stable general-consumer external Bazel endpoint is promised yet
- any private or operator-only Bazel hostname is internal implementation detail, not the onboarding contract
- current cache-backed local execution is real; universal remote execution is not yet the proved default contract
- executor-backed mode is available only when
BAZEL_REMOTE_EXECUTORis set separately fromBAZEL_REMOTE_CACHEand the target class is eligible throughconfig/rbe-target-eligibility.json
Developer-machine attachment
There are two cache attachment modes, and they must not be blurred.
| Context | Attic behavior | Bazel behavior | Contract status |
|---|---|---|---|
| Shared self-hosted runner | workflow setup injects ATTIC_SERVER=http://attic.nix-cache.svc.cluster.local and ATTIC_CACHE=main when reachable |
workflow setup injects BAZEL_REMOTE_CACHE=grpc://bazel-cache.nix-cache.svc.cluster.local:9092 |
proved source-repo CI path |
| Internal developer machine | .envrc derives ATTIC_SERVER=https://nix-cache.<domain> and ATTIC_CACHE=main; Nix may use https://nix-cache.tinyland.dev/main as a substituter when trusted locally |
.envrc leaves BAZEL_REMOTE_CACHE empty unless the operator provides a routable endpoint |
supported as explicit attachment, not automatic discovery |
| Future public or third-party consumer | use documented variables and exported public docs, not private Tinyland topology | use documented variables and exported public docs, not private Tinyland topology | projection only until a public product contract exists |
If BAZEL_REMOTE_CACHE is empty, just info must report
compatibility-local-only. That is not a failure; it is the guardrail that
prevents developer machines from silently depending on stale or invented
endpoints.
State Backend
Active stack authority
| Stack | Current backend | Proven state key |
|---|---|---|
attic |
S3-compatible | attic/terraform.tfstate |
arc-runners |
S3-compatible | arc-runners/terraform.tfstate |
gitlab-runners |
S3-compatible | tinyland-infra/gitlab-runners/terraform.tfstate |
runner-dashboard |
S3-compatible | tinyland-infra/runner-dashboard/terraform.tfstate |
Current truth:
- all four active stacks use
backend "s3"on currentmain - the active local operator path is the
TOFU_BACKEND_S3_*family or a materialized backend HCL file consumed byjust tofu-init <stack> TF_HTTP_*remains compatibility-only for archived or external repair paths- RustFS has known bucket-index reliability debt: the S3 API has returned
NoSuchBucketfortofu-statewhile both/data/tofu-stateand/data/.rustfs.sys/buckets/tofu-stateexisted on disk. On 2026-05-08, a representative Attic publication probe also left bothatticandtofu-stateabsent from S3list-bucketswhile their disk bucket markers existed. A controlled RustFS restart restored the API view both times, but this is only an operator recovery action, not a proved non-restart repair. - On 2026-05-19, the state-authority guard failed again after the latest
controlled restart window:
tofu-statewas absent from S3list-bucketswhile disk bucket markers remained present, and GloriousFlywheel PR #735 failedPlan ARC Runnerson that same guard. Treat the configured RustFS state path as degraded untiljust tofu-state-ha-readiness --expect-interimpasses again, and treat strict HA as blocked until TIN-1026 and TIN-1017 produce endpoint package, scratch/disposable OpenTofu, lockfile, maintenance, and cleanup evidence.
Authority order
Use this order when reasoning about state:
- repo code plus stack inputs define desired state
- S3-compatible OpenTofu state defines managed-resource authority
- live cluster state is observed state and may drift
Manual cluster edits are drift unless they are an explicit bounded operator action that is expected to reconcile later.
RustFS bucket-index guardrail
The S3 API view is the OpenTofu state authority. On-disk bucket directories are evidence for incident response, but they are not sufficient proof that OpenTofu can safely read, lock, or persist state.
If S3 returns NoSuchBucket while disk markers are present, preserve the
failed workflow/apply logs and run the repo RCA scripts before restarting
RustFS. Restarting nix-cache/attic-rustfs-openebs can restore service, but it
does not close TIN-1012 or TIN-1046. TIN-1043 closed the default-read-only
quarantine response, not the backend repair/replacement requirement.
Before protected OpenTofu mutation, run the deep RustFS state authority guard:
just tofu-state-authority-deep-check <stack>
The guard checks RustFS workload health, disk bucket markers when pod exec is
available, S3 bucket and object metadata, the stack state object, optional
state-object body readability/JSON shape, and a temporary write/read/delete
proof. If apply mutates live resources but state
persistence fails, preserve errored.tfstate and use a controlled
tofu state push; do not rerun apply as the first recovery action.
For incident capture or RCA follow-up, run:
just rustfs-bucket-index-rca --scratch-probe
This collects RustFS workload, pod, version, data-layout, bootstrap/lifecycle job, log, event, and S3 authority evidence. The scratch probe creates and deletes an isolated bucket to prove normal bucket-index API/disk coherence without touching OpenTofu state objects.
The self-hosted RustFS State Authority Canary workflow runs the same evidence
path on main, on demand, and hourly while RustFS remains the interim state
authority. It runs on the shared tinyland-nix-operator dogfood lane because
it uses kubeconfig-backed operator probes and bounded port-forwards; generic
Nix overflow lanes and hosted runners are not valid substitutes. It executes
tofu-state-ha-readiness --expect-interim, then
read-only rustfs-bucket-index-rca --bucket attic and
attic-nar-integrity-check evidence, then
rustfs-bucket-index-rca --scratch-probe --strict-scratch-disk-markers, then
the read-only ha-state-candidate-inventory classifier, and publishes the logs
as workflow artifacts. The Attic read-path evidence is marked if: always() so
the canary still captures attic bucket-index and NAR body evidence when the
current tofu-state check fails first. A green canary means the current RustFS
state path is coherent now, the Attic incident-shaped NAR body streamed, the
scratch bucket appeared in both the API and disk bucket markers, the scratch
bucket markers disappeared after API delete, and the known OpenTofu state
objects were readable as JSON state bodies. It also means the candidate
inventory completed. It is not an HA claim and it does not promote RustFS to
Bazel CAS/AC or RBE authority. When the inventory reports
NO_LIVE_HA_STATE_CANDIDATE, that is valid evidence for TIN-1012 rather than a
canary failure.
To check the state path against the HA authority gate, run:
just tofu-state-ha-readiness --expect-interim
That command is expected-red without --expect-interim until TIN-1012 proves
the implementation gate. If the command is red even with --expect-interim,
do not start protected OpenTofu mutations through this state path. TIN-1002
captured the candidate plan and guardrail; it did not make the current RustFS
path HA. Even when the interim guard is green, the current RustFS path is still
one RustFS Deployment replica on a bumble-bound OpenEBS ZFS ReadWriteOnce
PVC.
The same RustFS bucket-index class can break the Attic cache body path while
narinfo and Attic database metadata remain present. When Nix reports
Transferred a partial file from the Attic substituter, run:
just attic-nar-integrity-check --store-hash <nix-store-hash>
If that check fails, keep nix build --fallback enabled for CI safety and treat
the incident as cache-object availability debt. A cache object-store repair or
restart may restore acceleration, but it is not a substitute for the separate
HA OpenTofu state authority decision.
This is backend hardening. It is not Bazel remote execution proof, and RustFS must not be treated as CAS/AC authority for RBE until bucket-index recovery, durability, and observability are explicitly proved or a different HA store is chosen.
HA And Durability Limits
None of these systems are HA today.
| Component | Deployment shape | HA | Durability notes |
|---|---|---|---|
| Attic API | single-cluster service on honey |
No | depends on the honey cache/storage plane |
| Attic metadata database | single-node stateful service family | No | no cross-cluster failover |
| RustFS | single-node storage-biased deployment on bumble |
No | no off-site backup guarantee |
| Bazel remote cache | RustFS/S3-backed service with pod-local hot cache | No | durable within the current RustFS/OpenEBS envelope; no cross-cluster failover |
Impact summary:
- loss of
bumbledegrades the cache/storage plane sharply - loss of
honeyremoves runners and cache access together - cache misses should slow work, not redefine the platform contract
Explicitly Out Of Contract
Do not treat these as current authority:
- GitLab-managed HTTP state for the four active stacks
attic-cache-devas the current live cache namespacegrpc://bazel-cache.attic-cache-dev.svc.cluster.local:9092https://attic.dev-cluster.example.comhttps://attic.tinyland.dev- old
fuzzy-devcache hostnames