Cache and State Backend Roles

Cache and State Backend Roles

Canonical internal reference for cache systems and state backend authority in GloriousFlywheel.

This page is the contract-facing summary of the current live cache and state shape. It should agree with the internal live contract note in docs/research/.

Current Contract

  • honey is the only active physical cluster target
  • Attic and Bazel are acceleration layers, not publication surfaces
  • Attic and Bazel remote cache are shared acceleration layers for both CI and developer workflows
  • Attic and Bazel are part of the pooled substrate contract, not CI-only decorations
  • FlakeHub is publication and discovery only
  • the four active infrastructure stacks now use the honey-local S3-compatible state path
  • GitLab-managed HTTP state is compatibility-only
  • current main proves shared cache acceleration plus narrow explicit REAPI proofs for selected target classes
  • current main does not prove full remote execution or full remote-builder offload for every developer workload

Cache Surfaces

System Primary audience Canonical current address Backing store Current role
Attic API in-cluster runners http://attic.nix-cache.svc.cluster.local RustFS-backed object storage on honey runner-side Nix cache API
Attic HTTPS operators and internal consumers https://nix-cache.tinyland.dev same Attic service family human-facing read path and internal API base
Bazel remote cache in-cluster runners grpc://bazel-cache.nix-cache.svc.cluster.local:9092 RustFS/S3-backed object storage with local hot cache optional Bazel acceleration
FlakeHub public publication/discovery https://flakehub.com/f/tinyland-inc/GloriousFlywheel/* Determinate Systems SaaS flake publication only
RustFS internal storage plane not user-facing OpenEBS-backed object storage on bumble S3-compatible storage backing the Attic family

Attic contract

Current Attic truth:

  • shared self-hosted runners talk to the cluster-internal API endpoint
  • the default shared cache name is main
  • the shared main cache is public-read and credentialed-write
  • developer workflows are also meant to consume this shared cache-backed substrate where the repo wiring proves it
  • internal human or dev-machine reads should use the HTTPS endpoint with the cache path they intend to consume, for example https://nix-cache.tinyland.dev/main
  • write access is internal and credentialed, not a public default
  • cache signing keys, JWT signing keys, workflow ATTIC_TOKEN values, and RustFS/S3 credentials are separate artifacts with separate rotation rules
  • pull requests must remain read-only for Attic publication
  • pilot and downstream examples may use default-branch plus ATTIC_TOKEN gated push-cache, but broad GloriousFlywheel proof workflows still keep push-cache: "false" while the 2026-05-08 RustFS bucket-index recurrence is unresolved
  • scripts/validate-attic-write-quarantine.py statically enforces that split across .github/workflows and operator-facing docs
  • the manual publication probe requires profile-specific confirmation: confirm=probe-attic-publication for the one-path synthetic profile and confirm=probe-attic-publication-small-check for the known-risk bounded statix check-output profile, then confirm=probe-attic-publication-medium-check for the known-risk representative deadnix medium closure
  • the one-path synthetic profile now has repeated clean evidence, including 2026-05-13 run 25816771239; this does not override the known small-check and medium-check reproduction failures
  • a current-main repeat medium-check run, 25817881900, still reproduced the failure with a 22-path Attic push delta and post-failure S3 list-buckets loss for both attic and tofu-state; restart recovery restored guarded reads but did not prove a non-restart repair
  • manual publication probe artifacts include the requested closure inventory, the actual Attic push delta, and Attic push stdout/stderr logs; push stderr is classified for RustFS/S3 bucket-index recurrence and credential/auth signatures; both real-output profiles are controlled reproduction tools, not a restore path for default workflow writes

Bazel contract

Current Bazel truth:

  • the only stable default contract today is the cluster-internal runner path
  • the live cache persists through the bazel-cache bucket on the OpenEBS-backed attic-rustfs-openebs service; the pod-local /data volume is only a hot cache
  • on 2026-05-25 the source Bazel proof reported a remote-cache digest mismatch while reading the RustFS-backed bazel-cache bucket through bazel-remote. The implicated cas.v2 key was removed as cache-only acceleration data, and that delete reproduced the RustFS bucket-index class: S3 list-buckets went empty while on-disk bucket markers remained, and a controlled attic-rustfs-openebs restart was required to restore API visibility. That is cache acceleration corruption, not source-code evidence, and it reinforces that the current RustFS-backed Bazel cache is not RBE CAS/action-cache authority. Do not infer corruption from raw S3 object hashes alone: the live bazel-remote backend stores zstd-encoded objects, so audits must hash decoded payload bytes.
  • Bazel cache is part of the intended shared developer-plus-CI substrate, not a CI-only decoration
  • no stable general-consumer external Bazel endpoint is promised yet
  • any private or operator-only Bazel hostname is internal implementation detail, not the onboarding contract
  • current cache-backed local execution is real; universal remote execution is not yet the proved default contract
  • executor-backed mode is available only when BAZEL_REMOTE_EXECUTOR is set separately from BAZEL_REMOTE_CACHE and the target class is eligible through config/rbe-target-eligibility.json

Developer-machine attachment

There are two cache attachment modes, and they must not be blurred.

Context Attic behavior Bazel behavior Contract status
Shared self-hosted runner workflow setup injects ATTIC_SERVER=http://attic.nix-cache.svc.cluster.local and ATTIC_CACHE=main when reachable workflow setup injects BAZEL_REMOTE_CACHE=grpc://bazel-cache.nix-cache.svc.cluster.local:9092 proved source-repo CI path
Internal developer machine .envrc derives ATTIC_SERVER=https://nix-cache.<domain> and ATTIC_CACHE=main; Nix may use https://nix-cache.tinyland.dev/main as a substituter when trusted locally .envrc leaves BAZEL_REMOTE_CACHE empty unless the operator provides a routable endpoint supported as explicit attachment, not automatic discovery
Future public or third-party consumer use documented variables and exported public docs, not private Tinyland topology use documented variables and exported public docs, not private Tinyland topology projection only until a public product contract exists

If BAZEL_REMOTE_CACHE is empty, just info must report compatibility-local-only. That is not a failure; it is the guardrail that prevents developer machines from silently depending on stale or invented endpoints.

State Backend

Active stack authority

Stack Current backend Proven state key
attic S3-compatible attic/terraform.tfstate
arc-runners S3-compatible arc-runners/terraform.tfstate
gitlab-runners S3-compatible tinyland-infra/gitlab-runners/terraform.tfstate
runner-dashboard S3-compatible tinyland-infra/runner-dashboard/terraform.tfstate

Current truth:

  • all four active stacks use backend "s3" on current main
  • the active local operator path is the TOFU_BACKEND_S3_* family or a materialized backend HCL file consumed by just tofu-init <stack>
  • TF_HTTP_* remains compatibility-only for archived or external repair paths
  • RustFS has known bucket-index reliability debt: the S3 API has returned NoSuchBucket for tofu-state while both /data/tofu-state and /data/.rustfs.sys/buckets/tofu-state existed on disk. On 2026-05-08, a representative Attic publication probe also left both attic and tofu-state absent from S3 list-buckets while their disk bucket markers existed. A controlled RustFS restart restored the API view both times, but this is only an operator recovery action, not a proved non-restart repair.
  • On 2026-05-19, the state-authority guard failed again after the latest controlled restart window: tofu-state was absent from S3 list-buckets while disk bucket markers remained present, and GloriousFlywheel PR #735 failed Plan ARC Runners on that same guard. Treat the configured RustFS state path as degraded until just tofu-state-ha-readiness --expect-interim passes again, and treat strict HA as blocked until TIN-1026 and TIN-1017 produce endpoint package, scratch/disposable OpenTofu, lockfile, maintenance, and cleanup evidence.

Authority order

Use this order when reasoning about state:

  1. repo code plus stack inputs define desired state
  2. S3-compatible OpenTofu state defines managed-resource authority
  3. live cluster state is observed state and may drift

Manual cluster edits are drift unless they are an explicit bounded operator action that is expected to reconcile later.

RustFS bucket-index guardrail

The S3 API view is the OpenTofu state authority. On-disk bucket directories are evidence for incident response, but they are not sufficient proof that OpenTofu can safely read, lock, or persist state.

If S3 returns NoSuchBucket while disk markers are present, preserve the failed workflow/apply logs and run the repo RCA scripts before restarting RustFS. Restarting nix-cache/attic-rustfs-openebs can restore service, but it does not close TIN-1012 or TIN-1046. TIN-1043 closed the default-read-only quarantine response, not the backend repair/replacement requirement.

Before protected OpenTofu mutation, run the deep RustFS state authority guard:

just tofu-state-authority-deep-check <stack>

The guard checks RustFS workload health, disk bucket markers when pod exec is available, S3 bucket and object metadata, the stack state object, optional state-object body readability/JSON shape, and a temporary write/read/delete proof. If apply mutates live resources but state persistence fails, preserve errored.tfstate and use a controlled tofu state push; do not rerun apply as the first recovery action.

For incident capture or RCA follow-up, run:

just rustfs-bucket-index-rca --scratch-probe

This collects RustFS workload, pod, version, data-layout, bootstrap/lifecycle job, log, event, and S3 authority evidence. The scratch probe creates and deletes an isolated bucket to prove normal bucket-index API/disk coherence without touching OpenTofu state objects.

The self-hosted RustFS State Authority Canary workflow runs the same evidence path on main, on demand, and hourly while RustFS remains the interim state authority. It runs on the shared tinyland-nix-operator dogfood lane because it uses kubeconfig-backed operator probes and bounded port-forwards; generic Nix overflow lanes and hosted runners are not valid substitutes. It executes tofu-state-ha-readiness --expect-interim, then read-only rustfs-bucket-index-rca --bucket attic and attic-nar-integrity-check evidence, then rustfs-bucket-index-rca --scratch-probe --strict-scratch-disk-markers, then the read-only ha-state-candidate-inventory classifier, and publishes the logs as workflow artifacts. The Attic read-path evidence is marked if: always() so the canary still captures attic bucket-index and NAR body evidence when the current tofu-state check fails first. A green canary means the current RustFS state path is coherent now, the Attic incident-shaped NAR body streamed, the scratch bucket appeared in both the API and disk bucket markers, the scratch bucket markers disappeared after API delete, and the known OpenTofu state objects were readable as JSON state bodies. It also means the candidate inventory completed. It is not an HA claim and it does not promote RustFS to Bazel CAS/AC or RBE authority. When the inventory reports NO_LIVE_HA_STATE_CANDIDATE, that is valid evidence for TIN-1012 rather than a canary failure.

To check the state path against the HA authority gate, run:

just tofu-state-ha-readiness --expect-interim

That command is expected-red without --expect-interim until TIN-1012 proves the implementation gate. If the command is red even with --expect-interim, do not start protected OpenTofu mutations through this state path. TIN-1002 captured the candidate plan and guardrail; it did not make the current RustFS path HA. Even when the interim guard is green, the current RustFS path is still one RustFS Deployment replica on a bumble-bound OpenEBS ZFS ReadWriteOnce PVC.

The same RustFS bucket-index class can break the Attic cache body path while narinfo and Attic database metadata remain present. When Nix reports Transferred a partial file from the Attic substituter, run:

just attic-nar-integrity-check --store-hash <nix-store-hash>

If that check fails, keep nix build --fallback enabled for CI safety and treat the incident as cache-object availability debt. A cache object-store repair or restart may restore acceleration, but it is not a substitute for the separate HA OpenTofu state authority decision.

This is backend hardening. It is not Bazel remote execution proof, and RustFS must not be treated as CAS/AC authority for RBE until bucket-index recovery, durability, and observability are explicitly proved or a different HA store is chosen.

HA And Durability Limits

None of these systems are HA today.

Component Deployment shape HA Durability notes
Attic API single-cluster service on honey No depends on the honey cache/storage plane
Attic metadata database single-node stateful service family No no cross-cluster failover
RustFS single-node storage-biased deployment on bumble No no off-site backup guarantee
Bazel remote cache RustFS/S3-backed service with pod-local hot cache No durable within the current RustFS/OpenEBS envelope; no cross-cluster failover

Impact summary:

  • loss of bumble degrades the cache/storage plane sharply
  • loss of honey removes runners and cache access together
  • cache misses should slow work, not redefine the platform contract

Explicitly Out Of Contract

Do not treat these as current authority:

  • GitLab-managed HTTP state for the four active stacks
  • attic-cache-dev as the current live cache namespace
  • grpc://bazel-cache.attic-cache-dev.svc.cluster.local:9092
  • https://attic.dev-cluster.example.com
  • https://attic.tinyland.dev
  • old fuzzy-dev cache hostnames

GloriousFlywheel