CAS Primitives — In-House Build Plan for `gf-reapi-cell`

CAS Primitives — In-House Build Plan for gf-reapi-cell

Framing note (load-bearing). This doc replaces the deleted cas-backend-decision.md which incorrectly framed the work as a Buildbarn / BuildBuddy / bazel-remote vendor bake-off. GloriousFlywheel is a peer to those projects, not a consumer. gf-reapi-cell owns every REAPI byte. Peers are studied as reference architectures for inspiration; they are never adoption candidates. Any future PR or doc that drifts back toward vendor adoption is a defect.

Decision summary

  • Status: Working draft (W1.1 / TIN-1458, under parent E1 / TIN-1445).
  • Frame: Pure in-house. Buildbarn, BuildBuddy, NativeLink, bazel-remote, and EngFlow are class peers, studied as inspiration only. No third-party REAPI server, CAS daemon, or storage daemon lives in the data path.
  • What gf-reapi-cell ships today: Capabilities, ByteStream, CAS, AC, Execution, WaitExecution against ephemeral node-local PVC (local-path-sting-fast-ephemeral), digest verification on read/write paths, first-slice instance_name CAS/AC routing on proof-local disk, opt-in JWT tenant/scope authz, no eviction policy, no TTL contract, single-replica, no AC writer attestation.
  • What’s being added in E1: S3-compatible storage substrate; digest verification on both read and write paths; namespace routing keyed by instance_name; bounded eviction (LRU or LFU); TTL + FindMissingBlobs keep-alive contract; sharding/replication topology decision; write-side attestation parallel to the AC writer.
  • Explicitly out: Adopting any peer’s CAS server or storage daemon. Promoting the current RustFS topology to CAS/AC authority without TIN-1147 repair/proof evidence. Per-tenant per-mnemonic SLOs (E4 territory). Multi-cell topology (single-cell horizon).

Frame

GloriousFlywheel competes with Buildbarn, BuildBuddy, NativeLink, and bazel-remote on the same primitive: a REAPI surface that turns Bazel’s content-addressable build model into a shared substrate for local development and CI. Those projects are class peers — they solve the same problem, with their own architectural choices, in their own languages, against their own storage stacks. gf-reapi-cell is the GloriousFlywheel-owned REAPI surface. It already implements Capabilities, ByteStream, CAS, AC, Execution, and WaitExecution end-to-end (see docs/build-system/gf-reapi-cell.md); the default-branch proofs for //app:build, //app:unit_tests, //:deployment_bundle, //docs-site:build, and the WAS-110 public-input fixture all run through it. The data path is in-house from the first byte to the last.

The work in E1 is to extend gf-reapi-cell’s primitives in-house so its CAS data path is production-grade. “Production-grade” here means: durable storage substrate that survives node loss; digest verification on every byte crossing the boundary; namespace routing that makes tenant isolation honest; bounded eviction that doesn’t poison referenced digests; an explicit TTL contract Bazel’s --experimental_remote_cache_ttl can hold the cell to; sharding/replication that survives a single-replica failure; and write-side attestation so the audit log can attribute every byte. Each of these primitives is owned by gf-reapi-cell. Peers are read for “how does Buildbarn solve eviction across multiple bb-storage backends?” — not for “which one do we adopt?” The answer to the second question is always: write it in-house. The peer-frame discipline is captured in feedback_rbe_backend and is non-negotiable.

What “Primitive” Means Here

The CAS layer decomposes into seven primitives, and gf-reapi-cell owns all seven: storage substrate (where bytes live), digest verification (every byte hashed and checked on read and write), namespace routing (instance_name → tenant-keyed storage), eviction policy (bounded capacity with LRU or LFU), TTL / lease (how long a blob is promised to survive after last touch), sharding / replication (how the cell stays available across replicas), and TLS / write attestation (who wrote this byte and can we prove it). Each primitive has its own contract, its own failure mode, and its own gating ticket. Decomposing this way lets the build plan attack each primitive in isolation, name a library boundary where one exists, and refuse a vendor adoption at the REAPI server level.

Layer-by-Layer Build Plan

One section per primitive. For each: what gf-reapi-cell ships today, what’s missing, the in-house build recommendation, a short peer reference (tagged inspiration only), and the failure mode if the primitive is not built.

a. Storage substrate

What gf-reapi-cell ships today. Ephemeral node-local PVC. The deploy/gf-rbe/gf-reapi-cell.yaml manifest mounts local-path-sting-fast-ephemeral (or another explicit compute-expansion KVM/worker class), as documented in docs/build-system/gf-reapi-cell.md under “Storage Boundary.” Storage survives pod restart on the same node, does not survive node loss, does not support cross-replica access, and has no quota or eviction layer.

What’s missing for production. Durability across node loss; a single storage authority that multiple gf-reapi-cell replicas can write to and read from; bounded capacity with an explicit eviction story; a backup or disaster-recovery shape that doesn’t depend on the PVC’s underlying disk.

In-house build recommendation. Target an S3-compatible object store whose repair, restore, quota, tenant-isolation, and lifecycle behavior has been proved before it becomes the CAS/action-cache authority. The CAS layer in gf-reapi-cell talks to it through a thin storage interface (type CASStore interface { Get(ctx, digest) ([]byte, error); Put(ctx, digest, []byte) error; Stat(ctx, digest) (Metadata, error); Delete(ctx, digest) error; List(ctx, prefix) ([]Entry, error) }) so the substrate can be swapped without touching the REAPI surface. No provider is selected in this doc. Civo Object Storage is explicitly not an option. The candidate family is a managed/appliance S3-compatible service, a self-hosted S3-compatible object-store class, or a repaired/topology-changed RustFS path only if TIN-1147 supplies a separate promotion decision and evidence stronger than restart recovery. See the dedicated Storage substrate decision section below for the current picking grid. The storage SDK is borrowable at the library level (AWS SDK for Go’s s3 client, equivalent S3 client libraries) — these are clients to a substrate we operate, not REAPI servers. Current RustFS is not the CAS/AC authority. TIN-1147 / the RustFS RCA (docs/research/gloriousflywheel-rustfs-state-backend-rca-gate-2026-05-06.md) blocks it today: bucket-index reliability debt, restart-as-recovery, and single-pod failure modes that recurred during the 2026-05-10 and 2026-05-11 incidents.

Peer reference (inspiration only). Buildbarn separates its storage backend (bb-storage with pluggable blobstore configs: local disk, S3, GCS, Redis, sharded combinations) from its REAPI frontend. BuildBuddy’s OSS server uses local disk or S3, keyed by API key + instance_name. bazel-remote uses local disk with optional S3 proxy upload. NativeLink uses a stack of Store implementations composed in TOML. EngFlow runs on managed object storage. The inspiration: separate the storage interface from the protocol surface, pick S3 semantics as the wire, swap substrates without touching handlers. The adoption: none.

Failure mode if not built. Ephemeral storage means every node-loss event re-poisons the CAS hit-rate SLO (slo.md target ≥ 90%), every cell restart loses warm working set, and the cell cannot horizontally scale because replicas have no shared storage to read from. This is the dominant gating failure for E1/TIN-1445.

b. Digest verification

What gf-reapi-cell ships today. The first implementation slice re-hashes CAS bytes on BatchUpdateBlobs / ByteStream.Write, BatchReadBlobs / ByteStream.Read, directory materialization, tree walking, and presence-style checks (FindMissingBlobs, QueryWriteStatus, AC write precondition). Write mismatches return structured DIGEST_MISMATCH write / INVALID_ARGUMENT; read mismatches return structured DIGEST_MISMATCH read / DATA_LOSS where the REAPI method can surface a status. The cell exports the initial poison counter as gf_reapi_digest_mismatch_total{path="read|write"} from /metrics.

What’s missing for production. The current counter is intentionally minimal. Production still needs labels for {hash_function="sha256", instance_name="..."}, dashboard wiring, alert routing, an AC lookup audit path that can prove referenced output digests were read through verified CAS, and a durable CAS substrate with restore/retention evidence. The mismatch counter is the gf_reapi_digest_mismatch_total SLI from slo.md; any nonzero value is a paged incident.

In-house build recommendation. Keep the verification primitive owned inside gf-reapi-cell’s CAS package. Structured error remains gRPC INVALID_ARGUMENT on write mismatch (client sent bad bytes) and gRPC DATA_LOSS on read mismatch (storage substrate corrupted the bytes; this is the poison case). The hash function is borrowable at the library level (Go crypto/sha256, BLAKE3 if/when REAPI v2.1 lands blake3 digests). The verification logic itself is ours. Cross-link TIN-1459 (W1.2 digest verification + write-side attestation primitive).

Peer reference (inspiration only). Buildbarn re-hashes on read by default through its VerifyingBlobAccess decorator; BuildBuddy re-hashes on write but skips on read for latency unless verification mode is enabled; bazel-remote verifies on write and trusts on read; NativeLink composes verification into its store-decorator stack. The inspiration: the read path verification is the one that catches storage substrate corruption — it’s expensive and mandatory. The adoption: none.

Failure mode if not built. Storage corruption silently propagates into Bazel actions; one bit-flip in object storage becomes a poisoned action result that gets cached in the AC and re-served forever. This is the dominant correctness failure for E1/TIN-1445 and feeds directly into the digest-mismatch rate poison signal in slo.md (no error budget; any event pages).

c. Namespace routing

What gf-reapi-cell ships today. The first implementation slice validates instance_name as default, system, or spoke-<slug>, reads it from the standard REAPI request field or ByteStream path prefix, and keys proof-local CAS/AC files under instances/<instance_name>/cas/... and instances/<instance_name>/ac/.... Cross-instance CAS and AC lookups miss by construction. Existing empty instance_name traffic maps to default.

What’s missing for production. The implementation is still proof-local. Production still needs IAM binding from caller identity to allowed instance_name, a durable shared CAS substrate, per-instance quota/eviction, audit records, metrics by instance, default read-only/deletion policy, and multi-replica behavior. The wire-level design is docs/build-system/instance-name-routing-design.md (W4.1 / TIN-1472); this primitive is the CAS-side implementation of that contract.

In-house build recommendation. Keep the current proof-cell route shape and lift it behind a storage interface before production durability lands. Storage keys remain instance-scoped; the future CASStore interface gains an instance_name parameter on every method. The cell validates against the ^(spoke-[a-z][a-z0-9-]{1,62}|default|system)$ regex and routes accordingly. Cross-tenant reads return NOT_FOUND (not PERMISSION_DENIED — that confirms existence in another tenant’s namespace, which is the info-disclosure channel we close). The routing logic is in-house; no library applies. Cross-link docs/build-system/instance-name-routing-design.md (sibling W4.1) and TIN-1472.

Peer reference (inspiration only). Buildbarn maps instance_name to a storage-config block via its BlobAccessConfiguration switch. BuildBuddy prefixes storage keys with the API key + instance_name pair. bazel-remote uses instance_name as a literal on-disk path component with no auth. NativeLink composes it through its Store stack. The inspiration: prefix the storage key, route at the middleware boundary, default-deny cross-tenant. The adoption: none.

Failure mode if not built. Every spoke shares one global CAS. Noisy neighbors evict quiet spokes’ working set. Digest-guess info disclosure becomes a real attack surface. Per-tenant quota enforcement (E4/W4.4 / TIN-1475) has nothing to count. The tenant model in E4 cannot close without this primitive landing in the CAS layer.

d. Eviction policy

What gf-reapi-cell ships today. The first local-backend size bound exists: GF_REAPI_CAS_MAX_BYTES enables a lease-protected, LRU-ordered CAS evictor that skips blobs inside GF_REAPI_MIN_CLIENT_CACHE_TTL, reconciles durable quota counters after reclamation, and emits gf_reapi_size_eviction_* plus the gf_reapi_evicted_while_referenced_total poison tripwire. It is disabled by default and applies to CAS blobs on the local backend only; S3/object-store size policy still belongs to the backend’s lifecycle/ILM/quota layer.

What’s missing for production. A per-instance_name size budget (initial default reads from spoke-cache-quota’s cache_gib ConfigMap value; fallback global default if absent), high/low-water hysteresis, S3/object-store lifecycle policy proof, and a distributed referenced-set when multiple cell replicas or worker pools are executing against the same CAS. LRU is the recommended default because Bazel’s access pattern is dominated by “recently-built ⇒ likely-rebuilt”; LFU is documented as the alternative and left as an open question pending real workload data.

In-house build recommendation. Owned by gf-reapi-cell’s storage layer. The eviction loop runs as a background goroutine per instance_name, reads bytes-used from the storage substrate (S3 ListObjectsV2 with prefix), holds an in-memory access-time index (or atime/last-touch-time stored as object metadata on the substrate — the implementation picks one and documents it), and issues Delete calls in least-recent-first order until utilization is back under the low-water mark. The LRU index data structure is borrowable at the library level (github.com/hashicorp/golang-lru/v2, generic Go LRU implementations); the policy wiring is ours. Critical invariant: eviction must never delete a digest that is currently referenced by an unfinished action. The bytes-evicted-while-referenced poison signal in slo.md (no error budget; any event pages) is the contract this primitive must hold up. The referenced-set lives in gf-reapi-cell’s in-memory action tracker — every Execute and WaitExecution registers the action’s input digests as referenced, releases them on action completion or timeout.

Peer reference (inspiration only). Buildbarn’s SizeDistinguishingBlobAccess splits small and large blobs with separate eviction policies. BuildBuddy uses LRU at the row level with TTL overlay. bazel-remote uses LRU with a disk-bound --max_size. NativeLink composes via FilesystemStore + size limits. EngFlow runs LRU with hot-tier pinning. The inspiration: bound the cache, evict on a high-water trigger, separate the referenced-set from the LRU set. The adoption: none.

Failure mode if not built. The substrate fills; writes start failing; either Bazel client-side retry storms (the action retry rate SLI from slo.md trips, < 1% budget consumed in hours), or the entire cell becomes unwritable until manual operator intervention. Worse: an ad-hoc eviction (operator-driven rm) hits a referenced digest and poisons an in-flight action.

e. TTL / lease

What gf-reapi-cell ships today. GF_REAPI_BLOB_TTL enables local TTL eviction for instances/<name>/{cas,ac}, and GF_REAPI_MIN_CLIENT_CACHE_TTL is a startup guard: TTL must be greater than or equal to the Bazel client cache lease. The local backend also touches CAS/AC objects on read so the filesystem mtime is an LRU signal for TTL and size eviction. GF_REAPI_CAS_MAX_BYTES requires the same lease floor before it will start.

What’s missing for production. A documented TTL contract Bazel’s --experimental_remote_cache_ttl can hold the cell to: after a successful FindMissingBlobs returns “not missing” for a digest, the cell promises the digest survives for at least N seconds. The Bazel client uses FindMissingBlobs as a keep-alive: as long as a build periodically refreshes the action’s input digests, the cell agrees not to evict them. The minimum contractually-guaranteed lifetime is the TTL.

In-house build recommendation. Add an expires_at field to the CAS metadata (stored as object metadata on the storage substrate, or in an adjacent index). On Put, expires_at = now() + ttl_default. On every FindMissingBlobs hit, refresh: expires_at = max(expires_at, now() + ttl_default). The eviction loop from primitive d treats expires_at as a floor — never evict a blob whose expires_at is in the future, regardless of LRU position. Default ttl_default is 7 days (matches the planned ttl_days field on spoke-cache-quota ConfigMaps); per-instance override reads from the ConfigMap. The contract is named in --experimental_remote_cache_ttl semantics on the Bazel client side; cell side it is enforced in the eviction loop. Cross-link TIN-1460 (W1.3 TTL + lease contract).

Peer reference (inspiration only). Buildbarn’s CompletenessCheckingBlobAccess extends TTL on FindMissingBlobs hits. BuildBuddy implements --experimental_remote_cache_ttl natively with per-org overrides. bazel-remote does not honor TTL beyond raw LRU. NativeLink supports TTL through its Store decorators. The inspiration: FindMissingBlobs is the keep-alive primitive; treat every hit as a lease renewal. The adoption: none.

Failure mode if not built. Long-running builds (CI cold paths, multi-hour RBE jobs) lose digests mid-action; the AC then surfaces an action result whose referenced inputs are gone; the next read returns NOT_FOUND and the client falls back to local execution or fails. This trips the action retry rate and digest-mismatch rate SLIs in slo.md simultaneously.

f. Sharding / replication

What gf-reapi-cell ships today. Single-replica Deployment. One pod, one PVC, one node. Pod restart loses any in-memory state (action tracker, referenced-set, LRU index); node loss loses all storage.

What’s missing for production. A deployment topology that survives single pod / single node loss without losing the working set. With S3-compatible storage as the substrate (primitive a), the data tier is already horizontally available; what’s missing is multi-replica gf-reapi-cell pods that share that substrate. Replicas must agree on the referenced-set (so eviction in replica A doesn’t poison an action running in replica B) and on the LRU index (so eviction decisions are consistent).

In-house build recommendation. Phase 1: stateless replicas. Multiple gf-reapi-cell pods, all reading and writing the same S3 substrate. Each replica maintains its own in-memory referenced-set scoped to actions it itself is executing; the eviction loop runs on a single elected leader (Kubernetes lease via coordination.k8s.io/v1/Lease). Phase 2 (deferred): shared referenced-set via Redis or an in-cell consensus layer if leader eviction becomes a bottleneck. The leader-election library is borrowable (k8s.io/client-go/tools/leaderelection); the eviction policy and referenced-set logic are ours. Sharding by instance_name is implicit because storage is already prefixed (primitive c); no additional sharding layer is needed at the CAS level for the single-cell horizon. Cross-link TIN-1461 (W1.4 sharding / replication topology).

Peer reference (inspiration only). Buildbarn shards via ShardingBlobAccess with weighted backends. BuildBuddy scales horizontally on its API layer with shared object storage. bazel-remote does not shard. NativeLink composes ShardStore for backend distribution. EngFlow runs a managed sharded fleet. The inspiration: stateless REAPI replicas on shared durable storage is the simplest path to HA; leader-elect the eviction loop. The adoption: none.

Failure mode if not built. Single-pod failure = full cell outage. The gf-reapi-cell availability SLI (not yet named in slo.md; tracked as part of E5/TIN-1449) trips immediately. Cold-start time after pod restart spikes the TTFCH < 90s SLO.

g. TLS / write attestation

What gf-reapi-cell ships today. mTLS terminates at the cell ingress (per the deploy/gf-rbe/gf-reapi-cell.yaml ingress boundary); CAS writes have no per-blob attestation. The AC side has a parallel writer-attestation design landing in docs/build-system/ac-writer-attestation-design.md.

What’s missing for production. Per-blob write attestation: the audit record for every BatchUpdateBlobs / ByteStream.Write carries the writing identity (from the mTLS peer cert or the IAM token from E4/W4.2 once landed), the digest, the byte count, the instance_name, and a timestamp. The audit log is the same surface instance-name-routing-design.md names as W2.3 / TIN-1464.

In-house build recommendation. Mirror the AC-side design from ac-writer-attestation-design.md: same audit envelope, same identity extraction, same field names. The CAS-side handler emits one audit record per blob written (batched if multiple blobs in one RPC, but each blob gets its own record). The signing of audit records, if/when audit records get signed, is the same primitive as AC; both share whatever signing library lands there. TLS termination remains at the ingress; in-cluster traffic between gf-reapi-cell replicas and the storage substrate uses cluster mTLS (NetworkPolicy already in gf-reapi-cell.yaml). Cross-link docs/build-system/ac-writer-attestation-design.md (sibling in this corpus).

Peer reference (inspiration only). Buildbarn does not emit write attestation by default; it logs at the storage backend layer. BuildBuddy attaches write attestation to its audit log keyed by API key. bazel-remote has no write attestation. NativeLink can emit through its StoreFilter stack. EngFlow has enterprise audit. The inspiration: the write attestation primitive is the AC and CAS sharing one audit envelope; don’t invent two. The adoption: none.

Failure mode if not built. No forensic trail when a poisoned digest surfaces. The digest-mismatch rate SLI in slo.md becomes a paged incident with no way to attribute the bad write. Tenant-attribution for quota and noisy-neighbor RCAs has no source data.

The Build-vs-Borrow Line

Explicit, because this is where the deleted predecessor doc went wrong.

Borrowable (library level only) Not borrowable (REAPI / CAS server level)
AWS SDK for Go s3 client; equivalent S3 client libraries Buildbarn bb-storage daemon
crypto/sha256, BLAKE3 (lukechampine.com/blake3) — hash function implementations BuildBuddy OSS server
github.com/hashicorp/golang-lru/v2 — generic LRU data structure bazel-remote binary
k8s.io/client-go/tools/leaderelection — Kubernetes lease primitive NativeLink REAPI binary
google.golang.org/grpc — gRPC server framework (already in use) EngFlow scheduler
github.com/prometheus/client_golang/prometheus — metric emission Any peer’s CAS routing layer
go.opentelemetry.io/otel — tracing emission Any peer’s eviction policy implementation
Standard structured-log libraries (zap, zerolog, slog) Any peer’s digest-verification middleware
TLS / mTLS primitives from the Go standard library Any peer’s audit log envelope

The rule. When in doubt, default to (a) write it in-house. A library is borrowable when it implements a generic primitive (LRU, SHA-256, gRPC, S3 client). A library is not borrowable when it implements REAPI semantics, CAS protocol behavior, or any decision-layer logic specific to remote build execution. That second class is where strategic coupling to a competitor hides — buildbarn/bb-storage’s BlobAccess interface looks like a generic storage shim until you notice it carries Buildbarn-shaped semantics for instance_name, eviction, and verification, at which point adopting it adopts Buildbarn’s product decisions. Don’t.

Storage Substrate Decision

Picking the storage substrate in one section, so the operator reading just this part can see the answer.

No default provider selected yet. The Week-1 decision is to keep CAS on a proved S3-compatible substrate and to require a concrete endpoint package before implementation. Civo Object Storage is explicitly excluded. Current RustFS is blocked by TIN-1147 and does not qualify from green canaries or restart recovery. A future candidate must name the endpoint family, authentication model, lifecycle policy, restore proof, regional/failure-domain behavior, and tenant-isolation shape before it can become the CAS authority.

Managed/appliance S3-compatible candidate. This is the preferred class if it gives us operator-owned credentials, lifecycle policy, restore evidence, and enough availability without adding a second storage service we run ourselves. The implementation should still use the same CASStore interface and an S3-compatible client library so the REAPI surface is not coupled to a provider.

Self-hosted S3-compatible candidate. Keep this as a candidate class when operator control and predictable latency matter more than reducing the storage operations surface. The current live member of that class is RustFS for existing cache/state paths; using RustFS for CAS/AC requires TIN-1147 repair, restore, lifecycle, bucket-index coherence, and failure-domain evidence before promotion.

Test-environment fallback. CI or integration environments may use a scoped S3-compatible test bucket with ephemeral credentials. The test endpoint must exercise the same CASStore contract and must not become the production authority by accident.

Current RustFS promotion is gated today. TIN-1147 / the RustFS RCA (docs/research/gloriousflywheel-rustfs-state-backend-rca-gate-2026-05-06.md) documents the bucket-index reliability failure that recurred on 2026-05-10 and the storage-node recovery failure on 2026-05-11. Restart-as-recovery is an incident response, not an availability design. CAS hit rate, action retry rate, and digest-mismatch rate SLOs cannot be met on a substrate whose bucket-index visibility depends on a clean restart. Current RustFS is not CAS/AC authority. TIN-1147 may still produce a repaired RustFS topology or a replacement backend, but current RustFS CAS/AC promotion remains blocked by TIN-1147 until a new promotion decision and evidence clears the recurrence class.

Explicitly disqualified: any peer’s storage daemon as the substrate. Running bb-storage as the S3 endpoint would be adopting Buildbarn at the data-tier layer — the CASStore interface would be talking to a Buildbarn-flavored API, and Buildbarn’s eviction / sharding / verification choices would propagate up into our cell. That’s the strategic coupling the peer-frame discipline refuses.

Prioritized Backlog

Ordered by gating-power for the E1/TIN-1445 close criteria (CAS hit rate ≥ 90% for 14 days, digest-mismatch rate = 0 for 14 days, evicted-while-referenced = 0 for 14 days, cas-primitives-static-gate passes).

  1. Storage substrate (primitive a) — TIN-1458 / W1.1 (this doc). Without durable substrate, every other primitive measures noise. Require a concrete S3-compatible endpoint package with restore evidence. Keep managed/appliance, self-hosted S3-compatible, and any separately promoted RustFS candidate explicit; Civo is excluded and current RustFS CAS/AC promotion remains blocked by TIN-1147.
  2. Digest verification (primitive b) — TIN-1459 / W1.2. The first correctness slice is implemented in gf-reapi-cell read/write and presence-check paths with a minimal poison counter. Remaining production work is richer labels, dashboard/alert wiring, and AC lookup provenance.
  3. TTL / lease (primitive e) — TIN-1460 / W1.3. Required before any bounded substrate (a) can run an eviction loop (d) without poisoning long-running actions.
  4. Eviction policy (primitive d) — under TIN-1460’s neighborhood; schedule the eviction-loop sub-ticket explicitly. LRU default; LFU deferred. Referenced-set invariant is the hard contract.
  5. Sharding / replication (primitive f) — TIN-1461 / W1.4. Phase 1 stateless replicas + leader-elected eviction. Phase 2 deferred.
  6. Namespace routing (primitive c) — W4.1 / TIN-1472 (already drafted in instance-name-routing-design.md). The CAS-side implementation lands alongside that doc; this primitive’s gating-power is on E4, not E1, so it’s lower in the E1 backlog even though it touches every CAS request.
  7. TLS / write attestation (primitive g) — co-lands with ac-writer-attestation-design.md. Audit envelope is shared with AC; no separate ticket needed for E1 close, but the audit record must include CAS writes by the time E5/TIN-1449 closes.

Failure Modes — At the Primitive Level

Failure Which primitive owns the defense Residual risk
Node loss wipes warm CAS a. storage substrate Substrate outage or object-store host loss; mitigated by explicit restore and fallback substrate selection
Storage substrate returns corrupted bytes b. digest verification (read path) Pre-verification corruption between substrate write and substrate read (in-flight memory corruption); rare
Client uploads bytes that don’t match the declared digest b. digest verification (write path) None — write-side is fully covered by re-hashing
Spoke A reads spoke B’s blob by digest-guess c. namespace routing First JWT authz slice exists behind GF_REAPI_AUTHZ_MODE; live rollout still needs token exchange, credential helper, and enforce-mode proof
Substrate fills, writes start failing d. eviction policy Eviction loop fails to keep up under burst write load; mitigated by sizing high-water mark conservatively
Eviction deletes a referenced digest mid-action d. eviction policy (referenced-set invariant) Referenced-set is per-replica in Phase 1; cross-replica reference leak deferred to Phase 2 + Redis
Long-running build loses digests mid-action e. TTL / lease Client must actually call FindMissingBlobs periodically; older Bazel versions don’t always
Single pod failure takes down the cell f. sharding / replication Leader election failure during eviction is recoverable on next lease cycle; brief eviction pause acceptable
Poisoned blob surfaces without forensic trail g. TLS / write attestation Audit log is append-only but not yet signed; signing deferred (see Open Questions)
Storage substrate provider compromise a. + g. (substrate selection + audit) Outside the trust boundary of this doc; named in Open Questions
Cross-cell instance_name collision c. namespace routing Single-cell horizon; multi-cell named in Open Questions

Open Questions

These must be answered before W1.1 closes. Each is named so it can become a sub-ticket under TIN-1458.

  1. Managed/appliance S3-compatible service vs self-hosted S3-compatible service? Civo Object Storage is not a candidate. The open decision is whether a managed/appliance endpoint can give us a better operator contract than running a self-hosted S3-compatible service ourselves or promoting a repaired RustFS topology. Pending: run a measured availability, restore, and P99 read/write-latency probe against the real candidate set.
  2. LRU vs LFU for eviction in our access pattern? LRU is the default recommendation. LFU may win on workloads where a small hot set is re-accessed across many builds (compile-cache-dominant) and a long tail of one-shot blobs (test fixtures, generated configs) gets discarded correctly. Pending: instrument the access-time histogram on the first 30 days post-substrate-cutover, decide LRU vs LFU vs hybrid based on data, not vibes.
  3. Where does the CAS hot-blob set live across gf-reapi-cell replicas? Phase 1: per-replica in-memory referenced-set, leader-elected eviction. Phase 2: shared referenced-set in Redis or equivalent. The question: when does Phase 1 stop being sufficient? Pending: if eviction-leader pod restart causes a measurable eviction-pause SLI breach, promote to Phase 2.
  4. When does TIN-1147 RustFS recovery change anything here? Only if it produces more than restart recovery: a repaired or topology- changed RustFS authority with retention, restore, quota, tenant isolation, and recurrence-clearing evidence, plus an explicit promotion decision. Otherwise CAS/AC stays on a different selected S3-compatible substrate.
  5. How does this interact with the spoke-cache-quota tenant model? The spoke-cache-quota ConfigMap declares cache_gib, ttl_days, and (as of the most recent module variant) attic_namespace and bazel_cache_prefix. The CAS eviction loop reads these as the per-instance_name quota; the TTL contract reads ttl_days. Open question: what happens when an operator updates the ConfigMap mid-flight — does the eviction loop watch for changes or re-read on each pass? Pending: pick “watch” because update-during-burn is the operator-recovery path.
  6. Audit-log signing. The write-attestation primitive emits an audit record per blob; the AC side does the same. Are records signed? By what key? Today: append-only structured log on the cell’s existing log surface. Pending: when the audit log graduates from “evidence for operators” to “evidence for compliance,” pick a signing scheme; for now, un-signed structured records are sufficient.
  7. Cross-substrate migration shape. If we cut from one selected S3-compatible substrate to another, what’s the data move? Pending: write a one-page cutover runbook when the substrate decision lands. Likely answer: dual-write window, read-fallthrough during cutover, single-substrate cutoff once the new substrate’s hit-rate matches the old.
  8. Hash function plurality. REAPI v2 supports SHA-256 as the default, with BLAKE3 and others optionally. Bazel today uses SHA-256. gf-reapi-cell’s capabilities response should declare its supported hash functions; today it declares only SHA-256. Pending: leave as SHA-256-only until a client asks for BLAKE3.

References

Peer architectures — INSPIRATION ONLY — these are peers, not adoption candidates

  • Buildbarn bb-deployments — reference deployment shapes for a peer’s CAS / scheduler / worker topology. INSPIRATION ONLY — peer, not adoption candidate.
  • Buildbarn bb-storage blobstore configuration — peer’s storage interface decomposition. INSPIRATION ONLY — peer, not adoption candidate.
  • BuildBuddy OSS — peer’s monolith REAPI server with per-org tenant model. INSPIRATION ONLY — peer, not adoption candidate.
  • buchgr/bazel-remote — peer’s minimalist cache-only server. INSPIRATION ONLY — peer, not adoption candidate.
  • NativeLink — peer’s Rust-implemented composable Store stack. INSPIRATION ONLY — peer, not adoption candidate.
  • EngFlow — peer’s managed-service patterns for IAM scope composition with instance_name. INSPIRATION ONLY — peer, not adoption candidate.

GloriousFlywheel