Sharded / Replicated CAS Topology and Eviction Policy Design

Sharded / Replicated CAS Topology and Eviction Policy Design

Framing note (load-bearing). GloriousFlywheel is a peer to Buildbarn / BuildBuddy / NativeLink / bazel-remote, not a consumer. gf-reapi-cell owns every REAPI byte. The architectures below are studied as reference patterns only; none is an adoption candidate by default. The current durability seam is the provider-neutral S3-compatible BlobStore backend. GloriousFlywheel already runs RustFS for existing self-hosted cache/state paths, but RustFS is not automatically promoted to CAS/AC authority; TIN-1147 must prove durable repair and bucket-index coherence before any RustFS-backed CAS/AC namespace is selected.

W1.4 / TIN-1461. This is the storage-plane design plus the implementation contract for the first local-backend slice. Phase 1 now exists in code as GF_REAPI_CAS_MAX_BYTES: a lease-protected, LRU-ordered local CAS size evictor with metrics and quota reconciliation. The sharded / replicated topology below remains design until the shardedBlobStore phase lands.

This doc is storage-plane only. It does not replace the separate executor-scheduler / worker-pool placement work needed for broad/default RBE. The goal is that the CAS/AC substrate can survive that future scheduler load without poisoning live actions or collapsing under one tenant’s working set.

It builds on what already exists in code:

  • the provider-neutral BlobStore seam (internal/cell/blobstore.go: Get / Put / Stat / Ping, plus the optional usageReporter and evicting capabilities);
  • the local + dependency-free s3 backends (the s3 backend proven against a real S3-compatible integration fixture, W1.1/E1);
  • the age-TTL garbage collector and its gf_reapi_gc_* metrics (W1.3);
  • the GC-vs-client-lease startup guard (W1.3/TIN-1460), which already makes evict-while-referenced structurally impossible when configured.

Part A — Eviction policy and sizing

What exists

The shipped GC is age-TTL: a blob not recently accessed within GF_REAPI_BLOB_TTL is evicted, scoped to instances/<name>/{cas,ac}. The local backend best-effort touches object mtimes on Get, so mtime is the LRU signal. It publishes gf_reapi_gc_runs_total, gf_reapi_gc_evicted_objects_total, gf_reapi_gc_evicted_bytes_total, and gf_reapi_gc_errors_total, and the startup guard refuses GF_REAPI_BLOB_TTL < GF_REAPI_MIN_CLIENT_CACHE_TTL so the sweeper never evicts a blob inside the Bazel client cache lease (--experimental_remote_cache_ttl, default 3h).

What this ticket adds: a size bound

Age-TTL bounds staleness, not footprint. A burst of large builds can blow past disk/bucket capacity before any blob ages out. TIN-1461 asks for a bounded --max_size with deterministic LRU (or LFU) so the store has a hard ceiling.

The first local implementation is GF_REAPI_CAS_MAX_BYTES. It accepts bytes or Ki/Mi/Gi/Ti suffixes, requires GF_REAPI_MIN_CLIENT_CACHE_TTL, evicts only CAS blobs older than that lease floor, reconciles durable quota counters after reclaim, and emits gf_reapi_size_eviction_* plus gf_reapi_evicted_while_referenced_total.

Recommended policy: size-bounded LRU, lease-protected.

  • Track total stored bytes per backend (already available for the local backend via usageReporter.scanUsage, used by the durable byte quota).
  • On Put that would cross maxBytes, evict least-recently-used blobs until under the high-water mark, skipping any blob younger than the client lease floor (GF_REAPI_MIN_CLIENT_CACHE_TTL) — the same invariant as the age-TTL guard, so an in-flight build’s inputs are never reclaimed.
  • LRU over LFU for v1: access-recency is cheap to approximate from filesystem mtime/atime (or a sidecar last-access touch on read), and CAS access patterns are recency-dominated. LFU is a documented follow-up if a frequency signal proves worthwhile.

Eviction order precedence: lease-protected blobs are never evicted; among evictable blobs, LRU; ties broken by larger-first to reclaim space fastest.

Sizing analysis (headroom ≥ 2× active protected working set)

The completion criterion is not merely maxBytes ≥ 2 × (largest single-build working set). That rule is sufficient for a one-build proof cell, but it is too small for multi-tenant/default-RBE operation. The production rule is:

GF_REAPI_CAS_MAX_BYTES >= 2 * p99(active lease-protected CAS bytes)

active lease-protected CAS bytes includes all blobs younger than the client cache lease floor across the configured concurrency envelope. Until direct dashboard support exists, the conservative approximation is:

GF_REAPI_CAS_MAX_BYTES >= 2 * concurrency_limit * p99(per-build distinct CAS bytes)

The factor of 2 is headroom, not the concurrency multiplier. This keeps a burst of allowed concurrent actions from forcing eviction of another still-referenced build’s inputs.

The working set per proved target class is bounded by the proof-run process counts already recorded (e.g. //docs-site:build ≈ 2.5k processes / ~1k remote; the private tinyland.dev fan-outs ≈ 5–6k processes / 2.5–3k remote-cache hits). A conservative upper bound for a single heavy web build’s CAS working set is on the order of a few GiB of distinct blobs, but the product sizing surface must scale that by the tenant or pool concurrency envelope.

p99(active lease-protected CAS bytes) comes from the same dashboards that feed the W1.3 TTL tuning (W5 observability). Until those dashboards exist, the safe default is to leave the size bound unset (disabled) — exactly as GC ships disabled by default — and to size the substrate generously. The design must not hard-code a number that pretends to be dashboard-validated.

Eviction metrics and the poison signal

Signal Status
eviction rate / bytes / blobs published (gf_reapi_gc_evicted_{objects,bytes}_total, W1.3; gf_reapi_size_eviction_*, W1.4 phase 1)
current stored bytes (per tenant) published (gf_reapi_tenant_cas_bytes gauge, W4.6)
bytes-evicted-while-referenced (poison) structurally prevented by the lease-floor guard; the alert is W5.5

The size-bounded evictor must add a gf_reapi_evicted_while_referenced_total counter that stays at zero by construction (it can only increment if the lease-protection check is bypassed), so the W5.5 poison alert is a defence-in-depth tripwire, not the primary guarantee.

Part B — Sharded / replicated topology

Why

One BlobStore = one object-store endpoint (or one disk) = a capacity and availability ceiling. Horizontal scale and single-endpoint-loss survival need N shards with a replication factor R.

Shape

A shardedBlobStore implements BlobStore over an ordered set of N backend BlobStores (each shard is a local or s3 store). It is itself a BlobStore, so nothing above the seam (the REAPI handlers, digest verification, quotas, instance routing) changes.

Placement: rendezvous (HRW) hashing

For a key k, compute score(k, shard_i) = hash(k || shard_id_i) for every shard and select the top-R shards as the replica set. Rendezvous hashing is chosen over modulo or a hash ring because adding/removing a shard reshuffles only ~1/N of keys, and placement is deterministic and client-independent (the key is the content digest, already uniformly distributed).

  • CAS blobs key on the blob digest hash; AC entries key on the action digest.
  • instance_name is not a placement input by default: a tenant’s blobs spread across all shards (best balance). A per-tenant-shard-subset variant is noted below for blast-radius isolation.

Operation mapping

Op Behavior
Put write to all R replicas; succeed if ≥ W (write quorum) succeed; the rest get hinted-handoff or are repaired on read
Get try replicas in HRW order until a hit; the server verifies the digest (W1.2); on a miss-then-hit, read-repair writes the blob back to the replicas that missed
Stat query replicas in HRW order until one reports present + size
Ping ready iff ≥ a configured quorum of shards are reachable

Because CAS is content-addressed and immutable, R=2 or R=3 with W=R (write-all) and read-one is sufficient and simple: any replica’s bytes are verifiable, so there is no write-write conflict and no need for vector clocks. AC entries are keyed by action digest and remain behind the writer-attestation gate; read-repair must never promote an unauthenticated AC entry.

Rebalancing trade-off

When the shard set changes, HRW moves ~1/N of keys to new replica sets. Two options:

  1. Active rebalancer — a background walk re-places moved blobs. Correct but expensive (full scan + copy).
  2. Lazy + re-upload — accept that a topology change makes some blobs temporarily unreachable at their new HRW location; the client re-uploads on the next FindMissingBlobs miss (CAS is a cache, not a database). The cost is a transient cache-hit-rate dip, never data loss or a wrong answer.

v1 chooses (2): CAS/AC are build-execution working storage, not publication or OpenTofu state authority, and a miss must degrade to re-upload/re-execute rather than return wrong bytes. The transient dip is observable via the existing cache-hit metrics and is the price of an online topology change. (2) plus read-repair converges the new layout as traffic flows. (1) is a documented follow-up for environments that cannot tolerate the dip.

Failure modes

  • One shard down with R≥2: reads fall through to a surviving replica; writes meet quorum if W ≤ surviving replicas; Ping stays ready above the quorum.
  • Corruption: caught by per-read digest verification (W1.2), then read-repair from a good replica.
  • Split brain: not applicable — content-addressed immutable blobs have no conflicting versions.

Aggregating the optional capabilities

  • usageReporter: the sharded store sums per-shard usage (dividing replicated bytes by R for a logical total, or reporting physical bytes — the durable byte quota wants the logical figure; document the choice).
  • evicting / size-bound: each shard runs its own evictor against its physical footprint; the global size bound is N/R × per-shard bound. The lease protection is per-shard and identical, so the global invariant holds.

Tenant-isolation variant (future)

Optionally pin an instance_name to a subset of shards (combine with executor pools and per-tenant quotas) so one noisy tenant’s footprint and failure blast radius are bounded. This trades balance for isolation and is out of scope for v1.

Implementation phases

  1. Eviction: add the size-bounded, lease-protected LRU evictor to the local backend, plus GF_REAPI_CAS_MAX_BYTES and the gf_reapi_evicted_while_referenced_total tripwire. Disabled by default. Status: landed as phase 1.
  2. Sharded store v1: shardedBlobStore with HRW placement, R replicas, write-quorum, read-repair, lazy rebalance; config GF_REAPI_SHARDS (JSON list of backend configs) + GF_REAPI_SHARD_REPLICAS + write quorum. Aggregates usageReporter/evicting. Dependency-free (reuses the stdlib s3 client per shard).
  3. Follow-ups: active rebalancer, hinted handoff, LFU option, tenant-shard pinning — each behind its own gate.

Interactions

  • W1.2 digest verification — unchanged; the read-repair path relies on it.
  • W1.3 GC + lease guard — the size evictor reuses the lease-floor protection; per-shard GC keeps the same TTL and guard.
  • W4.4/W4.6 quotas — usage aggregation feeds the durable byte quota.
  • instance_name routing — orthogonal to placement (keys already carry the instance).
  • Executor scheduler / worker pools — orthogonal. This storage topology is necessary for broad/default RBE, but it does not pick workers, enforce pool placement, or solve queue fairness.
  • Substrate — shards are local or provider-neutral s3 backends. Self-hosted/appliance S3-compatible substrates must be proved behind the same seam. RustFS remains a real current cache/state substrate, but CAS/AC use remains gated on TIN-1147 repair/proof evidence and a dedicated namespace.

Completion-criteria mapping (TIN-1461)

Criterion This design
Sizing analysis in decision doc Part A “Sizing analysis”; bound = 2 × p99(active lease-protected CAS bytes), approximated as 2 × concurrency_limit × p99(per-build distinct bytes) until dashboard-fed (W5)
Eviction metrics (rate, bytes, blobs) gf_reapi_gc_evicted_{objects,bytes}_total shipped (W1.3); + size-evictor counters in phase 1
bytes-evicted-while-referenced alert structurally prevented by the lease-floor guard (W1.3); gf_reapi_evicted_while_referenced_total tripwire + W5.5 alert
Sharded / replicated topology Part B: shardedBlobStore, HRW, R replicas, read-repair, quorum

GloriousFlywheel