Sharded / Replicated CAS Topology and Eviction Policy Design
Framing note (load-bearing). GloriousFlywheel is a peer to Buildbarn / BuildBuddy / NativeLink / bazel-remote, not a consumer.
gf-reapi-cellowns every REAPI byte. The architectures below are studied as reference patterns only; none is an adoption candidate by default. The current durability seam is the provider-neutral S3-compatibleBlobStorebackend. GloriousFlywheel already runs RustFS for existing self-hosted cache/state paths, but RustFS is not automatically promoted to CAS/AC authority; TIN-1147 must prove durable repair and bucket-index coherence before any RustFS-backed CAS/AC namespace is selected.
W1.4 / TIN-1461. This is the storage-plane design plus the implementation
contract for the first local-backend slice. Phase 1 now exists in code as
GF_REAPI_CAS_MAX_BYTES: a lease-protected, LRU-ordered local CAS size evictor
with metrics and quota reconciliation. The sharded / replicated topology below
remains design until the shardedBlobStore phase lands.
This doc is storage-plane only. It does not replace the separate executor-scheduler / worker-pool placement work needed for broad/default RBE. The goal is that the CAS/AC substrate can survive that future scheduler load without poisoning live actions or collapsing under one tenant’s working set.
It builds on what already exists in code:
- the provider-neutral
BlobStoreseam (internal/cell/blobstore.go:Get/Put/Stat/Ping, plus the optionalusageReporterandevictingcapabilities); - the local + dependency-free
s3backends (the s3 backend proven against a real S3-compatible integration fixture, W1.1/E1); - the age-TTL garbage collector and its
gf_reapi_gc_*metrics (W1.3); - the GC-vs-client-lease startup guard (W1.3/TIN-1460), which already makes evict-while-referenced structurally impossible when configured.
Part A — Eviction policy and sizing
What exists
The shipped GC is age-TTL: a blob not recently accessed within
GF_REAPI_BLOB_TTL is evicted, scoped to instances/<name>/{cas,ac}. The local
backend best-effort touches object mtimes on Get, so mtime is the LRU signal.
It publishes
gf_reapi_gc_runs_total, gf_reapi_gc_evicted_objects_total,
gf_reapi_gc_evicted_bytes_total, and gf_reapi_gc_errors_total, and the
startup guard refuses GF_REAPI_BLOB_TTL < GF_REAPI_MIN_CLIENT_CACHE_TTL so the
sweeper never evicts a blob inside the Bazel client cache lease
(--experimental_remote_cache_ttl, default 3h).
What this ticket adds: a size bound
Age-TTL bounds staleness, not footprint. A burst of large builds can blow
past disk/bucket capacity before any blob ages out. TIN-1461 asks for a
bounded --max_size with deterministic LRU (or LFU) so the store has a hard
ceiling.
The first local implementation is GF_REAPI_CAS_MAX_BYTES. It accepts bytes or
Ki/Mi/Gi/Ti suffixes, requires GF_REAPI_MIN_CLIENT_CACHE_TTL, evicts
only CAS blobs older than that lease floor, reconciles durable quota counters
after reclaim, and emits gf_reapi_size_eviction_* plus
gf_reapi_evicted_while_referenced_total.
Recommended policy: size-bounded LRU, lease-protected.
- Track total stored bytes per backend (already available for the local backend
via
usageReporter.scanUsage, used by the durable byte quota). - On
Putthat would crossmaxBytes, evict least-recently-used blobs until under the high-water mark, skipping any blob younger than the client lease floor (GF_REAPI_MIN_CLIENT_CACHE_TTL) — the same invariant as the age-TTL guard, so an in-flight build’s inputs are never reclaimed. - LRU over LFU for v1: access-recency is cheap to approximate from filesystem mtime/atime (or a sidecar last-access touch on read), and CAS access patterns are recency-dominated. LFU is a documented follow-up if a frequency signal proves worthwhile.
Eviction order precedence: lease-protected blobs are never evicted; among evictable blobs, LRU; ties broken by larger-first to reclaim space fastest.
Sizing analysis (headroom ≥ 2× active protected working set)
The completion criterion is not merely
maxBytes ≥ 2 × (largest single-build working set). That rule is sufficient
for a one-build proof cell, but it is too small for multi-tenant/default-RBE
operation. The production rule is:
GF_REAPI_CAS_MAX_BYTES >= 2 * p99(active lease-protected CAS bytes)
active lease-protected CAS bytes includes all blobs younger than the client
cache lease floor across the configured concurrency envelope. Until direct
dashboard support exists, the conservative approximation is:
GF_REAPI_CAS_MAX_BYTES >= 2 * concurrency_limit * p99(per-build distinct CAS bytes)
The factor of 2 is headroom, not the concurrency multiplier. This keeps a burst of allowed concurrent actions from forcing eviction of another still-referenced build’s inputs.
The working set per proved target class is bounded by the proof-run process
counts already recorded (e.g. //docs-site:build ≈ 2.5k processes / ~1k remote;
the private tinyland.dev fan-outs ≈ 5–6k processes / 2.5–3k remote-cache hits).
A conservative upper bound for a single heavy web build’s CAS working set is on
the order of a few GiB of distinct blobs, but the product sizing surface must
scale that by the tenant or pool concurrency envelope.
p99(active lease-protected CAS bytes) comes from the same dashboards that feed
the W1.3 TTL tuning (W5 observability). Until those dashboards exist, the safe
default is to leave the size bound unset (disabled) — exactly as GC ships
disabled by default — and to size the substrate generously. The design must not
hard-code a number that pretends to be dashboard-validated.
Eviction metrics and the poison signal
| Signal | Status |
|---|---|
| eviction rate / bytes / blobs | published (gf_reapi_gc_evicted_{objects,bytes}_total, W1.3; gf_reapi_size_eviction_*, W1.4 phase 1) |
| current stored bytes (per tenant) | published (gf_reapi_tenant_cas_bytes gauge, W4.6) |
bytes-evicted-while-referenced (poison) |
structurally prevented by the lease-floor guard; the alert is W5.5 |
The size-bounded evictor must add a gf_reapi_evicted_while_referenced_total
counter that stays at zero by construction (it can only increment if the
lease-protection check is bypassed), so the W5.5 poison alert is a
defence-in-depth tripwire, not the primary guarantee.
Part B — Sharded / replicated topology
Why
One BlobStore = one object-store endpoint (or one disk) = a capacity and
availability ceiling. Horizontal scale and single-endpoint-loss survival need N
shards with a replication factor R.
Shape
A shardedBlobStore implements BlobStore over an ordered set of N backend
BlobStores (each shard is a local or s3 store).
It is itself a BlobStore, so nothing above the seam (the REAPI handlers,
digest verification, quotas, instance routing) changes.
Placement: rendezvous (HRW) hashing
For a key k, compute score(k, shard_i) = hash(k || shard_id_i) for every
shard and select the top-R shards as the replica set. Rendezvous hashing is
chosen over modulo or a hash ring because adding/removing a shard reshuffles only
~1/N of keys, and placement is deterministic and client-independent (the key
is the content digest, already uniformly distributed).
- CAS blobs key on the blob digest hash; AC entries key on the action digest.
instance_nameis not a placement input by default: a tenant’s blobs spread across all shards (best balance). A per-tenant-shard-subset variant is noted below for blast-radius isolation.
Operation mapping
| Op | Behavior |
|---|---|
Put |
write to all R replicas; succeed if ≥ W (write quorum) succeed; the rest get hinted-handoff or are repaired on read |
Get |
try replicas in HRW order until a hit; the server verifies the digest (W1.2); on a miss-then-hit, read-repair writes the blob back to the replicas that missed |
Stat |
query replicas in HRW order until one reports present + size |
Ping |
ready iff ≥ a configured quorum of shards are reachable |
Because CAS is content-addressed and immutable, R=2 or R=3 with W=R
(write-all) and read-one is sufficient and simple: any replica’s bytes are
verifiable, so there is no write-write conflict and no need for vector clocks.
AC entries are keyed by action digest and remain behind the writer-attestation
gate; read-repair must never promote an unauthenticated AC entry.
Rebalancing trade-off
When the shard set changes, HRW moves ~1/N of keys to new replica sets. Two
options:
- Active rebalancer — a background walk re-places moved blobs. Correct but expensive (full scan + copy).
- Lazy + re-upload — accept that a topology change makes some blobs
temporarily unreachable at their new HRW location; the client re-uploads on
the next
FindMissingBlobsmiss (CAS is a cache, not a database). The cost is a transient cache-hit-rate dip, never data loss or a wrong answer.
v1 chooses (2): CAS/AC are build-execution working storage, not publication or OpenTofu state authority, and a miss must degrade to re-upload/re-execute rather than return wrong bytes. The transient dip is observable via the existing cache-hit metrics and is the price of an online topology change. (2) plus read-repair converges the new layout as traffic flows. (1) is a documented follow-up for environments that cannot tolerate the dip.
Failure modes
- One shard down with R≥2: reads fall through to a surviving replica; writes
meet quorum if W ≤ surviving replicas;
Pingstays ready above the quorum. - Corruption: caught by per-read digest verification (W1.2), then read-repair from a good replica.
- Split brain: not applicable — content-addressed immutable blobs have no conflicting versions.
Aggregating the optional capabilities
usageReporter: the sharded store sums per-shard usage (dividing replicated bytes by R for a logical total, or reporting physical bytes — the durable byte quota wants the logical figure; document the choice).evicting/ size-bound: each shard runs its own evictor against its physical footprint; the global size bound isN/R ×per-shard bound. The lease protection is per-shard and identical, so the global invariant holds.
Tenant-isolation variant (future)
Optionally pin an instance_name to a subset of shards (combine with executor
pools and per-tenant quotas) so one noisy tenant’s footprint and failure blast
radius are bounded. This trades balance for isolation and is out of scope for
v1.
Implementation phases
- Eviction: add the size-bounded, lease-protected LRU evictor to the local
backend, plus
GF_REAPI_CAS_MAX_BYTESand thegf_reapi_evicted_while_referenced_totaltripwire. Disabled by default. Status: landed as phase 1. - Sharded store v1:
shardedBlobStorewith HRW placement, R replicas, write-quorum, read-repair, lazy rebalance; configGF_REAPI_SHARDS(JSON list of backend configs) +GF_REAPI_SHARD_REPLICAS+ write quorum. AggregatesusageReporter/evicting. Dependency-free (reuses the stdlib s3 client per shard). - Follow-ups: active rebalancer, hinted handoff, LFU option, tenant-shard pinning — each behind its own gate.
Interactions
- W1.2 digest verification — unchanged; the read-repair path relies on it.
- W1.3 GC + lease guard — the size evictor reuses the lease-floor protection; per-shard GC keeps the same TTL and guard.
- W4.4/W4.6 quotas — usage aggregation feeds the durable byte quota.
instance_namerouting — orthogonal to placement (keys already carry the instance).- Executor scheduler / worker pools — orthogonal. This storage topology is necessary for broad/default RBE, but it does not pick workers, enforce pool placement, or solve queue fairness.
- Substrate — shards are
localor provider-neutrals3backends. Self-hosted/appliance S3-compatible substrates must be proved behind the same seam. RustFS remains a real current cache/state substrate, but CAS/AC use remains gated on TIN-1147 repair/proof evidence and a dedicated namespace.
Completion-criteria mapping (TIN-1461)
| Criterion | This design |
|---|---|
| Sizing analysis in decision doc | Part A “Sizing analysis”; bound = 2 × p99(active lease-protected CAS bytes), approximated as 2 × concurrency_limit × p99(per-build distinct bytes) until dashboard-fed (W5) |
| Eviction metrics (rate, bytes, blobs) | gf_reapi_gc_evicted_{objects,bytes}_total shipped (W1.3); + size-evictor counters in phase 1 |
bytes-evicted-while-referenced alert |
structurally prevented by the lease-floor guard (W1.3); gf_reapi_evicted_while_referenced_total tripwire + W5.5 alert |
| Sharded / replicated topology | Part B: shardedBlobStore, HRW, R replicas, read-repair, quorum |