Cache-Hit Dashboard Design
Decision summary
- Status: Working draft (W5.2 / TIN-1478, under parent E5 / TIN-1449).
- Three panels, three distinct measurements: CAS hit (REAPI
FindMissingBlobs), AC hit (REAPIGetActionResult), analysis hit (Bazel skyframe via BES). Three different signals; three different sources; never aggregated into a single “cache hit %.”- Drill-down dimensions: mnemonic (from action
command.arguments[0]) and tenant (remote_instance_name, per TIN-1472).- Metric source: CAS and AC panels read
gf-reapi-cell’s own/metricsendpoint (Prometheus-shaped, proposed). Analysis panel reads Bazel BES events via agf-bes-collectorsidecar service. No peer’s metrics endpoint is scraped — GloriousFlywheel is a peer to Buildbarn / BuildBuddy / NativeLink / bazel-remote, not a consumer.- What blocks if absent: E5/TIN-1449 cannot close. SLO targets in
slo.mdare unobservable without these panels; W5.3 fairness panel (TIN-1479) consumes the tenant drill-down as its primitive.- Scope: Panel taxonomy, metric naming, drill-down design, dashboard layout. Not: alert routing (W5.6), TTFCH probe (W5.4), poison signal alerting policy (W5.5) — those are siblings that this dashboard cross-references.
Frame
The SLO doc (slo.md) names three cache hit-rate SLIs and treats them as cleanly separate rows. In practice, every observation tool that ships with peer RBE systems (Buildbarn, BuildBuddy, bazel-remote) conflates them in operator-facing surfaces — a single “cache hit rate” panel that is silently a mix of CAS lookups, AC lookups, and Bazel-client analysis hits, weighted by whatever happens to flow through the scrape window. The result is a number that looks meaningful, isn’t, and drifts in ways an operator can’t decompose. The first cache-degradation incident on such a dashboard ends with the operator reverse-engineering which numerator and denominator each panel actually uses, which is exactly the work this doc does once, in writing, before the panels ship.
The principle this doc enforces: a dashboard is an operating instrument or it is decoration. The difference is whether the operator, woken at 0300 by a cache-hit-rate burn alert, can identify within one click which of CAS / AC / analysis is responsible, drill to which mnemonic is causing the drop, and drill again to which tenant is hot — and read the numerator/denominator off the panel without grepping source. This doc promotes the proposed metric names in slo.md to a concrete panel taxonomy, fixes the source of each metric, defines what each panel does not measure (the anti-definitions), and lays out the drill-down topology so the dashboard composes the way the SLO targets read.
The dashboard lives in the runner-dashboard SvelteKit surface under app/src/routes/cache/ (existing route, currently a placeholder). It does not scrape a peer’s /metrics. gf-reapi-cell owns every REAPI byte; its own emitter is the authority. Analysis-hit data comes from Bazel’s BES stream, which is a client-side signal Bazel emits regardless of which RBE backend is on the far end — so this panel works the same whether the action lands on gf-reapi-cell or (in a degraded-mode scenario) on no RBE at all.
One stylistic constraint, stated explicitly: this dashboard never renders a single aggregate “cache hit rate” number. There are three numbers, side by side, named for what they measure, and the operator reads them as three. The temptation to roll them up into a “system cache health %” composite is the load-bearing decoration failure mode of every peer dashboard in this space; that composite is not a number; it is the absence of information dressed as a number. The panel layout in this doc deliberately makes the three-up rendering the only rendering — there is no aggregate view to land on, no roll-up tile in the page summary, no top-of-fold ”%” indicator. The operator reads three or they read nothing.
The Three Panels — Definitions
One subsection per panel. Each has a precise definition, a source, a numerator/denominator, an anti-definition (what it isn’t, to head off the conflation that defaulted in peer dashboards), and the drill-down dimensions.
Panel 1 — CAS Hit Rate
What it measures. The fraction of CAS digest queries against gf-reapi-cell that returned “present” (the blob exists in this tenant’s CAS namespace) versus “missing” (the caller must upload the blob). This is the REAPI server-side view of “did the cache already have what the action needed?”
Source. gf-reapi-cell’s /metrics endpoint (Prometheus-shaped, proposed). The cell counts every FindMissingBlobs digest result and every ByteStream/BatchReadBlobs lookup, tagged with instance_name, mnemonic, and op.
Proposed metric.
gf_reapi_cas_findmissingblobs_results_total{
tenant="spoke-<slug>|default|system",
mnemonic="CppCompile|GoCompile|TestRunner|GenRule|JsRunBinary|...",
result="present|missing"
}
A companion counter for explicit read paths:
gf_reapi_cas_read_results_total{tenant, mnemonic, op="BatchReadBlobs|ByteStreamRead", result="ok|not_found"}
Numerator / denominator.
| Quantity | Formula |
|---|---|
| Numerator | sum(rate(gf_reapi_cas_findmissingblobs_results_total{result="present"}[$window])) |
| Denominator | sum(rate(gf_reapi_cas_findmissingblobs_results_total[$window])) |
| Ratio | numerator / denominator |
Two windows render side-by-side per panel: rolling 1h (operational view) and rolling 30d (SLO view).
Anti-definition — what this panel does NOT measure.
- Not “fraction of bytes served from cache vs disk.” Byte-served ratio is a storage-tier metric; this panel is request-count ratio over digest existence.
- Not “fraction of builds with no CAS misses.” Per-build aggregation is a different roll-up; this panel is per-digest-query.
- Not action-cache hit rate. AC is a separate REAPI surface and has its own panel below. A high AC hit rate makes CAS hit rate go down in absolute count (the action never gets executed, so its inputs are never
FindMissingBlobs-queried) — confusing the two will make the operator chase a phantom CAS regression that is actually an AC win. - Not Bazel-client
--disk_cachehit rate. That’s local-disk, never reaches the wire, and is invisible togf-reapi-cell.
Drill-down dimensions.
| Dimension | Source | Cardinality concern |
|---|---|---|
| Tenant | instance_name from REAPI request |
Bounded by spoke count (currently 0, target single-digit) |
| Mnemonic | command.arguments[0] extracted by the cell’s middleware |
Bounded by the proved-and-eligible target classes in config/rbe-target-eligibility.json; long-tail capped (see Failure Modes) |
| Op | FindMissingBlobs vs BatchReadBlobs vs ByteStreamRead |
Three values, no cardinality risk |
Poison-signal inset. Below the main CAS panel, two single-stat cells render the poison signals (W5.5 / TIN-1481): gf_reapi_cas_bytes_evicted_referenced_total and gf_reapi_digest_mismatch_total{path="read|write"}. Both have target = 0; any nonzero value renders red and links to the alert runbook. They sit on the CAS panel because they’re CAS-substrate signals, but they are not folded into the hit-rate number — they live in a visually-separated inset, exactly because conflating “we have correctness damage” with “we’re getting fewer cache hits than we’d like” is the failure mode of every existing peer dashboard.
Panel 2 — AC Hit Rate
What it measures. The fraction of GetActionResult requests against gf-reapi-cell that returned a cached ActionResult (the action was already executed and its result is reusable) versus NOT_FOUND (the action must be executed). This is the per-action REAPI server-side view of “did we already compute this?”
Source. gf-reapi-cell’s /metrics endpoint. The cell counts every GetActionResult reply by status.
Proposed metric.
gf_reapi_ac_getactionresult_results_total{
tenant="spoke-<slug>|default|system",
mnemonic="CppCompile|GoCompile|TestRunner|GenRule|...",
result="hit|miss"
}
Numerator / denominator.
| Quantity | Formula |
|---|---|
| Numerator | sum(rate(gf_reapi_ac_getactionresult_results_total{result="hit"}[$window])) |
| Denominator | sum(rate(gf_reapi_ac_getactionresult_results_total[$window])) |
| Ratio | numerator / denominator |
Anti-definition — what this panel does NOT measure.
- Not “fraction of builds with zero remote actions.” Build-level “fully cached build” is a roll-up metric; this is per-action.
- Not CAS hit rate. AC hits short-circuit before any CAS query for that action’s inputs happens. An AC hit is one
GetActionResultreturning OK; a CAS hit is one digest query returning “present.” A 100% AC hit rate makes the CAS hit-rate denominator very small (only inputs to actions that aren’t AC-hit get queried), which is fine and expected — see the cross-panel note below. - Not the same as Bazel’s analysis cache. Analysis cache is loaded-graph reuse; AC is executed-action result reuse. A build can have 100% analysis cache hits and 0% AC hits (analysis says “I know what the graph is” without saying “the result is already there”).
- Not
UpdateActionResultrate. Writes are a separate metric, attestation-scoped per TIN-1462. The hit-rate panel is read-side only.
Drill-down dimensions.
| Dimension | Source | Cardinality concern |
|---|---|---|
| Tenant | instance_name |
Same as CAS |
| Mnemonic | command.arguments[0] at the moment of GetActionResult — extracted from the embedded Action/Command proto since the action digest lookup doesn’t itself carry the mnemonic |
Same as CAS; long-tail capped |
Cross-panel note. AC hits reduce CAS hit denominator. This is mechanical, not a regression. The dashboard does not “correct” for this: each panel reports its own ratio honestly, and the operator reads the pair together. A degraded build has both panels drop. A genuinely-healthy build sees AC climb and CAS stay flat-to-moderate. The panel-header copy says this explicitly so the operator doesn’t go hunting.
Panel 3 — Analysis Cache Hit Rate (Bazel-side)
What it measures. The fraction of Bazel ConfiguredTarget evaluations during a build that were satisfied from the client-side skyframe / analysis cache (--disk_cache, in-memory analysis cache, or workspace-persisted state) without re-running analysis. This is the Bazel-client view of “did we already know what this target looks like?”
Source. Bazel BES (Build Event Stream) events, captured by a gf-bes-collector sidecar service. Not gf-reapi-cell. BES is emitted by the Bazel client and is independent of which RBE backend is on the far end.
The relevant event types: AnalysisProgress, LoadingProgress, ConfiguredTargetEvent, and the BuildMetadata event that carries analysis_cache_hits and analysis_cache_misses counts in the trailing metrics. Bazel does not expose these as Prometheus metrics natively; the BES collector reads them off the event stream and re-emits as a counter.
Proposed metric.
gf_bazel_analysis_cache_hits_total{
invocation_id="<bes invocation uuid>",
build_target_set="<canonical //...:all label set hash>"
}
gf_bazel_analysis_cache_misses_total{
invocation_id, build_target_set
}
The invocation_id is per-build (high cardinality, but bounded by build count); aggregation is by build_target_set (which approximates “what build is this — //app:build, //docs-site:build, etc.”).
Numerator / denominator.
| Quantity | Formula |
|---|---|
| Numerator | sum(gf_bazel_analysis_cache_hits_total) by (build_target_set) |
| Denominator | numerator + sum(gf_bazel_analysis_cache_misses_total) by (build_target_set) |
| Ratio | numerator / denominator |
The denominator is the total analysis evaluations for that build, not a time-window rate. Analysis hit rate is a per-build property, aggregated across recent builds in the dashboard’s chosen window.
Anti-definition — what this panel does NOT measure.
- Not REAPI-side. Analysis cache is the Bazel client’s skyframe — it never touches
gf-reapi-cell. A build can hit 100% analysis cache and still send every action to RBE for execution; it can hit 0% analysis cache and still get 100% AC hits. - Not
--disk_cachefor action outputs.--disk_cachecovers both AC and CAS; this panel covers analysis only, which is a third orthogonal cache layer in the Bazel architecture. - Not “the build was fast.” A high analysis cache hit can coexist with slow remote execution. Latency is a sibling panel set (see
slo.mdp50/p95/p99 rows; W5.4 / TIN-1480 owns the TTFCH dashboard). - Not per-tenant. Analysis happens on the Bazel client, not the cell — Bazel has no concept of
instance_name. The drill-down does not split by tenant.
Drill-down dimensions.
| Dimension | Source | Cardinality concern |
|---|---|---|
| Build target set | BES BuildMetadata event, canonicalised |
Bounded by the proved target classes |
| Invocation | BES invocation_id |
High but per-build; aggregated, not enumerated |
| Not by mnemonic | Analysis is whole-build, not per-action | — |
| Not by tenant | Bazel client doesn’t know about tenants | — |
Drill-Down Design
The dashboard exposes mnemonic and tenant as filters at the top of the page, applied uniformly to the CAS and AC panels (analysis ignores them, by definition above). The filter chips render the current selection so the operator can see at a glance whether they’re looking at aggregate or sliced data.
Filter bar (top of page).
[ Tenant: ▼ all spokes ] [ Mnemonic: ▼ all mnemonics ] [ Window: ▼ rolling 1h | 30d ]
- Tenant is a multi-select dropdown sourced from the live set of
instance_namevalues seen in the last 30 days, plus the reserveddefaultandsystem. Default selection: all. Selecting one chip narrows every CAS/AC panel and table to that tenant. - Mnemonic is a multi-select sourced from the mnemonic label set on
gf_reapi_*metrics, capped to the top-N by traffic (default N=12), with everything else grouped asother. Default selection: all. - Window toggles 1h (operational) and 30d (SLO compliance) views.
Default landing view. All tenants, all mnemonics, 1h rolling. Three big panels (CAS / AC / analysis) across the top. Below each, a small drill-down table.
Drill from aggregate to mnemonic. Click on any CAS or AC panel — the panel expands into a stacked-bar breakdown by mnemonic, showing each mnemonic’s contribution to total hits and misses. The aggregate ratio stays visible above the breakdown.
Drill from mnemonic to tenant. Click a specific mnemonic bar — the breakdown re-pivots to show that mnemonic’s hit rate split by tenant, as a small-multiples grid (one tile per tenant) or a horizontal bar chart depending on tenant count.
Drill from tenant back to mnemonic. Selecting a tenant chip in the filter bar narrows every panel and table to that tenant; clicking into a panel then shows that tenant’s mnemonic breakdown.
The drill never inverts: tenant is always the outer filter (set on the filter bar); mnemonic is always the inner one (set by clicking a panel). This keeps the navigation deterministic — the operator can’t end up in a state where they don’t know which tenant the numbers belong to. Per-tenant fairness is W5.3’s panel (TIN-1479), and uses the same tenant primitive but does the cross-tenant comparison the cache-hit dashboard deliberately doesn’t.
Worked example — operator drill. It’s 0300; the on-call pager fires “CAS hit rate burn 50% of 7d budget.” The operator opens the dashboard:
- Default landing: CAS panel reads 84% (target 90%, budget burning). AC panel reads 71% (target 70%, fine). Analysis reads 83% (target 80%, fine).
- Operator clicks the CAS panel. Mnemonic breakdown:
CppCompile93%,GoCompile91%,TestRunner88%,GenRule41% — outlier. - Operator clicks the
GenRulebar. Tenant breakdown:spoke-elders89%,default88%,spoke-blahaj12% — outlier. - Operator opens
spoke-blahaj’s GenRule action evidence (link out from the panel to the AC writer attestation surface or the BES events for recentspoke-blahajinvocations). Discovers a recent spokeBUILD.bazelchange introduced a non-deterministic genrule whose digest churns on every run. - Total time from page to root cause: under 90 seconds, because the drill path is deterministic.
Compare against the failure mode the design is preventing: a single conflated “cache hit rate” panel reading 79%, with no decomposition, sending the operator to grep the cell’s logs for two hours.
Metric Source Plumbing
Two streams, two paths, one aggregation surface.
Stream 1 — gf-reapi-cell → /metrics → Prometheus → SvelteKit.
gf-reapi-cellexports a Prometheus/metricsendpoint on its admin port (proposed::9090separate from the gRPC port:8980). The cell does not export today; W5.2 is the workstream that lights this up. The metrics taxonomy:
| Metric | Type | Labels |
|---|---|---|
gf_reapi_cas_findmissingblobs_results_total |
counter | tenant, mnemonic, result |
gf_reapi_cas_read_results_total |
counter | tenant, mnemonic, op, result |
gf_reapi_ac_getactionresult_results_total |
counter | tenant, mnemonic, result |
gf_reapi_ac_updateactionresult_results_total |
counter | tenant, mnemonic, result |
gf_reapi_cas_bytes_evicted_total |
counter | tenant |
gf_reapi_cas_bytes_evicted_referenced_total |
counter | tenant (poison signal) |
gf_reapi_digest_mismatch_total |
counter | initial path; proposed tenant, op, hash_function (poison signal) |
gf_reapi_action_latency_seconds |
histogram | tenant, mnemonic, op |
- A Prometheus instance scrapes
gf-reapi-cellon thegf-rbecluster-internal endpoint. The Prom instance is operator-internal — it does not sit on the public ingress. - The SvelteKit dashboard at
app/src/routes/cache/+page.server.tsqueries the Prom HTTP API server-side, formats the response into panel-shaped JSON, and renders intoapp/src/routes/cache/+page.svelte.
Stream 2 — Bazel BES → gf-bes-collector → Prometheus → SvelteKit.
- Bazel clients are configured (via
.bazelrc) with--bes_backend=grpc://gf-bes-collector.gf-rbe.svc.cluster.local:1985. - The
gf-bes-collectorservice receives the BES stream, readsanalysis_cache_hitsandanalysis_cache_missesoff the trailingBuildMetadataevent, and re-emits as Prom counters labeled withinvocation_idandbuild_target_set. - Same Prom instance scrapes; same SvelteKit endpoint queries.
Where the BES collector lives — open question. It can be a sidecar inside gf-reapi-cell’s pod (one binary, one deploy) or a sibling service in the gf-rbe namespace. Recommendation pending in Open Questions.
Why not scrape a peer’s /metrics. Restating the framing: GloriousFlywheel is an in-house peer to Buildbarn, BuildBuddy, NativeLink, and bazel-remote. gf-reapi-cell is the product authority for every REAPI byte. The dashboard’s metric source is gf-reapi-cell’s own emitter, not a sidecar that translates from a peer system, not a federated scrape of a vendor’s metrics, not an OTel translation from a vendor-emitted signal. The work to light up gf-reapi-cell’s /metrics endpoint is in scope for W5.2; the work to integrate against a peer’s metrics endpoint is explicitly out of scope for this design and the broader RBE Production Readiness initiative. This boundary matters because the dashboard’s credibility as an operating instrument depends on the metrics it surfaces being authored by the same code that’s doing the work — which is the same reason cas-primitives.md and instance-name-routing-design.md keep their data-plane primitives in-house.
SvelteKit page contract. The dashboard page at app/src/routes/cache/+page.server.ts exports a load() that issues N Prom queries in parallel (one per panel × per window × per drill state), reduces them into a single typed CacheDashboardSnapshot object, and passes it to +page.svelte. The snapshot type lives in app/src/lib/types/cache-dashboard.ts (proposed); the typed shape is the contract between server and client. Stale-metric detection happens in the server load(): any panel whose latest scrape is over the staleness threshold gets a { stale: true, last_seen_at: ... } flag on its slice of the snapshot, and the Svelte component renders the banner.
Authentication. Prometheus scraping is cluster-internal; no public endpoint. The SvelteKit surface enforces E4 IAM (TIN-1473) on the read side: operator role sees all tenants, tenant-scoped role sees only their own (per TIN-153 role views).
Panel Layout
ASCII mockup of the dashboard page at default landing. Three top-level panels; filter bar above; drill-down tables below; sidebar holds tenant detail when drilled.
┌─────────────────────────────────────────────────────────────────────────────┐
│ GloriousFlywheel · Cache Hit Dashboard │
│ [ Tenant: all spokes ▼ ] [ Mnemonic: all ▼ ] [ Window: 1h ▼ ] │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
│ │ CAS hit rate │ │ AC hit rate │ │ Analysis hit │ │
│ │ │ │ │ │ │ │
│ │ 92.4 % │ │ 73.1 % │ │ 84.6 % │ │
│ │ ╱╲╱╲╱─╲╱╲ trend │ │ ╱╲─╱╲╱─╲╱ trend │ │ ─╱╲╱─╲╱─ trend │ │
│ │ SLO ≥ 90 % │ │ SLO ≥ 70 % │ │ SLO ≥ 80 % │ │
│ │ budget: 81 % │ │ budget: 64 % │ │ budget: 92 % │ │
│ ├───────────────────┤ ├───────────────────┤ ├───────────────────┤ │
│ │ poison: │ │ (write rejects: │ │ (no poison │ │
│ │ evict-ref: 0 │ │ see attestation │ │ signals at │ │
│ │ digest-mismatch │ │ dashboard) │ │ this layer) │ │
│ │ : 0 │ │ │ │ │ │
│ └───────────────────┘ └───────────────────┘ └───────────────────┘ │
├─────────────────────────────────────────────────────────────────────────────┤
│ By mnemonic (top 12, click to drill): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ CppCompile ████████████████░░░░ CAS 94% AC 78% │ │
│ │ GoCompile ███████████████░░░░░ CAS 91% AC 71% │ │
│ │ TestRunner █████████████░░░░░░░ CAS 88% AC 62% │ │
│ │ JsRunBinary ████████████░░░░░░░░ CAS 86% AC 69% │ │
│ │ GenRule ██████████░░░░░░░░░░ CAS 83% AC 55% │ │
│ │ … (cap at 12; tail grouped as `other`) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────────────┤
│ By tenant (operator view; tenant-role users see only their own): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ spoke-elders CAS 93% AC 74% (skew: 0.97×) │ │
│ │ spoke-blahaj CAS 91% AC 71% (skew: 1.02×) │ │
│ │ default CAS 89% AC 68% (skew: 1.11×) ← migration tail │ │
│ │ system CAS 99% AC -- (probe traffic only) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
The panel uses the same compact-density convention as app/src/routes/runners/ and app/src/routes/monitoring/. The trend sparkline above each headline number is a 30d view regardless of selected window — the operator sees both “right now” and “rolling history” without toggling.
SLO Overlay
Each of the three top-level panels carries the following overlay:
| Element | Source | Notes |
|---|---|---|
| Current value | Prom query, rolling 1h | Headline number |
| SLO target | slo.md table |
Hardcoded in the panel config, NOT scraped (single source of truth is the doc) |
| Error budget remaining (30d) | computed | (1 - actual_miss_rate / target_miss_rate) × 100% |
| Trend sparkline | Prom range query, 30d | Renders below headline |
| Budget-burn rate | computed | Highlighted red if 7d burn rate exceeds 2× nominal |
Cross-references to slo.md targets.
- CAS hit ≥ 90% on warm CI (excluding cold-clone first 60s).
- AC hit ≥ 70% on warm CI.
- Analysis hit ≥ 80% on warm CI.
The cold-clone-exclusion logic for CAS hit rate is implemented in the Prom query, not in the cell’s emitter: the query filters out the first 60 seconds of any new-runner session (identified by a runner_session_age label, proposed — see Open Questions). The cell counts everything; the dashboard chooses what to expose against SLO.
Why the SLO target is hardcoded into the panel rather than scraped. The SLO doc is the single source of truth for the target. If the doc says ≥ 90% and a future change wants to relax it to ≥ 85%, the change happens in slo.md and propagates to the panel config via a code update — not via a config-flip that bypasses the doc. This is the same discipline config/rbe-target-eligibility.json enforces for target-class promotion: the doc is the gate, not the config. A panel that reads its target from a live config can be silently relaxed mid-incident; a panel whose target lives in a committed doc cannot.
Burn-rate computation. The dashboard computes a fast-burn signal as: (target_miss_rate × budget_window_seconds - elapsed_misses) / target_miss_rate × budget_window_seconds. If this fraction reaches 50% of the budget consumed in less than 25% of the window, the panel renders red and links to the on-call runbook. The math is intentionally simple — burn-rate alerting (multi-window, multi-burn-rate) is a sibling concern (W5.6, alert routing); this dashboard exposes the input signal, not the alert.
Per-Tenant Fairness — Sibling Cross-Link
The per-tenant drill-down in this dashboard is the visible primitive that W5.3 / TIN-1479 (fairness panel) reads from. This dashboard shows per-tenant hit rates as a flat table; the fairness panel computes derived metrics — queue-time skew, hit-rate skew, eviction-share — across tenants and renders the cross-tenant comparison. The cache-hit dashboard answers “how is each tenant’s cache doing?”; the fairness panel answers “are tenants getting a fair share?“. Same data, two different presentation layers.
Both panels group by instance_name (per TIN-1472). When TIN-1479 lands, the per-tenant table at the bottom of this dashboard adds a deep-link to the fairness view for any tenant whose skew exceeds the W4.4 quota-enforcement budget.
TTFCH — Sibling Cross-Link
Time-to-first-cache-hit (TTFCH) is its own dashboard / panel set, owned by W5.4 / TIN-1480. TTFCH measures wall-time from git clone complete to first observed remote cache hit on a representative target — it is fundamentally a latency metric, not a hit-rate metric. The first dashboard JSON contract lives at docs/monitoring/gf-runner-ttfch-dashboard.json; the cache-hit dashboard links to the TTFCH page from the panel header, but does not display TTFCH metrics inline.
The relationship: a regression on TTFCH does not necessarily move any of the three hit-rate panels (the cache might still be 92% hit, just slow to first-touch on a cold runner). A regression on cache hit rate does not necessarily move TTFCH (the first hit might come on time and the rest of the build catches few). Keep them separate.
Poison Signals — Cross-Link to W5.5
Two poison signals render as an inset under the CAS panel, with the following posture:
gf_reapi_cas_bytes_evicted_referenced_total— bytes evicted from CAS while still referenced by an in-flight action. Target = 0. Any nonzero value pages immediately.gf_reapi_digest_mismatch_total{path="read|write"}— count of digest mismatches observed on CAS read or write paths. Target = 0. Any nonzero value pages immediately. AC lookup labeling/provenance is still a follow-on slice.
These are not budgeted and not folded into the CAS hit-rate number. They are correctness invariants, not performance SLIs. The inset is on the CAS panel because they are CAS-substrate signals; the operator sees them in context, but the dashboard’s visual hierarchy makes clear they are not the hit-rate story. The alert routing and runbook attachment for these signals is owned by W5.5 / TIN-1481.
A useful test of this design: if a poison signal fires while CAS hit rate is at 92%, the dashboard does not let the operator be reassured by the 92%. The 92% is honest, and the poison signal is independent — the panel renders both, and the runbook says the poison signal wins.
The reverse test also holds: if CAS hit rate is degrading toward the budget edge but no poison signal has fired, the operator knows the failure is performance-shaped (eviction policy, capacity, working-set shift) rather than correctness-shaped (digest collision, premature eviction of live references). The two channels are independently informative, and the dashboard layout makes the distinction legible at a glance — no operator should ever conflate “we are getting fewer cache hits than the SLO says we should” with “the cache may be returning the wrong data,” because the design isolates the two signals on the same panel with deliberate visual separation. This is a load-bearing claim against peer dashboards in this space that fold eviction-rate-while-referenced into a generic “eviction rate” panel that gets lost next to hit-rate trends.
Failure Modes (of the Dashboard Itself)
The dashboard can lie. Here are the ways, and the defenses.
| Failure | Impact | Design defense | Residual risk |
|---|---|---|---|
| Metric source goes silent (cell crashes, scraper fails, BES collector down) | Dashboard renders stale data with no indicator; operator believes the system is fine | Stale-metric alert: any panel whose latest scrape is > 2× scrape-interval old renders a “STALE” banner. Prom up{job="gf-reapi-cell"} alert pages independently of the data. |
Operator may dismiss the banner during a known-maintenance window; mitigated by maintenance-window markers on the dashboard. |
| Wrong denominator on a panel | Number looks plausible, isn’t; trust-erosion when later discovered | Explicit panel spec in this doc; every panel header carries a tooltip with the numerator/denominator formula; PromQL queries reviewed against this doc before landing | Drift between this doc and the deployed query if the doc isn’t updated when the query changes; mitigated by a doc-test in CI (proposed) that asserts the live query matches what this doc says. |
| Mnemonic cardinality explosion | Prom storage cost spikes; dashboard render slows; rare mnemonics drown out signal | Cap mnemonic label set at top-12 by traffic per scrape window; everything else grouped as other. Cap enforced at the emitter (cell-side relabel-style logic), not the dashboard. |
A genuinely-new important mnemonic gets folded into other until the operator promotes it; mitigated by a “candidate mnemonics” view showing the top-of-other traffic. |
| Tenant cardinality explosion | Same as mnemonic, scaled by tenant count | Cap tenant count similarly; alert if count(distinct instance_name) exceeds a threshold (proposed: 50) — that’s the signal of an attack or a misconfigured caller spraying invented instance_name values. The validator regex in instance-name-routing-design.md already rejects malformed instance names at the cell perimeter; this is belt-and-suspenders on the metric side. |
A legitimate growth event (the project actually has 50 spokes) requires raising the cap; that’s a one-line PR, not a redesign. |
| Cross-tenant data leak in dashboard | Tenant A’s user sees Tenant B’s hit-rate data | Dashboard auth respects E4 IAM (TIN-1473): the SvelteKit +page.server.ts extracts the caller’s role from the session, filters Prom queries to the legal tenant set before sending. Operator role sees all; tenant role sees only their own. |
Bug in the filter logic could leak — mitigated by an integration test that asserts a tenant-role session cannot pull another tenant’s metric via direct API call (test lives in app/scripts/, proposed). |
| Panel conflates CAS / AC / analysis | The exact failure mode this dashboard is designed to prevent | Three separate Prom queries, three separate metric names, three separate panels, three anti-definitions in this doc, panel header copy that names what each one measures and links to this doc | Operator habit: “cache hit rate” is a familiar shorthand and operators may verbally aggregate the three. Mitigated by never showing an aggregated “cache hit rate” number anywhere on the dashboard. |
| Trend sparkline confuses with current value | Operator sees a downward trend and pages, but current value is fine | Sparkline always renders 30d trend; headline always renders current-window value. Both labeled explicitly. Sparkline never includes the current window’s value (which can spike on partial-window data) | Confusion remains possible; mitigated by tooltips on hover. |
| Cold-clone window not excluded correctly | First 60s of every new runner session poisons CAS hit rate downward; operator chases a phantom regression every time a runner cycles | runner_session_age label on cell-side metric; dashboard Prom query filters > 60s. Verified against the SLO doc’s cold-clone exclusion language. |
If runner_session_age is unreliable (clock skew, missing emit), the exclusion silently does nothing. Mitigated by a synthetic test that asserts a cold-clone-only window reads as “insufficient data” rather than a low hit-rate. |
| Tenant role view leaks “I am one of many” | A tenant-role user sees their own panel reading 71% AC hit, infers the system has multiple tenants, infers tenant count from page layout | Tenant role sees only their own panel; no cross-tenant comparison surface; no tenant count anywhere; no system-aggregate row | Side-channel: response timing might reveal aggregate state. Acceptable residual; the dashboard is operator-internal in steady state. |
Integration with Siblings
Where this dashboard fits in the broader observability work.
- TIN-1477 — W5.1 SLO definitions (
slo.md). This dashboard surfaces the SLIs that doc defines. Targets and budgets are read from the SLO doc; this dashboard does not redefine them. - TIN-1479 — W5.3 fairness panel. Per-tenant drill-down here is the primitive the fairness panel consumes. Shared
instance_namegrouping; complementary presentation. - TIN-1480 — W5.4 TTFCH probe. Sibling dashboard / panel set. Cross-linked from panel headers; not displayed inline.
- TIN-1481 — W5.5 poison alerts. Poison signals inlined as an inset on the CAS panel. Alert routing owned by W5.5; visual surface owned here.
- TIN-154 — MCP wrap (re-parented under E5). When the MCP wrap lands, the dashboard’s data is exposed to the MCP surface through it. Today the dashboard reads Prom directly; under MCP wrap it reads through the wrap.
- TIN-155 — API envelope (re-parented under E5). Same — the dashboard’s HTTP API surface conforms to the envelope shape when that lands.
- TIN-153 — role views. Operator vs tenant scoping uses this. Operator role sees all tenants; tenant role sees only their own.
- TIN-1462 — AC writer attestation. AC write rejection counters live on a sibling attestation dashboard; this dashboard cross-links but does not duplicate.
- TIN-1472 —
instance_namerouting. The tenant drill-down depends on this routing primitive existing and being honest. Without it, “tenant” is decoration.
Open Questions
These must be resolved before W5.2 closes. Each is named so it can become a sub-ticket under TIN-1478.
- Prometheus vs OTel as the metric authority. Cross-link
slo.mdOpen Question 1. This dashboard’s design assumes Prometheus because the existingmonitoring/route already speaks Prom, and Bazel’s BES → Prom translation is straightforward. Ifgf-reapi-celladopts OTel as the primary emitter (with a Prom translation layer), the panel queries change shape but the panel taxonomy doesn’t. Recommendation pending: pick one in TIN-1449 and stop debating. - 7d vs 30d default window. Cross-link
slo.mdOpen Question 2. The dashboard offers a 1h / 30d toggle; whether the default landing is 1h-rolling or 7d-rolling is a separate operator-UX question. Recommendation: 1h for the operational page; 30d for the SLO-compliance page (separate route underapp/src/routes/cache/slo/, proposed). - Where does the BES collector live? Sidecar to
gf-reapi-cell(one deploy, tighter coupling, simpler IAM) or sibling service ingf-rbenamespace (cleaner separation, allows BES traffic from non-gf-reapi-celllanes — e.g. localbazel buildagainstbazel-cacheonly — to still flow into analysis-hit metrics)? Recommendation pending: sibling service, on the grounds that analysis-hit data is client-side and shouldn’t require the executor to be in the path. - Tenant cardinality cap — what’s the threshold and what’s the alert behavior? Proposed: cap at 50 distinct
instance_namevalues per 30d window; alert at 40; hard-reject (at the cell perimeter validator) above 50. Numbers are placeholders; revisit when spoke count is real. - Mnemonic top-N — what’s N, and how is the long-tail surfaced? Proposed: N=12 for the main panel, with an expandable “candidates” view showing the next 8 by traffic. Operators can promote a candidate to a first-class panel via a config change.
- Cold-clone exclusion mechanics. The SLO target excludes the first 60s of a new runner workspace. The dashboard query implements this via a
runner_session_agelabel, butgf-reapi-celldoesn’t natively know a request is from a cold runner — that signal has to come from the runner’s session-start marker. Recommendation pending: have the runner-dashboard SvelteKit surface emit arunner_session_startevent the cell consumes and labels requests with for the first 60s. - Should the dashboard show write metrics at all? AC and CAS write rates are observable; the attestation dashboard (TIN-1462) is the primary surface for them. Recommendation: this dashboard stays read-side only; writes link out to attestation.
- What’s the BES collector’s retention shape? BES events carry the analysis-cache hit/miss counts the dashboard needs, but the full event stream is much richer (action-level events, target-level events, profile data). The collector either retains everything (storage cost) or only the metric-relevant fields (lighter, but loses replay capability when an investigation needs the full BES). Recommendation pending: retain only the metric-relevant fields in steady state; flip to full-retention for explicit-mode proof runs and incident-response windows.
- Synthetic-probe traffic in the denominator. The
systeminstance carries TTFCH-probe traffic, which by design is a small set of canonical actions repeated frequently. This inflates the AC and CAS hit-rate denominators with very-cacheable traffic, biasing the numbers upward versus genuine spoke traffic. Recommendation pending: rendersystemseparately on the per-tenant table (already in the mockup) and excludesystemfrom the headline aggregate ratio. Tenant filter default should be “all spokes + default; excludes system.”
References
- Repo-local:
docs/build-system/slo.md— the SLIs this dashboard surfaces; voice exemplar. - Repo-local:
docs/build-system/instance-name-routing-design.md— tenant identity; voice exemplar. - Repo-local:
docs/build-system/ac-writer-attestation-design.md— voice exemplar. - Repo-local:
docs/build-system/gf-reapi-cell.md— current REAPI shape and metrics-surface readiness. - Repo-local:
config/rbe-target-eligibility.json— proved target classes; informs the mnemonic cap. - Repo-local:
app/src/routes/cache/— SvelteKit surface for this dashboard. - Repo-local:
app/src/routes/monitoring/— sibling surface, Prom query conventions. - Bazel BES protocol:
build_event_stream.proto— canonical BES event shapes;BuildMetadatacarriesanalysis_cache_hits/analysis_cache_misses. - REAPI v2:
remote_execution.proto—FindMissingBlobs,GetActionResult,BatchReadBlobsdefinitions. - Peer dashboards — INSPIRATION ONLY — not adoption candidates. GloriousFlywheel is a peer to these systems, not a consumer:
- BuildBuddy dashboards split CAS / AC / executor at the namespace and the API-key dimensions. Useful reference for tenant-scoped views; not the implementation. INSPIRATION ONLY.
- Buildbarn
bb-storageexports per-backend hit-rate metrics with a similar three-panel decomposition. Useful reference for metric naming conventions; not the implementation. INSPIRATION ONLY. buchgr/bazel-remoteexports CAS and AC hits as separate Prom counters with no tenant dimension. Useful as a minimal-shape reference; not the implementation. INSPIRATION ONLY.- EngFlow’s tenant fairness dashboards influence the per-tenant drill-down design. INSPIRATION ONLY.
- Linear epic: TIN-1449 — E5 observability (parent).
- Linear siblings: TIN-1477 (W5.1 SLO), TIN-1479 (W5.3 fairness), TIN-1480 (W5.4 TTFCH), TIN-1481 (W5.5 poison alerts).
- Linear adjacent: TIN-1472 (instance-name routing), TIN-1462 (AC writer attestation), TIN-153 (role views), TIN-154 (MCP wrap), TIN-155 (API envelope).