Cache-Hit Dashboard Design

Cache-Hit Dashboard Design

Decision summary

  • Status: Working draft (W5.2 / TIN-1478, under parent E5 / TIN-1449).
  • Three panels, three distinct measurements: CAS hit (REAPI FindMissingBlobs), AC hit (REAPI GetActionResult), analysis hit (Bazel skyframe via BES). Three different signals; three different sources; never aggregated into a single “cache hit %.”
  • Drill-down dimensions: mnemonic (from action command.arguments[0]) and tenant (remote_instance_name, per TIN-1472).
  • Metric source: CAS and AC panels read gf-reapi-cell’s own /metrics endpoint (Prometheus-shaped, proposed). Analysis panel reads Bazel BES events via a gf-bes-collector sidecar service. No peer’s metrics endpoint is scraped — GloriousFlywheel is a peer to Buildbarn / BuildBuddy / NativeLink / bazel-remote, not a consumer.
  • What blocks if absent: E5/TIN-1449 cannot close. SLO targets in slo.md are unobservable without these panels; W5.3 fairness panel (TIN-1479) consumes the tenant drill-down as its primitive.
  • Scope: Panel taxonomy, metric naming, drill-down design, dashboard layout. Not: alert routing (W5.6), TTFCH probe (W5.4), poison signal alerting policy (W5.5) — those are siblings that this dashboard cross-references.

Frame

The SLO doc (slo.md) names three cache hit-rate SLIs and treats them as cleanly separate rows. In practice, every observation tool that ships with peer RBE systems (Buildbarn, BuildBuddy, bazel-remote) conflates them in operator-facing surfaces — a single “cache hit rate” panel that is silently a mix of CAS lookups, AC lookups, and Bazel-client analysis hits, weighted by whatever happens to flow through the scrape window. The result is a number that looks meaningful, isn’t, and drifts in ways an operator can’t decompose. The first cache-degradation incident on such a dashboard ends with the operator reverse-engineering which numerator and denominator each panel actually uses, which is exactly the work this doc does once, in writing, before the panels ship.

The principle this doc enforces: a dashboard is an operating instrument or it is decoration. The difference is whether the operator, woken at 0300 by a cache-hit-rate burn alert, can identify within one click which of CAS / AC / analysis is responsible, drill to which mnemonic is causing the drop, and drill again to which tenant is hot — and read the numerator/denominator off the panel without grepping source. This doc promotes the proposed metric names in slo.md to a concrete panel taxonomy, fixes the source of each metric, defines what each panel does not measure (the anti-definitions), and lays out the drill-down topology so the dashboard composes the way the SLO targets read.

The dashboard lives in the runner-dashboard SvelteKit surface under app/src/routes/cache/ (existing route, currently a placeholder). It does not scrape a peer’s /metrics. gf-reapi-cell owns every REAPI byte; its own emitter is the authority. Analysis-hit data comes from Bazel’s BES stream, which is a client-side signal Bazel emits regardless of which RBE backend is on the far end — so this panel works the same whether the action lands on gf-reapi-cell or (in a degraded-mode scenario) on no RBE at all.

One stylistic constraint, stated explicitly: this dashboard never renders a single aggregate “cache hit rate” number. There are three numbers, side by side, named for what they measure, and the operator reads them as three. The temptation to roll them up into a “system cache health %” composite is the load-bearing decoration failure mode of every peer dashboard in this space; that composite is not a number; it is the absence of information dressed as a number. The panel layout in this doc deliberately makes the three-up rendering the only rendering — there is no aggregate view to land on, no roll-up tile in the page summary, no top-of-fold ”%” indicator. The operator reads three or they read nothing.

The Three Panels — Definitions

One subsection per panel. Each has a precise definition, a source, a numerator/denominator, an anti-definition (what it isn’t, to head off the conflation that defaulted in peer dashboards), and the drill-down dimensions.

Panel 1 — CAS Hit Rate

What it measures. The fraction of CAS digest queries against gf-reapi-cell that returned “present” (the blob exists in this tenant’s CAS namespace) versus “missing” (the caller must upload the blob). This is the REAPI server-side view of “did the cache already have what the action needed?”

Source. gf-reapi-cell’s /metrics endpoint (Prometheus-shaped, proposed). The cell counts every FindMissingBlobs digest result and every ByteStream/BatchReadBlobs lookup, tagged with instance_name, mnemonic, and op.

Proposed metric.

gf_reapi_cas_findmissingblobs_results_total{
  tenant="spoke-<slug>|default|system",
  mnemonic="CppCompile|GoCompile|TestRunner|GenRule|JsRunBinary|...",
  result="present|missing"
}

A companion counter for explicit read paths:

gf_reapi_cas_read_results_total{tenant, mnemonic, op="BatchReadBlobs|ByteStreamRead", result="ok|not_found"}

Numerator / denominator.

Quantity Formula
Numerator sum(rate(gf_reapi_cas_findmissingblobs_results_total{result="present"}[$window]))
Denominator sum(rate(gf_reapi_cas_findmissingblobs_results_total[$window]))
Ratio numerator / denominator

Two windows render side-by-side per panel: rolling 1h (operational view) and rolling 30d (SLO view).

Anti-definition — what this panel does NOT measure.

  • Not “fraction of bytes served from cache vs disk.” Byte-served ratio is a storage-tier metric; this panel is request-count ratio over digest existence.
  • Not “fraction of builds with no CAS misses.” Per-build aggregation is a different roll-up; this panel is per-digest-query.
  • Not action-cache hit rate. AC is a separate REAPI surface and has its own panel below. A high AC hit rate makes CAS hit rate go down in absolute count (the action never gets executed, so its inputs are never FindMissingBlobs-queried) — confusing the two will make the operator chase a phantom CAS regression that is actually an AC win.
  • Not Bazel-client --disk_cache hit rate. That’s local-disk, never reaches the wire, and is invisible to gf-reapi-cell.

Drill-down dimensions.

Dimension Source Cardinality concern
Tenant instance_name from REAPI request Bounded by spoke count (currently 0, target single-digit)
Mnemonic command.arguments[0] extracted by the cell’s middleware Bounded by the proved-and-eligible target classes in config/rbe-target-eligibility.json; long-tail capped (see Failure Modes)
Op FindMissingBlobs vs BatchReadBlobs vs ByteStreamRead Three values, no cardinality risk

Poison-signal inset. Below the main CAS panel, two single-stat cells render the poison signals (W5.5 / TIN-1481): gf_reapi_cas_bytes_evicted_referenced_total and gf_reapi_digest_mismatch_total{path="read|write"}. Both have target = 0; any nonzero value renders red and links to the alert runbook. They sit on the CAS panel because they’re CAS-substrate signals, but they are not folded into the hit-rate number — they live in a visually-separated inset, exactly because conflating “we have correctness damage” with “we’re getting fewer cache hits than we’d like” is the failure mode of every existing peer dashboard.

Panel 2 — AC Hit Rate

What it measures. The fraction of GetActionResult requests against gf-reapi-cell that returned a cached ActionResult (the action was already executed and its result is reusable) versus NOT_FOUND (the action must be executed). This is the per-action REAPI server-side view of “did we already compute this?”

Source. gf-reapi-cell’s /metrics endpoint. The cell counts every GetActionResult reply by status.

Proposed metric.

gf_reapi_ac_getactionresult_results_total{
  tenant="spoke-<slug>|default|system",
  mnemonic="CppCompile|GoCompile|TestRunner|GenRule|...",
  result="hit|miss"
}

Numerator / denominator.

Quantity Formula
Numerator sum(rate(gf_reapi_ac_getactionresult_results_total{result="hit"}[$window]))
Denominator sum(rate(gf_reapi_ac_getactionresult_results_total[$window]))
Ratio numerator / denominator

Anti-definition — what this panel does NOT measure.

  • Not “fraction of builds with zero remote actions.” Build-level “fully cached build” is a roll-up metric; this is per-action.
  • Not CAS hit rate. AC hits short-circuit before any CAS query for that action’s inputs happens. An AC hit is one GetActionResult returning OK; a CAS hit is one digest query returning “present.” A 100% AC hit rate makes the CAS hit-rate denominator very small (only inputs to actions that aren’t AC-hit get queried), which is fine and expected — see the cross-panel note below.
  • Not the same as Bazel’s analysis cache. Analysis cache is loaded-graph reuse; AC is executed-action result reuse. A build can have 100% analysis cache hits and 0% AC hits (analysis says “I know what the graph is” without saying “the result is already there”).
  • Not UpdateActionResult rate. Writes are a separate metric, attestation-scoped per TIN-1462. The hit-rate panel is read-side only.

Drill-down dimensions.

Dimension Source Cardinality concern
Tenant instance_name Same as CAS
Mnemonic command.arguments[0] at the moment of GetActionResult — extracted from the embedded Action/Command proto since the action digest lookup doesn’t itself carry the mnemonic Same as CAS; long-tail capped

Cross-panel note. AC hits reduce CAS hit denominator. This is mechanical, not a regression. The dashboard does not “correct” for this: each panel reports its own ratio honestly, and the operator reads the pair together. A degraded build has both panels drop. A genuinely-healthy build sees AC climb and CAS stay flat-to-moderate. The panel-header copy says this explicitly so the operator doesn’t go hunting.

Panel 3 — Analysis Cache Hit Rate (Bazel-side)

What it measures. The fraction of Bazel ConfiguredTarget evaluations during a build that were satisfied from the client-side skyframe / analysis cache (--disk_cache, in-memory analysis cache, or workspace-persisted state) without re-running analysis. This is the Bazel-client view of “did we already know what this target looks like?”

Source. Bazel BES (Build Event Stream) events, captured by a gf-bes-collector sidecar service. Not gf-reapi-cell. BES is emitted by the Bazel client and is independent of which RBE backend is on the far end.

The relevant event types: AnalysisProgress, LoadingProgress, ConfiguredTargetEvent, and the BuildMetadata event that carries analysis_cache_hits and analysis_cache_misses counts in the trailing metrics. Bazel does not expose these as Prometheus metrics natively; the BES collector reads them off the event stream and re-emits as a counter.

Proposed metric.

gf_bazel_analysis_cache_hits_total{
  invocation_id="<bes invocation uuid>",
  build_target_set="<canonical //...:all label set hash>"
}
gf_bazel_analysis_cache_misses_total{
  invocation_id, build_target_set
}

The invocation_id is per-build (high cardinality, but bounded by build count); aggregation is by build_target_set (which approximates “what build is this — //app:build, //docs-site:build, etc.”).

Numerator / denominator.

Quantity Formula
Numerator sum(gf_bazel_analysis_cache_hits_total) by (build_target_set)
Denominator numerator + sum(gf_bazel_analysis_cache_misses_total) by (build_target_set)
Ratio numerator / denominator

The denominator is the total analysis evaluations for that build, not a time-window rate. Analysis hit rate is a per-build property, aggregated across recent builds in the dashboard’s chosen window.

Anti-definition — what this panel does NOT measure.

  • Not REAPI-side. Analysis cache is the Bazel client’s skyframe — it never touches gf-reapi-cell. A build can hit 100% analysis cache and still send every action to RBE for execution; it can hit 0% analysis cache and still get 100% AC hits.
  • Not --disk_cache for action outputs. --disk_cache covers both AC and CAS; this panel covers analysis only, which is a third orthogonal cache layer in the Bazel architecture.
  • Not “the build was fast.” A high analysis cache hit can coexist with slow remote execution. Latency is a sibling panel set (see slo.md p50/p95/p99 rows; W5.4 / TIN-1480 owns the TTFCH dashboard).
  • Not per-tenant. Analysis happens on the Bazel client, not the cell — Bazel has no concept of instance_name. The drill-down does not split by tenant.

Drill-down dimensions.

Dimension Source Cardinality concern
Build target set BES BuildMetadata event, canonicalised Bounded by the proved target classes
Invocation BES invocation_id High but per-build; aggregated, not enumerated
Not by mnemonic Analysis is whole-build, not per-action
Not by tenant Bazel client doesn’t know about tenants

Drill-Down Design

The dashboard exposes mnemonic and tenant as filters at the top of the page, applied uniformly to the CAS and AC panels (analysis ignores them, by definition above). The filter chips render the current selection so the operator can see at a glance whether they’re looking at aggregate or sliced data.

Filter bar (top of page).

[ Tenant: ▼ all spokes ] [ Mnemonic: ▼ all mnemonics ] [ Window: ▼ rolling 1h | 30d ]
  • Tenant is a multi-select dropdown sourced from the live set of instance_name values seen in the last 30 days, plus the reserved default and system. Default selection: all. Selecting one chip narrows every CAS/AC panel and table to that tenant.
  • Mnemonic is a multi-select sourced from the mnemonic label set on gf_reapi_* metrics, capped to the top-N by traffic (default N=12), with everything else grouped as other. Default selection: all.
  • Window toggles 1h (operational) and 30d (SLO compliance) views.

Default landing view. All tenants, all mnemonics, 1h rolling. Three big panels (CAS / AC / analysis) across the top. Below each, a small drill-down table.

Drill from aggregate to mnemonic. Click on any CAS or AC panel — the panel expands into a stacked-bar breakdown by mnemonic, showing each mnemonic’s contribution to total hits and misses. The aggregate ratio stays visible above the breakdown.

Drill from mnemonic to tenant. Click a specific mnemonic bar — the breakdown re-pivots to show that mnemonic’s hit rate split by tenant, as a small-multiples grid (one tile per tenant) or a horizontal bar chart depending on tenant count.

Drill from tenant back to mnemonic. Selecting a tenant chip in the filter bar narrows every panel and table to that tenant; clicking into a panel then shows that tenant’s mnemonic breakdown.

The drill never inverts: tenant is always the outer filter (set on the filter bar); mnemonic is always the inner one (set by clicking a panel). This keeps the navigation deterministic — the operator can’t end up in a state where they don’t know which tenant the numbers belong to. Per-tenant fairness is W5.3’s panel (TIN-1479), and uses the same tenant primitive but does the cross-tenant comparison the cache-hit dashboard deliberately doesn’t.

Worked example — operator drill. It’s 0300; the on-call pager fires “CAS hit rate burn 50% of 7d budget.” The operator opens the dashboard:

  1. Default landing: CAS panel reads 84% (target 90%, budget burning). AC panel reads 71% (target 70%, fine). Analysis reads 83% (target 80%, fine).
  2. Operator clicks the CAS panel. Mnemonic breakdown: CppCompile 93%, GoCompile 91%, TestRunner 88%, GenRule 41% — outlier.
  3. Operator clicks the GenRule bar. Tenant breakdown: spoke-elders 89%, default 88%, spoke-blahaj 12% — outlier.
  4. Operator opens spoke-blahaj’s GenRule action evidence (link out from the panel to the AC writer attestation surface or the BES events for recent spoke-blahaj invocations). Discovers a recent spoke BUILD.bazel change introduced a non-deterministic genrule whose digest churns on every run.
  5. Total time from page to root cause: under 90 seconds, because the drill path is deterministic.

Compare against the failure mode the design is preventing: a single conflated “cache hit rate” panel reading 79%, with no decomposition, sending the operator to grep the cell’s logs for two hours.

Metric Source Plumbing

Two streams, two paths, one aggregation surface.

Stream 1 — gf-reapi-cell/metrics → Prometheus → SvelteKit.

  • gf-reapi-cell exports a Prometheus /metrics endpoint on its admin port (proposed: :9090 separate from the gRPC port :8980). The cell does not export today; W5.2 is the workstream that lights this up. The metrics taxonomy:
Metric Type Labels
gf_reapi_cas_findmissingblobs_results_total counter tenant, mnemonic, result
gf_reapi_cas_read_results_total counter tenant, mnemonic, op, result
gf_reapi_ac_getactionresult_results_total counter tenant, mnemonic, result
gf_reapi_ac_updateactionresult_results_total counter tenant, mnemonic, result
gf_reapi_cas_bytes_evicted_total counter tenant
gf_reapi_cas_bytes_evicted_referenced_total counter tenant (poison signal)
gf_reapi_digest_mismatch_total counter initial path; proposed tenant, op, hash_function (poison signal)
gf_reapi_action_latency_seconds histogram tenant, mnemonic, op
  • A Prometheus instance scrapes gf-reapi-cell on the gf-rbe cluster-internal endpoint. The Prom instance is operator-internal — it does not sit on the public ingress.
  • The SvelteKit dashboard at app/src/routes/cache/+page.server.ts queries the Prom HTTP API server-side, formats the response into panel-shaped JSON, and renders into app/src/routes/cache/+page.svelte.

Stream 2 — Bazel BES → gf-bes-collector → Prometheus → SvelteKit.

  • Bazel clients are configured (via .bazelrc) with --bes_backend=grpc://gf-bes-collector.gf-rbe.svc.cluster.local:1985.
  • The gf-bes-collector service receives the BES stream, reads analysis_cache_hits and analysis_cache_misses off the trailing BuildMetadata event, and re-emits as Prom counters labeled with invocation_id and build_target_set.
  • Same Prom instance scrapes; same SvelteKit endpoint queries.

Where the BES collector lives — open question. It can be a sidecar inside gf-reapi-cell’s pod (one binary, one deploy) or a sibling service in the gf-rbe namespace. Recommendation pending in Open Questions.

Why not scrape a peer’s /metrics. Restating the framing: GloriousFlywheel is an in-house peer to Buildbarn, BuildBuddy, NativeLink, and bazel-remote. gf-reapi-cell is the product authority for every REAPI byte. The dashboard’s metric source is gf-reapi-cell’s own emitter, not a sidecar that translates from a peer system, not a federated scrape of a vendor’s metrics, not an OTel translation from a vendor-emitted signal. The work to light up gf-reapi-cell’s /metrics endpoint is in scope for W5.2; the work to integrate against a peer’s metrics endpoint is explicitly out of scope for this design and the broader RBE Production Readiness initiative. This boundary matters because the dashboard’s credibility as an operating instrument depends on the metrics it surfaces being authored by the same code that’s doing the work — which is the same reason cas-primitives.md and instance-name-routing-design.md keep their data-plane primitives in-house.

SvelteKit page contract. The dashboard page at app/src/routes/cache/+page.server.ts exports a load() that issues N Prom queries in parallel (one per panel × per window × per drill state), reduces them into a single typed CacheDashboardSnapshot object, and passes it to +page.svelte. The snapshot type lives in app/src/lib/types/cache-dashboard.ts (proposed); the typed shape is the contract between server and client. Stale-metric detection happens in the server load(): any panel whose latest scrape is over the staleness threshold gets a { stale: true, last_seen_at: ... } flag on its slice of the snapshot, and the Svelte component renders the banner.

Authentication. Prometheus scraping is cluster-internal; no public endpoint. The SvelteKit surface enforces E4 IAM (TIN-1473) on the read side: operator role sees all tenants, tenant-scoped role sees only their own (per TIN-153 role views).

Panel Layout

ASCII mockup of the dashboard page at default landing. Three top-level panels; filter bar above; drill-down tables below; sidebar holds tenant detail when drilled.

┌─────────────────────────────────────────────────────────────────────────────┐
│  GloriousFlywheel · Cache Hit Dashboard                                     │
│  [ Tenant: all spokes ▼ ]  [ Mnemonic: all ▼ ]  [ Window: 1h ▼ ]            │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐        │
│  │  CAS hit rate     │  │  AC hit rate      │  │  Analysis hit     │        │
│  │                   │  │                   │  │                   │        │
│  │       92.4 %      │  │       73.1 %      │  │       84.6 %      │        │
│  │  ╱╲╱╲╱─╲╱╲ trend  │  │  ╱╲─╱╲╱─╲╱ trend  │  │  ─╱╲╱─╲╱─ trend   │        │
│  │  SLO ≥ 90 %       │  │  SLO ≥ 70 %       │  │  SLO ≥ 80 %       │        │
│  │  budget: 81 %     │  │  budget: 64 %     │  │  budget: 92 %     │        │
│  ├───────────────────┤  ├───────────────────┤  ├───────────────────┤        │
│  │  poison:          │  │  (write rejects:  │  │  (no poison       │        │
│  │   evict-ref: 0    │  │   see attestation │  │    signals at     │        │
│  │   digest-mismatch │  │   dashboard)      │  │    this layer)    │        │
│  │   : 0             │  │                   │  │                   │        │
│  └───────────────────┘  └───────────────────┘  └───────────────────┘        │
├─────────────────────────────────────────────────────────────────────────────┤
│  By mnemonic (top 12, click to drill):                                      │
│  ┌─────────────────────────────────────────────────────────────────┐        │
│  │ CppCompile    ████████████████░░░░  CAS 94%  AC 78%             │        │
│  │ GoCompile     ███████████████░░░░░  CAS 91%  AC 71%             │        │
│  │ TestRunner    █████████████░░░░░░░  CAS 88%  AC 62%             │        │
│  │ JsRunBinary   ████████████░░░░░░░░  CAS 86%  AC 69%             │        │
│  │ GenRule       ██████████░░░░░░░░░░  CAS 83%  AC 55%             │        │
│  │ … (cap at 12; tail grouped as `other`)                          │        │
│  └─────────────────────────────────────────────────────────────────┘        │
├─────────────────────────────────────────────────────────────────────────────┤
│  By tenant (operator view; tenant-role users see only their own):           │
│  ┌─────────────────────────────────────────────────────────────────┐        │
│  │ spoke-elders  CAS 93% AC 74% (skew: 0.97×)                      │        │
│  │ spoke-blahaj  CAS 91% AC 71% (skew: 1.02×)                      │        │
│  │ default       CAS 89% AC 68% (skew: 1.11×) ← migration tail     │        │
│  │ system        CAS 99% AC --   (probe traffic only)              │        │
│  └─────────────────────────────────────────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────────────────┘

The panel uses the same compact-density convention as app/src/routes/runners/ and app/src/routes/monitoring/. The trend sparkline above each headline number is a 30d view regardless of selected window — the operator sees both “right now” and “rolling history” without toggling.

SLO Overlay

Each of the three top-level panels carries the following overlay:

Element Source Notes
Current value Prom query, rolling 1h Headline number
SLO target slo.md table Hardcoded in the panel config, NOT scraped (single source of truth is the doc)
Error budget remaining (30d) computed (1 - actual_miss_rate / target_miss_rate) × 100%
Trend sparkline Prom range query, 30d Renders below headline
Budget-burn rate computed Highlighted red if 7d burn rate exceeds 2× nominal

Cross-references to slo.md targets.

  • CAS hit ≥ 90% on warm CI (excluding cold-clone first 60s).
  • AC hit ≥ 70% on warm CI.
  • Analysis hit ≥ 80% on warm CI.

The cold-clone-exclusion logic for CAS hit rate is implemented in the Prom query, not in the cell’s emitter: the query filters out the first 60 seconds of any new-runner session (identified by a runner_session_age label, proposed — see Open Questions). The cell counts everything; the dashboard chooses what to expose against SLO.

Why the SLO target is hardcoded into the panel rather than scraped. The SLO doc is the single source of truth for the target. If the doc says ≥ 90% and a future change wants to relax it to ≥ 85%, the change happens in slo.md and propagates to the panel config via a code update — not via a config-flip that bypasses the doc. This is the same discipline config/rbe-target-eligibility.json enforces for target-class promotion: the doc is the gate, not the config. A panel that reads its target from a live config can be silently relaxed mid-incident; a panel whose target lives in a committed doc cannot.

Burn-rate computation. The dashboard computes a fast-burn signal as: (target_miss_rate × budget_window_seconds - elapsed_misses) / target_miss_rate × budget_window_seconds. If this fraction reaches 50% of the budget consumed in less than 25% of the window, the panel renders red and links to the on-call runbook. The math is intentionally simple — burn-rate alerting (multi-window, multi-burn-rate) is a sibling concern (W5.6, alert routing); this dashboard exposes the input signal, not the alert.

The per-tenant drill-down in this dashboard is the visible primitive that W5.3 / TIN-1479 (fairness panel) reads from. This dashboard shows per-tenant hit rates as a flat table; the fairness panel computes derived metrics — queue-time skew, hit-rate skew, eviction-share — across tenants and renders the cross-tenant comparison. The cache-hit dashboard answers “how is each tenant’s cache doing?”; the fairness panel answers “are tenants getting a fair share?“. Same data, two different presentation layers.

Both panels group by instance_name (per TIN-1472). When TIN-1479 lands, the per-tenant table at the bottom of this dashboard adds a deep-link to the fairness view for any tenant whose skew exceeds the W4.4 quota-enforcement budget.

Time-to-first-cache-hit (TTFCH) is its own dashboard / panel set, owned by W5.4 / TIN-1480. TTFCH measures wall-time from git clone complete to first observed remote cache hit on a representative target — it is fundamentally a latency metric, not a hit-rate metric. The first dashboard JSON contract lives at docs/monitoring/gf-runner-ttfch-dashboard.json; the cache-hit dashboard links to the TTFCH page from the panel header, but does not display TTFCH metrics inline.

The relationship: a regression on TTFCH does not necessarily move any of the three hit-rate panels (the cache might still be 92% hit, just slow to first-touch on a cold runner). A regression on cache hit rate does not necessarily move TTFCH (the first hit might come on time and the rest of the build catches few). Keep them separate.

Two poison signals render as an inset under the CAS panel, with the following posture:

  • gf_reapi_cas_bytes_evicted_referenced_total — bytes evicted from CAS while still referenced by an in-flight action. Target = 0. Any nonzero value pages immediately.
  • gf_reapi_digest_mismatch_total{path="read|write"} — count of digest mismatches observed on CAS read or write paths. Target = 0. Any nonzero value pages immediately. AC lookup labeling/provenance is still a follow-on slice.

These are not budgeted and not folded into the CAS hit-rate number. They are correctness invariants, not performance SLIs. The inset is on the CAS panel because they are CAS-substrate signals; the operator sees them in context, but the dashboard’s visual hierarchy makes clear they are not the hit-rate story. The alert routing and runbook attachment for these signals is owned by W5.5 / TIN-1481.

A useful test of this design: if a poison signal fires while CAS hit rate is at 92%, the dashboard does not let the operator be reassured by the 92%. The 92% is honest, and the poison signal is independent — the panel renders both, and the runbook says the poison signal wins.

The reverse test also holds: if CAS hit rate is degrading toward the budget edge but no poison signal has fired, the operator knows the failure is performance-shaped (eviction policy, capacity, working-set shift) rather than correctness-shaped (digest collision, premature eviction of live references). The two channels are independently informative, and the dashboard layout makes the distinction legible at a glance — no operator should ever conflate “we are getting fewer cache hits than the SLO says we should” with “the cache may be returning the wrong data,” because the design isolates the two signals on the same panel with deliberate visual separation. This is a load-bearing claim against peer dashboards in this space that fold eviction-rate-while-referenced into a generic “eviction rate” panel that gets lost next to hit-rate trends.

Failure Modes (of the Dashboard Itself)

The dashboard can lie. Here are the ways, and the defenses.

Failure Impact Design defense Residual risk
Metric source goes silent (cell crashes, scraper fails, BES collector down) Dashboard renders stale data with no indicator; operator believes the system is fine Stale-metric alert: any panel whose latest scrape is > 2× scrape-interval old renders a “STALE” banner. Prom up{job="gf-reapi-cell"} alert pages independently of the data. Operator may dismiss the banner during a known-maintenance window; mitigated by maintenance-window markers on the dashboard.
Wrong denominator on a panel Number looks plausible, isn’t; trust-erosion when later discovered Explicit panel spec in this doc; every panel header carries a tooltip with the numerator/denominator formula; PromQL queries reviewed against this doc before landing Drift between this doc and the deployed query if the doc isn’t updated when the query changes; mitigated by a doc-test in CI (proposed) that asserts the live query matches what this doc says.
Mnemonic cardinality explosion Prom storage cost spikes; dashboard render slows; rare mnemonics drown out signal Cap mnemonic label set at top-12 by traffic per scrape window; everything else grouped as other. Cap enforced at the emitter (cell-side relabel-style logic), not the dashboard. A genuinely-new important mnemonic gets folded into other until the operator promotes it; mitigated by a “candidate mnemonics” view showing the top-of-other traffic.
Tenant cardinality explosion Same as mnemonic, scaled by tenant count Cap tenant count similarly; alert if count(distinct instance_name) exceeds a threshold (proposed: 50) — that’s the signal of an attack or a misconfigured caller spraying invented instance_name values. The validator regex in instance-name-routing-design.md already rejects malformed instance names at the cell perimeter; this is belt-and-suspenders on the metric side. A legitimate growth event (the project actually has 50 spokes) requires raising the cap; that’s a one-line PR, not a redesign.
Cross-tenant data leak in dashboard Tenant A’s user sees Tenant B’s hit-rate data Dashboard auth respects E4 IAM (TIN-1473): the SvelteKit +page.server.ts extracts the caller’s role from the session, filters Prom queries to the legal tenant set before sending. Operator role sees all; tenant role sees only their own. Bug in the filter logic could leak — mitigated by an integration test that asserts a tenant-role session cannot pull another tenant’s metric via direct API call (test lives in app/scripts/, proposed).
Panel conflates CAS / AC / analysis The exact failure mode this dashboard is designed to prevent Three separate Prom queries, three separate metric names, three separate panels, three anti-definitions in this doc, panel header copy that names what each one measures and links to this doc Operator habit: “cache hit rate” is a familiar shorthand and operators may verbally aggregate the three. Mitigated by never showing an aggregated “cache hit rate” number anywhere on the dashboard.
Trend sparkline confuses with current value Operator sees a downward trend and pages, but current value is fine Sparkline always renders 30d trend; headline always renders current-window value. Both labeled explicitly. Sparkline never includes the current window’s value (which can spike on partial-window data) Confusion remains possible; mitigated by tooltips on hover.
Cold-clone window not excluded correctly First 60s of every new runner session poisons CAS hit rate downward; operator chases a phantom regression every time a runner cycles runner_session_age label on cell-side metric; dashboard Prom query filters > 60s. Verified against the SLO doc’s cold-clone exclusion language. If runner_session_age is unreliable (clock skew, missing emit), the exclusion silently does nothing. Mitigated by a synthetic test that asserts a cold-clone-only window reads as “insufficient data” rather than a low hit-rate.
Tenant role view leaks “I am one of many” A tenant-role user sees their own panel reading 71% AC hit, infers the system has multiple tenants, infers tenant count from page layout Tenant role sees only their own panel; no cross-tenant comparison surface; no tenant count anywhere; no system-aggregate row Side-channel: response timing might reveal aggregate state. Acceptable residual; the dashboard is operator-internal in steady state.

Integration with Siblings

Where this dashboard fits in the broader observability work.

  • TIN-1477 — W5.1 SLO definitions (slo.md). This dashboard surfaces the SLIs that doc defines. Targets and budgets are read from the SLO doc; this dashboard does not redefine them.
  • TIN-1479 — W5.3 fairness panel. Per-tenant drill-down here is the primitive the fairness panel consumes. Shared instance_name grouping; complementary presentation.
  • TIN-1480 — W5.4 TTFCH probe. Sibling dashboard / panel set. Cross-linked from panel headers; not displayed inline.
  • TIN-1481 — W5.5 poison alerts. Poison signals inlined as an inset on the CAS panel. Alert routing owned by W5.5; visual surface owned here.
  • TIN-154 — MCP wrap (re-parented under E5). When the MCP wrap lands, the dashboard’s data is exposed to the MCP surface through it. Today the dashboard reads Prom directly; under MCP wrap it reads through the wrap.
  • TIN-155 — API envelope (re-parented under E5). Same — the dashboard’s HTTP API surface conforms to the envelope shape when that lands.
  • TIN-153 — role views. Operator vs tenant scoping uses this. Operator role sees all tenants; tenant role sees only their own.
  • TIN-1462 — AC writer attestation. AC write rejection counters live on a sibling attestation dashboard; this dashboard cross-links but does not duplicate.
  • TIN-1472instance_name routing. The tenant drill-down depends on this routing primitive existing and being honest. Without it, “tenant” is decoration.

Open Questions

These must be resolved before W5.2 closes. Each is named so it can become a sub-ticket under TIN-1478.

  1. Prometheus vs OTel as the metric authority. Cross-link slo.md Open Question 1. This dashboard’s design assumes Prometheus because the existing monitoring/ route already speaks Prom, and Bazel’s BES → Prom translation is straightforward. If gf-reapi-cell adopts OTel as the primary emitter (with a Prom translation layer), the panel queries change shape but the panel taxonomy doesn’t. Recommendation pending: pick one in TIN-1449 and stop debating.
  2. 7d vs 30d default window. Cross-link slo.md Open Question 2. The dashboard offers a 1h / 30d toggle; whether the default landing is 1h-rolling or 7d-rolling is a separate operator-UX question. Recommendation: 1h for the operational page; 30d for the SLO-compliance page (separate route under app/src/routes/cache/slo/, proposed).
  3. Where does the BES collector live? Sidecar to gf-reapi-cell (one deploy, tighter coupling, simpler IAM) or sibling service in gf-rbe namespace (cleaner separation, allows BES traffic from non-gf-reapi-cell lanes — e.g. local bazel build against bazel-cache only — to still flow into analysis-hit metrics)? Recommendation pending: sibling service, on the grounds that analysis-hit data is client-side and shouldn’t require the executor to be in the path.
  4. Tenant cardinality cap — what’s the threshold and what’s the alert behavior? Proposed: cap at 50 distinct instance_name values per 30d window; alert at 40; hard-reject (at the cell perimeter validator) above 50. Numbers are placeholders; revisit when spoke count is real.
  5. Mnemonic top-N — what’s N, and how is the long-tail surfaced? Proposed: N=12 for the main panel, with an expandable “candidates” view showing the next 8 by traffic. Operators can promote a candidate to a first-class panel via a config change.
  6. Cold-clone exclusion mechanics. The SLO target excludes the first 60s of a new runner workspace. The dashboard query implements this via a runner_session_age label, but gf-reapi-cell doesn’t natively know a request is from a cold runner — that signal has to come from the runner’s session-start marker. Recommendation pending: have the runner-dashboard SvelteKit surface emit a runner_session_start event the cell consumes and labels requests with for the first 60s.
  7. Should the dashboard show write metrics at all? AC and CAS write rates are observable; the attestation dashboard (TIN-1462) is the primary surface for them. Recommendation: this dashboard stays read-side only; writes link out to attestation.
  8. What’s the BES collector’s retention shape? BES events carry the analysis-cache hit/miss counts the dashboard needs, but the full event stream is much richer (action-level events, target-level events, profile data). The collector either retains everything (storage cost) or only the metric-relevant fields (lighter, but loses replay capability when an investigation needs the full BES). Recommendation pending: retain only the metric-relevant fields in steady state; flip to full-retention for explicit-mode proof runs and incident-response windows.
  9. Synthetic-probe traffic in the denominator. The system instance carries TTFCH-probe traffic, which by design is a small set of canonical actions repeated frequently. This inflates the AC and CAS hit-rate denominators with very-cacheable traffic, biasing the numbers upward versus genuine spoke traffic. Recommendation pending: render system separately on the per-tenant table (already in the mockup) and exclude system from the headline aggregate ratio. Tenant filter default should be “all spokes + default; excludes system.”

References

  • Repo-local: docs/build-system/slo.md — the SLIs this dashboard surfaces; voice exemplar.
  • Repo-local: docs/build-system/instance-name-routing-design.md — tenant identity; voice exemplar.
  • Repo-local: docs/build-system/ac-writer-attestation-design.md — voice exemplar.
  • Repo-local: docs/build-system/gf-reapi-cell.md — current REAPI shape and metrics-surface readiness.
  • Repo-local: config/rbe-target-eligibility.json — proved target classes; informs the mnemonic cap.
  • Repo-local: app/src/routes/cache/ — SvelteKit surface for this dashboard.
  • Repo-local: app/src/routes/monitoring/ — sibling surface, Prom query conventions.
  • Bazel BES protocol: build_event_stream.proto — canonical BES event shapes; BuildMetadata carries analysis_cache_hits / analysis_cache_misses.
  • REAPI v2: remote_execution.protoFindMissingBlobs, GetActionResult, BatchReadBlobs definitions.
  • Peer dashboards — INSPIRATION ONLY — not adoption candidates. GloriousFlywheel is a peer to these systems, not a consumer:
    • BuildBuddy dashboards split CAS / AC / executor at the namespace and the API-key dimensions. Useful reference for tenant-scoped views; not the implementation. INSPIRATION ONLY.
    • Buildbarn bb-storage exports per-backend hit-rate metrics with a similar three-panel decomposition. Useful reference for metric naming conventions; not the implementation. INSPIRATION ONLY.
    • buchgr/bazel-remote exports CAS and AC hits as separate Prom counters with no tenant dimension. Useful as a minimal-shape reference; not the implementation. INSPIRATION ONLY.
    • EngFlow’s tenant fairness dashboards influence the per-tenant drill-down design. INSPIRATION ONLY.
  • Linear epic: TIN-1449 — E5 observability (parent).
  • Linear siblings: TIN-1477 (W5.1 SLO), TIN-1479 (W5.3 fairness), TIN-1480 (W5.4 TTFCH), TIN-1481 (W5.5 poison alerts).
  • Linear adjacent: TIN-1472 (instance-name routing), TIN-1462 (AC writer attestation), TIN-153 (role views), TIN-154 (MCP wrap), TIN-155 (API envelope).

GloriousFlywheel