TTFCH Synthetic Probe
Time-to-first-cache-hit (TTFCH) is the customer-shaped signal for whether a fresh caller feels GloriousFlywheel’s shared cache quickly. It measures wall time from fresh clone completion to the first Bazel spawn reported as a remote cache hit.
Current Contract
- Workflow:
.github/workflows/ttfch-probe.yml - Cadence: hourly,
0 * * * * - Runner pool:
tinyland-nix - Default ref:
refs/heads/mainuntilGF_TTFCH_REFis set to a maintained probe ref such asrefs/tags/ttfch-probe-base - Default target:
//docs-site:build - Instance name:
system - Endpoint:
grpc://gf-reapi-cell.gf-rbe.svc.cluster.local:8980 - Endpoint serialization:
.github/workflows/ttfch-probe.ymland.github/workflows/gf-reapi-cell-proof.ymlshare thegf-reapi-cell-live-endpointconcurrency group so the synthetic probe does not race proof-cell apply/cleanup. - Evidence script:
scripts/gf-runner-ttfch-probe.sh - Parser:
scripts/gf-runner-ttfch-evidence.py - Dashboard contract:
docs/monitoring/gf-runner-ttfch-dashboard.json
The probe runs through the same Nix and Bazel substrate contract as normal
GloriousFlywheel jobs. It does not use hosted runners, does not use raw local
Bazel as the product path, and does not bypass scripts/bazel-cache-backed.sh.
Measurement
The runner script performs a fresh git fetch --depth=1 into a disposable
workdir, then runs:
scripts/bazel-cache-backed.sh build \
--remote_instance_name=system \
--remote_accept_cached=true \
--remote_local_fallback=false \
--execution_log_json_file="$evidence_dir/bazel-execution-log.json" \
--execution_log_sort=false \
//docs-site:build
The parser reads Bazel’s --execution_log_json_file output and finds the first
spawn where:
cacheHitis truerunneris exactlyremote cache hitmetrics.startTimeis present
TTFCH is first_remote_cache_hit.metrics.startTime - clone_end only when
Bazel’s execution-log timestamp is in the probe’s fresh-clone/build window.
Some remote cache hits carry action metadata timestamps from the warmed cache
entry rather than the local observation time. If the first hit timestamp
predates clone_end, the probe still reports result=ok because a remote
cache hit was observed, but it suppresses the gf_runner_ttfch_seconds
histogram sample and sets ttfch_clock_valid=false. Operators should treat
that as a measurement-quality follow-up, not as a zero-second latency proof.
Private probes use GF_TTFCH_GITHUB_TOKEN through a temporary GIT_ASKPASS
helper with GIT_TERMINAL_PROMPT=0. The helper is removed on script exit and
is not part of the bounded evidence artifact set.
Before clone/build, the runner performs a bounded TCP preflight against the
configured REAPI endpoint. If the live gf-reapi-cell service has no reachable
endpoint, the probe emits endpoint_unavailable evidence and exits before
doing a cold clone. That is a probe/service availability incident, not a TTFCH
latency sample.
The probe emits:
gf_runner_ttfch_secondsas a Prometheus histogramgf_runner_ttfch_clone_secondsas a diagnostic gaugegf_runner_ttfch_probe_runs_total{result=...}as an outcome counter
The histogram is emitted only for clock-valid cache-hit samples. The outcome counter is always emitted, so dashboards can distinguish “hit observed but latency unmeasured” from “probe failed” and “no hit observed.”
The SLO target remains the row in slo.md: p95 TTFCH under 90
seconds, with budget burn if it exceeds 90 seconds on more than 3 days in a
30-day window.
Outcome Classes
ok: build succeeded and at least one remote cache hit was observedno_remote_cache_hit: build succeeded but the probe saw no remote cache hitbuild_failed_no_hit: build failed before any remote cache hit was observedbuild_failed_after_hit: build failed after at least one remote cache hitendpoint_unavailable: the REAPI endpoint preflight failed before clone/build
Probe-broken outcomes are not TTFCH breaches by themselves. They are probe
availability incidents. The dashboard separates gf_runner_ttfch_seconds from
gf_runner_ttfch_probe_runs_total so operators do not conflate “cache was slow”
with “the probe could not measure.”
Boundaries
This is a first production contract slice for TIN-1480, not the close gate. TIN-1480 is complete only after the probe is reporting and TTFCH is sustained under target for the agreed window.
This probe measures cold-clone, warm-cache CI runner experience from
tinyland-nix. It does not measure:
- end-to-end build time
- developer laptop public-ingress TTFCH
- cold-cache rewarm cost
- broad/default RBE readiness by itself
- cache correctness or poison safety
Those remain sibling productionization gates under the RBE readiness project.