Ttfch Probe

TTFCH Synthetic Probe

Time-to-first-cache-hit (TTFCH) is the customer-shaped signal for whether a fresh caller feels GloriousFlywheel’s shared cache quickly. It measures wall time from fresh clone completion to the first Bazel spawn reported as a remote cache hit.

Current Contract

  • Workflow: .github/workflows/ttfch-probe.yml
  • Cadence: hourly, 0 * * * *
  • Runner pool: tinyland-nix
  • Default ref: refs/heads/main until GF_TTFCH_REF is set to a maintained probe ref such as refs/tags/ttfch-probe-base
  • Default target: //docs-site:build
  • Instance name: system
  • Endpoint: grpc://gf-reapi-cell.gf-rbe.svc.cluster.local:8980
  • Endpoint serialization: .github/workflows/ttfch-probe.yml and .github/workflows/gf-reapi-cell-proof.yml share the gf-reapi-cell-live-endpoint concurrency group so the synthetic probe does not race proof-cell apply/cleanup.
  • Evidence script: scripts/gf-runner-ttfch-probe.sh
  • Parser: scripts/gf-runner-ttfch-evidence.py
  • Dashboard contract: docs/monitoring/gf-runner-ttfch-dashboard.json

The probe runs through the same Nix and Bazel substrate contract as normal GloriousFlywheel jobs. It does not use hosted runners, does not use raw local Bazel as the product path, and does not bypass scripts/bazel-cache-backed.sh.

Measurement

The runner script performs a fresh git fetch --depth=1 into a disposable workdir, then runs:

scripts/bazel-cache-backed.sh build \
  --remote_instance_name=system \
  --remote_accept_cached=true \
  --remote_local_fallback=false \
  --execution_log_json_file="$evidence_dir/bazel-execution-log.json" \
  --execution_log_sort=false \
  //docs-site:build

The parser reads Bazel’s --execution_log_json_file output and finds the first spawn where:

  • cacheHit is true
  • runner is exactly remote cache hit
  • metrics.startTime is present

TTFCH is first_remote_cache_hit.metrics.startTime - clone_end only when Bazel’s execution-log timestamp is in the probe’s fresh-clone/build window. Some remote cache hits carry action metadata timestamps from the warmed cache entry rather than the local observation time. If the first hit timestamp predates clone_end, the probe still reports result=ok because a remote cache hit was observed, but it suppresses the gf_runner_ttfch_seconds histogram sample and sets ttfch_clock_valid=false. Operators should treat that as a measurement-quality follow-up, not as a zero-second latency proof.

Private probes use GF_TTFCH_GITHUB_TOKEN through a temporary GIT_ASKPASS helper with GIT_TERMINAL_PROMPT=0. The helper is removed on script exit and is not part of the bounded evidence artifact set.

Before clone/build, the runner performs a bounded TCP preflight against the configured REAPI endpoint. If the live gf-reapi-cell service has no reachable endpoint, the probe emits endpoint_unavailable evidence and exits before doing a cold clone. That is a probe/service availability incident, not a TTFCH latency sample.

The probe emits:

  • gf_runner_ttfch_seconds as a Prometheus histogram
  • gf_runner_ttfch_clone_seconds as a diagnostic gauge
  • gf_runner_ttfch_probe_runs_total{result=...} as an outcome counter

The histogram is emitted only for clock-valid cache-hit samples. The outcome counter is always emitted, so dashboards can distinguish “hit observed but latency unmeasured” from “probe failed” and “no hit observed.”

The SLO target remains the row in slo.md: p95 TTFCH under 90 seconds, with budget burn if it exceeds 90 seconds on more than 3 days in a 30-day window.

Outcome Classes

  • ok: build succeeded and at least one remote cache hit was observed
  • no_remote_cache_hit: build succeeded but the probe saw no remote cache hit
  • build_failed_no_hit: build failed before any remote cache hit was observed
  • build_failed_after_hit: build failed after at least one remote cache hit
  • endpoint_unavailable: the REAPI endpoint preflight failed before clone/build

Probe-broken outcomes are not TTFCH breaches by themselves. They are probe availability incidents. The dashboard separates gf_runner_ttfch_seconds from gf_runner_ttfch_probe_runs_total so operators do not conflate “cache was slow” with “the probe could not measure.”

Boundaries

This is a first production contract slice for TIN-1480, not the close gate. TIN-1480 is complete only after the probe is reporting and TTFCH is sustained under target for the agreed window.

This probe measures cold-clone, warm-cache CI runner experience from tinyland-nix. It does not measure:

  • end-to-end build time
  • developer laptop public-ingress TTFCH
  • cold-cache rewarm cost
  • broad/default RBE readiness by itself
  • cache correctness or poison safety

Those remain sibling productionization gates under the RBE readiness project.

GloriousFlywheel