GloriousFlywheel Benchmark Scorecard 2026-04-19

GloriousFlywheel Benchmark Scorecard 2026-04-19

Snapshot date: 2026-04-19

GitHub owner: #212

Purpose

Record the current measured benchmark evidence for the release baseline on the actual merged main branch.

This note is intentionally narrower than the full competitive benchmark plan:

  • it captures the currently measured GloriousFlywheel runner lanes
  • it does not invent missing GitHub-hosted or commercial comparison data
  • it keeps release gating tied to real runs instead of the scorecard template

Measured commit: e6c5871310435c387720de55b74e7ddcddfd258a

Current Measured Pack

  • workflow: Runner Benchmarks
  • dispatch date: 2026-04-19
  • workload selection: all
  • measured lanes:
    • tinyland-nix
    • tinyland-nix-heavy
  • measured workloads:
    • nix-build via nix build .#runner-dashboard-image --no-link
    • flake-check via nix flake check
  • bootstrap mode: DeterminateSystems/determinate-nix-action@v3 through .github/actions/nix-job/action.yml
  • cache posture during these runs:
    • Attic cache: unknown in artifact output
    • Bazel cache: available in artifact output

Run Summary

Run id Lane Queue latency Time to first step Benchmark step duration Timed workload total Approx bootstrap/setup before timers Result
24641466958 tinyland-nix 47s 48s 26.000s 13.930s 12.070s success
24641466963 tinyland-nix-heavy 29s 30s 1383.000s 1368.150s 14.850s success

Notes:

  • queue latency is measured from workflow creation to job start
  • time to first step is measured from workflow creation to the first job step
  • bootstrap/setup before timers is the benchmark-step wall clock minus the summed workload timers from the artifact JSON
  • the in-workload timers come from scripts/benchmark/runner-benchmark.sh and only cover the timed command body plus the script’s own minimal wrapper

Workload Results

Run id Lane Workload id Total runtime Build runtime In-script overhead Attic Bazel cache Nix store size Hostname
24641466958 tinyland-nix gf-nix-build 1.222s 1.220s 0.002s unknown available 5021 MiB tinyland-nix-fhdlf-runner-75fdt
24641466958 tinyland-nix gf-flake-check 12.708s 12.706s 0.002s unknown available 5021 MiB tinyland-nix-fhdlf-runner-75fdt
24641466963 tinyland-nix-heavy gf-nix-build 1m 25.411s 1m 25.408s 0.003s unknown available 1813 MiB tinyland-nix-heavy-7n5bq-runner-wnctg
24641466963 tinyland-nix-heavy gf-flake-check 21m 22.739s 21m 22.736s 0.003s unknown available 6440 MiB tinyland-nix-heavy-7n5bq-runner-wnctg

Current Read

What Is Proven

  • the repo-owned benchmark workflow runs successfully on merged main
  • both currently documented Nix lanes produced artifact-backed results
  • the heavy lane is not theoretical; it completed a real nix flake check benchmark on main
  • explicit Nix bootstrap overhead exists and is now separated from the timed workload body

What Is Not Yet Proven

  • no GitHub-hosted baseline is included in this scorecard yet
  • no commercial comparison lane is included yet
  • no warm-cache versus cold-cache split is controlled yet
  • this is still only one repo and two workload shapes, not the full pack described in the methodology note

What We Currently Win On

  • GloriousFlywheel has real measured Nix lanes on merged main; the source repo is not relying on hypothetical runner claims
  • the heavy lane is real and can complete a full nix flake check benchmark, which is materially stronger evidence than a light smoke-only contract
  • bootstrap/setup overhead is now separated from the timed workload body, so self-hosted runner cost is not hidden inside one opaque wall-clock number
  • the benchmark workflow produces reproducible artifacts and a parsable scorecard instead of ad hoc timing notes

What We Currently Lose On

  • queue latency is still well above the aspirational < 15s target in the current measured runs (29s to 47s)
  • the current measured pack is too narrow to support broad competitiveness claims against GitHub-hosted or commercial alternatives
  • cache behavior is only visible as coarse availability metadata in the current artifacts; real hit-rate and restore/save timing evidence is still missing

Where Evidence Is Still Missing

  • GitHub-hosted baseline runs on the same named workloads
  • commercial trial or clearly separated vendor-claim comparison rows
  • warm-cache versus cold-cache splits
  • a broader workload pack beyond the source repo’s two current Nix shapes

Release-Gate Read

This scorecard is current enough for the release baseline checklist because it documents the latest successful benchmark evidence on merged main.

It is not enough to make broad competitive claims for #212. Public claims should remain limited to:

  • tinyland-nix and tinyland-nix-heavy are real measured lanes
  • benchmark automation and artifact capture exist on the source repo
  • broader GitHub-hosted and commercial comparisons are still pending

GloriousFlywheel