GloriousFlywheel Runner Benchmark Methodology 2026-04-16

GloriousFlywheel Runner Benchmark Methodology 2026-04-16

Snapshot date: 2026-04-16

Purpose

Define a benchmark method that can answer a practical question:

How competitive is GloriousFlywheel, on real workloads, against:

  • GitHub-hosted runners
  • GloriousFlywheel ARC runners on honey
  • commercial alternatives such as Namespace, Blacksmith, and RWX

This note is about methodology, not marketing claims.

GitHub owner: #212

Core Principle

GloriousFlywheel should not claim competitiveness on speed, reliability, cost, or observability without measured evidence on named workloads.

Comparison Lanes

Baseline

  • GitHub-hosted standard runners

Product-under-test

  • GloriousFlywheel tinyland-docker
  • GloriousFlywheel tinyland-nix
  • GloriousFlywheel tinyland-dind

Commercial reference set

  • Namespace
  • Blacksmith
  • RWX

The commercial set may be evaluated through trial accounts, public benchmark material, or downstream customer evidence, but product claims and measured results must stay distinct.

Benchmark Dimensions

Performance

  • queue latency
  • cold-start latency
  • toolchain bootstrap time
  • time-to-first-step
  • total wall-clock job duration
  • container build duration
  • Nix build duration
  • test-suite duration

Cache Behavior

  • cache restore time
  • cache save time
  • cache hit rate
  • remote-cache throughput
  • Docker layer reuse effectiveness

Reliability

  • flaky job rate
  • failed-run rate attributable to runner platform
  • retry success rate
  • time lost to platform-induced failures

Operator Experience

  • debug path quality
  • log quality
  • visibility into queueing and failure causes
  • time to isolate an infrastructure-caused failure

Private-Network Fit

  • ability to reach tailnet-only services
  • ability to keep cluster management private
  • secrets and identity overhead

Cost And Overhead

  • direct compute cost
  • cache/storage cost
  • artifact or transfer cost
  • operator-maintenance cost
  • one-time setup effort

Candidate Workloads

The first benchmark pack should use named repos that already matter:

  • tinyland-inc/GloriousFlywheel
    • validation workflow
    • Nix derivation build
    • container/image path where relevant
  • tinyland-inc/tinyland.dev
    • representative Nix or site-validation workflow
  • tinyland-inc/lab
    • operator/validation workflow
    • Nix build or devshell workflow as the self-hosted Nix bootstrap canary
  • Jesssullivan/XoxdWM
    • user-repo canary workflow
  • linux-xr as the named Linux-builder canary, if the workload is available for controlled comparison

Benchmark Rules

  1. compare like with like
  2. separate cold-cache and warm-cache runs
  3. do not blend GitHub-hosted and self-hosted results into one number
  4. distinguish measured result from vendor-claimed result
  5. record enough config context that another operator could reproduce the run
  6. separate runner bootstrap overhead from repo build logic when self-hosted lanes install toolchains during the workflow

Minimum Output

The first benchmark report should include:

  • workload name
  • repo
  • runner lane
  • bootstrap mode
  • bootstrap overhead
  • run date
  • cold or warm cache mode
  • total runtime
  • queue latency
  • major cache notes
  • success or failure
  • key operator observations

Acceptance Criteria

  • at least 3 representative workloads are benchmarked locally against GitHub-hosted and GloriousFlywheel lanes
  • at least one tinyland-nix workload captures explicit Nix bootstrap cost instead of hiding it inside total runtime
  • one comparison table exists for:
    • GitHub-hosted baseline
    • GloriousFlywheel ARC lane
    • commercial reference claims or measured trial results
  • the repo has a written “what we currently win on / what we currently lose on” summary grounded in those runs

Non-Goals

  • do not publish vendor takedowns
  • do not turn public marketing claims into internal truth without measurement
  • do not benchmark every repo before the first comparison set is useful

GloriousFlywheel