GloriousFlywheel Runner Benchmark Methodology 2026-04-16
Snapshot date: 2026-04-16
Purpose
Define a benchmark method that can answer a practical question:
How competitive is GloriousFlywheel, on real workloads, against:
- GitHub-hosted runners
- GloriousFlywheel ARC runners on
honey - commercial alternatives such as Namespace, Blacksmith, and RWX
This note is about methodology, not marketing claims.
GitHub owner: #212
Core Principle
GloriousFlywheel should not claim competitiveness on speed, reliability, cost, or observability without measured evidence on named workloads.
Comparison Lanes
Baseline
- GitHub-hosted standard runners
Product-under-test
- GloriousFlywheel
tinyland-docker - GloriousFlywheel
tinyland-nix - GloriousFlywheel
tinyland-dind
Commercial reference set
- Namespace
- Blacksmith
- RWX
The commercial set may be evaluated through trial accounts, public benchmark material, or downstream customer evidence, but product claims and measured results must stay distinct.
Benchmark Dimensions
Performance
- queue latency
- cold-start latency
- toolchain bootstrap time
- time-to-first-step
- total wall-clock job duration
- container build duration
- Nix build duration
- test-suite duration
Cache Behavior
- cache restore time
- cache save time
- cache hit rate
- remote-cache throughput
- Docker layer reuse effectiveness
Reliability
- flaky job rate
- failed-run rate attributable to runner platform
- retry success rate
- time lost to platform-induced failures
Operator Experience
- debug path quality
- log quality
- visibility into queueing and failure causes
- time to isolate an infrastructure-caused failure
Private-Network Fit
- ability to reach tailnet-only services
- ability to keep cluster management private
- secrets and identity overhead
Cost And Overhead
- direct compute cost
- cache/storage cost
- artifact or transfer cost
- operator-maintenance cost
- one-time setup effort
Candidate Workloads
The first benchmark pack should use named repos that already matter:
tinyland-inc/GloriousFlywheel- validation workflow
- Nix derivation build
- container/image path where relevant
tinyland-inc/tinyland.dev- representative Nix or site-validation workflow
tinyland-inc/lab- operator/validation workflow
- Nix build or devshell workflow as the self-hosted Nix bootstrap canary
Jesssullivan/XoxdWM- user-repo canary workflow
linux-xras the named Linux-builder canary, if the workload is available for controlled comparison
Benchmark Rules
- compare like with like
- separate cold-cache and warm-cache runs
- do not blend GitHub-hosted and self-hosted results into one number
- distinguish measured result from vendor-claimed result
- record enough config context that another operator could reproduce the run
- separate runner bootstrap overhead from repo build logic when self-hosted lanes install toolchains during the workflow
Minimum Output
The first benchmark report should include:
- workload name
- repo
- runner lane
- bootstrap mode
- bootstrap overhead
- run date
- cold or warm cache mode
- total runtime
- queue latency
- major cache notes
- success or failure
- key operator observations
Acceptance Criteria
- at least
3representative workloads are benchmarked locally against GitHub-hosted and GloriousFlywheel lanes - at least one
tinyland-nixworkload captures explicit Nix bootstrap cost instead of hiding it inside total runtime - one comparison table exists for:
- GitHub-hosted baseline
- GloriousFlywheel ARC lane
- commercial reference claims or measured trial results
- the repo has a written “what we currently win on / what we currently lose on” summary grounded in those runs
Non-Goals
- do not publish vendor takedowns
- do not turn public marketing claims into internal truth without measurement
- do not benchmark every repo before the first comparison set is useful