Runner Topology

Runner Topology Model

Post-Liqo ARC topology for the GloriousFlywheel runner platform. Supersedes the prior multi-cluster burst model that used Liqo virtual nodes.

Current State: Single-Cluster ARC

All runners deploy to the on-prem honey RKE2 cluster managed by a single ARC controller in arc-systems.

honey RKE2 cluster
+-- arc-systems namespace
|   +-- ARC Controller (v0.14.0)
|       - manages all AutoScalingRunnerSets
|       - Prometheus metrics on :8080
|       - 5 alert rules (PrometheusRule)
|
+-- arc-runners namespace
|   +-- gh-nix (ARC ScaleSet)
|   |   - Nix builds, Attic cache integration
|   |   - Runner image ships Determinate Nix
|   |   - No shared /nix/store PVC requirement for baseline scheduling
|   |   - Warm pool: CronJob scales up during business hours
|   |   - CPU: 4 limit, Memory: 8Gi limit
|   |
|   +-- gh-docker (ARC ScaleSet)
|   |   - General CI workloads
|   |   - CPU: 2 limit, Memory: 4Gi limit
|   |
|   +-- gh-dind (ARC ScaleSet)
|   |   - Container builds (Docker-in-Docker)
|   |   - 40-50Gi ephemeral storage
|   |
|   +-- linux-xr-docker (ARC ScaleSet, extra)
|       - linux-xr kernel builds (rockylinux:10)
|       - Max 2 runners
|
+-- gitlab-runners namespace
|   +-- GitLab HPA runners (docker, dind, nix)
|       - HPA-managed deployments
|       - Separate from ARC, managed by gitlab-runner Helm chart
|
+-- nix-cache namespace
    +-- Attic binary cache server
    +-- Bazel remote cache
    +-- CNPG PostgreSQL cluster
    +-- RustFS object storage

Runner Classes

Runner Forge Type Scale Model Storage Use Case
gh-nix GitHub ARC ScaleSet Scale-to-zero + warm pool Ephemeral rootfs + Attic/Bazel acceleration Nix builds, flake checks
gh-docker GitHub ARC ScaleSet Scale-to-zero Ephemeral General CI, tests, linting
gh-dind GitHub ARC ScaleSet Scale-to-zero Large ephemeral (40-50Gi) Container image builds
linux-xr-docker GitHub ARC ScaleSet (extra) Max 2 Ephemeral Rocky Linux kernel builds
gl-docker GitLab HPA Deployment Min 1 replica Ephemeral General GitLab CI
gl-dind GitLab HPA Deployment Min 1 replica Ephemeral GitLab container builds
gl-nix GitLab HPA Deployment Min 1 replica Ephemeral GitLab Nix builds

Scale-Up and Scale-Down Lifecycle

ARC Runners (GitHub)

Job queued in GitHub Actions
  -> GitHub webhook -> ARC Controller
  -> Controller creates runner pod in scale set
  -> Pod starts (pulls image, exposes runtime hints)
  -> Runner registers with GitHub
  -> Job executes
  -> Job completes -> pod terminates (ephemeral)
  -> Scale set returns to minRunners (default: 0)

Warm pool override (gh-nix):

06:00 UTC (business hours start)
  -> CronJob patches AutoScalingRunnerSet: minRunners=2
  -> ARC pre-creates 2 idle runner pods
  -> Pods already carry the runner image + Nix toolchain
  -> Jobs get instant assignment (no registration cold start)

22:00 UTC (business hours end)
  -> CronJob patches AutoScalingRunnerSet: minRunners=0
  -> Idle pods drain and terminate
  -> Attic remains the reusable acceleration layer between jobs

GitLab Runners (HPA)

Always-on: minimum 1 replica per runner type
  -> HPA monitors CPU/memory utilization
  -> Scales up when CPU > target (default 70%)
  -> Scale-down stabilization: 300s window
  -> Never scales below min replicas

Cold Start Times

Runner Cold Start Warm Start Notes
gh-docker ~20-30s N/A (ephemeral) Image pull + registration
gh-dind ~25-35s N/A (ephemeral) Larger image, DinD setup
gh-nix (cold) ~90-120s ~5-10s Runner image already contains Nix; cache misses dominate
gh-nix (warm pool) ~5-10s ~5-10s Warm pool avoids registration cold start; Attic still carries reuse
gl-* ~5s ~5s Always running (HPA min replicas)

What Changed from Liqo Era

The prior topology used Liqo to extend capacity across multiple clusters (blahaj, honey) via virtual nodes. This created several problems:

  1. Scheduling complexity: Liqo virtual nodes required affinity rules and tolerations that differed per cluster, making runner configs non-portable.
  2. Network partitions: Cross-cluster pod communication was fragile when Liqo peering connections dropped.
  3. State management: Shared volumes (like /nix/store) couldn’t span Liqo-peered clusters without additional infrastructure.
  4. Debugging difficulty: Pod failures on virtual nodes were hard to diagnose — logs and events were split across clusters.

Resolution: All runners now deploy to a single cluster. Multi-cluster burst is deferred until a clear need emerges. When it does, the approach will be federated ARC (multiple independent ARC controllers with a shared GitHub App) rather than virtual-node scheduling.

Future Runner Classes (Planned)

Runner Status Blocker Issue
GPU (L40S/A100) Design phase Node pool provisioning, scheduling policy #44
ARM64 Not started ARM node availability
Windows Not started Windows node pool, licensing

Capacity Planning

Current cluster: 3 nodes (on-prem RKE2) — honey (control plane), bumble (storage/ZFS), sting (stateless compute).

Scenario Concurrent Runners Estimated Resource Usage
Quiet (off-hours) 3 GitLab + 0 ARC ~3 vCPU / 6Gi
Normal (business hours) 3 GitLab + 2 warm Nix ~11 vCPU / 22Gi
Burst (multiple PRs) 3 GitLab + 4-6 ARC Exceeds cluster, queue builds

Scaling strategy: Vertical (larger nodes) before horizontal (more nodes). ARC scale-to-zero means burst capacity is only consumed when needed.

Ownership

Component Owner Change Process
ARC Controller config Platform Engineer PR to tofu/stacks/arc-runners/
Runner scale set params Operator (via dashboard) GitOps MR through /api/gitops/submit
Warm pool schedules Operator PR to tofu/stacks/arc-runners/
HPA policies Platform Engineer PR to tofu/modules/gitlab-runner/
Cluster nodes Org Admin On-prem provisioning (honey, bumble, sting)
Prometheus alerts Platform Engineer PR to tofu/modules/arc-controller/monitoring.tf

GloriousFlywheel