Runner Topology Model
Post-Liqo ARC topology for the GloriousFlywheel runner platform. Supersedes the prior multi-cluster burst model that used Liqo virtual nodes.
Current State: Single-Cluster ARC
All runners deploy to the on-prem honey RKE2 cluster
managed by a single ARC controller in arc-systems.
honey RKE2 cluster
+-- arc-systems namespace
| +-- ARC Controller (v0.14.0)
| - manages all AutoScalingRunnerSets
| - Prometheus metrics on :8080
| - 5 alert rules (PrometheusRule)
|
+-- arc-runners namespace
| +-- gh-nix (ARC ScaleSet)
| | - Nix builds, Attic cache integration
| | - Runner image ships Determinate Nix
| | - No shared /nix/store PVC requirement for baseline scheduling
| | - Warm pool: CronJob scales up during business hours
| | - CPU: 4 limit, Memory: 8Gi limit
| |
| +-- gh-docker (ARC ScaleSet)
| | - General CI workloads
| | - CPU: 2 limit, Memory: 4Gi limit
| |
| +-- gh-dind (ARC ScaleSet)
| | - Container builds (Docker-in-Docker)
| | - 40-50Gi ephemeral storage
| |
| +-- linux-xr-docker (ARC ScaleSet, extra)
| - linux-xr kernel builds (rockylinux:10)
| - Max 2 runners
|
+-- gitlab-runners namespace
| +-- GitLab HPA runners (docker, dind, nix)
| - HPA-managed deployments
| - Separate from ARC, managed by gitlab-runner Helm chart
|
+-- nix-cache namespace
+-- Attic binary cache server
+-- Bazel remote cache
+-- CNPG PostgreSQL cluster
+-- RustFS object storage
Runner Classes
| Runner | Forge | Type | Scale Model | Storage | Use Case |
|---|---|---|---|---|---|
gh-nix |
GitHub | ARC ScaleSet | Scale-to-zero + warm pool | Ephemeral rootfs + Attic/Bazel acceleration | Nix builds, flake checks |
gh-docker |
GitHub | ARC ScaleSet | Scale-to-zero | Ephemeral | General CI, tests, linting |
gh-dind |
GitHub | ARC ScaleSet | Scale-to-zero | Large ephemeral (40-50Gi) | Container image builds |
linux-xr-docker |
GitHub | ARC ScaleSet (extra) | Max 2 | Ephemeral | Rocky Linux kernel builds |
gl-docker |
GitLab | HPA Deployment | Min 1 replica | Ephemeral | General GitLab CI |
gl-dind |
GitLab | HPA Deployment | Min 1 replica | Ephemeral | GitLab container builds |
gl-nix |
GitLab | HPA Deployment | Min 1 replica | Ephemeral | GitLab Nix builds |
Scale-Up and Scale-Down Lifecycle
ARC Runners (GitHub)
Job queued in GitHub Actions
-> GitHub webhook -> ARC Controller
-> Controller creates runner pod in scale set
-> Pod starts (pulls image, exposes runtime hints)
-> Runner registers with GitHub
-> Job executes
-> Job completes -> pod terminates (ephemeral)
-> Scale set returns to minRunners (default: 0)
Warm pool override (gh-nix):
06:00 UTC (business hours start)
-> CronJob patches AutoScalingRunnerSet: minRunners=2
-> ARC pre-creates 2 idle runner pods
-> Pods already carry the runner image + Nix toolchain
-> Jobs get instant assignment (no registration cold start)
22:00 UTC (business hours end)
-> CronJob patches AutoScalingRunnerSet: minRunners=0
-> Idle pods drain and terminate
-> Attic remains the reusable acceleration layer between jobs
GitLab Runners (HPA)
Always-on: minimum 1 replica per runner type
-> HPA monitors CPU/memory utilization
-> Scales up when CPU > target (default 70%)
-> Scale-down stabilization: 300s window
-> Never scales below min replicas
Cold Start Times
| Runner | Cold Start | Warm Start | Notes |
|---|---|---|---|
| gh-docker | ~20-30s | N/A (ephemeral) | Image pull + registration |
| gh-dind | ~25-35s | N/A (ephemeral) | Larger image, DinD setup |
| gh-nix (cold) | ~90-120s | ~5-10s | Runner image already contains Nix; cache misses dominate |
| gh-nix (warm pool) | ~5-10s | ~5-10s | Warm pool avoids registration cold start; Attic still carries reuse |
| gl-* | ~5s | ~5s | Always running (HPA min replicas) |
What Changed from Liqo Era
The prior topology used Liqo to extend capacity across multiple clusters (blahaj, honey) via virtual nodes. This created several problems:
- Scheduling complexity: Liqo virtual nodes required affinity rules and tolerations that differed per cluster, making runner configs non-portable.
- Network partitions: Cross-cluster pod communication was fragile when Liqo peering connections dropped.
- State management: Shared volumes (like /nix/store) couldn’t span Liqo-peered clusters without additional infrastructure.
- Debugging difficulty: Pod failures on virtual nodes were hard to diagnose — logs and events were split across clusters.
Resolution: All runners now deploy to a single cluster. Multi-cluster burst is deferred until a clear need emerges. When it does, the approach will be federated ARC (multiple independent ARC controllers with a shared GitHub App) rather than virtual-node scheduling.
Future Runner Classes (Planned)
| Runner | Status | Blocker | Issue |
|---|---|---|---|
| GPU (L40S/A100) | Design phase | Node pool provisioning, scheduling policy | #44 |
| ARM64 | Not started | ARM node availability | — |
| Windows | Not started | Windows node pool, licensing | — |
Capacity Planning
Current cluster: 3 nodes (on-prem RKE2) — honey (control plane), bumble (storage/ZFS), sting (stateless compute).
| Scenario | Concurrent Runners | Estimated Resource Usage |
|---|---|---|
| Quiet (off-hours) | 3 GitLab + 0 ARC | ~3 vCPU / 6Gi |
| Normal (business hours) | 3 GitLab + 2 warm Nix | ~11 vCPU / 22Gi |
| Burst (multiple PRs) | 3 GitLab + 4-6 ARC | Exceeds cluster, queue builds |
Scaling strategy: Vertical (larger nodes) before horizontal (more nodes). ARC scale-to-zero means burst capacity is only consumed when needed.
Ownership
| Component | Owner | Change Process |
|---|---|---|
| ARC Controller config | Platform Engineer | PR to tofu/stacks/arc-runners/ |
| Runner scale set params | Operator (via dashboard) | GitOps MR through /api/gitops/submit |
| Warm pool schedules | Operator | PR to tofu/stacks/arc-runners/ |
| HPA policies | Platform Engineer | PR to tofu/modules/gitlab-runner/ |
| Cluster nodes | Org Admin | On-prem provisioning (honey, bumble, sting) |
| Prometheus alerts | Platform Engineer | PR to tofu/modules/arc-controller/monitoring.tf |