Day-to-day runner fleet manager. Monitors health, responds to incidents, adjusts capacity.
Typical context: on-call, reacting to build queue backup or pod crash loop.
Sets policy for the runner platform: who can use which runners, resource budgets,
security boundaries. Approves configuration changes via GitOps merge requests.
Developer or CI pipeline that submits jobs. Does not manage runners directly.
Cares about: job starts quickly, build cache is warm, artifacts land in the right registry.
Maintains the IaC, Nix flake, Bazel build, and deployment pipeline.
Extends the platform with new runner types, modules, or integrations.
| ID |
Story |
Acceptance |
| OP-1 |
As an operator, I want to see all runners and their status at a glance so I can spot problems quickly |
Dashboard landing page shows fleet status, job counts, failure rates |
| OP-2 |
As an operator, I want to pause a runner without SSHing into the cluster |
Pause/resume buttons on runner detail page, MCP pause_runner tool |
| OP-3 |
As an operator, I want to see CPU/memory time series for a specific runner |
Monitoring page with configurable time windows (1h/6h/24h/7d) |
| OP-4 |
As an operator, I want drift detection to alert me when live state diverges from tfvars |
GitOps page shows drift items with severity, /api/gitops/drift endpoint |
| OP-5 |
As an operator, I want to adjust runner concurrency via the dashboard instead of editing tfvars |
Config edit form submits GitOps MR through /api/gitops/submit |
| OP-6 |
As an operator, I want warm pool scheduling so Nix runners are pre-scaled during business hours |
CronJob-based warm pool module with configurable schedules and min replicas |
| OP-7 |
As an operator, I want to use Claude Code to query runner status hands-free |
MCP server with 13 tools exposing full dashboard API |
| ID |
Story |
Acceptance |
| OA-1 |
As an org admin, I want role-based access so viewers can’t mutate runner state |
RBAC with viewer/operator/admin hierarchy, mutation endpoints require operator+ |
| OA-2 |
As an org admin, I want all config changes tracked as merge requests |
GitOps submit flow creates branch + MR, never applies directly |
| OA-3 |
As an org admin, I want per-runner resource limits enforced at the platform level |
OpenTofu variables define CPU/memory requests and limits per runner type |
| OA-4 |
As an org admin, I want HPA policies that prevent runaway scaling |
Configurable min/max replicas, CPU/memory targets, scale-down stabilization window |
| OA-5 |
As an org admin, I want multi-forge support (GitHub + GitLab) from a single platform |
Dashboard groups runners by forge, ARC handles GitHub, gitlab-runner handles GitLab |
| ID |
Story |
Acceptance |
| DC-1 |
As a developer, I want my CI job to start within 60 seconds of being queued |
Warm pool keeps minimum replicas ready during peak hours |
| DC-2 |
As a developer, I want Nix builds to reuse cached inputs across jobs |
Image-baked Nix plus Attic-backed reuse across ephemeral jobs, without a hard shared-PVC dependency |
| DC-3 |
As a developer, I want to know which runner types support my workload tags |
Runner detail page and list_runners MCP tool show tags per runner |
| DC-4 |
As a developer, I want build cache hits so incremental builds are fast |
Attic binary cache for Nix, Bazel remote cache, BuildKit registry cache for containers |
| ID |
Story |
Acceptance |
| PE-1 |
As a platform engineer, I want to add a new runner type with a single module call |
arc-runner module accepts name, type, image, resources, tags |
| PE-2 |
As a platform engineer, I want CI to validate all OpenTofu modules on every PR |
validate.yml runs tofu init -backend=false && tofu validate for all 15+ modules |
| PE-3 |
As a platform engineer, I want Prometheus metrics from the ARC controller |
ServiceMonitor scrapes gha_controller_* metrics, PrometheusRule fires 5 alerts |
| PE-4 |
As a platform engineer, I want the dashboard API to have a consistent contract |
Envelope pattern: { data, meta } success / { error: { code, message } } failure |
| PE-5 |
As a platform engineer, I want FlakeHub publishing automated on release |
GitHub Actions workflow with OIDC auth publishes flake on push to main or tag |
Alert fires (Prometheus)
-> Operator opens dashboard or calls `get_fleet_metrics` via MCP
-> Identifies failing runner via status/metrics
-> Pauses runner (dashboard button or `pause_runner` MCP tool)
-> Investigates pods (`list_pods` or k9s)
-> Fixes issue (restart pod, adjust config)
-> Resumes runner
-> Verifies jobs flowing via metrics
Operator opens runner detail page
-> Clicks "Edit Config"
-> Adjusts concurrency, resources, or flags
-> Submits form
-> Dashboard calls POST /api/gitops/submit
-> Backend creates branch + merge request
-> Org admin reviews and approves MR
-> CI runs tofu plan
-> Merge triggers tofu apply
-> Dashboard drift check confirms convergence
Platform engineer adds module call in stacks/arc-runners/main.tf
-> Defines runner name, type, image, resources, tags
-> Optionally enables nix_store, warm_pool
-> Runs `just tofu-plan arc-runners` locally
-> Opens PR
-> CI validates module + stack
-> Review + merge
-> CI applies to cluster
-> Runner appears in dashboard automatically
Developer pushes to repo
-> CI dispatches job with `nix` tag
-> ARC assigns to warm Nix runner (pre-scaled by warm pool CronJob)
-> Runner image already contains the Nix toolchain
-> Build resolves from Attic cache or builds and pushes to Attic
-> Subsequent jobs on same or different pod reuse Attic-backed cache state
| Metric |
Baseline |
Target |
Source |
| Runner fleet uptime |
unmeasured |
99.5% |
Prometheus up metric |
| Mean job queue time |
unmeasured |
< 60s (peak), < 15s (off-peak) |
ARC controller metrics |
| Failed job rate |
unmeasured |
< 2% |
gha_controller_pending_ephemeral_runners |
| Configuration drift items |
unmeasured |
0 sustained |
/api/gitops/drift |
| Alert response time |
unmeasured |
< 15 min (P1), < 1h (P2) |
PagerDuty/manual |
| Metric |
Baseline |
Target |
Source |
| Nix cache hit rate |
unmeasured |
> 80% |
Attic server metrics |
| Bazel cache hit rate |
unmeasured |
> 70% |
Bazel remote cache metrics |
| Cold start time (non-Nix) |
~30s |
< 20s |
ARC scale-up duration |
| Cold start time (Nix, warm pool) |
~120s |
< 45s |
Runner registration + cache warm-up |
| Resource utilization (CPU) |
unmeasured |
40-70% average |
Prometheus node metrics |
| Metric |
Baseline |
Target |
Source |
| Mean CI pipeline duration |
unmeasured |
< 10 min (unit), < 30 min (full) |
Forge pipeline APIs |
| Dashboard page load time |
unmeasured |
< 2s (P95) |
Synthetic monitoring |
| MCP tool response time |
unmeasured |
< 3s (P95) |
MCP server logs |
| Config change lead time (MR open to applied) |
unmeasured |
< 2h (business hours) |
Forge MR metrics |
| ID |
Question |
Context |
Owner |
| D-1 |
FlakeHub Cache vs self-hosted Attic? |
#187 — cost, latency, and sovereignty tradeoffs |
Platform Engineer |
| D-2 |
Tailscale Operator for cluster auth? |
#178 — replaces manual kubeconfig, enables tsidp for dashboard auth |
Org Admin |
| D-3 |
Liqo or pure ARC for multi-cluster burst? |
#170 — Liqo was the prior model, now deprecated; define post-Liqo topology |
Platform Engineer |
| D-4 |
Single cluster (blahaj) or multi-cluster? |
#169 — deployment contract for dev/prod stacks |
Org Admin |
| D-5 |
Pulp for unified registry caching? |
#66 — containers, npm, PyPI, Nix through one cache layer |
Platform Engineer |
| D-6 |
GPU runner scheduling strategy? |
#44 — Dawn/WebGPU/L40S/A100 workload placement |
Platform Engineer |