User Stories

GloriousFlywheel User Stories, Flows & KPIs

Personas

Operator

Day-to-day runner fleet manager. Monitors health, responds to incidents, adjusts capacity. Typical context: on-call, reacting to build queue backup or pod crash loop.

Org Admin

Sets policy for the runner platform: who can use which runners, resource budgets, security boundaries. Approves configuration changes via GitOps merge requests.

Downstream Consumer

Developer or CI pipeline that submits jobs. Does not manage runners directly. Cares about: job starts quickly, build cache is warm, artifacts land in the right registry.

Platform Engineer

Maintains the IaC, Nix flake, Bazel build, and deployment pipeline. Extends the platform with new runner types, modules, or integrations.

User Stories

Operator Stories

ID Story Acceptance
OP-1 As an operator, I want to see all runners and their status at a glance so I can spot problems quickly Dashboard landing page shows fleet status, job counts, failure rates
OP-2 As an operator, I want to pause a runner without SSHing into the cluster Pause/resume buttons on runner detail page, MCP pause_runner tool
OP-3 As an operator, I want to see CPU/memory time series for a specific runner Monitoring page with configurable time windows (1h/6h/24h/7d)
OP-4 As an operator, I want drift detection to alert me when live state diverges from tfvars GitOps page shows drift items with severity, /api/gitops/drift endpoint
OP-5 As an operator, I want to adjust runner concurrency via the dashboard instead of editing tfvars Config edit form submits GitOps MR through /api/gitops/submit
OP-6 As an operator, I want warm pool scheduling so Nix runners are pre-scaled during business hours CronJob-based warm pool module with configurable schedules and min replicas
OP-7 As an operator, I want to use Claude Code to query runner status hands-free MCP server with 13 tools exposing full dashboard API

Org Admin Stories

ID Story Acceptance
OA-1 As an org admin, I want role-based access so viewers can’t mutate runner state RBAC with viewer/operator/admin hierarchy, mutation endpoints require operator+
OA-2 As an org admin, I want all config changes tracked as merge requests GitOps submit flow creates branch + MR, never applies directly
OA-3 As an org admin, I want per-runner resource limits enforced at the platform level OpenTofu variables define CPU/memory requests and limits per runner type
OA-4 As an org admin, I want HPA policies that prevent runaway scaling Configurable min/max replicas, CPU/memory targets, scale-down stabilization window
OA-5 As an org admin, I want multi-forge support (GitHub + GitLab) from a single platform Dashboard groups runners by forge, ARC handles GitHub, gitlab-runner handles GitLab

Downstream Consumer Stories

ID Story Acceptance
DC-1 As a developer, I want my CI job to start within 60 seconds of being queued Warm pool keeps minimum replicas ready during peak hours
DC-2 As a developer, I want Nix builds to reuse cached inputs across jobs Image-baked Nix plus Attic-backed reuse across ephemeral jobs, without a hard shared-PVC dependency
DC-3 As a developer, I want to know which runner types support my workload tags Runner detail page and list_runners MCP tool show tags per runner
DC-4 As a developer, I want build cache hits so incremental builds are fast Attic binary cache for Nix, Bazel remote cache, BuildKit registry cache for containers

Platform Engineer Stories

ID Story Acceptance
PE-1 As a platform engineer, I want to add a new runner type with a single module call arc-runner module accepts name, type, image, resources, tags
PE-2 As a platform engineer, I want CI to validate all OpenTofu modules on every PR validate.yml runs tofu init -backend=false && tofu validate for all 15+ modules
PE-3 As a platform engineer, I want Prometheus metrics from the ARC controller ServiceMonitor scrapes gha_controller_* metrics, PrometheusRule fires 5 alerts
PE-4 As a platform engineer, I want the dashboard API to have a consistent contract Envelope pattern: { data, meta } success / { error: { code, message } } failure
PE-5 As a platform engineer, I want FlakeHub publishing automated on release GitHub Actions workflow with OIDC auth publishes flake on push to main or tag

Key Flows

Flow 1: Runner Incident Response

Alert fires (Prometheus)
  -> Operator opens dashboard or calls `get_fleet_metrics` via MCP
  -> Identifies failing runner via status/metrics
  -> Pauses runner (dashboard button or `pause_runner` MCP tool)
  -> Investigates pods (`list_pods` or k9s)
  -> Fixes issue (restart pod, adjust config)
  -> Resumes runner
  -> Verifies jobs flowing via metrics

Flow 2: Configuration Change

Operator opens runner detail page
  -> Clicks "Edit Config"
  -> Adjusts concurrency, resources, or flags
  -> Submits form
  -> Dashboard calls POST /api/gitops/submit
  -> Backend creates branch + merge request
  -> Org admin reviews and approves MR
  -> CI runs tofu plan
  -> Merge triggers tofu apply
  -> Dashboard drift check confirms convergence

Flow 3: New Runner Type Onboarding

Platform engineer adds module call in stacks/arc-runners/main.tf
  -> Defines runner name, type, image, resources, tags
  -> Optionally enables nix_store, warm_pool
  -> Runs `just tofu-plan arc-runners` locally
  -> Opens PR
  -> CI validates module + stack
  -> Review + merge
  -> CI applies to cluster
  -> Runner appears in dashboard automatically

Flow 4: Nix Build with Warm Cache

Developer pushes to repo
  -> CI dispatches job with `nix` tag
  -> ARC assigns to warm Nix runner (pre-scaled by warm pool CronJob)
  -> Runner image already contains the Nix toolchain
  -> Build resolves from Attic cache or builds and pushes to Attic
  -> Subsequent jobs on same or different pod reuse Attic-backed cache state

KPIs

Operational Health

Metric Baseline Target Source
Runner fleet uptime unmeasured 99.5% Prometheus up metric
Mean job queue time unmeasured < 60s (peak), < 15s (off-peak) ARC controller metrics
Failed job rate unmeasured < 2% gha_controller_pending_ephemeral_runners
Configuration drift items unmeasured 0 sustained /api/gitops/drift
Alert response time unmeasured < 15 min (P1), < 1h (P2) PagerDuty/manual

Platform Efficiency

Metric Baseline Target Source
Nix cache hit rate unmeasured > 80% Attic server metrics
Bazel cache hit rate unmeasured > 70% Bazel remote cache metrics
Cold start time (non-Nix) ~30s < 20s ARC scale-up duration
Cold start time (Nix, warm pool) ~120s < 45s Runner registration + cache warm-up
Resource utilization (CPU) unmeasured 40-70% average Prometheus node metrics

Developer Experience

Metric Baseline Target Source
Mean CI pipeline duration unmeasured < 10 min (unit), < 30 min (full) Forge pipeline APIs
Dashboard page load time unmeasured < 2s (P95) Synthetic monitoring
MCP tool response time unmeasured < 3s (P95) MCP server logs
Config change lead time (MR open to applied) unmeasured < 2h (business hours) Forge MR metrics

Open Decision Points

ID Question Context Owner
D-1 FlakeHub Cache vs self-hosted Attic? #187 — cost, latency, and sovereignty tradeoffs Platform Engineer
D-2 Tailscale Operator for cluster auth? #178 — replaces manual kubeconfig, enables tsidp for dashboard auth Org Admin
D-3 Liqo or pure ARC for multi-cluster burst? #170 — Liqo was the prior model, now deprecated; define post-Liqo topology Platform Engineer
D-4 Single cluster (blahaj) or multi-cluster? #169 — deployment contract for dev/prod stacks Org Admin
D-5 Pulp for unified registry caching? #66 — containers, npm, PyPI, Nix through one cache layer Platform Engineer
D-6 GPU runner scheduling strategy? #44 — Dawn/WebGPU/L40S/A100 workload placement Platform Engineer

GloriousFlywheel