GloriousFlywheel User Stories, Flows & KPIs

Personas

Operator

Day-to-day runner fleet manager. Monitors health, responds to incidents, adjusts capacity. Typical context: on-call, reacting to build queue backup or pod crash loop.

Org Admin

Sets policy for the runner platform: who can use which runners, resource budgets, security boundaries. Approves configuration changes via GitOps merge requests.

Downstream Consumer

Developer or CI pipeline that submits jobs. Does not manage runners directly. Cares about: job starts quickly, build cache is warm, artifacts land in the right registry.

Platform Engineer

Maintains the IaC, Nix flake, Bazel build, and deployment pipeline. Extends the platform with new runner types, modules, or integrations.

User Stories

Operator Stories

ID	Story	Acceptance
OP-1	As an operator, I want to see all runners and their status at a glance so I can spot problems quickly	Dashboard landing page shows fleet status, job counts, failure rates
OP-2	As an operator, I want to pause a runner without SSHing into the cluster	Pause/resume buttons on runner detail page, MCP `pause_runner` tool
OP-3	As an operator, I want to see CPU/memory time series for a specific runner	Monitoring page with configurable time windows (1h/6h/24h/7d)
OP-4	As an operator, I want drift detection to alert me when live state diverges from tfvars	GitOps page shows drift items with severity, `/api/gitops/drift` endpoint
OP-5	As an operator, I want to adjust runner concurrency via the dashboard instead of editing tfvars	Config edit form submits GitOps MR through `/api/gitops/submit`
OP-6	As an operator, I want warm pool scheduling so Nix runners are pre-scaled during business hours	CronJob-based warm pool module with configurable schedules and min replicas
OP-7	As an operator, I want to use Claude Code to query runner status hands-free	MCP server with 13 tools exposing full dashboard API

Org Admin Stories

ID	Story	Acceptance
OA-1	As an org admin, I want role-based access so viewers can’t mutate runner state	RBAC with viewer/operator/admin hierarchy, mutation endpoints require operator+
OA-2	As an org admin, I want all config changes tracked as merge requests	GitOps submit flow creates branch + MR, never applies directly
OA-3	As an org admin, I want per-runner resource limits enforced at the platform level	OpenTofu variables define CPU/memory requests and limits per runner type
OA-4	As an org admin, I want HPA policies that prevent runaway scaling	Configurable min/max replicas, CPU/memory targets, scale-down stabilization window
OA-5	As an org admin, I want multi-forge support (GitHub + GitLab) from a single platform	Dashboard groups runners by forge, ARC handles GitHub, gitlab-runner handles GitLab

Downstream Consumer Stories

ID	Story	Acceptance
DC-1	As a developer, I want my CI job to start within 60 seconds of being queued	Warm pool keeps minimum replicas ready during peak hours
DC-2	As a developer, I want Nix builds to reuse cached inputs across jobs	Image-baked Nix plus Attic-backed reuse across ephemeral jobs, without a hard shared-PVC dependency
DC-3	As a developer, I want to know which runner types support my workload tags	Runner detail page and `list_runners` MCP tool show tags per runner
DC-4	As a developer, I want build cache hits so incremental builds are fast	Attic binary cache for Nix, Bazel remote cache, BuildKit registry cache for containers

Platform Engineer Stories

ID	Story	Acceptance
PE-1	As a platform engineer, I want to add a new runner type with a single module call	`arc-runner` module accepts name, type, image, resources, tags
PE-2	As a platform engineer, I want CI to validate all OpenTofu modules on every PR	`validate.yml` runs `tofu init -backend=false && tofu validate` for all 15+ modules
PE-3	As a platform engineer, I want Prometheus metrics from the ARC controller	ServiceMonitor scrapes `gha_controller_*` metrics, PrometheusRule fires 5 alerts
PE-4	As a platform engineer, I want the dashboard API to have a consistent contract	Envelope pattern: `{ data, meta }` success / `{ error: { code, message } }` failure
PE-5	As a platform engineer, I want FlakeHub publishing automated on release	GitHub Actions workflow with OIDC auth publishes flake on push to main or tag

Key Flows

Flow 1: Runner Incident Response

Alert fires (Prometheus)
  -> Operator opens dashboard or calls `get_fleet_metrics` via MCP
  -> Identifies failing runner via status/metrics
  -> Pauses runner (dashboard button or `pause_runner` MCP tool)
  -> Investigates pods (`list_pods` or k9s)
  -> Fixes issue (restart pod, adjust config)
  -> Resumes runner
  -> Verifies jobs flowing via metrics

Flow 2: Configuration Change

Operator opens runner detail page
  -> Clicks "Edit Config"
  -> Adjusts concurrency, resources, or flags
  -> Submits form
  -> Dashboard calls POST /api/gitops/submit
  -> Backend creates branch + merge request
  -> Org admin reviews and approves MR
  -> CI runs tofu plan
  -> Merge triggers tofu apply
  -> Dashboard drift check confirms convergence

Flow 3: New Runner Type Onboarding

Platform engineer adds module call in stacks/arc-runners/main.tf
  -> Defines runner name, type, image, resources, tags
  -> Optionally enables nix_store, warm_pool
  -> Runs `just tofu-plan arc-runners` locally
  -> Opens PR
  -> CI validates module + stack
  -> Review + merge
  -> CI applies to cluster
  -> Runner appears in dashboard automatically

Flow 4: Nix Build with Warm Cache

Developer pushes to repo
  -> CI dispatches job with `nix` tag
  -> ARC assigns to warm Nix runner (pre-scaled by warm pool CronJob)
  -> Runner image already contains the Nix toolchain
  -> Build resolves from Attic cache or builds and pushes to Attic
  -> Subsequent jobs on same or different pod reuse Attic-backed cache state

KPIs

Operational Health

Metric	Baseline	Target	Source
Runner fleet uptime	unmeasured	99.5%	Prometheus `up` metric
Mean job queue time	unmeasured	< 60s (peak), < 15s (off-peak)	ARC controller metrics
Failed job rate	unmeasured	< 2%	`gha_controller_pending_ephemeral_runners`
Configuration drift items	unmeasured	0 sustained	`/api/gitops/drift`
Alert response time	unmeasured	< 15 min (P1), < 1h (P2)	PagerDuty/manual

Platform Efficiency

Metric	Baseline	Target	Source
Nix cache hit rate	unmeasured	> 80%	Attic server metrics
Bazel cache hit rate	unmeasured	> 70%	Bazel remote cache metrics
Cold start time (non-Nix)	~30s	< 20s	ARC scale-up duration
Cold start time (Nix, warm pool)	~120s	< 45s	Runner registration + cache warm-up
Resource utilization (CPU)	unmeasured	40-70% average	Prometheus node metrics

Developer Experience

Metric	Baseline	Target	Source
Mean CI pipeline duration	unmeasured	< 10 min (unit), < 30 min (full)	Forge pipeline APIs
Dashboard page load time	unmeasured	< 2s (P95)	Synthetic monitoring
MCP tool response time	unmeasured	< 3s (P95)	MCP server logs
Config change lead time (MR open to applied)	unmeasured	< 2h (business hours)	Forge MR metrics

Open Decision Points

ID	Question	Context	Owner
D-1	FlakeHub Cache vs self-hosted Attic?	#187 — cost, latency, and sovereignty tradeoffs	Platform Engineer
D-2	Tailscale Operator for cluster auth?	#178 — replaces manual kubeconfig, enables tsidp for dashboard auth	Org Admin
D-3	Liqo or pure ARC for multi-cluster burst?	#170 — Liqo was the prior model, now deprecated; define post-Liqo topology	Platform Engineer
D-4	Single cluster (blahaj) or multi-cluster?	#169 — deployment contract for dev/prod stacks	Org Admin
D-5	Pulp for unified registry caching?	#66 — containers, npm, PyPI, Nix through one cache layer	Platform Engineer
D-6	GPU runner scheduling strategy?	#44 — Dawn/WebGPU/L40S/A100 workload placement	Platform Engineer