Runbook
Operational procedures for managing the runner infrastructure.
For planned k8s/OpenTofu hardening rollouts, start with Live Runner Rollout Checklist. The procedures below are narrower recovery and tuning notes, and several GitLab-oriented sections describe compatibility behavior rather than the primary ARC path.
Important scope note:
- the HPA-oriented procedures below primarily describe the legacy GitLab runner path
- GitHub Actions ARC scale sets use
minRunners/maxRunnersand static per-runner resource envelopes - ARC can scale runner count horizontally, but it does not automatically enlarge one runner pod’s memory or CPU limit
#213is now primarily abouthoneyARC runtime integrity and runner-host hygiene, not generic GitLab runner tuning
Scaling Up
To increase the maximum number of replicas for a runner type:
- Edit the HPA
maxvalue for the target runner inorganization.yaml. - Apply the change:
tofu apply - Verify the new HPA configuration:
kubectl get hpa -n {org}-runners
See HPA Tuning for details on stabilization windows and scaling behavior.
GitHub Actions ARC Scaling
For the ARC-backed GitHub Actions lanes (tinyland-nix, tinyland-docker,
tinyland-dind):
- classify the current bottleneck before editing source:
Read the burst audit’s node-consumer and active-runner-job sections before changing caps. Honey pressure caused by one repo’s live jobs is a different operator problem from stale completed pods, missing quota, or absent Sting PVC-backed overflow.just arc-burst-capacity-audit \ --include-label tinyland-dind \ --include-label tinyland-nix just arc-shared-label-capacity-audit \ --include-label tinyland-dind \ --include-label tinyland-nix just kubelet-imagefs-capacity-audit kubectl --context honey get resourcequota,limitrange -n arc-runners - only after the audit identifies a source-owned capacity issue, edit the
relevant
arc-runnersstack values for:min_runners/max_runnersif you need more parallel capacitycpu_request/cpu_limit/memory_request/memory_limitif one job needs a larger envelope- Sting compute-expansion placement, tolerations, and PVC-backed scratch if the workload belongs on overflow capacity
- prove the source envelope before planning or applying:
just runner-scale-contract-check just runner-capacity-model-check - apply the stack changes through the managed path:
The controller runsENV=dev just tofu-plan arc-runners ENV=dev just tofu-apply arc-runnersupdate_strategy=immediate(TIN-2056; upstream’s default): a runner-spec change recreates the listener and the newEphemeralRunnerSetgeneration at once, old-generation busy runners finish their jobs out, and the only documented cost is a transient overprovision overlap. Honey’s pod ceiling is the binding capacity axis (all shared runner payloads pin to that node), so overlap pods from a busy-window respec may briefly sitPendingand can trip thealert_pending_thresholdPrometheusRule (5 pending for 10 minutes) — treat that alert as expected signal during a respec, and prefer applying respecs in a quiet window so the quiesce/freeze lane makes the overlap a non-event. Historically the controller raneventual, which removed the listener and refused to recreate anything until every running AND pending runner finished, deadlocking saturated sets (TIN-2055 signature:AutoscalingRunnerSetphasePending, no listener pod inarc-systems, controller logWaiting for the running and pending runners to finish). The managedDeploy ARC Runnersapply asserts a read-only pre-apply wedge canary (everyAutoscalingRunnerSethas exactly oneEphemeralRunnerSetgeneration and a Running listener) before any mutating step, reaps idle leaked (no-job)EphemeralRunnerCRs between quiesce scoping and the cap freeze so zombies cannot stall the 20-minute drain, and its post-apply listener cap prove is settle-aware (scripts/arc-prove-listener-caps.sh): transient listener recreation settles instead of failing the gate, while persistent drift and unconverged recreation still go red. The quiesce scope unions in any shared set whose live caps drifted from the tfvars source (warm-pool cron patches are invisible to the helm plan): the always() cap restore reverts that drift on every apply, and a cap revert recreates the listener under either update strategy. After caps are restored, every successful apply settle-reaps stale-generation zombies by running the reap helper with--stale-generationbefore the prove gate — retained as a backstop that degrades to a safe no-op now thatimmediatedeletes stale generations itself — so jobless leftovers converge instead of redding the prove at 1500s. Between applies the cluster-widerunner-zombie-reapCronJob backstop (every 30 minutes,--min-age-seconds 3600) sweeps the same shape across every live scale set, including owner overlays. - verify the scale sets and rerun the burst audit:
kubectl get autoscalingrunnersets -n arc-runners kubectl get pods -n arc-runners -o wide just arc-runtime-audit just arc-burst-capacity-audit \ --include-label tinyland-dind \ --include-label tinyland-nix
Do not assume that adding cluster nodes or having large aggregate free RAM will
automatically raise the memory available to one existing ARC runner pod.
Do not treat shared-label maxRunners as a global concurrency policy; it is a
per-scale-set cap, and owner overlays can advertise the same workflow label.
Honey Runner Workspace Hygiene
Use this when a downstream job dies inside actions/checkout before repository
code runs, especially with EACCES unlink failures under persistent _work/
paths.
See Honey Runner Workdir Contract for the lifecycle boundary, escalation rules, and replace-first read behind this recovery path.
- If you have the failing run URL or run id, let the repo parse that run and
target the affected honey hosts first:
Usejust honey-runner-checkout-triage \ https://github.com/Jesssullivan/scheduling-bridge/actions/runs/24525417273--parse-onlywhen the current shell should not touch the remote hosts yet or cannot reach them over SSH. - Audit the affected runner hosts directly when you need the raw host view or
are not starting from a known run:
Or target specific hosts:just honey-runner-workdir-auditjust honey-runner-workdir-audit honey-am-1 honey-am-2 - Generate the bounded recovery plan:
This will stop at escalation when more than one repo workdir is contaminated on the same host.just honey-runner-workdir-reconcile - If the reconcile output identifies a safe one-repo host, drain the affected
runner host root first:
just honey-runner-host-lifecycle honey-am-2 drain - Preview the bounded remediation explicitly when you want to inspect one
repo target yourself:
just honey-runner-workdir-remediate honey-am-2 scheduling-bridge - Apply one of the bounded remediation modes:
Usejust honey-runner-workdir-remediate honey-am-2 scheduling-bridge --apply--mode unlock --applywhen you need to restore owner write bits before inspection or manual follow-up. The defaultremovemode restores owner write bits and deletes the repo workdir. Or let the repo-owned automation path execute the safe single-repo fixes:just honey-runner-workdir-reconcile --apply --confirm-drained - Restart the runner host root when bounded cleanup is finished:
If the host root does not come back cleanly, replace that runner host instead of widening salvage.just honey-runner-host-lifecycle honey-am-2 start - Rerun the blocked downstream job.
This is host hygiene, not a downstream repo patch path. If checkout dies before repo code runs, the platform should own remediation.
Contract note:
- this path only touches one
_work/<repo>tree at a time just honey-runner-host-lifecycleis the bounded stop/start path for one runner host root when the launcher can be discovered- if ownership drift remains after
unlock, escalate to host-level recovery instead of widening the repo-local cleanup scope
Heavy Nix Validation
Use this when a Rust-heavy or memory-heavy Nix workflow should prove the live
tinyland-nix-heavy lane rather than rely on theory.
- Confirm the live ARC contract:
just arc-runtime-audit - Verify the heavy lane exists:
kubectl --context honey get autoscalingrunnersets -n arc-runners - While a heavy job is running, re-run:
Confirm any activejust arc-runtime-audittinyland-nix-heavyrunner pod lands on the currently admitted ARC payload surface and does not depend on storage-biasedbumble. - Compare node pressure:
kubectl --context honey top nodes - If the heavy job still fails, treat that as a lane-envelope or workload-fit problem, not proof that the whole cluster is out of memory.
Rotating Runner Tokens
To rotate the GitLab runner registration token:
- Delete the Kubernetes Secret containing the current token:
kubectl delete secret runner-token-TYPE -n {org}-runners - Re-apply to recreate the secret with a new token:
tofu apply - Runner pods will pick up the new token on their next restart.
Adding a New Runner Type
- Add the new runner definition to
organization.yamlwith its configuration (base image, tags, resource limits, HPA settings). - Create corresponding
tfvarsentries in the overlay for the new runner type. - Apply:
tofu apply - Verify the new runner appears in the GitLab group runner list.
Emergency Stop
To immediately stop all runners of a specific type:
Option A — Scale HPA to zero:
kubectl scale hpa runner-TYPE --replicas=0 -n {org}-runners
Option B — Delete the runner deployment:
kubectl delete deployment runner-TYPE -n {org}-runners
Note: Option B requires a tofu apply to recreate the deployment when
service is restored. Option A can be reversed by setting replicas back to
the desired minimum.
Log Collection
View logs for all pods of a specific runner type:
kubectl logs -n {org}-runners -l app=runner-TYPE
Follow logs in real time:
kubectl logs -n {org}-runners -l app=runner-TYPE --follow
Health Check
From the overlay repository, run the health check target:
just runners-health
This verifies that all runner types have at least one healthy pod and that the runners are registered with GitLab.
Manual Status Check
To inspect the full state of the runner namespace:
kubectl get pods,hpa,deployments -n {org}-runners
For a specific runner type:
kubectl get pods -n {org}-runners -l app=runner-docker
kubectl describe hpa runner-docker -n {org}-runners
Related
- HPA Tuning — autoscaler configuration details
- Troubleshooting — diagnosing common issues
- Security Model — access controls and secrets
- Resource Limits — per-runner pod envelope reference