Runbook
Operational procedures for managing the runner infrastructure.
Important scope note:
- the HPA-oriented procedures below primarily describe the legacy GitLab runner path
- GitHub Actions ARC scale sets use
minRunners/maxRunnersand static per-runner resource envelopes - ARC can scale runner count horizontally, but it does not automatically enlarge one runner pod’s memory or CPU limit
#213is now primarily abouthoneyARC runtime integrity and runner-host hygiene, not generic GitLab runner tuning
Scaling Up
To increase the maximum number of replicas for a runner type:
- Edit the HPA
maxvalue for the target runner inorganization.yaml. - Apply the change:
tofu apply - Verify the new HPA configuration:
kubectl get hpa -n {org}-runners
See HPA Tuning for details on stabilization windows and scaling behavior.
GitHub Actions ARC Scaling
For the ARC-backed GitHub Actions lanes (tinyland-nix, tinyland-docker,
tinyland-dind):
- edit the relevant
arc-runnersstack values for:min_runners/max_runnersif you need more parallel capacitycpu_request/cpu_limit/memory_request/memory_limitif one job needs a larger envelope
- apply the stack changes:
ENV=dev just tofu-plan arc-runners ENV=dev just tofu-apply arc-runners - verify the scale sets:
kubectl get autoscalingrunnersets -n arc-runners kubectl get pods -n arc-runners -o wide just arc-runtime-audit
Do not assume that adding cluster nodes or having large aggregate free RAM will automatically raise the memory available to one existing ARC runner pod.
Honey Runner Workspace Hygiene
Use this when a downstream job dies inside actions/checkout before repository
code runs, especially with EACCES unlink failures under persistent _work/
paths.
See Honey Runner Workdir Contract for the lifecycle boundary, escalation rules, and replace-first read behind this recovery path.
- If you have the failing run URL or run id, let the repo parse that run and
target the affected honey hosts first:
Usejust honey-runner-checkout-triage \ https://github.com/Jesssullivan/acuity-middleware/actions/runs/24525417273--parse-onlywhen the current shell should not touch the remote hosts yet or cannot reach them over SSH. - Audit the affected runner hosts directly when you need the raw host view or
are not starting from a known run:
Or target specific hosts:just honey-runner-workdir-auditjust honey-runner-workdir-audit honey-am-1 honey-am-2 - Generate the bounded recovery plan:
This will stop at escalation when more than one repo workdir is contaminated on the same host.just honey-runner-workdir-reconcile - If the reconcile output identifies a safe one-repo host, stop or drain the affected runner host or service first.
- Preview the bounded remediation explicitly when you want to inspect one
repo target yourself:
just honey-runner-workdir-remediate honey-am-2 acuity-middleware - Apply one of the bounded remediation modes:
Usejust honey-runner-workdir-remediate honey-am-2 acuity-middleware --apply--mode unlock --applywhen you need to restore owner write bits before inspection or manual follow-up. The defaultremovemode restores owner write bits and deletes the repo workdir. Or let the repo-owned automation path execute the safe single-repo fixes:just honey-runner-workdir-reconcile --apply --confirm-drained - Restart or replace the affected runner.
- Rerun the blocked downstream job.
This is host hygiene, not a downstream repo patch path. If checkout dies before repo code runs, the platform should own remediation.
Contract note:
- this path only touches one
_work/<repo>tree at a time - it does not stop or restart runner services for you
- if ownership drift remains after
unlock, escalate to host-level recovery instead of widening the repo-local cleanup scope
Heavy Nix Validation On sting
Use this when a Rust-heavy or memory-heavy Nix workflow should prove the live
tinyland-nix-heavy lane rather than rely on theory.
- Confirm the live ARC contract:
just arc-runtime-audit - Verify the heavy lane exists:
kubectl --context honey get autoscalingrunnersets -n arc-runners - While a heavy job is running, re-run:
Confirm any activejust arc-runtime-audittinyland-nix-heavyrunner pod lands onsting. - Compare node pressure:
kubectl --context honey top nodes - If the heavy job still fails, treat that as a lane-envelope or workload-fit problem, not proof that the whole cluster is out of memory.
Rotating Runner Tokens
To rotate the GitLab runner registration token:
- Delete the Kubernetes Secret containing the current token:
kubectl delete secret runner-token-TYPE -n {org}-runners - Re-apply to recreate the secret with a new token:
tofu apply - Runner pods will pick up the new token on their next restart.
Adding a New Runner Type
- Add the new runner definition to
organization.yamlwith its configuration (base image, tags, resource limits, HPA settings). - Create corresponding
tfvarsentries in the overlay for the new runner type. - Apply:
tofu apply - Verify the new runner appears in the GitLab group runner list.
Emergency Stop
To immediately stop all runners of a specific type:
Option A — Scale HPA to zero:
kubectl scale hpa runner-TYPE --replicas=0 -n {org}-runners
Option B — Delete the runner deployment:
kubectl delete deployment runner-TYPE -n {org}-runners
Note: Option B requires a tofu apply to recreate the deployment when
service is restored. Option A can be reversed by setting replicas back to
the desired minimum.
Log Collection
View logs for all pods of a specific runner type:
kubectl logs -n {org}-runners -l app=runner-TYPE
Follow logs in real time:
kubectl logs -n {org}-runners -l app=runner-TYPE --follow
Health Check
From the overlay repository, run the health check target:
just runners-health
This verifies that all runner types have at least one healthy pod and that the runners are registered with GitLab.
Manual Status Check
To inspect the full state of the runner namespace:
kubectl get pods,hpa,deployments -n {org}-runners
For a specific runner type:
kubectl get pods -n {org}-runners -l app=runner-docker
kubectl describe hpa runner-docker -n {org}-runners
Related
- HPA Tuning — autoscaler configuration details
- Troubleshooting — diagnosing common issues
- Security Model — access controls and secrets
- Resource Limits — per-runner pod envelope reference