Runbook

Operational procedures for managing the runner infrastructure.

Important scope note:

the HPA-oriented procedures below primarily describe the legacy GitLab runner path
GitHub Actions ARC scale sets use minRunners / maxRunners and static per-runner resource envelopes
ARC can scale runner count horizontally, but it does not automatically enlarge one runner pod’s memory or CPU limit
#213 is now primarily about honey ARC runtime integrity and runner-host hygiene, not generic GitLab runner tuning

Scaling Up

To increase the maximum number of replicas for a runner type:

Edit the HPA max value for the target runner in organization.yaml.
Apply the change:
```
tofu apply
```
Verify the new HPA configuration:
```
kubectl get hpa -n {org}-runners
```

See HPA Tuning for details on stabilization windows and scaling behavior.

GitHub Actions ARC Scaling

For the ARC-backed GitHub Actions lanes (tinyland-nix, tinyland-docker, tinyland-dind):

edit the relevant arc-runners stack values for:
- min_runners / max_runners if you need more parallel capacity
- cpu_request / cpu_limit / memory_request / memory_limit if one job needs a larger envelope

apply the stack changes:

ENV=dev just tofu-plan arc-runners
ENV=dev just tofu-apply arc-runners

verify the scale sets:

kubectl get autoscalingrunnersets -n arc-runners
kubectl get pods -n arc-runners -o wide
just arc-runtime-audit

Do not assume that adding cluster nodes or having large aggregate free RAM will automatically raise the memory available to one existing ARC runner pod.

Honey Runner Workspace Hygiene

Use this when a downstream job dies inside actions/checkout before repository code runs, especially with EACCES unlink failures under persistent _work/ paths.

See Honey Runner Workdir Contract for the lifecycle boundary, escalation rules, and replace-first read behind this recovery path.

If you have the failing run URL or run id, let the repo parse that run and target the affected honey hosts first:
```
just honey-runner-checkout-triage \
  https://github.com/Jesssullivan/acuity-middleware/actions/runs/24525417273
```
Use --parse-only when the current shell should not touch the remote hosts yet or cannot reach them over SSH.
Audit the affected runner hosts directly when you need the raw host view or are not starting from a known run:
```
just honey-runner-workdir-audit
```
Or target specific hosts:
```
just honey-runner-workdir-audit honey-am-1 honey-am-2
```
Generate the bounded recovery plan:
```
just honey-runner-workdir-reconcile
```
This will stop at escalation when more than one repo workdir is contaminated on the same host.
If the reconcile output identifies a safe one-repo host, stop or drain the affected runner host or service first.
Preview the bounded remediation explicitly when you want to inspect one repo target yourself:
```
just honey-runner-workdir-remediate honey-am-2 acuity-middleware
```
Apply one of the bounded remediation modes:
```
just honey-runner-workdir-remediate honey-am-2 acuity-middleware --apply
```
Use --mode unlock --apply when you need to restore owner write bits before inspection or manual follow-up. The default remove mode restores owner write bits and deletes the repo workdir. Or let the repo-owned automation path execute the safe single-repo fixes:
```
just honey-runner-workdir-reconcile --apply --confirm-drained
```
Restart or replace the affected runner.
Rerun the blocked downstream job.

This is host hygiene, not a downstream repo patch path. If checkout dies before repo code runs, the platform should own remediation.

Contract note:

this path only touches one _work/<repo> tree at a time
it does not stop or restart runner services for you
if ownership drift remains after unlock, escalate to host-level recovery instead of widening the repo-local cleanup scope

Heavy Nix Validation On `sting`

Use this when a Rust-heavy or memory-heavy Nix workflow should prove the live tinyland-nix-heavy lane rather than rely on theory.

Confirm the live ARC contract:
```
just arc-runtime-audit
```

Verify the heavy lane exists:

kubectl --context honey get autoscalingrunnersets -n arc-runners

While a heavy job is running, re-run:
```
just arc-runtime-audit
```
Confirm any active tinyland-nix-heavy runner pod lands on sting.
Compare node pressure:
```
kubectl --context honey top nodes
```
If the heavy job still fails, treat that as a lane-envelope or workload-fit problem, not proof that the whole cluster is out of memory.

Rotating Runner Tokens

To rotate the GitLab runner registration token:

Delete the Kubernetes Secret containing the current token:

kubectl delete secret runner-token-TYPE -n {org}-runners

Re-apply to recreate the secret with a new token:
```
tofu apply
```
Runner pods will pick up the new token on their next restart.

Adding a New Runner Type

Add the new runner definition to organization.yaml with its configuration (base image, tags, resource limits, HPA settings).
Create corresponding tfvars entries in the overlay for the new runner type.
Apply:
```
tofu apply
```
Verify the new runner appears in the GitLab group runner list.

Emergency Stop

To immediately stop all runners of a specific type:

Option A — Scale HPA to zero:

kubectl scale hpa runner-TYPE --replicas=0 -n {org}-runners

Option B — Delete the runner deployment:

kubectl delete deployment runner-TYPE -n {org}-runners

Note: Option B requires a tofu apply to recreate the deployment when service is restored. Option A can be reversed by setting replicas back to the desired minimum.

Log Collection

View logs for all pods of a specific runner type:

kubectl logs -n {org}-runners -l app=runner-TYPE

Follow logs in real time:

kubectl logs -n {org}-runners -l app=runner-TYPE --follow

Health Check

From the overlay repository, run the health check target:

just runners-health

This verifies that all runner types have at least one healthy pod and that the runners are registered with GitLab.

Manual Status Check

To inspect the full state of the runner namespace:

kubectl get pods,hpa,deployments -n {org}-runners

For a specific runner type:

kubectl get pods -n {org}-runners -l app=runner-docker
kubectl describe hpa runner-docker -n {org}-runners

HPA Tuning — autoscaler configuration details
Troubleshooting — diagnosing common issues
Security Model — access controls and secrets
Resource Limits — per-runner pod envelope reference