Runbook

Runbook

Operational procedures for managing the runner infrastructure.

For planned k8s/OpenTofu hardening rollouts, start with Live Runner Rollout Checklist. The procedures below are narrower recovery and tuning notes, and several GitLab-oriented sections describe compatibility behavior rather than the primary ARC path.

Important scope note:

  • the HPA-oriented procedures below primarily describe the legacy GitLab runner path
  • GitHub Actions ARC scale sets use minRunners / maxRunners and static per-runner resource envelopes
  • ARC can scale runner count horizontally, but it does not automatically enlarge one runner pod’s memory or CPU limit
  • #213 is now primarily about honey ARC runtime integrity and runner-host hygiene, not generic GitLab runner tuning

Scaling Up

To increase the maximum number of replicas for a runner type:

  1. Edit the HPA max value for the target runner in organization.yaml.
  2. Apply the change:
    tofu apply
  3. Verify the new HPA configuration:
    kubectl get hpa -n {org}-runners

See HPA Tuning for details on stabilization windows and scaling behavior.

GitHub Actions ARC Scaling

For the ARC-backed GitHub Actions lanes (tinyland-nix, tinyland-docker, tinyland-dind):

  1. classify the current bottleneck before editing source:
    just arc-burst-capacity-audit \
      --include-label tinyland-dind \
      --include-label tinyland-nix
    just arc-shared-label-capacity-audit \
      --include-label tinyland-dind \
      --include-label tinyland-nix
    just kubelet-imagefs-capacity-audit
    kubectl --context honey get resourcequota,limitrange -n arc-runners
    Read the burst audit’s node-consumer and active-runner-job sections before changing caps. Honey pressure caused by one repo’s live jobs is a different operator problem from stale completed pods, missing quota, or absent Sting PVC-backed overflow.
  2. only after the audit identifies a source-owned capacity issue, edit the relevant arc-runners stack values for:
    • min_runners / max_runners if you need more parallel capacity
    • cpu_request / cpu_limit / memory_request / memory_limit if one job needs a larger envelope
    • Sting compute-expansion placement, tolerations, and PVC-backed scratch if the workload belongs on overflow capacity
  3. prove the source envelope before planning or applying:
    just runner-scale-contract-check
    just runner-capacity-model-check
  4. apply the stack changes through the managed path:
    ENV=dev just tofu-plan arc-runners
    ENV=dev just tofu-apply arc-runners
    The controller runs update_strategy=immediate (TIN-2056; upstream’s default): a runner-spec change recreates the listener and the new EphemeralRunnerSet generation at once, old-generation busy runners finish their jobs out, and the only documented cost is a transient overprovision overlap. Honey’s pod ceiling is the binding capacity axis (all shared runner payloads pin to that node), so overlap pods from a busy-window respec may briefly sit Pending and can trip the alert_pending_threshold PrometheusRule (5 pending for 10 minutes) — treat that alert as expected signal during a respec, and prefer applying respecs in a quiet window so the quiesce/freeze lane makes the overlap a non-event. Historically the controller ran eventual, which removed the listener and refused to recreate anything until every running AND pending runner finished, deadlocking saturated sets (TIN-2055 signature: AutoscalingRunnerSet phase Pending, no listener pod in arc-systems, controller log Waiting for the running and pending runners to finish). The managed Deploy ARC Runners apply asserts a read-only pre-apply wedge canary (every AutoscalingRunnerSet has exactly one EphemeralRunnerSet generation and a Running listener) before any mutating step, reaps idle leaked (no-job) EphemeralRunner CRs between quiesce scoping and the cap freeze so zombies cannot stall the 20-minute drain, and its post-apply listener cap prove is settle-aware (scripts/arc-prove-listener-caps.sh): transient listener recreation settles instead of failing the gate, while persistent drift and unconverged recreation still go red. The quiesce scope unions in any shared set whose live caps drifted from the tfvars source (warm-pool cron patches are invisible to the helm plan): the always() cap restore reverts that drift on every apply, and a cap revert recreates the listener under either update strategy. After caps are restored, every successful apply settle-reaps stale-generation zombies by running the reap helper with --stale-generation before the prove gate — retained as a backstop that degrades to a safe no-op now that immediate deletes stale generations itself — so jobless leftovers converge instead of redding the prove at 1500s. Between applies the cluster-wide runner-zombie-reap CronJob backstop (every 30 minutes, --min-age-seconds 3600) sweeps the same shape across every live scale set, including owner overlays.
  5. verify the scale sets and rerun the burst audit:
    kubectl get autoscalingrunnersets -n arc-runners
    kubectl get pods -n arc-runners -o wide
    just arc-runtime-audit
    just arc-burst-capacity-audit \
      --include-label tinyland-dind \
      --include-label tinyland-nix

Do not assume that adding cluster nodes or having large aggregate free RAM will automatically raise the memory available to one existing ARC runner pod. Do not treat shared-label maxRunners as a global concurrency policy; it is a per-scale-set cap, and owner overlays can advertise the same workflow label.

Honey Runner Workspace Hygiene

Use this when a downstream job dies inside actions/checkout before repository code runs, especially with EACCES unlink failures under persistent _work/ paths.

See Honey Runner Workdir Contract for the lifecycle boundary, escalation rules, and replace-first read behind this recovery path.

  1. If you have the failing run URL or run id, let the repo parse that run and target the affected honey hosts first:
    just honey-runner-checkout-triage \
      https://github.com/Jesssullivan/scheduling-bridge/actions/runs/24525417273
    Use --parse-only when the current shell should not touch the remote hosts yet or cannot reach them over SSH.
  2. Audit the affected runner hosts directly when you need the raw host view or are not starting from a known run:
    just honey-runner-workdir-audit
    Or target specific hosts:
    just honey-runner-workdir-audit honey-am-1 honey-am-2
  3. Generate the bounded recovery plan:
    just honey-runner-workdir-reconcile
    This will stop at escalation when more than one repo workdir is contaminated on the same host.
  4. If the reconcile output identifies a safe one-repo host, drain the affected runner host root first:
    just honey-runner-host-lifecycle honey-am-2 drain
  5. Preview the bounded remediation explicitly when you want to inspect one repo target yourself:
    just honey-runner-workdir-remediate honey-am-2 scheduling-bridge
  6. Apply one of the bounded remediation modes:
    just honey-runner-workdir-remediate honey-am-2 scheduling-bridge --apply
    Use --mode unlock --apply when you need to restore owner write bits before inspection or manual follow-up. The default remove mode restores owner write bits and deletes the repo workdir. Or let the repo-owned automation path execute the safe single-repo fixes:
    just honey-runner-workdir-reconcile --apply --confirm-drained
  7. Restart the runner host root when bounded cleanup is finished:
    just honey-runner-host-lifecycle honey-am-2 start
    If the host root does not come back cleanly, replace that runner host instead of widening salvage.
  8. Rerun the blocked downstream job.

This is host hygiene, not a downstream repo patch path. If checkout dies before repo code runs, the platform should own remediation.

Contract note:

  • this path only touches one _work/<repo> tree at a time
  • just honey-runner-host-lifecycle is the bounded stop/start path for one runner host root when the launcher can be discovered
  • if ownership drift remains after unlock, escalate to host-level recovery instead of widening the repo-local cleanup scope

Heavy Nix Validation

Use this when a Rust-heavy or memory-heavy Nix workflow should prove the live tinyland-nix-heavy lane rather than rely on theory.

  1. Confirm the live ARC contract:
    just arc-runtime-audit
  2. Verify the heavy lane exists:
    kubectl --context honey get autoscalingrunnersets -n arc-runners
  3. While a heavy job is running, re-run:
    just arc-runtime-audit
    Confirm any active tinyland-nix-heavy runner pod lands on the currently admitted ARC payload surface and does not depend on storage-biased bumble.
  4. Compare node pressure:
    kubectl --context honey top nodes
  5. If the heavy job still fails, treat that as a lane-envelope or workload-fit problem, not proof that the whole cluster is out of memory.

Rotating Runner Tokens

To rotate the GitLab runner registration token:

  1. Delete the Kubernetes Secret containing the current token:
    kubectl delete secret runner-token-TYPE -n {org}-runners
  2. Re-apply to recreate the secret with a new token:
    tofu apply
  3. Runner pods will pick up the new token on their next restart.

Adding a New Runner Type

  1. Add the new runner definition to organization.yaml with its configuration (base image, tags, resource limits, HPA settings).
  2. Create corresponding tfvars entries in the overlay for the new runner type.
  3. Apply:
    tofu apply
  4. Verify the new runner appears in the GitLab group runner list.

Emergency Stop

To immediately stop all runners of a specific type:

Option A — Scale HPA to zero:

kubectl scale hpa runner-TYPE --replicas=0 -n {org}-runners

Option B — Delete the runner deployment:

kubectl delete deployment runner-TYPE -n {org}-runners

Note: Option B requires a tofu apply to recreate the deployment when service is restored. Option A can be reversed by setting replicas back to the desired minimum.

Log Collection

View logs for all pods of a specific runner type:

kubectl logs -n {org}-runners -l app=runner-TYPE

Follow logs in real time:

kubectl logs -n {org}-runners -l app=runner-TYPE --follow

Health Check

From the overlay repository, run the health check target:

just runners-health

This verifies that all runner types have at least one healthy pod and that the runners are registered with GitLab.

Manual Status Check

To inspect the full state of the runner namespace:

kubectl get pods,hpa,deployments -n {org}-runners

For a specific runner type:

kubectl get pods -n {org}-runners -l app=runner-docker
kubectl describe hpa runner-docker -n {org}-runners

GloriousFlywheel