Troubleshooting

Troubleshooting

Common issues with the runner infrastructure and how to resolve them.

Runner Not Registering

Symptom: Runner pod starts but does not appear in the GitLab group runner list.

Causes and fixes:

  • Invalid runner token: Verify the Kubernetes Secret containing the registration token exists and is current. Delete the secret and run tofu apply to recreate it.
  • Group access: Confirm that the service account or user associated with the token has access to the target GitLab group.
  • Token auto-creation: Automatic runner token creation requires the Owner role on the GitLab group. If the deploying user does not have Owner, token creation will fail silently. Verify role assignment in GitLab group settings.

ARC Listener Fails With GitHub API Rate Limit

Symptom: The listener pod restarts or never establishes a broker session, and the logs contain a line like:

failed to get runner registration token (403 Forbidden): API rate limit exceeded for user ID ...

Read: This is an auth-contract problem, not a cluster-capacity problem. The affected scale set is authenticated with a PAT-backed secret, and GitHub has exhausted that user’s REST core budget. ARC cannot mint a runner registration token, so repo-scoped lanes stay offline even if the honey cluster and ARC controller are otherwise healthy.

Fix:

  1. Confirm the failure on the listener:
    kubectl -n arc-systems logs <listener-pod>
  2. Check whether the shared PAT is exhausted:
    gh api rate_limit
  3. Replace the PAT-backed secret with a dedicated GitHub App installation secret for the target org or repository, created in both arc-systems and arc-runners.
  4. Update the runner set to use that new secret and re-apply the ARC stack.

Temporary bridge only:

  • waiting for the PAT rate-limit reset can bring the listener back
  • reducing PAT traffic can delay the next outage
  • neither is a durable fix for an active repo-scoped runner lane

ARC Listener Fails With GHCR 403 On Cold Pull

Symptom: A listener lands on a newly used node and enters ImagePullBackOff. kubectl describe pod shows a GHCR 403 Forbidden while pulling ghcr.io/actions/gha-runner-scale-set-controller:....

Read: This is a controller-image auth drift problem, not a runner-label or cluster-capacity problem. The runner namespace has a working ghcr-pull secret, but the listener/controller namespace copy in arc-systems is stale, missing, or still carrying public-images-only credentials.

Fix:

  1. Confirm the listener pull failure:
    kubectl -n arc-systems describe pod <listener-pod>
  2. Check whether the controller copy matches the runner copy:
    just arc-ghcr-pull-secret-check
  3. Sync the controller copy from the authoritative runner copy:
    just arc-ghcr-pull-secret-sync
  4. Recreate the affected listener pod or re-run the downstream workflow so it can cold-pull with the corrected secret.

ARC Jobs Queue During A Shared-Label Burst

Symptom: ARC listener pods are healthy, GitHub jobs stay queued, or new GPU, KVM, Nix, or DinD runner pods stay Pending during burst periods even though aggregate cluster CPU/RAM looks available.

Read: Start by classifying the burst topology, not by raising caps. ARC maxRunners is per scale set; a workflow-facing shared label can span multiple owner overlays, Honey-bound baselines, and Sting overflow lanes. The real limiter can be Honey pod slots, namespace quota, kubelet root/imagefs headroom, or missing fast-local scratch PVCs even when CPU and memory are not exhausted.

Fix:

  1. Run the combined read-only burst audit:
    just arc-burst-capacity-audit \
      --include-label tinyland-dind \
      --include-label tinyland-nix \
      --include-label tinyland-nix-heavy
  2. Keep the narrower diagnostics when you need source evidence for a single failure class:
    just arc-shared-label-capacity-audit \
      --include-label tinyland-dind \
      --include-label tinyland-nix \
      --include-label tinyland-nix-heavy
    just kubelet-imagefs-capacity-audit
    kubectl --context honey get resourcequota,limitrange -n arc-runners
  3. If the burst audit shows Honey pod-slot pressure, use its node-consumer and active-runner-job sections to classify whether the slots are stale residue, completed runner pods, or live jobs from a specific repo/workflow. Do not raise a downstream repo cap as the first response.
  4. Read the Shared Queue Fairness section separately from the capacity sections. If it reports shared-label-fairness-contention, the runner label is being occupied by multiple repositories and GitHub/ARC is not applying a repo priority policy. That is a fairness/admission question, not proof that the cluster is out of CPU, memory, or scratch storage.
  5. Read the Shared Label Queue Pressure section when a scarce lane reports pending runners. queued-behind-active-runner-capacity means the declared label capacity is currently occupied by active work, and the section names the holder repositories. scheduler-resource-pressure means Kubernetes is rejecting runner pods for reasons such as Insufficient ephemeral-storage. Treat that as placement/resource evidence before raising ARC maxRunners.
  6. Read the JIT Runner Assignment Traps section before deleting any EphemeralRunner. offline-no-job-cleanup-candidate can be cleaned up only after the GitHub runner is verified offline/not busy. assigned-job-at-risk has a real GitHub job attached; do not delete it unless that job is cancelled or explicitly declared stale. idle-no-job-runner is ready capacity and should be left alone unless arc-runtime-audit also shows cancelled-job handoff evidence for that exact pod and a replacement job is still queued behind a max-1 control-plane lane.
  7. If the audit shows an active tinyland-dind-compute-expansion runner without local-path-sting-fast-ephemeral work and Docker graph PVCs, treat that as a platform scratch-storage regression before rerunning the workload.
  8. If a capacity change is still justified, make it in source, run just runner-scale-contract-check and just runner-capacity-model-check, then use the managed ARC deploy path and rerun the burst audit.

Current bounded contract:

  • bumble is storage-biased OpenEBS/ZFS infrastructure, not default ARC scheduling authority
  • sting remains explicit compute-expansion capacity and needs both placement and toleration before it can carry ARC payloads
  • honey still carries the baseline Nix/DinD payload lanes; pod-slot pressure there is real queue debt even when the cluster has free CPU/RAM elsewhere
  • tinyland-dind currently has a source-owned envelope of 20 Honey slots plus 16 Sting fast-local overflow slots; that is not a global cross-owner label cap or a repository-priority policy
  • tinyland-nix-heavy is intentionally scarce today; if it is occupied by a long platform or consumer proof, the correct first read is queue-holder and scheduler evidence, not a repo-specific runner label

Pods Crashing (OOMKilled)

Symptom: Runner pods restart repeatedly. kubectl describe pod shows OOMKilled as the termination reason.

Fix: First capture the affected runner lane, job, pod, and cgroup evidence. For ARC lanes, start with just arc-burst-capacity-audit --include-label <workflow-label> and then update the owned tofu/stacks/arc-runners/*.tfvars envelope through the managed ARC workflow. Do not patch a live pod or route the job to GitHub-hosted runners as the fix.

Important read:

  • an OOM on a runner pod does not automatically mean the whole honey cluster is out of memory
  • the current ARC runners still run inside per-pod cgroup limits
  • for example, the committed baseline tinyland-nix lane is still an 8Gi memory-limit runner
  • Rust-heavy clippy or rustc workloads can still hit that limit even if the cluster aggregate has far more RAM available

Common memory-hungry workloads:

  • dind: Container builds with large build contexts.
  • nix: Derivations that compile large packages from source.
  • Rust-heavy lint/build steps: clippy and parallel rustc processes can spike memory within one runner pod.

Recommended current path:

  • keep general Nix work on tinyland-nix
  • move recurring heavy Rust/Nix jobs to tinyland-nix-heavy
  • use just arc-runtime-audit to confirm the live heavy lane and placement
  • use just arc-burst-capacity-audit --include-label tinyland-nix to surface terminal ARC runner pods and their assigned GitHub jobs when the failure is on the baseline or compute-expansion Nix lane
  • do not infer one runner pod’s memory budget from total cluster RAM

Platform Proof Nix Build Dies With Signal 9

Symptom: The Platform Proof Prove tinyland-nix contract or Prove tinyland-nix-heavy contract job fails while building .#runner-dashboard, and the Nix log includes signal 9, Killed, or builder failed due to signal 9.

Read: Treat this as transient runner memory evidence until the cgroup diagnostics say otherwise. Both Nix platform-proof lanes use scripts/platform-proof-nix-runner-dashboard.sh, which prints host memory, filesystem state, and the runner cgroup files memory.events and memory.peak. Those two cgroup files are required evidence for this failure class.

Fix:

  1. Read the failed attempt log path printed by the helper.
  2. Check memory.events for oom or oom_kill increments, then compare memory.peak against memory.max.
  3. Use the helper’s single automatic retry as the contract result when the second attempt passes on the same commit.
  4. If the retry also fails, keep both attempt logs and the cgroup diagnostics together. Do not treat total cluster memory as proof that the runner pod had enough memory.

Docker Runner Hits ENFILE

Symptom: A tinyland-docker job fails during dependency installation with ENFILE, file table overflow, or Error: spawn ps ENFILE. GitHub may also show a proof run stuck after the job’s contract steps completed if the runner listener aborts during final upload or cleanup.

Read: Treat this as runner-host capacity or cleanup evidence, not as a normal application test failure. It is also separate from RustFS bucket-index debt, Attic publication, BCR, and RBE/CAS authority.

The repo-owned Platform Proof Docker lane intentionally bounds pnpm install fanout with --child-concurrency=2 --network-concurrency=4. That keeps the proof focused on the shared Docker runner contract instead of accidentally turning the proof into a host file-table stress test.

Fix:

  1. Capture the failing job runner and job id:
    gh api repos/<owner>/<repo>/actions/jobs/<job_id>
  2. Check recent ARC runner events and finalizer state:
    kubectl --context honey -n arc-runners get events --sort-by=.lastTimestamp \
      | rg '<runner-name>|ENFILE|OOM|Failed|Killing'
    kubectl --context honey -n arc-runners get ephemeralrunners,ephemeralrunnersets \
      | rg '<runner-scale-set>|<runner-name>'
  3. Capture the node file-table and memory state before deleting evidence:
    kubectl --context honey debug node/sting -it --image=busybox:1.36 -- \
      chroot /host sh -lc 'cat /proc/sys/fs/file-nr; cat /proc/sys/fs/file-max; cat /proc/sys/fs/nr_open; free -m'
  4. If the proof itself failed but a rerun passes on the same commit, classify the failed run as runner-host capacity evidence and keep the clean rerun as the contract result.
  5. If ordinary downstream package installs repeatedly hit this failure class, either lower package-manager concurrency in that workflow or move the job to a more appropriate lane. Do not raise tinyland-docker capacity blindly.

Runner Loses Communication Mid-Build

Symptom: GitHub Actions shows The self-hosted runner lost communication with the server, while the job log ends with error: interrupted by the user in the middle of a long build.

Read: This is often not a repo-local build failure. On honey, the first thing to check is whether Kubernetes has API/CNI continuity evidence, node pressure, or a runner-pod eviction near the failure.

Fix:

  1. Confirm the failed job’s runner name:
    gh api repos/<owner>/<repo>/actions/jobs/<job_id>
  2. Check recent ARC events on honey:
    kubectl --context honey -n arc-runners get events --sort-by=.lastTimestamp \
      | rg 'Evicted|DiskPressure|ephemeral-storage|<runner-name>'
  3. Classify Kubernetes continuity evidence for the runner:
    just arc-network-continuity-audit --runner-name <runner-name>
  4. Confirm the live runner-set envelope:
    just arc-runtime-audit
    If the log scan shows Job message not found / job was canceled on a max-1 operator lane, verify the queued replacement job still exists and the pod has no Runner.Worker process before deleting only that stale EphemeralRunner. If the listener is missing and no-job runners appear to be blocking listener recreation, rerun with:
    just arc-runtime-audit --fail-on-stale-idle-listener-blocker
    Treat any cleanup hint as gated evidence, not an automatic action. Capture GitHub runner API output, pass it with --github-runners-json, verify every candidate runner is busy=false, and confirm no pod has a Runner.Worker process before deleting the owning EphemeralRunnerSet. The managed ARC apply now automates the idle-leak shape pre-apply: between quiesce scoping and the cap freeze, Deploy ARC Runners runs scripts/reap-idle-leaked-ephemeral-runners.sh against the affected scale sets, deleting only no-job EphemeralRunner CRs whose owning EphemeralRunnerSet wants zero replicas or that are excess beyond desired and older than the minimum age, with a just-in-time per-CR job re-check before each delete. For a manual, evidence-first pass use:
    just arc-reap-zombies --scale-set <scale-set> --dry-run
    If the scale set is instead deadlocked mid listener rollover (TIN-2055 signature: AutoscalingRunnerSet phase Pending, no listener pod in arc-systems, controller log Waiting for the running and pending runners to finish), the blocker is an idle warm runner of the previous EphemeralRunnerSet generation sitting at current==desired — a shape the default reap contract deliberately skips. This deadlock class is produced by the controller’s old eventual update strategy; since TIN-2056 the controller runs immediate (the upstream default), which recreates the listener and new generation at once, so a fresh occurrence of this signature now points at a transition that predates the flip or a controller restart that stranded one mid-drain. The recovery helpers are retained. Use the opt-in stale-generation mode, which also reaps jobless runners owned by a non-newest EphemeralRunnerSet generation (newest = latest creationTimestamp, falling back to differing actions.github.com/runner-spec-hash annotations) even at current==desired:
    just arc-reap-zombies-stale --scale-set <scale-set> --dry-run
    This shape is also automated twice over: every successful managed apply settle-reaps it post-apply (after cap restore, before the listener-cap prove gate; a safe no-op when immediate already deleted the stale generation), and the runner-zombie-reap CronJob backstop sweeps every live scale set — owner overlays included — every 30 minutes with --min-age-seconds 3600. The managed apply additionally refuses to start while a rollover is already in flight (pre-apply wedge canary: single EphemeralRunnerSet per set, listeners Running), so wedged transitions never get a new transition stacked on top.
  5. Check kubelet root/imagefs capacity separately from durable storage:
    just kubelet-imagefs-capacity-audit
    just kubelet-imagefs-capacity-audit --node bumble
  6. If the node is overcommitted on disk, raise the Nix lane’s ephemeral-storage request and limit in the ARC stack instead of patching the downstream repo again.

Checkout Fails Before Repo Code Runs

Symptom: actions/checkout fails with EACCES, unlink errors, or stale workspace state under _work/, before the repository’s own build logic starts.

Read: This is usually a runner-host hygiene problem, not a downstream repo contract problem.

See Honey Runner Workdir Contract for the lifecycle boundary and escalation rules behind this failure class.

Fix:

  1. If you have the failing run URL or run id, start from the run itself:
    just honey-runner-checkout-triage \
      https://github.com/Jesssullivan/scheduling-bridge/actions/runs/24525417273
    Use --parse-only when you only want the run/log extraction or the current shell cannot reach the honey hosts over SSH.
  2. Audit the runner hosts directly when you need the raw host view:
    just honey-runner-workdir-audit
  3. Generate the bounded recovery plan:
    just honey-runner-workdir-reconcile
    If the host has more than one contaminated repo workdir, stop there and treat it as a replacement or wider manual-triage incident.
  4. If a specific host shows a safe single-repo recovery candidate, drain that runner host root first:
    just honey-runner-host-lifecycle honey-am-2 drain
  5. Preview the bounded remediation:
    just honey-runner-workdir-remediate honey-am-2 scheduling-bridge
  6. Apply the bounded remediation:
    just honey-runner-workdir-remediate honey-am-2 scheduling-bridge --apply
    Use --mode unlock --apply when you need to restore owner write bits before inspection or escalation. Or use:
    just honey-runner-workdir-reconcile --apply --confirm-drained
    when you want the repo-owned automation path to execute only the safe single-repo host recoveries.
  7. Restart the runner host root:
    just honey-runner-host-lifecycle honey-am-2 start
    If the host root cannot be restarted cleanly, replace it instead of widening salvage.
  8. Rerun the downstream job.

If checkout fails before repo code runs, post-checkout cleanup inside the downstream repo will not help.

Cache Misses on Nix Runner

Symptom: Nix builds download or compile everything from scratch despite previous builds having populated the cache.

Causes and fixes:

  • ATTIC_SERVER not set: Verify the environment variable is present in the runner pod. Check that the Kubernetes Secret for Attic credentials exists in the runner’s namespace (arc-runners or gitlab-runners, depending on the path you are debugging).
  • Attic service unreachable: Confirm the Attic cache service is running in the nix-cache namespace. Test connectivity from a runner pod with curl $ATTIC_SERVER.
  • Cache name mismatch: Verify ATTIC_CACHE matches the cache name used in attic push commands.

TOML Configuration Gotchas

The GitLab Runner TOML configuration has several pitfalls in Runner 17.x:

  • Resource limits must be flat keys: Values like cpu_limit, memory_limit, cpu_request, and memory_request must be specified as flat keys in the [[runners.kubernetes]] section. Do not nest them inside a TOML table.
  • pod_spec.containers type mismatch: Using pod_spec with a containers field causes a type mismatch error in Runner 17.x. Instead, use environment = [...] on the [[runners]] section to inject environment variables.

Runner Pods Pending

Symptom: Pods stay in Pending state and are not scheduled.

Causes and fixes:

  • Insufficient cluster resources: Check node capacity with kubectl describe nodes. The cluster may need more nodes or the runner resource requests may be too high.
  • HPA at maximum: If all replicas are running and jobs are still queuing, increase the HPA maximum. See HPA Tuning.
  • Image pull failure behind a Pending pod: A pod can still show Pending while the runner container is actually blocked in ImagePullBackOff. Check kubectl describe pod and the container waiting reason before treating the incident as a scheduler or capacity failure.

Live GloriousFlywheel example:

  • baseline tinyland-nix pods on honey were observed as Pending
  • the real blocker was ImagePullBackOff on ghcr.io/tinyland-inc/actions-runner-nix:latest
  • the concrete pull error was 401 Unauthorized

Meaning:

  • do not assume every Pending runner pod is a scheduler or capacity problem
  • verify image-pull auth before treating the incident as a memory or placement failure.

Heavy Nix Lane Fails With npm TLS Errors

Symptom: A tinyland-nix-heavy job starts, but the workload fails inside nix build with pnpm errors like UNABLE_TO_GET_ISSUER_CERT_LOCALLY against registry.npmjs.org.

Read: This means ARC placement is already working. The remaining problem is certificate trust inside the build path, not runner scheduling or GHCR pull auth.

Fix:

  1. Make sure the derivation or build wrapper exports a CA bundle into the Node and Nix fetch paths:
    • SSL_CERT_FILE
    • NIX_SSL_CERT_FILE
    • NODE_EXTRA_CA_CERTS
  2. Re-run the heavy canary after the derivation change lands.
  3. If the error persists, inspect the runner image and runtime trust store rather than changing runner placement.

GloriousFlywheel