Troubleshooting
Common issues with the runner infrastructure and how to resolve them.
Runner Not Registering
Symptom: Runner pod starts but does not appear in the GitLab group runner list.
Causes and fixes:
- Invalid runner token: Verify the Kubernetes Secret containing the
registration token exists and is current. Delete the secret and run
tofu applyto recreate it. - Group access: Confirm that the service account or user associated with the token has access to the target GitLab group.
- Token auto-creation: Automatic runner token creation requires the Owner role on the GitLab group. If the deploying user does not have Owner, token creation will fail silently. Verify role assignment in GitLab group settings.
ARC Listener Fails With GitHub API Rate Limit
Symptom: The listener pod restarts or never establishes a broker session, and the logs contain a line like:
failed to get runner registration token (403 Forbidden): API rate limit exceeded for user ID ...
Read: This is an auth-contract problem, not a cluster-capacity problem.
The affected scale set is authenticated with a PAT-backed secret, and GitHub
has exhausted that user’s REST core budget. ARC cannot mint a runner
registration token, so repo-scoped lanes stay offline even if the honey
cluster and ARC controller are otherwise healthy.
Fix:
- Confirm the failure on the listener:
kubectl -n arc-systems logs <listener-pod> - Check whether the shared PAT is exhausted:
gh api rate_limit - Replace the PAT-backed secret with a dedicated GitHub App installation
secret for the target org or repository, created in both
arc-systemsandarc-runners. - Update the runner set to use that new secret and re-apply the ARC stack.
Temporary bridge only:
- waiting for the PAT rate-limit reset can bring the listener back
- reducing PAT traffic can delay the next outage
- neither is a durable fix for an active repo-scoped runner lane
ARC Listener Fails With GHCR 403 On Cold Pull
Symptom: A listener lands on a newly used node and enters
ImagePullBackOff. kubectl describe pod shows a GHCR 403 Forbidden while
pulling ghcr.io/actions/gha-runner-scale-set-controller:....
Read: This is a controller-image auth drift problem, not a runner-label or
cluster-capacity problem. The runner namespace has a working ghcr-pull
secret, but the listener/controller namespace copy in arc-systems is stale,
missing, or still carrying public-images-only credentials.
Fix:
- Confirm the listener pull failure:
kubectl -n arc-systems describe pod <listener-pod> - Check whether the controller copy matches the runner copy:
just arc-ghcr-pull-secret-check - Sync the controller copy from the authoritative runner copy:
just arc-ghcr-pull-secret-sync - Recreate the affected listener pod or re-run the downstream workflow so it can cold-pull with the corrected secret.
ARC Jobs Queue During A Shared-Label Burst
Symptom: ARC listener pods are healthy, GitHub jobs stay queued, or new
GPU, KVM, Nix, or DinD runner pods stay Pending during burst periods even
though aggregate cluster CPU/RAM looks available.
Read: Start by classifying the burst topology, not by raising caps. ARC
maxRunners is per scale set; a workflow-facing shared label can span multiple
owner overlays, Honey-bound baselines, and Sting overflow lanes. The real
limiter can be Honey pod slots, namespace quota, kubelet root/imagefs headroom,
or missing fast-local scratch PVCs even when CPU and memory are not exhausted.
Fix:
- Run the combined read-only burst audit:
just arc-burst-capacity-audit \ --include-label tinyland-dind \ --include-label tinyland-nix \ --include-label tinyland-nix-heavy - Keep the narrower diagnostics when you need source evidence for a single
failure class:
just arc-shared-label-capacity-audit \ --include-label tinyland-dind \ --include-label tinyland-nix \ --include-label tinyland-nix-heavy just kubelet-imagefs-capacity-audit kubectl --context honey get resourcequota,limitrange -n arc-runners - If the burst audit shows Honey pod-slot pressure, use its node-consumer and active-runner-job sections to classify whether the slots are stale residue, completed runner pods, or live jobs from a specific repo/workflow. Do not raise a downstream repo cap as the first response.
- Read the
Shared Queue Fairnesssection separately from the capacity sections. If it reportsshared-label-fairness-contention, the runner label is being occupied by multiple repositories and GitHub/ARC is not applying a repo priority policy. That is a fairness/admission question, not proof that the cluster is out of CPU, memory, or scratch storage. - Read the
Shared Label Queue Pressuresection when a scarce lane reports pending runners.queued-behind-active-runner-capacitymeans the declared label capacity is currently occupied by active work, and the section names the holder repositories.scheduler-resource-pressuremeans Kubernetes is rejecting runner pods for reasons such asInsufficient ephemeral-storage. Treat that as placement/resource evidence before raising ARCmaxRunners. - Read the
JIT Runner Assignment Trapssection before deleting anyEphemeralRunner.offline-no-job-cleanup-candidatecan be cleaned up only after the GitHub runner is verified offline/not busy.assigned-job-at-riskhas a real GitHub job attached; do not delete it unless that job is cancelled or explicitly declared stale.idle-no-job-runneris ready capacity and should be left alone unlessarc-runtime-auditalso shows cancelled-job handoff evidence for that exact pod and a replacement job is still queued behind a max-1 control-plane lane. - If the audit shows an active
tinyland-dind-compute-expansionrunner withoutlocal-path-sting-fast-ephemeralwork and Docker graph PVCs, treat that as a platform scratch-storage regression before rerunning the workload. - If a capacity change is still justified, make it in source, run
just runner-scale-contract-checkandjust runner-capacity-model-check, then use the managed ARC deploy path and rerun the burst audit.
Current bounded contract:
bumbleis storage-biased OpenEBS/ZFS infrastructure, not default ARC scheduling authoritystingremains explicit compute-expansion capacity and needs both placement and toleration before it can carry ARC payloadshoneystill carries the baseline Nix/DinD payload lanes; pod-slot pressure there is real queue debt even when the cluster has free CPU/RAM elsewheretinyland-dindcurrently has a source-owned envelope of 20 Honey slots plus 16 Sting fast-local overflow slots; that is not a global cross-owner label cap or a repository-priority policytinyland-nix-heavyis intentionally scarce today; if it is occupied by a long platform or consumer proof, the correct first read is queue-holder and scheduler evidence, not a repo-specific runner label
Pods Crashing (OOMKilled)
Symptom: Runner pods restart repeatedly. kubectl describe pod shows
OOMKilled as the termination reason.
Fix: First capture the affected runner lane, job, pod, and cgroup evidence.
For ARC lanes, start with just arc-burst-capacity-audit --include-label <workflow-label> and then update the owned tofu/stacks/arc-runners/*.tfvars
envelope through the managed ARC workflow. Do not patch a live pod or route the
job to GitHub-hosted runners as the fix.
Important read:
- an OOM on a runner pod does not automatically mean the whole
honeycluster is out of memory - the current ARC runners still run inside per-pod cgroup limits
- for example, the committed baseline
tinyland-nixlane is still an8Gimemory-limit runner - Rust-heavy
clippyorrustcworkloads can still hit that limit even if the cluster aggregate has far more RAM available
Common memory-hungry workloads:
dind: Container builds with large build contexts.nix: Derivations that compile large packages from source.- Rust-heavy lint/build steps:
clippyand parallelrustcprocesses can spike memory within one runner pod.
Recommended current path:
- keep general Nix work on
tinyland-nix - move recurring heavy Rust/Nix jobs to
tinyland-nix-heavy - use
just arc-runtime-auditto confirm the live heavy lane and placement - use
just arc-burst-capacity-audit --include-label tinyland-nixto surface terminal ARC runner pods and their assigned GitHub jobs when the failure is on the baseline or compute-expansion Nix lane - do not infer one runner pod’s memory budget from total cluster RAM
Platform Proof Nix Build Dies With Signal 9
Symptom: The Platform Proof Prove tinyland-nix contract or
Prove tinyland-nix-heavy contract job fails while building
.#runner-dashboard, and the Nix log includes signal 9, Killed, or
builder failed due to signal 9.
Read: Treat this as transient runner memory evidence until the cgroup
diagnostics say otherwise. Both Nix platform-proof lanes use
scripts/platform-proof-nix-runner-dashboard.sh, which prints host memory,
filesystem state, and the runner cgroup files memory.events and
memory.peak. Those two cgroup files are required evidence for this failure
class.
Fix:
- Read the failed attempt log path printed by the helper.
- Check
memory.eventsforoomoroom_killincrements, then comparememory.peakagainstmemory.max. - Use the helper’s single automatic retry as the contract result when the second attempt passes on the same commit.
- If the retry also fails, keep both attempt logs and the cgroup diagnostics together. Do not treat total cluster memory as proof that the runner pod had enough memory.
Docker Runner Hits ENFILE
Symptom: A tinyland-docker job fails during dependency installation with
ENFILE, file table overflow, or Error: spawn ps ENFILE. GitHub may also
show a proof run stuck after the job’s contract steps completed if the runner
listener aborts during final upload or cleanup.
Read: Treat this as runner-host capacity or cleanup evidence, not as a normal application test failure. It is also separate from RustFS bucket-index debt, Attic publication, BCR, and RBE/CAS authority.
The repo-owned Platform Proof Docker lane intentionally bounds pnpm install
fanout with --child-concurrency=2 --network-concurrency=4. That keeps the
proof focused on the shared Docker runner contract instead of accidentally
turning the proof into a host file-table stress test.
Fix:
- Capture the failing job runner and job id:
gh api repos/<owner>/<repo>/actions/jobs/<job_id> - Check recent ARC runner events and finalizer state:
kubectl --context honey -n arc-runners get events --sort-by=.lastTimestamp \ | rg '<runner-name>|ENFILE|OOM|Failed|Killing' kubectl --context honey -n arc-runners get ephemeralrunners,ephemeralrunnersets \ | rg '<runner-scale-set>|<runner-name>' - Capture the node file-table and memory state before deleting evidence:
kubectl --context honey debug node/sting -it --image=busybox:1.36 -- \ chroot /host sh -lc 'cat /proc/sys/fs/file-nr; cat /proc/sys/fs/file-max; cat /proc/sys/fs/nr_open; free -m' - If the proof itself failed but a rerun passes on the same commit, classify the failed run as runner-host capacity evidence and keep the clean rerun as the contract result.
- If ordinary downstream package installs repeatedly hit this failure class,
either lower package-manager concurrency in that workflow or move the job to
a more appropriate lane. Do not raise
tinyland-dockercapacity blindly.
Runner Loses Communication Mid-Build
Symptom: GitHub Actions shows The self-hosted runner lost communication with the server, while the job log ends with error: interrupted by the user
in the middle of a long build.
Read: This is often not a repo-local build failure. On honey, the first
thing to check is whether Kubernetes has API/CNI continuity evidence, node
pressure, or a runner-pod eviction near the failure.
Fix:
- Confirm the failed job’s runner name:
gh api repos/<owner>/<repo>/actions/jobs/<job_id> - Check recent ARC events on
honey:kubectl --context honey -n arc-runners get events --sort-by=.lastTimestamp \ | rg 'Evicted|DiskPressure|ephemeral-storage|<runner-name>' - Classify Kubernetes continuity evidence for the runner:
just arc-network-continuity-audit --runner-name <runner-name> - Confirm the live runner-set envelope:
If the log scan showsjust arc-runtime-auditJob message not found/job was canceledon a max-1 operator lane, verify the queued replacement job still exists and the pod has noRunner.Workerprocess before deleting only that staleEphemeralRunner. If the listener is missing and no-job runners appear to be blocking listener recreation, rerun with:
Treat any cleanup hint as gated evidence, not an automatic action. Capture GitHub runner API output, pass it withjust arc-runtime-audit --fail-on-stale-idle-listener-blocker--github-runners-json, verify every candidate runner isbusy=false, and confirm no pod has aRunner.Workerprocess before deleting the owningEphemeralRunnerSet. The managed ARC apply now automates the idle-leak shape pre-apply: between quiesce scoping and the cap freeze,Deploy ARC Runnersrunsscripts/reap-idle-leaked-ephemeral-runners.shagainst the affected scale sets, deleting only no-jobEphemeralRunnerCRs whose owningEphemeralRunnerSetwants zero replicas or that are excess beyond desired and older than the minimum age, with a just-in-time per-CR job re-check before each delete. For a manual, evidence-first pass use:
If the scale set is instead deadlocked mid listener rollover (TIN-2055 signature:just arc-reap-zombies --scale-set <scale-set> --dry-runAutoscalingRunnerSetphasePending, no listener pod inarc-systems, controller logWaiting for the running and pending runners to finish), the blocker is an idle warm runner of the previousEphemeralRunnerSetgeneration sitting at current==desired — a shape the default reap contract deliberately skips. This deadlock class is produced by the controller’s oldeventualupdate strategy; since TIN-2056 the controller runsimmediate(the upstream default), which recreates the listener and new generation at once, so a fresh occurrence of this signature now points at a transition that predates the flip or a controller restart that stranded one mid-drain. The recovery helpers are retained. Use the opt-in stale-generation mode, which also reaps jobless runners owned by a non-newestEphemeralRunnerSetgeneration (newest = latestcreationTimestamp, falling back to differingactions.github.com/runner-spec-hashannotations) even at current==desired:
This shape is also automated twice over: every successful managed apply settle-reaps it post-apply (after cap restore, before the listener-cap prove gate; a safe no-op whenjust arc-reap-zombies-stale --scale-set <scale-set> --dry-runimmediatealready deleted the stale generation), and therunner-zombie-reapCronJob backstop sweeps every live scale set — owner overlays included — every 30 minutes with--min-age-seconds 3600. The managed apply additionally refuses to start while a rollover is already in flight (pre-apply wedge canary: singleEphemeralRunnerSetper set, listeners Running), so wedged transitions never get a new transition stacked on top. - Check kubelet root/imagefs capacity separately from durable storage:
just kubelet-imagefs-capacity-audit just kubelet-imagefs-capacity-audit --node bumble - If the node is overcommitted on disk, raise the Nix lane’s
ephemeral-storagerequest and limit in the ARC stack instead of patching the downstream repo again.
Checkout Fails Before Repo Code Runs
Symptom: actions/checkout fails with EACCES, unlink errors, or stale
workspace state under _work/, before the repository’s own build logic starts.
Read: This is usually a runner-host hygiene problem, not a downstream repo contract problem.
See Honey Runner Workdir Contract for the lifecycle boundary and escalation rules behind this failure class.
Fix:
- If you have the failing run URL or run id, start from the run itself:
Usejust honey-runner-checkout-triage \ https://github.com/Jesssullivan/scheduling-bridge/actions/runs/24525417273--parse-onlywhen you only want the run/log extraction or the current shell cannot reach the honey hosts over SSH. - Audit the runner hosts directly when you need the raw host view:
just honey-runner-workdir-audit - Generate the bounded recovery plan:
If the host has more than one contaminated repo workdir, stop there and treat it as a replacement or wider manual-triage incident.just honey-runner-workdir-reconcile - If a specific host shows a safe single-repo recovery candidate, drain that
runner host root first:
just honey-runner-host-lifecycle honey-am-2 drain - Preview the bounded remediation:
just honey-runner-workdir-remediate honey-am-2 scheduling-bridge - Apply the bounded remediation:
Usejust honey-runner-workdir-remediate honey-am-2 scheduling-bridge --apply--mode unlock --applywhen you need to restore owner write bits before inspection or escalation. Or use:
when you want the repo-owned automation path to execute only the safe single-repo host recoveries.just honey-runner-workdir-reconcile --apply --confirm-drained - Restart the runner host root:
If the host root cannot be restarted cleanly, replace it instead of widening salvage.just honey-runner-host-lifecycle honey-am-2 start - Rerun the downstream job.
If checkout fails before repo code runs, post-checkout cleanup inside the downstream repo will not help.
Cache Misses on Nix Runner
Symptom: Nix builds download or compile everything from scratch despite previous builds having populated the cache.
Causes and fixes:
- ATTIC_SERVER not set: Verify the environment variable is present in
the runner pod. Check that the Kubernetes Secret for Attic credentials
exists in the runner’s namespace (
arc-runnersorgitlab-runners, depending on the path you are debugging). - Attic service unreachable: Confirm the Attic cache service is running
in the
nix-cachenamespace. Test connectivity from a runner pod withcurl $ATTIC_SERVER. - Cache name mismatch: Verify
ATTIC_CACHEmatches the cache name used inattic pushcommands.
TOML Configuration Gotchas
The GitLab Runner TOML configuration has several pitfalls in Runner 17.x:
- Resource limits must be flat keys: Values like
cpu_limit,memory_limit,cpu_request, andmemory_requestmust be specified as flat keys in the[[runners.kubernetes]]section. Do not nest them inside a TOML table. - pod_spec.containers type mismatch: Using
pod_specwith acontainersfield causes a type mismatch error in Runner 17.x. Instead, useenvironment = [...]on the[[runners]]section to inject environment variables.
Runner Pods Pending
Symptom: Pods stay in Pending state and are not scheduled.
Causes and fixes:
- Insufficient cluster resources: Check node capacity with
kubectl describe nodes. The cluster may need more nodes or the runner resource requests may be too high. - HPA at maximum: If all replicas are running and jobs are still queuing, increase the HPA maximum. See HPA Tuning.
- Image pull failure behind a Pending pod: A pod can still show
Pendingwhile the runner container is actually blocked inImagePullBackOff. Checkkubectl describe podand the container waiting reason before treating the incident as a scheduler or capacity failure.
Live GloriousFlywheel example:
- baseline
tinyland-nixpods onhoneywere observed asPending - the real blocker was
ImagePullBackOffonghcr.io/tinyland-inc/actions-runner-nix:latest - the concrete pull error was
401 Unauthorized
Meaning:
- do not assume every
Pendingrunner pod is a scheduler or capacity problem - verify image-pull auth before treating the incident as a memory or placement failure.
Heavy Nix Lane Fails With npm TLS Errors
Symptom: A tinyland-nix-heavy job starts, but the workload
fails inside nix build with pnpm errors like
UNABLE_TO_GET_ISSUER_CERT_LOCALLY against registry.npmjs.org.
Read: This means ARC placement is already working. The remaining problem is certificate trust inside the build path, not runner scheduling or GHCR pull auth.
Fix:
- Make sure the derivation or build wrapper exports a CA bundle into the Node
and Nix fetch paths:
SSL_CERT_FILENIX_SSL_CERT_FILENODE_EXTRA_CA_CERTS
- Re-run the heavy canary after the derivation change lands.
- If the error persists, inspect the runner image and runtime trust store rather than changing runner placement.
Related
- Runbook — operational procedures for common tasks
- Security Model — access and permission details