Troubleshooting
Common issues with the runner infrastructure and how to resolve them.
Runner Not Registering
Symptom: Runner pod starts but does not appear in the GitLab group runner list.
Causes and fixes:
- Invalid runner token: Verify the Kubernetes Secret containing the
registration token exists and is current. Delete the secret and run
tofu applyto recreate it. - Group access: Confirm that the service account or user associated with the token has access to the target GitLab group.
- Token auto-creation: Automatic runner token creation requires the Owner role on the GitLab group. If the deploying user does not have Owner, token creation will fail silently. Verify role assignment in GitLab group settings.
ARC Listener Fails With GitHub API Rate Limit
Symptom: The listener pod restarts or never establishes a broker session, and the logs contain a line like:
failed to get runner registration token (403 Forbidden): API rate limit exceeded for user ID ...
Read: This is an auth-contract problem, not a cluster-capacity problem.
The affected scale set is authenticated with a PAT-backed secret, and GitHub
has exhausted that user’s REST core budget. ARC cannot mint a runner
registration token, so repo-scoped lanes stay offline even if the honey
cluster and ARC controller are otherwise healthy.
Fix:
- Confirm the failure on the listener:
kubectl -n arc-systems logs <listener-pod> - Check whether the shared PAT is exhausted:
gh api rate_limit - Replace the PAT-backed secret with a dedicated GitHub App installation
secret for the target org or repository, created in both
arc-systemsandarc-runners. - Update the runner set to use that new secret and re-apply the ARC stack.
Temporary bridge only:
- waiting for the PAT rate-limit reset can bring the listener back
- reducing PAT traffic can delay the next outage
- neither is a durable fix for an active repo-scoped runner lane
Pods Crashing (OOMKilled)
Symptom: Runner pods restart repeatedly. kubectl describe pod shows
OOMKilled as the termination reason.
Fix: Increase the memory limit for the affected runner type in
organization.yaml and run tofu apply. See HPA Tuning
for resource limit configuration.
Important read:
- an OOM on a runner pod does not automatically mean the whole
honeycluster is out of memory - the current ARC runners still run inside per-pod cgroup limits
- for example, the committed baseline
tinyland-nixlane is still an8Gimemory-limit runner - Rust-heavy
clippyorrustcworkloads can still hit that limit even if the cluster aggregate has far more RAM available
Common memory-hungry workloads:
dind: Container builds with large build contexts.nix: Derivations that compile large packages from source.- Rust-heavy lint/build steps:
clippyand parallelrustcprocesses can spike memory within one runner pod.
Recommended current path:
- keep general Nix work on
tinyland-nix - move recurring heavy Rust/Nix jobs to
tinyland-nix-heavy - use
just arc-runtime-auditto confirm the live heavy lane and placement - do not infer one runner pod’s memory budget from total cluster RAM
Checkout Fails Before Repo Code Runs
Symptom: actions/checkout fails with EACCES, unlink errors, or stale
workspace state under _work/, before the repository’s own build logic starts.
Read: This is usually a runner-host hygiene problem, not a downstream repo contract problem.
See Honey Runner Workdir Contract for the lifecycle boundary and escalation rules behind this failure class.
Fix:
- If you have the failing run URL or run id, start from the run itself:
Usejust honey-runner-checkout-triage \ https://github.com/Jesssullivan/acuity-middleware/actions/runs/24525417273--parse-onlywhen you only want the run/log extraction or the current shell cannot reach the honey hosts over SSH. - Audit the runner hosts directly when you need the raw host view:
just honey-runner-workdir-audit - Generate the bounded recovery plan:
If the host has more than one contaminated repo workdir, stop there and treat it as a replacement or wider manual-triage incident.just honey-runner-workdir-reconcile - If a specific host shows a safe single-repo recovery candidate, stop or drain that runner host or service first.
- Preview the bounded remediation:
just honey-runner-workdir-remediate honey-am-2 acuity-middleware - Apply the bounded remediation:
Usejust honey-runner-workdir-remediate honey-am-2 acuity-middleware --apply--mode unlock --applywhen you need to restore owner write bits before inspection or escalation. Or use:
when you want the repo-owned automation path to execute only the safe single-repo host recoveries.just honey-runner-workdir-reconcile --apply --confirm-drained - Restart or replace the runner.
- Rerun the downstream job.
If checkout fails before repo code runs, post-checkout cleanup inside the downstream repo will not help.
Cache Misses on Nix Runner
Symptom: Nix builds download or compile everything from scratch despite previous builds having populated the cache.
Causes and fixes:
- ATTIC_SERVER not set: Verify the environment variable is present in
the runner pod. Check that the Kubernetes Secret for Attic credentials
exists in the
{org}-runnersnamespace. - Attic service unreachable: Confirm the Attic cache service is running
in the
attic-cache-devnamespace. Test connectivity from a runner pod withcurl $ATTIC_SERVER. - Cache name mismatch: Verify
ATTIC_CACHEmatches the cache name used inattic pushcommands.
TOML Configuration Gotchas
The GitLab Runner TOML configuration has several pitfalls in Runner 17.x:
- Resource limits must be flat keys: Values like
cpu_limit,memory_limit,cpu_request, andmemory_requestmust be specified as flat keys in the[[runners.kubernetes]]section. Do not nest them inside a TOML table. - pod_spec.containers type mismatch: Using
pod_specwith acontainersfield causes a type mismatch error in Runner 17.x. Instead, useenvironment = [...]on the[[runners]]section to inject environment variables.
Runner Pods Pending
Symptom: Pods stay in Pending state and are not scheduled.
Causes and fixes:
- Insufficient cluster resources: Check node capacity with
kubectl describe nodes. The cluster may need more nodes or the runner resource requests may be too high. - HPA at maximum: If all replicas are running and jobs are still queuing, increase the HPA maximum. See HPA Tuning.
- Image pull failure behind a Pending pod: A pod can still show
Pendingwhile the runner container is actually blocked inImagePullBackOff. Checkkubectl describe podand the container waiting reason before treating the incident as a scheduler or capacity failure.
Live GloriousFlywheel example:
- baseline
tinyland-nixpods onhoneywere observed asPending - the real blocker was
ImagePullBackOffonghcr.io/tinyland-inc/actions-runner-nix:latest - the concrete pull error was
401 Unauthorized
Meaning:
- do not assume every
Pendingrunner pod is a scheduler or capacity problem - verify image-pull auth before treating the incident as a memory or placement failure.
Heavy Nix Lane Fails With npm TLS Errors
Symptom: A tinyland-nix-heavy job starts on sting, but the workload
fails inside nix build with pnpm errors like
UNABLE_TO_GET_ISSUER_CERT_LOCALLY against registry.npmjs.org.
Read: This means ARC placement is already working. The remaining problem is certificate trust inside the build path, not runner scheduling or GHCR pull auth.
Fix:
- Make sure the derivation or build wrapper exports a CA bundle into the Node
and Nix fetch paths:
SSL_CERT_FILENIX_SSL_CERT_FILENODE_EXTRA_CA_CERTS
- Re-run the heavy canary after the derivation change lands.
- If the error persists, inspect the runner image and runtime trust store rather than changing runner placement.
Related
- Runbook — operational procedures for common tasks
- Security Model — access and permission details