Troubleshooting

Common issues with the runner infrastructure and how to resolve them.

Runner Not Registering

Symptom: Runner pod starts but does not appear in the GitLab group runner list.

Causes and fixes:

Invalid runner token: Verify the Kubernetes Secret containing the registration token exists and is current. Delete the secret and run tofu apply to recreate it.
Group access: Confirm that the service account or user associated with the token has access to the target GitLab group.
Token auto-creation: Automatic runner token creation requires the Owner role on the GitLab group. If the deploying user does not have Owner, token creation will fail silently. Verify role assignment in GitLab group settings.

ARC Listener Fails With GitHub API Rate Limit

Symptom: The listener pod restarts or never establishes a broker session, and the logs contain a line like:

failed to get runner registration token (403 Forbidden): API rate limit exceeded for user ID ...

Read: This is an auth-contract problem, not a cluster-capacity problem. The affected scale set is authenticated with a PAT-backed secret, and GitHub has exhausted that user’s REST core budget. ARC cannot mint a runner registration token, so repo-scoped lanes stay offline even if the honey cluster and ARC controller are otherwise healthy.

Fix:

Confirm the failure on the listener:

kubectl -n arc-systems logs <listener-pod>

Check whether the shared PAT is exhausted:
```
gh api rate_limit
```
Replace the PAT-backed secret with a dedicated GitHub App installation secret for the target org or repository, created in both arc-systems and arc-runners.
Update the runner set to use that new secret and re-apply the ARC stack.

Temporary bridge only:

waiting for the PAT rate-limit reset can bring the listener back
reducing PAT traffic can delay the next outage
neither is a durable fix for an active repo-scoped runner lane

Pods Crashing (OOMKilled)

Symptom: Runner pods restart repeatedly. kubectl describe pod shows OOMKilled as the termination reason.

Fix: Increase the memory limit for the affected runner type in organization.yaml and run tofu apply. See HPA Tuning for resource limit configuration.

Important read:

an OOM on a runner pod does not automatically mean the whole honey cluster is out of memory
the current ARC runners still run inside per-pod cgroup limits
for example, the committed baseline tinyland-nix lane is still an 8Gi memory-limit runner
Rust-heavy clippy or rustc workloads can still hit that limit even if the cluster aggregate has far more RAM available

Common memory-hungry workloads:

dind: Container builds with large build contexts.
nix: Derivations that compile large packages from source.
Rust-heavy lint/build steps: clippy and parallel rustc processes can spike memory within one runner pod.

Recommended current path:

keep general Nix work on tinyland-nix
move recurring heavy Rust/Nix jobs to tinyland-nix-heavy
use just arc-runtime-audit to confirm the live heavy lane and placement
do not infer one runner pod’s memory budget from total cluster RAM

Checkout Fails Before Repo Code Runs

Symptom: actions/checkout fails with EACCES, unlink errors, or stale workspace state under _work/, before the repository’s own build logic starts.

Read: This is usually a runner-host hygiene problem, not a downstream repo contract problem.

See Honey Runner Workdir Contract for the lifecycle boundary and escalation rules behind this failure class.

Fix:

If you have the failing run URL or run id, start from the run itself:
```
just honey-runner-checkout-triage \
  https://github.com/Jesssullivan/acuity-middleware/actions/runs/24525417273
```
Use --parse-only when you only want the run/log extraction or the current shell cannot reach the honey hosts over SSH.
Audit the runner hosts directly when you need the raw host view:
```
just honey-runner-workdir-audit
```
Generate the bounded recovery plan:
```
just honey-runner-workdir-reconcile
```
If the host has more than one contaminated repo workdir, stop there and treat it as a replacement or wider manual-triage incident.
If a specific host shows a safe single-repo recovery candidate, stop or drain that runner host or service first.

Preview the bounded remediation:

just honey-runner-workdir-remediate honey-am-2 acuity-middleware

Apply the bounded remediation:
```
just honey-runner-workdir-remediate honey-am-2 acuity-middleware --apply
```
Use --mode unlock --apply when you need to restore owner write bits before inspection or escalation. Or use:
```
just honey-runner-workdir-reconcile --apply --confirm-drained
```
when you want the repo-owned automation path to execute only the safe single-repo host recoveries.
Restart or replace the runner.
Rerun the downstream job.

If checkout fails before repo code runs, post-checkout cleanup inside the downstream repo will not help.

Cache Misses on Nix Runner

Symptom: Nix builds download or compile everything from scratch despite previous builds having populated the cache.

Causes and fixes:

ATTIC_SERVER not set: Verify the environment variable is present in the runner pod. Check that the Kubernetes Secret for Attic credentials exists in the {org}-runners namespace.
Attic service unreachable: Confirm the Attic cache service is running in the attic-cache-dev namespace. Test connectivity from a runner pod with curl $ATTIC_SERVER.
Cache name mismatch: Verify ATTIC_CACHE matches the cache name used in attic push commands.

TOML Configuration Gotchas

The GitLab Runner TOML configuration has several pitfalls in Runner 17.x:

Resource limits must be flat keys: Values like cpu_limit, memory_limit, cpu_request, and memory_request must be specified as flat keys in the [[runners.kubernetes]] section. Do not nest them inside a TOML table.
pod_spec.containers type mismatch: Using pod_spec with a containers field causes a type mismatch error in Runner 17.x. Instead, use environment = [...] on the [[runners]] section to inject environment variables.

Runner Pods Pending

Symptom: Pods stay in Pending state and are not scheduled.

Causes and fixes:

Insufficient cluster resources: Check node capacity with kubectl describe nodes. The cluster may need more nodes or the runner resource requests may be too high.
HPA at maximum: If all replicas are running and jobs are still queuing, increase the HPA maximum. See HPA Tuning.
Image pull failure behind a Pending pod: A pod can still show Pending while the runner container is actually blocked in ImagePullBackOff. Check kubectl describe pod and the container waiting reason before treating the incident as a scheduler or capacity failure.

Live GloriousFlywheel example:

baseline tinyland-nix pods on honey were observed as Pending
the real blocker was ImagePullBackOff on ghcr.io/tinyland-inc/actions-runner-nix:latest
the concrete pull error was 401 Unauthorized

Meaning:

do not assume every Pending runner pod is a scheduler or capacity problem
verify image-pull auth before treating the incident as a memory or placement failure.

Heavy Nix Lane Fails With npm TLS Errors

Symptom: A tinyland-nix-heavy job starts on sting, but the workload fails inside nix build with pnpm errors like UNABLE_TO_GET_ISSUER_CERT_LOCALLY against registry.npmjs.org.

Read: This means ARC placement is already working. The remaining problem is certificate trust inside the build path, not runner scheduling or GHCR pull auth.

Fix:

Make sure the derivation or build wrapper exports a CA bundle into the Node and Nix fetch paths:
- SSL_CERT_FILE
- NIX_SSL_CERT_FILE
- NODE_EXTRA_CA_CERTS
Re-run the heavy canary after the derivation change lands.
If the error persists, inspect the runner image and runtime trust store rather than changing runner placement.

Runbook — operational procedures for common tasks
Security Model — access and permission details