Troubleshooting

Troubleshooting

Common issues with the runner infrastructure and how to resolve them.

Runner Not Registering

Symptom: Runner pod starts but does not appear in the GitLab group runner list.

Causes and fixes:

  • Invalid runner token: Verify the Kubernetes Secret containing the registration token exists and is current. Delete the secret and run tofu apply to recreate it.
  • Group access: Confirm that the service account or user associated with the token has access to the target GitLab group.
  • Token auto-creation: Automatic runner token creation requires the Owner role on the GitLab group. If the deploying user does not have Owner, token creation will fail silently. Verify role assignment in GitLab group settings.

ARC Listener Fails With GitHub API Rate Limit

Symptom: The listener pod restarts or never establishes a broker session, and the logs contain a line like:

failed to get runner registration token (403 Forbidden): API rate limit exceeded for user ID ...

Read: This is an auth-contract problem, not a cluster-capacity problem. The affected scale set is authenticated with a PAT-backed secret, and GitHub has exhausted that user’s REST core budget. ARC cannot mint a runner registration token, so repo-scoped lanes stay offline even if the honey cluster and ARC controller are otherwise healthy.

Fix:

  1. Confirm the failure on the listener:
    kubectl -n arc-systems logs <listener-pod>
  2. Check whether the shared PAT is exhausted:
    gh api rate_limit
  3. Replace the PAT-backed secret with a dedicated GitHub App installation secret for the target org or repository, created in both arc-systems and arc-runners.
  4. Update the runner set to use that new secret and re-apply the ARC stack.

Temporary bridge only:

  • waiting for the PAT rate-limit reset can bring the listener back
  • reducing PAT traffic can delay the next outage
  • neither is a durable fix for an active repo-scoped runner lane

Pods Crashing (OOMKilled)

Symptom: Runner pods restart repeatedly. kubectl describe pod shows OOMKilled as the termination reason.

Fix: Increase the memory limit for the affected runner type in organization.yaml and run tofu apply. See HPA Tuning for resource limit configuration.

Important read:

  • an OOM on a runner pod does not automatically mean the whole honey cluster is out of memory
  • the current ARC runners still run inside per-pod cgroup limits
  • for example, the committed baseline tinyland-nix lane is still an 8Gi memory-limit runner
  • Rust-heavy clippy or rustc workloads can still hit that limit even if the cluster aggregate has far more RAM available

Common memory-hungry workloads:

  • dind: Container builds with large build contexts.
  • nix: Derivations that compile large packages from source.
  • Rust-heavy lint/build steps: clippy and parallel rustc processes can spike memory within one runner pod.

Recommended current path:

  • keep general Nix work on tinyland-nix
  • move recurring heavy Rust/Nix jobs to tinyland-nix-heavy
  • use just arc-runtime-audit to confirm the live heavy lane and placement
  • do not infer one runner pod’s memory budget from total cluster RAM

Checkout Fails Before Repo Code Runs

Symptom: actions/checkout fails with EACCES, unlink errors, or stale workspace state under _work/, before the repository’s own build logic starts.

Read: This is usually a runner-host hygiene problem, not a downstream repo contract problem.

See Honey Runner Workdir Contract for the lifecycle boundary and escalation rules behind this failure class.

Fix:

  1. If you have the failing run URL or run id, start from the run itself:
    just honey-runner-checkout-triage \
      https://github.com/Jesssullivan/acuity-middleware/actions/runs/24525417273
    Use --parse-only when you only want the run/log extraction or the current shell cannot reach the honey hosts over SSH.
  2. Audit the runner hosts directly when you need the raw host view:
    just honey-runner-workdir-audit
  3. Generate the bounded recovery plan:
    just honey-runner-workdir-reconcile
    If the host has more than one contaminated repo workdir, stop there and treat it as a replacement or wider manual-triage incident.
  4. If a specific host shows a safe single-repo recovery candidate, stop or drain that runner host or service first.
  5. Preview the bounded remediation:
    just honey-runner-workdir-remediate honey-am-2 acuity-middleware
  6. Apply the bounded remediation:
    just honey-runner-workdir-remediate honey-am-2 acuity-middleware --apply
    Use --mode unlock --apply when you need to restore owner write bits before inspection or escalation. Or use:
    just honey-runner-workdir-reconcile --apply --confirm-drained
    when you want the repo-owned automation path to execute only the safe single-repo host recoveries.
  7. Restart or replace the runner.
  8. Rerun the downstream job.

If checkout fails before repo code runs, post-checkout cleanup inside the downstream repo will not help.

Cache Misses on Nix Runner

Symptom: Nix builds download or compile everything from scratch despite previous builds having populated the cache.

Causes and fixes:

  • ATTIC_SERVER not set: Verify the environment variable is present in the runner pod. Check that the Kubernetes Secret for Attic credentials exists in the {org}-runners namespace.
  • Attic service unreachable: Confirm the Attic cache service is running in the attic-cache-dev namespace. Test connectivity from a runner pod with curl $ATTIC_SERVER.
  • Cache name mismatch: Verify ATTIC_CACHE matches the cache name used in attic push commands.

TOML Configuration Gotchas

The GitLab Runner TOML configuration has several pitfalls in Runner 17.x:

  • Resource limits must be flat keys: Values like cpu_limit, memory_limit, cpu_request, and memory_request must be specified as flat keys in the [[runners.kubernetes]] section. Do not nest them inside a TOML table.
  • pod_spec.containers type mismatch: Using pod_spec with a containers field causes a type mismatch error in Runner 17.x. Instead, use environment = [...] on the [[runners]] section to inject environment variables.

Runner Pods Pending

Symptom: Pods stay in Pending state and are not scheduled.

Causes and fixes:

  • Insufficient cluster resources: Check node capacity with kubectl describe nodes. The cluster may need more nodes or the runner resource requests may be too high.
  • HPA at maximum: If all replicas are running and jobs are still queuing, increase the HPA maximum. See HPA Tuning.
  • Image pull failure behind a Pending pod: A pod can still show Pending while the runner container is actually blocked in ImagePullBackOff. Check kubectl describe pod and the container waiting reason before treating the incident as a scheduler or capacity failure.

Live GloriousFlywheel example:

  • baseline tinyland-nix pods on honey were observed as Pending
  • the real blocker was ImagePullBackOff on ghcr.io/tinyland-inc/actions-runner-nix:latest
  • the concrete pull error was 401 Unauthorized

Meaning:

  • do not assume every Pending runner pod is a scheduler or capacity problem
  • verify image-pull auth before treating the incident as a memory or placement failure.

Heavy Nix Lane Fails With npm TLS Errors

Symptom: A tinyland-nix-heavy job starts on sting, but the workload fails inside nix build with pnpm errors like UNABLE_TO_GET_ISSUER_CERT_LOCALLY against registry.npmjs.org.

Read: This means ARC placement is already working. The remaining problem is certificate trust inside the build path, not runner scheduling or GHCR pull auth.

Fix:

  1. Make sure the derivation or build wrapper exports a CA bundle into the Node and Nix fetch paths:
    • SSL_CERT_FILE
    • NIX_SSL_CERT_FILE
    • NODE_EXTRA_CA_CERTS
  2. Re-run the heavy canary after the derivation change lands.
  3. If the error persists, inspect the runner image and runtime trust store rather than changing runner placement.

GloriousFlywheel