ARC Hygiene Audit (2026-05-09)

ARC Hygiene Audit — 2026-05-09

J1 of the May 10-16 sprint plan (2026-05-10-cache-forward-toward-rbe.md). Read-only live cluster audit, no mutations performed.

Headline finding

The tinyland-nix ARS-listener mismatch flagged at sprint kickoff is resolved. All 5 tinyland-nix* variants (tinyland-nix, tinyland-nix-compute-expansion, tinyland-nix-gpu, tinyland-nix-heavy, tinyland-nix-kvm) now have correctly-named matching listeners running on sting, and none reports the Pending STATE that was visible at sprint kickoff (~2026-05-09T19:15Z).

Likely fixed by Codex’s #588 ARC horizon audit work, or self-resolved through normal ARC reconcile cycles. The “no orphaned ephemeral runners running against a Pending parent” acceptance from J1 is met.

Verification commands run

kubectl --context honey get autoscalingrunnerset -A
kubectl --context honey get autoscalinglisteners -A
just arc-runtime-audit
just arc-network-continuity-audit

All four are read-only.

State summary

AutoscalingRunnerSets (arc-runners)

All ARS report empty STATE column (no Pending). Sample:

ARS Max Current Listener Listener node
tinyland-nix 16 0 tinyland-nix-ddd868ff-listener sting
tinyland-nix-compute-expansion 2 0 tinyland-nix-compute-expansion-ddd868ff-listener sting
tinyland-nix-gpu 1 0 tinyland-nix-gpu-ddd868ff-listener sting
tinyland-nix-heavy 1 0 tinyland-nix-heavy-ddd868ff-listener sting
tinyland-nix-kvm 1 0 tinyland-nix-kvm-ddd868ff-listener sting
tinyland-dind 12 7 listener Running sting
tinyland-dind-compute-expansion 1 1 listener Running sting

Active runner pods at audit time

tinyland-dind was busy with 7 runner pods (3 Running, 4 Pending) on honey — typical Bazel image-publication burst. No stale orphaned runners observed.

Session continuity

just arc-runtime-audit reported “no broker/session continuity drift found in scanned active runner logs” across 8 active pods. No TryAgain/acquirejob/job-assignment errors in scanned logs.

Node pressure snapshot

Node CPU Memory
bumble 35% 36%
honey 27% 20%
sting 7% 23%

All Ready=True. bumble running noticeably hotter than honey/sting, which is the storage-biased node profile per existing ops notes.

Findings (network-continuity audit)

just arc-network-continuity-audit reported 3 kubelet-eviction-pressure events. All three are the known TIN-613 bumble rootfs/imagefs headroom class, not new debt:

  1. tinyland-dind-compute-expansion-5xr6q-runner-q2wp7 FailedScheduling on sting due to insufficient ephemeral-storage. Pod did not match the other two nodes’ affinity/selector.
  2. Same pod, same cause, second event ~400ms later (normal scheduler retry).
  3. tinyland-nix-heavy-d5wvp-runner-8mdt4 FailedScheduling on sting — insufficient ephemeral-storage and insufficient memory. Heavy-lane resource envelope hitting the boundary on a busy node.

Disposition: these match the existing scheduling-avoidance policy for bumble (TIN-613 closed on that basis). No new mutation needed. If they persist or escalate, host maintenance window for bumble is the documented next step, not an ARC config change.

Acceptance vs J1

Plan acceptance: “every live runner anomaly is either fixed during an approved window or captured as a specific Linear/GitHub blocker.”

  • ARS Pending: fixed (no live mutation by Claude; resolved during the session through Codex/normal reconcile)
  • FailedScheduling events: captured — they are TIN-613-class debt, not fresh anomalies, and the existing scheduling-avoidance + future host maintenance plan stands

J1 is done. No window-gated mutations were required.

What this audit did NOT do

  • No kubectl delete / kubectl edit / kubectl apply invocations.
  • No tofu plan / tofu apply.
  • No restart of any pod, listener, or service.
  • No state file mutation.

If a follow-on action does need a window (e.g. bumble maintenance), that’s a separate Linear-tracked decision, not part of this audit.

  • TIN-1070 — sprint control list (J1)
  • TIN-613 — bumble rootfs/imagefs headroom (closed on scheduling-avoidance basis; live remediation deferred)
  • just arc-runtime-audit / just arc-network-continuity-audit — the read-only diagnostic surfaces used here

GloriousFlywheel