ARC Hygiene Audit — 2026-05-09
J1 of the May 10-16 sprint plan (2026-05-10-cache-forward-toward-rbe.md). Read-only live cluster audit, no mutations performed.
Headline finding
The tinyland-nix ARS-listener mismatch flagged at sprint kickoff is
resolved. All 5 tinyland-nix* variants (tinyland-nix,
tinyland-nix-compute-expansion, tinyland-nix-gpu,
tinyland-nix-heavy, tinyland-nix-kvm) now have correctly-named
matching listeners running on sting, and none reports the Pending
STATE that was visible at sprint kickoff (~2026-05-09T19:15Z).
Likely fixed by Codex’s #588 ARC horizon audit work, or self-resolved through normal ARC reconcile cycles. The “no orphaned ephemeral runners running against a Pending parent” acceptance from J1 is met.
Verification commands run
kubectl --context honey get autoscalingrunnerset -A
kubectl --context honey get autoscalinglisteners -A
just arc-runtime-audit
just arc-network-continuity-audit
All four are read-only.
State summary
AutoscalingRunnerSets (arc-runners)
All ARS report empty STATE column (no Pending). Sample:
| ARS | Max | Current | Listener | Listener node |
|---|---|---|---|---|
tinyland-nix |
16 | 0 | tinyland-nix-ddd868ff-listener |
sting |
tinyland-nix-compute-expansion |
2 | 0 | tinyland-nix-compute-expansion-ddd868ff-listener |
sting |
tinyland-nix-gpu |
1 | 0 | tinyland-nix-gpu-ddd868ff-listener |
sting |
tinyland-nix-heavy |
1 | 0 | tinyland-nix-heavy-ddd868ff-listener |
sting |
tinyland-nix-kvm |
1 | 0 | tinyland-nix-kvm-ddd868ff-listener |
sting |
tinyland-dind |
12 | 7 | listener Running | sting |
tinyland-dind-compute-expansion |
1 | 1 | listener Running | sting |
Active runner pods at audit time
tinyland-dind was busy with 7 runner pods (3 Running, 4 Pending) on
honey — typical Bazel image-publication burst. No stale orphaned
runners observed.
Session continuity
just arc-runtime-audit reported “no broker/session continuity drift
found in scanned active runner logs” across 8 active pods. No
TryAgain/acquirejob/job-assignment errors in scanned logs.
Node pressure snapshot
| Node | CPU | Memory |
|---|---|---|
| bumble | 35% | 36% |
| honey | 27% | 20% |
| sting | 7% | 23% |
All Ready=True. bumble running noticeably hotter than honey/sting,
which is the storage-biased node profile per existing ops notes.
Findings (network-continuity audit)
just arc-network-continuity-audit reported 3 kubelet-eviction-pressure
events. All three are the known TIN-613 bumble rootfs/imagefs
headroom class, not new debt:
tinyland-dind-compute-expansion-5xr6q-runner-q2wp7FailedScheduling onstingdue to insufficient ephemeral-storage. Pod did not match the other two nodes’ affinity/selector.- Same pod, same cause, second event ~400ms later (normal scheduler retry).
tinyland-nix-heavy-d5wvp-runner-8mdt4FailedScheduling onsting— insufficient ephemeral-storage and insufficient memory. Heavy-lane resource envelope hitting the boundary on a busy node.
Disposition: these match the existing scheduling-avoidance policy
for bumble (TIN-613 closed on that basis). No new mutation needed.
If they persist or escalate, host maintenance window for bumble is
the documented next step, not an ARC config change.
Acceptance vs J1
Plan acceptance: “every live runner anomaly is either fixed during an approved window or captured as a specific Linear/GitHub blocker.”
- ARS Pending: fixed (no live mutation by Claude; resolved during the session through Codex/normal reconcile)
- FailedScheduling events: captured — they are TIN-613-class debt, not fresh anomalies, and the existing scheduling-avoidance + future host maintenance plan stands
J1 is done. No window-gated mutations were required.
What this audit did NOT do
- No
kubectl delete/kubectl edit/kubectl applyinvocations. - No
tofu plan/tofu apply. - No restart of any pod, listener, or service.
- No state file mutation.
If a follow-on action does need a window (e.g. bumble maintenance), that’s a separate Linear-tracked decision, not part of this audit.
Related
- TIN-1070 — sprint control list (J1)
- TIN-613 — bumble rootfs/imagefs headroom (closed on scheduling-avoidance basis; live remediation deferred)
just arc-runtime-audit/just arc-network-continuity-audit— the read-only diagnostic surfaces used here