Resource Limits

Resource Limits Reference

Default resource limits for CI job pods and their containers, by runner type.

Important boundary:

  • these are Kubernetes container requests and limits that add up into pod and namespace admission
  • they are not the same thing as total RAM or CPU available in the honey cluster
  • a job can still be OOM-killed inside an 8Gi runner pod even when the cluster as a whole has abundant free memory

Default Job Pod Limits

These are the module defaults. Overlay deployments override these values in their *.tfvars files.

Runner CPU Request CPU Limit Memory Request Memory Limit Ephemeral Request Ephemeral Limit
docker 100m 2 256Mi 4Gi none none
dind 500m 4 1Gi 8Gi none none
rocky8 100m 2 256Mi 2Gi none none
rocky9 100m 2 256Mi 2Gi none none
nix 500m 4 1Gi 8Gi none none

For the current Honey ARC stack, committed *.tfvars values intentionally override the module defaults so the primary shared lanes have explicit burst caps and disk envelopes:

Environment Runner Max runners Node CPU Request CPU Limit Memory Request Memory Limit Runner Ephemeral Request Runner Ephemeral Limit DinD Daemon Ephemeral Request DinD Daemon Ephemeral Limit
honey tinyland-docker 20 sting 100m 2 256Mi 4Gi 1Gi 8Gi N/A N/A
honey tinyland-nix 16 honey 500m 4 1Gi 8Gi 24Gi 40Gi N/A N/A
honey tinyland-nix-compute-expansion 8 sting 500m 4 4Gi 16Gi 1Gi 16Gi N/A N/A
honey tinyland-dind 20 honey 500m 4 1Gi 8Gi 4Gi 8Gi 24Gi 40Gi
honey tinyland-dind-compute-expansion 16 sting 500m 4 1Gi 8Gi 512Mi 2Gi 1Gi 4Gi
honey tinyland-nix-operator 1 honey 1 4 4Gi 8Gi 8Gi 16Gi N/A N/A
honey tinyland-nix-heavy 2 honey 4 8 64Gi 160Gi 192Gi 256Gi N/A N/A
honey tinyland-nix-kvm 1 honey 2 8 8Gi 16Gi 64Gi 128Gi N/A N/A
honey tinyland-nix-gpu 1 honey 1 4 4Gi 8Gi 12Gi 16Gi N/A N/A

The Docker lane is still the lowest-overhead path for stateless CI, but it now has a small explicit ephemeral-storage request so a large burst is visible to Kubernetes scheduling. Container image builds belong on tinyland-dind, not on tinyland-docker. The tinyland-dind lane splits disk admission between the runner workspace container and the Docker daemon sidecar so neither container falls back to the namespace 1Gi default.

The tinyland-nix-compute-expansion lane is a sting overflow path for the same shared tinyland-nix capability label. It gives org-scoped Nix jobs an 8-slot bounded escape hatch when honey is saturated, but it does not make personal repos such as Dell-7810, XoxdWM, scheduling-kit, or scheduling-bridge able to see tinyland-inc org runners. That remains an owner-boundary problem, not a raw capacity problem. The TIN-1400 implementation uses generic ephemeral local-path-sting-fast-ephemeral PVCs for /nix (64Gi), /home/runner/_work (32Gi), and /home/runner/.cache (32Gi), and copies the baked image /nix into the per-pod PVC before runner startup. That preserves the installed Nix toolchain while moving store/workdir/cache churn off the kubelet root ephemeral-storage envelope. The container still has a small explicit root ephemeral-storage request/limit for logs, runner state, and non-mounted paths.

The tinyland-dind-compute-expansion lane is the sting overflow path for the same shared tinyland-dind capability label. Sting has large local SSD/NVMe scratch exposed through the local-path-sting-fast-ephemeral StorageClass, but its kubelet/root ephemeral-storage accounting is still the smaller root filesystem envelope. The compute-expansion lane therefore uses generic ephemeral PVCs on local-path-sting-fast-ephemeral for /home/runner/_work (48Gi) and the DinD Docker graph at /var/lib/docker (96Gi), while keeping container ephemeral-storage requests small enough to avoid pretending that root ephemeral storage is the same capacity pool as the fast-local PVCs.

The 2026-05-10 Rockies fanout showed the prior combined tinyland-dind capacity of 16 slots saturating while the cluster still had CPU, memory, honey ephemeral-storage, and sting fast-local headroom. The 2026-05-12 follow-up showed the same artificial cap pattern after sting returned to service: honey could hit its pod ceiling while sting still had pod, CPU, memory, and fast-local PVC headroom. The current bounded step is 20 honey DinD slots plus 16 sting fast-local overflow slots. Further increases should be based on live queue evidence and storage telemetry, not on repo-specific labels.

The 2026-05-11 runner availability check added one more constraint: a node can have spare CPU, memory, and storage while the active lane is blocked by the node pod-count ceiling or by selectors/tolerations that keep the pod off the node with spare resources. Treat pod capacity and placement as first-class capacity dimensions alongside CPU, memory, and disk.

The 2026-05-14 Rockies fanout made the source-owned overflow cap visible: Honey was at its 110/110 pod ceiling, baseline tinyland-dind pods were Unschedulable with Too many pods, tinyland-dind-compute-expansion was at its maxRunners = 9, and Sting still had free pod slots plus low actual fast-local scratch usage. just arc-burst-capacity-audit now reports this as overflow saturation so the next change can be a reviewed cap and storage envelope decision instead of an emergency live patch. The PVC sizes are recovery/scratch intent for the lane; local-path does not make those nominal sizes a durable HA or global storage-quota authority.

The 2026-05-14 follow-up raised that source-owned overflow cap to maxRunners = 12 while tightening the fast-local PVC envelope to 48Gi work plus 96Gi Docker graph per runner. The May 15 recurrence then saturated that overflow cap again: tinyland-dind was current=20 pending=6, the overflow lane was current=12 pending=5, honey was at 110/110 pods, and sting still had 31 free pod slots plus low actual fast-local usage. The current source-owned follow-up raises only the additive sting overflow cap to maxRunners = 16; it does not widen the active honey baseline scale set. Treat the PVC sizes as recoverable scratch intent and validate actual fast-local headroom with the burst audit before any further cap change.

The 2026-05-14 PR #655 post-merge window exposed the next boundary: capacity and fairness are different things. The tinyland-dind and tinyland-dind-compute-expansion lanes can be healthy and still leave a GloriousFlywheel control-plane job queued behind a large downstream fanout on the same shared workflow label. GitHub ARC does not provide repository priority for one shared label. Treat that as shared-label fairness/admission policy, not as proof that the cluster is out of CPU, memory, or fast-local scratch. Use just arc-burst-capacity-audit --include-label tinyland-dind and read its Shared Queue Fairness section alongside pod-slot, quota, and PVC evidence before changing runner envelopes.

The 2026-05-19 same-label overflow work added a separate JIT assignment trap classifier to the burst audit. A pending runner with no visible GitHub job and a not-ready pod is cleanup evidence only after the GitHub runner is verified offline/not busy. A pending runner with jobRepositoryName, workflowRunId, or jobDisplayName is an assigned job at risk, not disposable residue. Do not use EphemeralRunner deletion as a queue-drain shortcut for assigned jobs; fix the placement/capacity cause or cancel the GitHub job explicitly.

The 2026-05-12 follow-up adds repo-owned cleanup for runner namespace residue: the ARC runner stack enables the runner-cleanup CronJob in arc-runners. This does not increase hardware capacity, but it keeps Succeeded and Failed utility pods from occupying honey pod slots after their evidence value has expired.

Do not widen an active baseline AutoscalingRunnerSet under load as the first queue-drain response. ARC can treat maxRunners changes as listener/config reconciliation work, and changing the baseline scale set can interrupt the listener path while existing runners drain. Prefer additive overflow lanes, or land the higher envelope as a reviewed source change followed by a planned deploy and runtime audit. The tinyland-dind=20 plus tinyland-dind-compute-expansion=16 envelope is canonical because it is encoded in the runner tfvars, docs, and validation, not because a live patch happened.

OpenEBS ZFS is currently the bumble-backed durable PVC plane, not the hot runner scratch plane. Do not route DinD Docker graph churn through bumble OpenEBS unless a separate storage design explicitly accepts the latency, failure-domain, and cleanup semantics. For transient container build scratch, prefer sting fast-local generic ephemeral PVCs.

The repo-owned Platform Proof job also keeps the tinyland-docker dependency install bounded with pnpm install --child-concurrency=2 --network-concurrency=4. That is intentional: the proof should verify the shared Docker lane, cache attachment, and ordinary app/MCP checks without turning a low-overhead runner into a host file-table stress test. Repeated ENFILE failures in downstream workflows should be treated as capacity or workflow-fanout evidence, not as a reason to silently widen the shared Docker lane.

The current additive shared capability class for heavier Nix work is tinyland-nix-heavy, defined in tofu/stacks/arc-runners/honey.tfvars with a larger pod envelope than baseline tinyland-nix.

TIN-1249 adds the matching REAPI proof-cell capacity boundary: gf-reapi-cell uses a 4 CPU / 8Gi request and 16 CPU / 16Gi limit when it is active, but the committed manifest is idle at replicas: 0. The local proof script still defaults GF_REAPI_CELL_SCALE_TO_ZERO_AFTER_PROOF=true for bounded operator windows. The GitHub proof workflow now defaults to leaving the cell resident after a successful apply because the repo is dogfooding hourly TTFCH and back-to-back RBE proofs against the live endpoint. Operators can opt into teardown with scale_to_zero_after_proof=true when scarce-lane pressure is more important than live endpoint continuity.

The 2026-05-17 PR #694 window exposed the corresponding scarce-lane queue case: tinyland-nix-heavy can be healthy and still block a control-plane or consumer proof when its single declared slot is occupied by another repository, and a follow-on pod can also report scheduler resource pressure such as Insufficient ephemeral-storage. just arc-burst-capacity-audit now reports this under Shared Label Queue Pressure, including the active holder repositories and not-ready runner pod messages. Treat that as queue/admission and scheduler evidence; this diagnostic does not mutate capacity and does not turn tinyland-nix-heavy into a repo-specific lane.

The 2026-05-19 PR #725/post-merge window showed the same contention twice in a row: once before merge and again on default-branch Platform Proof. The managed capacity response moves the existing shared tinyland-nix-heavy scale set to honey and raises it to a large shared heavy envelope. That keeps the workflow-facing capability label stable, avoids a GloriousFlywheel-only runner label, and avoids pretending that sting can safely host large heavy-Nix pods before the fast-local Nix scratch/store model exists there.

The 2026-05-19 PR #757 post-merge image-publish run then proved that 32Gi was still too small for the gf-reapi-cell OCI publication path when Nix had to compile the patched skopeo used by nix2container.copyTo. PR #758 kept the same shared capability label and raised the honey-backed tinyland-nix-heavy pod request/limit to 64Gi rather than minting a repo-specific publication lane.

The 2026-05-21 W3.4 vendor-mode canaries then proved that 64Gi and 128Gi scratch were still not full-graph Bazel vendor-mode envelopes. Run 26245482714 reached roughly 53Gi in the vendor temp tree before kubelet eviction at 64Gi. Run 26246609243 passed that boundary, reached roughly 79Gi while still healthy, and was later evicted at the 128Gi scratch limit. The lane then reserved 96Gi memory, allowed 160Gi memory, and reserved 192Gi scratch with a 256Gi scratch limit. Managed apply 26247461740 reconciled that envelope live, and canary 26247715938 proved the intended transition: the W3.4 red signal moved from runner eviction to a Bazel external-input defect in the rules_pkg@1.1.0 BCR module. Follow-up canary 26350919668 moved past that leak and exposed the next authority gap, an ambient local-Python lookup from pybind11_bazel, which is now made explicit through the CI devshell and PYTHON_BIN_PATH repository environment.

The 2026-05-28 current-main W3.4 canary 26549932671 passed on tinyland-nix-heavy with downloadable evidence: bazel vendor completed, //:deployment_bundle built from the vendor directory, the temporary vendor tree reached roughly 170Gi, and pod memory stayed around 14Gi with no OOM or eviction. That makes the lane scratch-heavy rather than memory-heavy for this target class. The committed envelope therefore keeps the 160Gi memory limit for bursts, lowers the memory request to 64Gi, and keeps the 192Gi/256Gi scratch request/limit that the canary actually needed.

The 2026-05-28 follow-up canary 26587033690 on main repeated the proof after the E3 status gate landed: full-scope vendor-mode passed, //:deployment_bundle built from the vendored graph, the workdir again reached roughly 170Gi, and the job completed without OOM or eviction. The vendor-mode workflow now uses a 192Gi scratch preflight by default so operators do not dispatch a known 170Gi-class proof onto a runner with only the older 40Gi floor available.

The 2026-05-31 canary 26710742767 confirmed that the next red signal was not runner capacity: it ran on tinyland-nix-heavy and failed while fetching a hermetic_launcher prebuilt stub from GitHub during bazel vendor. Treat that as W3 external-input authority debt. The runner envelope remains the right class for this proof; the fix is to stage verified distdir inputs, not to mint a repo-specific runner label or fallback to hosted runners.

Follow-up workflow-dispatch canary 26717187299 proved the distdir staging fix on the same dogfooded heavy lane after PR #855: bazel vendor completed, //:deployment_bundle built from the vendor directory, and the evidence classifier was ok. Post-merge main canary 26718312931 then repeated the proof on main at 3d68e10, with the vendor step, evidence summary, and artifact upload all green. That confirms the 192Gi/256Gi scratch envelope remains the right shared capability lane for W3.4; it does not promote durable external-input authority or broad/default RBE.

The 2026-05-27 PR #813/#814 dogfood window exposed the next heavy-lane truth: tinyland-nix-heavy is live at maxRunners = 2, but each runner requests 96Gi memory on honey before this correction. The second slot was only schedulable when Honey had another 96Gi request window available, which made maxRunners = 2 an overclaim under normal dogfood pressure. TIN-1649 now treats the proved 64Gi request as the truthful honey-backed fix; if future heavy pods still report Insufficient memory, handle it as capability-lane placement or quota evidence. Do not route around it with a repo-specific runner label.

The 2026-05-18 recurrence made the Sting scratch boundary explicit for Nix lanes. DinD compute-expansion already used local-path-sting-fast-ephemeral PVCs for recoverable workspace and Docker graph churn; the TIN-1400 follow-up adds the analogous Nix root/workdir PVC model for tinyland-nix-compute-expansion. Do not generalize that proof to heavy Nix, KVM, GPU, or compatibility lanes until their own runner image and storage behavior are proved.

The 2026-05-24 first-party dogfood surge then proved the next layer: the Nix compute-expansion fast-local PVC model is necessary but not sufficient if sting still advertises only a small kubelet root/nodefs ephemeral-storage surface. A live tinyland-nix-compute-expansion pod reported scheduler Insufficient ephemeral-storage while CPU, memory, namespace quota, and the physical fast-local storage story were not the limiting dimensions. Treat that as kubelet/local-path storage integration debt. Do not treat it as approval to fallback to hosted runners, and do not count the maxRunners = 8 overflow cap as fully usable until scheduler evidence shows the node can admit that many pods under the committed root ephemeral request envelope.

The 2026-05-25 follow-up tightened that envelope instead of raising the cap: tinyland-nix-compute-expansion keeps its 64Gi /nix PVC and 32Gi work PVC, adds a 32Gi /home/runner/.cache PVC for Bazel/Bazelisk/package-manager cache churn, keeps the runner root ephemeral-storage request at 1Gi, and raises only the root ephemeral-storage limit to 16Gi. The module’s Nix PVC init container still requests 1Gi, so an eight-runner burst still reserves 16Gi of Sting node ephemeral-storage instead of 40Gi for the PVC-backed lane. This is an admission correction for logs, runner state, and non-mounted paths; it is not a durability claim for Sting fast-local scratch.

The 2026-05-27 dogfood window exposed the separate memory dimension for that same Nix compute-expansion lane. A tinyland-nix-compute-expansion runner for tinyland-inc/lab Publish to FlakeHub run 26542162814 reached the runner container’s 8Gi cgroup limit and terminated with OOMKilled while sting itself remained Ready and below node memory pressure. Treat that as capability-envelope evidence, not proof that the cluster is out of RAM. The source-owned correction raises the shared lane to a 4Gi memory request and 16Gi memory limit so dogfood Nix work has room to burst on the existing capability label. The namespace request-memory quota moves with it to 280Gi so honey-heavy requests do not turn the sting overflow lane into an artificial cross-node quota casualty. Recurrent OOMs after this lands are evidence to route the workload to tinyland-nix-heavy or raise the shared contract again; they are not a hosted-runner fallback and not a repo-specific runner label.

The 2026-05-25 default-branch Validate run exposed the baseline honey Nix scratch floor: .#ci realization failed with No space left on device on shared tinyland-nix before stack validation could complete. The shared lane therefore reserves 24Gi and limits at 40Gi per runner. This keeps the workflow-facing capability label stable and repairs the dogfood lane instead of routing first-party checks to hosted runners or minting a repo-specific scale set.

The same window also exposed honey kubelet image/container filesystem pressure. tinyland-nix per-pod scratch limits and node imagefs hygiene are separate capacity dimensions: widening a runner pod does not clear DiskPressure if /var/lib/rancher is near the kubelet eviction threshold. Run just kubelet-imagefs-capacity-audit --node honey before blaming workflow labels or adding hosted fallback.

The 2026-05-25 Source Bazel Proof failure exposed a separate remote-cache digest mismatch while Bazel read through the RustFS-backed bazel-cache service. The recovery started as cache-only: delete the implicated CAS object and roll the stateless bazel-cache pods so local emptyDir hot caches cannot keep serving it. That delete also reproduced the RustFS bucket-index failure class: S3 list-buckets went empty while disk markers remained present. A controlled restart of attic-rustfs-openebs restored the API view. Use just bazel-remote-cache-cas-integrity-audit --object-encoding zstd for read-only decoded payload verification before treating a Bazel digest mismatch as a runner, Bzlmod, or source-code failure.

tinyland-nix-operator is the dedicated control-plane lane for ARC deployment and operator maintenance. It is intentionally small and honey-bound: it exists so managed ARC applies can quiesce and max-freeze consumer runner lanes without running on the same labels they are draining. During bootstrap, the deploy workflow may still fall back to tinyland-nix-heavy; after the operator lane is live, set ARC_DEPLOY_RUNNER_LABEL=tinyland-nix-operator in repository variables.

Additive Nix hardware lanes also carry explicit ephemeral-storage envelopes. Without those fields, Kubernetes would apply the namespace 1Gi default limit to the runner container, which is too small for cache-backed Nix shells that still materialize GPU userspace, KVM scratch state, or heavy build work inside an ephemeral runner pod.

Use just arc-runtime-audit to inspect the live ARC runner-set envelopes and confirm that the cluster actually matches the repo contract after rollout.

For the shared KVM VM-execution lane, also read KVM Capacity Policy. The KVM label can be advertised by multiple owner-overlay scale sets, so the active limit is a live node-budget-derived policy on top of per-scale-set ARC maxRunners values.

GitLab Compatibility Limits

The current Honey GitLab compatibility stack is deliberately explicit but does not provide ARC-equivalent queue-driven scale-to-zero. It uses GitLab runner manager pods, per-manager concurrent_jobs, and HPA replica ranges. Current live tfvars pin manager and job pods to sting:

Runner Concurrent jobs HPA min/max Job ephemeral request Job ephemeral limit Helper ephemeral limit Service ephemeral limit
nix 2 1/3 12Gi 16Gi 2Gi none
docker 4 1/3 1Gi 8Gi 1Gi none
dind 2 1/3 24Gi 40Gi 2Gi 40Gi

Typical Workload Profiles

Workload CPU (typical) Memory (typical) Recommended Runner
Python lint (ruff) 50-200m 128-256Mi docker
Python tests (pytest) 100-500m 256-512Mi docker
Nix flake check 200-500m 256-512Mi nix
GHC build (warm cache) 500m-1 512Mi-1Gi nix
GHC build (cold cache) 2-4 2-4Gi nix
MUSL static build 1-2 1-2Gi nix
FPM RPM packaging 100-500m 256-512Mi rocky8/rocky9
Docker image build 500m-2 512Mi-2Gi dind

Scheduling Controls

The committed runner namespace rollout now declares Kubernetes ResourceQuota plus LimitRange guardrails so aggregate burst admission fails before it silently exceeds the intended on-prem envelope. Capacity is controlled by:

  • ARC maxRunners and minRunners values per scale set
  • pod resource requests and limits
  • namespace ResourceQuota and LimitRange defaults
  • node selectors, taints, and tolerations
  • physical node CPU, memory, pod-count, and kubelet imagefs headroom

maxRunners is local to one ARC scale set. It is not a global cap for a shared workflow label across every owner overlay attached to Honey. The namespace quota is the aggregate in-namespace backstop; it is not a cross-overlay global capacity policy, and it may intentionally cap below the sum of every lane’s theoretical max when the goal is to stop admission at the finite machine envelope.

Committed Honey guardrails:

Namespace Pods Requests CPU Requests Memory Requests Ephemeral Limits CPU Limits Memory Limits Ephemeral
arc-runners 96 42 280Gi 1100Gi 320 640Gi 2000Gi
gitlab-runners 48 10 24Gi 450Gi 80 160Gi 760Gi

Run just runner-capacity-model-check after changing runner counts, HPA caps, or resource envelopes. The check compares committed quota values against the modeled ARC/GitLab burst envelopes and fails when a quota is too loose to be a real backstop or too small for the largest modeled runner pod. It is an admission-envelope check, not proof that every lane can run at its individual max at the same time.

Read-only acceptance checks for runner storage-capacity changes:

  1. kubectl describe node honey and kubectl describe node sting show allocatable CPU, memory, pod count, and ephemeral-storage with no DiskPressure or unexpected eviction events.
  2. kubectl get resourcequota,limitrange -n arc-runners -o wide shows the namespace still below the committed aggregate envelope.
  3. kubectl get storageclass shows local-path-sting-fast-ephemeral present before a sting fast-local runner lane depends on it.
  4. kubectl get pvc -n arc-runners shows compute-expansion DinD work and Docker graph PVCs bound to local-path-sting-fast-ephemeral at the expected 48Gi and 96Gi sizes whenever compute-expansion DinD pods are active.
  5. kubectl get pods -n arc-runners -o wide shows compute-expansion DinD payloads scheduled on sting, while the honey baseline lane remains a source-owned envelope rather than an emergency live patch target.
  6. just arc-listener-queue-drift --repo <owner/repo> --run-id <run-id> --fail-on-drift and just arc-runtime-audit --fail-on-listener-cap-drift --fail-on-runner-count-drift --fail-on-runner-session-drift distinguish a real capacity cap from listener/session drift before any live mutation.

Requesting Limit Increases

If your jobs are being OOM-killed or throttled:

  1. Check pod resource usage: kubectl top pods -n <runner-namespace>
  2. Review the job logs for OOM or throttling messages
  3. Update the overlay *.tfvars file with higher limits for the relevant runner type
  4. For arc-runners, prefer the managed Deploy ARC Runners workflow from current main; local just tofu-apply arc-runners refuses dirty or non-origin/main source unless GF_ARC_RUNNERS_LOCAL_APPLY_ALLOW=1 is set for a reviewed break-glass apply.
  5. The runner Helm release will be updated with new pod resource templates

GloriousFlywheel