Resource Limits Reference
Default resource limits for CI job pods and their containers, by runner type.
Important boundary:
- these are Kubernetes container requests and limits that add up into pod and namespace admission
- they are not the same thing as total RAM or CPU available in the
honeycluster - a job can still be OOM-killed inside an
8Girunner pod even when the cluster as a whole has abundant free memory
Default Job Pod Limits
These are the module defaults. Overlay deployments override these
values in their *.tfvars files.
| Runner | CPU Request | CPU Limit | Memory Request | Memory Limit | Ephemeral Request | Ephemeral Limit |
|---|---|---|---|---|---|---|
| docker | 100m | 2 | 256Mi | 4Gi | none | none |
| dind | 500m | 4 | 1Gi | 8Gi | none | none |
| rocky8 | 100m | 2 | 256Mi | 2Gi | none | none |
| rocky9 | 100m | 2 | 256Mi | 2Gi | none | none |
| nix | 500m | 4 | 1Gi | 8Gi | none | none |
For the current Honey ARC stack, committed *.tfvars values intentionally
override the module defaults so the primary shared lanes have explicit burst
caps and disk envelopes:
| Environment | Runner | Max runners | Node | CPU Request | CPU Limit | Memory Request | Memory Limit | Runner Ephemeral Request | Runner Ephemeral Limit | DinD Daemon Ephemeral Request | DinD Daemon Ephemeral Limit |
|---|---|---|---|---|---|---|---|---|---|---|---|
honey |
tinyland-docker |
20 | sting |
100m | 2 | 256Mi | 4Gi | 1Gi | 8Gi | N/A | N/A |
honey |
tinyland-nix |
16 | honey |
500m | 4 | 1Gi | 8Gi | 24Gi | 40Gi | N/A | N/A |
honey |
tinyland-nix-compute-expansion |
8 | sting |
500m | 4 | 4Gi | 16Gi | 1Gi | 16Gi | N/A | N/A |
honey |
tinyland-dind |
20 | honey |
500m | 4 | 1Gi | 8Gi | 4Gi | 8Gi | 24Gi | 40Gi |
honey |
tinyland-dind-compute-expansion |
16 | sting |
500m | 4 | 1Gi | 8Gi | 512Mi | 2Gi | 1Gi | 4Gi |
honey |
tinyland-nix-operator |
1 | honey |
1 | 4 | 4Gi | 8Gi | 8Gi | 16Gi | N/A | N/A |
honey |
tinyland-nix-heavy |
2 | honey |
4 | 8 | 64Gi | 160Gi | 192Gi | 256Gi | N/A | N/A |
honey |
tinyland-nix-kvm |
1 | honey |
2 | 8 | 8Gi | 16Gi | 64Gi | 128Gi | N/A | N/A |
honey |
tinyland-nix-gpu |
1 | honey |
1 | 4 | 4Gi | 8Gi | 12Gi | 16Gi | N/A | N/A |
The Docker lane is still the lowest-overhead path for stateless CI, but it now
has a small explicit ephemeral-storage request so a large burst is visible to
Kubernetes scheduling. Container image builds belong on tinyland-dind, not on
tinyland-docker. The tinyland-dind lane splits disk admission between the
runner workspace container and the Docker daemon sidecar so neither container
falls back to the namespace 1Gi default.
The tinyland-nix-compute-expansion lane is a sting overflow path for the same
shared tinyland-nix capability label. It gives org-scoped Nix jobs an 8-slot
bounded escape hatch when honey is saturated, but it does not make personal
repos such as Dell-7810, XoxdWM, scheduling-kit, or scheduling-bridge able to
see tinyland-inc org runners. That remains an owner-boundary problem, not a
raw capacity problem. The TIN-1400 implementation uses generic ephemeral
local-path-sting-fast-ephemeral PVCs for /nix (64Gi),
/home/runner/_work (32Gi), and /home/runner/.cache (32Gi), and copies
the baked image /nix into the per-pod PVC before runner startup. That
preserves the installed Nix toolchain while moving store/workdir/cache churn
off the kubelet root ephemeral-storage envelope. The container still has a
small explicit root ephemeral-storage request/limit for logs, runner state, and
non-mounted paths.
The tinyland-dind-compute-expansion lane is the sting overflow path for the
same shared tinyland-dind capability label. Sting has large local SSD/NVMe
scratch exposed through the local-path-sting-fast-ephemeral StorageClass, but
its kubelet/root ephemeral-storage accounting is still the smaller root
filesystem envelope. The compute-expansion lane therefore uses generic
ephemeral PVCs on local-path-sting-fast-ephemeral for /home/runner/_work
(48Gi) and the DinD Docker graph at /var/lib/docker (96Gi), while keeping
container ephemeral-storage requests small enough to avoid pretending that root
ephemeral storage is the same capacity pool as the fast-local PVCs.
The 2026-05-10 Rockies fanout showed the prior combined tinyland-dind
capacity of 16 slots saturating while the cluster still had CPU, memory, honey
ephemeral-storage, and sting fast-local headroom. The 2026-05-12 follow-up
showed the same artificial cap pattern after sting returned to service: honey
could hit its pod ceiling while sting still had pod, CPU, memory, and
fast-local PVC headroom. The current bounded step is 20 honey DinD slots plus
16 sting fast-local overflow slots. Further increases should be based on live
queue evidence and storage telemetry, not on repo-specific labels.
The 2026-05-11 runner availability check added one more constraint: a node can have spare CPU, memory, and storage while the active lane is blocked by the node pod-count ceiling or by selectors/tolerations that keep the pod off the node with spare resources. Treat pod capacity and placement as first-class capacity dimensions alongside CPU, memory, and disk.
The 2026-05-14 Rockies fanout made the source-owned overflow cap visible:
Honey was at its 110/110 pod ceiling, baseline tinyland-dind pods were
Unschedulable with Too many pods, tinyland-dind-compute-expansion was at
its maxRunners = 9, and Sting still had free pod slots plus low actual
fast-local scratch usage. just arc-burst-capacity-audit now reports this as
overflow saturation so the next change can be a reviewed cap and storage
envelope decision instead of an emergency live patch. The PVC sizes are
recovery/scratch intent for the lane; local-path does not make those nominal
sizes a durable HA or global storage-quota authority.
The 2026-05-14 follow-up raised that source-owned overflow cap to
maxRunners = 12 while tightening the fast-local PVC envelope to 48Gi work
plus 96Gi Docker graph per runner. The May 15 recurrence then saturated that
overflow cap again: tinyland-dind was current=20 pending=6, the overflow
lane was current=12 pending=5, honey was at 110/110 pods, and sting still
had 31 free pod slots plus low actual fast-local usage. The current
source-owned follow-up raises only the additive sting overflow cap to
maxRunners = 16; it does not widen the active honey baseline scale set.
Treat the PVC sizes as recoverable scratch intent and validate actual fast-local
headroom with the burst audit before any further cap change.
The 2026-05-14 PR #655 post-merge window exposed the next boundary: capacity
and fairness are different things. The tinyland-dind and
tinyland-dind-compute-expansion lanes can be healthy and still leave a
GloriousFlywheel control-plane job queued behind a large downstream fanout on
the same shared workflow label. GitHub ARC does not provide repository priority
for one shared label. Treat that as shared-label fairness/admission policy, not
as proof that the cluster is out of CPU, memory, or fast-local scratch. Use
just arc-burst-capacity-audit --include-label tinyland-dind and read its
Shared Queue Fairness section alongside pod-slot, quota, and PVC evidence
before changing runner envelopes.
The 2026-05-19 same-label overflow work added a separate JIT assignment trap
classifier to the burst audit. A pending runner with no visible GitHub job and a
not-ready pod is cleanup evidence only after the GitHub runner is verified
offline/not busy. A pending runner with jobRepositoryName, workflowRunId, or
jobDisplayName is an assigned job at risk, not disposable residue. Do not use
EphemeralRunner deletion as a queue-drain shortcut for assigned jobs; fix the
placement/capacity cause or cancel the GitHub job explicitly.
The 2026-05-12 follow-up adds repo-owned cleanup for runner namespace residue:
the ARC runner stack enables the runner-cleanup CronJob in arc-runners.
This does not increase hardware capacity, but it keeps Succeeded and Failed
utility pods from occupying honey pod slots after their evidence value has
expired.
Do not widen an active baseline AutoscalingRunnerSet under load as the first
queue-drain response. ARC can treat maxRunners changes as listener/config
reconciliation work, and changing the baseline scale set can interrupt the
listener path while existing runners drain. Prefer additive overflow lanes, or
land the higher envelope as a reviewed source change followed by a planned
deploy and runtime audit. The tinyland-dind=20 plus
tinyland-dind-compute-expansion=16 envelope is canonical because it is encoded
in the runner tfvars, docs, and validation, not because a live patch happened.
OpenEBS ZFS is currently the bumble-backed durable PVC plane, not the hot runner scratch plane. Do not route DinD Docker graph churn through bumble OpenEBS unless a separate storage design explicitly accepts the latency, failure-domain, and cleanup semantics. For transient container build scratch, prefer sting fast-local generic ephemeral PVCs.
The repo-owned Platform Proof job also keeps the tinyland-docker dependency
install bounded with pnpm install --child-concurrency=2 --network-concurrency=4. That is intentional: the proof should verify the
shared Docker lane, cache attachment, and ordinary app/MCP checks without
turning a low-overhead runner into a host file-table stress test. Repeated
ENFILE failures in downstream workflows should be treated as capacity or
workflow-fanout evidence, not as a reason to silently widen the shared Docker
lane.
The current additive shared capability class for heavier Nix work is
tinyland-nix-heavy, defined in tofu/stacks/arc-runners/honey.tfvars with a
larger pod envelope than baseline tinyland-nix.
TIN-1249 adds the matching REAPI proof-cell capacity boundary: gf-reapi-cell
uses a 4 CPU / 8Gi request and 16 CPU / 16Gi limit when it is active, but
the committed manifest is idle at replicas: 0. The local proof script still
defaults GF_REAPI_CELL_SCALE_TO_ZERO_AFTER_PROOF=true for bounded operator
windows. The GitHub proof workflow now defaults to leaving the cell resident
after a successful apply because the repo is dogfooding hourly TTFCH and
back-to-back RBE proofs against the live endpoint. Operators can opt into
teardown with scale_to_zero_after_proof=true when scarce-lane pressure is
more important than live endpoint continuity.
The 2026-05-17 PR #694 window exposed the corresponding scarce-lane queue
case: tinyland-nix-heavy can be healthy and still block a control-plane or
consumer proof when its single declared slot is occupied by another repository,
and a follow-on pod can also report scheduler resource pressure such as
Insufficient ephemeral-storage. just arc-burst-capacity-audit now reports
this under Shared Label Queue Pressure, including the active holder
repositories and not-ready runner pod messages. Treat that as queue/admission
and scheduler evidence; this diagnostic does not mutate capacity and does not
turn tinyland-nix-heavy into a repo-specific lane.
The 2026-05-19 PR #725/post-merge window showed the same contention twice in a
row: once before merge and again on default-branch Platform Proof. The managed
capacity response moves the existing shared tinyland-nix-heavy scale set to
honey and raises it to a large shared heavy envelope. That keeps the
workflow-facing capability label stable, avoids a GloriousFlywheel-only runner
label, and avoids pretending that sting can safely host large heavy-Nix pods
before the fast-local Nix scratch/store model exists there.
The 2026-05-19 PR #757 post-merge image-publish run then proved that 32Gi was
still too small for the gf-reapi-cell OCI publication path when Nix had to
compile the patched skopeo used by nix2container.copyTo. PR #758 kept the
same shared capability label and raised the honey-backed tinyland-nix-heavy
pod request/limit to 64Gi rather than minting a repo-specific publication lane.
The 2026-05-21 W3.4 vendor-mode canaries then proved that 64Gi and 128Gi
scratch were still not full-graph Bazel vendor-mode envelopes. Run
26245482714 reached roughly 53Gi in the vendor temp tree before kubelet
eviction at 64Gi. Run 26246609243 passed that boundary, reached roughly 79Gi
while still healthy, and was later evicted at the 128Gi scratch limit. The lane
then reserved 96Gi memory, allowed 160Gi memory, and reserved 192Gi scratch
with a 256Gi scratch limit. Managed apply 26247461740 reconciled that
envelope live, and canary 26247715938 proved the intended transition: the
W3.4 red signal moved from runner eviction to a Bazel external-input defect in
the rules_pkg@1.1.0 BCR module. Follow-up canary 26350919668 moved past
that leak and exposed the next authority gap, an ambient local-Python lookup
from pybind11_bazel, which is now made explicit through the CI devshell and
PYTHON_BIN_PATH repository environment.
The 2026-05-28 current-main W3.4 canary 26549932671 passed on
tinyland-nix-heavy with downloadable evidence: bazel vendor completed,
//:deployment_bundle built from the vendor directory, the temporary vendor
tree reached roughly 170Gi, and pod memory stayed around 14Gi with no OOM or
eviction. That makes the lane scratch-heavy rather than memory-heavy for this
target class. The committed envelope therefore keeps the 160Gi memory limit for
bursts, lowers the memory request to 64Gi, and keeps the 192Gi/256Gi scratch
request/limit that the canary actually needed.
The 2026-05-28 follow-up canary 26587033690 on main repeated the proof after
the E3 status gate landed: full-scope vendor-mode passed, //:deployment_bundle
built from the vendored graph, the workdir again reached roughly 170Gi, and the
job completed without OOM or eviction. The vendor-mode workflow now uses a
192Gi scratch preflight by default so operators do not dispatch a known
170Gi-class proof onto a runner with only the older 40Gi floor available.
The 2026-05-31 canary 26710742767 confirmed that the next red signal was not
runner capacity: it ran on tinyland-nix-heavy and failed while fetching a
hermetic_launcher prebuilt stub from GitHub during bazel vendor. Treat that
as W3 external-input authority debt. The runner envelope remains the right
class for this proof; the fix is to stage verified distdir inputs, not to mint a
repo-specific runner label or fallback to hosted runners.
Follow-up workflow-dispatch canary 26717187299 proved the distdir staging
fix on the same dogfooded heavy lane after PR #855: bazel vendor completed,
//:deployment_bundle built from the vendor directory, and the evidence
classifier was ok. Post-merge main canary 26718312931 then repeated the
proof on main at 3d68e10, with the vendor step, evidence summary, and
artifact upload all green. That confirms the 192Gi/256Gi scratch envelope
remains the right shared capability lane for W3.4; it does not promote durable
external-input authority or broad/default RBE.
The 2026-05-27 PR #813/#814 dogfood window exposed the next heavy-lane truth:
tinyland-nix-heavy is live at maxRunners = 2, but each runner requests
96Gi memory on honey before this correction. The second slot was only
schedulable when Honey had another 96Gi request window available, which made
maxRunners = 2 an overclaim under normal dogfood pressure. TIN-1649 now treats
the proved 64Gi request as the truthful honey-backed fix; if future heavy pods
still report Insufficient memory, handle it as capability-lane placement or
quota evidence. Do not route around it with a repo-specific runner label.
The 2026-05-18 recurrence made the Sting scratch boundary explicit for Nix
lanes. DinD compute-expansion already used local-path-sting-fast-ephemeral
PVCs for recoverable workspace and Docker graph churn; the TIN-1400 follow-up
adds the analogous Nix root/workdir PVC model for tinyland-nix-compute-expansion.
Do not generalize that proof to heavy Nix, KVM, GPU, or compatibility lanes
until their own runner image and storage behavior are proved.
The 2026-05-24 first-party dogfood surge then proved the next layer: the Nix
compute-expansion fast-local PVC model is necessary but not sufficient if
sting still advertises only a small kubelet root/nodefs ephemeral-storage
surface. A live tinyland-nix-compute-expansion pod reported scheduler
Insufficient ephemeral-storage while CPU, memory, namespace quota, and the
physical fast-local storage story were not the limiting dimensions. Treat that
as kubelet/local-path storage integration debt. Do not treat it as approval to
fallback to hosted runners, and do not count the maxRunners = 8 overflow cap
as fully usable until scheduler evidence shows the node can admit that many
pods under the committed root ephemeral request envelope.
The 2026-05-25 follow-up tightened that envelope instead of raising the cap:
tinyland-nix-compute-expansion keeps its 64Gi /nix PVC and 32Gi work
PVC, adds a 32Gi /home/runner/.cache PVC for Bazel/Bazelisk/package-manager
cache churn, keeps the runner root ephemeral-storage request at 1Gi, and
raises only the root ephemeral-storage limit to 16Gi. The module’s Nix PVC
init container still requests 1Gi, so an eight-runner burst still reserves
16Gi of Sting node ephemeral-storage instead of 40Gi for the PVC-backed
lane. This is an admission correction for logs, runner state, and non-mounted
paths; it is not a durability claim for Sting fast-local scratch.
The 2026-05-27 dogfood window exposed the separate memory dimension for that
same Nix compute-expansion lane. A tinyland-nix-compute-expansion runner for
tinyland-inc/lab Publish to FlakeHub run 26542162814 reached the runner
container’s 8Gi cgroup limit and terminated with OOMKilled while sting
itself remained Ready and below node memory pressure. Treat that as
capability-envelope evidence, not proof that the cluster is out of RAM. The
source-owned correction raises the shared lane to a 4Gi memory request and
16Gi memory limit so dogfood Nix work has room to burst on the existing
capability label. The namespace request-memory quota moves with it to 280Gi
so honey-heavy requests do not turn the sting overflow lane into an artificial
cross-node quota casualty. Recurrent OOMs after this lands are evidence to
route the workload to tinyland-nix-heavy or raise the shared contract again;
they are not a hosted-runner fallback and not a repo-specific runner label.
The 2026-05-25 default-branch Validate run exposed the baseline honey Nix
scratch floor: .#ci realization failed with No space left on device on
shared tinyland-nix before stack validation could complete. The shared lane
therefore reserves 24Gi and limits at 40Gi per runner. This keeps the
workflow-facing capability label stable and repairs the dogfood lane instead of
routing first-party checks to hosted runners or minting a repo-specific scale
set.
The same window also exposed honey kubelet image/container filesystem pressure.
tinyland-nix per-pod scratch limits and node imagefs hygiene are separate
capacity dimensions: widening a runner pod does not clear DiskPressure if
/var/lib/rancher is near the kubelet eviction threshold. Run
just kubelet-imagefs-capacity-audit --node honey before blaming workflow
labels or adding hosted fallback.
The 2026-05-25 Source Bazel Proof failure exposed a separate remote-cache
digest mismatch while Bazel read through the RustFS-backed bazel-cache
service. The recovery started as cache-only: delete the implicated CAS object
and roll the stateless bazel-cache pods so local emptyDir hot caches cannot
keep serving it. That delete also reproduced the RustFS bucket-index failure
class: S3 list-buckets went empty while disk markers remained present. A
controlled restart of attic-rustfs-openebs restored the API view. Use
just bazel-remote-cache-cas-integrity-audit --object-encoding zstd for
read-only decoded payload verification before treating a Bazel digest mismatch
as a runner, Bzlmod, or source-code failure.
tinyland-nix-operator is the dedicated control-plane lane for ARC deployment
and operator maintenance. It is intentionally small and honey-bound: it exists
so managed ARC applies can quiesce and max-freeze consumer runner lanes without
running on the same labels they are draining. During bootstrap, the deploy
workflow may still fall back to tinyland-nix-heavy; after the operator lane is
live, set ARC_DEPLOY_RUNNER_LABEL=tinyland-nix-operator in repository
variables.
Additive Nix hardware lanes also carry explicit ephemeral-storage envelopes.
Without those fields, Kubernetes would apply the namespace 1Gi default limit
to the runner container, which is too small for cache-backed Nix shells that
still materialize GPU userspace, KVM scratch state, or heavy build work inside
an ephemeral runner pod.
Use just arc-runtime-audit to inspect the live ARC runner-set envelopes and
confirm that the cluster actually matches the repo contract after rollout.
For the shared KVM VM-execution lane, also read
KVM Capacity Policy. The KVM label can be advertised
by multiple owner-overlay scale sets, so the active limit is a live
node-budget-derived policy on top of per-scale-set ARC maxRunners values.
GitLab Compatibility Limits
The current Honey GitLab compatibility stack is deliberately explicit but does
not provide ARC-equivalent queue-driven scale-to-zero. It uses GitLab runner
manager pods, per-manager concurrent_jobs, and HPA replica ranges. Current
live tfvars pin manager and job pods to sting:
| Runner | Concurrent jobs | HPA min/max | Job ephemeral request | Job ephemeral limit | Helper ephemeral limit | Service ephemeral limit |
|---|---|---|---|---|---|---|
| nix | 2 | 1/3 | 12Gi | 16Gi | 2Gi | none |
| docker | 4 | 1/3 | 1Gi | 8Gi | 1Gi | none |
| dind | 2 | 1/3 | 24Gi | 40Gi | 2Gi | 40Gi |
Typical Workload Profiles
| Workload | CPU (typical) | Memory (typical) | Recommended Runner |
|---|---|---|---|
| Python lint (ruff) | 50-200m | 128-256Mi | docker |
| Python tests (pytest) | 100-500m | 256-512Mi | docker |
| Nix flake check | 200-500m | 256-512Mi | nix |
| GHC build (warm cache) | 500m-1 | 512Mi-1Gi | nix |
| GHC build (cold cache) | 2-4 | 2-4Gi | nix |
| MUSL static build | 1-2 | 1-2Gi | nix |
| FPM RPM packaging | 100-500m | 256-512Mi | rocky8/rocky9 |
| Docker image build | 500m-2 | 512Mi-2Gi | dind |
Scheduling Controls
The committed runner namespace rollout now declares Kubernetes ResourceQuota plus LimitRange guardrails so aggregate burst admission fails before it silently exceeds the intended on-prem envelope. Capacity is controlled by:
- ARC
maxRunnersandminRunnersvalues per scale set - pod resource requests and limits
- namespace ResourceQuota and LimitRange defaults
- node selectors, taints, and tolerations
- physical node CPU, memory, pod-count, and kubelet imagefs headroom
maxRunners is local to one ARC scale set. It is not a global cap for a shared
workflow label across every owner overlay attached to Honey. The namespace
quota is the aggregate in-namespace backstop; it is not a cross-overlay global
capacity policy, and it may intentionally cap below the sum of every lane’s
theoretical max when the goal is to stop admission at the finite machine
envelope.
Committed Honey guardrails:
| Namespace | Pods | Requests CPU | Requests Memory | Requests Ephemeral | Limits CPU | Limits Memory | Limits Ephemeral |
|---|---|---|---|---|---|---|---|
arc-runners |
96 | 42 | 280Gi | 1100Gi | 320 | 640Gi | 2000Gi |
gitlab-runners |
48 | 10 | 24Gi | 450Gi | 80 | 160Gi | 760Gi |
Run just runner-capacity-model-check after changing runner counts, HPA caps,
or resource envelopes. The check compares committed quota values against the
modeled ARC/GitLab burst envelopes and fails when a quota is too loose to be a
real backstop or too small for the largest modeled runner pod. It is an
admission-envelope check, not proof that every lane can run at its individual
max at the same time.
Read-only acceptance checks for runner storage-capacity changes:
kubectl describe node honeyandkubectl describe node stingshow allocatable CPU, memory, pod count, and ephemeral-storage with noDiskPressureor unexpected eviction events.kubectl get resourcequota,limitrange -n arc-runners -o wideshows the namespace still below the committed aggregate envelope.kubectl get storageclassshowslocal-path-sting-fast-ephemeralpresent before a sting fast-local runner lane depends on it.kubectl get pvc -n arc-runnersshows compute-expansion DinD work and Docker graph PVCs bound tolocal-path-sting-fast-ephemeralat the expected48Giand96Gisizes whenever compute-expansion DinD pods are active.kubectl get pods -n arc-runners -o wideshows compute-expansion DinD payloads scheduled onsting, while the honey baseline lane remains a source-owned envelope rather than an emergency live patch target.just arc-listener-queue-drift --repo <owner/repo> --run-id <run-id> --fail-on-driftandjust arc-runtime-audit --fail-on-listener-cap-drift --fail-on-runner-count-drift --fail-on-runner-session-driftdistinguish a real capacity cap from listener/session drift before any live mutation.
Requesting Limit Increases
If your jobs are being OOM-killed or throttled:
- Check pod resource usage:
kubectl top pods -n <runner-namespace> - Review the job logs for OOM or throttling messages
- Update the overlay
*.tfvarsfile with higher limits for the relevant runner type - For
arc-runners, prefer the managedDeploy ARC Runnersworkflow from currentmain; localjust tofu-apply arc-runnersrefuses dirty or non-origin/mainsource unlessGF_ARC_RUNNERS_LOCAL_APPLY_ALLOW=1is set for a reviewed break-glass apply. - The runner Helm release will be updated with new pod resource templates