GloriousFlywheel Honey Runner Memory Envelope 2026-04-16
Snapshot date: 2026-04-16
Purpose
Capture a second platform-level runner signal from downstream dogfooding:
memory-pressure failures on honey are currently easier to trigger than the
cluster-wide hardware footprint suggests.
This note separates:
- aggregate cluster capacity
- per-runner pod limits
- placement and scheduling behavior
Triggering Evidence
Downstream signal:
- a Rust CI lane on
honeyreportedclippybeing SIGKILLed from memory pressure rather than failing on a lint issue - the downstream repo mitigated it in workflow code by capping clippy parallelism
That mitigation may be acceptable in the downstream repo, but it exposes a GloriousFlywheel runner-envelope question.
Live Cluster Evidence
Live honey cluster read on 2026-04-16:
- allocatable node envelopes:
honey:32CPU,230483600Kimemorysting:32CPU,57332656Kimemorybumble:4CPU,16048956Kimemory
- live node usage from
kubectl top nodes:honey:1754mCPU,92193Mimemory,40%memory usagesting:193mCPU,3509Mimemory,6%memory usagebumble:100mCPU,5828Mimemory,37%memory usage
stingcurrently carries taintdedicated.tinyland.dev/compute-expansion:NoSchedule- live ARC listeners for
tinyland-nix,tinyland-docker,tinyland-dind,personal-nix,personal-docker,linux-xr-docker, andtinyland-nix-heavyare all onhoney - live repo-owned
AutoscalingRunnerSet/tinyland-nixnow uses:limits.cpu = "4"limits.memory = "8Gi"- normalized
ATTIC_SERVER = http://attic.nix-cache.svc.cluster.local
- live repo-owned
AutoscalingRunnerSet/tinyland-nix-heavynow uses:limits.cpu = "8"limits.memory = "16Gi"nodeSelector["kubernetes.io/hostname"] = "sting"- toleration for
dedicated.tinyland.dev/compute-expansion:NoSchedule
- live personal
AutoscalingRunnerSet/personal-nixstill uses:githubConfigUrl = https://github.com/jesssullivan/jesssullivan.github.iogithubConfigSecret = github-personal-secret- stale
ATTIC_SERVER = http://attic-api.nix-cache.svc:8080
Meaning:
- this is not a cluster-wide memory shortage on
honey - it is a per-runner envelope and runtime-drift problem
stingis not part of the default scheduling surface unless GloriousFlywheel explicitly adds tolerations or a stronger placement contract
Current Repo Truth
ARC Nix Runner Envelope
Current committed arc-runners baseline for the Nix runner lane:
tofu/stacks/arc-runners/dev.tfvarstofu/stacks/arc-runners/dev-policy.tfvars
Current values:
nix_cpu_limit = "4"nix_memory_limit = "8Gi"
The stack defaults also align with this envelope:
tofu/stacks/arc-runners/variables.tfnix_memory_limitdefault:8Gi
ARC applies those limits directly to the runner container:
tofu/modules/arc-runner/main.tfresources.requests.memory = var.memory_requestresources.limits.memory = var.memory_limit
Meaning:
tinyland-nixdoes not inherit “all memory on honey”- it inherits a per-runner cgroup envelope
- a Rust workload can be OOM-killed inside that envelope even when the cluster still has abundant free RAM
Placement Truth
The ARC runner module supports:
node_selectortolerations
But the baseline tinyland-nix stack path does not currently express an
explicit node-placement contract for the default Nix runner lane.
Meaning:
- the cluster may have large aggregate capacity
- but the effective runtime still depends on where the runner pod lands
- no current repo truth guarantees even distribution across all
honeynodes
Quota Truth
The docs currently advertise a shared namespace quota of:
16CPU requests32Gimemory requests50pods
That quota is about namespace request accounting, not about the maximum memory available to one runner pod.
Meaning:
- namespace quota does not make an
8Girunner pod behave like a much larger memory machine - HPA and runner-count behavior also do not change the single-pod cgroup limit
Current Read
Given current repo truth, a clippy SIGKILL on tinyland-nix is not actually
impossible.
It is compatible with the current platform if any of these are true:
- the Rust workload exceeds
8Giunder current parallelism - multiple rustc/clippy processes spike within one runner pod
- the job lands on a node under local contention
- the downstream workflow assumes cluster-wide capacity instead of the runner pod’s actual cgroup
Autoscaling Clarification
In the current GloriousFlywheel ARC model, autoscaling does exist, but it is horizontal:
- ARC can raise or lower the number of runner pods through scale-set behavior
- ARC does not automatically resize
cpu_limitormemory_limitfor one runner pod based on workload demand - namespace quota and scale-set growth also do not override one pod’s cgroup
So the cloud-native answer is not “limits should autoscale by themselves.”
The actual current choices are:
- scale out runner count for more parallel capacity
- raise the static per-runner envelope for a lane
- introduce a heavier dedicated builder lane
- reduce workload concurrency inside the job
The live honey evidence narrows that even further:
tinyland-nixis still a small default lane with an8Gihard cap- live ARC config now includes the additive
tinyland-nix-heavylane - current repo config and live runtime now model
tinyland-nix-heavyas explicit stateless compute-expansion capacity:- target hostname
sting - tolerate
dedicated.tinyland.dev/compute-expansion:NoSchedule
- target hostname
- the compute-expansion node
stingis currently excluded by taint unless a lane is designed for it on purpose
Platform Gap
The missing platform answer is not “does honey have enough RAM overall?”
The missing answer is:
- what memory envelope
tinyland-nixis supposed to guarantee - whether Rust-heavy lint/build lanes belong on default
tinyland-nixor on a heavier builder lane - whether the heavy lane needs additional placement or listener isolation
- what observability we have for runner-pod OOM and memory saturation
Recommended Follow-On Work
- audit live memory usage for Rust-heavy
tinyland-nixjobs - decide whether
8Giremains the intended baseline Nix envelope - add or document placement policy if specific nodes should back Nix-heavy CI
- add operator-facing guidance that cluster-wide capacity is not the same as a runner pod limit
- decide whether heavy Rust/Clippy lanes need a separate additive builder class instead of silently relying on workflow parallelism caps
Current repo-owned operator surfaces added after this note:
just arc-runtime-auditnow also reports active runner-pod placement and a node-pressure snapshot when metrics are availabledocs/runners/runbook.mdnow includes a heavy-Nix validation procedure fortinyland-nix-heavyonstingdocs/runners/troubleshooting.mdnow points recurring heavy Rust/Nix jobs attinyland-nix-heavy
Live Audit Update 2026-04-17
A live ARC audit on honey refined the immediate platform read:
tinyland-nix-heavyis live with the intended contract:cpu = 8memory = 16GinodeSelector["kubernetes.io/hostname"] = "sting"- toleration for
dedicated.tinyland.dev/compute-expansion:NoSchedule
- no heavy-lane job pod was active at audit time, so the heavy lane is still not proven under a real downstream workload from this audit alone
- the only active repo-owned Nix runner pods observed were baseline
tinyland-nixpods onhoney - those baseline pods were not blocked by scheduler pressure or memory pressure
- the concrete live blocker was
ImagePullBackOffwhile pulling:ghcr.io/tinyland-inc/actions-runner-nix:latest - the concrete pull error was
401 Unauthorized
Meaning:
- the current live failure on the baseline Nix lane is image-pull auth, not memory starvation
#214remains real because the baseline envelope is still small and the heavy lane still needs proof under load- but the immediate operational blocker for active baseline Nix pods is GHCR
auth drift, not lack of RAM on
honeyorsting
Heavy Canary Update 2026-04-17
After restoring GHCR pull auth and dispatching a real org-side validation run, the heavy lane crossed the next proof boundary:
tinyland-nix-heavyaccepted a realTest ARC Runnersjob- the matching ephemeral runner pod landed on
sting - the job completed setup, checkout, Nix bootstrap, and cache-contract checks
The remaining failure was inside the workload:
nix build .#runner-dashboardfailed duringpnpmfetches fromregistry.npmjs.org- the concrete error was
UNABLE_TO_GET_ISSUER_CERT_LOCALLY
Meaning:
- heavy-lane scheduling and GHCR pull auth are now proven
- the next runtime-integrity gap is certificate trust in the heavy Nix build path, not ARC placement or cluster memory pressure
Non-Goal
Do not describe this as “honey is out of memory.”
The repo evidence points to runner-envelope ambiguity, not to a proven cluster- wide memory shortage.