GloriousFlywheel Honey Runner Memory Envelope 2026-04-16

Snapshot date: 2026-04-16

Purpose

Capture a second platform-level runner signal from downstream dogfooding: memory-pressure failures on honey are currently easier to trigger than the cluster-wide hardware footprint suggests.

This note separates:

aggregate cluster capacity
per-runner pod limits
placement and scheduling behavior

Triggering Evidence

Downstream signal:

a Rust CI lane on honey reported clippy being SIGKILLed from memory pressure rather than failing on a lint issue
the downstream repo mitigated it in workflow code by capping clippy parallelism

That mitigation may be acceptable in the downstream repo, but it exposes a GloriousFlywheel runner-envelope question.

Live Cluster Evidence

Live honey cluster read on 2026-04-16:

allocatable node envelopes:
- honey: 32 CPU, 230483600Ki memory
- sting: 32 CPU, 57332656Ki memory
- bumble: 4 CPU, 16048956Ki memory
live node usage from kubectl top nodes:
- honey: 1754m CPU, 92193Mi memory, 40% memory usage
- sting: 193m CPU, 3509Mi memory, 6% memory usage
- bumble: 100m CPU, 5828Mi memory, 37% memory usage
sting currently carries taint dedicated.tinyland.dev/compute-expansion:NoSchedule
live ARC listeners for tinyland-nix, tinyland-docker, tinyland-dind, personal-nix, personal-docker, linux-xr-docker, and tinyland-nix-heavy are all on honey
live repo-owned AutoscalingRunnerSet/tinyland-nix now uses:
- limits.cpu = "4"
- limits.memory = "8Gi"
- normalized ATTIC_SERVER = http://attic.nix-cache.svc.cluster.local
live repo-owned AutoscalingRunnerSet/tinyland-nix-heavy now uses:
- limits.cpu = "8"
- limits.memory = "16Gi"
- nodeSelector["kubernetes.io/hostname"] = "sting"
- toleration for dedicated.tinyland.dev/compute-expansion:NoSchedule
live personal AutoscalingRunnerSet/personal-nix still uses:
- githubConfigUrl = https://github.com/jesssullivan/jesssullivan.github.io
- githubConfigSecret = github-personal-secret
- stale ATTIC_SERVER = http://attic-api.nix-cache.svc:8080

Meaning:

this is not a cluster-wide memory shortage on honey
it is a per-runner envelope and runtime-drift problem
sting is not part of the default scheduling surface unless GloriousFlywheel explicitly adds tolerations or a stronger placement contract

Current Repo Truth

ARC Nix Runner Envelope

Current committed arc-runners baseline for the Nix runner lane:

tofu/stacks/arc-runners/dev.tfvars
tofu/stacks/arc-runners/dev-policy.tfvars

Current values:

nix_cpu_limit = "4"
nix_memory_limit = "8Gi"

The stack defaults also align with this envelope:

tofu/stacks/arc-runners/variables.tf
nix_memory_limit default: 8Gi

ARC applies those limits directly to the runner container:

tofu/modules/arc-runner/main.tf
resources.requests.memory = var.memory_request
resources.limits.memory = var.memory_limit

Meaning:

tinyland-nix does not inherit “all memory on honey”
it inherits a per-runner cgroup envelope
a Rust workload can be OOM-killed inside that envelope even when the cluster still has abundant free RAM

Placement Truth

The ARC runner module supports:

node_selector
tolerations

But the baseline tinyland-nix stack path does not currently express an explicit node-placement contract for the default Nix runner lane.

Meaning:

the cluster may have large aggregate capacity
but the effective runtime still depends on where the runner pod lands
no current repo truth guarantees even distribution across all honey nodes

Quota Truth

The docs currently advertise a shared namespace quota of:

16 CPU requests
32Gi memory requests
50 pods

That quota is about namespace request accounting, not about the maximum memory available to one runner pod.

Meaning:

namespace quota does not make an 8Gi runner pod behave like a much larger memory machine
HPA and runner-count behavior also do not change the single-pod cgroup limit

Current Read

Given current repo truth, a clippy SIGKILL on tinyland-nix is not actually impossible.

It is compatible with the current platform if any of these are true:

the Rust workload exceeds 8Gi under current parallelism
multiple rustc/clippy processes spike within one runner pod
the job lands on a node under local contention
the downstream workflow assumes cluster-wide capacity instead of the runner pod’s actual cgroup

Autoscaling Clarification

In the current GloriousFlywheel ARC model, autoscaling does exist, but it is horizontal:

ARC can raise or lower the number of runner pods through scale-set behavior
ARC does not automatically resize cpu_limit or memory_limit for one runner pod based on workload demand
namespace quota and scale-set growth also do not override one pod’s cgroup

So the cloud-native answer is not “limits should autoscale by themselves.”

The actual current choices are:

scale out runner count for more parallel capacity
raise the static per-runner envelope for a lane
introduce a heavier dedicated builder lane
reduce workload concurrency inside the job

The live honey evidence narrows that even further:

tinyland-nix is still a small default lane with an 8Gi hard cap
live ARC config now includes the additive tinyland-nix-heavy lane
current repo config and live runtime now model tinyland-nix-heavy as explicit stateless compute-expansion capacity:
- target hostname sting
- tolerate dedicated.tinyland.dev/compute-expansion:NoSchedule
the compute-expansion node sting is currently excluded by taint unless a lane is designed for it on purpose

Platform Gap

The missing platform answer is not “does honey have enough RAM overall?”

The missing answer is:

what memory envelope tinyland-nix is supposed to guarantee
whether Rust-heavy lint/build lanes belong on default tinyland-nix or on a heavier builder lane
whether the heavy lane needs additional placement or listener isolation
what observability we have for runner-pod OOM and memory saturation

Recommended Follow-On Work

audit live memory usage for Rust-heavy tinyland-nix jobs
decide whether 8Gi remains the intended baseline Nix envelope
add or document placement policy if specific nodes should back Nix-heavy CI
add operator-facing guidance that cluster-wide capacity is not the same as a runner pod limit
decide whether heavy Rust/Clippy lanes need a separate additive builder class instead of silently relying on workflow parallelism caps

Current repo-owned operator surfaces added after this note:

just arc-runtime-audit now also reports active runner-pod placement and a node-pressure snapshot when metrics are available
docs/runners/runbook.md now includes a heavy-Nix validation procedure for tinyland-nix-heavy on sting
docs/runners/troubleshooting.md now points recurring heavy Rust/Nix jobs at tinyland-nix-heavy

Live Audit Update 2026-04-17

A live ARC audit on honey refined the immediate platform read:

tinyland-nix-heavy is live with the intended contract:
- cpu = 8
- memory = 16Gi
- nodeSelector["kubernetes.io/hostname"] = "sting"
- toleration for dedicated.tinyland.dev/compute-expansion:NoSchedule
no heavy-lane job pod was active at audit time, so the heavy lane is still not proven under a real downstream workload from this audit alone
the only active repo-owned Nix runner pods observed were baseline tinyland-nix pods on honey
those baseline pods were not blocked by scheduler pressure or memory pressure
the concrete live blocker was ImagePullBackOff while pulling: ghcr.io/tinyland-inc/actions-runner-nix:latest
the concrete pull error was 401 Unauthorized

Meaning:

the current live failure on the baseline Nix lane is image-pull auth, not memory starvation
#214 remains real because the baseline envelope is still small and the heavy lane still needs proof under load
but the immediate operational blocker for active baseline Nix pods is GHCR auth drift, not lack of RAM on honey or sting

Heavy Canary Update 2026-04-17

After restoring GHCR pull auth and dispatching a real org-side validation run, the heavy lane crossed the next proof boundary:

tinyland-nix-heavy accepted a real Test ARC Runners job
the matching ephemeral runner pod landed on sting
the job completed setup, checkout, Nix bootstrap, and cache-contract checks

The remaining failure was inside the workload:

nix build .#runner-dashboard failed during pnpm fetches from registry.npmjs.org
the concrete error was UNABLE_TO_GET_ISSUER_CERT_LOCALLY

Meaning:

heavy-lane scheduling and GHCR pull auth are now proven
the next runtime-integrity gap is certificate trust in the heavy Nix build path, not ARC placement or cluster memory pressure

Non-Goal

Do not describe this as “honey is out of memory.”

The repo evidence points to runner-envelope ambiguity, not to a proven cluster- wide memory shortage.