GloriousFlywheel Honey Runner Memory Envelope 2026-04-16

GloriousFlywheel Honey Runner Memory Envelope 2026-04-16

Snapshot date: 2026-04-16

Purpose

Capture a second platform-level runner signal from downstream dogfooding: memory-pressure failures on honey are currently easier to trigger than the cluster-wide hardware footprint suggests.

This note separates:

  1. aggregate cluster capacity
  2. per-runner pod limits
  3. placement and scheduling behavior

Triggering Evidence

Downstream signal:

  • a Rust CI lane on honey reported clippy being SIGKILLed from memory pressure rather than failing on a lint issue
  • the downstream repo mitigated it in workflow code by capping clippy parallelism

That mitigation may be acceptable in the downstream repo, but it exposes a GloriousFlywheel runner-envelope question.

Live Cluster Evidence

Live honey cluster read on 2026-04-16:

  • allocatable node envelopes:
    • honey: 32 CPU, 230483600Ki memory
    • sting: 32 CPU, 57332656Ki memory
    • bumble: 4 CPU, 16048956Ki memory
  • live node usage from kubectl top nodes:
    • honey: 1754m CPU, 92193Mi memory, 40% memory usage
    • sting: 193m CPU, 3509Mi memory, 6% memory usage
    • bumble: 100m CPU, 5828Mi memory, 37% memory usage
  • sting currently carries taint dedicated.tinyland.dev/compute-expansion:NoSchedule
  • live ARC listeners for tinyland-nix, tinyland-docker, tinyland-dind, personal-nix, personal-docker, linux-xr-docker, and tinyland-nix-heavy are all on honey
  • live repo-owned AutoscalingRunnerSet/tinyland-nix now uses:
    • limits.cpu = "4"
    • limits.memory = "8Gi"
    • normalized ATTIC_SERVER = http://attic.nix-cache.svc.cluster.local
  • live repo-owned AutoscalingRunnerSet/tinyland-nix-heavy now uses:
    • limits.cpu = "8"
    • limits.memory = "16Gi"
    • nodeSelector["kubernetes.io/hostname"] = "sting"
    • toleration for dedicated.tinyland.dev/compute-expansion:NoSchedule
  • live personal AutoscalingRunnerSet/personal-nix still uses:
    • githubConfigUrl = https://github.com/jesssullivan/jesssullivan.github.io
    • githubConfigSecret = github-personal-secret
    • stale ATTIC_SERVER = http://attic-api.nix-cache.svc:8080

Meaning:

  • this is not a cluster-wide memory shortage on honey
  • it is a per-runner envelope and runtime-drift problem
  • sting is not part of the default scheduling surface unless GloriousFlywheel explicitly adds tolerations or a stronger placement contract

Current Repo Truth

ARC Nix Runner Envelope

Current committed arc-runners baseline for the Nix runner lane:

  • tofu/stacks/arc-runners/dev.tfvars
  • tofu/stacks/arc-runners/dev-policy.tfvars

Current values:

  • nix_cpu_limit = "4"
  • nix_memory_limit = "8Gi"

The stack defaults also align with this envelope:

  • tofu/stacks/arc-runners/variables.tf
  • nix_memory_limit default: 8Gi

ARC applies those limits directly to the runner container:

  • tofu/modules/arc-runner/main.tf
  • resources.requests.memory = var.memory_request
  • resources.limits.memory = var.memory_limit

Meaning:

  • tinyland-nix does not inherit “all memory on honey”
  • it inherits a per-runner cgroup envelope
  • a Rust workload can be OOM-killed inside that envelope even when the cluster still has abundant free RAM

Placement Truth

The ARC runner module supports:

  • node_selector
  • tolerations

But the baseline tinyland-nix stack path does not currently express an explicit node-placement contract for the default Nix runner lane.

Meaning:

  • the cluster may have large aggregate capacity
  • but the effective runtime still depends on where the runner pod lands
  • no current repo truth guarantees even distribution across all honey nodes

Quota Truth

The docs currently advertise a shared namespace quota of:

  • 16 CPU requests
  • 32Gi memory requests
  • 50 pods

That quota is about namespace request accounting, not about the maximum memory available to one runner pod.

Meaning:

  • namespace quota does not make an 8Gi runner pod behave like a much larger memory machine
  • HPA and runner-count behavior also do not change the single-pod cgroup limit

Current Read

Given current repo truth, a clippy SIGKILL on tinyland-nix is not actually impossible.

It is compatible with the current platform if any of these are true:

  • the Rust workload exceeds 8Gi under current parallelism
  • multiple rustc/clippy processes spike within one runner pod
  • the job lands on a node under local contention
  • the downstream workflow assumes cluster-wide capacity instead of the runner pod’s actual cgroup

Autoscaling Clarification

In the current GloriousFlywheel ARC model, autoscaling does exist, but it is horizontal:

  • ARC can raise or lower the number of runner pods through scale-set behavior
  • ARC does not automatically resize cpu_limit or memory_limit for one runner pod based on workload demand
  • namespace quota and scale-set growth also do not override one pod’s cgroup

So the cloud-native answer is not “limits should autoscale by themselves.”

The actual current choices are:

  1. scale out runner count for more parallel capacity
  2. raise the static per-runner envelope for a lane
  3. introduce a heavier dedicated builder lane
  4. reduce workload concurrency inside the job

The live honey evidence narrows that even further:

  • tinyland-nix is still a small default lane with an 8Gi hard cap
  • live ARC config now includes the additive tinyland-nix-heavy lane
  • current repo config and live runtime now model tinyland-nix-heavy as explicit stateless compute-expansion capacity:
    • target hostname sting
    • tolerate dedicated.tinyland.dev/compute-expansion:NoSchedule
  • the compute-expansion node sting is currently excluded by taint unless a lane is designed for it on purpose

Platform Gap

The missing platform answer is not “does honey have enough RAM overall?”

The missing answer is:

  • what memory envelope tinyland-nix is supposed to guarantee
  • whether Rust-heavy lint/build lanes belong on default tinyland-nix or on a heavier builder lane
  • whether the heavy lane needs additional placement or listener isolation
  • what observability we have for runner-pod OOM and memory saturation
  1. audit live memory usage for Rust-heavy tinyland-nix jobs
  2. decide whether 8Gi remains the intended baseline Nix envelope
  3. add or document placement policy if specific nodes should back Nix-heavy CI
  4. add operator-facing guidance that cluster-wide capacity is not the same as a runner pod limit
  5. decide whether heavy Rust/Clippy lanes need a separate additive builder class instead of silently relying on workflow parallelism caps

Current repo-owned operator surfaces added after this note:

  • just arc-runtime-audit now also reports active runner-pod placement and a node-pressure snapshot when metrics are available
  • docs/runners/runbook.md now includes a heavy-Nix validation procedure for tinyland-nix-heavy on sting
  • docs/runners/troubleshooting.md now points recurring heavy Rust/Nix jobs at tinyland-nix-heavy

Live Audit Update 2026-04-17

A live ARC audit on honey refined the immediate platform read:

  • tinyland-nix-heavy is live with the intended contract:
    • cpu = 8
    • memory = 16Gi
    • nodeSelector["kubernetes.io/hostname"] = "sting"
    • toleration for dedicated.tinyland.dev/compute-expansion:NoSchedule
  • no heavy-lane job pod was active at audit time, so the heavy lane is still not proven under a real downstream workload from this audit alone
  • the only active repo-owned Nix runner pods observed were baseline tinyland-nix pods on honey
  • those baseline pods were not blocked by scheduler pressure or memory pressure
  • the concrete live blocker was ImagePullBackOff while pulling: ghcr.io/tinyland-inc/actions-runner-nix:latest
  • the concrete pull error was 401 Unauthorized

Meaning:

  • the current live failure on the baseline Nix lane is image-pull auth, not memory starvation
  • #214 remains real because the baseline envelope is still small and the heavy lane still needs proof under load
  • but the immediate operational blocker for active baseline Nix pods is GHCR auth drift, not lack of RAM on honey or sting

Heavy Canary Update 2026-04-17

After restoring GHCR pull auth and dispatching a real org-side validation run, the heavy lane crossed the next proof boundary:

  • tinyland-nix-heavy accepted a real Test ARC Runners job
  • the matching ephemeral runner pod landed on sting
  • the job completed setup, checkout, Nix bootstrap, and cache-contract checks

The remaining failure was inside the workload:

  • nix build .#runner-dashboard failed during pnpm fetches from registry.npmjs.org
  • the concrete error was UNABLE_TO_GET_ISSUER_CERT_LOCALLY

Meaning:

  • heavy-lane scheduling and GHCR pull auth are now proven
  • the next runtime-integrity gap is certificate trust in the heavy Nix build path, not ARC placement or cluster memory pressure

Non-Goal

Do not describe this as “honey is out of memory.”

The repo evidence points to runner-envelope ambiguity, not to a proven cluster- wide memory shortage.

GloriousFlywheel