GPU Runners

GPU Runners

Current Status (2026-04-22): GPU now has one current product proof floor: the shared tinyland-nix-gpu ARC lane on honey. An older Jesssullivan/cmux repo-anchored smoke proof is retained as historical evidence only; it does not define the runner taxonomy. The shared lane is intentionally narrower than broad GPU maturity: it is a host-device /dev/dri + Vulkan floor plus one bounded Dawn/WebGPU userspace proof floor covering minimal compute and offscreen render paths, not a general Dawn/WebGPU or Kubernetes device-plugin contract. The richer-runner proof floors are now landed history rather than open issue lanes: #312 / TIN-330 set the KVM floor, #338 / TIN-377 landed the legacy cmux proof slice, #342 / TIN-379 landed the shared honey lane implementation, #347 / TIN-386 landed the first bounded Dawn/WebGPU userspace proof on that lane, and #333 / TIN-371 fixed the broader advanced-runner matrix ordering. The old local NVIDIA-fabric idea now sits in future design context rather than active product backlog.

This page describes the honest current boundary:

  • there is one historical repo-anchored GPU canary on an authoritative branch: Jesssullivan/cmux GPU Smoke Test (Self-Hosted) run 24756928163 succeeded on main job GPU smoke test (cmux-nix) after building libghostty, building cmux-linux, running Zig tests, and completing the GPU smoke workload
  • that proof is explicit and bounded, but it is not the current product contract: shared labels such as tinyland-nix-gpu are the path forward
  • there is now one shared ARC GPU runner set on honey: tinyland-nix-gpu with host /dev/dri pass-through, root runner access, and a bounded Vulkan userspace canary in Test ARC Runners Soak
  • there is now one bounded Dawn/WebGPU userspace proof on that shared lane: Test ARC Runners Soak run 24811202201 completed success for branch commit 4381988 and Test tinyland-nix-gpu on honey, including Verify Dawn/WebGPU userspace path
  • there is now one downstream default-branch canary on that shared lane: tinyland-inc/lab WebGPU Canary run 24811145561 completed success on main merge commit d2a9af2, including both Verify Vulkan userspace path and Verify Dawn/WebGPU userspace path
  • that shared proof is still intentionally narrow: it proves the current host-device GPU floor on honey plus one minimal shared-lane Dawn/WebGPU compute and offscreen render userspace path, not general Dawn/WebGPU maturity and not a Kubernetes nvidia.com/gpu device-plugin contract
  • the older GitLab l40s / a100 module notes remain compatibility-shaped design context, not a current product claim
  • broader Dawn/WebGPU maturity and wider downstream adoption can still happen later, but the older local NVIDIA-fabric idea is no longer treated as active product backlog after the current bounded compute-plus-render proof floor

Sequencing

The current runner-class order is:

  1. KVM through the landed #312 / TIN-330 floor
  2. GPU / WebGPU / Dawn through the matrix owned by #333 / TIN-371 with #338 / TIN-377 as the legacy cmux proof slice, #342 / TIN-379 as the current shared tinyland-nix-gpu host-device lane on honey, #347 / TIN-386 as the first bounded shared-lane Dawn/WebGPU userspace proof, and tinyland-inc/lab#163 as the current downstream default-branch canary on that shared lane
  3. macOS through #335 / TIN-376 as historical bounded proof context, not as normal runner taxonomy
  4. riscv and other rarer execution lanes only after the earlier lanes are materially stronger
  5. broader cross-forge parity only after a GitHub-first runtime pattern is repeatable

Historical Repo-Anchored Proof

  • repo: Jesssullivan/cmux
  • workflow: GPU Smoke Test (Self-Hosted)
  • run: 24756928163 on main commit f64a777
  • job: GPU smoke test (cmux-nix) 72432003462
  • gate: GitHub environment gpu-tests

This counted as real bounded evidence at the time because:

  • the workflow runs on an authoritative branch
  • it uses a named self-hosted GPU-capable lane
  • it performs real build work before the runtime smoke: Build libghostty (Nix), Build cmux-linux (Nix), and Test config parser (Nix)
  • it then completes an actual GPU smoke test step rather than only emitting host metadata

Current Shared Lane

  • lane: tinyland-nix-gpu
  • stack owner: tofu/stacks/arc-runners/honey.tfvars
  • proof workflow: Test ARC Runners Soak
  • proof job: Test tinyland-nix-gpu on honey
  • host contract: mount host /dev/dri, require renderD128, and verify vulkaninfo --summary sees a discrete GPU through the shared lane

This counts as real shared proof because:

  • the lane is owned by GloriousFlywheel ARC config, not only by one repo
  • it is pinned to the real current hardware floor on honey
  • the soak job proves device visibility and Vulkan userspace instead of only claiming that a GPU-shaped node exists
  • the same soak workflow now also proves one bounded Dawn/WebGPU userspace path through real adapter enumeration, a minimal compute submission, and a minimal offscreen render submission
  • it still avoids overstating current reality as a mature cluster-wide nvidia.com/gpu or broader Dawn/WebGPU contract

Current Downstream Default-Branch Canary

  • repo: tinyland-inc/lab
  • workflow: WebGPU Canary
  • PR: tinyland-inc/lab#163
  • default-branch proof: run 24811145561 on main merge commit d2a9af2
  • branch proof: PR run 24811086650 on branch commit a243687
  • shared label contract: runs-on: tinyland-nix-gpu

This counts as the first real downstream default-branch canary because:

  • it runs on an authoritative downstream main branch rather than only in GloriousFlywheel soak
  • it uses the shared org lane directly instead of a repo-scoped GPU runner
  • it proves /dev/dri, Vulkan userspace, and Dawn/WebGPU userspace in one repo-local workflow
  • the same workflow also passed on the PR head before merge, so the default branch result is not a one-off post-merge surprise

What Would Count As Stronger GPU Proof

  • a real GitHub Actions workload proving runtime scheduling on that shared lane
  • clear trace-extraction, cache, cleanup, and cold-start behavior
  • broader graphics/runtime coverage than the current bounded compute-plus- offscreen-render userspace proof
  • a second downstream default-branch canary beyond tinyland-inc/lab
  • operator docs that describe the real proven path instead of only infrastructure prerequisites or legacy GitLab configuration

Current Next Move

The next explicit GPU slice is:

  • wider downstream adoption can continue, but it should stay narrower than broad GPU maturity claims
  • the older local NVIDIA-fabric idea is now future design context unless a new product requirement revives it explicitly

That work should stay narrower than broad GPU maturity claims. The goal is to keep the current shared /dev/dri plus bounded userspace proof floor honest before claiming more about Dawn, WebGPU, or cluster-wide GPU maturity.

Compatibility Notes

The legacy GitLab runner module supports two GPU-oriented runner types: l40s and a100. Keep that surface in the repo as compatibility-only design context, not as evidence that GloriousFlywheel currently has a mature GPU product lane.

Current Shared-Lane Prerequisites

The current shared ARC lane depends on:

  1. one honey node with host /dev/dri
  2. working host Vulkan userspace on that node
  3. ARC runner config that mounts /dev/dri into the runner container
  4. a bounded GitHub Actions canary that proves device visibility from inside the runner pod

Legacy NVIDIA Prerequisites

  1. NVIDIA GPU Operator installed on the cluster
  2. GPU node pool with appropriate labels and taints
  3. Nodes visible via kubectl get nodes -l accelerator=nvidia

Legacy Compatibility Types

Type GPU Architecture VRAM Use Case
l40s NVIDIA L40S Ada Lovelace 48 GB Inference, fine-tuning, rendering
a100 NVIDIA A100 Ampere 40/80 GB Training, large model inference

Both types default to nvidia/cuda:12.4-devel-ubuntu22.04 and run in privileged mode for GPU device access.

Legacy Module Configuration

module "gpu_runner" {
  source = "../../modules/gitlab-runner"

  runner_name  = "gpu-l40s"
  runner_type  = "l40s"
  runner_token = var.l40s_runner_token
  namespace    = "gitlab-runners"

  # GPU configuration
  gpu_count         = 1
  gpu_resource_name = "nvidia.com/gpu"

  # Node selector is cluster-specific; adjust label to match your GPU node pool
  gpu_node_selector = {
    "accelerator" = "nvidia"
  }

  gpu_tolerations = [{
    key      = "nvidia.com/gpu"
    operator = "Exists"
    value    = null
    effect   = "NoSchedule"
  }]

  # Scale conservatively — GPU nodes are expensive
  concurrent_jobs = 2
  hpa_enabled     = true
  hpa_min_replicas = 1
  hpa_max_replicas = 2
}

Legacy GitLab Job Example

train-model:
  tags: [gpu, nvidia, cuda, l40s]
  image: nvidia/cuda:12.4-devel-ubuntu22.04
  script:
    - nvidia-smi
    - python train.py --epochs 10

How The Legacy Module Allocates GPUs

The module injects a pod_spec strategic merge patch into the runner TOML config. This adds nvidia.com/gpu resource requests/limits to the build container in each CI job pod. The Kubernetes scheduler then places job pods on nodes with available GPUs.

Legacy Environment Variables

GPU runners automatically inject:

  • NVIDIA_VISIBLE_DEVICES=all — expose all GPUs to the container
  • NVIDIA_DRIVER_CAPABILITIES=compute,utility — enable compute + nvidia-smi

Legacy Troubleshooting

Shared ARC lane cannot see /dev/dri

Check the host device surface directly on honey:

ssh honey 'ls -ld /dev/dri /dev/dri/*'
ssh honey 'vulkaninfo --summary | grep -E "deviceName|deviceType|driverName|driverInfo"'

Job pod stuck in Pending

Check that GPU nodes are available and the NVIDIA device plugin is running:

kubectl get nodes -l accelerator=nvidia
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

nvidia-smi not found

Ensure the NVIDIA GPU Operator is installed and the driver container is running on GPU nodes:

kubectl get pods -n gpu-operator

Wrong number of GPUs

Adjust gpu_count in the module configuration. Each job pod requests exactly that many GPUs.

GloriousFlywheel