GPU Runners
Current Status (2026-04-22): GPU now has one current product proof floor: the shared
tinyland-nix-gpuARC lane onhoney. An olderJesssullivan/cmuxrepo-anchored smoke proof is retained as historical evidence only; it does not define the runner taxonomy. The shared lane is intentionally narrower than broad GPU maturity: it is a host-device/dev/dri+ Vulkan floor plus one bounded Dawn/WebGPU userspace proof floor covering minimal compute and offscreen render paths, not a general Dawn/WebGPU or Kubernetes device-plugin contract. The richer-runner proof floors are now landed history rather than open issue lanes:#312/TIN-330set the KVM floor,#338/TIN-377landed the legacycmuxproof slice,#342/TIN-379landed the sharedhoneylane implementation,#347/TIN-386landed the first bounded Dawn/WebGPU userspace proof on that lane, and#333/TIN-371fixed the broader advanced-runner matrix ordering. The old local NVIDIA-fabric idea now sits in future design context rather than active product backlog.
This page describes the honest current boundary:
- there is one historical repo-anchored GPU canary on an authoritative branch:
Jesssullivan/cmuxGPU Smoke Test (Self-Hosted)run24756928163succeeded onmainjobGPU smoke test (cmux-nix)after buildinglibghostty, buildingcmux-linux, running Zig tests, and completing the GPU smoke workload - that proof is explicit and bounded, but it is not the current product
contract: shared labels such as
tinyland-nix-gpuare the path forward - there is now one shared ARC GPU runner set on
honey:tinyland-nix-gpuwith host/dev/dripass-through, root runner access, and a bounded Vulkan userspace canary inTest ARC Runners Soak - there is now one bounded Dawn/WebGPU userspace proof on that shared lane:
Test ARC Runners Soakrun24811202201completedsuccessfor branch commit4381988andTest tinyland-nix-gpu on honey, includingVerify Dawn/WebGPU userspace path - there is now one downstream default-branch canary on that shared lane:
tinyland-inc/labWebGPU Canaryrun24811145561completedsuccessonmainmerge commitd2a9af2, including bothVerify Vulkan userspace pathandVerify Dawn/WebGPU userspace path - that shared proof is still intentionally narrow: it proves the current
host-device GPU floor on
honeyplus one minimal shared-lane Dawn/WebGPU compute and offscreen render userspace path, not general Dawn/WebGPU maturity and not a Kubernetesnvidia.com/gpudevice-plugin contract - the older GitLab
l40s/a100module notes remain compatibility-shaped design context, not a current product claim - broader Dawn/WebGPU maturity and wider downstream adoption can still happen later, but the older local NVIDIA-fabric idea is no longer treated as active product backlog after the current bounded compute-plus-render proof floor
Sequencing
The current runner-class order is:
- KVM through the landed
#312/TIN-330floor - GPU / WebGPU / Dawn through the matrix owned by
#333/TIN-371with#338/TIN-377as the legacycmuxproof slice,#342/TIN-379as the current sharedtinyland-nix-gpuhost-device lane onhoney,#347/TIN-386as the first bounded shared-lane Dawn/WebGPU userspace proof, andtinyland-inc/lab#163as the current downstream default-branch canary on that shared lane - macOS through
#335/TIN-376as historical bounded proof context, not as normal runner taxonomy - riscv and other rarer execution lanes only after the earlier lanes are materially stronger
- broader cross-forge parity only after a GitHub-first runtime pattern is repeatable
Historical Repo-Anchored Proof
- repo:
Jesssullivan/cmux - workflow:
GPU Smoke Test (Self-Hosted) - run:
24756928163onmaincommitf64a777 - job:
GPU smoke test (cmux-nix)72432003462 - gate: GitHub environment
gpu-tests
This counted as real bounded evidence at the time because:
- the workflow runs on an authoritative branch
- it uses a named self-hosted GPU-capable lane
- it performs real build work before the runtime smoke:
Build libghostty (Nix),Build cmux-linux (Nix), andTest config parser (Nix) - it then completes an actual
GPU smoke teststep rather than only emitting host metadata
Current Shared Lane
- lane:
tinyland-nix-gpu - stack owner:
tofu/stacks/arc-runners/honey.tfvars - proof workflow:
Test ARC Runners Soak - proof job:
Test tinyland-nix-gpu on honey - host contract: mount host
/dev/dri, requirerenderD128, and verifyvulkaninfo --summarysees a discrete GPU through the shared lane
This counts as real shared proof because:
- the lane is owned by GloriousFlywheel ARC config, not only by one repo
- it is pinned to the real current hardware floor on
honey - the soak job proves device visibility and Vulkan userspace instead of only claiming that a GPU-shaped node exists
- the same soak workflow now also proves one bounded Dawn/WebGPU userspace path through real adapter enumeration, a minimal compute submission, and a minimal offscreen render submission
- it still avoids overstating current reality as a mature cluster-wide
nvidia.com/gpuor broader Dawn/WebGPU contract
Current Downstream Default-Branch Canary
- repo:
tinyland-inc/lab - workflow:
WebGPU Canary - PR:
tinyland-inc/lab#163 - default-branch proof: run
24811145561onmainmerge commitd2a9af2 - branch proof: PR run
24811086650on branch commita243687 - shared label contract:
runs-on: tinyland-nix-gpu
This counts as the first real downstream default-branch canary because:
- it runs on an authoritative downstream
mainbranch rather than only in GloriousFlywheel soak - it uses the shared org lane directly instead of a repo-scoped GPU runner
- it proves
/dev/dri, Vulkan userspace, and Dawn/WebGPU userspace in one repo-local workflow - the same workflow also passed on the PR head before merge, so the default branch result is not a one-off post-merge surprise
What Would Count As Stronger GPU Proof
- a real GitHub Actions workload proving runtime scheduling on that shared lane
- clear trace-extraction, cache, cleanup, and cold-start behavior
- broader graphics/runtime coverage than the current bounded compute-plus- offscreen-render userspace proof
- a second downstream default-branch canary beyond
tinyland-inc/lab - operator docs that describe the real proven path instead of only infrastructure prerequisites or legacy GitLab configuration
Current Next Move
The next explicit GPU slice is:
- wider downstream adoption can continue, but it should stay narrower than broad GPU maturity claims
- the older local NVIDIA-fabric idea is now future design context unless a new product requirement revives it explicitly
That work should stay narrower than broad GPU maturity claims. The goal is to
keep the current shared /dev/dri plus bounded userspace proof floor honest
before claiming more about Dawn, WebGPU, or cluster-wide GPU maturity.
Compatibility Notes
The legacy GitLab runner module supports two GPU-oriented runner types:
l40s and a100. Keep that surface in the repo as compatibility-only design
context, not as evidence that GloriousFlywheel currently has a mature GPU
product lane.
Current Shared-Lane Prerequisites
The current shared ARC lane depends on:
- one
honeynode with host/dev/dri - working host Vulkan userspace on that node
- ARC runner config that mounts
/dev/driinto the runner container - a bounded GitHub Actions canary that proves device visibility from inside the runner pod
Legacy NVIDIA Prerequisites
- NVIDIA GPU Operator installed on the cluster
- GPU node pool with appropriate labels and taints
- Nodes visible via
kubectl get nodes -l accelerator=nvidia
Legacy Compatibility Types
| Type | GPU | Architecture | VRAM | Use Case |
|---|---|---|---|---|
l40s |
NVIDIA L40S | Ada Lovelace | 48 GB | Inference, fine-tuning, rendering |
a100 |
NVIDIA A100 | Ampere | 40/80 GB | Training, large model inference |
Both types default to nvidia/cuda:12.4-devel-ubuntu22.04 and run in
privileged mode for GPU device access.
Legacy Module Configuration
module "gpu_runner" {
source = "../../modules/gitlab-runner"
runner_name = "gpu-l40s"
runner_type = "l40s"
runner_token = var.l40s_runner_token
namespace = "gitlab-runners"
# GPU configuration
gpu_count = 1
gpu_resource_name = "nvidia.com/gpu"
# Node selector is cluster-specific; adjust label to match your GPU node pool
gpu_node_selector = {
"accelerator" = "nvidia"
}
gpu_tolerations = [{
key = "nvidia.com/gpu"
operator = "Exists"
value = null
effect = "NoSchedule"
}]
# Scale conservatively — GPU nodes are expensive
concurrent_jobs = 2
hpa_enabled = true
hpa_min_replicas = 1
hpa_max_replicas = 2
}
Legacy GitLab Job Example
train-model:
tags: [gpu, nvidia, cuda, l40s]
image: nvidia/cuda:12.4-devel-ubuntu22.04
script:
- nvidia-smi
- python train.py --epochs 10
How The Legacy Module Allocates GPUs
The module injects a pod_spec strategic merge patch into the runner TOML
config. This adds nvidia.com/gpu resource requests/limits to the build
container in each CI job pod. The Kubernetes scheduler then places job pods
on nodes with available GPUs.
Legacy Environment Variables
GPU runners automatically inject:
NVIDIA_VISIBLE_DEVICES=all— expose all GPUs to the containerNVIDIA_DRIVER_CAPABILITIES=compute,utility— enable compute + nvidia-smi
Legacy Troubleshooting
Shared ARC lane cannot see /dev/dri
Check the host device surface directly on honey:
ssh honey 'ls -ld /dev/dri /dev/dri/*'
ssh honey 'vulkaninfo --summary | grep -E "deviceName|deviceType|driverName|driverInfo"'
Job pod stuck in Pending
Check that GPU nodes are available and the NVIDIA device plugin is running:
kubectl get nodes -l accelerator=nvidia
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
nvidia-smi not found
Ensure the NVIDIA GPU Operator is installed and the driver container is running on GPU nodes:
kubectl get pods -n gpu-operator
Wrong number of GPUs
Adjust gpu_count in the module configuration. Each job pod requests
exactly that many GPUs.