GPU Runners
Platform: GitLab CI only. GPU runners are not available on GitHub Actions (ARC) because ARC runner scale sets do not support NVIDIA device plugin integration. Use GitLab CI with
tags: [gpu, ...]for GPU workloads.
The GitLab runner module supports two GPU runner types: l40s and a100.
Prerequisites
- NVIDIA GPU Operator installed on the cluster
- GPU node pool with appropriate labels and taints
- Nodes visible via
kubectl get nodes -l accelerator=nvidia
Runner Types
| Type | GPU | Architecture | VRAM | Use Case |
|---|---|---|---|---|
l40s |
NVIDIA L40S | Ada Lovelace | 48 GB | Inference, fine-tuning, rendering |
a100 |
NVIDIA A100 | Ampere | 40/80 GB | Training, large model inference |
Both types default to nvidia/cuda:12.4-devel-ubuntu22.04 and run in
privileged mode for GPU device access.
Configuration
module "gpu_runner" {
source = "../../modules/gitlab-runner"
runner_name = "gpu-l40s"
runner_type = "l40s"
runner_token = var.l40s_runner_token
namespace = "gitlab-runners"
# GPU configuration
gpu_count = 1
gpu_resource_name = "nvidia.com/gpu"
# Node selector is cluster-specific; adjust label to match your GPU node pool
gpu_node_selector = {
"accelerator" = "nvidia"
}
gpu_tolerations = [{
key = "nvidia.com/gpu"
operator = "Exists"
value = null
effect = "NoSchedule"
}]
# Scale conservatively — GPU nodes are expensive
concurrent_jobs = 2
hpa_enabled = true
hpa_min_replicas = 1
hpa_max_replicas = 2
}
CI Job Example
train-model:
tags: [gpu, nvidia, cuda, l40s]
image: nvidia/cuda:12.4-devel-ubuntu22.04
script:
- nvidia-smi
- python train.py --epochs 10
How GPU Allocation Works
The module injects a pod_spec strategic merge patch into the runner TOML
config. This adds nvidia.com/gpu resource requests/limits to the build
container in each CI job pod. The Kubernetes scheduler then places job pods
on nodes with available GPUs.
Environment Variables
GPU runners automatically inject:
NVIDIA_VISIBLE_DEVICES=all— expose all GPUs to the containerNVIDIA_DRIVER_CAPABILITIES=compute,utility— enable compute + nvidia-smi
Troubleshooting
Job pod stuck in Pending
Check that GPU nodes are available and the NVIDIA device plugin is running:
kubectl get nodes -l accelerator=nvidia
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
nvidia-smi not found
Ensure the NVIDIA GPU Operator is installed and the driver container is running on GPU nodes:
kubectl get pods -n gpu-operator
Wrong number of GPUs
Adjust gpu_count in the module configuration. Each job pod requests
exactly that many GPUs.