Load Testing Guide
Guide for load testing the GitLab Runners infrastructure.
Test Scenarios
Scenario 1: Burst Load
Simulate 20 simultaneous jobs to test scale-up behavior.
# .gitlab-ci.yml for burst test
stages:
- burst
.burst-job:
stage: burst
script:
- echo "Job $CI_JOB_ID started at $(date)"
- sleep 60 # Simulate 1-minute job
- echo "Job $CI_JOB_ID completed"
burst-01: { extends: .burst-job, tags: [docker] }
burst-02: { extends: .burst-job, tags: [docker] }
burst-03: { extends: .burst-job, tags: [docker] }
burst-04: { extends: .burst-job, tags: [docker] }
burst-05: { extends: .burst-job, tags: [docker] }
burst-06: { extends: .burst-job, tags: [docker] }
burst-07: { extends: .burst-job, tags: [docker] }
burst-08: { extends: .burst-job, tags: [docker] }
burst-09: { extends: .burst-job, tags: [docker] }
burst-10: { extends: .burst-job, tags: [docker] }
burst-11: { extends: .burst-job, tags: [docker] }
burst-12: { extends: .burst-job, tags: [docker] }
burst-13: { extends: .burst-job, tags: [docker] }
burst-14: { extends: .burst-job, tags: [docker] }
burst-15: { extends: .burst-job, tags: [docker] }
burst-16: { extends: .burst-job, tags: [docker] }
burst-17: { extends: .burst-job, tags: [docker] }
burst-18: { extends: .burst-job, tags: [docker] }
burst-19: { extends: .burst-job, tags: [docker] }
burst-20: { extends: .burst-job, tags: [docker] }
Expected Behavior:
- HPA scales docker runner to max (5 replicas)
- All jobs complete within ~5 minutes
- No jobs fail due to infrastructure
Scenario 2: Sustained Load
Simulate 10 jobs/minute for 30 minutes.
#!/bin/bash
# sustained-load.sh
for i in $(seq 1 30); do
for j in $(seq 1 10); do
curl -X POST \
-H "PRIVATE-TOKEN: $GITLAB_TOKEN" \
"https://gitlab.com/api/v4/projects/$PROJECT_ID/trigger/pipeline" \
-F "ref=main" \
-F "token=$TRIGGER_TOKEN" \
-F "variables[TEST_ID]=$i-$j" &
done
sleep 60
done
wait
Expected Behavior:
- Runners stabilize at mid-scale
- No job queue backlog
- Memory/CPU stay below 90%
Scenario 3: Mixed Workload
Test all runner types simultaneously.
stages:
- test
test-docker:
tags: [docker, linux]
script:
- apk add --no-cache curl
- curl -I https://example.com
test-dind:
tags: [dind, privileged]
services:
- docker:27-dind
variables:
DOCKER_HOST: tcp://localhost:2375
DOCKER_TLS_CERTDIR: ""
script:
- docker pull alpine:latest
- docker images
test-tinyland-docker:
tags: [tinyland-docker]
script:
- cat /etc/redhat-release
- dnf list installed | head -20
test-tinyland-nix:
tags: [tinyland-nix]
script:
- cat /etc/redhat-release
- dnf list installed | head -20
test-nix:
tags: [nix, flakes]
script:
- nix --version
- nix build nixpkgs#hello
- ./result/bin/hello
Expected Behavior:
- Each runner type handles its jobs
- No cross-contamination
- All jobs pass
Scenario 4: Idle Behavior
Test scale-down after load.
- Run burst test
- Wait 10 minutes
- Verify all runners scaled to minimum
# Check scale-down
watch -n 30 'kubectl get hpa -n {org}-runners'
Expected Behavior:
- All HPAs show 1 replica after 5-10 minutes
- No stuck pods
Monitoring During Tests
Real-time Monitoring
# Terminal 1: HPA status
watch -n 5 'kubectl get hpa -n {org}-runners'
# Terminal 2: Pod count
watch -n 5 'kubectl get pods -n {org}-runners | wc -l'
# Terminal 3: Resource usage
watch -n 10 'kubectl top pods -n {org}-runners'
# Terminal 4: Events
kubectl get events -n {org}-runners -w
Prometheus Queries
# Job queue depth
gitlab_runner_jobs{state="running"}
# Scaling activity
changes(kube_horizontalpodautoscaler_status_current_replicas{namespace="{org}-runners"}[5m])
# Resource utilization
avg(container_cpu_usage_seconds_total{namespace="{org}-runners"})
Success Criteria
| Metric | Target | How to Measure |
|---|---|---|
| Job start latency | < 30s P95 | GitLab job metrics |
| Scale-up time | < 60s | HPA event timestamps |
| Scale-down time | 5-10 min | HPA event timestamps |
| Job success rate | > 99% | GitLab CI analytics |
| Memory usage | < 85% peak | kubectl top |
| CPU usage | < 80% sustained | kubectl top |
Post-Test Analysis
Collect Metrics
# Export HPA events
kubectl get events -n {org}-runners \
--field-selector reason=SuccessfulRescale \
-o json > hpa-events.json
# Export pod metrics (requires metrics-server)
kubectl top pods -n {org}-runners --no-headers \
> pod-metrics-$(date +%Y%m%d-%H%M).txt
GitLab Analytics
- Go to CI/CD > Analytics
- Check:
- Pipeline duration trends
- Job wait times
- Failure rates
Identify Issues
Common issues to look for:
- Jobs stuck in pending (not enough runners)
- OOM kills (memory limits too low)
- Slow scale-up (stabilization window too long)
- Thrashing (targets too sensitive)
Tuning Based on Results
If jobs wait too long:
# Increase replicas
docker_hpa_max_replicas = 10
# Faster scale-up
hpa_scale_up_window = 0
# Lower CPU target
hpa_cpu_target = 60
If HPA thrashes:
# Increase stabilization windows
hpa_scale_up_window = 60
hpa_scale_down_window = 600
# Higher targets
hpa_cpu_target = 80
hpa_memory_target = 85
If OOM kills occur:
# Increase job memory
docker_job_memory_limit = "4Gi"
# Or reduce concurrency
docker_concurrent_jobs = 4
Automated Load Test Script
#!/bin/bash
# load-test.sh - Automated load testing
NAMESPACE="{org}-runners"
DURATION_MINUTES=30
JOBS_PER_MINUTE=10
echo "Starting load test: $JOBS_PER_MINUTE jobs/min for $DURATION_MINUTES minutes"
echo "Namespace: $NAMESPACE"
echo ""
# Baseline metrics
echo "Baseline HPA status:"
kubectl get hpa -n $NAMESPACE
echo ""
# Start monitoring in background
kubectl get events -n $NAMESPACE -w > events.log &
EVENTS_PID=$!
# Run load
for i in $(seq 1 $DURATION_MINUTES); do
echo "Minute $i: Triggering $JOBS_PER_MINUTE jobs..."
for j in $(seq 1 $JOBS_PER_MINUTE); do
# Trigger pipeline via API (implement your trigger)
# curl ... &
echo " Job $i-$j triggered"
done
# Snapshot metrics
echo "HPA status at minute $i:"
kubectl get hpa -n $NAMESPACE --no-headers
sleep 60
done
# Stop monitoring
kill $EVENTS_PID 2>/dev/null
# Final metrics
echo ""
echo "Final HPA status:"
kubectl get hpa -n $NAMESPACE
echo ""
echo "Load test complete. Check events.log for scaling events."
Cleanup After Testing
# Cancel any pending pipelines
glab ci list --status running -P $PROJECT_ID
# Clean up test artifacts
kubectl delete pods -n {org}-runners --field-selector=status.phase==Succeeded
# Reset any manual scaling
kubectl scale deployment -n {org}-runners --all --replicas=1