Live Runner Rollout Checklist

Live Runner Rollout Checklist

Use this checklist when rolling the hardened runner and cache substrate onto a real cluster. It is intentionally narrower than a general deployment guide: it covers the current ARC, GitLab compatibility, dashboard, and cache-plane guardrails needed for repeatable cache-first runner infrastructure.

Current product boundary:

  • GitHub Actions ARC is the primary scale-set path.
  • ARC lanes should default to minRunners = 0 and scale up to committed machine-envelope limits.
  • Scheduled warm pools are explicit exceptions and must have matching cold schedules.
  • GitLab runners are a compatibility path with HPA-based scaling, not yet queue-driven ARC-equivalent scale-to-zero.
  • Bazel is remote-cache backed today; Bazel remote execution is not selected or proved.
  • Local direnv and flake shells should attach to the same Attic and Bazel cache family as CI when the operator-provided endpoints are available.

Before Touching The Cluster

Enter the repo-managed shell first so local checks use the same contract as CI:

direnv allow
just info

The local shell may report GF_BAZEL_SUBSTRATE_MODE=compatibility-local-only until an operator-provided Bazel endpoint is present. That is acceptable for static validation, but it is not proof of cache-backed Bazel execution.

Run the offline hardening checks:

just tofu-validate-all
just check
git diff --check

These checks cover:

  • OpenTofu formatting and initialized stack validation
  • live tfvars naming and image pinning
  • provider lockfile presence
  • runner cache env injection
  • Attic public key agreement across runner stacks
  • ARC/GitLab scale contracts
  • namespace ResourceQuota and LimitRange capacity modeling
  • setup-flywheel and workflow cache proof shape
  • the current no-RBE operational boundary

Stop before planning if any of those checks fail.

Planning Preconditions

Set the operator context explicitly:

export ENV=dev
export KUBE_CONTEXT=honey
export TOFU_BACKEND_S3_ENDPOINT=...
export TOFU_BACKEND_S3_BUCKET=...
export TOFU_BACKEND_S3_ACCESS_KEY=...
export TOFU_BACKEND_S3_SECRET_KEY=...

Then initialize and plan one stack at a time:

just tofu-init arc-runners
just tofu-plan arc-runners
just tofu-plan-guard arc-runners

Repeat for gitlab-runners, runner-dashboard, and attic only after the stack-specific notes below are satisfied. Review every plan before applying the saved tfplan.

tofu-apply runs the same saved-plan guard by default. It blocks destructive namespace, namespace-policy, cache PVC, object bucket, ARC controller, repo-scoped ARC registration URL, and repo-shaped runner label drift before apply. Secret replacement and runner Helm release deletion are printed as review items because they can be legitimate rotation or retirement work, but they still require operator attention.

State Moves And Adoption Checks

ARC Runners

The ARC stack owns:

  • arc-systems
  • arc-runners
  • ARC controller Helm release when deploy_arc_controller = true
  • runner scale-set Helm releases
  • runner namespace ResourceQuota and LimitRange

The current stack includes a state move from module.arc_controller to module.arc_controller[0]. Existing core deployments must apply once with deploy_arc_controller = true before any owner overlay disables controller ownership.

Plan review must show:

  • no destroy/recreate for the ARC controller in an existing healthy install
  • minRunners = 0 for cold ARC scale sets
  • any warm-pool lane has a cold schedule
  • maxRunners and pod resource envelopes stay within the namespace quota
  • cache variables are still injected into docker, dind, and nix runner pods

GitLab Runners

The GitLab compatibility stack now owns the gitlab-runners namespace at the stack root and applies namespace policy before runner Helm releases. It also includes a state move from the old Nix module namespace resource to the new root namespace resource.

Plan review must show:

  • no namespace deletion for gitlab-runners
  • the state move is accepted cleanly, or the existing namespace is imported before apply
  • runner modules do not recreate namespace ownership themselves
  • GitLab manager HPA limits and concurrent_jobs stay within quota
  • job, helper, and service containers keep explicit cache and storage limits

GitLab is still a parity target, not the primary scale-to-zero implementation. Do not describe HPA min/max behavior as ARC-equivalent queue scaling.

Runner Dashboard

The dashboard RBAC should remain namespace-scoped. Plan review must show:

  • no cluster-wide secret list/watch authority
  • dashboard access to the GitLab runner namespace through runners_namespace
  • dashboard read access to the configured ARC namespaces through arc_namespaces
  • no mutation authority beyond the documented compatibility-backed flows

Attic And Bazel Cache

The honey Attic stack is an adoption path for resources that already exist on the cluster. Do not apply it until existing resources are imported or otherwise accounted for in state.

Current honey status as of 2026-05-06 UTC: the destructive adoption blocker is cleared, but a broad Attic apply still requires saved-plan review. The TIN-980 state-adoption pass removed stale old Kubernetes provider-type addresses, imported the live RustFS/Attic API/GC/secret objects at _v1 addresses, and imported the live attic-config ConfigMap as an explicit honey adoption mode. The latest operator plan is 6 to add, 14 to change, 0 to destroy, and just tofu-plan-guard attic passes. The current honey tfvars intentionally read the live database URL, Attic JWT signing key, and RustFS credentials from existing service Secrets; saved-plan review must keep Secret data keys stable until an explicit rotation is approved.

At minimum, verify adoption for:

  • nix-cache namespace
  • honey’s existing attic-config ConfigMap, or the generated attic-server-config Secret after a deliberate hardening cutover
  • existing service Secrets, including attic-secrets, attic-jwt-signing, attic-rustfs-credentials, and bazel-cache-s3-credentials
  • Attic API deployment and service resources
  • RustFS or other S3-compatible backing resources selected by the stack
  • Bazel cache service resources

Before applying, converge the current drift explicitly:

  • migrate or import old non-_v1 Kubernetes provider state addresses left from the pre-342efb4 resource type migration
  • make the RustFS model match the durable live object, deployment/attic-rustfs-openebs with OpenEBS ZFS storage, or record a reviewed data migration away from it
  • keep the legacy attic-rustfs service aliases pointed at the intended live backend; do not recreate the old local-path attic-rustfs StatefulSet as an accidental rollback
  • keep database_url_secret_name pointed at the existing service Secret when adopting honey without committing a database connection string to tfvars
  • keep attic_jwt_signing_secret_name and rustfs_credentials_secret_name pointed at the existing service Secrets during honey adoption unless the plan is an explicit credential rotation
  • preserve the live honey node selectors for Attic API, Attic GC, and bazel-cache pods unless the plan is an explicit scheduling-policy change
  • separate the server-config ConfigMap-to-Secret hardening from the current adoption plan; it can be valid, but it should not appear as surprise churn in a broad adoption plan
  • retain completed bootstrap Jobs during honey adoption with rustfs_bootstrap_job_ttl_seconds_after_finished = null and init_cache_job_ttl_seconds_after_finished = null; short Kubernetes TTLs garbage-collect successful Jobs and turn otherwise stable managed objects into recurring add-only plan noise

OpenTofu 1.8 does not allow declarative moved blocks or tofu state mv across Kubernetes provider resource aliases such as kubernetes_service to kubernetes_service_v1. Treat those as state adoption operations:

  1. remove the stale old-type address from state
  2. import the same live object at the new _v1 address
  3. run just tofu-plan attic
  4. run just tofu-plan-guard attic

For the RustFS portion, the known old-type addresses are:

module.rustfs[0].kubernetes_secret.credentials
module.rustfs[0].kubernetes_service.api
module.rustfs[0].kubernetes_service.headless
module.rustfs[0].kubernetes_stateful_set.rustfs
module.rustfs[0].kubernetes_job.create_buckets
module.rustfs[0].kubernetes_job.apply_lifecycle[0]

The known live RustFS objects that need new-address imports after the code path is deployed are:

module.rustfs[0].kubernetes_secret_v1.credentials                 nix-cache/attic-rustfs-credentials
module.rustfs[0].kubernetes_service_v1.api                        nix-cache/attic-rustfs
module.rustfs[0].kubernetes_service_v1.headless                   nix-cache/attic-rustfs-hl
module.rustfs[0].kubernetes_stateful_set_v1.rustfs                nix-cache/attic-rustfs
module.rustfs[0].kubernetes_deployment_v1.rustfs[0]               nix-cache/attic-rustfs-openebs

The RustFS bucket/lifecycle Jobs used to be TTL-style operational Jobs. For honey adoption, keep them retained after completion so OpenTofu can continue to observe the managed _v1 Job objects after the first reviewed apply. If an old TTL-managed Job has already disappeared, remove the old provider-type state address and let the retained _v1 Job be created by the eventual reviewed apply.

When the live legacy StatefulSet is intentionally drained, preserve immutable claim-template fields with rustfs_statefulset_storage_class instead of forcing the drained StatefulSet to match the active Deployment’s OpenEBS PVC.

The stack supports two server-config modes. secret is the target hardening mode for generated config. existing_config_map preserves honey’s live attic-config object during adoption so the ConfigMap-to-Secret move can be reviewed and rolled separately. Both modes still place sensitive config material in OpenTofu state until a later state-secret hardening pass externalizes it.

Plan review must show:

  • no accidental replacement of live cache-plane PVCs or buckets
  • no Kubernetes Secret data-key rotations unless the rotation is explicitly approved
  • no accidental removal of existing service placement selectors
  • no regression from S3-compatible state backend wording to HTTP-only backend assumptions
  • immutable image references remain pinned
  • Bazel cache remains a remote-cache endpoint, not remote execution

Apply Order

For this hardening rollout on an existing honey deployment, prefer small operator-applied steps:

  1. arc-runners: land the ARC state move, namespace policy, cache env, and scale-set guardrails.
  2. gitlab-runners: land namespace ownership, namespace policy, cache env, image pins, and storage/resource envelopes.
  3. runner-dashboard: land narrowed namespace-scoped RBAC and ARC namespace visibility.
  4. attic: adopt/import first, then land cache image pin updates; perform the server-config Secret hardening as a separate reviewed cutover.

Use the saved plan path:

just tofu-apply arc-runners

Do not run a broad apply if the reviewed plan includes namespace deletion, cache-plane PVC replacement, runner label taxonomy drift, or unplanned Secret replacement.

Post-Apply Verification

Verify namespace guardrails:

kubectl --context honey get resourcequota,limitrange -n arc-runners
kubectl --context honey get resourcequota,limitrange -n gitlab-runners

Verify ARC scale sets and warm-pool state:

kubectl --context honey get autoscalingrunnersets -n arc-runners
kubectl --context honey get cronjobs -n arc-runners
just arc-runtime-audit

Verify GitLab compatibility runners:

kubectl --context honey get deploy,hpa,pods -n gitlab-runners

Verify dashboard access remains bounded:

kubectl --context honey auth can-i list pods \
  --as system:serviceaccount:runner-dashboard:runner-dashboard \
  -n arc-runners
kubectl --context honey auth can-i list secrets \
  --as system:serviceaccount:runner-dashboard:runner-dashboard \
  -n arc-runners

The first command should be allowed for configured runner namespaces. The second should not be allowed.

Verify cache attachment from the relevant environment:

just cache-contract-nix-strict
just cache-contract-strict

For developer-machine Bazel proof, use an operator-provided endpoint first, then run:

just developer-cache-attachment-proof //:deployment_bundle false

This proves shared remote-cache attachment for a bounded target. It still does not prove Bazel remote execution.

Rollback Boundaries

If a plan or apply exposes drift:

  • stop before applying if a namespace, PVC, bucket, or cache Secret would be destroyed unexpectedly
  • lower maxRunners or tighten warm-pool windows if namespace quota rejects expected bursts
  • roll back a runner envelope by reverting the stack tfvars and applying a new reviewed plan
  • prefer replacing unhealthy runner pods through the controller instead of mutating live Helm-managed pod specs
  • preserve Tofu state authority; manual kubectl edits are drift unless they are a documented bounded recovery action

Keep claims precise after rollout: the desired product shape is repeatable, cache-first, scale-set runner infrastructure across forges. The current implemented proof is shared cache-backed execution across local and CI entry points where endpoints are present, with ARC as the primary scale-to-zero runner implementation and GitLab still on the compatibility path.

GloriousFlywheel