Live Runner Rollout Checklist
Use this checklist when rolling the hardened runner and cache substrate onto a real cluster. It is intentionally narrower than a general deployment guide: it covers the current ARC, GitLab compatibility, dashboard, and cache-plane guardrails needed for repeatable cache-first runner infrastructure.
Current product boundary:
- GitHub Actions ARC is the primary scale-set path.
- ARC lanes should default to
minRunners = 0and scale up to committed machine-envelope limits. - Scheduled warm pools are explicit exceptions and must have matching cold schedules.
- GitLab runners are a compatibility path with HPA-based scaling, not yet queue-driven ARC-equivalent scale-to-zero.
- Bazel is remote-cache backed today; Bazel remote execution is not selected or proved.
- Local
direnvand flake shells should attach to the same Attic and Bazel cache family as CI when the operator-provided endpoints are available.
Before Touching The Cluster
Enter the repo-managed shell first so local checks use the same contract as CI:
direnv allow
just info
The local shell may report GF_BAZEL_SUBSTRATE_MODE=compatibility-local-only
until an operator-provided Bazel endpoint is present. That is acceptable for
static validation, but it is not proof of cache-backed Bazel execution.
Run the offline hardening checks:
just tofu-validate-all
just check
git diff --check
These checks cover:
- OpenTofu formatting and initialized stack validation
- live tfvars naming and image pinning
- provider lockfile presence
- runner cache env injection
- Attic public key agreement across runner stacks
- ARC/GitLab scale contracts
- namespace ResourceQuota and LimitRange capacity modeling
- setup-flywheel and workflow cache proof shape
- the current no-RBE operational boundary
Stop before planning if any of those checks fail.
Planning Preconditions
Set the operator context explicitly:
export ENV=dev
export KUBE_CONTEXT=honey
export TOFU_BACKEND_S3_ENDPOINT=...
export TOFU_BACKEND_S3_BUCKET=...
export TOFU_BACKEND_S3_ACCESS_KEY=...
export TOFU_BACKEND_S3_SECRET_KEY=...
Then initialize and plan one stack at a time:
just tofu-init arc-runners
just tofu-plan arc-runners
just tofu-plan-guard arc-runners
Repeat for gitlab-runners, runner-dashboard, and attic only after the
stack-specific notes below are satisfied. Review every plan before applying the
saved tfplan.
tofu-apply runs the same saved-plan guard by default. It blocks destructive
namespace, namespace-policy, cache PVC, object bucket, ARC controller,
repo-scoped ARC registration URL, and repo-shaped runner label drift before
apply. Secret replacement and runner Helm release deletion are printed as
review items because they can be legitimate rotation or retirement work, but
they still require operator attention.
State Moves And Adoption Checks
ARC Runners
The ARC stack owns:
arc-systemsarc-runners- ARC controller Helm release when
deploy_arc_controller = true - runner scale-set Helm releases
- runner namespace
ResourceQuotaandLimitRange
The current stack includes a state move from module.arc_controller to
module.arc_controller[0]. Existing core deployments must apply once with
deploy_arc_controller = true before any owner overlay disables controller
ownership.
Plan review must show:
- no destroy/recreate for the ARC controller in an existing healthy install
minRunners = 0for cold ARC scale sets- any warm-pool lane has a cold schedule
maxRunnersand pod resource envelopes stay within the namespace quota- cache variables are still injected into docker, dind, and nix runner pods
GitLab Runners
The GitLab compatibility stack now owns the gitlab-runners namespace at the
stack root and applies namespace policy before runner Helm releases. It also
includes a state move from the old Nix module namespace resource to the new
root namespace resource.
Plan review must show:
- no namespace deletion for
gitlab-runners - the state move is accepted cleanly, or the existing namespace is imported before apply
- runner modules do not recreate namespace ownership themselves
- GitLab manager HPA limits and
concurrent_jobsstay within quota - job, helper, and service containers keep explicit cache and storage limits
GitLab is still a parity target, not the primary scale-to-zero implementation. Do not describe HPA min/max behavior as ARC-equivalent queue scaling.
Runner Dashboard
The dashboard RBAC should remain namespace-scoped. Plan review must show:
- no cluster-wide secret list/watch authority
- dashboard access to the GitLab runner namespace through
runners_namespace - dashboard read access to the configured ARC namespaces through
arc_namespaces - no mutation authority beyond the documented compatibility-backed flows
Attic And Bazel Cache
The honey Attic stack is an adoption path for resources that already exist on
the cluster. Do not apply it until existing resources are imported or otherwise
accounted for in state.
Current honey status as of 2026-05-06 UTC: the destructive adoption blocker is
cleared, but a broad Attic apply still requires saved-plan review. The TIN-980
state-adoption pass removed stale old Kubernetes provider-type addresses,
imported the live RustFS/Attic API/GC/secret objects at _v1 addresses, and
imported the live attic-config ConfigMap as an explicit honey adoption mode.
The latest operator plan is 6 to add, 14 to change, 0 to destroy, and
just tofu-plan-guard attic passes. The current honey tfvars intentionally
read the live database URL, Attic JWT signing key, and RustFS credentials from
existing service Secrets; saved-plan review must keep Secret data keys stable
until an explicit rotation is approved.
At minimum, verify adoption for:
nix-cachenamespace- honey’s existing
attic-configConfigMap, or the generatedattic-server-configSecret after a deliberate hardening cutover - existing service Secrets, including
attic-secrets,attic-jwt-signing,attic-rustfs-credentials, andbazel-cache-s3-credentials - Attic API deployment and service resources
- RustFS or other S3-compatible backing resources selected by the stack
- Bazel cache service resources
Before applying, converge the current drift explicitly:
- migrate or import old non-
_v1Kubernetes provider state addresses left from the pre-342efb4resource type migration - make the RustFS model match the durable live object,
deployment/attic-rustfs-openebswith OpenEBS ZFS storage, or record a reviewed data migration away from it - keep the legacy
attic-rustfsservice aliases pointed at the intended live backend; do not recreate the old local-pathattic-rustfsStatefulSet as an accidental rollback - keep
database_url_secret_namepointed at the existing service Secret when adopting honey without committing a database connection string to tfvars - keep
attic_jwt_signing_secret_nameandrustfs_credentials_secret_namepointed at the existing service Secrets during honey adoption unless the plan is an explicit credential rotation - preserve the live honey node selectors for Attic API, Attic GC, and bazel-cache pods unless the plan is an explicit scheduling-policy change
- separate the server-config ConfigMap-to-Secret hardening from the current adoption plan; it can be valid, but it should not appear as surprise churn in a broad adoption plan
- retain completed bootstrap Jobs during honey adoption with
rustfs_bootstrap_job_ttl_seconds_after_finished = nullandinit_cache_job_ttl_seconds_after_finished = null; short Kubernetes TTLs garbage-collect successful Jobs and turn otherwise stable managed objects into recurring add-only plan noise
OpenTofu 1.8 does not allow declarative moved blocks or tofu state mv across
Kubernetes provider resource aliases such as kubernetes_service to
kubernetes_service_v1. Treat those as state adoption operations:
- remove the stale old-type address from state
- import the same live object at the new
_v1address - run
just tofu-plan attic - run
just tofu-plan-guard attic
For the RustFS portion, the known old-type addresses are:
module.rustfs[0].kubernetes_secret.credentials
module.rustfs[0].kubernetes_service.api
module.rustfs[0].kubernetes_service.headless
module.rustfs[0].kubernetes_stateful_set.rustfs
module.rustfs[0].kubernetes_job.create_buckets
module.rustfs[0].kubernetes_job.apply_lifecycle[0]
The known live RustFS objects that need new-address imports after the code path is deployed are:
module.rustfs[0].kubernetes_secret_v1.credentials nix-cache/attic-rustfs-credentials
module.rustfs[0].kubernetes_service_v1.api nix-cache/attic-rustfs
module.rustfs[0].kubernetes_service_v1.headless nix-cache/attic-rustfs-hl
module.rustfs[0].kubernetes_stateful_set_v1.rustfs nix-cache/attic-rustfs
module.rustfs[0].kubernetes_deployment_v1.rustfs[0] nix-cache/attic-rustfs-openebs
The RustFS bucket/lifecycle Jobs used to be TTL-style operational Jobs. For
honey adoption, keep them retained after completion so OpenTofu can continue to
observe the managed _v1 Job objects after the first reviewed apply. If an old
TTL-managed Job has already disappeared, remove the old provider-type state
address and let the retained _v1 Job be created by the eventual reviewed
apply.
When the live legacy StatefulSet is intentionally drained, preserve immutable
claim-template fields with rustfs_statefulset_storage_class instead of
forcing the drained StatefulSet to match the active Deployment’s OpenEBS PVC.
The stack supports two server-config modes. secret is the target hardening
mode for generated config. existing_config_map preserves honey’s live
attic-config object during adoption so the ConfigMap-to-Secret move can be
reviewed and rolled separately. Both modes still place sensitive config
material in OpenTofu state until a later state-secret hardening pass
externalizes it.
Plan review must show:
- no accidental replacement of live cache-plane PVCs or buckets
- no Kubernetes Secret data-key rotations unless the rotation is explicitly approved
- no accidental removal of existing service placement selectors
- no regression from S3-compatible state backend wording to HTTP-only backend assumptions
- immutable image references remain pinned
- Bazel cache remains a remote-cache endpoint, not remote execution
Apply Order
For this hardening rollout on an existing honey deployment, prefer small
operator-applied steps:
arc-runners: land the ARC state move, namespace policy, cache env, and scale-set guardrails.gitlab-runners: land namespace ownership, namespace policy, cache env, image pins, and storage/resource envelopes.runner-dashboard: land narrowed namespace-scoped RBAC and ARC namespace visibility.attic: adopt/import first, then land cache image pin updates; perform the server-config Secret hardening as a separate reviewed cutover.
Use the saved plan path:
just tofu-apply arc-runners
Do not run a broad apply if the reviewed plan includes namespace deletion, cache-plane PVC replacement, runner label taxonomy drift, or unplanned Secret replacement.
Post-Apply Verification
Verify namespace guardrails:
kubectl --context honey get resourcequota,limitrange -n arc-runners
kubectl --context honey get resourcequota,limitrange -n gitlab-runners
Verify ARC scale sets and warm-pool state:
kubectl --context honey get autoscalingrunnersets -n arc-runners
kubectl --context honey get cronjobs -n arc-runners
just arc-runtime-audit
Verify GitLab compatibility runners:
kubectl --context honey get deploy,hpa,pods -n gitlab-runners
Verify dashboard access remains bounded:
kubectl --context honey auth can-i list pods \
--as system:serviceaccount:runner-dashboard:runner-dashboard \
-n arc-runners
kubectl --context honey auth can-i list secrets \
--as system:serviceaccount:runner-dashboard:runner-dashboard \
-n arc-runners
The first command should be allowed for configured runner namespaces. The second should not be allowed.
Verify cache attachment from the relevant environment:
just cache-contract-nix-strict
just cache-contract-strict
For developer-machine Bazel proof, use an operator-provided endpoint first, then run:
just developer-cache-attachment-proof //:deployment_bundle false
This proves shared remote-cache attachment for a bounded target. It still does not prove Bazel remote execution.
Rollback Boundaries
If a plan or apply exposes drift:
- stop before applying if a namespace, PVC, bucket, or cache Secret would be destroyed unexpectedly
- lower
maxRunnersor tighten warm-pool windows if namespace quota rejects expected bursts - roll back a runner envelope by reverting the stack tfvars and applying a new reviewed plan
- prefer replacing unhealthy runner pods through the controller instead of mutating live Helm-managed pod specs
- preserve Tofu state authority; manual
kubectledits are drift unless they are a documented bounded recovery action
Keep claims precise after rollout: the desired product shape is repeatable, cache-first, scale-set runner infrastructure across forges. The current implemented proof is shared cache-backed execution across local and CI entry points where endpoints are present, with ARC as the primary scale-to-zero runner implementation and GitLab still on the compatibility path.