GloriousFlywheel

Honey Runner Workdir Contract

Canonical lifecycle contract for persistent runner _work/* state on the honey GitHub Actions runner hosts.

Use this when a job dies inside actions/checkout before repository code runs, especially for EACCES unlink failures, stale read-only files, or ownership drift under _work/<repo>.

Scope

This contract applies to:

the persistent _work/* trees on honey-am-* runner hosts
failures that happen before downstream repository code starts
operator recovery and escalation for contaminated repo workdirs

This contract does not redefine:

ARC pod memory or placement behavior
downstream repository cleanup after checkout succeeds
GitHub-hosted workflow paths

Contract

honey runner hosts may be long-lived, but _work/<repo> trees are disposable scratch, not durable repo state
checkout failures before repo code runs are platform-owned incidents, not downstream patch requests
the default recovery unit is one repo workdir on one host, not the whole _work/* root
bounded remediation happens only after the affected runner has been stopped or drained
unlock is an inspection and recovery aid, not a steady-state fix
if contamination is broader than one repo workdir, or ownership drift remains after bounded remediation, replace the affected runner service or host instead of widening salvage

Default Recovery Flow

If you have the failing GitHub Actions run URL or run id, start with just honey-runner-checkout-triage; otherwise audit the affected host set with just honey-runner-workdir-audit.
Reconcile the affected host set with just honey-runner-workdir-reconcile when you need the direct host view or want to confirm the run-driven extraction still matches the live host state.
Drain the affected runner service or host with just honey-runner-host-lifecycle <host> drain, or stop it by other bounded operator means if the lifecycle command cannot control that host root.
Preview bounded recovery with just honey-runner-workdir-remediate <host> <repo>, or re-run just honey-runner-workdir-reconcile after drain if you want the repo-owned automation path to choose the safe single-repo candidates.
Apply bounded recovery:
- --mode unlock --apply if inspection or manual follow-up is still needed
- --apply with default remove mode when the repo workdir should be discarded
Restart the affected runner with just honey-runner-host-lifecycle <host> start, or replace the host if the lifecycle command cannot restore a clean active runner process.
Re-run the blocked downstream job.

Escalate To Replacement

Treat replacement as the default next step when any of the following are true:

more than one repo workdir on the same host shows contamination
the same repo contamination recurs after bounded cleanup
ownership drift remains after unlock
the operator would otherwise need broad chown -R or root-level cleanup over _work/*

Meaning:

host identity may persist, but contaminated workspace state does not deserve preservation
do not normalize repo-local or operator-manual salvage as the primary steady state

Allowed Operator Surfaces

scripts/honey-runner-workdir-audit.sh
scripts/honey-runner-workdir-remediate.sh
scripts/honey-runner-workdir-reconcile.sh
scripts/honey-runner-host-lifecycle.sh
scripts/honey-runner-checkout-triage.py
just honey-runner-workdir-audit
just honey-runner-workdir-remediate <host> <repo> [--mode unlock|remove] [--apply]
just honey-runner-workdir-reconcile [--apply --confirm-drained]
just honey-runner-host-lifecycle <host> [status|drain|start|restart]
just honey-runner-checkout-triage <run-url|run-id> [--repo <owner/name>]
the runner Runbook and Troubleshooting guides

Non-Goals

do not treat this as a downstream repository contract fix
do not widen one-repo remediation into arbitrary cleanup of the full _work/* root without escalation
do not depend on post-checkout cleanup inside the downstream repository for this failure class

Current State — internal operating status and active gaps
Runbook — operator recovery steps
Troubleshooting — incident entry point
GloriousFlywheel Honey Runner Workspace Hygiene 2026-04-16 — incident evidence and original problem framing