Honey Runner Workdir Contract

Honey Runner Workdir Contract

Canonical lifecycle contract for persistent runner _work/* state on the honey GitHub Actions runner hosts.

Use this when a job dies inside actions/checkout before repository code runs, especially for EACCES unlink failures, stale read-only files, or ownership drift under _work/<repo>.

Scope

This contract applies to:

  • the persistent _work/* trees on honey-am-* runner hosts
  • failures that happen before downstream repository code starts
  • operator recovery and escalation for contaminated repo workdirs

This contract does not redefine:

  • ARC pod memory or placement behavior
  • downstream repository cleanup after checkout succeeds
  • GitHub-hosted workflow paths

Contract

  • honey runner hosts may be long-lived, but _work/<repo> trees are disposable scratch, not durable repo state
  • checkout failures before repo code runs are platform-owned incidents, not downstream patch requests
  • the default recovery unit is one repo workdir on one host, not the whole _work/* root
  • bounded remediation happens only after the affected runner has been stopped or drained
  • unlock is an inspection and recovery aid, not a steady-state fix
  • if contamination is broader than one repo workdir, or ownership drift remains after bounded remediation, replace the affected runner service or host instead of widening salvage

Default Recovery Flow

  1. If you have the failing GitHub Actions run URL or run id, start with just honey-runner-checkout-triage; otherwise audit the affected host set with just honey-runner-workdir-audit.
  2. Reconcile the affected host set with just honey-runner-workdir-reconcile when you need the direct host view or want to confirm the run-driven extraction still matches the live host state.
  3. Stop or drain the affected runner service or host.
  4. Preview bounded recovery with just honey-runner-workdir-remediate <host> <repo>, or re-run just honey-runner-workdir-reconcile after drain if you want the repo-owned automation path to choose the safe single-repo candidates.
  5. Apply bounded recovery:
    • --mode unlock --apply if inspection or manual follow-up is still needed
    • --apply with default remove mode when the repo workdir should be discarded
  6. Restart or replace the affected runner.
  7. Re-run the blocked downstream job.

Escalate To Replacement

Treat replacement as the default next step when any of the following are true:

  • more than one repo workdir on the same host shows contamination
  • the same repo contamination recurs after bounded cleanup
  • ownership drift remains after unlock
  • the operator would otherwise need broad chown -R or root-level cleanup over _work/*

Meaning:

  • host identity may persist, but contaminated workspace state does not deserve preservation
  • do not normalize repo-local or operator-manual salvage as the primary steady state

Allowed Operator Surfaces

  • scripts/honey-runner-workdir-audit.sh
  • scripts/honey-runner-workdir-remediate.sh
  • scripts/honey-runner-workdir-reconcile.sh
  • scripts/honey-runner-checkout-triage.py
  • just honey-runner-workdir-audit
  • just honey-runner-workdir-remediate <host> <repo> [--mode unlock|remove] [--apply]
  • just honey-runner-workdir-reconcile [--apply --confirm-drained]
  • just honey-runner-checkout-triage <run-url|run-id> [--repo <owner/name>]
  • the runner Runbook and Troubleshooting guides

Non-Goals

  • do not treat this as a downstream repository contract fix
  • do not widen one-repo remediation into arbitrary cleanup of the full _work/* root without escalation
  • do not depend on post-checkout cleanup inside the downstream repository for this failure class

GloriousFlywheel