GloriousFlywheel Honey Runner Workspace Hygiene 2026-04-16

GloriousFlywheel Honey Runner Workspace Hygiene 2026-04-16

Snapshot date: 2026-04-16

Purpose

Capture a new class of platform failure exposed by downstream dogfooding: runner-host workspace state on honey can block jobs before repository code even runs.

This is not a downstream repo contract problem. It is a GloriousFlywheel runner hygiene and lifecycle problem.

Triggering Evidence

Downstream repo:

  • Jesssullivan/acuity-middleware
  • PR #37
  • latest failed run inspected: 24525417273

Failure shape:

  • both failing jobs died inside actions/checkout@v6
  • no repo build or test logic executed
  • both jobs failed on honey runner hosts with the same EACCES unlink error

Observed paths from the failed logs:

  • /home/jess/am-runners/honey-am-1/_work/acuity-middleware/acuity-middleware/pkg/LICENSE
  • /home/jess/am-runners/honey-am-2/_work/acuity-middleware/acuity-middleware/pkg/LICENSE

Observed error:

  • File was unable to be removed Error: EACCES: permission denied, unlink .../pkg/LICENSE

Read

Current platform read:

  • GloriousFlywheel runner pickup is working
  • repo contract is not the blocker here
  • stale read-only files in persistent runner work directories can prevent actions/checkout from cleaning a prior workspace
  • a repo-side cleanup step after checkout cannot fix this class of failure, because checkout itself is where the job dies

Meaning:

  • the immediate fix is runner-host cleanup or runner replacement
  • the durable fix belongs in GloriousFlywheel runner lifecycle hygiene

What This Blocks

Current direct downstream consequence:

  • acuity-middleware#37 is blocked even though the latest repo patch already attempted post-checkout cleanup

Broader implication:

  • any honey-hosted self-hosted runner with a stale read-only file in _work/ can fail identically
  • this can surface as random downstream CI instability even when the runner label contract and cache contract are correct

Immediate Operator Action

For the affected runners:

  1. stop, drain, or replace honey-am-1 and honey-am-2
  2. remove or chown the stale workdirs under _work/acuity-middleware
  3. restart the affected runners
  4. rerun acuity-middleware#37

GloriousFlywheel Follow-On Work

The platform needs one explicit hygiene lane for runner-host workdirs:

  • decide whether honey runners are cattle or pets for workspace state
  • define a cleanup contract for _work/* between jobs or on failure
  • decide whether runner replacement is cheaper than in-place salvage
  • add an operator-facing audit and remediation path for stale workspace state
  • make sure this does not depend on downstream repo patches to recover

Current repo-owned operator surfaces added after this note:

  • scripts/honey-runner-workdir-audit.sh
  • just honey-runner-workdir-audit
  • scripts/honey-runner-workdir-remediate.sh
  • just honey-runner-workdir-remediate <host> <repo> [--mode unlock|remove] [--apply]
  • docs/runners/runbook.md
  • docs/runners/troubleshooting.md

Contract Update 2026-04-19

The repo now carries one explicit honey runner workdir lifecycle contract:

  • honey hosts may be long-lived, but _work/<repo> trees are disposable scratch
  • checkout failures before downstream code runs are platform-owned incidents
  • bounded remediation targets one repo workdir on one host at a time
  • default recovery is drain, remediate, restart or replace, then rerun
  • if contamination is broader than one repo tree, or ownership drift remains after bounded remediation, the runner service or host should be replaced instead of widening salvage

Canonical contract surface:

  • docs/architecture/honey-runner-workdir-contract.md

Live Audit Update 2026-04-17

Live host audit on jess@100.113.89.12 narrowed the problem further:

  • the affected runner service roots are:
    • /home/jess/am-runners/honey-am-1/_work
    • /home/jess/am-runners/honey-am-2/_work
  • both currently contain only one repo workdir: acuity-middleware
  • both workdirs are large, about 5.9G
  • no stale .git/index.lock was present
  • no ownership mismatch was observed; expected owner remained jess:jess
  • both trees contain many non-writable .git/objects/* files
  • the original pkg/LICENSE path still exists on both hosts
  • honey-am-2 still has the exact non-writable symptom from the GitHub log:
    • /home/jess/am-runners/honey-am-2/_work/acuity-middleware/acuity-middleware/pkg/LICENSE
    • mode 0555 (-r-xr-xr-x)
  • honey-am-1 has the same path but it is currently writable by owner:
    • mode 0755 (-rwxr-xr-x)

Meaning:

  • this is not generic ownership drift
  • it is stale persisted checkout state inside the acuity-middleware workdirs
  • honey-am-2 still has the concrete checkout blocker
  • both hosts also have broader read-only git-object contamination

Non-Goal

Do not treat this as a downstream repo fix request.

The failure happens before downstream code runs, so the platform should own the recovery and prevention path.

GloriousFlywheel