Dogfood Reality Gap Analysis

GloriousFlywheel Dogfood Reality Gap Analysis

Date: 2026-04-25

Linear: TIN-548

Purpose

This note records a cautious source-repo dogfood reality pass after the stability, pooled-substrate, cache-authority, auth-authority, and TIN-545 hardening work completed.

The goal is not to invent a new architecture. The goal is to compare the written contract against the live repo surfaces and separate:

  • proved source-repo dogfood truth
  • implementation gaps
  • compatibility surfaces that must not steer the product story
  • historical residue that still looks executable enough to mislead agents

Current Proved Truth

At audit start, /Users/jess/git/GloriousFlywheel was clean on main at 5850bdacd9b46416a729a6002e24105a978258dd.

The latest default-branch proof package on that commit was green:

  • Platform Proof run 24922182403
  • Source Bazel Proof run 24922182401
  • Validate run 24922182399
  • Secret Detection run 24922182398
  • Tranche Proof Status run 24922182395
  • Deploy Docs run 24922182402
  • Publish to FlakeHub run 24922182393

The green proof matters. It does not mean every source-repo workflow already executes on the pooled GloriousFlywheel substrate.

The strongest source-repo dogfood proof is currently:

  • Source Bazel Proof runs on tinyland-nix
  • it requires BAZEL_REMOTE_CACHE
  • it requires GF_BAZEL_SUBSTRATE_MODE=shared-cache-backed
  • it calls scripts/cache-attachment-contract.sh --strict
  • it enters nix develop .#ci
  • it runs Bazel through scripts/bazel-cache-backed.sh

That proves shared cache acceleration through the wrapper path. It does not prove universal remote execution or full remote builder offload for every local developer workload.

2026-04-26 Refresh

The latest audited default-branch proof package is stronger than the original TIN-548 snapshot, but the core boundary remains the same.

Current verified truth:

  • main at 7ae6b3d653199ec1dc5299f2a541a63225a9aa94 passed the proof package: Source Bazel Proof, Platform Proof, Validate, Secret Detection, Deploy Docs, and Publish to FlakeHub
  • just attic-cache-authority-check reports the live main Attic cache as public-read, with anonymous metadata returning HTTP 200
  • the Source Bazel Proof passed a real BAZEL_REMOTE_CACHE=grpc://bazel-cache.nix-cache.svc.cluster.local:9092 endpoint through scripts/bazel-cache-backed.sh and reported one remote cache hit
  • the Platform Proof showed runner-dashboard fetched from http://attic.nix-cache.svc.cluster.local/main and a post-job Attic delta push
  • ARC listener pods for the Tinyland and Jess owner-overlay scale sets were running during the audit, and no runner payload pods were pending at that instant

Current negative truth:

  • the active Bazel config still contains no --remote_executor or equivalent Bazel remote-execution path; the implemented surface is remote cache
  • local developer sessions are compatibility-local-only unless an operator provides a routable BAZEL_REMOTE_CACHE
  • Bazel external repository fetches still happen before action-cache hits can help, so TIN-643 remains real product debt
  • owner-overlay scale-set names solve GitHub registration/auth boundaries, but they do not create a global concurrency policy across shared labels
  • direct full-repo public visibility remains blocked by internal history and current-tree exposure; the safe route is still the scrubbed public-alpha export/mirror

So the current product status is: source-repo shared cache dogfood is green; developer-machine parity, broad cross-repo adoption, and Bazel remote execution are not complete.

Contract Baseline

The active contract is:

  • GloriousFlywheel is a pooled build, cache, and runner substrate.
  • Local development and CI are meant to ride the same shared substrate.
  • Capability classes, not repo names, define runner taxonomy.
  • Raw local Bazel and Bazelisk are not the default product path.
  • If the implementation only proves cache-backed local execution, say that explicitly.

This remains the right contract. The gaps below are places where implementation, examples, or workflow reality still fall short of that contract.

Gap 1: Hosted Default-Branch Workflows Still Exist

The default-branch proof package is green, but several workflows still use ubuntu-latest.

Current hosted surfaces:

  • .github/workflows/validate.yml
  • .github/workflows/secrets-scan.yml
  • .github/workflows/flakehub-publish.yml
  • .github/workflows/pages.yml
  • .github/workflows/mirror-images.yml
  • .github/workflows/release.yml for release metadata creation
  • .github/workflows/tranche-proof-status.yml

Assessment:

  • This is a real dogfood gap.
  • It is not the same as a failed proof.
  • Some jobs may remain control-plane or third-party-action exceptions for a while, but they should be named exceptions or migrated intentionally.
  • The current repo should not imply that all default-branch work already avoids hosted runners.

Likely next slice:

  • Create a hosted-runner exception register and migrate the low-risk jobs first.
  • Candidate low-risk migrations: validate.yml, secrets-scan.yml, and tranche-proof-status.yml.
  • Leave GitHub Pages deploy, FlakeHub OIDC publish, image mirroring, and release metadata as separately evaluated control-plane jobs until proved on self-hosted lanes.

Gap 2: Active Bazel Docs Still Normalize Direct Bazel Commands

Several active docs and examples still show direct bazel build or bazel test commands as the normal invocation after cache attachment.

Examples:

  • docs/build-system/bazel-targets.md
  • docs/architecture/bazel-version-policy.md
  • docs/guides/adoption-quickstart.md
  • docs/runners/downstream-migration-checklist.md
  • examples/github/cache-backed-workflow.yml
  • examples/gitlab/.gitlab-ci.yml
  • examples/flake/flake.nix

Assessment:

  • This is a real narrative and agent-safety gap.
  • Passing --remote_cache="$BAZEL_REMOTE_CACHE" explicitly is better than the old literal-placeholder bug, but it still teaches direct Bazel invocation as the path users copy.
  • The product story should center a wrapper or repo-managed entrypoint that performs the strict cache-attachment preflight before Bazel executes.

Likely next slice:

  • Add or expose a reusable consumer-side cache-backed Bazel wrapper example.
  • Rewrite active examples to call that wrapper or a just recipe, not direct bazel build.
  • Keep raw Bazel text only in explicit compatibility/debug sections.

Gap 3: The Devshell Still Exposes A Bare bazel Wrapper

flake.nix exposes bazel as a compatibility wrapper around Bazelisk. The comments say routine usage should go through just bazel-build-cached, but the command is still available.

Assessment:

  • This is not automatically wrong, because scripts/bazel-cache-backed.sh needs a Bazel binary to call.
  • It is still an enforcement gap: the repo relies on docs and agent guidance to stop raw bazel use instead of making misuse harder.

Likely next slice:

  • Consider a guarded devshell bazel shim that refuses heavy commands unless GF_BAZEL_SUBSTRATE_MODE=shared-cache-backed and BAZEL_REMOTE_CACHE are present, with an explicit escape hatch for compatibility debugging.
  • Do this carefully because bazel clean, bazel query, and wrapper-invoked commands need a clear allowance model.

Gap 4: GitLab Compatibility Surfaces Preserve Stale Cache Drift

The primary GitHub path is the current product path, but GitLab compatibility files are still live enough to mislead.

Observed drift:

  • .gitlab-ci.yml still sets BAZEL_REMOTE_CACHE=grpc://bazel-cache.attic-cache-dev.svc.cluster.local:9092
  • .gitlab/ci/jobs/bazel-build.gitlab-ci.yml builds a user.bazelrc and runs direct nix develop .#ci --command bazel ...
  • config/organization-s3.example.yaml and config/organization-ha.example.yaml still use attic-cache-dev namespace examples
  • docs/infrastructure/overlay-creation.md still teaches the old GitLab-first overlay path with attic-cache-dev examples

Assessment:

  • This is a compatibility-surface gap, not the primary source-repo proof path.
  • It is still dangerous because these files are active tracked examples and validate targets, not archived research notes.
  • The stale endpoint is explicitly out of contract elsewhere in the repo.

Likely next slice:

  • Remove hard-coded stale Bazel endpoint defaults from GitLab compatibility.
  • Require operator-provided BAZEL_REMOTE_CACHE for GitLab Bazel jobs.
  • Mark overlay creation as legacy compatibility or rewrite it against the current S3 state and shared cache contract.

Gap 5: The Active Superpowers Plan Still Contains Executable Old Body Text

docs/superpowers/plans/2026-04-23-gloriousflywheel-pooled-substrate-dogfood-reset.md has accurate progress checkpoints at the top, but the older checkbox body still contains stale local paths and direct Bazel snippets.

Assessment:

  • This is not current implementation truth.
  • It remains a drift hazard because it is still presented as an implementation plan for agentic workers.
  • The top checkpoint says the route has moved on, but the old body is still easy to over-follow.

Likely next slice:

  • Collapse the completed plan body into a historical appendix, or add an explicit “do not execute the original checklist without reconciling against current canon” boundary.
  • Move active productization work into fresh issue-backed plan surfaces.

Gap 6: Future Runner Types Are Correctly Bounded For Now

Native aarch64, riscv, Dawn-native dispatch, and localized warm-cache guarantees for Hackage, Chapel, GPU backends, and similar toolchains are not currently implemented as dispatch contracts.

Assessment:

  • The current docs mostly classify these correctly as future-lane research.
  • They should stay out of the current product contract until there is a named proof surface, runner class, owner, and cache-warming plan.

Likely next slice:

  • Do not implement these inside the immediate dogfood repair lane.
  • Keep them on the productization roadmap as future proof packages.

Gap 7: RBE Planning Exists, But Is Not Yet Authority

The April 26 RBE planning pass correctly identified that Bazel “remote build” means remote execution, not only remote cache hits. It also correctly found BUILD_WORKSPACE_DIRECTORY and shell-environment hazards in the Tofu rules.

Assessment:

  • The plan is useful as a candidate sprint shape.
  • It is not implementation authority.
  • The NativeLink-shaped Linear scaffold created on April 26 assumes a peer backend choice before the repo has recorded that decision.
  • Buildbarn, Buildfarm, BuildBuddy, and NativeLink are class-peer projects with overlapping build cache, CAS, worker, scheduling, and REAPI concerns. They are not ordinary GloriousFlywheel dependencies.
  • ARC/GitHub Actions dispatch is real remote job execution, but not Bazel action-level remote execution. It must not be counted as --remote_executor proof.
  • The README must not claim remote build until a default-branch proof shows actual remote processes through --remote_executor.
  • See RBE Sprint Gate for the execution boundary.

Likely next slice:

  • Add or annotate the Linear RBE work with an architecture-decision gate.
  • Keep TIN-650 as the nearer-term developer-machine cache attachment proof.
  • Treat a backend-neutral REAPI adapter, NativeLink, BuildBuddy, Buildbarn, Buildfarm, or deferral as candidates until the architecture decision is recorded.
  1. Add a repeatable source-repo dogfood contract audit. It should fail on unclassified hosted workflows, stale cache endpoints in live surfaces, and direct Bazel examples in active docs.
  2. Migrate the lowest-risk hosted workflows onto shared lanes. Start with validation/status jobs, not third-party publish/deploy jobs.
  3. Rewrite consumer Bazel examples around a wrapper entrypoint. Make direct Bazel commands compatibility/debug-only.
  4. Repair GitLab compatibility drift. Remove stale endpoints and require the same cache-variable contract.
  5. Evaluate a guarded devshell Bazel shim. Do this only after wrapper docs and CI checks are settled.
  6. Gate RBE work through a backend decision and minimum executor proof. Do not wire runner env vars, worker images, or public claims before the proof contract is explicit.

Non-Goals For This Audit

  • Do not treat downstream blocked repos as proof criteria for the source repo.
  • Do not invent repo-specific runner labels to close gaps.
  • Do not claim remote execution where the implementation only proves shared cache acceleration.
  • Do not move all hosted workflows in one broad PR without separating third-party control-plane risk.

GloriousFlywheel