Cold-Start Enrollment — Off-Cluster Agents on the W4.2 IAM Token-Exchange (June 2026)

Cold-Start Enrollment — Off-Cluster Agents on the W4.2 IAM Token-Exchange — June 2026

Snapshot date: 2026-06-14. Prepared from verified repo truth and an adversarial 6-axis design review.

Status: Accepted 2026-06-15. Records the decision for TIN-2112 (GF AX/DX: decide agent secret/profile authority for enrollment).

This is an implementation/rollout decision UNDER the already-decided W4.2 IAM designiam-tenant-claim-design.md (TIN-1473, parent E4 / TIN-1448). It does not introduce a new auth stack or a competing “broker.” It defers to that design for the auth substrate (scope model, JWT shape, validation rules, the already-built credential helper at services/gf-reapi-cell/cmd/gf-reapi-credhelper) and records only the decisions that make that design reachable and usable by cold, off-cluster agents/devs — the gap the IAM doc leaves open (it is scoped to in-cluster workers + CI; off-cluster reachability and the deny-egress ↔ GitHub-JWKS tension are unresolved there).

Anchors: TIN-2112 (this records it) · TIN-2107 (consumer lace-up epic). Related substrate: TIN-1473 (W4.2 IAM), TIN-1448 (E4 tenant model), TIN-1462 (W2.1 AC-writer attestation), TIN-1472 (W4.1 routing).

What the first draft got wrong (corrected here)

This ADR’s first draft proposed a net-new flywheel-broker and claimed it was a “low-risk integration, not a new auth stack.” A 6-axis adversarial review (verified against main @ 898a5c2) refuted that on every count. The corrections are folded into the decision below:

  1. Duplication. It re-proposed iam-tenant-claim-design.md’s GitHub-OIDC token-exchange (option d) and ignored the already-built credhelper. → This note defers to that design instead.
  2. “Integration, not new auth” is false. The cell deploy sets no GF_REAPI_AUTHZ_* env, so authz defaults to off (authz.go authorize() short-circuits allow-all); and a default-deny-all egress NetworkPolicy (gf-reapi-cell-deny-egress, policyTypes:[Egress], zero egress rules) blocks the JWKS fetch validation depends on. Becoming a working TrustedIssuer is a real, separately proof-gated rollout — enumerated below.
  3. Day-one cache token is theater. The live cache is bazel-remote at bazel-cache.nix-cache:9092, a different service in a different namespace from the cell (gf-reapi-cell.gf-rbe:8980); it enforces no client auth (only BAZEL_REMOTE_S3_AUTH_METHOD, a cache→S3 backend cred). A minted “cache-read” token is presented to a service that never checks it.
  4. Cross-tenant confused deputy. Validating at org granularity (repository_owner) while the registry binds per-repo lets any repo in an enrolled org mint another spoke’s tenant token, which the cell honors verbatim (authz.go has zero GitHub/repo awareness).
  5. Wrong scope vocabulary. {cache-read, cache-write, executor} does not match the cell’s exact cas:Read tenant:<slug> / … strings; validateScope rejects every other verb, so enforce-mode would deny every minted token.
  6. Reachability relocated the “no”. A tailnet-only MagicDNS endpoint (*.taila4c78d.ts.net) is unreachable by the exact cold population (fork-PR runners, un-joined devs, external agents) that goes out-of-band.

Decision

Adopt the W4.2 IAM design as the auth substrate unchanged. For cold, off-cluster enrollment specifically, decide:

  1. Identity is GitHub, rooted in the App install. The GloriousFlywheel GitHub App installation on an org IS the enrollment gate (docs/guides/github-app-adoption.md). CI presents GitHub Actions OIDC (IAM option d, id-token: write). Local dev uses a gh-derived OIDC CLI flow through the same exchange as the primary cold-start path; tsidp stays available as the tailnet-native fallback for already-joined developer machines. No new identity provider.

  2. Token-exchange reachability — the genuine cold-start delta. The IAM doc places /v1/token/exchange “on the cell,” but the cell is default-deny-egress (cannot call token.actions.githubusercontent.com for OIDC JWKS) and cluster-internal (cold callers cannot reach it). Decision: run the GitHub-OIDC exchange as a distinct egress-allowed component behind a public ingress (traefik + cert-manager are already in-cluster), authenticated purely by the validated GitHub OIDC token — no tailnet membership required to obtain a credential. The cell proper stays cluster-internal and deny-egress; only the exchange surface reaches GitHub, and the cell validates only the exchange’s minted JWT (static JWKS, see §9).

  3. Cross-tenant binding invariant (mandatory). The exchange MUST validate the OIDC repository claim (full owner/repo) against an exact config/spoke-registry.json lookup (github_repository == claim.repository), derive the spoke slug from THAT entry, and mint tenant:spoke-<that-slug> only. Never gate on repository_owner alone; never accept a caller-supplied slug/instance_name. Reject if repository is absent or not an exact registry key. (Closes the confused-deputy hole.)

  4. Ref-gated scopes (mirrors the IAM posture matrix + W2.1). PR-ref OIDC (sub contains :ref:refs/pull/) → at most cas:Read + actioncache:Read. Only :ref:refs/heads/<default> AND an exact-repo registry match → + cas:Write / actioncache:Write / remoteexecution:Run. A fork-PR contract test (where the OIDC subject differs) is required.

  5. Exact scope vocabulary. Mint the cell’s exact strings — cas:Read tenant:spoke-<slug>, cas:Write tenant:spoke-<slug>, actioncache:{Read,Write} tenant:spoke-<slug>, remoteexecution:Run tenant:spoke-<slug> (authz.go:29-33, suffix tenant:<slug> per validateScope). A contract test MUST run a minted token through the cell’s authorize() in enforce mode for each scope.

  6. Audience pinning. The exchange requires a fixed inbound aud (e.g. gf-reapi-broker or the exchange URL) and rejects GitHub’s default audience (which defaults to the repo/owner URL — attacker-predictable). The minted token carries aud = gf-reapi-cell.gf-rbe.svc (authz.go DefaultAuthzAudience).

  7. First-contact-YES via a default read-only org allowlist. Orgs {jesssullivan, tinyland-inc} receive cas:Read + actioncache:Read (tenant:default) WITHOUT a per-repo registry PR, so a cold caller’s first contact is a YES, not a registry-gated NO. Write/executor scopes require the CODEOWNERS-gated registry entry (treated as a credential-granting change). The registry is consumed from a single signed/pinned source (a ConfigMap rendered by CI from the merged commit, sha logged in every issuance audit line); de-enrollment effective within a bounded reload TTL; fail closed if the registry source is unreachable/unparseable.

  8. The cache front door — day-one value is gated on it. A minted cache-read token means nothing until the cache enforces it. Decision: phase-1’s real deliverable is an enforcing front door for the cache — either an authenticating gRPC proxy in front of bazel-remote that validates the cell-JWT, or routing CAS/AC through the cell’s already-authenticated path — not “tokens day one.” Until that front-door proof passes, this note makes no cache-acceleration claim.

  9. god-token deny + key custody. authz.go:197 grants system:* tenant:system an unconditional global bypass. The exchange MUST be structurally incapable of minting system:* or tenant=system (server-side deny-list + test), and can mint only for tenants present in the registry. The signing key lives as an out-of-band k8s Secret (never git), mounted read-only; kid-based rotation publishes current+next public keys in the JWKS with overlap exceeding the cell’s 5-min JWKS cache (authz.go:277) and the token TTL; under deny-egress the cell reads a static JWKS ConfigMap refreshed by an in-cluster job (alarmed on staleness). Exchange-key compromise = full cell compromise — custody bar is high.

  10. HA + break-glass. ≥2 exchange replicas + PodDisruptionBudget; a long-lived, tightly-scoped, audited operator break-glass spoke token so an exchange outage is recoverable WITHOUT standalone runners; the cell serves extended-stale JWKS rather than hard-deny on fetch failure. The single cluster (honey) remains one control-plane failure domain — named residual.

  11. Token-handling hardening. Single-digit-minute TTL for read tokens; remove the 30s leeway (authz.go:213) for exchange-minted tokens; the sourced env file is mode 0600 and unset after the smoke; the exchange records consumed OIDC jti+exp and rejects OIDC re-exchange (one GitHub OIDC token → one mint, not a rolling supply). Higher-value (Write/executor) scopes get either sender-constraint (DPoP/mTLS) or a bounded jti replay cache in the cell (today the cell parses jti but keeps no seen-set).

Prerequisite work (none of this exists today; each separately proof-gated)

  • Cell to replicas >= 1 with a real image (RBE Production Readiness).
  • GF_REAPI_AUTHZ_MODE=warnenforce + GF_REAPI_AUTHZ_TRUSTED_ISSUERS=<exchange>=<jwks>.
  • Static-JWKS provisioning under deny-egress.
  • The distinct public-ingress GitHub-OIDC exchange component (per §2).
  • The enforcing cache front door (per §8).
  • Default org allowlist + signed/pinned registry consumption (per §7).
  • Contract tests: minted-token-through-authorize() (enforce), fork-PR read-only, reject-unknown-issuer, reject-system:*-mint.

Phasing (honest served-population + success metric)

  1. Phase 1 — enforcing cache + read-only first-contact. Stand up the cache front door (§8) + the public OIDC exchange (§2) + the org allowlist (§7). Served population: any GitHub-identity caller in an enrolled org — including the cold off-cluster case, once the public ingress lands. Success metric: time-to-first-cache-hit beats a cold standalone runner, measured — not “token minted.”
  2. Phase 2 — write/executor. Once the cell is replicas>=1 enforce + W2.1 attestation. remoteexecution:Run stays unmintable (exchange rejects it “not yet available”) until a real OIDC → exchange → cell Execute → result proof passes against a running replica.
  3. Phase 3 — local-dev + cross-cluster. gh-OIDC CLI exchange as the default cold-start developer path, with tsidp retained as the tailnet-native fallback; SPIFFE cross-cluster is out of scope, named as residual.

Resolved operator decisions

  • TIN-2120 exchange placement: distinct egress-allowed component, not a cell endpoint with scoped egress. This keeps the cell proper small, cluster-internal, and deny-egress.
  • TIN-2121 reachability posture: public-ingress exchange. The product claim remains cold off-cluster enrollment for fork-PR runners, unjoined developer machines, and external agents; it is not descoped to tailnet-only.
  • TIN-2122 local-dev identity: gh-OIDC CLI exchange is the primary cold-start developer path. tsidp remains a supported tailnet-native fallback, not the default enrollment answer.

GloriousFlywheel