Cold-Start Enrollment — Off-Cluster Agents on the W4.2 IAM Token-Exchange — June 2026
Snapshot date: 2026-06-14. Prepared from verified repo truth and an adversarial 6-axis design review.
Status: Accepted 2026-06-15. Records the decision for TIN-2112 (GF AX/DX: decide agent secret/profile authority for enrollment).
This is an implementation/rollout decision UNDER the already-decided W4.2 IAM design —
iam-tenant-claim-design.md(TIN-1473, parent E4 / TIN-1448). It does not introduce a new auth stack or a competing “broker.” It defers to that design for the auth substrate (scope model, JWT shape, validation rules, the already-built credential helper atservices/gf-reapi-cell/cmd/gf-reapi-credhelper) and records only the decisions that make that design reachable and usable by cold, off-cluster agents/devs — the gap the IAM doc leaves open (it is scoped to in-cluster workers + CI; off-cluster reachability and the deny-egress ↔ GitHub-JWKS tension are unresolved there).
Anchors: TIN-2112 (this records it) · TIN-2107 (consumer lace-up epic). Related substrate: TIN-1473 (W4.2 IAM), TIN-1448 (E4 tenant model), TIN-1462 (W2.1 AC-writer attestation), TIN-1472 (W4.1 routing).
What the first draft got wrong (corrected here)
This ADR’s first draft proposed a net-new flywheel-broker and claimed it was
a “low-risk integration, not a new auth stack.” A 6-axis adversarial review
(verified against main @ 898a5c2) refuted that on every count. The
corrections are folded into the decision below:
- Duplication. It re-proposed
iam-tenant-claim-design.md’s GitHub-OIDC token-exchange (option d) and ignored the already-built credhelper. → This note defers to that design instead. - “Integration, not new auth” is false. The cell deploy sets no
GF_REAPI_AUTHZ_*env, so authz defaults tooff(authz.goauthorize()short-circuits allow-all); and a default-deny-all egress NetworkPolicy (gf-reapi-cell-deny-egress,policyTypes:[Egress], zero egress rules) blocks the JWKS fetch validation depends on. Becoming a workingTrustedIssueris a real, separately proof-gated rollout — enumerated below. - Day-one cache token is theater. The live cache is
bazel-remoteatbazel-cache.nix-cache:9092, a different service in a different namespace from the cell (gf-reapi-cell.gf-rbe:8980); it enforces no client auth (onlyBAZEL_REMOTE_S3_AUTH_METHOD, a cache→S3 backend cred). A minted “cache-read” token is presented to a service that never checks it. - Cross-tenant confused deputy. Validating at org granularity
(
repository_owner) while the registry binds per-repo lets any repo in an enrolled org mint another spoke’s tenant token, which the cell honors verbatim (authz.gohas zero GitHub/repo awareness). - Wrong scope vocabulary.
{cache-read, cache-write, executor}does not match the cell’s exactcas:Read tenant:<slug>/ … strings;validateScoperejects every other verb, so enforce-mode would deny every minted token. - Reachability relocated the “no”. A tailnet-only MagicDNS endpoint
(
*.taila4c78d.ts.net) is unreachable by the exact cold population (fork-PR runners, un-joined devs, external agents) that goes out-of-band.
Decision
Adopt the W4.2 IAM design as the auth substrate unchanged. For cold, off-cluster enrollment specifically, decide:
-
Identity is GitHub, rooted in the App install. The GloriousFlywheel GitHub App installation on an org IS the enrollment gate (
docs/guides/github-app-adoption.md). CI presents GitHub Actions OIDC (IAM option d,id-token: write). Local dev uses agh-derived OIDC CLI flow through the same exchange as the primary cold-start path;tsidpstays available as the tailnet-native fallback for already-joined developer machines. No new identity provider. -
Token-exchange reachability — the genuine cold-start delta. The IAM doc places
/v1/token/exchange“on the cell,” but the cell is default-deny-egress (cannot calltoken.actions.githubusercontent.comfor OIDC JWKS) and cluster-internal (cold callers cannot reach it). Decision: run the GitHub-OIDC exchange as a distinct egress-allowed component behind a public ingress (traefik + cert-manager are already in-cluster), authenticated purely by the validated GitHub OIDC token — no tailnet membership required to obtain a credential. The cell proper stays cluster-internal and deny-egress; only the exchange surface reaches GitHub, and the cell validates only the exchange’s minted JWT (static JWKS, see §9). -
Cross-tenant binding invariant (mandatory). The exchange MUST validate the OIDC
repositoryclaim (fullowner/repo) against an exactconfig/spoke-registry.jsonlookup (github_repository == claim.repository), derive the spoke slug from THAT entry, and minttenant:spoke-<that-slug>only. Never gate onrepository_owneralone; never accept a caller-supplied slug/instance_name. Reject ifrepositoryis absent or not an exact registry key. (Closes the confused-deputy hole.) -
Ref-gated scopes (mirrors the IAM posture matrix + W2.1). PR-ref OIDC (
subcontains:ref:refs/pull/) → at mostcas:Read+actioncache:Read. Only:ref:refs/heads/<default>AND an exact-repo registry match →+ cas:Write/actioncache:Write/remoteexecution:Run. A fork-PR contract test (where the OIDC subject differs) is required. -
Exact scope vocabulary. Mint the cell’s exact strings —
cas:Read tenant:spoke-<slug>,cas:Write tenant:spoke-<slug>,actioncache:{Read,Write} tenant:spoke-<slug>,remoteexecution:Run tenant:spoke-<slug>(authz.go:29-33, suffixtenant:<slug>pervalidateScope). A contract test MUST run a minted token through the cell’sauthorize()in enforce mode for each scope. -
Audience pinning. The exchange requires a fixed inbound
aud(e.g.gf-reapi-brokeror the exchange URL) and rejects GitHub’s default audience (which defaults to the repo/owner URL — attacker-predictable). The minted token carriesaud = gf-reapi-cell.gf-rbe.svc(authz.goDefaultAuthzAudience). -
First-contact-YES via a default read-only org allowlist. Orgs
{jesssullivan, tinyland-inc}receivecas:Read+actioncache:Read(tenant:default) WITHOUT a per-repo registry PR, so a cold caller’s first contact is a YES, not a registry-gated NO.Write/executorscopes require the CODEOWNERS-gated registry entry (treated as a credential-granting change). The registry is consumed from a single signed/pinned source (a ConfigMap rendered by CI from the merged commit, sha logged in every issuance audit line); de-enrollment effective within a bounded reload TTL; fail closed if the registry source is unreachable/unparseable. -
The cache front door — day-one value is gated on it. A minted cache-read token means nothing until the cache enforces it. Decision: phase-1’s real deliverable is an enforcing front door for the cache — either an authenticating gRPC proxy in front of
bazel-remotethat validates the cell-JWT, or routing CAS/AC through the cell’s already-authenticated path — not “tokens day one.” Until that front-door proof passes, this note makes no cache-acceleration claim. -
god-token deny + key custody.
authz.go:197grantssystem:*tenant:systeman unconditional global bypass. The exchange MUST be structurally incapable of mintingsystem:*ortenant=system(server-side deny-list + test), and can mint only for tenants present in the registry. The signing key lives as an out-of-band k8s Secret (never git), mounted read-only;kid-based rotation publishes current+next public keys in the JWKS with overlap exceeding the cell’s 5-min JWKS cache (authz.go:277) and the token TTL; under deny-egress the cell reads a static JWKS ConfigMap refreshed by an in-cluster job (alarmed on staleness). Exchange-key compromise = full cell compromise — custody bar is high. -
HA + break-glass. ≥2 exchange replicas + PodDisruptionBudget; a long-lived, tightly-scoped, audited operator break-glass spoke token so an exchange outage is recoverable WITHOUT standalone runners; the cell serves extended-stale JWKS rather than hard-deny on fetch failure. The single cluster (
honey) remains one control-plane failure domain — named residual. -
Token-handling hardening. Single-digit-minute TTL for read tokens; remove the 30s leeway (
authz.go:213) for exchange-minted tokens; the sourced env file is mode0600and unset after the smoke; the exchange records consumed OIDCjti+expand rejects OIDC re-exchange (one GitHub OIDC token → one mint, not a rolling supply). Higher-value (Write/executor) scopes get either sender-constraint (DPoP/mTLS) or a boundedjtireplay cache in the cell (today the cell parsesjtibut keeps no seen-set).
Prerequisite work (none of this exists today; each separately proof-gated)
- Cell to
replicas >= 1with a real image (RBE Production Readiness). GF_REAPI_AUTHZ_MODE=warn→enforce+GF_REAPI_AUTHZ_TRUSTED_ISSUERS=<exchange>=<jwks>.- Static-JWKS provisioning under deny-egress.
- The distinct public-ingress GitHub-OIDC exchange component (per §2).
- The enforcing cache front door (per §8).
- Default org allowlist + signed/pinned registry consumption (per §7).
- Contract tests: minted-token-through-
authorize()(enforce), fork-PR read-only, reject-unknown-issuer, reject-system:*-mint.
Phasing (honest served-population + success metric)
- Phase 1 — enforcing cache + read-only first-contact. Stand up the cache front door (§8) + the public OIDC exchange (§2) + the org allowlist (§7). Served population: any GitHub-identity caller in an enrolled org — including the cold off-cluster case, once the public ingress lands. Success metric: time-to-first-cache-hit beats a cold standalone runner, measured — not “token minted.”
- Phase 2 — write/executor. Once the cell is
replicas>=1enforce + W2.1 attestation.remoteexecution:Runstays unmintable (exchange rejects it “not yet available”) until a realOIDC → exchange → cell Execute → resultproof passes against a running replica. - Phase 3 — local-dev + cross-cluster.
gh-OIDC CLI exchange as the default cold-start developer path, withtsidpretained as the tailnet-native fallback; SPIFFE cross-cluster is out of scope, named as residual.
Resolved operator decisions
- TIN-2120 exchange placement: distinct egress-allowed component, not a cell endpoint with scoped egress. This keeps the cell proper small, cluster-internal, and deny-egress.
- TIN-2121 reachability posture: public-ingress exchange. The product claim remains cold off-cluster enrollment for fork-PR runners, unjoined developer machines, and external agents; it is not descoped to tailnet-only.
- TIN-2122 local-dev identity:
gh-OIDC CLI exchange is the primary cold-start developer path.tsidpremains a supported tailnet-native fallback, not the default enrollment answer.