IAM + OIDC Tenant Claim Design

Decision summary

Status: First gf-reapi-cell enforcement slice in progress (W4.2 / TIN-1473, under parent E4 / TIN-1448).

Rule: Every REAPI call to gf-reapi-cell carries a short-lived OIDC-shaped JWT in an Authorization: Bearer header. The cell validates the signature against a trusted issuer JWKS, asserts aud=gf-reapi-cell, and enforces the tenant claim against the request’s instance_name and operation scope.

Recommended issuers: (a) k8s ServiceAccount projected tokens for in-cluster workers + (d) GitHub Actions OIDC for workflow provenance, federated through a small token-exchange endpoint on the cell. GitHub OIDC is an identity signal, not permission to run first-party dogfood on GitHub-hosted runners. They coexist and cover different surfaces. (b) tsidp is the documented fallback for Tailscale-native developer flows.

JWT lifetime: 15–60 min. Recommend 15 min for ServiceAccount-projected workers (kubelet rotates for free), 60 min for the GitHub Actions exchange token (one-shot per workflow), 5 min for developer tokens.

Scope shape: EngFlow-inspired (cas:Read, cas:Write, actioncache:Read, actioncache:Write, remoteexecution:Run) tuple’d with a tenant binding (tenant:<slug>). No third-party REAPI code on the data path — the validation middleware is in-house in gf-reapi-cell.

What blocks if absent: E4 / TIN-1448 cannot close. W4.1 routing without W4.2 IAM is unauthenticated namespacing — a caller can claim any instance_name. W4.3 pool selection (TIN-1474) and W4.4 quotas (TIN-1475) key on the same tenant claim. W2.3 audit log (TIN-1464) needs the JWT sub and tenant to be queryable.

Frame

W4.1 / TIN-1472 decided the routing primitive: each spoke gets remote_instance_name=spoke-<slug> and gf-reapi-cell keys CAS / AC storage on that pair. The default-deny behavior on cross-tenant reads (quiet NOT_FOUND) closes the digest-guess info-disclosure channel — a spoke-B caller asking for spoke-A’s blob by digest gets the same answer as if the blob did not exist. That is one defense, and it is real: the cell cannot leak a blob it does not route to in the first place.

But routing is only honest when the caller’s claimed tenant identity is verified. A misconfigured spoke that sets --remote_instance_name=spoke-elders while running the spoke-blahaj toolchain is the obvious case; the malicious case is a compromised CI lane or a hand-edited .bazelrc flipping the instance string. The second defense — the one this doc owns — is explicit authorization at the call layer: every REAPI request carries a signed token that names the caller’s tenant, and gf-reapi-cell refuses to act on any request whose token does not authorize the instance + operation pair. Routing tells the cell which namespace to look in; IAM tells the cell whether the caller is allowed to ask. Both must hold; neither replaces the other.

The Scope Model

EngFlow’s multi-tenancy IAM is the closest reference in the ecosystem and this doc cribs the shape — scope verbs paired with a tenant binding — without adopting EngFlow’s server, SDK, or call path. The implementation lives entirely inside gf-reapi-cell’s in-house Go middleware. We do not import EngFlow code; we do not run their server.

The scope set:

Scope	Grants	Operation surface
`cas:Read tenant:<slug>`	Read CAS blobs scoped to `instance_name=spoke-<slug>`	`FindMissingBlobs`, `BatchReadBlobs`, `ByteStream.Read`
`cas:Write tenant:<slug>`	Write CAS blobs into the tenant’s namespace	`BatchUpdateBlobs`, `ByteStream.Write`
`actioncache:Read tenant:<slug>`	Read AC entries from the tenant’s namespace	`GetActionResult`
`actioncache:Write tenant:<slug>`	Write AC entries into the tenant’s namespace	`UpdateActionResult` (also gated by W2.1 writer attestation)
`remoteexecution:Run tenant:<slug>`	Submit Execute / WaitExecution requests	`Execute`, `WaitExecution`
`system:` (reserved)*	Cross-tenant internal: cell health, audit reads, synthetic probes	All RPCs; only granted to `gf-reapi-cell` itself

A token may carry multiple scopes; the validator checks the intersection against the request. A request matches a scope when (a) the operation verb maps to the scope verb, and (b) the request’s instance_name equals the scope’s tenant:<slug> binding (or the scope is system:*).

cas:Read and cas:Write are split intentionally. The expected steady state is that PR CI lanes get cas:Read + actioncache:Read only, merged- main gets +Write on both, and developer tokens get cas:Read + actioncache:Read (no Write at all — consistent with W2.1’s threat model where developer laptops are never trusted writers).

The actioncache:Write scope is necessary but not sufficient for AC writes: W2.1 / TIN-1462 adds a second check (writer-attestation: pod identity = the worker SA, image digest in the allow-list, git_ref on refs/heads/main). Both must pass. W2.1 is the AC-write-specific gate; W4.2 is the general authorization layer. They share a JWT, they each enforce their own clause.

OIDC Provider Analysis

Four candidate issuer surfaces, compared on the dimensions that matter for the two-operator team: identity primitive, JWT shape, rotation cost, audit trail, integration cost.

(a) k8s ServiceAccount projected tokens

The kubelet projects a short-lived JWT into each pod, signed by the kube- apiserver, with sub=system:serviceaccount:<ns>:<sa> and a configurable audience. The token validates against the kube-apiserver’s OIDC discovery endpoint or via TokenReview. This is the same primitive W2.1 / TIN-1462 picked for AC writer attestation.

Dimension	Score
Identity primitive	k8s `ServiceAccount` — already what `tofu/modules/arc-runner/` and the spoke modules consume
JWT shape	Native OIDC JWT; configurable audience; `tenant` claim added via the token-projection audience or via a small claim-mapper
Rotation cost	Zero operator cost — kubelet rotates automatically; default 1h, can tighten to 15min
Audit trail	`sub` carries the SA identity; `kubernetes.io/serviceaccount/pod.uid` carries pod identity
Integration cost	Near zero. Cell validates via kube-apiserver discovery (already in-cluster trust)
Cross-cluster posture	Single-cluster only (Honey). A future off-cluster worker would need SPIFFE federation

(b) tsidp (Tailscale Identity Provider)

tsidp is the Tailscale-native OIDC IdP. Identity is bound to a Tailscale node identity; the IdP issues OIDC tokens with sub matching the tailnet user or device. The repo already references Tailscale for runner authentication.

Dimension	Score
Identity primitive	Tailscale node / user — natural for developer laptops on the tailnet
JWT shape	Standard OIDC JWT; tenant claim would need to be set via a tsidp-side mapping (one-per-user)
Rotation cost	Medium — tsidp issues tokens on demand; client refresh discipline required
Audit trail	Good for human users; less good for unattended workers (a “tailscale device” is a coarse identity)
Integration cost	Medium — stand up the tsidp endpoint, configure ACLs, ship a developer credential helper
Cross-cluster posture	Strong — tailnet identity is cluster-agnostic; works equally well off-Honey

(c) Self-hosted Keycloak / Dex

A general-purpose OIDC IdP run inside the cluster. Federates with GitHub, Google, GitLab, LDAP, whatever. Standard, well-trodden, but a whole service to operate.

Dimension	Score
Identity primitive	Whatever federation backends it’s configured with
JWT shape	Fully customizable; `tenant` is a first-class custom claim
Rotation cost	High — operator owns the IdP service, its database, its JWKS rotation, its upgrade cadence
Audit trail	Excellent (Keycloak event log is rich)
Integration cost	High — new service in the cluster, new alert surface, new failure mode
Cross-cluster posture	Strong (IdP is centralized; clusters trust the same issuer)

(d) GitHub Actions OIDC

GitHub Actions issues a workflow-scoped OIDC JWT on demand; the token encodes the repo, the workflow, the branch, and the actor. We exchange it for a gf-reapi-cell JWT at a small token-exchange endpoint on the cell. This is workflow provenance, not runner placement. For GloriousFlywheel’s own merge-blocking validation, security, Bzlmod/Bazel, and RBE proof lanes, the workflow still runs on shared tinyland-* self-hosted runners.

Dimension	Score
Identity primitive	GitHub repo + workflow + ref — exactly the cardinality CI lane authorization wants
JWT shape	The incoming token is a GitHub OIDC JWT; the exchange endpoint mints the `gf-reapi-cell` JWT with `tenant`/`scopes` set per repo+ref policy
Rotation cost	One token per workflow run; no rotation in the long-running sense
Audit trail	Excellent — every token names the repo, the workflow file, the actor, the ref
Integration cost	Low — small token-exchange endpoint; no IdP to host
Cross-cluster posture	Cluster-agnostic; works for any CI-side caller

Scorecard

Provider	Operability	Rotation	Audit	Integration	E4 / W2.1 fit
(a) k8s SA projected tokens	Strong	Strong	Good	Strong	Strong (W2.1 already picked this)
(b) tsidp	Medium	Medium	Medium	Medium	Good — tenant via mapping
(c) Self-hosted Keycloak / Dex	Weak (new service)	Medium	Strong	Weak	Good
(d) GitHub Actions OIDC (via exchange endpoint)	Strong	n/a (one-shot)	Strong	Strong	Strong

Recommendation

Pick (a) + (d). They coexist and cover different surfaces:

(a) k8s SA projected tokens for in-cluster workers, in-cluster CI runners (ARC, the gf-rbe namespace), and the cell-internal system:* identity.
(d) GitHub Actions OIDC for GitHub workflow provenance — trusted same-repo ARC dogfood jobs, external tenant CI callers, and explicit control-plane exceptions can exchange this for a gf-reapi-cell JWT at a small token-exchange endpoint. This does not approve ubuntu-latest as a first-party dogfood path.

Three reasons:

W2.1 already chose (a). The AC writer attestation design picked k8s SA projected tokens; reusing the same substrate for general authorization keeps the identity story singular instead of forking. One token shape, one validator middleware, two enforcement clauses (W2.1 + W4.2).
(d) is necessary because (a) does not name the GitHub workflow. A k8s ServiceAccount token proves the in-cluster pod identity, but not the repository, workflow file, actor, or ref that requested work. GitHub Actions OIDC supplies that provenance on both self-hosted and external Actions callers. A token-exchange endpoint that validates the GitHub OIDC token and mints a gf-reapi-cell JWT is the canonical pattern (this is how every cloud provider’s GitHub Actions OIDC flow works); it is small, well-trodden, and matches the PR-lane / merged-main-lane policy story the W2.1 doc already sketches. It is not a hosted-runner fallback for GloriousFlywheel itself.
(b) tsidp is a real option for developer laptops on the tailnet, especially if read-only dev access becomes a regular need. Document it as the fallback for the dev path; do not block on it. If developer access ends up wanted via tailnet identity, we revisit.

JWT Contents

The validated token shape. Required claims:

Claim	Type / value	Purpose
`iss`	Issuer URL (kube-apiserver discovery URL, or `gf-reapi-cell` token-exchange endpoint)	Identifies the signer; JWKS lookup uses this
`aud`	`gf-reapi-cell.gf-rbe.svc` (audience-scoped)	Stops tokens issued for one audience being replayed at another
`sub`	`system:serviceaccount:gf-rbe:<sa>` (a) or `repo:tinyland-inc/<repo>:ref:<ref>` (d)	Workload identity; audit-log primary key
`exp`	Unix timestamp, `now + lifetime`	Short lifetime caps replay window
`iat`	Unix timestamp	Time of issuance
`nbf`	Unix timestamp	Not-before; equals or precedes `iat`
`tenant`	`spoke-<slug>` \| `default` \| `system`	The spoke slug; matches `instance_name` validator regex from W4.1
`scopes`	`[]string` of `<verb>:<resource> tenant:<slug>` strings	The verb-resource-tenant set this token authorizes
`worker_image_digest`	`sha256:<digest>` (workers only)	Optional; carries the AC-writer-attestation digest for W2.1 parity
`jti`	Unique token identifier	Per-token forensic primitive; used by W2.3 audit

Validation rules gf-reapi-cell applies on every inbound RPC:

Signature. Verify against the JWKS resolved from iss. JWKS is cached with a short TTL (5 min); JWKS rotation is fetched on cache miss.
Issuer allow-list. iss must be one of the configured trusted issuers (initially: the kube-apiserver and the cell’s own token- exchange endpoint). Any other iss → reject UNAUTHENTICATED.
Audience. aud must equal gf-reapi-cell.gf-rbe.svc. Mismatch → reject UNAUTHENTICATED.
Expiry. exp > now, nbf <= now. Outside the window → reject UNAUTHENTICATED.
Tenant claim. tenant must match ^(spoke-[a-z][a-z0-9-]{1,62}|default|system)$ (same regex as W4.1).
Scope shape. Each scope string parses as <verb>:<resource> tenant:<slug> or system:*. Malformed scopes → reject UNAUTHENTICATED.

Validation failure on any of the above returns UNAUTHENTICATED (gRPC 16) — distinct from authorization failure on the operation itself, which returns PERMISSION_DENIED (gRPC 7) or, on cross-tenant data access, quiet NOT_FOUND consistent with W4.1.

The Authz Check in gf-reapi-cell

Implementation status: the first in-cell slice now exists behind GF_REAPI_AUTHZ_MODE=off|warn|enforce. It validates RSA-signed JWTs from configured JWKS issuers, checks aud=gf-reapi-cell.gf-rbe.svc, requires sub, tenant, scopes, jti, exp, iat, and nbf, and maps CAS, AC, ByteStream, Execute, and WaitExecution RPCs to the scope table below. The token-exchange endpoint and Bazel credential helper remain future rollout steps; current live proofs keep authz off until those callers can mint tokens.

A request flows through the cell as follows:

Edge: extract token. Read authorization: Bearer <jwt> from gRPC metadata. Missing token → UNAUTHENTICATED.
Edge: validate token. Apply the six rules above. Token validated → attach a (sub, tenant, scopes, jti) tuple to request context.
Per-handler: scope check. Each RPC handler maps its operation to a scope verb (e.g. BatchReadBlobs → cas:Read). Look up whether (verb, tenant) is in the token’s scope set:
- Operation verb not authorized for this tenant → PERMISSION_DENIED. This is the explicit identity error: “your token does not grant cas:Read tenant:spoke-elders“.
- tenant claim mismatches instance_name on the request → also PERMISSION_DENIED. The token authorizes tenant:spoke-A; the request is for tenant:spoke-B. The two must agree.
Per-handler: cross-tenant data lookups. When the operation reads data scoped to a tenant the token does not authorize, the response is NOT_FOUND (per the W4.1 quiet-default-deny rule). This applies only to the cross-instance data path, not to the token-vs-request- instance check above. The distinction:
- Token says tenant:spoke-A, request says instance_name=spoke-B → PERMISSION_DENIED (identity defect: caller asked for the wrong namespace).
- Token says tenant:spoke-A, request says instance_name=spoke-A, but the digest being read exists only in spoke-B’s namespace → NOT_FOUND (data isolation; do not confirm cross-tenant existence).
Audit emit. On every accept and every reject, emit an audit row carrying {ts, sub, tenant, jti, rpc, instance_name, outcome, reject_reason}. The audit shape is W2.3’s contract; this doc commits to the fields.

The W2.1 writer-attestation clause runs in addition to step 3 on the UpdateActionResult path: even with actioncache:Write tenant:<slug> in scope, the pod identity + image digest + git_ref checks from W2.1 must also pass. Failing either clause returns PERMISSION_DENIED; the audit row distinguishes the two via reject_reason.

Token Rotation

For (a) k8s SA projected tokens:

The kubelet projects a fresh token every expirationSeconds * 80% (the k8s default refresh point).
The token file at /var/run/secrets/tokens/gf-reapi-cell-token is rewritten in place; the projected volume is the rotation channel.
Long-running gRPC connections that authenticated at connection-time must re-read the token at RPC time (the credential helper handles this), or the cell rejects on exp.
Recommended expirationSeconds=900 (15 min) for workers; the cell accepts tokens up to their exp, no longer.

For (d) GitHub Actions OIDC:

The GitHub OIDC token is one-shot per workflow run (issued by the Actions runtime, exchanged once at the cell’s token-exchange endpoint).
The exchange endpoint mints a gf-reapi-cell JWT with exp = now + 60min.
For workflows longer than 60 min: the workflow re-fetches a new GitHub OIDC token and re-exchanges. The credential helper handles this on the Bazel side.

For (b) tsidp (fallback, dev only):

Tokens are issued on demand by tsidp; short-lived (5 min recommended).
Developer credential helper re-fetches from tsidp on expiry.

Bazel Credential Helper

Bazel’s --credential_helper flag (Bazel 6.1+) invokes a helper binary that reads stdin (a JSON GetCredentialsRequest) and writes stdout (a JSON GetCredentialsResponse with headers). gf-reapi-cell ships a helper binary that:

Reads the helper request (Bazel passes the target URL).
Picks the right token source based on environment:
- In-cluster (k8s pod): reads /var/run/secrets/tokens/gf-reapi-cell-token. Always fresh on each invocation (the kubelet keeps the file current).
- GitHub Actions runner: fetches a GitHub OIDC token from the Actions runtime (ACTIONS_ID_TOKEN_REQUEST_URL + ACTIONS_ID_TOKEN_REQUEST_TOKEN), exchanges it at the cell’s /v1/token/exchange endpoint, caches the result until exp - 60s.
- Developer machine: fetches from tsidp (fallback path) or from the dev-token issuer (gf-rbe-dev-issuer, see open questions).
Returns {"headers": {"Authorization": ["Bearer <jwt>"]}}.
Never caches stale: on expiry, refetch; on fetch failure, exit nonzero so Bazel surfaces the error loudly. Fail closed.

Helper binary location: gf-reapi-cell/cmd/gf-reapi-credhelper/ in the cell’s source tree, shipped alongside the cell binary and the cell OCI image.

Implementation status: the first helper slice exists for projected-token and explicit-token callers. It implements Bazel’s get protocol, reads GF_REAPI_CREDENTIAL_HELPER_TOKEN_FILE, GF_REAPI_CREDENTIAL_HELPER_TOKEN, or the default k8s projected-token path /var/run/secrets/tokens/gf-reapi-cell-token, requires a JWT exp claim, and returns Authorization: Bearer <jwt> with an expiry one minute before exp. The GitHub Actions OIDC exchange and developer issuer paths are still future work; the helper deliberately fails closed instead of minting or accepting opaque long-lived tokens.

Bazel wiring (proposal for .bazelrc):

build --credential_helper=gf-reapi-cell.gf-rbe.svc=%workspace%/tools/gf-reapi-cell-credhelper

The helper is one binary per platform; the cell publishes Linux x86_64 and macOS arm64 builds as release artifacts.

CI vs Dev Posture

The posture matrix, per caller class. Cross-references W2.1’s “single AC writer” property — only merged-main CI gets the actioncache:Write scope.

Caller class	Identity source	Scope set granted	Notes
In-cluster merged-main CI worker	(a) k8s SA `gf-reapi-cell-worker`	`cas:{Read,Write} tenant:spoke-<slug>` + `actioncache:{Read,Write} tenant:spoke-<slug>` + `remoteexecution:Run tenant:spoke-<slug>` (W2.1 also enforced on AC write)	The single AC writer per W2.1.
In-cluster PR CI worker	(a) k8s SA `gf-reapi-cell-pr`	`cas:Read tenant:spoke-<slug>` + `actioncache:Read tenant:spoke-<slug>` + `remoteexecution:Run tenant:spoke-<slug>`	Read-only on cache; can execute but cannot poison.
GitHub-Actions PR CI	(d) GitHub OIDC → exchange	`cas:Read` + `actioncache:Read` (tenant scoped by exchange policy)	Token-exchange policy reads `repo`+`ref` claims; PR refs get read-only.
GitHub-Actions merged-main CI	(d) GitHub OIDC → exchange	`cas:{Read,Write}` + `actioncache:{Read,Write}` + `remoteexecution:Run` (tenant scoped)	Exchange policy: `ref:refs/heads/main` + repo allow-list → write scopes.
Developer machine	(b) tsidp (fallback) or `gf-rbe-dev-issuer`	`cas:Read tenant:<the dev's tenant>` + `actioncache:Read tenant:<...>`	Read-only. Cannot write AC under any condition.
Spoke runner (cross-cluster, future)	TBD (likely SPIFFE)	per-spoke scope set	Out of scope for v1; flagged for future cross-cluster work.
`gf-reapi-cell` itself (internal probes)	(a) k8s SA `gf-reapi-cell-system`	`system:*`	Used for the synthetic TTFCH probe and the cell’s own health checks.

Integration with Siblings

This doc is the authorization substrate. Each sibling adds its own enforcement clause on top.

W2.1 AC writer attestation (TIN-1462). Already picked k8s SA projected tokens. This doc reuses the same JWT shape for the general authorization layer; W2.1 is the AC-write- specific clause on top. Both must validate; both must agree. The audit log row carries the JWT’s sub, tenant, worker_image_digest, and jti so W2.1 can distinguish “AC write rejected because the token lacked actioncache:Write” from “AC write rejected because the image digest was not in the W2.1 allow-list.”
W4.1 instance-name routing (TIN-1472). The tenant claim on the JWT must equal the instance_name on the request. Mismatch is a defect (PERMISSION_DENIED). W4.1 routes; W4.2 authorizes the routing.
W4.3 executor pool selection (TIN-1474). Pool selection reads the validated tenant claim from request context (set by this doc’s middleware) and chooses the pool. The pool selector does not re-validate the token; it consumes the context.
W4.4 quota enforcement (TIN-1475). Quotas key on tenant. The quota enforcer joins ConfigMap-declared budgets (from spoke-cache-quota) to live CAS bytes-used metrics, both keyed by tenant. The tenant here is the same JWT claim this doc validates.
W4.5 tenant-aware proof (TIN-1476). Proofs in config/rbe-target-eligibility.json may eventually carry an instance_name field (per W4.1 Open Question 7); the proof harness authenticates using this design’s JWT.
W2.3 audit log (TIN-1464). Captures the JWT sub, tenant, and jti on every RPC. Without these, W2.3 cannot answer “who wrote this” forensically.

Failure Mode Table

Failure	Exposure today	Defense in this design	Residual risk
Expired token reused	n/a — no tokens today	`exp` checked on every RPC; reject `UNAUTHENTICATED`	Window between `exp` and request = zero
Spoofed token (forged signature)	n/a	Signature verified against issuer JWKS; reject on mismatch	Issuer key compromise (separate row)
Scope escalation (`spoke-A` claims `tenant:spoke-B`)	High — `instance_name` is currently a free-form client string	Token signature binds the `tenant` claim to issuer-trusted identity; `spoke-A`’s issuer cannot mint `tenant:spoke-B`	Issuer-side mapping bug (e.g. tsidp ACL misconfig) — caught by the audit log + dashboard skew alert
OIDC issuer (kube-apiserver) compromise	n/a (no issuer)	Short token TTL caps replay; rotate issuer keys; AC integrity inherits cluster trust	If the kube-apiserver is owned, RBE is the least of the problems
Token-exchange endpoint compromise	n/a	Exchange endpoint runs in `gf-rbe`, signed JWTs only; compromise = full IAM bypass; mitigation: short TTL + audit	Same shape as kube-apiserver compromise — cluster-trust posture
Credential helper failure (no token returned)	n/a	Helper exits nonzero; Bazel surfaces “credential helper failed”; build errors loudly. Fail closed.	None — failure is loud by design
Credential helper caches stale token past expiry	n/a	Helper never caches past `exp - 60s`; refetches on every invocation when no cached token is fresh	A bug here is a fail-open; nightly chaos test should assert helper refreshes on stale-token error
Cross-cluster identity (off-cluster worker)	n/a (no multi-cluster today)	Audience-scoped JWTs (`aud=gf-reapi-cell.gf-rbe.svc`); an untrusted external token would not validate	Future work: SPIFFE / cross-cluster federation if off-cluster workers become real
Replay of a captured live token	n/a	Short TTL (15 min for workers, 5 min for dev, 60 min for CI exchange) + audience check	Replay possible within TTL window from the same network — acceptable; optional `jti` single-use store later
`gf-rbe-dev-issuer` over-scoped tokens	n/a	Dev issuer must mint tokens with `cas:Read` + `actioncache:Read` only; never `Write`; never `system:*`	Operator discipline; see Open Questions
Token leaked to disk / .envrc / Slack	n/a	Short TTL bounds blast radius; audit log catches anomalous `sub` on `jti` reuse pattern	TTL window + the leakage detection rigor of W2.3
Issuer JWKS rotation flips the validator into reject-all	n/a	JWKS cache TTL is short (5 min); cell fetches on miss; warning alert on sustained JWKS-fetch failure	A bad rotation could blackout the cell for up to JWKS TTL — accept; document on-call playbook

Rollout Plan

Sequential. Each step is a separately reviewable change; each step is revertable via a single feature flag on the cell’s Deployment.

Land this design doc. Recommendation locked: (a) + (d). Open questions itemized.
Land the JWT validation middleware in gf-reapi-cell. Configured with one trusted issuer initially (the kube-apiserver discovery URL). Middleware runs in warn-only mode: validates tokens, logs results, does not reject on absent or invalid tokens. This proves the validator on real traffic before flipping enforcement.
Stand up the token-exchange endpoint for GitHub Actions. New handler /v1/token/exchange on the cell. Validates incoming GitHub OIDC tokens against https://token.actions.githubusercontent.com, applies the policy table (repo allow-list, ref → scope mapping), mints gf-reapi-cell JWTs with 60-min TTL. Add the GitHub issuer to the trusted-issuer set on the cell.
Ship the Bazel credential helper. First slice landed as gf-reapi-cell/cmd/gf-reapi-credhelper/ for projected-token and explicit JWT callers. Remaining rollout work: release artifact packaging, token- exchange integration, developer issuer integration, and .bazelrc / per-spoke wrapper wiring with --credential_helper=....
Roll out per-tenant scopes for tenant:default. The migration cohort that hasn’t adopted spoke instances yet gets cas:Read + actioncache:Read + remoteexecution:Run on default; in-cluster merged-main CI gets +Write. Cell stays in warn-only mode.
Migrate one spoke at a time. For each spoke in lanes.json: provision the ServiceAccount, attach the projected token volume, define the GitHub exchange policy, and grant the spoke’s scope set. Spoke CI starts sending --remote_instance_name=spoke-<slug> and the new JWT.
Flip enforcement: default-deny on cross-tenant. Once every active caller is on a real tenant:<slug> (visible on the dashboard as zero un-tokenized traffic for 7 days), flip the cell from warn-only to enforce. Cross-tenant reads return NOT_FOUND; missing-token requests return UNAUTHENTICATED.
Drop default (per W4.1’s migration plan; W4.2 follows W4.1’s schedule).

Rollback: a single feature flag on the cell’s Deployment (AUTHZ_ENFORCEMENT_MODE=warn|enforce) flips back to warn-only at any step. The audit log continues to record rejections even in warn mode.

Open Questions

These do not block landing this doc. They do block closing E4 / TIN-1448.

OIDC provider final pick. This doc proposes (a)+(d). The operator may have a strong preference for tsidp (b) given Tailscale-native infra. Defended: (a) already chosen by W2.1, (d) covers the GitHub- hosted CI surface that (a) cannot reach, and (b) can be added as a developer-side fallback without re-architecting. Recommendation pending operator sign-off. If (b) is preferred for the dev path, adopt it alongside (a)+(d); they all validate the same JWT shape.
JWT lifetime: 15 min vs 60 min? Trade-off: shorter is safer (smaller replay window) but increases token-refresh frequency for long-running builds. Some target classes (docs-site:build cold, web-playwright-chromium-static-smoke) can run > 15 min. Recommendation: 15 min for SA-projected workers (kubelet rotates for free), 60 min for GitHub Actions exchange tokens (one-shot per workflow). Revisit after W2.5 chaos test passes for 14 days.
Credential helper binary location and packaging. Proposal: gf-reapi-cell/cmd/credhelper/. Open: is the helper a separate release artifact, or always shipped inside the cell OCI image and kubectl cp‘d out by operators? Recommendation: both — release artifact for dev machines + bundled in image for in-cluster.
How does gf-rbe-dev-issuer get bootstrapped without becoming a vending machine for over-scoped tokens? The dev-mode issuer must mint only cas:Read + actioncache:Read scopes. Open: what authenticates a developer to the dev issuer? Three options: (i) tsidp identity (Tailscale-native; needs (b) in production); (ii) GitHub OIDC via a CLI flow (gh auth status → bearer → exchange); (iii) static dev tokens issued by an operator (least secure). Recommendation pending: (i) if (b) is adopted; (ii) otherwise. (iii) is rejected.
Cross-tenant NOT_FOUND vs PERMISSION_DENIED semantics. This doc says: token-vs-instance mismatch → PERMISSION_DENIED; cross- tenant data access with correct token-instance binding → NOT_FOUND. This is intentional but worth one more pass against the W4.1 default- deny matrix. Pending: confirm with Codex.
JWKS cache TTL. 5 min is the proposal. Open: should the cell pre-fetch JWKS on startup and refresh on a fixed schedule, or lazy- load on cache miss? Recommendation: pre-fetch on startup + refresh every 5 min in a background loop; fail open on refresh error if a prior key is still in the cache.
Scope language: flat strings, or structured (CEL / OPA)? Initial proposal: flat <verb>:<resource> tenant:<slug> strings. Pre- committing to a policy language for a handful of verbs is over- engineering; revisit when the scope surface grows.

References

External:

REAPI v2 spec — Action, Platform, request headers; instance_name field on every request.
Bazel --credential_helper — the client-side hook for per-request auth headers.
Bazel --remote_header / --remote_cache_header — alternative header injection if the credential helper is not available.
Kubernetes ServiceAccount token volume projection — the (a) primitive.
Kubernetes TokenReview API — server-side validation primitive.
GitHub Actions OIDC for cloud providers — canonical exchange pattern for (d).
SPIFFE / SPIRE — cross-cluster generalization of (a), for a future off-cluster worker case.
EngFlow multi-tenancy IAM patterns — inspiration only; peers, not adoption candidates. Scope shape (<verb>:<resource> tenant:<slug>) cribbed; no code adopted.
BuildBuddy API key model — comparable trust-boundary pattern in a class peer; not adopted.

Repo-local:

docs/build-system/ac-writer-attestation-design.md — W2.1 voice exemplar + tight sibling; chose k8s SA projected tokens for AC writer attestation. This doc reuses that substrate.
docs/build-system/instance-name-routing-design.md — W4.1 voice exemplar + tight sibling; the tenant identity model this doc authorizes on.
docs/build-system/slo.md — voice exemplar; the SLOs this design gates.
docs/build-system/gf-reapi-cell.md — current REAPI cell shape; where the validator middleware lands.
tofu/modules/arc-runner/ — current runner identity shape; may inform what’s already wired for SA-based identity.
tofu/modules/spoke-cache-quota/, tofu/modules/spoke-runner-binding/, tofu/modules/spoke-state-namespace/ — the tenant declaration layer the JWT tenant claim binds to.
config/rbe-target-eligibility.json — proof schema; per W4.5 may eventually carry instance_name.

Linear:

Parent epic: TIN-1448 (E4 tenant model).
This workstream: TIN-1473 (W4.2 IAM + OIDC tenant claim).
Tight siblings: TIN-1472 (W4.1 instance-name routing), TIN-1474 (W4.3 executor pools), TIN-1475 (W4.4 quota enforcement), TIN-1476 (W4.5 tenant-aware proof).
Cross-epic siblings: TIN-1462 (W2.1 AC writer attestation — shares the JWT substrate), TIN-1464 (W2.3 audit log — consumes JWT sub, tenant, jti).
Related: TIN-1446 (E2 AC authority), TIN-1449 (E5 observability), TIN-1450 (E6 target-class breadth).