IAM + OIDC Tenant Claim Design
Decision summary
- Status: First
gf-reapi-cellenforcement slice in progress (W4.2 / TIN-1473, under parent E4 / TIN-1448).- Rule: Every REAPI call to
gf-reapi-cellcarries a short-lived OIDC-shaped JWT in anAuthorization: Bearerheader. The cell validates the signature against a trusted issuer JWKS, assertsaud=gf-reapi-cell, and enforces thetenantclaim against the request’sinstance_nameand operation scope.- Recommended issuers: (a) k8s ServiceAccount projected tokens for in-cluster workers + (d) GitHub Actions OIDC for workflow provenance, federated through a small token-exchange endpoint on the cell. GitHub OIDC is an identity signal, not permission to run first-party dogfood on GitHub-hosted runners. They coexist and cover different surfaces. (b)
tsidpis the documented fallback for Tailscale-native developer flows.- JWT lifetime: 15–60 min. Recommend 15 min for ServiceAccount-projected workers (kubelet rotates for free), 60 min for the GitHub Actions exchange token (one-shot per workflow), 5 min for developer tokens.
- Scope shape: EngFlow-inspired (
cas:Read,cas:Write,actioncache:Read,actioncache:Write,remoteexecution:Run) tuple’d with a tenant binding (tenant:<slug>). No third-party REAPI code on the data path — the validation middleware is in-house ingf-reapi-cell.- What blocks if absent: E4 / TIN-1448 cannot close. W4.1 routing without W4.2 IAM is unauthenticated namespacing — a caller can claim any
instance_name. W4.3 pool selection (TIN-1474) and W4.4 quotas (TIN-1475) key on the sametenantclaim. W2.3 audit log (TIN-1464) needs the JWTsubandtenantto be queryable.
Frame
W4.1 / TIN-1472 decided the
routing primitive: each spoke gets remote_instance_name=spoke-<slug> and
gf-reapi-cell keys CAS / AC storage on that pair. The default-deny
behavior on cross-tenant reads (quiet NOT_FOUND) closes the digest-guess
info-disclosure channel — a spoke-B caller asking for spoke-A’s blob by
digest gets the same answer as if the blob did not exist. That is one
defense, and it is real: the cell cannot leak a blob it does not route to
in the first place.
But routing is only honest when the caller’s claimed tenant identity is
verified. A misconfigured spoke that sets
--remote_instance_name=spoke-elders while running the spoke-blahaj
toolchain is the obvious case; the malicious case is a compromised CI lane
or a hand-edited .bazelrc flipping the instance string. The second
defense — the one this doc owns — is explicit authorization at the call
layer: every REAPI request carries a signed token that names the caller’s
tenant, and gf-reapi-cell refuses to act on any request whose token does
not authorize the instance + operation pair. Routing tells the cell which
namespace to look in; IAM tells the cell whether the caller is allowed to
ask. Both must hold; neither replaces the other.
The Scope Model
EngFlow’s multi-tenancy IAM is the closest reference in the ecosystem and
this doc cribs the shape — scope verbs paired with a tenant binding —
without adopting EngFlow’s server, SDK, or call path. The implementation
lives entirely inside gf-reapi-cell’s in-house Go middleware. We do not
import EngFlow code; we do not run their server.
The scope set:
| Scope | Grants | Operation surface |
|---|---|---|
cas:Read tenant:<slug> |
Read CAS blobs scoped to instance_name=spoke-<slug> |
FindMissingBlobs, BatchReadBlobs, ByteStream.Read |
cas:Write tenant:<slug> |
Write CAS blobs into the tenant’s namespace | BatchUpdateBlobs, ByteStream.Write |
actioncache:Read tenant:<slug> |
Read AC entries from the tenant’s namespace | GetActionResult |
actioncache:Write tenant:<slug> |
Write AC entries into the tenant’s namespace | UpdateActionResult (also gated by W2.1 writer attestation) |
remoteexecution:Run tenant:<slug> |
Submit Execute / WaitExecution requests | Execute, WaitExecution |
system:* (reserved) |
Cross-tenant internal: cell health, audit reads, synthetic probes | All RPCs; only granted to gf-reapi-cell itself |
A token may carry multiple scopes; the validator checks the intersection
against the request. A request matches a scope when (a) the operation verb
maps to the scope verb, and (b) the request’s instance_name equals
the scope’s tenant:<slug> binding (or the scope is system:*).
cas:Read and cas:Write are split intentionally. The expected steady
state is that PR CI lanes get cas:Read + actioncache:Read only, merged-
main gets +Write on both, and developer tokens get cas:Read +
actioncache:Read (no Write at all — consistent with W2.1’s threat
model where developer laptops are never trusted writers).
The actioncache:Write scope is necessary but not sufficient for AC
writes: W2.1 / TIN-1462 adds
a second check (writer-attestation: pod identity = the worker SA, image
digest in the allow-list, git_ref on refs/heads/main). Both must pass.
W2.1 is the AC-write-specific gate; W4.2 is the general authorization
layer. They share a JWT, they each enforce their own clause.
OIDC Provider Analysis
Four candidate issuer surfaces, compared on the dimensions that matter for the two-operator team: identity primitive, JWT shape, rotation cost, audit trail, integration cost.
(a) k8s ServiceAccount projected tokens
The kubelet projects a short-lived JWT into each pod, signed by the kube-
apiserver, with sub=system:serviceaccount:<ns>:<sa> and a configurable
audience. The token validates against the kube-apiserver’s OIDC discovery
endpoint or via TokenReview. This is the same primitive W2.1 / TIN-1462
picked for AC writer attestation.
| Dimension | Score |
|---|---|
| Identity primitive | k8s ServiceAccount — already what tofu/modules/arc-runner/ and the spoke modules consume |
| JWT shape | Native OIDC JWT; configurable audience; tenant claim added via the token-projection audience or via a small claim-mapper |
| Rotation cost | Zero operator cost — kubelet rotates automatically; default 1h, can tighten to 15min |
| Audit trail | sub carries the SA identity; kubernetes.io/serviceaccount/pod.uid carries pod identity |
| Integration cost | Near zero. Cell validates via kube-apiserver discovery (already in-cluster trust) |
| Cross-cluster posture | Single-cluster only (Honey). A future off-cluster worker would need SPIFFE federation |
(b) tsidp (Tailscale Identity Provider)
tsidp is the Tailscale-native OIDC IdP. Identity is bound to a
Tailscale node identity; the IdP issues OIDC tokens with sub matching
the tailnet user or device. The repo already references Tailscale for
runner authentication.
| Dimension | Score |
|---|---|
| Identity primitive | Tailscale node / user — natural for developer laptops on the tailnet |
| JWT shape | Standard OIDC JWT; tenant claim would need to be set via a tsidp-side mapping (one-per-user) |
| Rotation cost | Medium — tsidp issues tokens on demand; client refresh discipline required |
| Audit trail | Good for human users; less good for unattended workers (a “tailscale device” is a coarse identity) |
| Integration cost | Medium — stand up the tsidp endpoint, configure ACLs, ship a developer credential helper |
| Cross-cluster posture | Strong — tailnet identity is cluster-agnostic; works equally well off-Honey |
(c) Self-hosted Keycloak / Dex
A general-purpose OIDC IdP run inside the cluster. Federates with GitHub, Google, GitLab, LDAP, whatever. Standard, well-trodden, but a whole service to operate.
| Dimension | Score |
|---|---|
| Identity primitive | Whatever federation backends it’s configured with |
| JWT shape | Fully customizable; tenant is a first-class custom claim |
| Rotation cost | High — operator owns the IdP service, its database, its JWKS rotation, its upgrade cadence |
| Audit trail | Excellent (Keycloak event log is rich) |
| Integration cost | High — new service in the cluster, new alert surface, new failure mode |
| Cross-cluster posture | Strong (IdP is centralized; clusters trust the same issuer) |
(d) GitHub Actions OIDC
GitHub Actions issues a workflow-scoped OIDC JWT on demand; the token
encodes the repo, the workflow, the branch, and the actor. We exchange it
for a gf-reapi-cell JWT at a small token-exchange endpoint on the cell.
This is workflow provenance, not runner placement. For GloriousFlywheel’s own
merge-blocking validation, security, Bzlmod/Bazel, and RBE proof lanes, the
workflow still runs on shared tinyland-* self-hosted runners.
| Dimension | Score |
|---|---|
| Identity primitive | GitHub repo + workflow + ref — exactly the cardinality CI lane authorization wants |
| JWT shape | The incoming token is a GitHub OIDC JWT; the exchange endpoint mints the gf-reapi-cell JWT with tenant/scopes set per repo+ref policy |
| Rotation cost | One token per workflow run; no rotation in the long-running sense |
| Audit trail | Excellent — every token names the repo, the workflow file, the actor, the ref |
| Integration cost | Low — small token-exchange endpoint; no IdP to host |
| Cross-cluster posture | Cluster-agnostic; works for any CI-side caller |
Scorecard
| Provider | Operability | Rotation | Audit | Integration | E4 / W2.1 fit |
|---|---|---|---|---|---|
| (a) k8s SA projected tokens | Strong | Strong | Good | Strong | Strong (W2.1 already picked this) |
| (b) tsidp | Medium | Medium | Medium | Medium | Good — tenant via mapping |
| (c) Self-hosted Keycloak / Dex | Weak (new service) | Medium | Strong | Weak | Good |
| (d) GitHub Actions OIDC (via exchange endpoint) | Strong | n/a (one-shot) | Strong | Strong | Strong |
Recommendation
Pick (a) + (d). They coexist and cover different surfaces:
- (a) k8s SA projected tokens for in-cluster workers, in-cluster CI
runners (ARC, the
gf-rbenamespace), and the cell-internalsystem:*identity. - (d) GitHub Actions OIDC for GitHub workflow provenance — trusted
same-repo ARC dogfood jobs, external tenant CI callers, and explicit
control-plane exceptions can exchange this for a
gf-reapi-cellJWT at a small token-exchange endpoint. This does not approveubuntu-latestas a first-party dogfood path.
Three reasons:
- W2.1 already chose (a). The AC writer attestation design picked k8s SA projected tokens; reusing the same substrate for general authorization keeps the identity story singular instead of forking. One token shape, one validator middleware, two enforcement clauses (W2.1 + W4.2).
- (d) is necessary because (a) does not name the GitHub workflow.
A k8s ServiceAccount token proves the in-cluster pod identity, but not the
repository, workflow file, actor, or ref that requested work. GitHub Actions
OIDC supplies that provenance on both self-hosted and external Actions
callers. A token-exchange endpoint that validates the GitHub OIDC token and
mints a
gf-reapi-cellJWT is the canonical pattern (this is how every cloud provider’s GitHub Actions OIDC flow works); it is small, well-trodden, and matches the PR-lane / merged-main-lane policy story the W2.1 doc already sketches. It is not a hosted-runner fallback for GloriousFlywheel itself. - (b) tsidp is a real option for developer laptops on the tailnet, especially if read-only dev access becomes a regular need. Document it as the fallback for the dev path; do not block on it. If developer access ends up wanted via tailnet identity, we revisit.
(c) is rejected: a whole IdP service for the two-operator team is too much surface for what (a)+(d) cover natively.
JWT Contents
The validated token shape. Required claims:
| Claim | Type / value | Purpose |
|---|---|---|
iss |
Issuer URL (kube-apiserver discovery URL, or gf-reapi-cell token-exchange endpoint) |
Identifies the signer; JWKS lookup uses this |
aud |
gf-reapi-cell.gf-rbe.svc (audience-scoped) |
Stops tokens issued for one audience being replayed at another |
sub |
system:serviceaccount:gf-rbe:<sa> (a) or repo:tinyland-inc/<repo>:ref:<ref> (d) |
Workload identity; audit-log primary key |
exp |
Unix timestamp, now + lifetime |
Short lifetime caps replay window |
iat |
Unix timestamp | Time of issuance |
nbf |
Unix timestamp | Not-before; equals or precedes iat |
tenant |
spoke-<slug> | default | system |
The spoke slug; matches instance_name validator regex from W4.1 |
scopes |
[]string of <verb>:<resource> tenant:<slug> strings |
The verb-resource-tenant set this token authorizes |
worker_image_digest |
sha256:<digest> (workers only) |
Optional; carries the AC-writer-attestation digest for W2.1 parity |
jti |
Unique token identifier | Per-token forensic primitive; used by W2.3 audit |
Validation rules gf-reapi-cell applies on every inbound RPC:
- Signature. Verify against the JWKS resolved from
iss. JWKS is cached with a short TTL (5 min); JWKS rotation is fetched on cache miss. - Issuer allow-list.
issmust be one of the configured trusted issuers (initially: the kube-apiserver and the cell’s own token- exchange endpoint). Any otheriss→ rejectUNAUTHENTICATED. - Audience.
audmust equalgf-reapi-cell.gf-rbe.svc. Mismatch → rejectUNAUTHENTICATED. - Expiry.
exp > now,nbf <= now. Outside the window → rejectUNAUTHENTICATED. - Tenant claim.
tenantmust match^(spoke-[a-z][a-z0-9-]{1,62}|default|system)$(same regex as W4.1). - Scope shape. Each scope string parses as
<verb>:<resource> tenant:<slug>orsystem:*. Malformed scopes → rejectUNAUTHENTICATED.
Validation failure on any of the above returns UNAUTHENTICATED (gRPC 16)
— distinct from authorization failure on the operation itself, which
returns PERMISSION_DENIED (gRPC 7) or, on cross-tenant data access,
quiet NOT_FOUND consistent with W4.1.
The Authz Check in gf-reapi-cell
Implementation status: the first in-cell slice now exists behind
GF_REAPI_AUTHZ_MODE=off|warn|enforce. It validates RSA-signed JWTs from
configured JWKS issuers, checks aud=gf-reapi-cell.gf-rbe.svc, requires
sub, tenant, scopes, jti, exp, iat, and nbf, and maps CAS, AC,
ByteStream, Execute, and WaitExecution RPCs to the scope table below. The
token-exchange endpoint and Bazel credential helper remain future rollout
steps; current live proofs keep authz off until those callers can mint tokens.
A request flows through the cell as follows:
- Edge: extract token. Read
authorization: Bearer <jwt>from gRPC metadata. Missing token →UNAUTHENTICATED. - Edge: validate token. Apply the six rules above. Token validated →
attach a
(sub, tenant, scopes, jti)tuple to request context. - Per-handler: scope check. Each RPC handler maps its operation to a
scope verb (e.g.
BatchReadBlobs→cas:Read). Look up whether(verb, tenant)is in the token’s scope set:- Operation verb not authorized for this tenant →
PERMISSION_DENIED. This is the explicit identity error: “your token does not grantcas:Read tenant:spoke-elders“. tenantclaim mismatchesinstance_nameon the request → alsoPERMISSION_DENIED. The token authorizestenant:spoke-A; the request is fortenant:spoke-B. The two must agree.
- Operation verb not authorized for this tenant →
- Per-handler: cross-tenant data lookups. When the operation reads
data scoped to a tenant the token does not authorize, the response
is
NOT_FOUND(per the W4.1 quiet-default-deny rule). This applies only to the cross-instance data path, not to the token-vs-request- instance check above. The distinction:- Token says
tenant:spoke-A, request saysinstance_name=spoke-B→PERMISSION_DENIED(identity defect: caller asked for the wrong namespace). - Token says
tenant:spoke-A, request saysinstance_name=spoke-A, but the digest being read exists only inspoke-B’s namespace →NOT_FOUND(data isolation; do not confirm cross-tenant existence).
- Token says
- Audit emit. On every accept and every reject, emit an audit row
carrying
{ts, sub, tenant, jti, rpc, instance_name, outcome, reject_reason}. The audit shape is W2.3’s contract; this doc commits to the fields.
The W2.1 writer-attestation clause runs in addition to step 3 on the
UpdateActionResult path: even with actioncache:Write tenant:<slug> in
scope, the pod identity + image digest + git_ref checks from W2.1 must
also pass. Failing either clause returns PERMISSION_DENIED; the audit row
distinguishes the two via reject_reason.
Token Rotation
For (a) k8s SA projected tokens:
- The kubelet projects a fresh token every
expirationSeconds * 80%(the k8s default refresh point). - The token file at
/var/run/secrets/tokens/gf-reapi-cell-tokenis rewritten in place; the projected volume is the rotation channel. - Long-running gRPC connections that authenticated at connection-time
must re-read the token at RPC time (the credential helper handles
this), or the cell rejects on
exp. - Recommended
expirationSeconds=900(15 min) for workers; the cell accepts tokens up to theirexp, no longer.
For (d) GitHub Actions OIDC:
- The GitHub OIDC token is one-shot per workflow run (issued by the Actions runtime, exchanged once at the cell’s token-exchange endpoint).
- The exchange endpoint mints a
gf-reapi-cellJWT withexp = now + 60min. - For workflows longer than 60 min: the workflow re-fetches a new GitHub OIDC token and re-exchanges. The credential helper handles this on the Bazel side.
For (b) tsidp (fallback, dev only):
- Tokens are issued on demand by tsidp; short-lived (5 min recommended).
- Developer credential helper re-fetches from tsidp on expiry.
Bazel Credential Helper
Bazel’s --credential_helper flag (Bazel 6.1+) invokes a helper binary
that reads stdin (a JSON GetCredentialsRequest) and writes stdout (a
JSON GetCredentialsResponse with headers). gf-reapi-cell ships a
helper binary that:
- Reads the helper request (Bazel passes the target URL).
- Picks the right token source based on environment:
- In-cluster (k8s pod): reads
/var/run/secrets/tokens/gf-reapi-cell-token. Always fresh on each invocation (the kubelet keeps the file current). - GitHub Actions runner: fetches a GitHub OIDC token from the Actions runtime (
ACTIONS_ID_TOKEN_REQUEST_URL+ACTIONS_ID_TOKEN_REQUEST_TOKEN), exchanges it at the cell’s/v1/token/exchangeendpoint, caches the result untilexp - 60s. - Developer machine: fetches from tsidp (fallback path) or from the dev-token issuer (
gf-rbe-dev-issuer, see open questions).
- In-cluster (k8s pod): reads
- Returns
{"headers": {"Authorization": ["Bearer <jwt>"]}}. - Never caches stale: on expiry, refetch; on fetch failure, exit nonzero so Bazel surfaces the error loudly. Fail closed.
Helper binary location:
gf-reapi-cell/cmd/gf-reapi-credhelper/ in the cell’s source tree, shipped
alongside the cell binary and the cell OCI image.
Implementation status: the first helper slice exists for projected-token and
explicit-token callers. It implements Bazel’s get protocol, reads
GF_REAPI_CREDENTIAL_HELPER_TOKEN_FILE, GF_REAPI_CREDENTIAL_HELPER_TOKEN, or
the default k8s projected-token path
/var/run/secrets/tokens/gf-reapi-cell-token, requires a JWT exp claim, and
returns Authorization: Bearer <jwt> with an expiry one minute before exp.
The GitHub Actions OIDC exchange and developer issuer paths are still future
work; the helper deliberately fails closed instead of minting or accepting
opaque long-lived tokens.
Bazel wiring (proposal for .bazelrc):
build --credential_helper=gf-reapi-cell.gf-rbe.svc=%workspace%/tools/gf-reapi-cell-credhelper
The helper is one binary per platform; the cell publishes Linux x86_64 and macOS arm64 builds as release artifacts.
CI vs Dev Posture
The posture matrix, per caller class. Cross-references W2.1’s “single AC
writer” property — only merged-main CI gets the actioncache:Write scope.
| Caller class | Identity source | Scope set granted | Notes |
|---|---|---|---|
| In-cluster merged-main CI worker | (a) k8s SA gf-reapi-cell-worker |
cas:{Read,Write} tenant:spoke-<slug> + actioncache:{Read,Write} tenant:spoke-<slug> + remoteexecution:Run tenant:spoke-<slug> (W2.1 also enforced on AC write) |
The single AC writer per W2.1. |
| In-cluster PR CI worker | (a) k8s SA gf-reapi-cell-pr |
cas:Read tenant:spoke-<slug> + actioncache:Read tenant:spoke-<slug> + remoteexecution:Run tenant:spoke-<slug> |
Read-only on cache; can execute but cannot poison. |
| GitHub-Actions PR CI | (d) GitHub OIDC → exchange | cas:Read + actioncache:Read (tenant scoped by exchange policy) |
Token-exchange policy reads repo+ref claims; PR refs get read-only. |
| GitHub-Actions merged-main CI | (d) GitHub OIDC → exchange | cas:{Read,Write} + actioncache:{Read,Write} + remoteexecution:Run (tenant scoped) |
Exchange policy: ref:refs/heads/main + repo allow-list → write scopes. |
| Developer machine | (b) tsidp (fallback) or gf-rbe-dev-issuer |
cas:Read tenant:<the dev's tenant> + actioncache:Read tenant:<...> |
Read-only. Cannot write AC under any condition. |
| Spoke runner (cross-cluster, future) | TBD (likely SPIFFE) | per-spoke scope set | Out of scope for v1; flagged for future cross-cluster work. |
gf-reapi-cell itself (internal probes) |
(a) k8s SA gf-reapi-cell-system |
system:* |
Used for the synthetic TTFCH probe and the cell’s own health checks. |
Integration with Siblings
This doc is the authorization substrate. Each sibling adds its own enforcement clause on top.
- W2.1 AC writer attestation (TIN-1462).
Already picked k8s SA projected tokens. This doc reuses the same JWT
shape for the general authorization layer; W2.1 is the AC-write-
specific clause on top. Both must validate; both must agree. The
audit log row carries the JWT’s
sub,tenant,worker_image_digest, andjtiso W2.1 can distinguish “AC write rejected because the token lackedactioncache:Write” from “AC write rejected because the image digest was not in the W2.1 allow-list.” - W4.1 instance-name routing (TIN-1472).
The
tenantclaim on the JWT must equal theinstance_nameon the request. Mismatch is a defect (PERMISSION_DENIED). W4.1 routes; W4.2 authorizes the routing. - W4.3 executor pool selection (TIN-1474).
Pool selection reads the validated
tenantclaim from request context (set by this doc’s middleware) and chooses the pool. The pool selector does not re-validate the token; it consumes the context. - W4.4 quota enforcement (TIN-1475).
Quotas key on
tenant. The quota enforcer joins ConfigMap-declared budgets (fromspoke-cache-quota) to live CAS bytes-used metrics, both keyed bytenant. Thetenanthere is the same JWT claim this doc validates. - W4.5 tenant-aware proof (TIN-1476).
Proofs in
config/rbe-target-eligibility.jsonmay eventually carry aninstance_namefield (per W4.1 Open Question 7); the proof harness authenticates using this design’s JWT. - W2.3 audit log (TIN-1464).
Captures the JWT
sub,tenant, andjtion every RPC. Without these, W2.3 cannot answer “who wrote this” forensically.
Failure Mode Table
| Failure | Exposure today | Defense in this design | Residual risk |
|---|---|---|---|
| Expired token reused | n/a — no tokens today | exp checked on every RPC; reject UNAUTHENTICATED |
Window between exp and request = zero |
| Spoofed token (forged signature) | n/a | Signature verified against issuer JWKS; reject on mismatch | Issuer key compromise (separate row) |
Scope escalation (spoke-A claims tenant:spoke-B) |
High — instance_name is currently a free-form client string |
Token signature binds the tenant claim to issuer-trusted identity; spoke-A’s issuer cannot mint tenant:spoke-B |
Issuer-side mapping bug (e.g. tsidp ACL misconfig) — caught by the audit log + dashboard skew alert |
| OIDC issuer (kube-apiserver) compromise | n/a (no issuer) | Short token TTL caps replay; rotate issuer keys; AC integrity inherits cluster trust | If the kube-apiserver is owned, RBE is the least of the problems |
| Token-exchange endpoint compromise | n/a | Exchange endpoint runs in gf-rbe, signed JWTs only; compromise = full IAM bypass; mitigation: short TTL + audit |
Same shape as kube-apiserver compromise — cluster-trust posture |
| Credential helper failure (no token returned) | n/a | Helper exits nonzero; Bazel surfaces “credential helper failed”; build errors loudly. Fail closed. | None — failure is loud by design |
| Credential helper caches stale token past expiry | n/a | Helper never caches past exp - 60s; refetches on every invocation when no cached token is fresh |
A bug here is a fail-open; nightly chaos test should assert helper refreshes on stale-token error |
| Cross-cluster identity (off-cluster worker) | n/a (no multi-cluster today) | Audience-scoped JWTs (aud=gf-reapi-cell.gf-rbe.svc); an untrusted external token would not validate |
Future work: SPIFFE / cross-cluster federation if off-cluster workers become real |
| Replay of a captured live token | n/a | Short TTL (15 min for workers, 5 min for dev, 60 min for CI exchange) + audience check | Replay possible within TTL window from the same network — acceptable; optional jti single-use store later |
gf-rbe-dev-issuer over-scoped tokens |
n/a | Dev issuer must mint tokens with cas:Read + actioncache:Read only; never Write; never system:* |
Operator discipline; see Open Questions |
| Token leaked to disk / .envrc / Slack | n/a | Short TTL bounds blast radius; audit log catches anomalous sub on jti reuse pattern |
TTL window + the leakage detection rigor of W2.3 |
| Issuer JWKS rotation flips the validator into reject-all | n/a | JWKS cache TTL is short (5 min); cell fetches on miss; warning alert on sustained JWKS-fetch failure | A bad rotation could blackout the cell for up to JWKS TTL — accept; document on-call playbook |
Rollout Plan
Sequential. Each step is a separately reviewable change; each step is revertable via a single feature flag on the cell’s Deployment.
- Land this design doc. Recommendation locked: (a) + (d). Open questions itemized.
- Land the JWT validation middleware in
gf-reapi-cell. Configured with one trusted issuer initially (the kube-apiserver discovery URL). Middleware runs in warn-only mode: validates tokens, logs results, does not reject on absent or invalid tokens. This proves the validator on real traffic before flipping enforcement. - Stand up the token-exchange endpoint for GitHub Actions. New
handler
/v1/token/exchangeon the cell. Validates incoming GitHub OIDC tokens againsthttps://token.actions.githubusercontent.com, applies the policy table (repo allow-list, ref → scope mapping), mintsgf-reapi-cellJWTs with 60-min TTL. Add the GitHub issuer to the trusted-issuer set on the cell. - Ship the Bazel credential helper. First slice landed as
gf-reapi-cell/cmd/gf-reapi-credhelper/for projected-token and explicit JWT callers. Remaining rollout work: release artifact packaging, token- exchange integration, developer issuer integration, and.bazelrc/ per-spoke wrapper wiring with--credential_helper=.... - Roll out per-tenant scopes for
tenant:default. The migration cohort that hasn’t adopted spoke instances yet getscas:Read+actioncache:Read+remoteexecution:Runondefault; in-cluster merged-main CI gets+Write. Cell stays in warn-only mode. - Migrate one spoke at a time. For each spoke in
lanes.json: provision the ServiceAccount, attach the projected token volume, define the GitHub exchange policy, and grant the spoke’s scope set. Spoke CI starts sending--remote_instance_name=spoke-<slug>and the new JWT. - Flip enforcement: default-deny on cross-tenant. Once every active
caller is on a real
tenant:<slug>(visible on the dashboard as zero un-tokenized traffic for 7 days), flip the cell from warn-only to enforce. Cross-tenant reads returnNOT_FOUND; missing-token requests returnUNAUTHENTICATED. - Drop
default(per W4.1’s migration plan; W4.2 follows W4.1’s schedule).
Rollback: a single feature flag on the cell’s Deployment
(AUTHZ_ENFORCEMENT_MODE=warn|enforce) flips back to warn-only at any
step. The audit log continues to record rejections even in warn mode.
Open Questions
These do not block landing this doc. They do block closing E4 / TIN-1448.
- OIDC provider final pick. This doc proposes (a)+(d). The operator may have a strong preference for tsidp (b) given Tailscale-native infra. Defended: (a) already chosen by W2.1, (d) covers the GitHub- hosted CI surface that (a) cannot reach, and (b) can be added as a developer-side fallback without re-architecting. Recommendation pending operator sign-off. If (b) is preferred for the dev path, adopt it alongside (a)+(d); they all validate the same JWT shape.
- JWT lifetime: 15 min vs 60 min? Trade-off: shorter is safer
(smaller replay window) but increases token-refresh frequency for
long-running builds. Some target classes (
docs-site:buildcold,web-playwright-chromium-static-smoke) can run > 15 min. Recommendation: 15 min for SA-projected workers (kubelet rotates for free), 60 min for GitHub Actions exchange tokens (one-shot per workflow). Revisit after W2.5 chaos test passes for 14 days. - Credential helper binary location and packaging. Proposal:
gf-reapi-cell/cmd/credhelper/. Open: is the helper a separate release artifact, or always shipped inside the cell OCI image andkubectl cp‘d out by operators? Recommendation: both — release artifact for dev machines + bundled in image for in-cluster. - How does
gf-rbe-dev-issuerget bootstrapped without becoming a vending machine for over-scoped tokens? The dev-mode issuer must mint onlycas:Read+actioncache:Readscopes. Open: what authenticates a developer to the dev issuer? Three options: (i) tsidp identity (Tailscale-native; needs (b) in production); (ii) GitHub OIDC via a CLI flow (gh auth status→ bearer → exchange); (iii) static dev tokens issued by an operator (least secure). Recommendation pending: (i) if (b) is adopted; (ii) otherwise. (iii) is rejected. - Cross-tenant
NOT_FOUNDvsPERMISSION_DENIEDsemantics. This doc says: token-vs-instance mismatch →PERMISSION_DENIED; cross- tenant data access with correct token-instance binding →NOT_FOUND. This is intentional but worth one more pass against the W4.1 default- deny matrix. Pending: confirm with Codex. - JWKS cache TTL. 5 min is the proposal. Open: should the cell pre-fetch JWKS on startup and refresh on a fixed schedule, or lazy- load on cache miss? Recommendation: pre-fetch on startup + refresh every 5 min in a background loop; fail open on refresh error if a prior key is still in the cache.
- Scope language: flat strings, or structured (CEL / OPA)? Initial
proposal: flat
<verb>:<resource> tenant:<slug>strings. Pre- committing to a policy language for a handful of verbs is over- engineering; revisit when the scope surface grows.
References
External:
- REAPI v2 spec
—
Action,Platform, request headers;instance_namefield on every request. - Bazel
--credential_helper— the client-side hook for per-request auth headers. - Bazel
--remote_header/--remote_cache_header— alternative header injection if the credential helper is not available. - Kubernetes ServiceAccount token volume projection — the (a) primitive.
- Kubernetes TokenReview API — server-side validation primitive.
- GitHub Actions OIDC for cloud providers — canonical exchange pattern for (d).
- SPIFFE / SPIRE — cross-cluster generalization of (a), for a future off-cluster worker case.
- EngFlow multi-tenancy IAM patterns — inspiration only; peers, not adoption candidates. Scope shape (
<verb>:<resource> tenant:<slug>) cribbed; no code adopted. - BuildBuddy API key model — comparable trust-boundary pattern in a class peer; not adopted.
Repo-local:
docs/build-system/ac-writer-attestation-design.md— W2.1 voice exemplar + tight sibling; chose k8s SA projected tokens for AC writer attestation. This doc reuses that substrate.docs/build-system/instance-name-routing-design.md— W4.1 voice exemplar + tight sibling; the tenant identity model this doc authorizes on.docs/build-system/slo.md— voice exemplar; the SLOs this design gates.docs/build-system/gf-reapi-cell.md— current REAPI cell shape; where the validator middleware lands.tofu/modules/arc-runner/— current runner identity shape; may inform what’s already wired for SA-based identity.tofu/modules/spoke-cache-quota/,tofu/modules/spoke-runner-binding/,tofu/modules/spoke-state-namespace/— the tenant declaration layer the JWTtenantclaim binds to.config/rbe-target-eligibility.json— proof schema; per W4.5 may eventually carryinstance_name.
Linear:
- Parent epic: TIN-1448 (E4 tenant model).
- This workstream: TIN-1473 (W4.2 IAM + OIDC tenant claim).
- Tight siblings: TIN-1472 (W4.1 instance-name routing), TIN-1474 (W4.3 executor pools), TIN-1475 (W4.4 quota enforcement), TIN-1476 (W4.5 tenant-aware proof).
- Cross-epic siblings: TIN-1462 (W2.1 AC writer attestation — shares the JWT substrate),
TIN-1464 (W2.3 audit log — consumes JWT
sub,tenant,jti). - Related: TIN-1446 (E2 AC authority), TIN-1449 (E5 observability), TIN-1450 (E6 target-class breadth).