Action Cache Writer Attestation Design
Decision summary
- Status: Design landed on main; first feature-flagged cell primitive is in code. TIN-1462 — W2.1 under E2 / TIN-1446.
- Rule: Only an attested merged-main RBE worker running a pinned worker image digest in the
gf-rbenamespace may write to the action cache. Everyone else reads only.- Trusted writers: merged-main RBE worker pods, identified by a Kubernetes workload identity binding (recommended: SPIFFE-shaped k8s ServiceAccount projected token), allow-listed at the REAPI cell.
- Untrusted (read-only): developer laptops, PR CI lanes, self-hosted ARC runners outside
gf-rbe, any caller without a valid attestation token, spoke / off-cluster / external consumers.- Failure mode if attestation is wrong or absent: AC write returns
PermissionDenied(gRPC 7) / HTTP 403; the rejected attempt is recorded in the AC audit log; a Prom alert fires on sustained nonzerogf_reapi_ac_write_rejected_total. E6/TIN-1450 target-class breadth acceleration is blocked until W2.1 + W2.3 are green.
The Problem
The action cache is input-addressed but trust-based. CAS entries are content-addressed and digest-verified on read — a corrupt CAS blob fails to match its own digest and is rejected. AC entries are different: the AC key is a hash of the action’s inputs (command, platform, input root digest), but the AC value is a bazel.remote.execution.v2.ActionResult message that asserts what running that action produced — output digests, exit codes, stdout/stderr. The cache trusts whoever wrote the value to have actually executed it faithfully on a matching platform.
There is no cryptographic proof that an AC entry is correct. An AC writer says “I ran this action, here is the result.” If the writer is honest and the platform is what it claims, downstream readers get a real speedup. If the writer is dishonest — wrong toolchain, tampered binary, malicious return value, a bazel build from a developer laptop whose /usr/bin/cc is a homebrew shim — every downstream reader that hits that AC entry inherits the lie.
This is a one-way poison door:
- One bad AC write across one merged PR can poison the dependency closure for every downstream tenant.
- Once the entry lands, the cache hands it out to every subsequent reader until the entry is explicitly invalidated.
- Cache invalidation in a content-addressed-key world is hard: the only way to “remove” a poisoned entry is to evict it and ensure the next writer is trusted, because the key (input digest) hasn’t changed.
- Readers cannot detect poisoning by reading; they would have to re-execute the action and compare, which defeats the cache.
The defense is not “make AC writes safer.” The defense is restrict who can write at all — only attested, merged-main RBE workers running a pinned worker image, in the gf-rbe namespace, with cryptographic identity that the AC layer checks before accepting a write.
E2/W2.1 is the most load-bearing safety gate in the RBE Production Readiness initiative. It must land before E6 (TIN-1450) target-class breadth acceleration. The cost of getting W2.1 wrong is irreversible cache poisoning across the multi-tenant substrate; the cost of stalling on it is permanent E6 freeze. Pick a mechanism, ship it, iterate.
The Threat Model
Who could write AC entries today without attestation, what they could poison, and what the consequence is.
| Writer | Why this could happen today | What they could poison | Consequence |
|---|---|---|---|
Developer laptop running bazel build |
A future .bazelrc edit, copied legacy config, or user override silently flips --remote_upload_local_results=true; a developer hand-passes --remote_executor= to the cell. Toolchain on a Mac laptop is not the hermetic gf-reapi-cell worker image. |
CppCompile / JsRunBinary / GenRule AC entries with results produced under macOS/homebrew/non-hermetic glibc |
Every CI run and every other developer hitting the same AC key inherits a result built by a wrong toolchain. Detection is “things rebuild weirdly” or “tests pass locally fail in CI.” Recovery: invalidate the entry (nuke-key drill, W2.4). |
| PR CI lane (untrusted code) | A PR introduces a malicious BUILD.bazel or genrule that, when executed, writes attacker-controlled bytes to an output file referenced by a stable AC key. If the PR CI lane can write AC, the attacker can pre-poison main. |
Any AC entry whose key is reachable from PR-introduced action graph | Cross-PR poison: PR #X’s malicious genrule writes an AC entry, PR #Y (innocent) gets the poisoned entry on its next CI run. Supply-chain attack vector. This is the canonical reason PR CI must be read-only. |
| Self-hosted runner with stale toolchain | An ARC runner pod that hasn’t been rotated in 30 days has a nix-store PVC with old derivations; if it’s allowed to write AC, it writes results produced by stale toolchain. |
Any mnemonic whose action depends on the stale derivation | Slow drift: cache becomes inconsistent with current main. Manifests as “main rebuilds clean, incremental CI is wrong.” Hardest to detect because the staleness window is gradual. |
| Compromised credential / leaked AC write token | A long-lived bearer token / TLS cert leaks via .envrc, GitHub Actions secret exfiltration, or a developer’s ~/.bazelrc.user. Attacker writes AC entries from anywhere with internet access. |
Anything the credential’s scope allows | Full poison authority for the lifetime of the credential. Detection requires audit log review (W2.3). Recovery: revoke the credential, invalidate every AC entry written under it (this is why audit logs need writer identity, not just timestamps). |
| Non-main branch CI lane | A feature/* branch CI lane configured (or misconfigured) to write AC. Diverged toolchain, vendor-mode lockfile drift, in-flight refactors. |
Any AC entry produced from a non-main code state | Branch-isolated work pollutes the shared cache. Manifests as “merge to main rebuilds everything” or “merge to main returns wrong results until invalidated.” |
| Cross-tenant write (E4-adjacent) | Tenant A’s worker writes an AC entry into Tenant B’s instance namespace because instance-name routing is advisory, not enforced. Covered in detail by E4/W4.2 IAM scopes; the threat surfaces here too because AC writer identity is the substrate E4 builds on. | Cross-tenant poison: Tenant A’s toolchain poisons Tenant B’s cache | E4/TIN-1448 cannot land if AC writer attestation does not at least carry a tenant claim that E4 can enforce on. Out of scope here for enforcement, in scope here for identity shape. |
The common factor across these threats: the writer’s identity is the load-bearing variable, and today there is nothing forcing a writer to prove identity before the AC layer accepts a write.
Trust Boundary Definition
The boundary is a closed set on one side and a “everything else” on the other. There is no third bucket.
Trusted writers — may write AC, must prove identity on every write:
- Pod identity: Kubernetes ServiceAccount
gf-reapi-cell-workerin namespacegf-rbe. - Image identity: running the digest-pinned worker image
ghcr.io/tinyland-inc/gf-reapi-cell@sha256:<digest>(current authority:sha256:be2832171ac69cc9a2d012b3c789e8b765afb7cae0df8f7e9677dd6d8542dbc0, rotates via the publish workflow). - Source identity: the action being cached must have come from a request that
the cell traces back to a merged-main commit SHA. The audit-log row captures
this; W2.1 enforces it at write time by refusing AC writes whose accompanying
git_refis not onrefs/heads/main. - Workload identity binding: the worker pod presents a projected ServiceAccount
token (or SPIFFE SVID — see Mechanism Option C) on every AC write RPC; the
REAPI cell validates the token’s issuer, subject (
system:serviceaccount:gf-rbe:gf-reapi-cell-worker), and expiry before accepting the write.
Untrusted readers — may read AC, must not be granted write capability under any condition:
- developer laptops (any
bazel buildinvocation from a workstation) - PR CI lanes (GitHub Actions workflows running on PR branches)
- ARC runners outside the
gf-rbenamespace (tinyland-nix, spoke runners) - nightly / scheduled jobs that don’t carry a merged-main commit SHA
- spoke-canary / off-cluster burst capacity that runs PR-shaped work
- external consumers (the public-vendor handoff fixture path)
Out of scope here, but referenced:
- Cross-tenant authorization (E4/TIN-1448 / W4.2 IAM scopes). This doc specifies that the attestation token carries a tenant claim so E4 can enforce on it; it does not specify how E4 enforces.
- CAS write attestation. CAS writes are digest-verified by construction (the CAS rejects any blob whose content doesn’t match its declared digest). Attestation here is AC-only. See Open Questions for the explicit recommendation.
- Bazel-cache (
bazel-remoteon Honey) write paths. The existingbazel-cachebucket is a separate authority fromgf-reapi-cell’s AC. Its write boundary lives intofu/modules/bazel-cache/and is governed byscripts/cache-attachment-contract.sh. That cache is not in this doc’s scope; if and when it absorbs RBE-shaped AC traffic, this design becomes the model.
Mechanism
Three candidate mechanisms. One recommendation, justified.
Option A: mTLS at the REAPI Layer
Workers present a TLS client certificate on every gRPC call. The REAPI cell’s gRPC server validates the certificate chain against an internal CA, then checks the certificate’s Subject Alternative Names (SANs) or Common Name against an allow-list of attested-worker identities. The CA either issues short-lived certs (cert-manager + intermediate CA, 15–60 minute lifetime) or long-lived certs with explicit revocation (CRL or OCSP).
| Dimension | Score |
|---|---|
| Operability in Honey k8s topology | Medium. cert-manager + an internal CA is standard, but issuing per-pod client certs at scale means cert-manager Certificate resources per worker pod, or a sidecar that fetches certs on startup. Adds operator surface. |
| Token rotation cost | High if long-lived (requires CRL/OCSP infrastructure); medium if short-lived (cert-manager handles rotation but pods need to re-read certs mid-flight or accept short downtime). |
| Audit log shape | Good. Cert serial number is a stable identity primitive; SAN carries pod identity. |
| Blast radius of compromise | Medium. A stolen client cert is valid until rotation or revocation; short-lived certs cap this at the lifetime window. |
| E4 IAM integration | Weak. mTLS identity is not natively a JWT claim shape; E4’s planned OIDC tenant model would require a translation layer. |
Option B: Signed JWT in gRPC Metadata
Workers receive short-lived JWTs from a trusted issuer (an in-cluster
issuer like dex / kubernetes-projected-tokens, or an external OIDC IdP).
On every AC write, the worker sends the JWT as a gRPC metadata header
(authorization: Bearer <jwt>). The REAPI cell validates the signature
against the issuer’s JWKS, checks claims (sub, aud, exp, iat),
and matches sub against an allow-list. JWT lifetime is short (15min–1h);
rotation is automatic via the projection mechanism.
| Dimension | Score |
|---|---|
| Operability | Good. Bazel’s --remote_header / --remote_cache_header flags pass arbitrary headers including bearer tokens. JWKS endpoints are well-trodden infrastructure. |
| Token rotation cost | Low. Short-lived JWTs need no revocation infrastructure; you wait for expiry. |
| Audit log shape | Good. JWT jti is a per-token unique identifier; sub carries workload identity. |
| Blast radius of compromise | Low. 15min lifetime means a leaked JWT is dangerous only until expiry. |
| E4 IAM integration | Strong. A JWT with a tenant claim is exactly the shape E4 needs for OIDC-scoped CAS namespaces (the same pattern as BuildBuddy API-key-scoped namespaces or EngFlow’s IAM scope model). |
Option C: Workload Identity (SPIFFE / k8s ServiceAccount Projected Token)
Workers prove identity via Kubernetes ServiceAccount projected tokens
(a built-in k8s feature; the kubelet projects a short-lived JWT into
each pod that is signed by the kube-apiserver and carries the
ServiceAccount as sub). The REAPI cell validates the token via the
TokenReview API or against the kube-apiserver’s OIDC discovery
endpoint, and checks sub against an allow-list like
system:serviceaccount:gf-rbe:gf-reapi-cell-worker. SPIFFE / SPIRE
is the production-grade form of this same pattern with cross-cluster
identity federation, but is overkill for a single Honey cluster today.
| Dimension | Score |
|---|---|
| Operability | Strong. ServiceAccount projected tokens are a k8s primitive; no extra issuer to operate. cert-manager not required. |
| Token rotation cost | Zero operator cost. The kubelet rotates projected tokens automatically (default 1h; configurable per serviceAccountToken volume). |
| Audit log shape | Good. sub carries the ServiceAccount; kubernetes.io/serviceaccount/secret.name carries pod identity. |
| Blast radius of compromise | Low. Projected tokens are short-lived by construction and bound to a pod; exfiltrating one off-cluster is not very useful (audience is checked, expiry is short). |
| E4 IAM integration | Strong. Projected tokens are JWTs. A custom audience claim (--service-account-issuer + the cell’s audience identifier) is the natural place for the tenant claim that E4 enforces on. Aligns with E4’s planned OIDC tenant model without forcing a separate issuer. |
Recommendation: Option C (Workload Identity via k8s ServiceAccount Projected Tokens)
Pick C. Three reasons:
- Zero new operator surface. ServiceAccount projected tokens exist in every k8s cluster; the kubelet handles rotation; there is no new issuer to deploy, monitor, or fail. mTLS (Option A) requires running cert-manager + an internal CA; standalone JWT issuance (Option B) requires running an issuer (dex, keycloak, or hand-rolled) and its JWKS. The two-operator team should not absorb a new identity authority service for this.
- Native E4 alignment. Projected tokens are JWTs with a custom
audience; adding a
tenantclaim is configuration, not new infrastructure. E4’s tenant model will validate the same token shape from spoke workers (with differenttenantclaims), so W2.1 and E4/W4.2 share substrate instead of fighting for it. - Rotation is free. Short-lived projected tokens (default 1h, can be
tightened to 15min via
expirationSecondson the projected volume) cap blast radius without needing a CRL or revocation flow. When attestation is wrong (it will be wrong once — see Nuke-Key Drill), the recovery path is “rotate the ServiceAccount and re-project,” not “operate a CRL.”
Option B is a strong second; if cross-cluster workload identity ever becomes necessary (e.g. an off-cluster worker writing AC), upgrade to SPIFFE SVIDs (which is Option C’s general form) rather than retreating to a hand-rolled JWT issuer. Option A is reasonable but loses on operator surface and E4 alignment.
Default .bazelrc Posture
The shipped .bazelrc defaults are the architectural defense against the
“developer laptop wrote an AC entry” threat. The rule is: default-off
for AC writes, opt-in by named config that only merged-main CI can supply.
Concrete shape, layered onto the repo .bazelrc:
# === AC Writer Attestation Posture (TIN-1462) ===
# Default: no AC writes from this Bazel invocation.
# A merged-main RBE worker overrides this via --config=ac-write-attested,
# which requires a workload-identity JWT injected by the cell, not by
# Bazel client flags. Developer laptops and PR CI cannot supply that
# token; the AC layer rejects writes without it. This is architectural,
# not policy-based.
# Default: read AC, never upload local results.
build --remote_upload_local_results=false
# Default endpoints are intentionally not set here. Repo wrappers pass
# --remote_cache from BAZEL_REMOTE_CACHE after the strict attachment
# contract validates the endpoint. Executor endpoints remain explicit
# proof/production inputs, not ambient .bazelrc defaults.
# Cache-readonly config: explicit "I am a reader" posture for dev /
# untrusted CI. Identical to default but explicit, so workflow YAMLs
# can be self-documenting.
build:cache-readonly --remote_upload_local_results=false
# PR CI lanes MUST use --config=cache-readonly. PR CI lacks attestation
# credentials by construction — the AC layer would reject a write anyway,
# but the explicit config catches misconfiguration at flag-parse time.
build:ci --config=cache-readonly
# Merged-main CI flips to write-enabled via attested credential injection
# from the REAPI cell side. The Bazel client does NOT carry a write token;
# the worker pod (running inside gf-rbe) does. There is no Bazel flag here
# that flips the writer bit — that's intentional. The writer bit is a
# property of the calling identity, not a property of any client config.
build:ci-merged-main --config=ci-cached
# ci-cached remains --remote_upload_local_results=false. Remote-executed
# actions may populate AC through the attested worker path; local ARC/dev
# execution does not upload local results.
Why this shape:
--remote_upload_local_results=falseis the default-off switch. Without it, anybazel buildwith--remote_cache=set will attempt to upload local action results. With it, the only writes that happen come from the executor side (the worker pod itself, running insidegf-rbe), which is the only surface where attestation can be enforced honestly.- The
--remote_cache=URL is the same for readers and writers. Defense is not at the URL; defense is at the caller identity check the server performs on every write. Setting different URLs for readers vs writers would be defense-in-depth but is not the load-bearing gate. - PR CI inheriting dev posture (
cache-readonly) is intentional: PR CI is less trusted than a clean dev machine (it executes attacker-controlled code from PR branches). Same posture, same constraints. - The
ci-merged-mainconfig does not carry any “I am allowed to write” flag. The writer bit is set by the calling identity, server-side. Any client claiming to beci-merged-mainwhile running outsidegf-rbeis rejected at the AC layer.
CI Lane Matrix
Which lane can write AC, under which conditions, with which attestation, and what happens when it fails.
| Lane | AC posture | Attestation source | Failure mode if attestation missing |
|---|---|---|---|
| Developer laptop | read-only | none (cannot attest) | AC write attempts (if any) return 403; build proceeds with local execution. Default .bazelrc prevents the attempt. |
PR CI (pull_request workflows) |
read-only | none (PR runners are outside gf-rbe) |
AC write attempts return 403; CI fails loudly if --remote_upload_local_results=true is set anywhere (this is the alarm: PR CI should never be configured to write). |
Merged-main CI (push to main workflows) |
write | ServiceAccount projected token from gf-rbe:gf-reapi-cell-worker pod identity, validated by REAPI cell. The token is created inside gf-rbe by the worker pod that runs the action — not by the GitHub Actions runner that triggered the build. |
If the worker pod’s projected token is missing/expired/wrong audience, AC write returns 403; the action’s execution result still returns to Bazel (the build doesn’t fail), but the cache doesn’t fill. Sustained breach pages on gf_reapi_ac_write_rejected_total. |
| Nightly / scheduled (e.g. vendor-mode lane) | read-only by default; write only if running on a merged-main commit SHA against gf-rbe workers |
same as merged-main when applicable | same as merged-main when applicable; otherwise 403 on write |
| Dev attachment field-test | read-only | none | same as developer laptop |
| Spoke runners (tinyland-inc spokes) | read-only (against the spoke’s CAS/AC; cross-spoke AC traffic is E4 territory) | spoke-side workload identity (E4-shaped, not in this doc’s scope) | 403 on cross-spoke AC writes; logged via E4 audit infrastructure |
| Spoke-canary | read-only | none | 403 on writes; canary tests should never write AC by definition |
| External consumer (public-vendor handoff) | read-only against a public-readable CAS slice only; no AC access to gf-rbe cell |
n/a | n/a |
The architectural property the matrix encodes: only one row writes. Merged-main CI is the single AC writer. Every other lane is read-only. That single-writer property is what makes the poison surface containable.
Audit Log Shape
Every AC write — accepted or rejected — must capture enough context that W2.3 (TIN-1464) can build the AC audit log surface on top of it without re-instrumenting. The minimum row schema:
| Field | Source | Purpose |
|---|---|---|
timestamp |
server clock | order of events; budget windowing |
worker_image_digest |
resolved from the writer pod’s container image SHA at write time | “which toolchain wrote this?” — answers stale-toolchain questions |
platform_digest |
the REAPI Action.platform digest from the action being cached |
binds AC entry to a worker platform contract; W2.2 uses this on read |
instance_name |
REAPI-native wire field (set by client; routed by gf-reapi-cell per instance-name-routing-design.md) |
canonical per-tenant audit key; matches the routing doc’s audit JSON field |
tenant |
tenant claim from the workload-identity JWT (E4 alignment) — derived alias of instance_name, kept as a separate field so IAM authority is auditable independently. Normally equals instance_name; divergence is a defect |
per-tenant audit + future E4 enforcement; the IAM-authoritative tenant identity |
git_ref |
passed by the CI lane via Bazel build metadata (e.g. --build_metadata=COMMIT_SHA=...); must resolve to a merged-main commit |
proves the action came from a merged-main code state, not a PR branch |
action_digest |
the REAPI action digest (key under which the AC entry is stored) | the key being written; supports nuke-key drill (W2.4) |
attestation_proof |
the JWT jti (token unique identifier) and sub from the validated workload-identity token |
which specific credential authorized this write; supports revocation forensics |
outcome |
accepted | rejected and the gRPC status code |
distinguishes “we wrote it” from “we refused to write it”; both rows are kept |
reject_reason |
enum: no_attestation | wrong_audience | expired_token | not_main_ref | unknown_tenant | wrong_image_digest |
makes the chaos test (W2.5) assertable |
This is the schema. W2.3 implements it; W2.5 asserts on it; W2.4 queries it when invalidating poisoned entries.
Failure Modes the Design Must Defend Against
| Failure | Current exposure | Design defense | Residual risk |
|---|---|---|---|
| Non-attested write attempted (developer, PR CI, external) | High — .bazelrc could be edited; --remote_upload_local_results=true could be passed |
AC write requires valid workload-identity JWT; absence returns 403; audit log records reject_reason=no_attestation |
Zero, conditional on the cell correctly validating tokens on every write. W2.5 chaos test verifies. |
| Stale worker image with diverged toolchain | Medium — a long-running gf-reapi-cell pod could outlive a worker image rotation |
Worker pod identity carries the running image digest into the audit row; W2.2 (read-side validation) refuses AC entries whose worker_image_digest is no longer in the allow-list. AC write is allowed with a stale image, but reads of that entry will be rejected once the image is removed from the allow-list. |
Window between image rotation and allow-list update. Mitigation: image rotation is gated on allow-list update (single commit). |
| Compromised credential | Currently unbounded (any AC write credential leak is permanent until manually revoked) | Workload-identity tokens are short-lived (≤1h) and bound to a specific pod. Revocation = delete the pod; the next token issuance is automatic. No CRL needed. | 1h window from leak to natural expiry. Acceptable; can be tightened to 15min via expirationSeconds. |
| PR CI lane accidentally configured to write | High — historically the most common misconfiguration in shared-cache designs | Architectural impossibility, not a policy block: PR CI runs outside gf-rbe, cannot present a valid gf-reapi-cell-worker ServiceAccount token, and the AC layer rejects the write. The .bazelrc default of --remote_upload_local_results=false is a belt-and-suspenders second line. |
Zero, conditional on the workload-identity issuer not being reachable from the PR CI execution environment. (k8s ServiceAccount projection is pod-local; a PR runner cannot fabricate it.) |
| Cross-platform AC entry reuse | Out of scope here; covered by W2.2 (TIN-1463) platform-digest validation on read | n/a (this doc, by design) | n/a (W2.2’s residual risk) |
| Replay of a previously valid token | Low — JWTs are not natively single-use | exp is checked; window is ≤1h. A captured token can be replayed only within its remaining lifetime, and only against the same audience. |
1h window for replay. Acceptable. Optional: bind tokens to the source pod IP via the kubernetes.io/serviceaccount/pod.uid claim. |
| Token-issuer compromise (kube-apiserver signing key leak) | Catastrophic but cluster-wide; out of scope for AC specifically | If the kube-apiserver is owned, AC integrity is the least of the problems. AC trust inherits from k8s cluster trust. | Inherits cluster-trust posture. Acceptable for in-cluster identity model. |
Chaos Test Preview (W2.5 / TIN-1466)
The test that proves the design works: stand up a probe identity outside
the gf-reapi-cell-worker ServiceAccount — same namespace, same network, same
image even — and have it attempt an AC write against gf-reapi-cell with a
valid JWT carrying actioncache:Write for the tenant but a sub outside
GF_REAPI_AC_WRITE_TRUSTED_SUBJECTS. Expected outcome: the write returns
PermissionDenied (HTTP 403 equivalent), the AC entry is not written, the AC
audit log records one outcome=rejected row with
reject_reason=untrusted_subject, and the gf_reapi_ac_write_rejected_total
Prom counter increments. No-token, expired-token, and wrong-audience probes
are authentication failures and are rejected before the AC-attestation audit
path; they are authz chaos siblings, not this ticket’s non-attested-writer
proof. Run the non-attested writer check as a nightly job; the day it passes
silently is the day W2.5 closes. The day it stops passing is a hot incident.
Nuke-Key Drill Integration (W2.4 / TIN-1465)
This attestation will be wrong once. A trusted worker will write an AC
entry that turns out to be poisoned — flaky test result mis-cached, a
genrule that captured timestamp output, a worker image with a regression
that wasn’t caught before publish. When that happens, the operator needs
to invalidate the specific AC entry without nuking the whole cache.
W2.4 owns that drill: given an action_digest and tenant (from the
audit log row this design produces), the nuke-key drill removes the
entry from the AC, refuses to re-cache it for a quarantine window, and
emits an event that subsequent re-executions (by an attested writer)
can re-fill the slot honestly. Cross-link:
TIN-1465. The audit log
fields in this doc are the input to W2.4 — without
action_digest and tenant recorded per write, W2.4 cannot operate
surgically.
Rollout Plan
Sequential. Each step is a separately reviewable change.
- Land this design doc (this PR). Recommendation locked: Option C (k8s ServiceAccount projected tokens). Open questions itemized below.
- Stand up the attestation issuer. For Option C, this is configuration,
not a new service: enable ServiceAccount token projection on the
gf-reapi-cell-workerpod template, with audience claimgf-reapi-cell.gf-rbe.svcandexpirationSeconds=3600(tighten later). Reflect intofu/modules/(probably a newgf-reapi-cellmodule sibling tobazel-cache). - Configure REAPI cell to require attestation on AC write. Add an
AC-write interceptor to the
gf-reapi-cellGo service that: (a) extracts theauthorizationheader, (b) validates the token viaTokenReviewagainst the kube-apiserver, (c) checkssubagainst an allow-list (initially: justsystem:serviceaccount:gf-rbe:gf-reapi-cell-worker), (d) records the audit-log row, (e) returnsPermissionDeniedif any check fails. AC reads remain unauthenticated for now (cross-tenant read isolation is E4 territory). - Roll out the new
.bazelrcposture. Add the--remote_upload_local_results=falsedefault and thecache-readonly/ci-merged-mainconfigs. Updatescripts/cache-attachment-contract.shto assert the default. Update CI workflows to pass the right config per lane (the matrix above). - Add the chaos test (W2.5) to nightly. New script,
tests/gf_reapi_cell_ac_attestation_chaos.sh, sibling totests/gf_reapi_cell_publish_contract.sh. Wired intoGF REAPI AC Attestation Chaoson nightly schedule and AC-path changes. - Add audit-log requirement to AC write path. The first W2.3
implementation writes local JSONL rows at
${GF_REAPI_STORE_ROOT}/audit/ac-writes.jsonlby default and exposes an in-process tenant query primitive. Accepted AC writes fail closed if the audit append fails, so the cell does not create unaudited AC entries. The remaining W2.3 work is the operator Resource Usage API, 30-day retention policy, and dashboard/query surface. - Document rollback procedure. If the attestation gate breaks
(false-negative storm: legit writers being rejected), the rollback
is to flip a single feature flag on the
gf-reapi-cellDeployment (AC_WRITE_ATTESTATION_ENFORCED=false), which puts the cell into warn-but-allow mode. Audit log still records rejected reasons; writes still proceed. This is not a graceful degradation in the threat model sense — it’s an emergency lever. Document it explicitly so an on-call operator at 3am doesn’t have to invent it.
Open Questions
These do not block landing this doc. They block closing E2/TIN-1446.
- Cert authority scope (if Option C is ever replaced by Option A).
Option C avoids a CA entirely. If a future requirement (cross-cluster
workload identity, off-cluster workers) forces us to Option A or full
SPIFFE, do we use the existing tinyland-internal CA or stand up a new
CA scoped to
gf-rbe? Recommendation pending: a new scoped CA, on the principle that AC-write authority should not share trust roots with unrelated tinyland services. - Token lifetime: 15min / 1h / build-duration? Default ServiceAccount
projected token lifetime is 1h. Tighter is safer (smaller replay
window) but increases token-refresh frequency. Build duration can
exceed 1h for some target classes (
docs-site:buildcold can be close). Recommendation pending: 1h initial, tighten to 15min after W2.5 chaos test is green for 14 consecutive days. Note that a token issued mid-build remains valid for actions completed before expiry; build-duration tokens are not necessary. - Revocation strategy. For Option C: revocation = pod deletion + ServiceAccount rotation. No CRL needed. For Option A/B: would need either short-lived (preferred) or CRL/OCSP. Locked here on Option C = no infrastructure. Recommendation: stay on Option C.
- How does this integrate with E4’s OIDC tenant claim (TIN-1473)?
E4 plans an OIDC-shaped tenant claim. This design’s
tenantaudit field is the E4 substrate. Open: whether the workload-identity JWT itself carries thetenantclaim (E4 directly enforces on the same token) or whether the cell looks up tenant from the calling pod’s namespace (the cell enriches the audit log, E4 enforces separately). Recommendation pending: same JWT, tenant claim added by the projection audience — keeps W2.1 and E4/W4.2 on the same identity substrate. - Does attestation cover CAS writes too, or just AC? Recommendation: AC only. CAS writes are content-addressed and digest-verified on store — a malicious CAS writer cannot poison anyone because the blob digest doesn’t match anything the cache gives out. The damage surface for unauthenticated CAS writes is denial-of-service (filling the CAS with junk that gets evicted under LRU pressure) and information disclosure (an attacker who can read CAS can pull blobs by guessing digests, though digests are hard to guess). Both are real but neither is the poison surface AC writes are. AC-only attestation is the right scope for W2.1. If DoS or read-side info disclosure on CAS becomes the gating concern, that is a separate workstream — likely under E4’s tenant-isolated CAS namespace.
- Allow-list shape. Initially the allow-list is a single
ServiceAccount string. Should it be a flat list, a regex, or a
structured policy (CEL, OPA)? Recommendation pending: flat list
until there is more than one trusted writer; pre-committing to a
policy language for one entry is over-engineering. Revisit when
E6 introduces additional worker classes (Chromium worker, KVM
worker, GPU worker) per the existing
gf-reapi-cell.mdnon-goals. - Audit-log durability requirement. This doc says the log row must exist on every write attempt. It does not say the row must survive a pod restart. W2.3 (TIN-1464) owns the durable-store decision (Loki vs ClickHouse vs sqlite). Open here: should the AC write block on the audit row being durably stored, or fire-and-forget? Recommendation: fire-and-forget to the log stream; W2.3 owns the durable-side guarantee.
References
External:
- REAPI v2 spec —
UpdateActionResultis the AC write RPC; permission errors returnPermissionDenied. - Bazel
--remote_header/--remote_cache_header— how the client passes the attestation token. - Bazel
--remote_upload_local_results=false— the default-off switch. - Kubernetes ServiceAccount token volume projection — the Option C primitive.
- TokenReview API — server-side validation primitive.
- SPIFFE / SPIRE — the cross-cluster generalization of Option C, listed for the cross-cluster future case in Open Question 1.
- EngFlow multi-tenancy and remote execution security patterns — inspiration only; IAM scope pattern this design cribs from for E4 alignment.
- BuildBuddy API-key model — comparable trust-boundary pattern in a peer product; we are not adopting BuildBuddy but the pattern is informative.
Repo-local:
docs/build-system/slo.md— voice + structure exemplar; sibling doc that quantifies what “AC working” looks like.docs/build-system/cas-primitives.md— voice + structure exemplar; sibling doc on the in-house CAS primitives build-plan (replaces the deletedcas-backend-decision.md).docs/build-system/gf-reapi-cell.md— current REAPI cell shape; worker image digest and namespace authority.docs/build-system/remote-execution.md— surrounding remote-execution context..bazelrc— current default posture; this design requires the--remote_upload_local_results=falsedefault and thecache-readonly/ci-merged-mainconfigs.scripts/cache-attachment-contract.sh— adjacent guard that already polices cache endpoint shape; this design proposes extending its strict mode with the AC writer posture.tofu/modules/bazel-cache/— adjacent precedent for a cache writer authority.
Linear:
- Parent epic: TIN-1446 (E2 AC authority).
- This workstream: TIN-1462 (W2.1 AC writer attestation).
- Sibling workstreams: TIN-1463 (W2.2 platform-digest read-side validation), TIN-1464 (W2.3 AC audit log durable store), TIN-1465 (W2.4 nuke-key drill), TIN-1466 (W2.5 chaos test).
- Related: TIN-1445 (E1 CAS authority — the substrate AC sits on); TIN-1448 (E4 tenant model — consumes this design’s tenant claim); TIN-1450 (E6 target-class breadth — gated by E2); TIN-1473 (E4 OIDC tenant claim — concrete tenant-claim shape).