Action Cache Writer Attestation Design

Action Cache Writer Attestation Design

Decision summary

  • Status: Design landed on main; first feature-flagged cell primitive is in code. TIN-1462 — W2.1 under E2 / TIN-1446.
  • Rule: Only an attested merged-main RBE worker running a pinned worker image digest in the gf-rbe namespace may write to the action cache. Everyone else reads only.
  • Trusted writers: merged-main RBE worker pods, identified by a Kubernetes workload identity binding (recommended: SPIFFE-shaped k8s ServiceAccount projected token), allow-listed at the REAPI cell.
  • Untrusted (read-only): developer laptops, PR CI lanes, self-hosted ARC runners outside gf-rbe, any caller without a valid attestation token, spoke / off-cluster / external consumers.
  • Failure mode if attestation is wrong or absent: AC write returns PermissionDenied (gRPC 7) / HTTP 403; the rejected attempt is recorded in the AC audit log; a Prom alert fires on sustained nonzero gf_reapi_ac_write_rejected_total. E6/TIN-1450 target-class breadth acceleration is blocked until W2.1 + W2.3 are green.

The Problem

The action cache is input-addressed but trust-based. CAS entries are content-addressed and digest-verified on read — a corrupt CAS blob fails to match its own digest and is rejected. AC entries are different: the AC key is a hash of the action’s inputs (command, platform, input root digest), but the AC value is a bazel.remote.execution.v2.ActionResult message that asserts what running that action produced — output digests, exit codes, stdout/stderr. The cache trusts whoever wrote the value to have actually executed it faithfully on a matching platform.

There is no cryptographic proof that an AC entry is correct. An AC writer says “I ran this action, here is the result.” If the writer is honest and the platform is what it claims, downstream readers get a real speedup. If the writer is dishonest — wrong toolchain, tampered binary, malicious return value, a bazel build from a developer laptop whose /usr/bin/cc is a homebrew shim — every downstream reader that hits that AC entry inherits the lie.

This is a one-way poison door:

  • One bad AC write across one merged PR can poison the dependency closure for every downstream tenant.
  • Once the entry lands, the cache hands it out to every subsequent reader until the entry is explicitly invalidated.
  • Cache invalidation in a content-addressed-key world is hard: the only way to “remove” a poisoned entry is to evict it and ensure the next writer is trusted, because the key (input digest) hasn’t changed.
  • Readers cannot detect poisoning by reading; they would have to re-execute the action and compare, which defeats the cache.

The defense is not “make AC writes safer.” The defense is restrict who can write at all — only attested, merged-main RBE workers running a pinned worker image, in the gf-rbe namespace, with cryptographic identity that the AC layer checks before accepting a write.

E2/W2.1 is the most load-bearing safety gate in the RBE Production Readiness initiative. It must land before E6 (TIN-1450) target-class breadth acceleration. The cost of getting W2.1 wrong is irreversible cache poisoning across the multi-tenant substrate; the cost of stalling on it is permanent E6 freeze. Pick a mechanism, ship it, iterate.

The Threat Model

Who could write AC entries today without attestation, what they could poison, and what the consequence is.

Writer Why this could happen today What they could poison Consequence
Developer laptop running bazel build A future .bazelrc edit, copied legacy config, or user override silently flips --remote_upload_local_results=true; a developer hand-passes --remote_executor= to the cell. Toolchain on a Mac laptop is not the hermetic gf-reapi-cell worker image. CppCompile / JsRunBinary / GenRule AC entries with results produced under macOS/homebrew/non-hermetic glibc Every CI run and every other developer hitting the same AC key inherits a result built by a wrong toolchain. Detection is “things rebuild weirdly” or “tests pass locally fail in CI.” Recovery: invalidate the entry (nuke-key drill, W2.4).
PR CI lane (untrusted code) A PR introduces a malicious BUILD.bazel or genrule that, when executed, writes attacker-controlled bytes to an output file referenced by a stable AC key. If the PR CI lane can write AC, the attacker can pre-poison main. Any AC entry whose key is reachable from PR-introduced action graph Cross-PR poison: PR #X’s malicious genrule writes an AC entry, PR #Y (innocent) gets the poisoned entry on its next CI run. Supply-chain attack vector. This is the canonical reason PR CI must be read-only.
Self-hosted runner with stale toolchain An ARC runner pod that hasn’t been rotated in 30 days has a nix-store PVC with old derivations; if it’s allowed to write AC, it writes results produced by stale toolchain. Any mnemonic whose action depends on the stale derivation Slow drift: cache becomes inconsistent with current main. Manifests as “main rebuilds clean, incremental CI is wrong.” Hardest to detect because the staleness window is gradual.
Compromised credential / leaked AC write token A long-lived bearer token / TLS cert leaks via .envrc, GitHub Actions secret exfiltration, or a developer’s ~/.bazelrc.user. Attacker writes AC entries from anywhere with internet access. Anything the credential’s scope allows Full poison authority for the lifetime of the credential. Detection requires audit log review (W2.3). Recovery: revoke the credential, invalidate every AC entry written under it (this is why audit logs need writer identity, not just timestamps).
Non-main branch CI lane A feature/* branch CI lane configured (or misconfigured) to write AC. Diverged toolchain, vendor-mode lockfile drift, in-flight refactors. Any AC entry produced from a non-main code state Branch-isolated work pollutes the shared cache. Manifests as “merge to main rebuilds everything” or “merge to main returns wrong results until invalidated.”
Cross-tenant write (E4-adjacent) Tenant A’s worker writes an AC entry into Tenant B’s instance namespace because instance-name routing is advisory, not enforced. Covered in detail by E4/W4.2 IAM scopes; the threat surfaces here too because AC writer identity is the substrate E4 builds on. Cross-tenant poison: Tenant A’s toolchain poisons Tenant B’s cache E4/TIN-1448 cannot land if AC writer attestation does not at least carry a tenant claim that E4 can enforce on. Out of scope here for enforcement, in scope here for identity shape.

The common factor across these threats: the writer’s identity is the load-bearing variable, and today there is nothing forcing a writer to prove identity before the AC layer accepts a write.

Trust Boundary Definition

The boundary is a closed set on one side and a “everything else” on the other. There is no third bucket.

Trusted writers — may write AC, must prove identity on every write:

  • Pod identity: Kubernetes ServiceAccount gf-reapi-cell-worker in namespace gf-rbe.
  • Image identity: running the digest-pinned worker image ghcr.io/tinyland-inc/gf-reapi-cell@sha256:<digest> (current authority: sha256:be2832171ac69cc9a2d012b3c789e8b765afb7cae0df8f7e9677dd6d8542dbc0, rotates via the publish workflow).
  • Source identity: the action being cached must have come from a request that the cell traces back to a merged-main commit SHA. The audit-log row captures this; W2.1 enforces it at write time by refusing AC writes whose accompanying git_ref is not on refs/heads/main.
  • Workload identity binding: the worker pod presents a projected ServiceAccount token (or SPIFFE SVID — see Mechanism Option C) on every AC write RPC; the REAPI cell validates the token’s issuer, subject (system:serviceaccount:gf-rbe:gf-reapi-cell-worker), and expiry before accepting the write.

Untrusted readers — may read AC, must not be granted write capability under any condition:

  • developer laptops (any bazel build invocation from a workstation)
  • PR CI lanes (GitHub Actions workflows running on PR branches)
  • ARC runners outside the gf-rbe namespace (tinyland-nix, spoke runners)
  • nightly / scheduled jobs that don’t carry a merged-main commit SHA
  • spoke-canary / off-cluster burst capacity that runs PR-shaped work
  • external consumers (the public-vendor handoff fixture path)

Out of scope here, but referenced:

  • Cross-tenant authorization (E4/TIN-1448 / W4.2 IAM scopes). This doc specifies that the attestation token carries a tenant claim so E4 can enforce on it; it does not specify how E4 enforces.
  • CAS write attestation. CAS writes are digest-verified by construction (the CAS rejects any blob whose content doesn’t match its declared digest). Attestation here is AC-only. See Open Questions for the explicit recommendation.
  • Bazel-cache (bazel-remote on Honey) write paths. The existing bazel-cache bucket is a separate authority from gf-reapi-cell’s AC. Its write boundary lives in tofu/modules/bazel-cache/ and is governed by scripts/cache-attachment-contract.sh. That cache is not in this doc’s scope; if and when it absorbs RBE-shaped AC traffic, this design becomes the model.

Mechanism

Three candidate mechanisms. One recommendation, justified.

Option A: mTLS at the REAPI Layer

Workers present a TLS client certificate on every gRPC call. The REAPI cell’s gRPC server validates the certificate chain against an internal CA, then checks the certificate’s Subject Alternative Names (SANs) or Common Name against an allow-list of attested-worker identities. The CA either issues short-lived certs (cert-manager + intermediate CA, 15–60 minute lifetime) or long-lived certs with explicit revocation (CRL or OCSP).

Dimension Score
Operability in Honey k8s topology Medium. cert-manager + an internal CA is standard, but issuing per-pod client certs at scale means cert-manager Certificate resources per worker pod, or a sidecar that fetches certs on startup. Adds operator surface.
Token rotation cost High if long-lived (requires CRL/OCSP infrastructure); medium if short-lived (cert-manager handles rotation but pods need to re-read certs mid-flight or accept short downtime).
Audit log shape Good. Cert serial number is a stable identity primitive; SAN carries pod identity.
Blast radius of compromise Medium. A stolen client cert is valid until rotation or revocation; short-lived certs cap this at the lifetime window.
E4 IAM integration Weak. mTLS identity is not natively a JWT claim shape; E4’s planned OIDC tenant model would require a translation layer.

Option B: Signed JWT in gRPC Metadata

Workers receive short-lived JWTs from a trusted issuer (an in-cluster issuer like dex / kubernetes-projected-tokens, or an external OIDC IdP). On every AC write, the worker sends the JWT as a gRPC metadata header (authorization: Bearer <jwt>). The REAPI cell validates the signature against the issuer’s JWKS, checks claims (sub, aud, exp, iat), and matches sub against an allow-list. JWT lifetime is short (15min–1h); rotation is automatic via the projection mechanism.

Dimension Score
Operability Good. Bazel’s --remote_header / --remote_cache_header flags pass arbitrary headers including bearer tokens. JWKS endpoints are well-trodden infrastructure.
Token rotation cost Low. Short-lived JWTs need no revocation infrastructure; you wait for expiry.
Audit log shape Good. JWT jti is a per-token unique identifier; sub carries workload identity.
Blast radius of compromise Low. 15min lifetime means a leaked JWT is dangerous only until expiry.
E4 IAM integration Strong. A JWT with a tenant claim is exactly the shape E4 needs for OIDC-scoped CAS namespaces (the same pattern as BuildBuddy API-key-scoped namespaces or EngFlow’s IAM scope model).

Option C: Workload Identity (SPIFFE / k8s ServiceAccount Projected Token)

Workers prove identity via Kubernetes ServiceAccount projected tokens (a built-in k8s feature; the kubelet projects a short-lived JWT into each pod that is signed by the kube-apiserver and carries the ServiceAccount as sub). The REAPI cell validates the token via the TokenReview API or against the kube-apiserver’s OIDC discovery endpoint, and checks sub against an allow-list like system:serviceaccount:gf-rbe:gf-reapi-cell-worker. SPIFFE / SPIRE is the production-grade form of this same pattern with cross-cluster identity federation, but is overkill for a single Honey cluster today.

Dimension Score
Operability Strong. ServiceAccount projected tokens are a k8s primitive; no extra issuer to operate. cert-manager not required.
Token rotation cost Zero operator cost. The kubelet rotates projected tokens automatically (default 1h; configurable per serviceAccountToken volume).
Audit log shape Good. sub carries the ServiceAccount; kubernetes.io/serviceaccount/secret.name carries pod identity.
Blast radius of compromise Low. Projected tokens are short-lived by construction and bound to a pod; exfiltrating one off-cluster is not very useful (audience is checked, expiry is short).
E4 IAM integration Strong. Projected tokens are JWTs. A custom audience claim (--service-account-issuer + the cell’s audience identifier) is the natural place for the tenant claim that E4 enforces on. Aligns with E4’s planned OIDC tenant model without forcing a separate issuer.

Recommendation: Option C (Workload Identity via k8s ServiceAccount Projected Tokens)

Pick C. Three reasons:

  1. Zero new operator surface. ServiceAccount projected tokens exist in every k8s cluster; the kubelet handles rotation; there is no new issuer to deploy, monitor, or fail. mTLS (Option A) requires running cert-manager + an internal CA; standalone JWT issuance (Option B) requires running an issuer (dex, keycloak, or hand-rolled) and its JWKS. The two-operator team should not absorb a new identity authority service for this.
  2. Native E4 alignment. Projected tokens are JWTs with a custom audience; adding a tenant claim is configuration, not new infrastructure. E4’s tenant model will validate the same token shape from spoke workers (with different tenant claims), so W2.1 and E4/W4.2 share substrate instead of fighting for it.
  3. Rotation is free. Short-lived projected tokens (default 1h, can be tightened to 15min via expirationSeconds on the projected volume) cap blast radius without needing a CRL or revocation flow. When attestation is wrong (it will be wrong once — see Nuke-Key Drill), the recovery path is “rotate the ServiceAccount and re-project,” not “operate a CRL.”

Option B is a strong second; if cross-cluster workload identity ever becomes necessary (e.g. an off-cluster worker writing AC), upgrade to SPIFFE SVIDs (which is Option C’s general form) rather than retreating to a hand-rolled JWT issuer. Option A is reasonable but loses on operator surface and E4 alignment.

Default .bazelrc Posture

The shipped .bazelrc defaults are the architectural defense against the “developer laptop wrote an AC entry” threat. The rule is: default-off for AC writes, opt-in by named config that only merged-main CI can supply.

Concrete shape, layered onto the repo .bazelrc:

# === AC Writer Attestation Posture (TIN-1462) ===
# Default: no AC writes from this Bazel invocation.
# A merged-main RBE worker overrides this via --config=ac-write-attested,
# which requires a workload-identity JWT injected by the cell, not by
# Bazel client flags. Developer laptops and PR CI cannot supply that
# token; the AC layer rejects writes without it. This is architectural,
# not policy-based.

# Default: read AC, never upload local results.
build --remote_upload_local_results=false

# Default endpoints are intentionally not set here. Repo wrappers pass
# --remote_cache from BAZEL_REMOTE_CACHE after the strict attachment
# contract validates the endpoint. Executor endpoints remain explicit
# proof/production inputs, not ambient .bazelrc defaults.

# Cache-readonly config: explicit "I am a reader" posture for dev /
# untrusted CI. Identical to default but explicit, so workflow YAMLs
# can be self-documenting.
build:cache-readonly --remote_upload_local_results=false

# PR CI lanes MUST use --config=cache-readonly. PR CI lacks attestation
# credentials by construction — the AC layer would reject a write anyway,
# but the explicit config catches misconfiguration at flag-parse time.
build:ci --config=cache-readonly

# Merged-main CI flips to write-enabled via attested credential injection
# from the REAPI cell side. The Bazel client does NOT carry a write token;
# the worker pod (running inside gf-rbe) does. There is no Bazel flag here
# that flips the writer bit — that's intentional. The writer bit is a
# property of the calling identity, not a property of any client config.
build:ci-merged-main --config=ci-cached
# ci-cached remains --remote_upload_local_results=false. Remote-executed
# actions may populate AC through the attested worker path; local ARC/dev
# execution does not upload local results.

Why this shape:

  • --remote_upload_local_results=false is the default-off switch. Without it, any bazel build with --remote_cache= set will attempt to upload local action results. With it, the only writes that happen come from the executor side (the worker pod itself, running inside gf-rbe), which is the only surface where attestation can be enforced honestly.
  • The --remote_cache= URL is the same for readers and writers. Defense is not at the URL; defense is at the caller identity check the server performs on every write. Setting different URLs for readers vs writers would be defense-in-depth but is not the load-bearing gate.
  • PR CI inheriting dev posture (cache-readonly) is intentional: PR CI is less trusted than a clean dev machine (it executes attacker-controlled code from PR branches). Same posture, same constraints.
  • The ci-merged-main config does not carry any “I am allowed to write” flag. The writer bit is set by the calling identity, server-side. Any client claiming to be ci-merged-main while running outside gf-rbe is rejected at the AC layer.

CI Lane Matrix

Which lane can write AC, under which conditions, with which attestation, and what happens when it fails.

Lane AC posture Attestation source Failure mode if attestation missing
Developer laptop read-only none (cannot attest) AC write attempts (if any) return 403; build proceeds with local execution. Default .bazelrc prevents the attempt.
PR CI (pull_request workflows) read-only none (PR runners are outside gf-rbe) AC write attempts return 403; CI fails loudly if --remote_upload_local_results=true is set anywhere (this is the alarm: PR CI should never be configured to write).
Merged-main CI (push to main workflows) write ServiceAccount projected token from gf-rbe:gf-reapi-cell-worker pod identity, validated by REAPI cell. The token is created inside gf-rbe by the worker pod that runs the action — not by the GitHub Actions runner that triggered the build. If the worker pod’s projected token is missing/expired/wrong audience, AC write returns 403; the action’s execution result still returns to Bazel (the build doesn’t fail), but the cache doesn’t fill. Sustained breach pages on gf_reapi_ac_write_rejected_total.
Nightly / scheduled (e.g. vendor-mode lane) read-only by default; write only if running on a merged-main commit SHA against gf-rbe workers same as merged-main when applicable same as merged-main when applicable; otherwise 403 on write
Dev attachment field-test read-only none same as developer laptop
Spoke runners (tinyland-inc spokes) read-only (against the spoke’s CAS/AC; cross-spoke AC traffic is E4 territory) spoke-side workload identity (E4-shaped, not in this doc’s scope) 403 on cross-spoke AC writes; logged via E4 audit infrastructure
Spoke-canary read-only none 403 on writes; canary tests should never write AC by definition
External consumer (public-vendor handoff) read-only against a public-readable CAS slice only; no AC access to gf-rbe cell n/a n/a

The architectural property the matrix encodes: only one row writes. Merged-main CI is the single AC writer. Every other lane is read-only. That single-writer property is what makes the poison surface containable.

Audit Log Shape

Every AC write — accepted or rejected — must capture enough context that W2.3 (TIN-1464) can build the AC audit log surface on top of it without re-instrumenting. The minimum row schema:

Field Source Purpose
timestamp server clock order of events; budget windowing
worker_image_digest resolved from the writer pod’s container image SHA at write time “which toolchain wrote this?” — answers stale-toolchain questions
platform_digest the REAPI Action.platform digest from the action being cached binds AC entry to a worker platform contract; W2.2 uses this on read
instance_name REAPI-native wire field (set by client; routed by gf-reapi-cell per instance-name-routing-design.md) canonical per-tenant audit key; matches the routing doc’s audit JSON field
tenant tenant claim from the workload-identity JWT (E4 alignment) — derived alias of instance_name, kept as a separate field so IAM authority is auditable independently. Normally equals instance_name; divergence is a defect per-tenant audit + future E4 enforcement; the IAM-authoritative tenant identity
git_ref passed by the CI lane via Bazel build metadata (e.g. --build_metadata=COMMIT_SHA=...); must resolve to a merged-main commit proves the action came from a merged-main code state, not a PR branch
action_digest the REAPI action digest (key under which the AC entry is stored) the key being written; supports nuke-key drill (W2.4)
attestation_proof the JWT jti (token unique identifier) and sub from the validated workload-identity token which specific credential authorized this write; supports revocation forensics
outcome accepted | rejected and the gRPC status code distinguishes “we wrote it” from “we refused to write it”; both rows are kept
reject_reason enum: no_attestation | wrong_audience | expired_token | not_main_ref | unknown_tenant | wrong_image_digest makes the chaos test (W2.5) assertable

This is the schema. W2.3 implements it; W2.5 asserts on it; W2.4 queries it when invalidating poisoned entries.

Failure Modes the Design Must Defend Against

Failure Current exposure Design defense Residual risk
Non-attested write attempted (developer, PR CI, external) High — .bazelrc could be edited; --remote_upload_local_results=true could be passed AC write requires valid workload-identity JWT; absence returns 403; audit log records reject_reason=no_attestation Zero, conditional on the cell correctly validating tokens on every write. W2.5 chaos test verifies.
Stale worker image with diverged toolchain Medium — a long-running gf-reapi-cell pod could outlive a worker image rotation Worker pod identity carries the running image digest into the audit row; W2.2 (read-side validation) refuses AC entries whose worker_image_digest is no longer in the allow-list. AC write is allowed with a stale image, but reads of that entry will be rejected once the image is removed from the allow-list. Window between image rotation and allow-list update. Mitigation: image rotation is gated on allow-list update (single commit).
Compromised credential Currently unbounded (any AC write credential leak is permanent until manually revoked) Workload-identity tokens are short-lived (≤1h) and bound to a specific pod. Revocation = delete the pod; the next token issuance is automatic. No CRL needed. 1h window from leak to natural expiry. Acceptable; can be tightened to 15min via expirationSeconds.
PR CI lane accidentally configured to write High — historically the most common misconfiguration in shared-cache designs Architectural impossibility, not a policy block: PR CI runs outside gf-rbe, cannot present a valid gf-reapi-cell-worker ServiceAccount token, and the AC layer rejects the write. The .bazelrc default of --remote_upload_local_results=false is a belt-and-suspenders second line. Zero, conditional on the workload-identity issuer not being reachable from the PR CI execution environment. (k8s ServiceAccount projection is pod-local; a PR runner cannot fabricate it.)
Cross-platform AC entry reuse Out of scope here; covered by W2.2 (TIN-1463) platform-digest validation on read n/a (this doc, by design) n/a (W2.2’s residual risk)
Replay of a previously valid token Low — JWTs are not natively single-use exp is checked; window is ≤1h. A captured token can be replayed only within its remaining lifetime, and only against the same audience. 1h window for replay. Acceptable. Optional: bind tokens to the source pod IP via the kubernetes.io/serviceaccount/pod.uid claim.
Token-issuer compromise (kube-apiserver signing key leak) Catastrophic but cluster-wide; out of scope for AC specifically If the kube-apiserver is owned, AC integrity is the least of the problems. AC trust inherits from k8s cluster trust. Inherits cluster-trust posture. Acceptable for in-cluster identity model.

Chaos Test Preview (W2.5 / TIN-1466)

The test that proves the design works: stand up a probe identity outside the gf-reapi-cell-worker ServiceAccount — same namespace, same network, same image even — and have it attempt an AC write against gf-reapi-cell with a valid JWT carrying actioncache:Write for the tenant but a sub outside GF_REAPI_AC_WRITE_TRUSTED_SUBJECTS. Expected outcome: the write returns PermissionDenied (HTTP 403 equivalent), the AC entry is not written, the AC audit log records one outcome=rejected row with reject_reason=untrusted_subject, and the gf_reapi_ac_write_rejected_total Prom counter increments. No-token, expired-token, and wrong-audience probes are authentication failures and are rejected before the AC-attestation audit path; they are authz chaos siblings, not this ticket’s non-attested-writer proof. Run the non-attested writer check as a nightly job; the day it passes silently is the day W2.5 closes. The day it stops passing is a hot incident.

Nuke-Key Drill Integration (W2.4 / TIN-1465)

This attestation will be wrong once. A trusted worker will write an AC entry that turns out to be poisoned — flaky test result mis-cached, a genrule that captured timestamp output, a worker image with a regression that wasn’t caught before publish. When that happens, the operator needs to invalidate the specific AC entry without nuking the whole cache. W2.4 owns that drill: given an action_digest and tenant (from the audit log row this design produces), the nuke-key drill removes the entry from the AC, refuses to re-cache it for a quarantine window, and emits an event that subsequent re-executions (by an attested writer) can re-fill the slot honestly. Cross-link: TIN-1465. The audit log fields in this doc are the input to W2.4 — without action_digest and tenant recorded per write, W2.4 cannot operate surgically.

Rollout Plan

Sequential. Each step is a separately reviewable change.

  1. Land this design doc (this PR). Recommendation locked: Option C (k8s ServiceAccount projected tokens). Open questions itemized below.
  2. Stand up the attestation issuer. For Option C, this is configuration, not a new service: enable ServiceAccount token projection on the gf-reapi-cell-worker pod template, with audience claim gf-reapi-cell.gf-rbe.svc and expirationSeconds=3600 (tighten later). Reflect in tofu/modules/ (probably a new gf-reapi-cell module sibling to bazel-cache).
  3. Configure REAPI cell to require attestation on AC write. Add an AC-write interceptor to the gf-reapi-cell Go service that: (a) extracts the authorization header, (b) validates the token via TokenReview against the kube-apiserver, (c) checks sub against an allow-list (initially: just system:serviceaccount:gf-rbe:gf-reapi-cell-worker), (d) records the audit-log row, (e) returns PermissionDenied if any check fails. AC reads remain unauthenticated for now (cross-tenant read isolation is E4 territory).
  4. Roll out the new .bazelrc posture. Add the --remote_upload_local_results=false default and the cache-readonly / ci-merged-main configs. Update scripts/cache-attachment-contract.sh to assert the default. Update CI workflows to pass the right config per lane (the matrix above).
  5. Add the chaos test (W2.5) to nightly. New script, tests/gf_reapi_cell_ac_attestation_chaos.sh, sibling to tests/gf_reapi_cell_publish_contract.sh. Wired into GF REAPI AC Attestation Chaos on nightly schedule and AC-path changes.
  6. Add audit-log requirement to AC write path. The first W2.3 implementation writes local JSONL rows at ${GF_REAPI_STORE_ROOT}/audit/ac-writes.jsonl by default and exposes an in-process tenant query primitive. Accepted AC writes fail closed if the audit append fails, so the cell does not create unaudited AC entries. The remaining W2.3 work is the operator Resource Usage API, 30-day retention policy, and dashboard/query surface.
  7. Document rollback procedure. If the attestation gate breaks (false-negative storm: legit writers being rejected), the rollback is to flip a single feature flag on the gf-reapi-cell Deployment (AC_WRITE_ATTESTATION_ENFORCED=false), which puts the cell into warn-but-allow mode. Audit log still records rejected reasons; writes still proceed. This is not a graceful degradation in the threat model sense — it’s an emergency lever. Document it explicitly so an on-call operator at 3am doesn’t have to invent it.

Open Questions

These do not block landing this doc. They block closing E2/TIN-1446.

  1. Cert authority scope (if Option C is ever replaced by Option A). Option C avoids a CA entirely. If a future requirement (cross-cluster workload identity, off-cluster workers) forces us to Option A or full SPIFFE, do we use the existing tinyland-internal CA or stand up a new CA scoped to gf-rbe? Recommendation pending: a new scoped CA, on the principle that AC-write authority should not share trust roots with unrelated tinyland services.
  2. Token lifetime: 15min / 1h / build-duration? Default ServiceAccount projected token lifetime is 1h. Tighter is safer (smaller replay window) but increases token-refresh frequency. Build duration can exceed 1h for some target classes (docs-site:build cold can be close). Recommendation pending: 1h initial, tighten to 15min after W2.5 chaos test is green for 14 consecutive days. Note that a token issued mid-build remains valid for actions completed before expiry; build-duration tokens are not necessary.
  3. Revocation strategy. For Option C: revocation = pod deletion + ServiceAccount rotation. No CRL needed. For Option A/B: would need either short-lived (preferred) or CRL/OCSP. Locked here on Option C = no infrastructure. Recommendation: stay on Option C.
  4. How does this integrate with E4’s OIDC tenant claim (TIN-1473)? E4 plans an OIDC-shaped tenant claim. This design’s tenant audit field is the E4 substrate. Open: whether the workload-identity JWT itself carries the tenant claim (E4 directly enforces on the same token) or whether the cell looks up tenant from the calling pod’s namespace (the cell enriches the audit log, E4 enforces separately). Recommendation pending: same JWT, tenant claim added by the projection audience — keeps W2.1 and E4/W4.2 on the same identity substrate.
  5. Does attestation cover CAS writes too, or just AC? Recommendation: AC only. CAS writes are content-addressed and digest-verified on store — a malicious CAS writer cannot poison anyone because the blob digest doesn’t match anything the cache gives out. The damage surface for unauthenticated CAS writes is denial-of-service (filling the CAS with junk that gets evicted under LRU pressure) and information disclosure (an attacker who can read CAS can pull blobs by guessing digests, though digests are hard to guess). Both are real but neither is the poison surface AC writes are. AC-only attestation is the right scope for W2.1. If DoS or read-side info disclosure on CAS becomes the gating concern, that is a separate workstream — likely under E4’s tenant-isolated CAS namespace.
  6. Allow-list shape. Initially the allow-list is a single ServiceAccount string. Should it be a flat list, a regex, or a structured policy (CEL, OPA)? Recommendation pending: flat list until there is more than one trusted writer; pre-committing to a policy language for one entry is over-engineering. Revisit when E6 introduces additional worker classes (Chromium worker, KVM worker, GPU worker) per the existing gf-reapi-cell.md non-goals.
  7. Audit-log durability requirement. This doc says the log row must exist on every write attempt. It does not say the row must survive a pod restart. W2.3 (TIN-1464) owns the durable-store decision (Loki vs ClickHouse vs sqlite). Open here: should the AC write block on the audit row being durably stored, or fire-and-forget? Recommendation: fire-and-forget to the log stream; W2.3 owns the durable-side guarantee.

References

External:

  • REAPI v2 specUpdateActionResult is the AC write RPC; permission errors return PermissionDenied.
  • Bazel --remote_header / --remote_cache_header — how the client passes the attestation token.
  • Bazel --remote_upload_local_results=false — the default-off switch.
  • Kubernetes ServiceAccount token volume projection — the Option C primitive.
  • TokenReview API — server-side validation primitive.
  • SPIFFE / SPIRE — the cross-cluster generalization of Option C, listed for the cross-cluster future case in Open Question 1.
  • EngFlow multi-tenancy and remote execution security patterns — inspiration only; IAM scope pattern this design cribs from for E4 alignment.
  • BuildBuddy API-key model — comparable trust-boundary pattern in a peer product; we are not adopting BuildBuddy but the pattern is informative.

Repo-local:

Linear:

  • Parent epic: TIN-1446 (E2 AC authority).
  • This workstream: TIN-1462 (W2.1 AC writer attestation).
  • Sibling workstreams: TIN-1463 (W2.2 platform-digest read-side validation), TIN-1464 (W2.3 AC audit log durable store), TIN-1465 (W2.4 nuke-key drill), TIN-1466 (W2.5 chaos test).
  • Related: TIN-1445 (E1 CAS authority — the substrate AC sits on); TIN-1448 (E4 tenant model — consumes this design’s tenant claim); TIN-1450 (E6 target-class breadth — gated by E2); TIN-1473 (E4 OIDC tenant claim — concrete tenant-claim shape).

GloriousFlywheel