Tailnet Operator Plane

Tailnet-First Operator Plane

Defines the tailnet-first access model, dashboard exposure boundaries, and multi-org runner enrollment for the GloriousFlywheel platform.

Design Principle

The operator plane (dashboard, API, MCP server) is exposed exclusively over the tailnet. No public endpoints. Authentication is handled by Tailscale’s identity provider (tsidp), not application-level passwords.

Operator (tailnet device)
  -> Tailscale tunnel
  -> Caddy reverse proxy (tailscale_auth directive)
  -> Dashboard (SvelteKit) / API / MCP server

Access Model

Who Can Access What

Role Access Method Scope
Operator Dashboard UI over tailnet View fleet, pause/resume, edit config, view metrics
Org Admin Dashboard UI + GitOps MRs All operator actions + approve config changes, manage enrollment
Platform Engineer CLI (kubectl, tofu, MCP) over tailnet Full cluster access, infrastructure changes
Downstream Consumer GitHub/GitLab CI only Submit jobs, no platform access
External User None (tailnet-only) No access to operator plane

Authentication Flow

User opens dashboard URL (e.g. https://dashboard.tail12345.ts.net)
  -> Caddy checks Tailscale identity via tailscale_auth
  -> Extracts tailnet user identity (email, node name)
  -> Maps to platform role:
     - Org owner email -> admin
     - Org member email -> operator
     - Unknown -> denied (not on tailnet = no access)
  -> Sets X-Tailscale-User header
  -> SvelteKit reads identity from header, populates session

No passwords. No OAuth flows. No session cookies to manage. If you’re on the tailnet, you’re authenticated. Role comes from identity.

Exposure Boundaries

What’s on the Tailnet

Service URL Pattern Port Auth
Dashboard UI dashboard.tail*.ts.net 443 (Caddy TLS) tailscale_auth
Dashboard API dashboard.tail*.ts.net/api/* 443 tailscale_auth + role check
MCP Server stdio (local process) N/A DASHBOARD_TOKEN env var
Prometheus prometheus.tail*.ts.net 443 tailscale_auth
Grafana grafana.tail*.ts.net 443 tailscale_auth

What’s NOT on the Tailnet

Service Access Reason
ARC Controller Cluster-internal only No external API needed
Runner pods Cluster-internal only Ephemeral, no direct access
Attic cache Cluster-internal + tailnet Runners use cluster DNS, devs use tailnet
PostgreSQL Cluster-internal only CNPG manages access via mTLS
RustFS Cluster-internal only S3 API for Attic only

MCP Server Access

The MCP server runs as a local stdio process on the operator’s machine. It calls the dashboard API over the tailnet:

Claude Code -> MCP Server (stdio, local)
  -> HTTP to dashboard.tail*.ts.net/api/*
  -> Bearer token = tsidp-issued JWT
  -> Dashboard validates token, checks role
  -> Returns envelope response
  -> MCP Server formats for Claude

Multi-Org Runner Enrollment

Model

GloriousFlywheel supports runner sharing across multiple GitHub organizations and GitLab groups through a single platform instance.

Platform Instance (single cluster)
  +-- Org A (GitHub)
  |   +-- gh-nix (shared)
  |   +-- gh-docker (shared)
  |   +-- gh-dind (shared)
  |
  +-- Org B (GitHub)
  |   +-- gh-docker (shared, same scale set)
  |   +-- org-b-gpu (dedicated, org-specific)
  |
  +-- Group C (GitLab)
      +-- gl-docker (shared)
      +-- gl-nix (shared)

Enrollment Types

Type Scope Registration Lifecycle
Shared All enrolled orgs Single GitHub App installation per org, all orgs route to same scale set Platform manages, orgs consume
Dedicated Single org Org-specific GitHub App or runner group Org requests, platform provisions
Org-plus-user Org + specific repos GitHub App with restricted repo access Org admin configures repo list

Registration Flow

GitHub (ARC):

1. Org admin installs GloriousFlywheel GitHub App
2. App installation generates credentials
3. Platform engineer adds org to arc-runners stack:
   - New GitHub App secret in cluster
   - ARC scale set configured with org's app credentials
4. Runners appear in org's GitHub Actions runner list
5. Org's workflows use `runs-on: [self-hosted, nix]`

GitLab:

1. Group admin creates group runner token
2. Platform engineer adds token to gitlab-runners stack
3. Runner registers with GitLab group
4. Group's pipelines pick up the runner via tags

Runner Isolation

Isolation Level Mechanism Use Case
None (shared pool) All orgs share same scale set pods Trusted orgs, cost efficiency
Namespace Separate Kubernetes namespace per org Untrusted orgs, resource isolation
Node Dedicated node pool per org Strict isolation, compliance

Default: shared pool with ephemeral pods. Each job gets a fresh pod that is destroyed after completion. No cross-job data leakage.

Enrollment Lifecycle

Enroll:
  Org admin requests enrollment
  -> Platform engineer adds org config to tofu stack
  -> PR + review + merge + apply
  -> Runner appears in org's forge

Monitor:
  Dashboard shows enrollment status per org
  -> /api/runners groups by forge
  -> Metrics tracked per org via runner labels

Offboard:
  Org admin requests removal
  -> Platform engineer removes org config
  -> PR + review + merge + apply
  -> Runner deregisters from org's forge
  -> Pods drain, secrets deleted

Ownership Matrix

Decision Owner Process
Grant org enrollment Org Admin Request via issue, approved by platform team
Provision shared runner Platform Engineer PR to tofu/stacks/arc-runners/
Provision dedicated runner Platform Engineer PR with new module call + namespace
Set resource quotas per org Org Admin PR to stack variables
Manage GitHub App installation Org Admin (per org) GitHub org settings
Rotate runner credentials Platform Engineer Scheduled or on-demand via runbook
Emergency pause (all orgs) Operator Dashboard or MCP pause_runner

GloriousFlywheel