RFC 0001: Fleet Sync Integration for Lab Machines
Status: Implemented (v0.3.0)
Author: xoxd
Date: 2026-02-22
Branch: fleet/multi-machine-sync (PR #18)
Tracking: Sprint 6
Abstract
This RFC describes the integration plan for deploying tcfs multi-machine sync across the tinyland lab fleet (xoxd-bates, yoga, petting-zoo-mini). It covers the Nix module design, infrastructure prerequisites, rollout sequence, and verification procedures.
Motivation
Three alpha machines share overlapping repos in ~/git/ that drift between
machines. Files may be stale, duplicative, or have uncommitted work. tcfs
brings all machines into a uniform view where every machine sees every file
but only hydrates on demand.
Current pain points:
- Manual
rsync/scpbetween machines for file sharing - No awareness of which machine has the latest version
- No conflict detection for concurrent edits
- No audit trail of what changed and where
Architecture
xoxd-bates (macOS) yoga (Rocky Linux) petting-zoo-mini (macOS)
| | |
+--- push/pull ----------+---- push/pull ---------+
| | |
+--- NATS events -------NATS JetStream-----------+
|
SeaweedFS S3 (CAS)
nats.tcfs.svc.cluster.local
Data Flow
- Write: Machine writes file locally, ticks its VectorClock
- Push: Chunks via FastCDC, uploads to SeaweedFS CAS, writes SyncManifest v2 (JSON)
- Publish:
StateEvent::FileSyncedon NATS subjectSTATE.{device_id}.file_synced - Subscribe: Other machines receive event via per-device durable consumer
- Compare: VectorClock comparison determines: LocalNewer / RemoteNewer / Conflict
- Pull: If RemoteNewer, auto-fetch chunks and reassemble
- Conflict: If concurrent, invoke ConflictResolver (auto/interactive/defer)
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Conflict detection | Vector clocks | Full distributed partial ordering; no central coordinator |
| Conflict resolution | Pluggable (auto/interactive/defer) | Auto for headless, interactive for workstations |
| Auto-resolve tie-break | Lexicographic device name | Deterministic, reproducible, no coordination needed |
| .git sync | Opt-in, bundle mode default | Git bundles are atomic; raw mode risks corruption |
| Manifest format | JSON v2 with v1 text fallback | Backward-compatible with pre-Sprint 6 data |
| State transport | NATS JetStream (not S3 polling) | Real-time, durable, fan-out, per-device cursors |
| Stream retention | Limits (7 days, file storage) | Survives restarts, bounded growth |
Infrastructure Prerequisites
Already Running (Civo K8s)
| Service | Endpoint | Namespace |
|---|---|---|
| SeaweedFS S3 | seaweedfs.tcfs.svc.cluster.local:8333 |
tcfs |
| NATS JetStream | nats.tcfs.svc.cluster.local:4222 |
tcfs |
Required Before Rollout
- NATS accessible from lab machines
- Current: NATS is cluster-internal only
- Options: (a) Civo NodePort/LoadBalancer, (b) WireGuard tunnel, (c) NATS leaf node on local network
- Recommended: NATS leaf node on local network (lowest latency, works offline)
- SeaweedFS S3 accessible from lab machines
- Current: Accessible via
dees-appu-bearts:8333on local network - Status: Already reachable from all three machines
- Current: Accessible via
- Device enrollment
- Each machine runs
tcfs device enroll --name $(hostname)once - Registry stored in S3 at
tcfs-meta/devices.json
- Each machine runs
Nix Module Design
Upstream (tummycrypt repo)
Located in nix/modules/ within the tummycrypt flake:
tcfs-daemon.nix— NixOS system-level service (services.tcfsd.*)tcfs-user.nix— Home Manager user-level service (programs.tcfs.*)
Both expose fleet options: deviceName, conflictMode, syncGitDirs,
gitSyncMode, natsUrl, excludePatterns.
Downstream (crush-dots repo)
The existing nix/home-manager/tummycrypt.nix module needs extension:
# New fleet options under tinyland.tummycrypt
fleet = {
enable = mkEnableOption "multi-machine fleet sync";
conflictMode = mkOption { type = enum ["auto" "interactive" "defer"]; default = "auto"; };
syncGitDirs = mkOption { type = bool; default = false; };
gitSyncMode = mkOption { type = enum ["bundle" "raw"]; default = "bundle"; };
natsUrl = mkOption { type = str; default = "nats://nats.tcfs.svc.cluster.local:4222"; };
excludePatterns = mkOption { type = listOf str; default = ["*.swp" "*.swo" ".direnv"]; };
};
Feature Flag (flags.nix)
Existing flag: tinyland.host.tummycrypt.enable (default: false)
New sub-flags needed:
tinyland.host.tummycrypt = {
enable = mkOption { type = bool; default = false; };
fleet = mkOption { type = bool; default = false; }; # NEW
};
Per-Host Configuration
| Host | Platform | tummycrypt.enable |
fleet |
conflictMode |
syncGitDirs |
|---|---|---|---|---|---|
| xoxd-bates | macOS (aarch64) | true | true | auto | false |
| yoga | Linux (x86_64) | true | true | interactive | true (bundle) |
| petting-zoo-mini | macOS (aarch64) | true | true | auto | false |
Rollout Plan
Phase 1: Merge + Package (Day 0)
- Merge PR #18 to
mainin tummycrypt - Tag
v0.3.0release - Nix flake update in crush-dots:
nix flake update tummycrypt - Verify
nix build .#tcfs-clisucceeds on all platforms
Phase 2: Enroll Devices (Day 1)
- Deploy updated Home Manager config to all 3 machines
- On each machine:
tcfs device enroll --name $(hostname) tcfs device list # verify all 3 visible - Verify S3 registry at
tcfs-meta/devices.jsonshows 3 devices
Phase 3: Smoke Test (Day 1)
- On yoga:
echo "fleet sync test" > /tmp/fleet-test.txt tcfs push /tmp/fleet-test.txt - On xoxd-bates:
tcfs pull tcfs/default/fleet-test.txt /tmp/fleet-test.txt diff <(echo "fleet sync test") /tmp/fleet-test.txt - On petting-zoo-mini: same pull + verify
Phase 4: Conflict Validation (Day 2)
- On yoga:
echo "yoga version" > /tmp/conflict.txt && tcfs push /tmp/conflict.txt - On xoxd-bates (before pulling):
echo "xoxd version" > /tmp/conflict.txt && tcfs push /tmp/conflict.txt - Verify conflict detected (not silently overwritten)
- Resolve via CLI:
tcfs resolve-conflict conflict.txt --keep-local
Phase 5: Git Repo Sync (Day 3, yoga only)
- Enable
syncGitDirs = trueon yoga - Push a test repo:
tcfs push ~/git/test-repo - Verify git bundle is created and round-trips
- Pull on another machine and verify
git logmatches
Phase 6: NATS Real-Time (Day 4+)
- Establish NATS connectivity from lab machines to Civo cluster
- Start tcfsd daemon on each machine
- Modify file on one machine, verify auto-pull on others within seconds
- Take one machine offline, modify files on others, bring back online, verify catch-up
Verification Checklist
- PR #18 CI all green (Build + Lint + Test, Nix Build, Security Audit, cargo-deny)
- All 133 tests pass locally on each platform (linux-x86_64, darwin-aarch64)
tcfs device enrollsucceeds on all 3 machines- Push → Pull round-trip byte-perfect
- Conflict detection works (concurrent writes detected, not overwritten)
- Auto-resolver picks lexicographic winner correctly
- Manifest v2 JSON written on push, v1 text still parseable
- VectorClock monotonicity holds across push/pull cycles
- Home Manager
nix switchsucceeds on all 3 machines after module update - TUI shows Conflicts tab (key
5) - MCP
device_statusandresolve_conflicttools respond correctly
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| NATS unreachable from lab | Medium | Blocks Phase 6 | Sync works without NATS (manual push/pull) |
| Manifest v2 breaks old clients | Low | Data inaccessible | v1 fallback parser, tested in CI |
| VectorClock overflow (u64) | Negligible | Corruption | 2^64 ticks = centuries at 1M ticks/sec |
| Git bundle fails (dirty worktree) | Medium | .git sync skipped | git_is_safe() pre-checks, warnings logged |
| macOS FUSE unavailable | Low | Mount doesn’t work | Push/pull/sync work without FUSE |
Open Questions (Resolved)
All open questions have been resolved. See Fleet Deployment Guide for full details.
-
NATS access path: Resolved — NATS leaf node on yoga (recommended). Provides lowest latency for LAN operations and offline resilience. NodePort and WireGuard documented as alternatives. See Fleet Deployment Guide, Section 1.
-
Credential distribution: Resolved — SOPS+age per-host encrypted secrets with sops-nix for NixOS hosts, env file for others. Credential precedence: env vars > SOPS > KDBX > config file. Rotation procedure documented. See Fleet Deployment Guide, Section 2.
-
Automatic daemon startup: Resolved — systemd on Linux (NixOS module or manual unit), launchd on macOS (plist at
dist/com.tummycrypt.tcfsd.plist). Both configured for auto-start on boot with restart-on-failure. See Fleet Deployment Guide, Section 3.
Implementation Notes
- PR #18: Core implementation (vector clocks, device identity, NATS events, git safety, PBT, Nix modules)
- PR #19: Stack wiring (device_id through CLI/daemon/gRPC/MCP, ResolveConflict RPC, NATS publishing)
- Tagged as v0.3.0 on 2026-02-22, all CI green across 9 build targets
- 133 tests passing, 18 proptest properties
Signed-off-by: xoxd