end-to-end fleet flow

Brain — MVP ingestion + first customer connection

Status: Approved (architecture confirmed by user 2026-05-03; remaining open questions decided by author). Date: 2026-05-03 Supersedes scope of: 2026-04-29-brain-design.md (the original spec, which targeted a customer-deployed-public-internet world). This document narrows that spec to the M8trx-managed-service-on-Tailscale reality.

1. What changed since the original spec

The 2026-04-29 spec assumed customer-owned EC2 boxes, public HTTPS ingestion, and per-customer privacy tiers including Haiku-based redaction of free-text summaries. Subsequent clarifications collapse that scope:

The original spec's data model survives essentially intact — minus redaction and privacy tiers — because the (customers, agents, api_keys, events) shape was already correct.

2. Goals

  1. Stand up a brain ingestion server that collects metadata-only telemetry from M8trx agents over Tailscale.
  2. Connect the first real M8trx-owned EC2 (running paperclipai + agent-runtime containers) to it, end-to-end, so we can start answering: engagement per customer, plugin/tool usage mix, cost-to-serve.
  3. Replace the existing dashboard's synthetic data with real aggregates, falling back to synthetic when no data exists.

3. Architecture

                       Tailnet (100.64.0.0/10)
                       ─────────────────────────
                       only path; no public surface
┌──────────────────────────────────────┐         ┌──────────────────────────────────┐
│ Customer EC2 (1 per customer org)    │         │ Brain EC2 (this box, 100.72.249.59)│
│                                      │         │                                  │
│ Tailscale daemon @ host (tag:agent)  │  HTTPS  │ Tailscale daemon (tag:brain)     │
│                                      │ ──────► │                                  │
│ ┌──────────────────────────────────┐ │  POST   │ ┌──────────────────────────────┐ │
│ │ docker-compose:                  │ │ /v1/    │ │ docker-compose:              │ │
│ │  caddy → paperclipai → postgres  │ │ events  │ │  brain-api (Node 22+Express) │ │
│ │  + m8trx-bridge (Node)           │ │         │ │  postgres:16 (JSONB events)  │ │
│ │  + ephemeral agent-runtime/N     │ │         │ └────────────┬─────────────────┘ │
│ │    (Claude Code per task)        │ │         │              │ same DB           │
│ │                                  │ │         │              ▼                   │
│ │ Telemetry sources:               │ │         │   Existing dashboard             │
│ │  • m8trx-claude-isolate wrapper  │ │         │   (live data; synthetic fallback)│
│ │  • Claude Code hooks             │ │         │                                  │
│ │  • m8trx-bridge tap (later)      │ │         │                                  │
│ │  • host systemd heartbeat timer  │ │         │                                  │
│ └──────────────────────────────────┘ │         │                                  │
└──────────────────────────────────────┘         └──────────────────────────────────┘

Components:

Component Where Job
brain-api brain EC2, docker-compose POST /v1/events, GET /v1/healthz, dashboard rollup endpoints, and serves the dashboard static files at /. Bound to 100.72.249.59:8080. Bearer auth, JSONB writes. Node 22 + Express + pg.
postgres:16 brain EC2, docker-compose Single source of truth. JSONB payload column for forward-compat.
Existing dashboard brain EC2 Same index.html; served by brain-api at /. JS fetches /v1/dashboard/* for live aggregates and falls back to existing synthetic numbers when API returns empty.
m8trx-claude-isolate patch each customer EC2 ~10-line bash patch: curl POST session.start at top, EXIT trap for session.end.
Claude Code hooks each customer EC2, in agent-runtime image settings.json baked into image; PostToolUse and Stop hooks curl events to brain.
m8trx-bridge tap each customer EC2 Phase 3 (deferred from v0.1): instrument the existing Node MCP bridge to emit plugin tool calls.
Heartbeat timer each customer EC2, host systemd Every 5 min: paperclip uptime, runtime container count, host load, agent versions.
Tailscale both sides Only network path. ACL: tag:agent → tag:brain:8080 and nothing else.
SSM Parameter Store M8trx control plane (AWS) Stores BRAIN_API_KEY per customer EC2 + BRAIN_URL. Cloud-init reads at boot, writes /etc/m8trx/brain.env.

4. HTTP contract

POST /v1/events (Tailscale-only)

Authorization: Bearer m8brain_<env>_<32 base32 chars>
Content-Type: application/json

{
  "event_id":   "uuid-v4",
  "ts":         "2026-05-03T14:32:00.123Z",
  "event_type": "session.start",
  "agent_id":   "agent_acme_employee_42",
  "run_id":     "run_8a3b...",
  "payload":    { ... }
}

GET /v1/healthz{"ok": true, "version": "...", "ts": "..."}. Used for the dashboard's "brain online" indicator. No auth required.

GET /v1/dashboard/* (admin-bearer-token-gated, Tailscale-only) — read-side endpoints used by the dashboard. See §7.

POST /admin/customers (admin-bearer-token-gated, Tailscale-only) — register a customer and mint their first API key. See §8.

5. Event types (v0.1 — six)

event_type Source Payload
session.start m8trx-claude-isolate (top of script) {paperclip_run_id, agent_kind?, source?}
session.end m8trx-claude-isolate (EXIT trap) {status: "success"|"failed"|"timeout", exit_code, duration_ms}
tool_call Claude Code PostToolUse hook {tool_name, duration_ms, ok, args_size_bytes?, output_size_bytes?}
llm_usage Claude Code Stop hook {model, input_tokens, output_tokens, cache_read_tokens?, cache_write_tokens?, cost_cents, total_turns}
error wrapper / hook on failure {kind, message_120c, stack_hash?}
heartbeat host systemd timer every 5 min {paperclip_uptime_s, agent_runtime_count, host_load_1m, agent_versions: {id: version}}

Out of scope for v0.1, additive later: free-text summaries, per-tool retry counts, queue-depth events, per-LLM-call breakdowns (rolled into Stop), prompt-cache savings $.

6. Postgres schema

create table customers (
  id            text primary key,                 -- 'cust_acme'
  name          text not null,
  created_at    timestamptz not null default now()
);

create table api_keys (
  id            text primary key,                 -- 'key_acme_prod'
  customer_id   text not null references customers(id),
  key_hash      bytea not null unique,            -- sha256 of plaintext
  label         text,
  created_at    timestamptz not null default now(),
  revoked_at    timestamptz
);
create index on api_keys (customer_id) where revoked_at is null;

create table agents (
  id              text primary key,               -- = PAPERCLIP_AGENT_ID, auto-upsert
  customer_id     text not null references customers(id),
  display_name    text,                           -- backfilled later
  first_seen_at   timestamptz not null default now(),
  last_seen_at    timestamptz not null default now(),
  agent_version   text
);
create index on agents (customer_id);

create table events (
  id            bigserial primary key,
  event_id      uuid not null unique,             -- client-supplied; dedupe key
  ts            timestamptz not null,
  ingested_at   timestamptz not null default now(),
  customer_id   text not null,
  agent_id      text not null,
  run_id        text,
  event_type    text not null,
  payload       jsonb not null
);
create index events_customer_ts on events (customer_id, ts desc);
create index events_agent_ts    on events (agent_id, ts desc);
create index events_run         on events (run_id) where run_id is not null;
create index events_type_ts     on events (event_type, ts desc);

JSONB on payload lets us add fields to any event type without migrations. Customer/agent rollups are SQL aggregates; the dashboard becomes ~5 queries.

Auth model decision (vs original spec): One key per customer, not per agent. Inside Tailscale + M8trx-managed, the blast radius of a leaked customer key is bounded ("M8trx team can spoof agent_id within one customer's data"), and the operational simplicity of one SSM secret per box is significant. Per-agent keys are an additive change later.

7. Dashboard wiring

Read endpoints on brain-api (admin-bearer-gated, Tailscale-only):

Endpoint Returns
GET /v1/dashboard/overview Per-customer last-24h: session count, error count, total cost cents, distinct agent count, last-seen timestamp.
GET /v1/dashboard/customers Customer list with cumulative counts and tier-style buckets.
GET /v1/dashboard/agents?customer_id=... Per-agent breakdown for a customer.
GET /v1/dashboard/tools?window=24h Tool-call frequency and average duration, optionally filtered by customer.
GET /v1/dashboard/health Brain ingestion lag, oldest unflushed event, total events ingested in last hour.

The existing dashboard/index.html JavaScript is updated to fetch these with fetch(). When any endpoint returns an empty result, the dashboard falls back to its existing hard-coded synthetic numbers, so the demo view never breaks.

8. Provisioning / onboarding flow

For the first customer, manual is fine. For the second customer onward, a POST /admin/customers endpoint is the API surface used by Terraform.

v0.1 (manual, applied to the first real customer):

  1. Operator runs node /home/ubuntu/brain/server/bin/mint-key.js cust_<slug> "<Display Name>" on the brain box. Output: a one-time plaintext API key, e.g. m8brain_dev_AB3F.... The script inserts a customers row and an api_keys row with the sha256 hash.
  2. Operator stores the plaintext at /m8trx/<customer-slug>/brain_api_key in SSM Parameter Store (SecureString). Plaintext does not exist anywhere else; rotation = mint new + update SSM + revoke old.
  3. Customer EC2's cloud-init reads SSM at boot, writes /etc/m8trx/brain.env:
    BRAIN_URL=http://brain.tailnet:8080
    BRAIN_API_KEY=m8brain_dev_AB3F...
    
  4. The wrapper (m8trx-claude-isolate) and the Claude Code hook script both source /etc/m8trx/brain.env.

v0.2 (Terraform-driven, additive):

POST /admin/customers body: {customer_id, name}. Returns {api_key_plaintext} exactly once. The Terraform module's local-exec calls this and writes the result to SSM. No human in the loop.

9. Instrumentation points

Three taps, additive, hitting the same /v1/events:

9.1 m8trx-claude-isolate (Phase B.1)

Bash wrapper, ~10 line patch:

9.2 Claude Code hooks (Phase B.2)

Baked into the agent-runtime image at /etc/claude/settings.json (or wherever Claude Code reads from):

9.3 m8trx-bridge tap (deferred from v0.1)

Adds to services/m8trx-bridge/server.js an after-hook that POSTs a tool_call event tagged source: "plugin" for every plugin call dispatched. Captures the email/telegram/gdrive/imessage/memory plugin invocations not visible to Claude Code.

10. Heartbeat (Phase B.3)

Host-side m8trx-brain-heartbeat.service + .timer (every 5 min). The service script:

11. Deferred — explicit YAGNI

12. Open risks / known limits

13. File layout (this repo)

brain/
├── server/                           # NEW — Phase A
│   ├── docker-compose.yml
│   ├── Dockerfile
│   ├── package.json
│   ├── src/
│   │   ├── index.js                  # Express app
│   │   ├── routes/events.js          # POST /v1/events
│   │   ├── routes/dashboard.js       # GET /v1/dashboard/*
│   │   ├── routes/admin.js           # POST /admin/customers (v0.2)
│   │   ├── routes/health.js          # GET /v1/healthz
│   │   ├── db.js                     # pg pool
│   │   └── auth.js                   # bearer → customer_id
│   ├── sql/
│   │   └── 001_init.sql              # schema in §6
│   └── bin/
│       └── mint-key.js               # operator CLI
├── dashboard/
│   └── index.html                    # MODIFIED — fetch real data, fall back to synthetic
├── agent-artifacts/                  # NEW — Phase B
│   ├── m8trx-claude-isolate.patch    # unified diff against M8trxAgent
│   ├── claude-hooks/
│   │   ├── settings.json             # for agent-runtime image
│   │   └── brain-hook                # POSIX sh helper
│   ├── heartbeat/
│   │   ├── m8trx-brain-heartbeat.service
│   │   ├── m8trx-brain-heartbeat.timer
│   │   └── m8trx-brain-heartbeat.sh
│   └── cloud-init-snippet.yaml
├── docs/
│   ├── superpowers/specs/2026-05-03-brain-mvp-ingestion-design.md  # this file
│   └── runbook-connect-customer-ec2.md  # NEW — Phase C

14. Definition of done for v0.1

  1. Brain ingestion server running in docker-compose on this EC2, bound to 100.72.249.59:8080, no public surface.
  2. Postgres schema initialised; first customer (cust_m8trx_test) seeded with one API key.
  3. Curl-driven smoke test sends a realistic burst of all six event types and observes them in Postgres + the dashboard.
  4. Dashboard fetches live data; falls back to synthetic when empty so the demo view never breaks.
  5. Agent-side artifacts produced: unified diff against m8trx-claude-isolate, Claude Code hooks bundle, heartbeat units, cloud-init snippet.
  6. Deployment runbook in docs/runbook-connect-customer-ec2.md walks through the manual steps to wire up the first real customer EC2.
  7. Everything committed to this repo. M8trxAgent repo is not modified by this work; the patch lives here for the human operator to apply on a branch.