Brain — MVP ingestion + first customer connection

Status: Approved (architecture confirmed by user 2026-05-03; remaining open questions decided by author). Date: 2026-05-03 Supersedes scope of: 2026-04-29-brain-design.md (the original spec, which targeted a customer-deployed-public-internet world). This document narrows that spec to the M8trx-managed-service-on-Tailscale reality.

1. What changed since the original spec

The 2026-04-29 spec assumed customer-owned EC2 boxes, public HTTPS ingestion, and per-customer privacy tiers including Haiku-based redaction of free-text summaries. Subsequent clarifications collapse that scope:

Agents are an M8trx-managed service. M8trx owns and operates every agent EC2. There are no third-party-customer-controlled boxes in the loop.
Metadata only. Free-text summaries are out of scope; we never collect or transmit them. Privacy tiers, Haiku redaction, and the summary_raw/summary_redacted columns are all dropped.
Tailscale is the only network path. Brain binds to its tailnet IP; no public surface. ALB, ACME/TLS, and Caddy on the brain side are unnecessary.
One EC2 per customer org, with N agents per customer (1–20+ employees, all on that customer's dedicated box). Roll-up axis for product analytics is the customer.

The original spec's data model survives essentially intact — minus redaction and privacy tiers — because the (customers, agents, api_keys, events) shape was already correct.

2. Goals

Stand up a brain ingestion server that collects metadata-only telemetry from M8trx agents over Tailscale.
Connect the first real M8trx-owned EC2 (running paperclipai + agent-runtime containers) to it, end-to-end, so we can start answering: engagement per customer, plugin/tool usage mix, cost-to-serve.
Replace the existing dashboard's synthetic data with real aggregates, falling back to synthetic when no data exists.

3. Architecture

                       Tailnet (100.64.0.0/10)
                       ─────────────────────────
                       only path; no public surface
┌──────────────────────────────────────┐         ┌──────────────────────────────────┐
│ Customer EC2 (1 per customer org)    │         │ Brain EC2 (this box, 100.72.249.59)│
│                                      │         │                                  │
│ Tailscale daemon @ host (tag:agent)  │  HTTPS  │ Tailscale daemon (tag:brain)     │
│                                      │ ──────► │                                  │
│ ┌──────────────────────────────────┐ │  POST   │ ┌──────────────────────────────┐ │
│ │ docker-compose:                  │ │ /v1/    │ │ docker-compose:              │ │
│ │  caddy → paperclipai → postgres  │ │ events  │ │  brain-api (Node 22+Express) │ │
│ │  + m8trx-bridge (Node)           │ │         │ │  postgres:16 (JSONB events)  │ │
│ │  + ephemeral agent-runtime/N     │ │         │ └────────────┬─────────────────┘ │
│ │    (Claude Code per task)        │ │         │              │ same DB           │
│ │                                  │ │         │              ▼                   │
│ │ Telemetry sources:               │ │         │   Existing dashboard             │
│ │  • m8trx-claude-isolate wrapper  │ │         │   (live data; synthetic fallback)│
│ │  • Claude Code hooks             │ │         │                                  │
│ │  • m8trx-bridge tap (later)      │ │         │                                  │
│ │  • host systemd heartbeat timer  │ │         │                                  │
│ └──────────────────────────────────┘ │         │                                  │
└──────────────────────────────────────┘         └──────────────────────────────────┘

Components:

Component	Where	Job
`brain-api`	brain EC2, docker-compose	`POST /v1/events`, `GET /v1/healthz`, dashboard rollup endpoints, and serves the dashboard static files at `/`. Bound to `100.72.249.59:8080`. Bearer auth, JSONB writes. Node 22 + Express + `pg`.
`postgres:16`	brain EC2, docker-compose	Single source of truth. JSONB `payload` column for forward-compat.
Existing dashboard	brain EC2	Same `index.html`; served by `brain-api` at `/`. JS fetches `/v1/dashboard/*` for live aggregates and falls back to existing synthetic numbers when API returns empty.
`m8trx-claude-isolate` patch	each customer EC2	~10-line bash patch: `curl` POST `session.start` at top, EXIT trap for `session.end`.
Claude Code hooks	each customer EC2, in `agent-runtime` image	`settings.json` baked into image; `PostToolUse` and `Stop` hooks `curl` events to brain.
`m8trx-bridge` tap	each customer EC2	Phase 3 (deferred from v0.1): instrument the existing Node MCP bridge to emit plugin tool calls.
Heartbeat timer	each customer EC2, host systemd	Every 5 min: paperclip uptime, runtime container count, host load, agent versions.
Tailscale	both sides	Only network path. ACL: `tag:agent → tag:brain:8080` and nothing else.
SSM Parameter Store	M8trx control plane (AWS)	Stores `BRAIN_API_KEY` per customer EC2 + `BRAIN_URL`. Cloud-init reads at boot, writes `/etc/m8trx/brain.env`.

4. HTTP contract

POST /v1/events (Tailscale-only)

Authorization: Bearer m8brain_<env>_<32 base32 chars>
Content-Type: application/json

{
  "event_id":   "uuid-v4",
  "ts":         "2026-05-03T14:32:00.123Z",
  "event_type": "session.start",
  "agent_id":   "agent_acme_employee_42",
  "run_id":     "run_8a3b...",
  "payload":    { ... }
}

Body is a single event OR an array of events (forward-compat batching).
customer_id is never in the body — resolved server-side from the bearer token.
agent_id comes from the body and is auto-upserted into agents on first sight.
Idempotency: dedupe on event_id (unique constraint). Client retries are safe.
Responses: 202 accept · 401 bad/missing/revoked token · 400 schema violation (returns {error, field}) · 5xx server's problem; client buffers and retries.

GET /v1/healthz → {"ok": true, "version": "...", "ts": "..."}. Used for the dashboard's "brain online" indicator. No auth required.

GET /v1/dashboard/* (admin-bearer-token-gated, Tailscale-only) — read-side endpoints used by the dashboard. See §7.

POST /admin/customers (admin-bearer-token-gated, Tailscale-only) — register a customer and mint their first API key. See §8.

5. Event types (v0.1 — six)

`event_type`	Source	Payload
`session.start`	`m8trx-claude-isolate` (top of script)	`{paperclip_run_id, agent_kind?, source?}`
`session.end`	`m8trx-claude-isolate` (EXIT trap)	`{status: "success"\|"failed"\|"timeout", exit_code, duration_ms}`
`tool_call`	Claude Code `PostToolUse` hook	`{tool_name, duration_ms, ok, args_size_bytes?, output_size_bytes?}`
`llm_usage`	Claude Code `Stop` hook	`{model, input_tokens, output_tokens, cache_read_tokens?, cache_write_tokens?, cost_cents, total_turns}`
`error`	wrapper / hook on failure	`{kind, message_120c, stack_hash?}`
`heartbeat`	host systemd timer every 5 min	`{paperclip_uptime_s, agent_runtime_count, host_load_1m, agent_versions: {id: version}}`

Out of scope for v0.1, additive later: free-text summaries, per-tool retry counts, queue-depth events, per-LLM-call breakdowns (rolled into Stop), prompt-cache savings $.

6. Postgres schema

create table customers (
  id            text primary key,                 -- 'cust_acme'
  name          text not null,
  created_at    timestamptz not null default now()
);

create table api_keys (
  id            text primary key,                 -- 'key_acme_prod'
  customer_id   text not null references customers(id),
  key_hash      bytea not null unique,            -- sha256 of plaintext
  label         text,
  created_at    timestamptz not null default now(),
  revoked_at    timestamptz
);
create index on api_keys (customer_id) where revoked_at is null;

create table agents (
  id              text primary key,               -- = PAPERCLIP_AGENT_ID, auto-upsert
  customer_id     text not null references customers(id),
  display_name    text,                           -- backfilled later
  first_seen_at   timestamptz not null default now(),
  last_seen_at    timestamptz not null default now(),
  agent_version   text
);
create index on agents (customer_id);

create table events (
  id            bigserial primary key,
  event_id      uuid not null unique,             -- client-supplied; dedupe key
  ts            timestamptz not null,
  ingested_at   timestamptz not null default now(),
  customer_id   text not null,
  agent_id      text not null,
  run_id        text,
  event_type    text not null,
  payload       jsonb not null
);
create index events_customer_ts on events (customer_id, ts desc);
create index events_agent_ts    on events (agent_id, ts desc);
create index events_run         on events (run_id) where run_id is not null;
create index events_type_ts     on events (event_type, ts desc);

JSONB on payload lets us add fields to any event type without migrations. Customer/agent rollups are SQL aggregates; the dashboard becomes ~5 queries.

Auth model decision (vs original spec): One key per customer, not per agent. Inside Tailscale + M8trx-managed, the blast radius of a leaked customer key is bounded ("M8trx team can spoof agent_id within one customer's data"), and the operational simplicity of one SSM secret per box is significant. Per-agent keys are an additive change later.

7. Dashboard wiring

Read endpoints on brain-api (admin-bearer-gated, Tailscale-only):

Endpoint	Returns
`GET /v1/dashboard/overview`	Per-customer last-24h: session count, error count, total cost cents, distinct agent count, last-seen timestamp.
`GET /v1/dashboard/customers`	Customer list with cumulative counts and tier-style buckets.
`GET /v1/dashboard/agents?customer_id=...`	Per-agent breakdown for a customer.
`GET /v1/dashboard/tools?window=24h`	Tool-call frequency and average duration, optionally filtered by customer.
`GET /v1/dashboard/health`	Brain ingestion lag, oldest unflushed event, total events ingested in last hour.

The existing dashboard/index.html JavaScript is updated to fetch these with fetch(). When any endpoint returns an empty result, the dashboard falls back to its existing hard-coded synthetic numbers, so the demo view never breaks.

8. Provisioning / onboarding flow

For the first customer, manual is fine. For the second customer onward, a POST /admin/customers endpoint is the API surface used by Terraform.

v0.1 (manual, applied to the first real customer):

Operator runs node /home/ubuntu/brain/server/bin/mint-key.js cust_<slug> "<Display Name>" on the brain box. Output: a one-time plaintext API key, e.g. m8brain_dev_AB3F.... The script inserts a customers row and an api_keys row with the sha256 hash.
Operator stores the plaintext at /m8trx/<customer-slug>/brain_api_key in SSM Parameter Store (SecureString). Plaintext does not exist anywhere else; rotation = mint new + update SSM + revoke old.
Customer EC2's cloud-init reads SSM at boot, writes /etc/m8trx/brain.env:
```
BRAIN_URL=http://brain.tailnet:8080
BRAIN_API_KEY=m8brain_dev_AB3F...
```
The wrapper (m8trx-claude-isolate) and the Claude Code hook script both source /etc/m8trx/brain.env.

v0.2 (Terraform-driven, additive):

POST /admin/customers body: {customer_id, name}. Returns {api_key_plaintext} exactly once. The Terraform module's local-exec calls this and writes the result to SSM. No human in the loop.

9. Instrumentation points

Three taps, additive, hitting the same /v1/events:

9.1 `m8trx-claude-isolate` (Phase B.1)

Bash wrapper, ~10 line patch:

At top, after env validation: emit session.start with event_id (uuidgen), agent_id=$PAPERCLIP_AGENT_ID, run_id=$PAPERCLIP_RUN_ID. Source /etc/m8trx/brain.env for URL/key. Use a wrapper function _brain_emit that POSTs via curl with a 2-second timeout and logs failures but never raises.
trap on EXIT: emit session.end with status derived from exit code (0 → success, non-zero → failed; SIGKILL/SIGTERM-derived → timeout if duration approaches the runtime limit).
The exec at the end of the script remains unchanged; the trap fires regardless.

9.2 Claude Code hooks (Phase B.2)

Baked into the agent-runtime image at /etc/claude/settings.json (or wherever Claude Code reads from):

PostToolUse → run hook helper brain-hook tool with stdin = the hook payload Claude Code provides; emits tool_call event with name + duration + ok status.
Stop → run hook helper brain-hook stop; emits llm_usage event with model + token counts + cost.
Helper script /usr/local/bin/brain-hook is a small Node or POSIX-sh program that reads stdin JSON, formats a brain event, POSTs with a 2-second timeout, swallows all errors, exits 0 always (Claude Code is not allowed to be blocked by telemetry).

9.3 `m8trx-bridge` tap (deferred from v0.1)

Adds to services/m8trx-bridge/server.js an after-hook that POSTs a tool_call event tagged source: "plugin" for every plugin call dispatched. Captures the email/telegram/gdrive/imessage/memory plugin invocations not visible to Claude Code.

10. Heartbeat (Phase B.3)

Host-side m8trx-brain-heartbeat.service + .timer (every 5 min). The service script:

Reads /etc/m8trx/brain.env.
Queries paperclipai container uptime (docker inspect --format='{{.State.StartedAt}}').
Counts running agent-runtime/* containers (docker ps --filter name=agent-runtime --format '{{.Names}}' | wc -l).
Reads /proc/loadavg first field.
POSTs heartbeat with agent_id="_host" and the metadata above. _host is a sentinel agent_id used only for heartbeats; it auto-upserts like any other agent, but the dashboard filters it out of agent-level aggregates.

11. Deferred — explicit YAGNI

Public HTTPS endpoint (Tailscale covers all v0.1 traffic).
ALB / Caddy / TLS terminator on the brain side.
Per-agent API keys (start with per-customer; additive split later).
POST /admin/customers endpoint (manual mint via CLI for the first customer).
Free-text summaries, Haiku redaction, privacy tiers.
Stack-trace hashing for errors beyond the simple kind + 120-char message.
SDK packages (Python m8trx_brain was in the original spec; the wrapper + hook helper are sufficient — no per-language SDK needed because we never instrument paperclipai's Node code directly).
Real-time alerting, anomaly detection, customer-facing dashboards.
ClickHouse / partitioning. Postgres + JSONB is fine until query patterns prove otherwise.
Multi-region. Single region is fine for one product team's analytics.
Bridge interception (Phase B.3 from earlier proposal — explicitly deferred from v0.1; same wire format means it's an additive tap later).

12. Open risks / known limits

Single brain instance, no replication. Acceptable for analytics workload; if brain is down, agents continue (their wrapper logs the curl failure but doesn't fail the Claude task) and the next event is just lost. We will revisit when we have something worth losing.
Tailscale outage = ingestion outage. Same fallback: events are dropped. v0.2 should add a small disk-backed retry queue inside the wrapper / hook helper. Out of scope for v0.1.
Auto-upsert of agents rows on first event means a typo in PAPERCLIP_AGENT_ID would pollute the agents table. Acceptable risk inside M8trx-managed: it's a janitorial bug, not a security one.
One key per customer means a leaked key on customer X's box can spoof agent_id within X. Not a security incident in this trust model; called out for awareness when we eventually expand to per-agent keys.

13. File layout (this repo)

brain/
├── server/                           # NEW — Phase A
│   ├── docker-compose.yml
│   ├── Dockerfile
│   ├── package.json
│   ├── src/
│   │   ├── index.js                  # Express app
│   │   ├── routes/events.js          # POST /v1/events
│   │   ├── routes/dashboard.js       # GET /v1/dashboard/*
│   │   ├── routes/admin.js           # POST /admin/customers (v0.2)
│   │   ├── routes/health.js          # GET /v1/healthz
│   │   ├── db.js                     # pg pool
│   │   └── auth.js                   # bearer → customer_id
│   ├── sql/
│   │   └── 001_init.sql              # schema in §6
│   └── bin/
│       └── mint-key.js               # operator CLI
├── dashboard/
│   └── index.html                    # MODIFIED — fetch real data, fall back to synthetic
├── agent-artifacts/                  # NEW — Phase B
│   ├── m8trx-claude-isolate.patch    # unified diff against M8trxAgent
│   ├── claude-hooks/
│   │   ├── settings.json             # for agent-runtime image
│   │   └── brain-hook                # POSIX sh helper
│   ├── heartbeat/
│   │   ├── m8trx-brain-heartbeat.service
│   │   ├── m8trx-brain-heartbeat.timer
│   │   └── m8trx-brain-heartbeat.sh
│   └── cloud-init-snippet.yaml
├── docs/
│   ├── superpowers/specs/2026-05-03-brain-mvp-ingestion-design.md  # this file
│   └── runbook-connect-customer-ec2.md  # NEW — Phase C

14. Definition of done for v0.1

Brain ingestion server running in docker-compose on this EC2, bound to 100.72.249.59:8080, no public surface.
Postgres schema initialised; first customer (cust_m8trx_test) seeded with one API key.
Curl-driven smoke test sends a realistic burst of all six event types and observes them in Postgres + the dashboard.
Dashboard fetches live data; falls back to synthetic when empty so the demo view never breaks.
Agent-side artifacts produced: unified diff against m8trx-claude-isolate, Claude Code hooks bundle, heartbeat units, cloud-init snippet.
Deployment runbook in docs/runbook-connect-customer-ec2.md walks through the manual steps to wire up the first real customer EC2.
Everything committed to this repo. M8trxAgent repo is not modified by this work; the patch lives here for the human operator to apply on a branch.