Status: Approved (architecture confirmed by user 2026-05-03; remaining open questions decided by author).
Date: 2026-05-03
Supersedes scope of: 2026-04-29-brain-design.md (the original spec, which targeted a customer-deployed-public-internet world). This document narrows that spec to the M8trx-managed-service-on-Tailscale reality.
The 2026-04-29 spec assumed customer-owned EC2 boxes, public HTTPS ingestion, and per-customer privacy tiers including Haiku-based redaction of free-text summaries. Subsequent clarifications collapse that scope:
summary_raw/summary_redacted columns are all dropped.The original spec's data model survives essentially intact — minus redaction and privacy tiers — because the (customers, agents, api_keys, events) shape was already correct.
Tailnet (100.64.0.0/10)
─────────────────────────
only path; no public surface
┌──────────────────────────────────────┐ ┌──────────────────────────────────┐
│ Customer EC2 (1 per customer org) │ │ Brain EC2 (this box, 100.72.249.59)│
│ │ │ │
│ Tailscale daemon @ host (tag:agent) │ HTTPS │ Tailscale daemon (tag:brain) │
│ │ ──────► │ │
│ ┌──────────────────────────────────┐ │ POST │ ┌──────────────────────────────┐ │
│ │ docker-compose: │ │ /v1/ │ │ docker-compose: │ │
│ │ caddy → paperclipai → postgres │ │ events │ │ brain-api (Node 22+Express) │ │
│ │ + m8trx-bridge (Node) │ │ │ │ postgres:16 (JSONB events) │ │
│ │ + ephemeral agent-runtime/N │ │ │ └────────────┬─────────────────┘ │
│ │ (Claude Code per task) │ │ │ │ same DB │
│ │ │ │ │ ▼ │
│ │ Telemetry sources: │ │ │ Existing dashboard │
│ │ • m8trx-claude-isolate wrapper │ │ │ (live data; synthetic fallback)│
│ │ • Claude Code hooks │ │ │ │
│ │ • m8trx-bridge tap (later) │ │ │ │
│ │ • host systemd heartbeat timer │ │ │ │
│ └──────────────────────────────────┘ │ │ │
└──────────────────────────────────────┘ └──────────────────────────────────┘
Components:
| Component | Where | Job |
|---|---|---|
brain-api |
brain EC2, docker-compose | POST /v1/events, GET /v1/healthz, dashboard rollup endpoints, and serves the dashboard static files at /. Bound to 100.72.249.59:8080. Bearer auth, JSONB writes. Node 22 + Express + pg. |
postgres:16 |
brain EC2, docker-compose | Single source of truth. JSONB payload column for forward-compat. |
| Existing dashboard | brain EC2 | Same index.html; served by brain-api at /. JS fetches /v1/dashboard/* for live aggregates and falls back to existing synthetic numbers when API returns empty. |
m8trx-claude-isolate patch |
each customer EC2 | ~10-line bash patch: curl POST session.start at top, EXIT trap for session.end. |
| Claude Code hooks | each customer EC2, in agent-runtime image |
settings.json baked into image; PostToolUse and Stop hooks curl events to brain. |
m8trx-bridge tap |
each customer EC2 | Phase 3 (deferred from v0.1): instrument the existing Node MCP bridge to emit plugin tool calls. |
| Heartbeat timer | each customer EC2, host systemd | Every 5 min: paperclip uptime, runtime container count, host load, agent versions. |
| Tailscale | both sides | Only network path. ACL: tag:agent → tag:brain:8080 and nothing else. |
| SSM Parameter Store | M8trx control plane (AWS) | Stores BRAIN_API_KEY per customer EC2 + BRAIN_URL. Cloud-init reads at boot, writes /etc/m8trx/brain.env. |
POST /v1/events (Tailscale-only)
Authorization: Bearer m8brain_<env>_<32 base32 chars>
Content-Type: application/json
{
"event_id": "uuid-v4",
"ts": "2026-05-03T14:32:00.123Z",
"event_type": "session.start",
"agent_id": "agent_acme_employee_42",
"run_id": "run_8a3b...",
"payload": { ... }
}
customer_id is never in the body — resolved server-side from the bearer token.agent_id comes from the body and is auto-upserted into agents on first sight.event_id (unique constraint). Client retries are safe.202 accept · 401 bad/missing/revoked token · 400 schema violation (returns {error, field}) · 5xx server's problem; client buffers and retries.GET /v1/healthz → {"ok": true, "version": "...", "ts": "..."}. Used for the dashboard's "brain online" indicator. No auth required.
GET /v1/dashboard/* (admin-bearer-token-gated, Tailscale-only) — read-side endpoints used by the dashboard. See §7.
POST /admin/customers (admin-bearer-token-gated, Tailscale-only) — register a customer and mint their first API key. See §8.
event_type |
Source | Payload |
|---|---|---|
session.start |
m8trx-claude-isolate (top of script) |
{paperclip_run_id, agent_kind?, source?} |
session.end |
m8trx-claude-isolate (EXIT trap) |
{status: "success"|"failed"|"timeout", exit_code, duration_ms} |
tool_call |
Claude Code PostToolUse hook |
{tool_name, duration_ms, ok, args_size_bytes?, output_size_bytes?} |
llm_usage |
Claude Code Stop hook |
{model, input_tokens, output_tokens, cache_read_tokens?, cache_write_tokens?, cost_cents, total_turns} |
error |
wrapper / hook on failure | {kind, message_120c, stack_hash?} |
heartbeat |
host systemd timer every 5 min | {paperclip_uptime_s, agent_runtime_count, host_load_1m, agent_versions: {id: version}} |
Out of scope for v0.1, additive later: free-text summaries, per-tool retry counts, queue-depth events, per-LLM-call breakdowns (rolled into Stop), prompt-cache savings $.
create table customers (
id text primary key, -- 'cust_acme'
name text not null,
created_at timestamptz not null default now()
);
create table api_keys (
id text primary key, -- 'key_acme_prod'
customer_id text not null references customers(id),
key_hash bytea not null unique, -- sha256 of plaintext
label text,
created_at timestamptz not null default now(),
revoked_at timestamptz
);
create index on api_keys (customer_id) where revoked_at is null;
create table agents (
id text primary key, -- = PAPERCLIP_AGENT_ID, auto-upsert
customer_id text not null references customers(id),
display_name text, -- backfilled later
first_seen_at timestamptz not null default now(),
last_seen_at timestamptz not null default now(),
agent_version text
);
create index on agents (customer_id);
create table events (
id bigserial primary key,
event_id uuid not null unique, -- client-supplied; dedupe key
ts timestamptz not null,
ingested_at timestamptz not null default now(),
customer_id text not null,
agent_id text not null,
run_id text,
event_type text not null,
payload jsonb not null
);
create index events_customer_ts on events (customer_id, ts desc);
create index events_agent_ts on events (agent_id, ts desc);
create index events_run on events (run_id) where run_id is not null;
create index events_type_ts on events (event_type, ts desc);
JSONB on payload lets us add fields to any event type without migrations. Customer/agent rollups are SQL aggregates; the dashboard becomes ~5 queries.
Auth model decision (vs original spec): One key per customer, not per agent. Inside Tailscale + M8trx-managed, the blast radius of a leaked customer key is bounded ("M8trx team can spoof agent_id within one customer's data"), and the operational simplicity of one SSM secret per box is significant. Per-agent keys are an additive change later.
Read endpoints on brain-api (admin-bearer-gated, Tailscale-only):
| Endpoint | Returns |
|---|---|
GET /v1/dashboard/overview |
Per-customer last-24h: session count, error count, total cost cents, distinct agent count, last-seen timestamp. |
GET /v1/dashboard/customers |
Customer list with cumulative counts and tier-style buckets. |
GET /v1/dashboard/agents?customer_id=... |
Per-agent breakdown for a customer. |
GET /v1/dashboard/tools?window=24h |
Tool-call frequency and average duration, optionally filtered by customer. |
GET /v1/dashboard/health |
Brain ingestion lag, oldest unflushed event, total events ingested in last hour. |
The existing dashboard/index.html JavaScript is updated to fetch these with fetch(). When any endpoint returns an empty result, the dashboard falls back to its existing hard-coded synthetic numbers, so the demo view never breaks.
For the first customer, manual is fine. For the second customer onward, a POST /admin/customers endpoint is the API surface used by Terraform.
v0.1 (manual, applied to the first real customer):
node /home/ubuntu/brain/server/bin/mint-key.js cust_<slug> "<Display Name>" on the brain box. Output: a one-time plaintext API key, e.g. m8brain_dev_AB3F.... The script inserts a customers row and an api_keys row with the sha256 hash./m8trx/<customer-slug>/brain_api_key in SSM Parameter Store (SecureString). Plaintext does not exist anywhere else; rotation = mint new + update SSM + revoke old./etc/m8trx/brain.env:BRAIN_URL=http://brain.tailnet:8080
BRAIN_API_KEY=m8brain_dev_AB3F...
m8trx-claude-isolate) and the Claude Code hook script both source /etc/m8trx/brain.env.v0.2 (Terraform-driven, additive):
POST /admin/customers body: {customer_id, name}. Returns {api_key_plaintext} exactly once. The Terraform module's local-exec calls this and writes the result to SSM. No human in the loop.
Three taps, additive, hitting the same /v1/events:
m8trx-claude-isolate (Phase B.1)Bash wrapper, ~10 line patch:
session.start with event_id (uuidgen), agent_id=$PAPERCLIP_AGENT_ID, run_id=$PAPERCLIP_RUN_ID. Source /etc/m8trx/brain.env for URL/key. Use a wrapper function _brain_emit that POSTs via curl with a 2-second timeout and logs failures but never raises.trap on EXIT: emit session.end with status derived from exit code (0 → success, non-zero → failed; SIGKILL/SIGTERM-derived → timeout if duration approaches the runtime limit).Baked into the agent-runtime image at /etc/claude/settings.json (or wherever Claude Code reads from):
PostToolUse → run hook helper brain-hook tool with stdin = the hook payload Claude Code provides; emits tool_call event with name + duration + ok status.Stop → run hook helper brain-hook stop; emits llm_usage event with model + token counts + cost./usr/local/bin/brain-hook is a small Node or POSIX-sh program that reads stdin JSON, formats a brain event, POSTs with a 2-second timeout, swallows all errors, exits 0 always (Claude Code is not allowed to be blocked by telemetry).m8trx-bridge tap (deferred from v0.1)Adds to services/m8trx-bridge/server.js an after-hook that POSTs a tool_call event tagged source: "plugin" for every plugin call dispatched. Captures the email/telegram/gdrive/imessage/memory plugin invocations not visible to Claude Code.
Host-side m8trx-brain-heartbeat.service + .timer (every 5 min). The service script:
/etc/m8trx/brain.env.paperclipai container uptime (docker inspect --format='{{.State.StartedAt}}').agent-runtime/* containers (docker ps --filter name=agent-runtime --format '{{.Names}}' | wc -l)./proc/loadavg first field.heartbeat with agent_id="_host" and the metadata above. _host is a sentinel agent_id used only for heartbeats; it auto-upserts like any other agent, but the dashboard filters it out of agent-level aggregates.POST /admin/customers endpoint (manual mint via CLI for the first customer).kind + 120-char message.m8trx_brain was in the original spec; the wrapper + hook helper are sufficient — no per-language SDK needed because we never instrument paperclipai's Node code directly).agents rows on first event means a typo in PAPERCLIP_AGENT_ID would pollute the agents table. Acceptable risk inside M8trx-managed: it's a janitorial bug, not a security one.agent_id within X. Not a security incident in this trust model; called out for awareness when we eventually expand to per-agent keys.brain/
├── server/ # NEW — Phase A
│ ├── docker-compose.yml
│ ├── Dockerfile
│ ├── package.json
│ ├── src/
│ │ ├── index.js # Express app
│ │ ├── routes/events.js # POST /v1/events
│ │ ├── routes/dashboard.js # GET /v1/dashboard/*
│ │ ├── routes/admin.js # POST /admin/customers (v0.2)
│ │ ├── routes/health.js # GET /v1/healthz
│ │ ├── db.js # pg pool
│ │ └── auth.js # bearer → customer_id
│ ├── sql/
│ │ └── 001_init.sql # schema in §6
│ └── bin/
│ └── mint-key.js # operator CLI
├── dashboard/
│ └── index.html # MODIFIED — fetch real data, fall back to synthetic
├── agent-artifacts/ # NEW — Phase B
│ ├── m8trx-claude-isolate.patch # unified diff against M8trxAgent
│ ├── claude-hooks/
│ │ ├── settings.json # for agent-runtime image
│ │ └── brain-hook # POSIX sh helper
│ ├── heartbeat/
│ │ ├── m8trx-brain-heartbeat.service
│ │ ├── m8trx-brain-heartbeat.timer
│ │ └── m8trx-brain-heartbeat.sh
│ └── cloud-init-snippet.yaml
├── docs/
│ ├── superpowers/specs/2026-05-03-brain-mvp-ingestion-design.md # this file
│ └── runbook-connect-customer-ec2.md # NEW — Phase C
100.72.249.59:8080, no public surface.cust_m8trx_test) seeded with one API key.m8trx-claude-isolate, Claude Code hooks bundle, heartbeat units, cloud-init snippet.docs/runbook-connect-customer-ec2.md walks through the manual steps to wire up the first real customer EC2.