Brain — Telemetry & Usage Analytics for Deployed AI Agents

Status: Draft (MVP scope) Date: 2026-04-29 Owner: M8TRX.AI platform team

1. Goal & non-goals

Goal. Stand up a phone-home server ("the brain") that collects usage data from hundreds of M8TRX.AI agents deployed on customer EC2 instances, so we can monitor adoption and design product iterations from real usage.

Top questions the brain must answer (priorities, in order):

  1. Engagement / churn risk — "Is customer X using their agent enough to justify renewal?"
  2. Upsell signals — "What is customer X using their agent for most, and where could we expand?"
  3. Cross-customer patterns — "What is common across customers — what should we productize next?"

Cost-to-serve and QA-at-scale are explicit secondary goals: captured by the same data, not the design driver.

Non-goals for MVP (additive later, do not block shipping):

2. Architecture overview

[Customer EC2]                        [Brain — single AWS region]
┌──────────────────┐                  ┌────────────────────────────┐
│ Agent process    │  HTTPS POST      │  ALB (TLS terminator)      │
│  (Python /       │  /v1/events      │   │                        │
│   Claude Code /  │ ───────────────► │   ▼                        │
│   paperclip)     │  Bearer api_key  │  FastAPI on EC2 (uvicorn)  │
│   + brain SDK    │                  │   │ validate, redact, write │
└──────────────────┘                  │   ▼                        │
                                      │  Postgres (RDS)            │
                                      │   events / agents /        │
                                      │   customers / api_keys     │
                                      └────────────┬───────────────┘
                                                   │ read-only role
                                                   ▼
                                            Metabase (queries)

Single-region, AWS-native, intentionally boring. We can horizontally scale FastAPI by adding another EC2 behind the ALB; we can introduce ClickHouse later when Postgres analytics start to hurt.

3. Components

Component Responsibility Tech
Brain SDK Buffer + flush events from inside the agent process; never raise. Python package m8trx_brain (pip-installable from a private index or git URL).
Ingestion API POST /v1/events, bearer auth, schema validation, redaction, DB write. Stateless. FastAPI + uvicorn in a Docker container on EC2 (t3.small) behind an ALB.
Redaction pipeline In-process module. Routes raw summaries through Claude Haiku to strip PII based on the customer's privacy tier. anthropic SDK; Haiku (claude-haiku-4-5).
Postgres (RDS) Primary store. JSONB payload column for schema flexibility during iteration. RDS Postgres 16, db.t4g.medium, gp3 storage, 7-day automated snapshots.
Metabase Read-only analytics UI for the team. Open-source Metabase on a small EC2 / ECS task; uses a read-only DB role.

Out-of-MVP components (additive later, no design blocker): real-time stream consumer, alerting service, customer-facing dashboard.

4. Event model & schema

4.1 Common envelope

Every event sent by the SDK has the same outer shape:

{
  "event_id": "uuid-v4",
  "ts": "2026-04-29T10:30:00Z",
  "event_type": "session.end",
  "session_id": "sess_abc123",
  "payload": { ... }
}

agent_id and customer_id are never sent by the client. They are resolved server-side from the bearer token. This means a leaked key can only attribute events to the agent that owns it.

4.2 Event types (MVP — four)

event_type Emitted when Payload fields
session.start Agent picks up a unit of work. category (free string, agent-tagged, e.g. "email_reply"); source (optional, e.g. "inbox").
session.end Unit of work completes, fails, or escalates. status: "success"|"failed"|"escalated"; duration_ms; tool_calls: [{name, ms, ok}]; llm_usage: {model, input_tokens, output_tokens, cost_cents}; summary_raw? (redaction input).
error Uncaught error in the agent. message, kind, stack_hash (sha256 of stack trace, no body) — groupable without leaking code paths.
heartbeat SDK background thread, every 5 min. agent_version, pid_uptime_s, dropped_events_since_last.

We deliberately do not model tool.call or llm.call as separate top-level events for MVP. Rolling them into session.end keeps query patterns simple (one row per session). Splitting later is an additive, not breaking, change.

4.3 Categories

category on session.start is a free string, agent-tagged. We accept that this means cross-customer comparison will need a normalization step (likely a periodic job that maps free strings → a curated taxonomy). Forcing a fixed enum on day one would either constrain real workflows or be ignored. The cost of free-string is one normalization job; the cost of premature enum is wrong data.

5. SDK contracts

5.1 Python (reference implementation)

import brain
brain.init(api_key=os.environ["BRAIN_API_KEY"])  # endpoint defaults to prod URL

with brain.session(category="email_reply") as s:
    s.tool("read_inbox", duration_ms=150, ok=True)
    s.llm(model="claude-sonnet-4-6", input_tokens=2400,
          output_tokens=350, cost_cents=2.1)
    s.set_summary("Replied to a billing dispute about an unpaid invoice.")
    # context exit emits session.end; status inferred from raised exception or set explicitly via s.fail("...") / s.escalate()

brain.track("error", {"message": "...", "kind": "TimeoutError",
                      "stack_hash": "sha256:..."})  # raw escape hatch

Behavior:

5.2 Claude Code adapter

A thin wrapper script claude-code-brain-hook.py plus a settings.json snippet customers add to their Claude Code install. Hooks used:

The wrapper imports the same Python SDK; no separate code path.

5.3 paperclip

paperclip's runtime is not yet known to the brain team. For MVP, paperclip integrates via the raw HTTP contract (Section 4.1). Once we know its language, we wrap it as a thin adapter over the same wire format. Open question — see §11.

6. Privacy & redaction

6.1 Privacy tiers

Per-customer config (stored on the customers row) chooses one of:

6.2 Redaction prompt

Haiku call uses the following user prompt (system prompt sets the role):

Rewrite the following one-line agent task summary in 120 characters or fewer. Strip every personal name, email address, phone number, postal address, account number, and order/ticket ID. Preserve the business intent (what kind of task it was, what the outcome was). Reply with only the rewritten line.

Wrapped in a 5-second timeout. On failure, the event is still stored, summary_redacted is null, and the payload gains summary_failed: true so we can re-run later. The raw summary is never persisted on a redaction failure for tier B customers — it is dropped.

7. Auth model

8. Storage schema

create table customers (
  id            text primary key,            -- "cust_<slug>"
  name          text not null,
  privacy_tier  text not null check (privacy_tier in ('a','b','c')),
  created_at    timestamptz not null default now()
);

create table agents (
  id            text primary key,            -- "agent_<slug>"
  customer_id   text not null references customers(id),
  kind          text,                        -- 'booking'|'inbox'|'sales'|'ops'|'other' (advisory)
  version       text,                        -- updated from heartbeats
  last_seen_at  timestamptz,                 -- updated on every event
  created_at    timestamptz not null default now()
);

create table api_keys (
  id            text primary key,            -- "key_<slug>"
  agent_id      text not null references agents(id),
  key_hash      bytea not null unique,
  label         text,
  created_at    timestamptz not null default now(),
  revoked_at    timestamptz
);

create table events (
  id               bigserial primary key,
  ts               timestamptz not null,
  ingested_at      timestamptz not null default now(),
  customer_id      text not null,             -- denormalized for fast filtering
  agent_id         text not null,
  event_type       text not null,
  session_id       text,
  payload          jsonb not null,
  summary_redacted text                       -- null for tier-A or pre-redaction
);

create index events_customer_ts on events (customer_id, ts desc);
create index events_agent_ts    on events (agent_id, ts desc);
create index events_type_ts     on events (event_type, ts desc);
create index events_session     on events (session_id) where session_id is not null;

Tier-C raw summaries (when in scope) live inside payload->>'summary_raw'; we deliberately do not promote them to a column to avoid an accidental SELECT * leaking them.

Partition events by month once we cross ~10 M rows. Not needed at MVP volume.

9. Saved analytics (Metabase)

Six saved questions, one for each priority signal, ship with the brain:

  1. Engagement — sessions/day per customer, last 30 days, sparkline. (Priority 1.)
  2. Last-seennow() - max(ts) per agent; agents with no event in > 1 h flagged. (Priority 1.)
  3. Tool-call leaderboardjsonb_array_elements(payload->'tool_calls') grouped by tool name × customer. (Priority 2.)
  4. Token spend per customer — daily sum(payload->'llm_usage'->>'cost_cents'). (Unit economics, secondary.)
  5. Category mixsession.start.category distribution per customer + cross-customer. (Priorities 2 & 3.)
  6. Escalation/failure ratesession.end.status ratios per customer; leading churn indicator. (Priority 1.)

These are not the final analytics — they are the smallest set that demonstrably answers the three priority questions on day one.

10. Error handling

SDK side. Every internal error is caught. Network failures replay from the on-disk ring buffer. Buffer overflow drops oldest events and bumps the counter on the next heartbeat. The SDK is allowed to be lossy; it is never allowed to crash the agent.

Server side.

Backpressure. Not enforced at MVP. If Postgres falls behind, the SDK will see 503s and retry; we will see it in events_rejected_total before customers do.

11. Ops & deployment

12. Testing

13. Open questions

  1. paperclip runtime. We assume HTTP-only integration until we know its language. If it turns out to be a runtime we can target with an SDK (Python/Node/Go), we wrap it post-MVP.
  2. Category taxonomy. Free string at MVP. The normalization job (free string → curated taxonomy) is post-MVP work; the question is whether it's a periodic batch job or an LLM-based classifier at ingest time.
  3. Tier-C launch criteria. No customer is at Tier C on day one. Before enabling, we need DPA template, a per-customer kill switch, and an audit log of every Tier-C row read.
  4. Multi-tenant isolation. All customers share one Postgres instance with customer_id-scoped queries. If a single large customer's volume becomes disruptive, we partition the events table by customer_id or move them to a dedicated DB. Not needed at MVP.