Phase B.3 — Host heartbeat

Date: 2026-05-03 Phase: B.3 (telemetry — host-side liveness + stats) Predecessor: B.2 (Claude Code hooks for agent-runtime) Successor: B.4 (Tailscale + cloud-init bootstrap)

Goal

Install a systemd timer + service on each customer EC2 host that POSTs a heartbeat event to the brain ingestion API every ~5 minutes. The event carries host metadata + system stats so operators get fleet-wide liveness ("this host is alive") plus light capacity-trend visibility (load avg, memory, disk %), without depending on the customer actively running agents.

Why now

Phase B.2 instrumented Claude itself (per-tool-call events from inside the agent-runtime container). But a customer with no recent agent activity disappears from the dashboard's "last seen" tracking. The heartbeat closes that gap with a periodic, agent-independent liveness signal — and while we're at it, attaches cheap host stats so a customer trending toward disk-full or thrashing on memory shows up in the brain before they page support.

Architecture

                   ┌─ customer EC2 host ─────────────────────┐
                   │                                         │
  systemd timer ──→│  m8trx-brain-heartbeat.timer            │
   (every 5 min)   │  OnBootSec=30s, OnUnitActiveSec=5min    │
                   │  RandomizedDelaySec=30s ◄── fleet-wide  │
                   │       │                     spread      │
                   │       ▼                                 │
                   │  m8trx-brain-heartbeat.service          │
                   │  Type=oneshot, User=root                │
                   │  EnvironmentFile=/etc/m8trx/brain.env   │
                   │       │                                 │
                   │       ▼                                 │
                   │  /usr/local/bin/m8trx-brain-heartbeat   │
                   │  (POSIX sh + jq + curl)                 │
                   │  - reads /proc/loadavg, /proc/meminfo,  │
                   │    /proc/uptime, uname -r, hostname     │
                   │  - df -P / for root disk %              │
                   │  - jq builds payload + event JSON       │
                   │  - curl POST to ${BRAIN_URL}/v1/events  │
                   │       │                                 │
                   └───────┼─────────────────────────────────┘
                           │ HTTPS over Tailscale
                           ▼
                   ┌─ brain ──────────────────────────────────┐
                   │  /v1/events                              │
                   │  agent_id="_host" → filtered out of      │
                   │  agent rollups by dashboard queries      │
                   └──────────────────────────────────────────┘

RandomizedDelaySec=30s spreads the fleet so 100 customers don't all fire at XX:00:00. Customer ID is implicit from the bearer key in /etc/m8trx/brain.env, which B.4 cloud-init writes during host provisioning.

Cadence

Timer fires every 5 minutes (OnUnitActiveSec=5min), with a 30-second startup delay (OnBootSec=30s) and a 30-second fleet-wide jitter (RandomizedDelaySec=30s).

5 minutes was chosen against the dashboard's statusFromLastSeenMin thresholds (server/src/routes/dashboard.js:9–15):

healthy ≤ 60 min
warning ≤ 240 min
critical ≤ 1440 min
idle > 1440 min

A single missed beat is well below the warning threshold, and 12 beats/hour × N customers is a manageable write rate (~1.2k rows/hr at 100 customers).

Components

Four files, all under agent-artifacts/heartbeat/:

1. `m8trx-brain-heartbeat.sh`

POSIX sh script. Installed at /usr/local/bin/m8trx-brain-heartbeat, mode 0755, root-owned.

Behaviour:

Read env vars BRAIN_URL and BRAIN_API_KEY. If either is unset, write a one-line stderr message and exit 1. (Non-zero exit: this is a misconfiguration the operator needs to see via systemctl status / journalctl.)
Gather host stats:
- hostname from hostname (or /etc/hostname fallback).
- uptime_sec from /proc/uptime (first field, integer-truncated).
- kernel from uname -r.
- load1, load5, load15 from /proc/loadavg (first three, floats).
- mem_total_mb, mem_avail_mb from /proc/meminfo (MemTotal, MemAvailable are reported in kibibytes; divide by 1024 for MiB. Field name uses _mb for brevity but the value is base-1024 — operators reading the dashboard care about order-of-magnitude, not exact base, and 1024-based matches what free -m shows on the same host).
- disk_root_pct from df -P / | tail -1 | awk '{print $5}', stripped of the %.
Build event JSON via jq -nc using --arg / --argjson so numeric fields are emitted as JSON numbers, not strings:
- event_id from cat /proc/sys/kernel/random/uuid.
- ts from date -u +"%Y-%m-%dT%H:%M:%S.000Z".
- event_type=heartbeat, agent_id=_host.
- Payload object with the 9 host-stat fields above.
POST to ${BRAIN_URL%/}/v1/events with curl --silent --show-error --max-time 5 --retry 0 --fail -o /dev/null, with Authorization: Bearer ${BRAIN_API_KEY} and Content-Type: application/json.
set -e is on throughout, so any of (1)–(4) failing exits non-zero and systemd records the failure.

Implementation notes:

No set -u. Optional vars handled with ${VAR:-} defaults where needed.
No trap 'exit 0' (different from the brain-hook). The hook swallowed failures because non-zero would block Claude. The heartbeat wants non-zero to surface in systemctl status.

2. `m8trx-brain-heartbeat.service`

systemd unit. Installed at /etc/systemd/system/m8trx-brain-heartbeat.service, mode 0644.

[Unit]
Description=M8trx brain host heartbeat
After=network-online.target tailscaled.service
Wants=network-online.target

[Service]
Type=oneshot
User=root
EnvironmentFile=/etc/m8trx/brain.env
ExecStart=/usr/local/bin/m8trx-brain-heartbeat

After=tailscaled.service because brain is reachable only over Tailscale.

3. `m8trx-brain-heartbeat.timer`

systemd timer. Installed at /etc/systemd/system/m8trx-brain-heartbeat.timer, mode 0644.

[Unit]
Description=Fire M8trx brain host heartbeat every 5 minutes

[Timer]
OnBootSec=30s
OnUnitActiveSec=5min
RandomizedDelaySec=30s
Unit=m8trx-brain-heartbeat.service

[Install]
WantedBy=timers.target

4. `README.md`

Short integration doc covering:

Where each artifact installs to on a customer EC2.
The expected /etc/m8trx/brain.env format (mode 0600 root:root, BRAIN_URL= and BRAIN_API_KEY= lines).

Install steps:

cp m8trx-brain-heartbeat.sh      /usr/local/bin/m8trx-brain-heartbeat
cp m8trx-brain-heartbeat.service /etc/systemd/system/
cp m8trx-brain-heartbeat.timer   /etc/systemd/system/
chmod 0755 /usr/local/bin/m8trx-brain-heartbeat
systemctl daemon-reload
systemctl enable --now m8trx-brain-heartbeat.timer

Operator debug recipe: systemctl status m8trx-brain-heartbeat.timer, journalctl -u m8trx-brain-heartbeat --since "10 min ago", plus a one-shot manual run via systemctl start m8trx-brain-heartbeat.service.
Required deps (curl, jq) — installed by B.4 cloud-init alongside this stack.

Event payload contract

{
  "event_id": "<uuid v4>",
  "ts": "2026-05-03T21:42:11.000Z",
  "event_type": "heartbeat",
  "agent_id": "_host",
  "payload": {
    "hostname":       "ip-10-0-1-42",
    "uptime_sec":     1234567,
    "kernel":         "6.8.0-1052-aws",
    "load1":          0.42,
    "load5":          0.51,
    "load15":         0.55,
    "mem_total_mb":   16384,
    "mem_avail_mb":   12000,
    "disk_root_pct":  38
  }
}

event_type=heartbeat is in brain's VALID_TYPES (server/src/routes/events.js:9).
agent_id="_host" is the convention for non-agent telemetry. Existing dashboard queries filter agent_id != '_host' in the per-customer agent count, the per-customer last-seen, and the fleet-wide active counts (server/src/routes/dashboard.js:23, 38, 151, 162), so heartbeats won't pollute agent-count or session-count rollups.
run_id is omitted — heartbeats have no concept of a run.
customer_id is not in the payload — brain derives it from the bearer key.
Numeric fields are JSON numbers (via --argjson), not strings.
All payload fields are non-PII host metadata. No IPs, no AWS instance IDs, no IAM identities.
The brain's existing agents table auto-upsert (server/src/routes/events.js:46–53) will create an agents row with id="_host" per customer. That gives operators a per-customer "_host" pseudo-agent in /v1/dashboard/agents queries — harmless, and arguably useful. Not suppressing in this phase; revisit only if it creates noise.

Env contract

Var	Set by	Required	Behaviour if missing
`BRAIN_URL`	`/etc/m8trx/brain.env` (B.4 cloud-init)	yes	exit 1, stderr log
`BRAIN_API_KEY`	`/etc/m8trx/brain.env` (B.4 cloud-init)	yes	exit 1, stderr log

No optional env vars. Heartbeats need no AGENT_ID (uses fixed _host), no RUN_ID, no BRAIN_DEBUG (systemd journal already captures stderr — debug is operator-visible by default).

Error handling

Failure	Behaviour
`BRAIN_URL` or `BRAIN_API_KEY` unset	`echo "m8trx-brain-heartbeat: BRAIN_URL/BRAIN_API_KEY unset" >&2; exit 1`. systemd marks unit failed → visible in `systemctl status` and `journalctl -u`.
`/proc/*` read fails	`set -e` exits non-zero. (Should never happen on Linux.)
`df -P /` fails	`set -e` exits non-zero.
`jq` missing or fails	`set -e` exits non-zero. (B.4 cloud-init installs jq alongside.)
`curl` transport failure (Tailscale down, brain unreachable)	`curl --fail --max-time 5` exits non-zero; script exits non-zero; systemd marks failed. Operator sees curl's stderr in journal.
Brain returns non-2xx (401 wrong key, 4xx bad event, 5xx)	`--fail` makes curl exit non-zero with the response code; same systemd failure path.
Hook crashes (syntax error, etc.)	`set -e` exits non-zero.

The heartbeat is fundamentally different from the brain-hook (B.2) in its failure philosophy:

The hook ran inside Claude Code where non-zero blocks the user's tool call. It had to swallow failures and use BRAIN_DEBUG for conditional stderr.
The heartbeat runs under systemd where non-zero is the correct signal. Operator notices via systemctl status / journalctl. systemd journal is operator-only by design, so no BRAIN_DEBUG gate is needed.

No Restart= on the service unit — Type=oneshot units don't restart, and the timer fires every 5 min regardless of the previous run's exit. A transient failure self-heals on the next beat. A persistent failure stays loud via systemctl status m8trx-brain-heartbeat.timer which surfaces the last result.

Testing

Standalone script test (in scope for this phase)

A bin/test-brain-heartbeat.sh script in the brain repo, runnable against the local brain server, covering:

Happy path — pass real BRAIN_URL + freshly-minted KEY, run the script, verify exit 0 and that a row appears in postgres with event_type=heartbeat, agent_id=_host, all 9 payload fields present and within sane bounds (disk_root_pct 0–100, mem_total_mb >= mem_avail_mb, load1/5/15 >= 0, uptime_sec > 0, kernel and hostname non-empty strings).
Missing BRAIN_URL — unset BRAIN_URL, run, verify exit 1, stderr contains "BRAIN_URL/BRAIN_API_KEY unset", no new row.
Missing BRAIN_API_KEY — same as above with the other var.
Wrong key (401) — bogus BRAIN_API_KEY, verify exit non-zero from curl --fail, journal-bound stderr present, no new row.
Unreachable brain (transport fail) — BRAIN_URL=http://192.0.2.1:1, verify exit non-zero within ~6 s (curl --max-time 5 + slack), no new row.

Bootstrap (same as the B.2 hook test): mint a fresh test key via docker compose exec brain-api node bin/mint-key.js cust_m8trx_test "M8TRX Test" at the top of the runner.

systemd integration test — deferred

Real-world test (cp files into /etc/systemd/system/, daemon-reload, enable + start the timer, wait, check systemctl status and journalctl) requires mutating this EC2's actual systemd state. That's heavier than the hook tests' "no host mutation" model.

Instead, the README documents the smoke procedure for an operator deploying to a real customer EC2 in B.4. The standalone script test above gets ≥90% of the value with zero system-state side effects.

Out of scope

Editing customer EC2 systemd state from this repo (B.4 / runbook territory).
A brain-side change to suppress the agents row auto-upsert when agent_id="_host". Revisit only if the per-customer "_host" pseudo-agent creates dashboard noise.
Per-mount disk reporting (only / for MVP).
CPU usage % (load avg is a reasonable proxy for now; true CPU% needs a sample window).
Pushing host stats more often than 5 min (the dashboard rollups don't currently consume sub-5-min granularity).
Reading the agent-runtime image tag (would couple the heartbeat to docker; deferred until fleet-version drift becomes a real ops problem).
Alerting on persistent failure (B.3 makes failure visible via systemd journal; an alerting layer is a later phase).

Open questions

None at design-approval time. All four clarifying questions were resolved interactively before this spec was written.