Phase B.3 — Host heartbeat

Date: 2026-05-03 Phase: B.3 (telemetry — host-side liveness + stats) Predecessor: B.2 (Claude Code hooks for agent-runtime) Successor: B.4 (Tailscale + cloud-init bootstrap)

Goal

Install a systemd timer + service on each customer EC2 host that POSTs a heartbeat event to the brain ingestion API every ~5 minutes. The event carries host metadata + system stats so operators get fleet-wide liveness ("this host is alive") plus light capacity-trend visibility (load avg, memory, disk %), without depending on the customer actively running agents.

Why now

Phase B.2 instrumented Claude itself (per-tool-call events from inside the agent-runtime container). But a customer with no recent agent activity disappears from the dashboard's "last seen" tracking. The heartbeat closes that gap with a periodic, agent-independent liveness signal — and while we're at it, attaches cheap host stats so a customer trending toward disk-full or thrashing on memory shows up in the brain before they page support.

Architecture

                   ┌─ customer EC2 host ─────────────────────┐
                   │                                         │
  systemd timer ──→│  m8trx-brain-heartbeat.timer            │
   (every 5 min)   │  OnBootSec=30s, OnUnitActiveSec=5min    │
                   │  RandomizedDelaySec=30s ◄── fleet-wide  │
                   │       │                     spread      │
                   │       ▼                                 │
                   │  m8trx-brain-heartbeat.service          │
                   │  Type=oneshot, User=root                │
                   │  EnvironmentFile=/etc/m8trx/brain.env   │
                   │       │                                 │
                   │       ▼                                 │
                   │  /usr/local/bin/m8trx-brain-heartbeat   │
                   │  (POSIX sh + jq + curl)                 │
                   │  - reads /proc/loadavg, /proc/meminfo,  │
                   │    /proc/uptime, uname -r, hostname     │
                   │  - df -P / for root disk %              │
                   │  - jq builds payload + event JSON       │
                   │  - curl POST to ${BRAIN_URL}/v1/events  │
                   │       │                                 │
                   └───────┼─────────────────────────────────┘
                           │ HTTPS over Tailscale
                           ▼
                   ┌─ brain ──────────────────────────────────┐
                   │  /v1/events                              │
                   │  agent_id="_host" → filtered out of      │
                   │  agent rollups by dashboard queries      │
                   └──────────────────────────────────────────┘

RandomizedDelaySec=30s spreads the fleet so 100 customers don't all fire at XX:00:00. Customer ID is implicit from the bearer key in /etc/m8trx/brain.env, which B.4 cloud-init writes during host provisioning.

Cadence

Timer fires every 5 minutes (OnUnitActiveSec=5min), with a 30-second startup delay (OnBootSec=30s) and a 30-second fleet-wide jitter (RandomizedDelaySec=30s).

5 minutes was chosen against the dashboard's statusFromLastSeenMin thresholds (server/src/routes/dashboard.js:9–15):

A single missed beat is well below the warning threshold, and 12 beats/hour × N customers is a manageable write rate (~1.2k rows/hr at 100 customers).

Components

Four files, all under agent-artifacts/heartbeat/:

1. m8trx-brain-heartbeat.sh

POSIX sh script. Installed at /usr/local/bin/m8trx-brain-heartbeat, mode 0755, root-owned.

Behaviour:

  1. Read env vars BRAIN_URL and BRAIN_API_KEY. If either is unset, write a one-line stderr message and exit 1. (Non-zero exit: this is a misconfiguration the operator needs to see via systemctl status / journalctl.)
  2. Gather host stats:
  3. Build event JSON via jq -nc using --arg / --argjson so numeric fields are emitted as JSON numbers, not strings:
  4. POST to ${BRAIN_URL%/}/v1/events with curl --silent --show-error --max-time 5 --retry 0 --fail -o /dev/null, with Authorization: Bearer ${BRAIN_API_KEY} and Content-Type: application/json.
  5. set -e is on throughout, so any of (1)–(4) failing exits non-zero and systemd records the failure.

Implementation notes:

2. m8trx-brain-heartbeat.service

systemd unit. Installed at /etc/systemd/system/m8trx-brain-heartbeat.service, mode 0644.

[Unit]
Description=M8trx brain host heartbeat
After=network-online.target tailscaled.service
Wants=network-online.target

[Service]
Type=oneshot
User=root
EnvironmentFile=/etc/m8trx/brain.env
ExecStart=/usr/local/bin/m8trx-brain-heartbeat

After=tailscaled.service because brain is reachable only over Tailscale.

3. m8trx-brain-heartbeat.timer

systemd timer. Installed at /etc/systemd/system/m8trx-brain-heartbeat.timer, mode 0644.

[Unit]
Description=Fire M8trx brain host heartbeat every 5 minutes

[Timer]
OnBootSec=30s
OnUnitActiveSec=5min
RandomizedDelaySec=30s
Unit=m8trx-brain-heartbeat.service

[Install]
WantedBy=timers.target

4. README.md

Short integration doc covering:

Event payload contract

{
  "event_id": "<uuid v4>",
  "ts": "2026-05-03T21:42:11.000Z",
  "event_type": "heartbeat",
  "agent_id": "_host",
  "payload": {
    "hostname":       "ip-10-0-1-42",
    "uptime_sec":     1234567,
    "kernel":         "6.8.0-1052-aws",
    "load1":          0.42,
    "load5":          0.51,
    "load15":         0.55,
    "mem_total_mb":   16384,
    "mem_avail_mb":   12000,
    "disk_root_pct":  38
  }
}

Env contract

Var Set by Required Behaviour if missing
BRAIN_URL /etc/m8trx/brain.env (B.4 cloud-init) yes exit 1, stderr log
BRAIN_API_KEY /etc/m8trx/brain.env (B.4 cloud-init) yes exit 1, stderr log

No optional env vars. Heartbeats need no AGENT_ID (uses fixed _host), no RUN_ID, no BRAIN_DEBUG (systemd journal already captures stderr — debug is operator-visible by default).

Error handling

Failure Behaviour
BRAIN_URL or BRAIN_API_KEY unset echo "m8trx-brain-heartbeat: BRAIN_URL/BRAIN_API_KEY unset" >&2; exit 1. systemd marks unit failed → visible in systemctl status and journalctl -u.
/proc/* read fails set -e exits non-zero. (Should never happen on Linux.)
df -P / fails set -e exits non-zero.
jq missing or fails set -e exits non-zero. (B.4 cloud-init installs jq alongside.)
curl transport failure (Tailscale down, brain unreachable) curl --fail --max-time 5 exits non-zero; script exits non-zero; systemd marks failed. Operator sees curl's stderr in journal.
Brain returns non-2xx (401 wrong key, 4xx bad event, 5xx) --fail makes curl exit non-zero with the response code; same systemd failure path.
Hook crashes (syntax error, etc.) set -e exits non-zero.

The heartbeat is fundamentally different from the brain-hook (B.2) in its failure philosophy:

No Restart= on the service unit — Type=oneshot units don't restart, and the timer fires every 5 min regardless of the previous run's exit. A transient failure self-heals on the next beat. A persistent failure stays loud via systemctl status m8trx-brain-heartbeat.timer which surfaces the last result.

Testing

Standalone script test (in scope for this phase)

A bin/test-brain-heartbeat.sh script in the brain repo, runnable against the local brain server, covering:

  1. Happy path — pass real BRAIN_URL + freshly-minted KEY, run the script, verify exit 0 and that a row appears in postgres with event_type=heartbeat, agent_id=_host, all 9 payload fields present and within sane bounds (disk_root_pct 0–100, mem_total_mb >= mem_avail_mb, load1/5/15 >= 0, uptime_sec > 0, kernel and hostname non-empty strings).
  2. Missing BRAIN_URL — unset BRAIN_URL, run, verify exit 1, stderr contains "BRAIN_URL/BRAIN_API_KEY unset", no new row.
  3. Missing BRAIN_API_KEY — same as above with the other var.
  4. Wrong key (401) — bogus BRAIN_API_KEY, verify exit non-zero from curl --fail, journal-bound stderr present, no new row.
  5. Unreachable brain (transport fail)BRAIN_URL=http://192.0.2.1:1, verify exit non-zero within ~6 s (curl --max-time 5 + slack), no new row.

Bootstrap (same as the B.2 hook test): mint a fresh test key via docker compose exec brain-api node bin/mint-key.js cust_m8trx_test "M8TRX Test" at the top of the runner.

systemd integration test — deferred

Real-world test (cp files into /etc/systemd/system/, daemon-reload, enable + start the timer, wait, check systemctl status and journalctl) requires mutating this EC2's actual systemd state. That's heavier than the hook tests' "no host mutation" model.

Instead, the README documents the smoke procedure for an operator deploying to a real customer EC2 in B.4. The standalone script test above gets ≥90% of the value with zero system-state side effects.

Out of scope

Open questions

None at design-approval time. All four clarifying questions were resolved interactively before this spec was written.