Date: 2026-05-03 Phase: B.3 (telemetry — host-side liveness + stats) Predecessor: B.2 (Claude Code hooks for agent-runtime) Successor: B.4 (Tailscale + cloud-init bootstrap)
Install a systemd timer + service on each customer EC2 host that POSTs
a heartbeat event to the brain ingestion API every ~5 minutes. The
event carries host metadata + system stats so operators get fleet-wide
liveness ("this host is alive") plus light capacity-trend visibility
(load avg, memory, disk %), without depending on the customer
actively running agents.
Phase B.2 instrumented Claude itself (per-tool-call events from inside the agent-runtime container). But a customer with no recent agent activity disappears from the dashboard's "last seen" tracking. The heartbeat closes that gap with a periodic, agent-independent liveness signal — and while we're at it, attaches cheap host stats so a customer trending toward disk-full or thrashing on memory shows up in the brain before they page support.
┌─ customer EC2 host ─────────────────────┐
│ │
systemd timer ──→│ m8trx-brain-heartbeat.timer │
(every 5 min) │ OnBootSec=30s, OnUnitActiveSec=5min │
│ RandomizedDelaySec=30s ◄── fleet-wide │
│ │ spread │
│ ▼ │
│ m8trx-brain-heartbeat.service │
│ Type=oneshot, User=root │
│ EnvironmentFile=/etc/m8trx/brain.env │
│ │ │
│ ▼ │
│ /usr/local/bin/m8trx-brain-heartbeat │
│ (POSIX sh + jq + curl) │
│ - reads /proc/loadavg, /proc/meminfo, │
│ /proc/uptime, uname -r, hostname │
│ - df -P / for root disk % │
│ - jq builds payload + event JSON │
│ - curl POST to ${BRAIN_URL}/v1/events │
│ │ │
└───────┼─────────────────────────────────┘
│ HTTPS over Tailscale
▼
┌─ brain ──────────────────────────────────┐
│ /v1/events │
│ agent_id="_host" → filtered out of │
│ agent rollups by dashboard queries │
└──────────────────────────────────────────┘
RandomizedDelaySec=30s spreads the fleet so 100 customers don't all
fire at XX:00:00. Customer ID is implicit from the bearer key in
/etc/m8trx/brain.env, which B.4 cloud-init writes during host
provisioning.
Timer fires every 5 minutes (OnUnitActiveSec=5min), with a
30-second startup delay (OnBootSec=30s) and a 30-second
fleet-wide jitter (RandomizedDelaySec=30s).
5 minutes was chosen against the dashboard's statusFromLastSeenMin
thresholds (server/src/routes/dashboard.js:9–15):
A single missed beat is well below the warning threshold, and 12 beats/hour × N customers is a manageable write rate (~1.2k rows/hr at 100 customers).
Four files, all under agent-artifacts/heartbeat/:
m8trx-brain-heartbeat.shPOSIX sh script. Installed at /usr/local/bin/m8trx-brain-heartbeat,
mode 0755, root-owned.
Behaviour:
BRAIN_URL and BRAIN_API_KEY. If either is unset,
write a one-line stderr message and exit 1. (Non-zero exit: this
is a misconfiguration the operator needs to see via
systemctl status / journalctl.)hostname from hostname (or /etc/hostname fallback).uptime_sec from /proc/uptime (first field, integer-truncated).kernel from uname -r.load1, load5, load15 from /proc/loadavg (first three,
floats).mem_total_mb, mem_avail_mb from /proc/meminfo
(MemTotal, MemAvailable are reported in kibibytes; divide by
1024 for MiB. Field name uses _mb for brevity but the value is
base-1024 — operators reading the dashboard care about
order-of-magnitude, not exact base, and 1024-based matches what
free -m shows on the same host).disk_root_pct from df -P / | tail -1 | awk '{print $5}',
stripped of the %.jq -nc using --arg / --argjson so
numeric fields are emitted as JSON numbers, not strings:event_id from cat /proc/sys/kernel/random/uuid.ts from date -u +"%Y-%m-%dT%H:%M:%S.000Z".event_type=heartbeat, agent_id=_host.${BRAIN_URL%/}/v1/events with
curl --silent --show-error --max-time 5 --retry 0 --fail -o /dev/null, with Authorization: Bearer ${BRAIN_API_KEY} and Content-Type: application/json.set -e is on throughout, so any of (1)–(4) failing exits non-zero
and systemd records the failure.Implementation notes:
set -u. Optional vars handled with ${VAR:-} defaults where
needed.trap 'exit 0' (different from the brain-hook). The hook
swallowed failures because non-zero would block Claude. The
heartbeat wants non-zero to surface in systemctl status.m8trx-brain-heartbeat.servicesystemd unit. Installed at
/etc/systemd/system/m8trx-brain-heartbeat.service, mode 0644.
[Unit]
Description=M8trx brain host heartbeat
After=network-online.target tailscaled.service
Wants=network-online.target
[Service]
Type=oneshot
User=root
EnvironmentFile=/etc/m8trx/brain.env
ExecStart=/usr/local/bin/m8trx-brain-heartbeat
After=tailscaled.service because brain is reachable only over
Tailscale.
m8trx-brain-heartbeat.timersystemd timer. Installed at
/etc/systemd/system/m8trx-brain-heartbeat.timer, mode 0644.
[Unit]
Description=Fire M8trx brain host heartbeat every 5 minutes
[Timer]
OnBootSec=30s
OnUnitActiveSec=5min
RandomizedDelaySec=30s
Unit=m8trx-brain-heartbeat.service
[Install]
WantedBy=timers.target
README.mdShort integration doc covering:
/etc/m8trx/brain.env format (mode 0600 root:root,
BRAIN_URL= and BRAIN_API_KEY= lines).cp m8trx-brain-heartbeat.sh /usr/local/bin/m8trx-brain-heartbeat
cp m8trx-brain-heartbeat.service /etc/systemd/system/
cp m8trx-brain-heartbeat.timer /etc/systemd/system/
chmod 0755 /usr/local/bin/m8trx-brain-heartbeat
systemctl daemon-reload
systemctl enable --now m8trx-brain-heartbeat.timer
systemctl status m8trx-brain-heartbeat.timer, journalctl -u m8trx-brain-heartbeat --since "10 min ago", plus a one-shot manual run via systemctl start m8trx-brain-heartbeat.service.curl, jq) — installed by B.4 cloud-init alongside
this stack.{
"event_id": "<uuid v4>",
"ts": "2026-05-03T21:42:11.000Z",
"event_type": "heartbeat",
"agent_id": "_host",
"payload": {
"hostname": "ip-10-0-1-42",
"uptime_sec": 1234567,
"kernel": "6.8.0-1052-aws",
"load1": 0.42,
"load5": 0.51,
"load15": 0.55,
"mem_total_mb": 16384,
"mem_avail_mb": 12000,
"disk_root_pct": 38
}
}
event_type=heartbeat is in brain's VALID_TYPES
(server/src/routes/events.js:9).agent_id="_host" is the convention for non-agent telemetry.
Existing dashboard queries filter agent_id != '_host' in the
per-customer agent count, the per-customer last-seen, and the
fleet-wide active counts (server/src/routes/dashboard.js:23, 38, 151, 162), so heartbeats won't pollute agent-count or
session-count rollups.run_id is omitted — heartbeats have no concept of a run.customer_id is not in the payload — brain derives it from the
bearer key.--argjson), not strings.agents table auto-upsert
(server/src/routes/events.js:46–53) will create an agents row
with id="_host" per customer. That gives operators a per-customer
"_host" pseudo-agent in /v1/dashboard/agents queries — harmless,
and arguably useful. Not suppressing in this phase; revisit only if
it creates noise.| Var | Set by | Required | Behaviour if missing |
|---|---|---|---|
BRAIN_URL |
/etc/m8trx/brain.env (B.4 cloud-init) |
yes | exit 1, stderr log |
BRAIN_API_KEY |
/etc/m8trx/brain.env (B.4 cloud-init) |
yes | exit 1, stderr log |
No optional env vars. Heartbeats need no AGENT_ID (uses fixed
_host), no RUN_ID, no BRAIN_DEBUG (systemd journal already
captures stderr — debug is operator-visible by default).
| Failure | Behaviour |
|---|---|
BRAIN_URL or BRAIN_API_KEY unset |
echo "m8trx-brain-heartbeat: BRAIN_URL/BRAIN_API_KEY unset" >&2; exit 1. systemd marks unit failed → visible in systemctl status and journalctl -u. |
/proc/* read fails |
set -e exits non-zero. (Should never happen on Linux.) |
df -P / fails |
set -e exits non-zero. |
jq missing or fails |
set -e exits non-zero. (B.4 cloud-init installs jq alongside.) |
curl transport failure (Tailscale down, brain unreachable) |
curl --fail --max-time 5 exits non-zero; script exits non-zero; systemd marks failed. Operator sees curl's stderr in journal. |
| Brain returns non-2xx (401 wrong key, 4xx bad event, 5xx) | --fail makes curl exit non-zero with the response code; same systemd failure path. |
| Hook crashes (syntax error, etc.) | set -e exits non-zero. |
The heartbeat is fundamentally different from the brain-hook (B.2) in its failure philosophy:
BRAIN_DEBUG for
conditional stderr.systemctl status /
journalctl. systemd journal is operator-only by design, so no
BRAIN_DEBUG gate is needed.No Restart= on the service unit — Type=oneshot units don't restart,
and the timer fires every 5 min regardless of the previous run's
exit. A transient failure self-heals on the next beat. A persistent
failure stays loud via systemctl status m8trx-brain-heartbeat.timer
which surfaces the last result.
A bin/test-brain-heartbeat.sh script in the brain repo, runnable
against the local brain server, covering:
BRAIN_URL + freshly-minted KEY,
run the script, verify exit 0 and that a row appears in postgres
with event_type=heartbeat, agent_id=_host, all 9 payload
fields present and within sane bounds (disk_root_pct 0–100,
mem_total_mb >= mem_avail_mb, load1/5/15 >= 0,
uptime_sec > 0, kernel and hostname non-empty strings).BRAIN_URL, run, verify exit 1,
stderr contains "BRAIN_URL/BRAIN_API_KEY unset", no new row.BRAIN_API_KEY, verify exit non-zero
from curl --fail, journal-bound stderr present, no new row.BRAIN_URL=http://192.0.2.1:1,
verify exit non-zero within ~6 s (curl --max-time 5 + slack),
no new row.Bootstrap (same as the B.2 hook test): mint a fresh test key via
docker compose exec brain-api node bin/mint-key.js cust_m8trx_test "M8TRX Test" at the top of the runner.
Real-world test (cp files into /etc/systemd/system/, daemon-reload,
enable + start the timer, wait, check systemctl status and
journalctl) requires mutating this EC2's actual systemd state.
That's heavier than the hook tests' "no host mutation" model.
Instead, the README documents the smoke procedure for an operator deploying to a real customer EC2 in B.4. The standalone script test above gets ≥90% of the value with zero system-state side effects.
agents row auto-upsert when
agent_id="_host". Revisit only if the per-customer "_host"
pseudo-agent creates dashboard noise./ for MVP).None at design-approval time. All four clarifying questions were resolved interactively before this spec was written.