Internal operator runbook for onboarding a brand-new customer EC2 into the M8trx brain telemetry fleet end-to-end. Single page; the per-phase docs linked from § References are the source-of-truth for details.
Audience: internal M8trx operator. Assumes AWS console + Terraform familiarity, brain EC2 access, and Tailscale admin rights. Onboard only — customer offboard / decommission is out of scope (see § What this runbook deliberately does not cover).
Before any customer onboard, all of these must already exist. Each checklist line has a "verify by" hint. If something is missing, set it up before continuing — don't try to do it lazily during a per-customer onboard.
Brain server up on Tailscale.
Verify: curl -s http://<brain-tailscale-ip>:8080/v1/healthz returns
{"ok":true,...}.
Source: server/ in this repo;
design at docs/superpowers/specs/2026-04-29-brain-design.md.
Tailscale tailnet ACL set up with per-customer isolation.
The ACL needs the M8trx infrastructure tags
(tag:m8trx-brain, tag:m8trx-team) plus an ACL rule template
for adding per-customer tags. Default-deny means
cross-customer traffic is blocked; intra-customer traffic is
allowed by per-customer rules. Brain reachable from
tag:m8trx-cust-*; team reachable to all customers.
Source-of-truth template at
agent-artifacts/cloud-init/README.md § Tailscale ACL.
Verify: in the Tailscale admin console, the ACL JSON has the
tagOwners + acls shape from that template.
Tailnet Lock enabled (Tailscale → Settings → Tailnet Lock). Without this, a stolen customer auth key from SSM lets an attacker onboard hostile devices. With it, new devices stay offline until an admin signs them.
Fleet-wide SSM param set in the AWS region you'll launch customer EC2s in:
/m8trx/brain-url (String) = the brain Tailscale URL,
e.g. http://brain.tailnet.ts.net:8080.
bash aws ssm put-parameter --name /m8trx/brain-url \ --type String --value "http://brain.tailnet.ts.net:8080"
(Per-customer Tailscale auth keys + brain bearers are minted
per-customer in the next section, not as fleet-wide setup.) Fleet IAM applied (one-time per AWS region). Apply the
m8trx-fleet
Terraform module:
hcl module "m8trx_fleet" { source = "github.com/M8trxInfra/M8trx-Brain//agent-artifacts/cloud-init/terraform/m8trx-fleet?ref=main" brain_url = "http://brain.tailnet.ts.net:8080" }
Creates the IAM role + policy + instance profile that every
customer-agent EC2 attaches. Optionally manages the
/m8trx/brain-url SSM param. Outputs
iam_instance_profile_name for the per-agent module to consume.
Run these in order for each new customer. The customer ID must match
brain's mint-key.js validation regex /^cust_[a-z0-9_]+$/ —
e.g. cust_acme, cust_bigco. Once these steps are done, all
future agents launched for this customer auto-connect with zero
additional configuration — the cloud-init bootstrap reads the
customer's tag and resolves all per-customer secrets from SSM by
that ID.
1. Add the customer's Tailscale tag + ACL rule. In the
Tailscale admin console (Access Controls), add to the ACL JSON:
- A tagOwners entry: "tag:m8trx-cust-<id_without_cust_>": ["autogroup:admin"]
— for cust_acme, that's tag:m8trx-cust-acme.
- An intra-customer acls rule:
{ "action": "accept", "src": ["tag:m8trx-cust-<id>"], "dst": ["tag:m8trx-cust-<id>:*"] }
Save. The brain reach + team reach rules already cover the new
tag via the tag:m8trx-cust-* wildcards from § Prerequisites.
2. Mint the customer's Tailscale auth key in the Tailscale
admin console (Settings → Keys → Generate auth key) with:
- Reusable: yes
- Ephemeral: yes
- Tags: tag:m8trx-cust-<id_without_cust_> (the tag you added
in step 1)
Copy the tskey-auth-... value, then store it in SSM in the
same region as the customer EC2:
bash aws ssm put-parameter --name /m8trx/cust_<id>/tailscale-auth-key \ --type SecureString --value "tskey-auth-..."
Tailscale enforces the tag at registration: this key can ONLY
register devices with the customer's tag. Leak-resistant.
3. Mint the customer's brain bearer key:
bash KEY=$(docker compose -f /home/ubuntu/brain/server/docker-compose.yml \ exec -T brain-api node bin/mint-key.js cust_<id> "<Display Name>" \ 2>/dev/null | tail -1) echo "$KEY" # confirm it looks like m8brain_<env>_<32 base32 chars>
mint-key.js prints the plaintext key once on stdout; diagnostics
(incl. the human "minted key …" line) go to stderr and are
suppressed by 2>/dev/null. Capture into $KEY for the next step.
4. Store the brain key in SSM in the same region:
bash aws ssm put-parameter --name /m8trx/cust_<id>/brain-key \ --type SecureString --value "$KEY"
5. Apply Terraform to launch the customer EC2. Use the
m8trx-agent
module (m8trx-deployer should wrap this):
hcl module "m8trx_agent_acme_1" { source = "github.com/M8trxInfra/M8trx-Brain//agent-artifacts/cloud-init/terraform/m8trx-agent?ref=main" customer_id = "cust_acme" iam_instance_profile_name = module.m8trx_fleet.iam_instance_profile_name subnet_id = aws_subnet.m8trx.id vpc_security_group_ids = [aws_security_group.m8trx_default.id] }
The module sets the right tags, metadata options, IAM profile,
and user-data automatically. Hand-launching from the EC2 console
is also supported (paste bootstrap.sh as user-data, set the
Customer= tag, enable instance metadata tags, attach the
instance profile from § Prerequisites).
For multiple agents per customer, instantiate the module N times
with the same customer_id — they all share the customer's
tailnet namespace + brain key, and each agent emits telemetry
under a distinct agent_id.
6. Approve the new device(s) in Tailnet Lock (if enabled). Tailscale → Devices → find the newly-joined device by hostname, click "Sign device". Once signed, the device comes online inside its customer's tailnet namespace. (Tailnet Lock is the gate that makes a stolen SSM auth key useless to an attacker; signing is one-click per agent.)
The 3-command success check.
a. Bootstrap completed — check the EC2 console "Get System
Log" (Actions → Monitor and troubleshoot → Get system log). Expect
a line near the bottom:
m8trx-bootstrap: complete for cust_<id>
b. Wait ~5 minutes for the heartbeat timer's first fire
(30s OnBootSec + 30s RandomizedDelaySec jitter + 5min
OnUnitActiveSec).
c. Heartbeat row landed in brain. On the brain EC2:
bash docker compose -f /home/ubuntu/brain/server/docker-compose.yml \ exec -T postgres psql -U brain brain -tAc \ "select payload->>'hostname' from events where customer_id='cust_<id>' and event_type='heartbeat' order by ts desc limit 1"
Expect a real Amazon-DNS-style hostname (ip-10-x-x-x or similar).
If all three check out, the customer EC2 is fully onboarded.
See agent-artifacts/cloud-init/README.md
§ Operator debug recipe for the full failure → fix mapping. Common
failures it covers:
Customer tag missing or InstanceMetadataTags disabledAn error occurred (AccessDeniedException) when calling the GetParameter operationAn error occurred (ParameterNotFound)tailscale: failed to authenticateDon't restate the recipe here — the source-of-truth is one click away.
aws ssm delete-parameter --name /m8trx/cust_<id>/brain-key,
optionally drop the customer's brain rows.)Per-phase docs, README first (most operationally relevant), spec for context. Plans are intentionally omitted (implementation history, not operational reference).
| Phase | What it does | README | Design spec |
|---|---|---|---|
| B.1 wrapper | session.start/end events from the m8trx-claude-isolate wrapper | (no README; see modified script at agent-artifacts/m8trx-claude-isolate.modified) |
2026-05-03-brain-mvp-ingestion-design.md |
| B.2 hooks | tool_call events from Claude Code PostToolUse hook in the agent-runtime container | agent-artifacts/claude-hooks/README.md |
2026-05-03-brain-claude-hooks-design.md |
| B.3 heartbeat | host-side liveness + system-stats events every 5 min via systemd timer | agent-artifacts/heartbeat/README.md |
2026-05-03-brain-host-heartbeat-design.md |
| B.4 cloud-init | one-shot AWS user-data bash that installs deps, joins Tailscale, fetches SSM secrets, writes brain.env, installs heartbeat | agent-artifacts/cloud-init/README.md |
2026-05-03-brain-cloud-init-bootstrap-design.md |