Date: 2026-05-03 Phase: B.4 (customer-host bootstrap) Predecessor: B.3 (host heartbeat) Successor: C (deployment runbook)
Ship the AWS cloud-init user-data + IAM contract that turns a fresh
customer EC2 into a host that's ready to send brain telemetry. Single
self-contained bash script: install host deps, install Tailscale and
join the tailnet, fetch per-customer secrets from SSM, write
/etc/m8trx/brain.env, install the B.3 heartbeat trio, enable the
timer.
The agent telemetry the brain ultimately receives from a customer EC2 — the actual operator value — comes from three feeds, all of which exist before this phase:
session.start and session.end events (one
pair per agent task), with status, exit_code, duration_ms.tool_call event per tool the agent invokes,
with {tool_name, input_bytes, output_bytes}. The "what tools is
this agent running, how often, how chunky" signal.B.4's job is not to add new telemetry — it's the one-shot bootstrap that gets all three feeds firing from a fresh customer host.
Phases A through B.3 produced the brain server, the wrapper
telemetry patch, the in-container hooks, and the heartbeat trio. All
those artifacts have been validated against the local brain on this
EC2, but they've never run on a real customer host. B.4 closes the
provisioning loop so a "first customer connect" can actually happen —
operator runs terraform apply (or pastes user-data into the
console), and ~5 minutes later the customer's host is sending events
to brain.
┌─ AWS SSM Parameter Store ─────────────────┐
│ /m8trx/brain-url (fleet-wide) │
│ /m8trx/<cust>/tailscale-auth-key (per-cust) │
│ /m8trx/cust_acme/brain-key (per-cust) │
└────────────────────▲──────────────────────┘
│ aws ssm get-parameter
│ (IAM: ec2 instance profile)
┌─ EC2 launch ─┐ │
│ user-data: │ │
│ bootstrap.sh│ cloud-init runs at first boot, as root:
│ │ ┌──────────────────────────────────────────┐
│ Tags: │ │ 1. Read CUSTOMER_ID from IMDSv2 tag │
│ Customer= │ │ 2. apt-get install deps │
│ cust_acme │ │ 3. Install Tailscale, join tailnet │
│ │ │ 4. Fetch brain key + URL from SSM │
│ MetadataOpts:│ │ 5. Write /etc/m8trx/brain.env │
│ Inst-Meta- │ │ 6. Install B.3 heartbeat trio (heredoc) │
│ Tags= │ │ 7. systemctl daemon-reload │
│ enabled │ │ 8. systemctl enable --now ...timer │
│ │ └──────────────────────────────────────────┘
└──────────────┘ │
│ once Tailscale up + heartbeat
│ enabled, beats fire every 5m
▼
┌─ brain ──────────────────────────┐
│ /v1/events (over Tailscale) │
└──────────────────────────────────┘
The cloud-init is AWS-only (IMDSv2 + SSM are AWS-specific) and
fail-loud — set -euo pipefail; no idempotency. If any step
fails, the operator sees it in /var/log/cloud-init-output.log and
re-launches the EC2.
Out of scope (handled by other systems on the customer host):
m8trx-claude-isolate wrapper (paperclipai's
responsibility — wrapper lives in paperclip's container).CUSTOMER_ID comes from the EC2's Customer= tag, read at boot via
IMDSv2's instance-metadata-tags surface. Requires
MetadataOptions.InstanceMetadataTags=enabled set at launch (a
one-line Terraform attribute or a console toggle).
Why IMDS-tags vs. aws ec2 describe-tags: zero IAM perms beyond the
SSM ones we already need, no AWS API call (faster + cheaper), no
network dependency at the moment we need the customer ID.
Why tags vs. user-data variable: per-customer user-data files would need templating per launch; tags are declarative, easy to audit in the EC2 console, and survive across re-launches via launch templates.
The CUSTOMER_ID value (e.g. cust_acme) directly indexes into the
SSM path /m8trx/${CUSTOMER_ID}/brain-key. No mapping layer.
Each customer gets their own reusable + ephemeral auth key,
stored in SSM at /m8trx/<customer_id>/tailscale-auth-key, minted
in the Tailscale admin console with the customer's tag baked in
(tag:m8trx-cust-<id>, e.g. tag:m8trx-cust-acme).
tag:m8trx-cust-acme
can ONLY register devices with that tag, never another customer's.
So a leaked customer key cannot pivot into another customer's
namespace.tag:m8trx-cust-acme ↔ tag:m8trx-cust-acme); cross-customer
traffic is default-deny. Brain is reachable from all customer tags
(tag:m8trx-cust-* → tag:m8trx-brain:8080). The team's tag
(tag:m8trx-team) can reach all customer hosts (chat, debug).Combined: complete network isolation between customers
(cust_acme cannot reach cust_bigco at the network layer at all)
using a single tailnet, one ACL JSON to maintain, and Tailscale-
enforced key-tag binding.
Tailnet Lock (recommended): enable in Tailscale → Settings. With Tailnet Lock on, new devices stay offline until an admin signs them — even with a valid auth key. Closes the "stolen SSM key" attack: an attacker who exfiltrated a customer's Tailscale key still can't onboard a hostile device without admin approval.
Customer-ID → tag derivation: bootstrap.sh computes the tag
from CUSTOMER_ID by stripping the cust_ prefix:
cust_acme → tag:m8trx-cust-acme. Operators must mint each
customer's auth key with the matching tag.
Rotation: mint a new tag-bound key in the Tailscale admin
console, aws ssm put-parameter --overwrite to update the SSM
value, revoke the old key in the Tailscale admin console. No
host-side action needed; existing devices continue to work because
the key was used at join time.
Per-host preauth keys (Tailscale API minted at provisioning) and OAuth client dynamic minting were considered and rejected for MVP — the per-customer reusable key model gives equivalent isolation with no API integration overhead.
Why not separate tailnets per customer: A separate Tailscale tailnet per customer (Option A in the brainstorm) is the strongest possible isolation but requires brain to be multi-homed across N tailnets and creates N admin consoles to manage. Per-customer tag-and-key within a single tailnet (this spec) is the operational sweet spot: Tailscale enforces isolation server-side; one tailnet to administer; brain reachable from all customers naturally; team chat trivially spans all customers.
Three files under agent-artifacts/cloud-init/:
bootstrap.shPure bash, fail-loud, ~120 lines including the embedded heartbeat
heredocs. Operator pastes the entire file into the EC2 user-data
field (or references it from Terraform via
templatefile("...bootstrap.sh", {})).
Structure:
#!/bin/bash
set -euo pipefail
exec > >(tee -a /var/log/m8trx-bootstrap.log) 2>&1
# 1. Read CUSTOMER_ID from IMDSv2 instance tag.
# -fsS: fail on HTTP errors (e.g. 404 if tag missing or
# InstanceMetadataTags not enabled), silent but show errors.
# Without -f, curl exits 0 on HTTP 4xx and stuffs the error body
# into CUSTOMER_ID — then SSM gets a garbage path lookup later.
TOKEN=$(curl -fsS -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 60")
CUSTOMER_ID=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/tags/instance/Customer) \
|| { echo "Customer tag missing or InstanceMetadataTags disabled"; exit 1; }
REGION=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/placement/region)
# 2. Install host deps
apt-get update -y
apt-get install -y --no-install-recommends \
awscli jq curl docker.io ca-certificates
# 3. Tailscale install + join
curl -fsSL https://tailscale.com/install.sh | sh
TS_KEY=$(aws ssm get-parameter --region "$REGION" \
--name "/m8trx/${CUSTOMER_ID}/tailscale-auth-key" --with-decryption \
--query Parameter.Value --output text)
TS_TAG="tag:m8trx-cust-${CUSTOMER_ID#cust_}"
tailscale up --auth-key="$TS_KEY" --ssh --advertise-tags="$TS_TAG"
# 4. Fetch brain bearer + URL from SSM
BRAIN_KEY=$(aws ssm get-parameter --region "$REGION" \
--name "/m8trx/${CUSTOMER_ID}/brain-key" --with-decryption \
--query Parameter.Value --output text)
BRAIN_URL=$(aws ssm get-parameter --region "$REGION" \
--name /m8trx/brain-url \
--query Parameter.Value --output text)
# 5. Write /etc/m8trx/brain.env
install -d -m 0700 /etc/m8trx
cat > /etc/m8trx/brain.env <<EOF
BRAIN_URL=${BRAIN_URL}
BRAIN_API_KEY=${BRAIN_KEY}
EOF
chmod 0600 /etc/m8trx/brain.env
# 6. Install B.3 heartbeat trio (verbatim copies — kept in sync by
# bin/test-cloud-init.sh diff check)
cat > /usr/local/bin/m8trx-brain-heartbeat <<'HEARTBEAT_SH'
[verbatim copy of agent-artifacts/heartbeat/m8trx-brain-heartbeat.sh]
HEARTBEAT_SH
chmod 0755 /usr/local/bin/m8trx-brain-heartbeat
cat > /etc/systemd/system/m8trx-brain-heartbeat.service <<'HEARTBEAT_SERVICE'
[verbatim copy of agent-artifacts/heartbeat/m8trx-brain-heartbeat.service]
HEARTBEAT_SERVICE
cat > /etc/systemd/system/m8trx-brain-heartbeat.timer <<'HEARTBEAT_TIMER'
[verbatim copy of agent-artifacts/heartbeat/m8trx-brain-heartbeat.timer]
HEARTBEAT_TIMER
# 7. Enable + start
systemctl daemon-reload
systemctl enable --now m8trx-brain-heartbeat.timer
echo "m8trx-bootstrap: complete for ${CUSTOMER_ID}"
The single exec > >(tee ...) 2>&1 line at the top duplicates all
output to /var/log/m8trx-bootstrap.log for grep convenience while
preserving the normal cloud-init log path.
iam-policy.jsonSample IAM policy granting the three SSM GetParameter perms the
bootstrap needs.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FetchBrainUrl",
"Effect": "Allow",
"Action": "ssm:GetParameter",
"Resource": "arn:aws:ssm:*:*:parameter/m8trx/brain-url"
},
{
"Sid": "FetchPerCustomerSecrets",
"Effect": "Allow",
"Action": "ssm:GetParameter",
"Resource": [
"arn:aws:ssm:*:*:parameter/m8trx/*/brain-key",
"arn:aws:ssm:*:*:parameter/m8trx/*/tailscale-auth-key"
]
}
]
}
The per-customer-secrets statement uses a wildcard so one IAM role
works fleet-wide for the IAM concern. Bootstrap.sh only references
/m8trx/${CUSTOMER_ID}/{brain-key,tailscale-auth-key}, never others.
Network isolation is the load-bearing mitigation here: even if a
compromised customer host abuses its IAM role to fetch another
customer's keys, Tailscale-enforced key-tag binding means that
customer's stolen Tailscale key still can only register devices with
THAT customer's tag — and the brain bearer is customer-scoped at the
server side via requireCustomerAuth. For belt-and-suspenders
IAM-level isolation (per-customer instance profile, each scoped to
that customer's SSM paths), revisit when fleet scale or contract
demands it.
No ec2:Describe* perms needed (IMDS-tags are metadata-direct). No
SSM write perms needed (bootstrap is read-only against SSM).
README.mdSix h2 sections covering:
aws ssm put-parameter
examples for each.aws_iam_policy_document + attachment).Customer= tag,
MetadataOptions.InstanceMetadataTags=enabled, instance profile
attached, user-data = bootstrap.sh contents. Recommended Ubuntu
22.04 (the apt-get and tailscale install.sh paths assume
Debian-family).aws_launch_template showing all
the above wired together./var/log/m8trx-bootstrap.log and /var/log/cloud-init-output.log,
how to reach a half-bootstrapped host via AWS Session Manager,
common failure modes and their fixes.| Path | Type | Scope | Purpose |
|---|---|---|---|
/m8trx/brain-url |
String |
fleet-wide | e.g. http://brain.tailnet.ts.net:8080. In SSM (rather than baked into bootstrap.sh) so brain can move without re-bootstrapping the fleet. |
/m8trx/<customer_id>/tailscale-auth-key |
SecureString |
per-customer | Reusable + ephemeral Tailscale key with the customer's tag (tag:m8trx-cust-<id>) baked in. Tailscale-enforced binding: key for cust_acme can only register devices with tag:m8trx-cust-acme. |
/m8trx/<customer_id>/brain-key |
SecureString |
per-customer | The bearer for Authorization: Bearer … against brain. <customer_id> matches the EC2's Customer= tag. |
The customer EC2 must launch with:
Customer=<customer_id> — the customer ID matching the
/m8trx/<id>/brain-key SSM param. Customer IDs follow brain's
mint-key.js validation: /^cust_[a-z0-9_]+$/ (e.g.
cust_acme). Mismatches between the tag and the SSM param name
surface as ParameterNotFound from aws ssm get-parameter,
which set -e propagates as a bootstrap failure visible in
/var/log/m8trx-bootstrap.log.InstanceMetadataTags=enabled so
bootstrap.sh can read the tag from IMDSv2.iam-policy.json.bootstrap.sh.apt-get and Tailscale
install assume Debian-family with systemd).Cloud-init runs once at first boot. Fail-loud, no idempotency:
| Failure | Behaviour | Operator response |
|---|---|---|
Customer= tag missing |
[ -n "$CUSTOMER_ID" ] fails → exit 1 "Customer tag missing" |
Terminate + relaunch with the tag set. |
MetadataOptions.InstanceMetadataTags=enabled not set |
IMDS tag fetch returns 404 → CUSTOMER_ID empty → same as above |
Same. Fix the launch template. |
IAM role lacks ssm:GetParameter |
aws ssm get-parameter exits non-zero (AccessDenied) → set -e aborts |
Bootstrap log shows the AWS error. Fix IAM. |
| SSM param missing (e.g. customer key not minted) | aws ssm get-parameter exits with ParameterNotFound → set -e aborts |
aws ssm put-parameter ... then relaunch. |
| Tailscale install fetch fails (no internet pre-tailnet) | curl ... | sh exits non-zero → set -e aborts |
Subnet has no NAT/IGW route. Fix networking. |
tailscale up fails (bad auth key, ACL rejects tag) |
tailscale CLI exits non-zero → set -e aborts |
Rotate the SSM key or fix the Tailscale ACL. |
apt-get fails (transient mirror issue) |
set -e aborts |
Usually transient — relaunch. |
| systemd commands fail | set -e aborts |
Genuine systemd issue — investigate via Session Manager. |
Logging:
/var/log/cloud-init-output.log — cloud-init's own log; AWS console exposes it via "Get System Log."/var/log/m8trx-bootstrap.log — duplicate of the bash script's stdout/stderr for grep convenience.journalctl -u m8trx-brain-heartbeat (B.3 contract).Debug access to a half-bootstrapped host:
tailscale up --ssh enables Tailscale SSH so once Tailscale is up
(step 3 of bootstrap), the operator can tailscale ssh <ec2-tailscale-name> to investigate. If Tailscale itself failed to
come up, fall back to AWS Session Manager via the EC2 console.
bin/test-cloud-init.sh, in scope)| Check | What it catches |
|---|---|
shellcheck agent-artifacts/cloud-init/bootstrap.sh |
Syntax errors, unquoted vars, classic shell pitfalls. If shellcheck is missing on the host, the test prints (shellcheck not installed — skip) and continues — the rest of the suite is the gate. |
jq . agent-artifacts/cloud-init/iam-policy.json |
Malformed IAM JSON. |
| Embedded heartbeat drift check | Extracts the three heredoc'd files from bootstrap.sh (HEARTBEAT_SH, HEARTBEAT_SERVICE, HEARTBEAT_TIMER heredoc tags) into /tmp/, then diff against the canonical agent-artifacts/heartbeat/* files. Exits non-zero on any byte difference. |
README sanity (grep -c '^##') |
Right number of h2 sections (6: SSM params, IAM, launch settings, Terraform snippet, smoke procedure, debug recipe). |
The drift check is the load-bearing one. Without it, a future update to the canonical heartbeat script would silently leave the embedded copy stale; new customer EC2s would ship the old version forever.
Extraction approach (POSIX awk):
awk '/^cat > .* <<.HEARTBEAT_SH./,/^HEARTBEAT_SH$/' bootstrap.sh \
| sed '1d;$d'
Strips the opening cat > line and the closing tag line; emits just
the body. Repeat for HEARTBEAT_SERVICE and HEARTBEAT_TIMER.
Honest end-to-end requires a real EC2:
cust_b4_smoke):KEY=$(node bin/mint-key.js cust_b4_smoke "B.4 smoke test")
aws ssm put-parameter --name /m8trx/cust_b4_smoke/brain-key \
--type SecureString --value "$KEY"
Customer=cust_b4_smoke
tag, InstanceMetadataTags=enabled, instance profile attached,
user-data = bootstrap.sh.m8trx-bootstrap: complete for cust_b4_smoke.psql -tAc "select payload->>'hostname' from events where customer_id='cust_b4_smoke' and event_type='heartbeat' order by ts desc limit 1"
→ expect a real hostname.tailscale ssh <ec2-tailscale-name> and run an agent
task via paperclipai — expect tool_call events to land. Skip if
paperclipai isn't installed on the smoke host.cust_b4_smoke SSM param + the brain row.m8trx-claude-isolate wrapper (paperclipai's
responsibility — wrapper lives in paperclip's container, ships via
paperclipai's own deployment).None at design-approval time. All seven clarifying questions were resolved interactively before this spec was written.