Phase B.4 — Tailscale + cloud-init bootstrap

Date: 2026-05-03 Phase: B.4 (customer-host bootstrap) Predecessor: B.3 (host heartbeat) Successor: C (deployment runbook)

Goal

Ship the AWS cloud-init user-data + IAM contract that turns a fresh customer EC2 into a host that's ready to send brain telemetry. Single self-contained bash script: install host deps, install Tailscale and join the tailnet, fetch per-customer secrets from SSM, write /etc/m8trx/brain.env, install the B.3 heartbeat trio, enable the timer.

The agent telemetry the brain ultimately receives from a customer EC2 — the actual operator value — comes from three feeds, all of which exist before this phase:

B.4's job is not to add new telemetry — it's the one-shot bootstrap that gets all three feeds firing from a fresh customer host.

Why now

Phases A through B.3 produced the brain server, the wrapper telemetry patch, the in-container hooks, and the heartbeat trio. All those artifacts have been validated against the local brain on this EC2, but they've never run on a real customer host. B.4 closes the provisioning loop so a "first customer connect" can actually happen — operator runs terraform apply (or pastes user-data into the console), and ~5 minutes later the customer's host is sending events to brain.

Architecture

                ┌─ AWS SSM Parameter Store ─────────────────┐
                │  /m8trx/brain-url            (fleet-wide) │
                │  /m8trx/<cust>/tailscale-auth-key  (per-cust) │
                │  /m8trx/cust_acme/brain-key  (per-cust)   │
                └────────────────────▲──────────────────────┘
                                     │ aws ssm get-parameter
                                     │  (IAM: ec2 instance profile)
        ┌─ EC2 launch ─┐             │
        │ user-data:   │             │
        │  bootstrap.sh│  cloud-init runs at first boot, as root:
        │              │  ┌──────────────────────────────────────────┐
        │ Tags:        │  │ 1. Read CUSTOMER_ID from IMDSv2 tag       │
        │  Customer=   │  │ 2. apt-get install deps                   │
        │   cust_acme  │  │ 3. Install Tailscale, join tailnet        │
        │              │  │ 4. Fetch brain key + URL from SSM         │
        │ MetadataOpts:│  │ 5. Write /etc/m8trx/brain.env             │
        │  Inst-Meta-  │  │ 6. Install B.3 heartbeat trio (heredoc)   │
        │  Tags=       │  │ 7. systemctl daemon-reload                │
        │   enabled    │  │ 8. systemctl enable --now ...timer        │
        │              │  └──────────────────────────────────────────┘
        └──────────────┘                  │
                                          │ once Tailscale up + heartbeat
                                          │ enabled, beats fire every 5m
                                          ▼
                                  ┌─ brain ──────────────────────────┐
                                  │  /v1/events  (over Tailscale)    │
                                  └──────────────────────────────────┘

The cloud-init is AWS-only (IMDSv2 + SSM are AWS-specific) and fail-loudset -euo pipefail; no idempotency. If any step fails, the operator sees it in /var/log/cloud-init-output.log and re-launches the EC2.

Out of scope (handled by other systems on the customer host):

Customer ID resolution

CUSTOMER_ID comes from the EC2's Customer= tag, read at boot via IMDSv2's instance-metadata-tags surface. Requires MetadataOptions.InstanceMetadataTags=enabled set at launch (a one-line Terraform attribute or a console toggle).

Why IMDS-tags vs. aws ec2 describe-tags: zero IAM perms beyond the SSM ones we already need, no AWS API call (faster + cheaper), no network dependency at the moment we need the customer ID.

Why tags vs. user-data variable: per-customer user-data files would need templating per launch; tags are declarative, easy to audit in the EC2 console, and survive across re-launches via launch templates.

The CUSTOMER_ID value (e.g. cust_acme) directly indexes into the SSM path /m8trx/${CUSTOMER_ID}/brain-key. No mapping layer.

Tailscale auth strategy — per-customer isolation

Each customer gets their own reusable + ephemeral auth key, stored in SSM at /m8trx/<customer_id>/tailscale-auth-key, minted in the Tailscale admin console with the customer's tag baked in (tag:m8trx-cust-<id>, e.g. tag:m8trx-cust-acme).

Combined: complete network isolation between customers (cust_acme cannot reach cust_bigco at the network layer at all) using a single tailnet, one ACL JSON to maintain, and Tailscale- enforced key-tag binding.

Tailnet Lock (recommended): enable in Tailscale → Settings. With Tailnet Lock on, new devices stay offline until an admin signs them — even with a valid auth key. Closes the "stolen SSM key" attack: an attacker who exfiltrated a customer's Tailscale key still can't onboard a hostile device without admin approval.

Customer-ID → tag derivation: bootstrap.sh computes the tag from CUSTOMER_ID by stripping the cust_ prefix: cust_acmetag:m8trx-cust-acme. Operators must mint each customer's auth key with the matching tag.

Rotation: mint a new tag-bound key in the Tailscale admin console, aws ssm put-parameter --overwrite to update the SSM value, revoke the old key in the Tailscale admin console. No host-side action needed; existing devices continue to work because the key was used at join time.

Per-host preauth keys (Tailscale API minted at provisioning) and OAuth client dynamic minting were considered and rejected for MVP — the per-customer reusable key model gives equivalent isolation with no API integration overhead.

Why not separate tailnets per customer: A separate Tailscale tailnet per customer (Option A in the brainstorm) is the strongest possible isolation but requires brain to be multi-homed across N tailnets and creates N admin consoles to manage. Per-customer tag-and-key within a single tailnet (this spec) is the operational sweet spot: Tailscale enforces isolation server-side; one tailnet to administer; brain reachable from all customers naturally; team chat trivially spans all customers.

Components

Three files under agent-artifacts/cloud-init/:

1. bootstrap.sh

Pure bash, fail-loud, ~120 lines including the embedded heartbeat heredocs. Operator pastes the entire file into the EC2 user-data field (or references it from Terraform via templatefile("...bootstrap.sh", {})).

Structure:

#!/bin/bash
set -euo pipefail
exec > >(tee -a /var/log/m8trx-bootstrap.log) 2>&1

# 1. Read CUSTOMER_ID from IMDSv2 instance tag.
#    -fsS: fail on HTTP errors (e.g. 404 if tag missing or
#    InstanceMetadataTags not enabled), silent but show errors.
#    Without -f, curl exits 0 on HTTP 4xx and stuffs the error body
#    into CUSTOMER_ID — then SSM gets a garbage path lookup later.
TOKEN=$(curl -fsS -X PUT "http://169.254.169.254/latest/api/token" \
    -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
CUSTOMER_ID=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/tags/instance/Customer) \
    || { echo "Customer tag missing or InstanceMetadataTags disabled"; exit 1; }
REGION=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/placement/region)

# 2. Install host deps
apt-get update -y
apt-get install -y --no-install-recommends \
    awscli jq curl docker.io ca-certificates

# 3. Tailscale install + join
curl -fsSL https://tailscale.com/install.sh | sh
TS_KEY=$(aws ssm get-parameter --region "$REGION" \
    --name "/m8trx/${CUSTOMER_ID}/tailscale-auth-key" --with-decryption \
    --query Parameter.Value --output text)
TS_TAG="tag:m8trx-cust-${CUSTOMER_ID#cust_}"
tailscale up --auth-key="$TS_KEY" --ssh --advertise-tags="$TS_TAG"

# 4. Fetch brain bearer + URL from SSM
BRAIN_KEY=$(aws ssm get-parameter --region "$REGION" \
    --name "/m8trx/${CUSTOMER_ID}/brain-key" --with-decryption \
    --query Parameter.Value --output text)
BRAIN_URL=$(aws ssm get-parameter --region "$REGION" \
    --name /m8trx/brain-url \
    --query Parameter.Value --output text)

# 5. Write /etc/m8trx/brain.env
install -d -m 0700 /etc/m8trx
cat > /etc/m8trx/brain.env <<EOF
BRAIN_URL=${BRAIN_URL}
BRAIN_API_KEY=${BRAIN_KEY}
EOF
chmod 0600 /etc/m8trx/brain.env

# 6. Install B.3 heartbeat trio (verbatim copies — kept in sync by
#    bin/test-cloud-init.sh diff check)
cat > /usr/local/bin/m8trx-brain-heartbeat <<'HEARTBEAT_SH'
[verbatim copy of agent-artifacts/heartbeat/m8trx-brain-heartbeat.sh]
HEARTBEAT_SH
chmod 0755 /usr/local/bin/m8trx-brain-heartbeat

cat > /etc/systemd/system/m8trx-brain-heartbeat.service <<'HEARTBEAT_SERVICE'
[verbatim copy of agent-artifacts/heartbeat/m8trx-brain-heartbeat.service]
HEARTBEAT_SERVICE

cat > /etc/systemd/system/m8trx-brain-heartbeat.timer <<'HEARTBEAT_TIMER'
[verbatim copy of agent-artifacts/heartbeat/m8trx-brain-heartbeat.timer]
HEARTBEAT_TIMER

# 7. Enable + start
systemctl daemon-reload
systemctl enable --now m8trx-brain-heartbeat.timer

echo "m8trx-bootstrap: complete for ${CUSTOMER_ID}"

The single exec > >(tee ...) 2>&1 line at the top duplicates all output to /var/log/m8trx-bootstrap.log for grep convenience while preserving the normal cloud-init log path.

2. iam-policy.json

Sample IAM policy granting the three SSM GetParameter perms the bootstrap needs.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "FetchBrainUrl",
      "Effect": "Allow",
      "Action": "ssm:GetParameter",
      "Resource": "arn:aws:ssm:*:*:parameter/m8trx/brain-url"
    },
    {
      "Sid": "FetchPerCustomerSecrets",
      "Effect": "Allow",
      "Action": "ssm:GetParameter",
      "Resource": [
        "arn:aws:ssm:*:*:parameter/m8trx/*/brain-key",
        "arn:aws:ssm:*:*:parameter/m8trx/*/tailscale-auth-key"
      ]
    }
  ]
}

The per-customer-secrets statement uses a wildcard so one IAM role works fleet-wide for the IAM concern. Bootstrap.sh only references /m8trx/${CUSTOMER_ID}/{brain-key,tailscale-auth-key}, never others.

Network isolation is the load-bearing mitigation here: even if a compromised customer host abuses its IAM role to fetch another customer's keys, Tailscale-enforced key-tag binding means that customer's stolen Tailscale key still can only register devices with THAT customer's tag — and the brain bearer is customer-scoped at the server side via requireCustomerAuth. For belt-and-suspenders IAM-level isolation (per-customer instance profile, each scoped to that customer's SSM paths), revisit when fleet scale or contract demands it.

No ec2:Describe* perms needed (IMDS-tags are metadata-direct). No SSM write perms needed (bootstrap is read-only against SSM).

3. README.md

Six h2 sections covering:

SSM parameter contract

Path Type Scope Purpose
/m8trx/brain-url String fleet-wide e.g. http://brain.tailnet.ts.net:8080. In SSM (rather than baked into bootstrap.sh) so brain can move without re-bootstrapping the fleet.
/m8trx/<customer_id>/tailscale-auth-key SecureString per-customer Reusable + ephemeral Tailscale key with the customer's tag (tag:m8trx-cust-<id>) baked in. Tailscale-enforced binding: key for cust_acme can only register devices with tag:m8trx-cust-acme.
/m8trx/<customer_id>/brain-key SecureString per-customer The bearer for Authorization: Bearer … against brain. <customer_id> matches the EC2's Customer= tag.

EC2 launch settings (operator contract)

The customer EC2 must launch with:

  1. Tag Customer=<customer_id> — the customer ID matching the /m8trx/<id>/brain-key SSM param. Customer IDs follow brain's mint-key.js validation: /^cust_[a-z0-9_]+$/ (e.g. cust_acme). Mismatches between the tag and the SSM param name surface as ParameterNotFound from aws ssm get-parameter, which set -e propagates as a bootstrap failure visible in /var/log/m8trx-bootstrap.log.
  2. MetadataOptions with InstanceMetadataTags=enabled so bootstrap.sh can read the tag from IMDSv2.
  3. Instance profile attached, granting the IAM policy in iam-policy.json.
  4. User-data = the contents of bootstrap.sh.
  5. AMI Ubuntu 22.04+ recommended (the apt-get and Tailscale install assume Debian-family with systemd).
  6. Network with egress to the public internet (for apt-get, Tailscale install + relay, and SSM API). No public ingress required — Tailscale handles inbound; the brain itself is tailnet-only.

Error handling

Cloud-init runs once at first boot. Fail-loud, no idempotency:

Failure Behaviour Operator response
Customer= tag missing [ -n "$CUSTOMER_ID" ] fails → exit 1 "Customer tag missing" Terminate + relaunch with the tag set.
MetadataOptions.InstanceMetadataTags=enabled not set IMDS tag fetch returns 404 → CUSTOMER_ID empty → same as above Same. Fix the launch template.
IAM role lacks ssm:GetParameter aws ssm get-parameter exits non-zero (AccessDenied) → set -e aborts Bootstrap log shows the AWS error. Fix IAM.
SSM param missing (e.g. customer key not minted) aws ssm get-parameter exits with ParameterNotFoundset -e aborts aws ssm put-parameter ... then relaunch.
Tailscale install fetch fails (no internet pre-tailnet) curl ... | sh exits non-zero → set -e aborts Subnet has no NAT/IGW route. Fix networking.
tailscale up fails (bad auth key, ACL rejects tag) tailscale CLI exits non-zero → set -e aborts Rotate the SSM key or fix the Tailscale ACL.
apt-get fails (transient mirror issue) set -e aborts Usually transient — relaunch.
systemd commands fail set -e aborts Genuine systemd issue — investigate via Session Manager.

Logging:

Debug access to a half-bootstrapped host:

tailscale up --ssh enables Tailscale SSH so once Tailscale is up (step 3 of bootstrap), the operator can tailscale ssh <ec2-tailscale-name> to investigate. If Tailscale itself failed to come up, fall back to AWS Session Manager via the EC2 console.

Testing

Local cheap checks (bin/test-cloud-init.sh, in scope)

Check What it catches
shellcheck agent-artifacts/cloud-init/bootstrap.sh Syntax errors, unquoted vars, classic shell pitfalls. If shellcheck is missing on the host, the test prints (shellcheck not installed — skip) and continues — the rest of the suite is the gate.
jq . agent-artifacts/cloud-init/iam-policy.json Malformed IAM JSON.
Embedded heartbeat drift check Extracts the three heredoc'd files from bootstrap.sh (HEARTBEAT_SH, HEARTBEAT_SERVICE, HEARTBEAT_TIMER heredoc tags) into /tmp/, then diff against the canonical agent-artifacts/heartbeat/* files. Exits non-zero on any byte difference.
README sanity (grep -c '^##') Right number of h2 sections (6: SSM params, IAM, launch settings, Terraform snippet, smoke procedure, debug recipe).

The drift check is the load-bearing one. Without it, a future update to the canonical heartbeat script would silently leave the embedded copy stale; new customer EC2s would ship the old version forever.

Extraction approach (POSIX awk):

awk '/^cat > .* <<.HEARTBEAT_SH./,/^HEARTBEAT_SH$/' bootstrap.sh \
    | sed '1d;$d'

Strips the opening cat > line and the closing tag line; emits just the body. Repeat for HEARTBEAT_SERVICE and HEARTBEAT_TIMER.

Manual smoke test (operator-side, documented in README)

Honest end-to-end requires a real EC2:

  1. Mint a fresh brain key for a throwaway customer (e.g. cust_b4_smoke):
    KEY=$(node bin/mint-key.js cust_b4_smoke "B.4 smoke test")
    aws ssm put-parameter --name /m8trx/cust_b4_smoke/brain-key \
        --type SecureString --value "$KEY"
    
  2. Launch a t3.micro Ubuntu 22 EC2 with: Customer=cust_b4_smoke tag, InstanceMetadataTags=enabled, instance profile attached, user-data = bootstrap.sh.
  3. Wait ~3 min, check "Get System Log" in EC2 console → expect m8trx-bootstrap: complete for cust_b4_smoke.
  4. Wait another ~5 min for the first heartbeat.
  5. On brain EC2: psql -tAc "select payload->>'hostname' from events where customer_id='cust_b4_smoke' and event_type='heartbeat' order by ts desc limit 1" → expect a real hostname.
  6. (Optional) tailscale ssh <ec2-tailscale-name> and run an agent task via paperclipai — expect tool_call events to land. Skip if paperclipai isn't installed on the smoke host.
  7. Terminate the smoke EC2.
  8. Optionally delete the cust_b4_smoke SSM param + the brain row.

What we deliberately don't do

Out of scope

Open questions

None at design-approval time. All seven clarifying questions were resolved interactively before this spec was written.