Phase C — Deployment runbook

Date: 2026-05-03 Phase: C (operator-facing onboarding documentation) Predecessor: B.4 (Tailscale + cloud-init bootstrap) Successor: none — final MVP phase

Goal

Ship docs/runbook-connect-customer-ec2.md: a single-page operator runbook that synthesizes the four B.x phases (wrapper telemetry, Claude Code hooks, host heartbeat, cloud-init bootstrap) into an end-to-end "from zero to telemetry-arriving" sequence for connecting a brand-new customer EC2 to the M8trx brain.

The runbook is the single entry point an internal operator opens when a new customer needs onboarding. It covers the order and prerequisites of what to do; the details (script behaviour, debug recipes, IAM policy text) live in the per-phase READMEs the runbook links out to.

Why now

All four B.x phases shipped artifacts that work in isolation. The per-phase READMEs (agent-artifacts/<phase>/README.md) cover their own slice well. What's missing is the operator-facing "where do I start" entry point — the doc you open the first time you need to connect a real customer EC2 and want a single sequenced checklist instead of stitching four READMEs together yourself.

This is the last MVP phase. Once it's in, "first customer connect" is operationally documented end-to-end.

Audience

Internal M8trx operator only. Assumes:

The runbook does not explain what SSM, Tailscale, or Terraform are. If/when M8trx ever needs a customer-DevOps-facing variant (for customers running their own AWS accounts), it'll fork from this internal version. Building both now is YAGNI.

Scope

Onboard only.

Out of scope (each handled separately when needed):

Document structure

Single file docs/runbook-connect-customer-ec2.md. Six h2 sections, in operator-execution order:

1. Prerequisites (one-time fleet setup)

What must already exist before any customer onboard. Each item is a checklist line with a "verify by" hint and a pointer to the source-of-truth doc:

2. Per-customer onboard checklist

The 4-step sequence for every new customer:

  1. Mint a brain bearer key:
    docker compose -f /home/ubuntu/brain/server/docker-compose.yml \
        exec -T brain-api node bin/mint-key.js cust_<id> "<Display Name>"
    
  2. Store it in SSM:
    aws ssm put-parameter --name /m8trx/cust_<id>/brain-key \
        --type SecureString --value "$KEY"
    
  3. Update Tailscale ACL — only if first customer on a fresh tailnet (usually no-op since the tag:m8trx-customer-host rule is in prerequisites).
  4. Apply Terraform (or paste user-data into the EC2 console) to launch the customer EC2 with the right tag, metadata-options, instance-profile, and user-data. Reference agent-artifacts/cloud-init/README.md § Terraform launch snippet.

3. Validation (happy-path smoke)

The 3-command success check:

4. If something failed

Pointer-only section. Don't restate the full failure → fix mapping; just link to it:

See agent-artifacts/cloud-init/README.md § Operator debug recipe for the full failure → fix mapping (Customer tag missing, AccessDeniedException, ParameterNotFound, tailscale auth fail, no events arriving).

5. What this runbook deliberately does not cover

Restate the Out-of-scope items above so a reader landing on the runbook knows what to look for elsewhere. Particularly important: "updating existing customer EC2" → terminate + relaunch.

6. References

Per-phase doc layers, README first (most operationally relevant), spec second:

Plans (docs/superpowers/plans/*) are intentionally omitted from the runbook — they're implementation history, not operational reference.

Cross-cutting changes

In addition to the new runbook file, one small change to docs/RESUME.md: add a top-level pointer at the start of "What's running right now" or as a new "Operator runbook" section so a returning operator/contributor sees:

For onboarding a new customer EC2, read docs/runbook-connect-customer-ec2.md first.

This is the runbook's discoverability hook from the doc operators already know to open.

Validation approach

The runbook itself is doc-only — there's no test suite that runs against it. Validation is empirical: the first real-customer connect IS the runbook's validation. If a reasonably-careful operator follows the runbook end-to-end and gets cust_<id> heartbeats arriving at brain, the runbook works. If they get stuck, that's a runbook bug to fix in a follow-up.

To set the runbook up for that test, the spec mandates:

Out of scope

Open questions

None at design-approval time. All three clarifying questions were resolved interactively before this spec was written.