# Kartā agent-led install playbook

Read this end-to-end before doing anything. Run every pre-flight gate
in order. Report each as `[OK]` or `[FAIL: reason]`. Do **not** tell the
operator to launch `kartaa-install` until every check is green. When the
operator launches the TUI, monitor it in a side terminal. After the TUI
exits, run the post-flight verification.

---

## Context

**kartaa-install** is a Go TUI. It provisions a single-tenant cockpit on
a Linux host, joins it to the operator's tailnet, runs the bootstrap,
and wires up Codex device-auth + (optionally) GitHub App + Telegram.

Two host modes (BYO mode is planned, currently disabled in
the TUI; the agent should not offer it as a choice):

- **`aws`** — provisions a fresh EC2 instance in the operator's account.
- **`hetzner`** — provisions a fresh Hetzner Cloud server.

Phase numbering varies slightly across the two modes, but the
pre-flight gates below cover both. Skip the AWS section for
`hetzner`, the Hetzner section for `aws`.

Each run persists under `~/.config/kartaa-install/runs/<timestamp>-<uuid>/`:
live `state.toml` (phase + error) and, on failure, `report.json`. The
newest run dir is the live one — sort by mtime to find it.

---

## Step 0 — get the latest installer

The kartaa repo is private; download from the public Pages mirror.

```bash
LATEST=$(curl -fsSL https://kartaa.pages.dev/version.json | jq -r .version)
ARCH=$(uname -m | sed 's/x86_64/amd64/;s/arm64/arm64/')
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
URL="https://kartaa.pages.dev/binaries/${LATEST}/kartaa-install_${LATEST#v}_${OS}_${ARCH}"

curl -fsSL "$URL" -o /tmp/kartaa-install.new
curl -fsSL "https://kartaa.pages.dev/binaries/${LATEST}/checksums.txt" -o /tmp/checksums.txt

ASSET="kartaa-install_${LATEST#v}_${OS}_${ARCH}"
EXPECTED=$(grep " $ASSET\$" /tmp/checksums.txt | awk '{print $1}')
ACTUAL=$(shasum -a 256 /tmp/kartaa-install.new | awk '{print $1}')
test "$EXPECTED" = "$ACTUAL" || { echo "[FAIL: sha mismatch]"; exit 1; }

chmod +x /tmp/kartaa-install.new
DEST="$(command -v kartaa-install || echo /usr/local/bin/kartaa-install)"
sudo mv /tmp/kartaa-install.new "$DEST"
"$DEST" --version
```

**Operator workstation must be on the tailnet.** Post-flight verifies
the cockpit at `https://<HOSTNAME>.<TAILNET-SUFFIX>.ts.net`, which
only resolves + routes from a tailnet-joined device. Live-monitoring
SSH-over-tailnet to the new host has the same requirement. Verify
before launching:

```bash
tailscale --version >/dev/null 2>&1 || echo "[FAIL: install Tailscale on this Mac first]"
tailscale status --json 2>/dev/null \
  | jq -e '.BackendState == "Running"' >/dev/null \
  && echo "[OK] tailnet up" \
  || echo "[FAIL: tailscale up + sign in to the same tailnet first]"
TAILNET_SUFFIX=$(tailscale status --json 2>/dev/null | jq -r '.MagicDNSSuffix // empty')
test -n "$TAILNET_SUFFIX" \
  && echo "[OK] tailnet suffix: $TAILNET_SUFFIX" \
  || echo "[FAIL: MagicDNS not enabled — see step 2]"
```

Ask the operator which **host mode** they want: `aws` or `hetzner`.
Run only the matching section in step 1.

---

## Step 1a — AWS pre-flight (skip if mode != aws)

**Reference docs**

- [AWS CLI install](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
- [AWS named profiles](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html)
- [`sts:GetCallerIdentity`](https://docs.aws.amazon.com/cli/latest/reference/sts/get-caller-identity.html)
- [EC2 default VPC](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html)
- [EC2 service quotas](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) — `L-1216C47A` is on-demand standard vCPUs
- [`ec2:ImportKeyPair`](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html)
- [Free Tier auto-suspend plan](https://aws.amazon.com/free/)
- [`ec2:RunInstances` dry-run](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_RunInstances.html#API_RunInstances_RequestParameters) (`--dry-run`)

**Gates**

1. `aws --version` → v2.x present.
2. `aws sts get-caller-identity --region <REGION>` returns the expected
   Account ID + ARN. Operator must confirm.
3. Default VPC exists in `<REGION>` and has tenancy `default` (not
   `dedicated`):
   ```bash
   aws ec2 describe-vpcs --region <REGION> \
     --filters Name=is-default,Values=true \
     --query 'Vpcs[0].{Id:VpcId,Tenancy:InstanceTenancy}'
   ```
4. ≥ 1 default subnet present:
   ```bash
   aws ec2 describe-subnets --region <REGION> \
     --filters Name=defaultForAz,Values=true --query 'length(Subnets)'
   ```
5. EC2 vCPU on-demand quota ≥ 4:
   ```bash
   aws service-quotas get-service-quota --region <REGION> \
     --service-code ec2 --quota-code L-1216C47A --query 'Quota.Value'
   ```
6. Instance size + Free-Tier eligibility. The installer ships a
   picker so you don't have to memorize type names — list the curated
   8 GB-floor options (with vCPU/RAM/arch/price/free-tier flag) for the
   region:
   ```bash
   kartaa-install instance-types --provider aws --region <REGION> --format json
   ```
   The TUI's server-type step reads the same data. Default is
   `m7i-flex.large` (2 vCPU / 8 GB). Free-Tier eligibility is
   informational, not a gate: it moves (e.g. `m7i-flex.large` is
   eligible for accounts created on/after 2025-07-15, not earlier).
   If the operator's account is on the post-2025 Free Plan
   (auto-suspend) and no 8 GB type is eligible, warn that the launch
   may fail; the fix is a Paid account with `--server-type t3a.large`.
   Kartā needs 8 GB, so a 1 GB type like `t3.micro` can't run it.
7. SSH key pair present **only if missing**. `import-key-pair` is
   NOT idempotent — re-importing returns `InvalidKeyPair.Duplicate`.
   Always describe first, import only on `[FAIL: NotFound]`:
   ```bash
   if aws ec2 describe-key-pairs --region <REGION> \
        --key-names "<KEY_NAME>" >/dev/null 2>&1; then
     echo "[OK] key pair '<KEY_NAME>' exists in <REGION>"
   else
     aws ec2 import-key-pair --region <REGION> \
       --key-name "<KEY_NAME>" --public-key-material "fileb://$PUBKEY"
   fi
   ```
8. **`RunInstances` dry-run** — proves the credentials can actually
   launch an instance without launching one. The AMI architecture
   must match the chosen instance type (here `m7i-flex.large` is
   x86_64-only, so we hardcode the amd64 SSM path regardless of the
   operator's local `uname -m`):
   ```bash
   AMI=$(aws ssm get-parameters --region <REGION> \
     --names /aws/service/canonical/ubuntu/server/24.04/stable/current/amd64/hvm/ebs-gp3/ami-id \
     --query 'Parameters[0].Value' --output text)
   aws ec2 run-instances --region <REGION> --dry-run \
     --image-id "$AMI" --instance-type m7i-flex.large \
     --key-name "<KEY_NAME>" 2>&1 | grep -q 'DryRunOperation' \
     && echo "[OK] RunInstances dry-run authorized" \
     || echo "[FAIL: dry-run rejected]"
   ```
   AWS returns `DryRunOperation` only if the call would have succeeded.
   Anything else means a real IAM / quota / config gap.
9. No stale `kartaa-*` instance with the chosen hostname:
   ```bash
   aws ec2 describe-instances --region <REGION> \
     --filters 'Name=tag:Name,Values=<HOSTNAME>' \
              'Name=instance-state-name,Values=pending,running,stopping,stopped' \
     --query 'Reservations[].Instances[].[InstanceId,State.Name]'
   ```
   If non-empty, ask the operator before terminating.

---

## Step 1b — Hetzner pre-flight (skip if mode != hetzner)

**Reference docs**

- [Hetzner Cloud API token](https://docs.hetzner.com/cloud/api/getting-started/generating-api-token/)
- [Hetzner CPX server types](https://docs.hetzner.com/cloud/servers/overview)

**Gates**

1. `hcloud` CLI installed (or operator has the API token in env):
   `echo "${HCLOUD_TOKEN:0:8}…"` should print a non-empty prefix.
2. Token has Read & Write scope (Hetzner only offers all-or-nothing).
3. Probe the API: `curl -fsSL -H "Authorization: Bearer $HCLOUD_TOKEN" \
   https://api.hetzner.cloud/v1/servers | jq '.meta'`. Non-error
   response = token valid.
4. Confirm the chosen server type (default `cpx32`) is available in the
   operator's preferred location.
5. SSH pubkey selected for upload — same as the AWS case.

---

## Step 2 — Tailscale

**Reference docs**

- [Auth keys overview](https://tailscale.com/kb/1085/auth-keys)
- [Generating an auth key](https://login.tailscale.com/admin/settings/keys)
- [Tagging nodes via auth keys (ACL)](https://tailscale.com/kb/1068/acl-tags)
- [API access tokens](https://tailscale.com/kb/1101/api)
- [MagicDNS](https://tailscale.com/kb/1081/magicdns)
- [HTTPS certificates](https://tailscale.com/kb/1153/enabling-https)

**Auth key (used by the cockpit at first boot to join the tailnet)**

Have the operator generate at <https://login.tailscale.com/admin/settings/keys>:

- **Reusable**: ON (a single-use key gets consumed by the first attempt; if
  the install fails midway, a re-run can't reuse it and the new EC2/VPS
  silently times out at "Waiting for Tailscale join").
- **Ephemeral**: OFF (we want a real, persistent node).
- **Pre-approved**: ON (only matters if the tailnet has device approval).
- **Expiration**: ≥ 7 days (cushion for retries; default 90 d is fine).
- **Tags**: at minimum a tag the operator's ACL whitelists for new
  nodes (commonly `tag:server` or `tag:kartaa`). Confirm the tag exists
  in the [tailnet policy file](https://login.tailscale.com/admin/acls)
  under `tagOwners`.

Verify format on paste:

```bash
test "${TS_AUTHKEY:0:11}" = "tskey-auth-" || echo "[FAIL: bad prefix]"
test ${#TS_AUTHKEY} -ge 50 || echo "[FAIL: key too short]"
```

**API token (used to sweep stale ghost nodes if the install ever
hits a hostname collision)**

Generate at <https://login.tailscale.com/admin/settings/keys> in the
"API access tokens" section. Same minimum 7-day TTL. Format:

```bash
test "${TS_API_TOKEN:0:10}" = "tskey-api-" || echo "[FAIL: bad prefix]"
```

**Admin toggles**

Open <https://login.tailscale.com/admin/dns> and confirm:

- **MagicDNS**: enabled.
- **HTTPS Certificates**: enabled.

Both are required — `tailscale serve` later in the install issues a
LetsEncrypt cert via the tailnet's HTTPS path. Without these, the
cockpit URL will work over IP but not over the friendly hostname.

**Hostname collision check**

If the operator's chosen hostname is already on the tailnet,
the install will register as `<hostname>-1` and Phase 6 will time
out polling for `<hostname>` itself. Pre-check:

```bash
dig +short <HOSTNAME>.<TAILNET-SUFFIX>.ts.net
```

If it resolves, the operator must delete the existing node at
<https://login.tailscale.com/admin/machines> before launching.

---

## Step 3 — Codex

**Reference docs**

- [Codex CLI](https://github.com/openai/codex)
- [Codex authentication (device-auth on headless host)](https://github.com/openai/codex/blob/main/docs/authentication.md#login-on-a-headless-server)
- [ChatGPT Plus / Pro plans](https://openai.com/chatgpt/pricing/)

**Gates**

1. Operator has an active **ChatGPT Plus** ($20/mo) or **Pro** ($200/mo)
   subscription with Codex quota. Kartā uses the operator's subscription
   quota — there's no separate API billing.
2. Operator understands the device-auth flow: the TUI runs
   `codex login --device-auth` on the host and prints a short code. The
   operator opens <https://chatgpt.com/codex/auth/link> on any device
   already signed in, pastes the code, approves. Takes ~30 s. The host
   then has a long-lived OAuth token in `~/.config/codex/auth.json`.

No agent action other than confirming the operator is ready for the
device-auth step when the TUI gets there.

---

## Step 4 — Telegram (optional)

**Reference docs**

- [@BotFather](https://t.me/BotFather)
- [@userinfobot](https://t.me/userinfobot)
- [Telegram Bot API](https://core.telegram.org/bots/api)

**Gates** (only if operator wants push notifications from the cockpit)

1. Bot token: in Telegram, message `@BotFather`, run `/newbot`, follow
   the prompts. Save the token (looks like `123456789:ABC-DEF…`). Format
   (must use `[[ ... ]]` — `=~` is not a `test`/`[` operator):
   ```bash
   [[ "${TG_BOT_TOKEN}" =~ ^[0-9]+:.+$ ]] || echo "[FAIL: bad token shape]"
   ```
2. chat_id: message `@userinfobot`, copy the numeric ID it prints.
3. Smoke-test the bot can DM the operator (the operator must `/start`
   the bot first):
   ```bash
   curl -fsSL "https://api.telegram.org/bot${TG_BOT_TOKEN}/sendMessage" \
     --data-urlencode "chat_id=${TG_CHAT_ID}" \
     --data-urlencode "text=Kartā install pre-flight ✓"
   ```
   If this returns `{"ok":true,…}`, push notifications will work
   end-to-end after install.

---

## GO / NO-GO

If every gate above for the chosen host mode is `[OK]`, tell the
operator:

> **READY TO LAUNCH.** Run `kartaa-install` in your terminal. I'll
> monitor in a second terminal.

Otherwise, list every `[FAIL]` with the exact remediation command and
hold.

---

## Live monitoring during the run

Have the operator launch `kartaa-install` in their terminal. In a
second terminal, you (the agent) tail state.toml and the host-mode
resource panel:

```bash
while sleep 3; do
  clear
  echo "=== state.toml (newest run) ==="
  run=$(ls -dt ~/.config/kartaa-install/runs/*/ 2>/dev/null | head -1)
  awk '/^status|^last_completed_phase|^error|^name|^state/{print}' \
    "$run/state.toml" 2>/dev/null | tail -30
  echo
  echo "=== resource ==="
  # AWS:
  aws ec2 describe-instances --region <REGION> \
    --filters 'Name=tag:Name,Values=<HOSTNAME>' \
              'Name=instance-state-name,Values=pending,running' \
    --query 'Reservations[].Instances[].{Id:InstanceId,State:State.Name,IP:PublicIpAddress}' \
    --output table 2>/dev/null
  # Hetzner:
  # hcloud server list -o columns=id,name,status,ipv4
  # BYO:
  # ssh <USER>@<HOST> 'uptime; systemctl is-active kartaa.service codex-app-server.service'
done
```

If Phase 6 ("Waiting for Tailscale join") sits >3 min, fetch the
host's console output (AWS only) or SSH in and read
`/var/log/cloud-init-output.log`. Look for: cloud-init crashes,
`tailscale up` rejected (auth key invalid/exhausted), apt mirror flake.

If the TUI dies at any phase, dump the failure report:

```bash
run=$(ls -dt ~/.config/kartaa-install/runs/*/ 2>/dev/null | head -1)
jq -r '.logs[] | select(.stream=="stderr") | .line' \
  "$run/report.json" 2>/dev/null | tail -30
```

The rc36+ TUI emits operator-facing remediation hints in the die
message itself — read it carefully before suggesting next steps.

---

## Post-flight (after TUI exits 0)

```bash
# 1. Cockpit reachable via tailnet HTTPS
curl -sI https://<HOSTNAME>.<TAILNET-SUFFIX>.ts.net | head -3

# 2. Tailnet membership
tailscale status | grep <HOSTNAME>

# 3. Optional: verify systemd units on the host
ssh ubuntu@<HOSTNAME>.<TAILNET-SUFFIX>.ts.net \
  'systemctl is-active kartaa.service codex-app-server.service codex-tee.service gh-token-broker.service'
```

If all four are green, tell the operator:

> **Install complete.** Cockpit live at
> `https://<HOSTNAME>.<TAILNET-SUFFIX>.ts.net`.

---

## Hard rules for you (the agent)

- **Do not run `kartaa-install` yourself.** The operator must launch
  the TUI from their terminal because it's interactive (PlanInput,
  device-auth code display, Done-screen actions).
- **Do not modify** `~/.aws/config`, `~/.aws/credentials`, the
  tailnet ACL policy file, or any `~/.config/kartaa/...` state file
  unless explicitly asked.
- **Do not delete** EC2/Hetzner/BYO resources without operator
  confirmation, even on visibly-stale ones.
- **Do not advance to "READY TO LAUNCH"** with a single `[FAIL]`
  outstanding. Hold and surface the gap.
- **Do not invent values.** If a region/hostname/tag isn't
  specified, ask the operator.
