openova/scripts/operator-recover-sovereign.sh
hatiyildiz 9e3268f2c5 docs(ops): comprehensive operator runbook + remediation playbook + idempotent recovery script
Adds docs/RUNBOOK-OPERATIONS.md as the single operator-facing entry point for
provisioning, troubleshooting, and recovering Catalyst Sovereigns:

A. Pre-provision checklist — Hetzner project + token, Dynadot pool zones +
   credentials, GHCR pull token (cross-link SECRET-ROTATION.md), PowerDNS pool
   zones bootstrapped, PDM healthy, bp-* chart versions, subchart-guard CI green.
B. Step-by-step walkthrough with timing — Phase 0 OpenTofu (30-60s plan +
   60-120s apply), PDM /commit (~5s), cloud-init (3-5min), Phase 1
   bootstrap-kit (10-15min), cert-manager + Cilium Gateway (1-2min). Total
   15-25min for a solo Sovereign.
C. 18 known failure modes with SYMPTOM / ROOT CAUSE / DIAGNOSIS / RECOVERY,
   each pinned to the canonical fix commit (c6cbfe68, e571ec7a, 54872009,
   2022e1af, 34c8de84, dddbab4b, 43aff202, 418cead0, 64d7de97, 330211d2,
   41c7ac13) or marked fix-in-flight where applicable.
D. Idempotent recovery script (Hetzner purge with DELETE-204-but-resource-
   persists verification sweep, PDM allocation release, catalyst-api
   deployment-record cancel). Dry-run by default; --apply gates real deletes
   on a validated HETZNER_API_TOKEN.
E. Cross-links to INVIOLABLE-PRINCIPLES, SOVEREIGN-PROVISIONING,
   RUNBOOK-PROVISIONING, BLUEPRINT-AUTHORING, CHART-AUTHORING, SECRET-ROTATION,
   PLATFORM-POWERDNS, IMPLEMENTATION-STATUS — references, doesn't duplicate.
F. Mermaid phase timeline diagram at the top showing ownership boundaries
   (catalyst-provisioner -> cloud-init -> Sovereign cluster) and hand-off points.
G. Mermaid failure decision tree at the end — operators land at the right §C
   entry in 4-6 yes/no questions.

Recovery script gracefully degrades to a name-only preview when
HETZNER_API_TOKEN is unset in dry-run mode (apply mode still hard-fails on
missing/invalid token), so operators can review what WOULD happen before
exporting the token.

Verified dry-run output against the live omantel.omani.works Sovereign:
- Step 1 lists 8 Hetzner kinds + 8 verification-sweep targets to inspect
- Step 2 confirms PDM reports the subdomain currently RESERVED (live state)
- Step 3 correctly identifies catalyst-api deployment 6274daeb7a9873cd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:29 +02:00

376 lines
15 KiB
Bash
Executable File

#!/usr/bin/env bash
# operator-recover-sovereign.sh — idempotent recovery of a partially-provisioned Sovereign.
#
# When a Sovereign provisioning run fails partway (Phase 0 OpenTofu, Phase 1
# bootstrap-kit, or anywhere in between), this script returns the system to a
# clean slate so the operator can re-run Launch in the wizard with the same
# FQDN. It does THREE things, in order:
#
# 1. Purge every Hetzner Cloud resource tagged for that Sovereign.
# 2. Release the PDM allocation for the Sovereign's pool subdomain (if any).
# 3. Mark the catalyst-api deployment record as `cancelled` so the wizard
# stops streaming events for it and the operator can re-create cleanly.
#
# Resource names in the OpenTofu module are deterministic:
# catalyst-${replace(sovereign_fqdn, ".", "-")}-{role}
# so re-running Launch with the same FQDN after this script is fully idempotent.
#
# DRY-RUN by default. Pass --apply to actually delete.
#
# Anchored to the canonical purge logic in:
# /home/openova/.claude/projects/-home-openova-repos-openova-private/memory/feedback_idempotent_iac_purge.md
# and the runbook at:
# docs/RUNBOOK-OPERATIONS.md §"Recovery procedure"
#
# Usage:
# ./scripts/operator-recover-sovereign.sh <sovereign-fqdn> # dry-run
# ./scripts/operator-recover-sovereign.sh <sovereign-fqdn> --apply # destructive
#
# Required tools: bash, curl, python3, kubectl
# Required env vars: HETZNER_API_TOKEN (read+write project token)
# Optional env vars: PDM_BASE_URL (default: derived from --pool-domain or omani.works)
# POOL_DOMAIN (default: derived from FQDN's parent zone)
# CATALYST_NAMESPACE (default: catalyst)
set -euo pipefail
# ── Argument parsing ──────────────────────────────────────────────────
FQDN="${1:-}"
MODE="dry-run"
shift || true
while [ "$#" -gt 0 ]; do
case "$1" in
--apply) MODE="apply" ;;
--dry-run) MODE="dry-run" ;;
*) echo "ERR: unknown flag: $1" >&2; exit 2 ;;
esac
shift
done
if [ -z "$FQDN" ]; then
echo "Usage: $0 <sovereign-fqdn> [--apply]" >&2
echo "Example: $0 omantel.omani.works --apply" >&2
exit 2
fi
# Slug used by the OpenTofu module to name resources, matching:
# catalyst-${replace(sovereign_fqdn, ".", "-")}-{role}
SLUG=$(echo "$FQDN" | tr . -)
LABEL_KEY="catalyst.openova.io/sovereign"
NS="${CATALYST_NAMESPACE:-catalyst}"
# Pool domain inference: the parent zone of the FQDN.
# omantel.omani.works -> omani.works
# acme.openova.io -> openova.io
# acme.bank.com -> bank.com (would only matter if PDM manages it)
POOL_DOMAIN="${POOL_DOMAIN:-$(echo "$FQDN" | cut -d. -f2-)}"
# ── Output helpers ────────────────────────────────────────────────────
bold() { printf "\033[1m%s\033[0m\n" "$*"; }
red() { printf "\033[31m%s\033[0m\n" "$*"; }
green() { printf "\033[32m%s\033[0m\n" "$*"; }
yellow() { printf "\033[33m%s\033[0m\n" "$*"; }
cyan() { printf "\033[36m%s\033[0m\n" "$*"; }
prefix() {
if [ "$MODE" = "dry-run" ]; then
yellow "[DRY-RUN] $*"
else
cyan "[APPLY] $*"
fi
}
# ── Banner ────────────────────────────────────────────────────────────
bold "==================================================================="
bold " OpenOva Catalyst — Operator Sovereign Recovery"
bold "==================================================================="
echo " Sovereign FQDN: $FQDN"
echo " Resource slug: catalyst-${SLUG}-*"
echo " Hetzner label: ${LABEL_KEY}=${FQDN}"
echo " Pool parent zone: ${POOL_DOMAIN}"
echo " Catalyst NS: ${NS}"
echo " Mode: $MODE"
bold "==================================================================="
echo
if [ "$MODE" = "dry-run" ]; then
yellow " Running in DRY-RUN mode. Nothing will be deleted."
yellow " Re-run with --apply to actually purge resources."
else
red " Running in APPLY mode. Resources WILL be deleted. CTRL-C now to abort."
sleep 3
fi
echo
# ── Pre-flight ────────────────────────────────────────────────────────
# In dry-run mode we tolerate a missing/invalid token — the operator is just
# previewing what would happen. In apply mode we hard-fail.
HAVE_HETZNER_TOKEN=0
if [ -n "${HETZNER_API_TOKEN:-}" ]; then
HTTP_CODE=$(curl -sS -o /dev/null -w '%{http_code}' \
-H "Authorization: Bearer ${HETZNER_API_TOKEN}" \
"https://api.hetzner.cloud/v1/servers?per_page=1" || true)
if [ "$HTTP_CODE" = "200" ]; then
HAVE_HETZNER_TOKEN=1
green " HETZNER_API_TOKEN validated — Hetzner inventory will be queried live."
else
if [ "$MODE" = "apply" ]; then
red "ERR: HETZNER_API_TOKEN rejected by Hetzner API (HTTP $HTTP_CODE). Aborting."
exit 3
else
yellow " HETZNER_API_TOKEN rejected by Hetzner (HTTP $HTTP_CODE) — Step 1 will be a name-only preview."
fi
fi
else
if [ "$MODE" = "apply" ]; then
red "ERR: HETZNER_API_TOKEN is not set. Export the read+write token for the Sovereign's project."
exit 3
else
yellow " HETZNER_API_TOKEN is not set — Step 1 will be a name-only preview (set the token to enumerate resources for real)."
fi
fi
echo
# ── Step 1 — Hetzner purge ────────────────────────────────────────────
bold "── Step 1 / 3 — Hetzner Cloud resource purge ─────────────────────"
H="Authorization: Bearer ${HETZNER_API_TOKEN:-NO_TOKEN}"
SEL="label_selector=${LABEL_KEY}=${FQDN}"
KINDS_LABELED="servers load_balancers networks firewalls volumes primary_ips floating_ips"
list_ids_labelled() {
local kind="$1"
curl -sS -H "$H" "https://api.hetzner.cloud/v1/${kind}?${SEL}" |
python3 -c "import json,sys; d=json.load(sys.stdin); [print(s['id'], s.get('name','')) for s in d.get('${kind}',[])]"
}
list_ids_by_name() {
# Hetzner ssh_keys (and a few other kinds) don't accept label selectors.
# Match by deterministic resource-name slug instead.
local kind="$1"
curl -sS -H "$H" "https://api.hetzner.cloud/v1/${kind}" |
python3 -c "
import json, sys
d = json.load(sys.stdin)
items = d.get('${kind}', [])
slug = '${SLUG}'
for s in items:
name = s.get('name', '')
if slug in name:
print(s['id'], name)
"
}
delete_resource() {
local kind="$1" id="$2" name="$3"
prefix "DELETE ${kind}/${id} (${name})"
if [ "$MODE" = "apply" ]; then
curl -sS -o /dev/null -X DELETE -H "$H" \
"https://api.hetzner.cloud/v1/${kind}/${id}" || red " WARN: delete returned non-zero for ${kind}/${id}"
fi
}
ANY_FOUND=0
if [ "$HAVE_HETZNER_TOKEN" = "1" ]; then
# Pass 1: label-selected resources, in dependency order (servers first so
# LB/network/firewall freeing isn't blocked).
for kind in $KINDS_LABELED ; do
while IFS= read -r line; do
[ -z "$line" ] && continue
ANY_FOUND=1
id=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | cut -d' ' -f2-)
delete_resource "$kind" "$id" "$name"
done < <(list_ids_labelled "$kind")
done
# Pass 2: ssh_keys (no labels — match by name slug).
while IFS= read -r line; do
[ -z "$line" ] && continue
ANY_FOUND=1
id=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | cut -d' ' -f2-)
delete_resource "ssh_keys" "$id" "$name"
done < <(list_ids_by_name "ssh_keys")
# Pass 3: verification sweep — Hetzner DELETE often returns 204 even when
# the resource persists (notably firewalls right after a server delete).
# Re-query without the label filter and re-delete by-id any that linger.
# Skipping this caused "name is already used (uniqueness_error)" on the
# next provision attempt. See feedback_idempotent_iac_purge.md.
echo
yellow "── Verification sweep (catches Hetzner DELETE-returns-204-but-keeps-resource quirk) ──"
for kind in firewalls networks load_balancers ssh_keys servers volumes primary_ips floating_ips ; do
while IFS= read -r line; do
[ -z "$line" ] && continue
id=$(echo "$line" | awk '{print $1}')
name=$(echo "$line" | cut -d' ' -f2-)
yellow " Lingering ${kind}/${id} (${name}) — re-deleting"
delete_resource "$kind" "$id" "$name"
done < <(list_ids_by_name "$kind")
done
if [ "$ANY_FOUND" = "0" ]; then
green " No Hetzner resources found for ${FQDN}. Already clean."
fi
else
yellow " (Token not available — listing the resource names that WOULD be inspected.)"
for kind in $KINDS_LABELED ssh_keys ; do
prefix "QUERY ${kind}?label_selector=${LABEL_KEY}=${FQDN} (would DELETE any matches: catalyst-${SLUG}-*)"
done
echo
yellow "── Verification sweep (would also re-list without label filter and re-delete catalyst-${SLUG}-* names) ──"
for kind in firewalls networks load_balancers ssh_keys servers volumes primary_ips floating_ips ; do
prefix "QUERY ${kind} (would re-DELETE any name matching slug ${SLUG})"
done
fi
echo
# ── Step 2 — PDM allocation release ───────────────────────────────────
bold "── Step 2 / 3 — Pool-domain-manager allocation release ───────────"
# Sub-label = first label of the FQDN (omantel.omani.works -> omantel).
SUB=$(echo "$FQDN" | cut -d. -f1)
# Locate PDM. Prefer in-cluster service; fall back to env override.
PDM_BASE_URL="${PDM_BASE_URL:-http://pool-domain-manager.openova-system.svc.cluster.local:8080}"
prefix "DELETE ${PDM_BASE_URL}/api/v1/pool/${POOL_DOMAIN}/release?sub=${SUB}"
if [ "$MODE" = "apply" ]; then
# Run the curl from inside the cluster so the in-cluster DNS resolves.
if kubectl -n openova-system get deploy pool-domain-manager >/dev/null 2>&1; then
kubectl -n openova-system exec deploy/pool-domain-manager -- \
sh -c "wget -q -O - --method=DELETE --header='Content-Type: application/json' \
'http://localhost:8080/api/v1/pool/${POOL_DOMAIN}/release?sub=${SUB}' || true" \
2>/dev/null || true
else
yellow " pool-domain-manager Deployment not found in openova-system; skipping PDM release."
yellow " If PDM lives elsewhere, set PDM_BASE_URL and re-run."
fi
else
# Dry-run: just check whether the allocation exists.
if kubectl -n openova-system get deploy pool-domain-manager >/dev/null 2>&1; then
OUT=$(kubectl -n openova-system exec deploy/pool-domain-manager -- \
sh -c "wget -q -O - 'http://localhost:8080/api/v1/pool/${POOL_DOMAIN}/check?sub=${SUB}'" \
2>/dev/null || echo '{}')
AVAIL=$(echo "$OUT" | python3 -c "import json,sys; d=json.load(sys.stdin) if sys.stdin else {}; print(d.get('available','unknown'))" 2>/dev/null || echo unknown)
case "$AVAIL" in
true) green " PDM reports ${SUB}.${POOL_DOMAIN} is already AVAILABLE — nothing to release." ;;
false) yellow " PDM reports ${SUB}.${POOL_DOMAIN} is currently RESERVED or COMMITTED — release will free it." ;;
*) yellow " PDM check returned: ${OUT}" ;;
esac
else
yellow " pool-domain-manager Deployment not found in openova-system (skipping check)."
fi
fi
echo
# ── Step 3 — catalyst-api deployment record ───────────────────────────
bold "── Step 3 / 3 — catalyst-api deployment record cancellation ──────"
if ! kubectl -n "$NS" get deploy catalyst-api >/dev/null 2>&1; then
yellow " catalyst-api Deployment not found in namespace ${NS}. Skipping."
else
# Find the record(s) for this FQDN. Records live at:
# /var/lib/catalyst/deployments/<id>.json
# and contain { "request": { "sovereignFQDN": "...", ... }, "status": "...", ... }
POD=$(kubectl -n "$NS" get pod -l app.kubernetes.io/name=catalyst-api \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null || true)
if [ -z "$POD" ]; then
yellow " No catalyst-api Pod found. Skipping deployment-record cancellation."
else
# The catalyst-api container is scratch-based — no python3, no jq, only
# /bin/sh + the catalyst-api binary. We pull each record file out via
# `kubectl exec ... cat` and parse the JSON locally.
DEPLOY_FILES=$(kubectl -n "$NS" exec "$POD" -- sh -c \
"ls /var/lib/catalyst/deployments/*.json 2>/dev/null || true" 2>/dev/null || true)
MATCHED_IDS=""
for f in $DEPLOY_FILES; do
[ -z "$f" ] && continue
JSON=$(kubectl -n "$NS" exec "$POD" -- sh -c "cat '$f'" 2>/dev/null || true)
RECORD_FQDN=$(echo "$JSON" | python3 -c "
import json, sys
try:
d = json.load(sys.stdin)
print(d.get('request', {}).get('sovereignFQDN', ''))
except Exception:
print('')
" 2>/dev/null || echo "")
RECORD_STATUS=$(echo "$JSON" | python3 -c "
import json, sys
try:
d = json.load(sys.stdin)
print(d.get('status', ''))
except Exception:
print('')
" 2>/dev/null || echo "")
if [ "$RECORD_FQDN" = "$FQDN" ]; then
DID=$(basename "$f" .json)
MATCHED_IDS="${MATCHED_IDS}${DID} ${RECORD_STATUS}\n"
fi
done
if [ -z "$MATCHED_IDS" ]; then
green " No catalyst-api deployment records reference ${FQDN}. Nothing to cancel."
else
printf '%b' "$MATCHED_IDS" | while IFS= read -r line; do
[ -z "$line" ] && continue
DID=$(echo "$line" | awk '{print $1}')
DSTATUS=$(echo "$line" | awk '{print $2}')
prefix "Mark deployment ${DID} (current status: ${DSTATUS}) -> status=cancelled"
if [ "$MODE" = "apply" ]; then
# Pull, mutate locally on the host (where python3 exists), push back.
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)
ORIG=$(kubectl -n "$NS" exec "$POD" -- sh -c "cat /var/lib/catalyst/deployments/${DID}.json" 2>/dev/null || true)
NEW=$(echo "$ORIG" | python3 -c "
import json, sys
d = json.load(sys.stdin)
d['status'] = 'cancelled'
d['finishedAt'] = '${NOW}'
d.setdefault('events', []).append({
'time': '${NOW}',
'phase': 'operator-recovery',
'level': 'warn',
'message': 'cancelled by scripts/operator-recover-sovereign.sh'
})
print(json.dumps(d, indent=2))
" 2>/dev/null || true)
if [ -n "$NEW" ]; then
# Pipe new JSON back into the Pod via stdin -> tee.
echo "$NEW" | kubectl -n "$NS" exec -i "$POD" -- \
sh -c "cat > /var/lib/catalyst/deployments/${DID}.json" \
|| red " WARN: could not rewrite ${DID}.json"
else
red " WARN: could not parse ${DID}.json — skipping rewrite"
fi
fi
done
fi
fi
fi
echo
# ── Done ──────────────────────────────────────────────────────────────
bold "==================================================================="
if [ "$MODE" = "dry-run" ]; then
yellow " DRY-RUN complete. No changes made."
yellow " Re-run with --apply to actually purge."
else
green " RECOVERY APPLIED. The operator may now re-run Launch in the wizard"
green " with sovereign-fqdn=${FQDN}. Re-runs are fully idempotent because"
green " every Hetzner resource is named deterministically off the FQDN."
fi
bold "==================================================================="
exit 0