Commit Graph

2 Commits

Author SHA1 Message Date
e3mrah
0289f0388d
feat(scripts): bootstrap-kit dependency-graph audit script (W2.K0) (#259)
Adds scripts/check-bootstrap-deps.sh + scripts/expected-bootstrap-deps.yaml,
the W2.K0 deliverable from docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3.

The script parses every clusters/_template/bootstrap-kit/*.yaml, extracts
metadata.name + spec.dependsOn for the HelmRelease document(s), and
mechanically verifies the actual graph against the expected DAG declared
in scripts/expected-bootstrap-deps.yaml. It detects cycles via Kahn's
algorithm and prints the rendered DAG as ASCII grouped by Wave 2 batch
(W2.K1-K4) on success.

Behaviour against the in-flight expansion: HRs declared expected but not
yet on disk are reported as "deferred" (informational, not an error), so
that this script can be the static authoritative list while W2.K1-K4
PRs land their HR files in series. After all four W2 PRs merge, the
"deferred" count drops to 0 and the audit goes 100% green.

Wired into the existing .github/workflows/test-bootstrap-kit.yaml as a
new dependency-graph-audit job that runs on every PR touching:
  - clusters/** (any HR file edit)
  - scripts/check-bootstrap-deps.sh
  - scripts/expected-bootstrap-deps.yaml
  - .github/workflows/test-bootstrap-kit.yaml

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:16:16 +04:00
hatiyildiz
9e3268f2c5 docs(ops): comprehensive operator runbook + remediation playbook + idempotent recovery script
Adds docs/RUNBOOK-OPERATIONS.md as the single operator-facing entry point for
provisioning, troubleshooting, and recovering Catalyst Sovereigns:

A. Pre-provision checklist — Hetzner project + token, Dynadot pool zones +
   credentials, GHCR pull token (cross-link SECRET-ROTATION.md), PowerDNS pool
   zones bootstrapped, PDM healthy, bp-* chart versions, subchart-guard CI green.
B. Step-by-step walkthrough with timing — Phase 0 OpenTofu (30-60s plan +
   60-120s apply), PDM /commit (~5s), cloud-init (3-5min), Phase 1
   bootstrap-kit (10-15min), cert-manager + Cilium Gateway (1-2min). Total
   15-25min for a solo Sovereign.
C. 18 known failure modes with SYMPTOM / ROOT CAUSE / DIAGNOSIS / RECOVERY,
   each pinned to the canonical fix commit (c6cbfe68, e571ec7a, 54872009,
   2022e1af, 34c8de84, dddbab4b, 43aff202, 418cead0, 64d7de97, 330211d2,
   41c7ac13) or marked fix-in-flight where applicable.
D. Idempotent recovery script (Hetzner purge with DELETE-204-but-resource-
   persists verification sweep, PDM allocation release, catalyst-api
   deployment-record cancel). Dry-run by default; --apply gates real deletes
   on a validated HETZNER_API_TOKEN.
E. Cross-links to INVIOLABLE-PRINCIPLES, SOVEREIGN-PROVISIONING,
   RUNBOOK-PROVISIONING, BLUEPRINT-AUTHORING, CHART-AUTHORING, SECRET-ROTATION,
   PLATFORM-POWERDNS, IMPLEMENTATION-STATUS — references, doesn't duplicate.
F. Mermaid phase timeline diagram at the top showing ownership boundaries
   (catalyst-provisioner -> cloud-init -> Sovereign cluster) and hand-off points.
G. Mermaid failure decision tree at the end — operators land at the right §C
   entry in 4-6 yes/no questions.

Recovery script gracefully degrades to a name-only preview when
HETZNER_API_TOKEN is unset in dry-run mode (apply mode still hard-fails on
missing/invalid token), so operators can review what WOULD happen before
exporting the token.

Verified dry-run output against the live omantel.omani.works Sovereign:
- Step 1 lists 8 Hetzner kinds + 8 verification-sweep targets to inspect
- Step 2 confirms PDM reports the subdomain currently RESERVED (live state)
- Step 3 correctly identifies catalyst-api deployment 6274daeb7a9873cd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:29 +02:00