openova/docs/RUNBOOK-PROVISIONING.md

28 KiB
Raw Blame History

Runbook — Provisioning a New Sovereign

Status: Operator-level procedure. Updated: 2026-04-29. Audience: Sovereign cloud team (e.g. omantel-cloud) onboarding their first Sovereign via Catalyst-Zero. Read this with SOVEREIGN-PROVISIONING.md (the architectural contract) and PROVISIONING-PLAN.md (the Catalyst-Zero waterfall).


What this runbook gets you

A new Sovereign — a self-sufficient deployed Catalyst — provisioned end-to-end on Hetzner from Catalyst-Zero (console.openova.io/sovereign). At the end:

  • A k3s cluster running on Hetzner Cloud servers in your chosen region
  • Cilium CNI + Gateway API as ingress, Flux as GitOps reconciler, Crossplane as day-2 IaC
  • The 12-component bootstrap kit installed and reconciling cleanly: cilium → cert-manager → flux → crossplane → sealed-secrets → spire → nats-jetstream → openbao → keycloak → gitea → powerdns → bp-catalyst-platform
  • Reachable URLs: console.<your-fqdn>, gitea.<your-fqdn>, admin.<your-fqdn> (TLS via cert-manager + Let's Encrypt)
  • Initial sovereign-admin user in Keycloak's catalyst-admin realm
  • The Sovereign is now self-sufficient — the catalyst-provisioner has zero ongoing connection to it (Phase 1 hand-off complete)

This runbook does NOT cover Day-1 setup (cert-manager issuers, backup destination, Org onboarding) — see SOVEREIGN-PROVISIONING.md §5 for that.


Before you start — what you need

Gather all of the following BEFORE opening the wizard. The wizard does not save partial input across sessions.

Item Where to get it Validation
Hetzner Cloud account + project https://console.hetzner.cloud → Projects → New Project Project ID visible in Cloud Console URL after selection
Hetzner Cloud API token Inside the project: Security → API Tokens → New Token, scope Read & Write Save it once — it is shown only at creation
Hetzner region One of: fsn1 (Falkenstein), nbg1 (Nuremberg), hel1 (Helsinki), ash (Ashburn US East), hil (Hillsboro US West) Wizard validates against this list
SSH public key Your sovereign-admin break-glass keypair — generate with ssh-keygen -t ed25519 -C "sovereign-admin@<your-org>" -f ~/.ssh/sovereign_admin The PUBLIC half (*.pub) is what the wizard takes
Sovereign domain Three modes (post-#169): (a) Pool — pick a subdomain under omani.works / openova.io (the wizard reserves it via PDM /v1/reserve and creates the per-Sovereign PowerDNS zone on commit); (b) BYO with manual NS-flip (byo-manual) — bring your own registered domain; the wizard shows the OpenOva NS records you paste into your registrar UI; (c) BYO with API NS-flip (byo-api) — bring your own domain plus a registrar API token (Cloudflare / Namecheap / GoDaddy / OVH / Dynadot) and OpenOva flips NS for you. Captured at Step 6 (after sizing + creds + components) so the wizard can pair the domain with the deployed footprint Wizard validates registrar tokens read-only (POST /api/v1/registrars/validate) before accepting
Organisation profile Org name, industry, size, HQ, compliance frame; the sovereign-admin email is captured at Step 6 (Domain) so it pairs with the Sovereign's external surface Email must be deliverable — Keycloak sends the password reset there
Topology choice Single-region (SME default) or 1-CP-1-worker minimal vs ha_enabled=true (3-CP HA) + worker_count ≥ 1; control-plane + worker SKU pickers driven by PROVIDER_NODE_SIZES[provider] (#176) Wizard surfaces these as form fields

Cost estimate for a default single-region run: 1× control-plane CPX21 (~€8/mo) + 1× worker CPX31 (~€16/mo) + 1× lb11 (~€6/mo) + ~€1 storage = ~€31/mo before workload growth. HA topology (3 CPs + 2 workers) is closer to ~€80/mo.


Step-by-step

1. Open the provisioning wizard

https://console.openova.io/sovereign

Log in as a Catalyst-Zero user (your existing OpenOva-issued credentials) and click New Sovereign.

2. Walk the 7-step wizard

The wizard's Vite scaffold lives at products/catalyst/bootstrap/ui/. Each step writes its inputs into the wizard's local store; nothing is sent to the catalyst-api until Review + Provision. The 7-step indicator lives in the page header (per #174); per-step ordering is canonical from STEPS in src/pages/wizard/WizardPage.tsx. The canonical order — operator picks workload sizing, then provider, then credentials, then components, then names the Sovereign in DNS — is:

Step What it captures Notes
1. Organisation Org profile: name, industry, size, HQ, compliance frame No email or domain capture here — the sovereign-admin email pairs with the Sovereign's external surface and is captured at Step 6 (Domain)
2. Topology Regions, building blocks (mgt + rtz/dmz), HA toggle, control-plane + worker SKU + worker count Single-region is the supported path at first launch — multi-region remains design-only. Per #176 the SKU pickers are driven by PROVIDER_NODE_SIZES[provider] so the catalog stays per-provider correct (no Hetzner-only literals leaking into the AWS/Azure/OCI paths)
3. Provider Cloud per region (Hetzner today; AWS / GCP / Azure / OCI / Huawei per PLATFORM-TECH-STACK.md §9.1 are design-only)
4. Credentials Provider API token + project ID (when applicable), SSH public key Validated read-only via POST /api/v1/credentials/validate before advancing; the token is sent once over TLS, never logged, redacted from SSE event stream
5. Components Single flat marketplace card grid (#162, #b0ec0c43) with family chips on each card and search + product-family chip filter at the top. Two tabs: Choose Your Stack (recommended + optional, default-on for recommended) and Always Included (the post-promotion mandatory closure, read-only) Apps can be added post-provisioning too — only pre-select the must-haves. Per #175 dependency-aware cascades pull transitive deps automatically (e.g. picking Harbor pulls in cnpg + seaweedfs + valkey); per #d3346441 each card's family chip is clickable and routes to the family portfolio page, the card body routes to the product detail page, and only the explicit Select / Selected button toggles the wizard store
6. Domain Pool subdomain OR BYO (manual NS / registrar API), per #169's three-mode flow, plus the sovereign-admin email Pool = PDM /v1/reserve. BYO byo-api = registrar token (Cloudflare/Namecheap/GoDaddy/OVH/Dynadot, #170). BYO byo-manual = wizard surfaces NS list to paste at customer registrar
7. Review Show every captured value, Provision button Click → catalyst-api accepts the request and starts streaming

3. Watch the SSE event stream

Once you click Provision, the wizard's progress page shows a live event log streamed from the catalyst-api /v1/sovereigns/{id}/events endpoint. Phases you will see:

tofu-init       Initialising OpenTofu working directory
tofu-plan       Planning Hetzner resources (network, firewall, server, LB, DNS)
tofu-apply      Applying — this provisions real Hetzner resources, please wait
tofu-output     Reading OpenTofu outputs (control_plane_ip, load_balancer_ip)
flux-bootstrap  Cloud-init has bootstrapped Flux + Crossplane in the new
                cluster — Flux will now reconcile clusters/<sovereign-fqdn>/
                from the public OpenOva monorepo, installing the 12-component
                bootstrap kit and bp-catalyst-platform umbrella in dependency
                order.

After flux-bootstrap, the wizard polls Flux Kustomizations on the new cluster (via the catalyst-api which has temporary kubeconfig from the OpenTofu output) and shows a per-Kustomization readiness grid. Steady-state takes 2555 minutes from tofu-apply to bp-catalyst-platform: Ready=True.

If the SSE stream goes silent for >60s: the catalyst-api connection may have dropped (browser refresh recovers; events queue server-side). If it is silent for >5 minutes during tofu-apply, check the Hetzner Cloud Console for stuck server creation — most often this is API rate-limiting under your project; it resolves itself.

4. First login

When the wizard shows Done — your Sovereign is ready, navigate to:

https://console.<sovereign-fqdn>

(For pool domains, this is e.g. console.omantel.omani.works. For BYO, you must first add a CNAME from *.<your-fqdn> to the load-balancer DNS name shown on the success screen.)

Sign in with the sovereign-admin email you provided at Step 6 (Domain). Keycloak's catalyst-admin realm sends a password-reset email; click the link, set a strong password (24+ chars per feedback_passwords.md), then complete the realm flow.

5. Day-1 setup checklist

Per SOVEREIGN-PROVISIONING.md §5:

  • Configure cert-manager Issuer (Let's Encrypt prod or your corporate CA)
  • Configure Velero backup destination (cloud object storage)
  • Configure Harbor image-scanning policies + retention
  • (Optional) Federate Keycloak to your corporate IdP (Azure AD / Okta / Google)
  • (Optional) Configure observability exports (datadog, SIEM)
  • Onboard your first Catalyst Organization
  • Create your first Environment in that Organization
  • Install your first Application from the marketplace

What can go wrong, and what to do

The catalyst-api retains the OpenTofu state per-Sovereign in /tmp/catalyst/tofu/<sovereign-fqdn>/ — the CATALYST_TOFU_WORKDIR env var on the catalyst-api Deployment (commit 27527e4c, see products/catalyst/chart/templates/api-deployment.yaml and the comment block explaining why /var/lib/catalyst/... is unwritable for UID 65534) points the provisioner at the Pod's writable /tmp emptyDir (2 Gi sizeLimit) so each Sovereign run gets its own subdirectory. Re-running with the same Sovereign FQDN is idempotent (tofu apply on existing state). This means most failures are recoverable without manual cleanup of Hetzner resources.

Symptom Most likely cause What to do
tofu plan fails with 403 Forbidden from hcloud Hetzner token has only Read scope, or expired Generate a new Read+Write token; re-run wizard with same FQDN
tofu plan fails with quota exceeded Hetzner project default limits (typically 10 servers, 1 LB) Open a Hetzner support ticket to raise limits; re-run when granted
tofu apply hangs at hcloud_server.control_plane[0]: Still creating... for >10 min Hetzner regional capacity transient Wait 15 min total; if still stuck, cancel + re-run with a different region
flux-bootstrap shows connection refused from kubectl Cilium CNI not yet up (chicken-and-egg with API server readiness) Wait — k3s + Cilium + Flux take ~5 min to converge before kubectl works through Flux
bp-cilium Kustomization stuck at Ready=Unknown for >10 min Network configuration mismatch (most likely cloud-init didn't pass --flannel-backend=none correctly) SSH into the control-plane node (the IP is visible in the Hetzner Cloud Console; SSH key is the one you provided) and run journalctl -u k3s -n 100; share the output with OpenOva support
bp-cert-manager reconciles but cert issuance fails Let's Encrypt rate-limit (50 certs / week / domain) or DNS records not propagated Check cert-manager events: kubectl -n cert-manager describe challenge; for rate-limit, wait. For DNS, dig the records: dig console.<your-fqdn> +short should return the LB IP
console.<sovereign-fqdn> returns 404 / connection-refused Per-Sovereign PowerDNS zone records not yet visible to public resolvers (parent-zone NS-delegation TTL ~15 min for pool, customer-registrar TTL for BYO byo-manual / byo-api) dig <sovereign-fqdn> NS should return OpenOva NS; dig console.<sovereign-fqdn> should return the LB IP. Allow up to 30 min for DNS propagation
Keycloak reset-password email never arrives SMTP not configured in Keycloak realm yet Reset via the catalyst-admin realm-admin flow inside the cluster: kubectl -n catalyst-system exec -it keycloak-0 -- /opt/keycloak/bin/kcadm.sh ... (the catalyst-admin path is documented in clusters/<sovereign-fqdn>/keycloak/README.md)
Bootstrap-kit Kustomization stuck Ready=False; PVCs (bp-spire, bp-keycloak postgres, bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres) all Pending indefinitely StorageClass missing — k3s started without local-path-provisioner and the cluster has no default class for HelmReleases that don't pin storageClassName See StorageClass missing below

Escalation: if the runbook doesn't unblock you, file an issue against github.com/openova-io/openova with the area/platform and kind/provisioning labels, including: Sovereign FQDN, region, last 50 SSE events, last 100 lines of kubectl -n flux-system get events, and the OpenTofu workdir contents (excluding tofu.auto.tfvars.json which contains the Hetzner token).

StorageClass missing

Symptom. A fresh Sovereign reaches flux-bootstrap and the bootstrap-kit Kustomization stays Ready=False for 10+ minutes. kubectl get pvc -A shows every PVC in Pending:

$ kubectl get pvc -A
NAMESPACE      NAME                         STATUS    VOLUME   CAPACITY   ...
keycloak       data-keycloak-postgresql-0   Pending   ...
spire-system   spire-data-spire-server-0    Pending   ...
openbao        data-openbao-0               Pending   ...

kubectl describe pvc <name> reports no persistent volumes available for this claim and no storage class is set. kubectl get sc returns No resources found.

Root cause. Pre-2026-04-29 the cloud-init template passed --disable=local-storage to the k3s installer, on the assumption that Crossplane would install hcloud-csi day-2 and register the StorageClass after bp-crossplane reconciled. That created a circular dependency: every PVC-using HelmRelease in the bootstrap-kit blocks waiting on a StorageClass that would only exist AFTER the bootstrap-kit had finished installing. Result: Sovereign deadlocks on first boot.

Resolution (current code). Cloud-init keeps k3s' built-in local-path-provisioner and marks local-path as the default StorageClass BEFORE applying the Flux bootstrap manifest — see infra/hetzner/cloudinit-control-plane.tftpl. PVCs without an explicit storageClassName bind immediately to node-local storage on the control-plane node. For a single-node Sovereign on a CPX21/CPX31, that is the correct shape: data lives on the node, capacity is bounded by the disk (200 GB+ on the supported SKUs), and there are no other nodes for volumes to migrate to anyway.

Recovery for a Sovereign already provisioned without the fix (e.g. omantel.omani.works on commit d311d439 or earlier). Apply local-path-provisioner directly to the running cluster, then mark the class default — no reprovision required:

KUBECONFIG=/path/to/sovereign-kubeconfig

# Install local-path-provisioner v0.0.30 (matches what k3s ships).
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.30/deploy/local-path-storage.yaml

# Wait for the controller pod.
kubectl -n local-path-storage wait --for=condition=Ready pod -l app=local-path-provisioner --timeout=60s

# Mark the local-path StorageClass default.
kubectl patch storageclass local-path -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Verify; should print local-path with (default).
kubectl get sc

# Pending PVCs in the bootstrap-kit will bind on the next provisioner sync (~30s).
# Watch them flip to Bound:
kubectl get pvc -A -w

Migrating to multi-node / hcloud-csi. The local-path solution is correct for the solo-Sovereign target. Operators stepping up to multi-node (HA control-plane + workers carrying stateful workloads) migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate, deliberate operation — local-path PVs are node-pinned and won't migrate when a Pod reschedules across nodes. Track this on the Sovereign's roadmap; it is not part of the cloud-init bootstrap.


bp-flux double-install — version-pin invariant

Live incident: omantel.omani.works, 2026-04-29 — Flux controllers deleted by the FIRST reconcile of bp-flux. Cluster lost its GitOps engine in-place; the only recovery is a full reprovision.

What happened

  1. Cloud-init runs early in the bootstrap sequence and installs Flux core via:

    curl -fsSL https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml \
      | kubectl apply -f -
    

    This is intentional — Flux must exist BEFORE the flux-system/GitRepository + Kustomization that pulls clusters/<sovereign-fqdn>/bootstrap-kit/ can be reconciled.

  2. Cloud-init then applies the GitRepository + Kustomization. Flux begins reconciling clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml, which is a HelmRelease for bp-flux.

  3. helm-controller runs helm install for bp-flux against the running cluster. The chart's umbrella declares dependencies: [{ name: flux2, version: <X> }] — the upstream community chart that ships its own copies of Flux's CRDs and controller Deployments.

  4. If the chart's subchart version ships a DIFFERENT upstream Flux release than cloud-init installed, Helm tries to update the existing Flux CRDs to a new schema. The apiserver rejects the update with:

    status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions
    

    because the version stored in the existing CRDs (from cloud-init's install) isn't in the new chart's spec.versions.

  5. Helm rolls back the failed install. The rollback deletes the existing Flux controller Deployments (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller).

  6. The cluster has no Flux. Every subsequent HelmRelease in the bootstrap kit halts. The cluster is unrecoverable in-place — the only fix is tofu destroy + reprovision.

The invariant

Cloud-init's flux2 v<X.Y.Z>/install.yaml URL pin and the bp-flux umbrella chart's flux2 subchart appVersion MUST be the same upstream Flux version. They cannot drift.

The fluxcd-community chart's appVersion field is the upstream Flux release tag the chart ships. Mapping:

cloud-init URL community chart (flux2 dep) upstream appVersion
v2.4.0 2.14.1 2.4.0 (current)
v2.3.0 2.13.0 2.3.0

Where the invariant is enforced

  • infra/hetzner/cloudinit-control-plane.tftpl — pins the install.yaml URL (currently v2.4.0).
  • platform/flux/chart/Chart.yaml — pins the subchart (currently flux2: 2.14.1).
  • platform/flux/chart/values.yamlcatalystBlueprint.upstream.version mirrors the dep pin (provenance metadata).
  • platform/flux/chart/tests/version-pin-replay.sh — CI gate; replays the catastrophic precondition and FAILS the build if the two pins ever drift.
  • clusters/_template/bootstrap-kit/03-flux.yaml and clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml — the HelmRelease declares install.disableTakeOwnership: false, upgrade.disableTakeOwnership: false, and upgrade.preserveValues: true so helm-controller adopts the cloud-init-installed Flux objects rather than re-creating them and rolling back on conflict.

How to bump Flux version safely

When an upgrade to a newer Flux release is desired, the bump must land in one PR and touch all four pin sites at once:

  1. Pick the target upstream Flux version (e.g. v2.5.1).
  2. Find the matching community chart version from https://fluxcd-community.github.io/helm-charts/index.yaml — match on appVersion: 2.5.1.
  3. Update infra/hetzner/cloudinit-control-plane.tftpl install.yaml URL → v2.5.1.
  4. Update platform/flux/chart/Chart.yaml flux2 dep → the matching community chart version.
  5. Update platform/flux/chart/values.yaml catalystBlueprint.upstream.version to match.
  6. Bump platform/flux/chart/Chart.yaml version: (semver patch).
  7. Update clusters/_template/bootstrap-kit/03-flux.yaml and every clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml to the new bp-flux version.
  8. Run bash platform/flux/chart/tests/version-pin-replay.sh locally — must pass.
  9. PR; blueprint-release.yaml rebuilds bp-flux; subchart-guard CI must be green.

The version-pin-replay.sh test is the gate. CI rejects any PR that bumps one pin without the other.

Existing Sovereigns

Sovereigns provisioned before this fix (any cluster running bp-flux:1.1.1 or earlier with the flux2: 2.13.0 subchart against a v2.4.0 cloud-init install) are at risk on next bp-flux reconcile and may already be broken. The recovery procedure is full reprovision (tofu destroytofu apply with the corrected manifests). There is no in-place recovery for a cluster whose Flux controllers have been deleted by a Helm rollback.

The omantel.omani.works cluster used to live-verify the failure mode is currently in this state and is being held for reprovision against bp-flux:1.1.2.


Phase 1 watch shows 0 HelmReleases

Symptom. The wizard's progress page reaches flux-bootstrap successfully, then the Sovereign Admin banner shows the warning:

Phase 1 watch saw 0 HelmReleases in 15m0s; the bootstrap-kit Kustomization may not be reconciling. Operator: run flux get kustomization -n flux-system on the new cluster.

The deployment status flips to failed with Phase1Outcome=flux-not-reconciling and the error message names this runbook section.

What this means. Phase 0 (tofu apply + cloud-init) succeeded — the new k3s cluster is up and Flux is installed. But the Phase-1 catalyst-api watcher, which observes bp-* HelmReleases in flux-system via a read-only client-go informer, never saw a single HelmRelease appear within the first-seen window (CATALYST_PHASE1_FIRST_SEEN_TIMEOUT, default 15 minutes). That means Flux on the new Sovereign isn't materialising the bootstrap-kit Kustomization — typically because the Kustomization itself can't reach its Git source, can't decrypt a SOPS secret, or its dependencies haven't reconciled yet.

This is not a "wait it out" condition: the watcher continues running so a late HelmRelease still flows, but the cluster needs operator inspection before the install can complete.

Operator playbook. SSH into the control-plane node (the IP is in the Hetzner Cloud Console; the SSH key is the one you supplied at Step 4 of the wizard) and walk these in order:

  1. Confirm the catalyst-api Pod actually has the kubeconfig. This eliminates the "watcher misconfigured" branch before you go hunting on the new cluster.

    # On the catalyst-zero cluster (where catalyst-api runs):
    kubectl -n openova-system get deployment catalyst-api -o jsonpath='{.spec.template.spec.containers[0].env}' \
      | jq '.[] | select(.name=="CATALYST_PHASE1_FIRST_SEEN_TIMEOUT" or .name=="CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS" or .name=="CATALYST_PHASE1_WATCH_TIMEOUT")'
    

    The defaults (15m / 11 / 60m) are fine for a normal run — only override for diagnostic re-runs.

  2. Check the GitRepository on the new Sovereign. Flux's source-controller fetches the OpenOva monorepo; if it can't, every downstream Kustomization is starved.

    # On the new Sovereign (KUBECONFIG=<the kubeconfig captured at Phase 0>):
    kubectl get gitrepository -n flux-system -o wide
    kubectl describe gitrepository -n flux-system openova-public
    

    Look for Conditions[type=Ready].status=True and a recent lastAppliedRevision. Common failures: 401/403 (deploy-key missing or wrong scope), 404 (branch / path mismatch), connection refused (DNS / firewall egress).

  3. Check the bootstrap-kit Kustomization. This is what materialises the 11 bp-* HelmRelease objects.

    kubectl get kustomization -n flux-system
    kubectl describe kustomization -n flux-system <sovereign-fqdn>-bootstrap-kit
    

    If Ready=False, the Message field names the cause: missing CRD (HelmRelease), unrecognised apiVersion (Flux upgrade lockstep), path not found in the Git source, or dependsOn unresolved.

  4. Inspect source-controller and kustomize-controller logs. When the GitRepository looks healthy but no Kustomization fires, these are the next layers down.

    kubectl -n flux-system logs deploy/source-controller --tail=200
    kubectl -n flux-system logs deploy/kustomize-controller --tail=200
    

    A clean log shows a periodic reconcile loop with revision SHAs. A stuck log shows the same error repeating every reconcile interval — that error is the root cause.

  5. Re-run reconciliation manually once the cause is fixed:

    flux reconcile source git openova-public -n flux-system
    flux reconcile kustomization <sovereign-fqdn>-bootstrap-kit -n flux-system
    

    The catalyst-api watcher is still running on the wizard side (the flux-not-reconciling warn event does NOT terminate the watch loop — it just surfaces the banner). Once HelmReleases start appearing, normal per-component pills resume in the Sovereign Admin UI.

If the watcher has already terminated (overall CATALYST_PHASE1_WATCH_TIMEOUT of 60m elapsed): the watch goroutine has exited. Start a new wizard run — the Hetzner side is idempotent (tofu apply on existing state) so you keep the cluster, but the per-deployment HelmRelease watch is owned by the old deployment id. A fresh run is the cleanest path until the wizard surfaces a "rejoin watch" button.

Why this is a dedicated symptom. Earlier builds misread an empty informer cache as "all components done" and reported finalStatus: ready one second after flux-bootstrap. The current build refuses to consider termination until at least one bp-* HelmRelease has been observed AND the count meets CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS, so the only way to land here is a real Flux-side problem on the new cluster — not a timing race in the watcher. Trust the diagnostic and walk the playbook above.


Re-runs and idempotency

tofu apply on an existing state is idempotent: rerunning the wizard with the same Sovereign FQDN updates only what changed (worker count up/down, k3s version upgrade, new firewall rules from a new cloud-init template). The cluster's running pods are untouched.

To intentionally re-run cloud-init on the control-plane (e.g. to apply a new Flux GitRepository config), the cleanest path is via Crossplane Compositions in clusters/<sovereign-fqdn>/, NOT by re-running cloud-init directly. Cloud-init runs once per server lifetime by default; replacing it requires either:

  1. A Crossplane-driven server replacement (preferred — drains the old node, brings up a new one, lets Flux reconcile fresh)
  2. SSH + manual cloud-init clean && cloud-init init (allowed only as break-glass)

Decommissioning

If you need to tear down a Sovereign you just provisioned (e.g. test run):

1. From Catalyst console: Admin → Sovereign → Decommission
   → Crossplane begins teardown of host clusters
   → OpenBao final state exported and stored encrypted (download link in admin UI)
   → DNS records removed
   → Cloud resources reclaimed
2. (For pool domains only) PDM releases the subdomain reservation and prunes the per-Sovereign PowerDNS zone; the parent-zone NS-delegation update at the registrar (Dynadot for pool) propagates within ~15 min TTL
3. (Manual cleanup) tofu destroy -auto-approve in the catalyst-api workdir for that Sovereign

This is the same flow as SOVEREIGN-PROVISIONING.md §10.2.



Part of OpenOva. Operator-facing companion to SOVEREIGN-PROVISIONING.md (the architectural contract) and PROVISIONING-PLAN.md (the Catalyst-Zero waterfall).