openova/docs/SOVEREIGN-PROVISIONING.md
hatiyildiz 3864eef4e7 docs(reconcile-pass-2): align docs with ground truth at 6afdb303
- Wizard step canonical order updated to Org → Topology → Provider →
  Credentials → Components → Domain → Review (RUNBOOK-PROVISIONING,
  DEMO-RUNBOOK, IMPLEMENTATION-STATUS); SKU pickers cross-ref the
  PROVIDER_NODE_SIZES per-provider catalog (#176).
- StepComponents UX rewritten: single flat marketplace card grid with
  family chips + product/family routes, two tabs (Choose Your Stack +
  Always Included) — replaces the prior "two-tab Mandatory infra/Apps"
  + "grouped by product header" prose (PRODUCT-FAMILIES, RUNBOOK-
  PROVISIONING, DEMO-RUNBOOK, COMPONENT-LOGOS).
- CORTEX familyDependencies = [] reflected in PRODUCT-FAMILIES; the
  Specter / BGE cascade narratives rewritten to component-level-only
  resolution (langfuse → cnpg, librechat → ferretdb → cnpg) — fixes
  the "selecting Spector pulls entire FABRIC" over-broad claim.
- catalyst-api OpenTofu workdir realigned from /var/lib/catalyst/...
  to /tmp/catalyst/tofu/<fqdn>/ via CATALYST_TOFU_WORKDIR env var
  (commit 27527e4c) — fixes runtime drift in RUNBOOK-PROVISIONING,
  SOVEREIGN-PROVISIONING, DEMO-RUNBOOK; DEMO-RUNBOOK kubectl exec
  ns corrected from catalyst-system to catalyst.
- Logo asset story rewritten: 58 logos (44 SVG + 14 PNG) sourced from
  CNCF artwork + project repos at #169b1d1c/#30ff318d, replacing the
  prior 62 stylised in-house marks; CI smoke-test (#6a7d2dd8)
  cross-referenced.
- 12 G2 bootstrap-kit charts (original 11 + bp-powerdns #167) aligned
  in PROVISIONING-PLAN Group F + blueprint-release.yaml comment +
  SOVEREIGN-PROVISIONING header; previously stale at 11.
- README repo-structure note updated: 12-component bootstrap kit +
  axon + external-dns leaf chart are built; 45 platform / 4 product
  folders remain README-only (was: "every folder except axon").
- ORCHESTRATOR-STATE main-tip SHA advanced from dd578d1c6afdb303
  with one-line summary of the post-Pass-1 batch.
- VALIDATION-LOG: Reconcile Pass 2 entry appended (drift fixed across
  10 files; six-category rubric).

Reconcile Pass 2 against main @ 6afdb303 — 10 files patched plus
VALIDATION-LOG entry. Doc patches are landing first so the in-flight
wizard step-reorder branch will merge into a doc set that already
names the canonical order, avoiding a second drift round.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:48:57 +02:00

17 KiB
Raw Blame History

Sovereign Provisioning

Status: Authoritative procedure. Updated: 2026-04-29. Implementation: §3 below now reflects the deployed shape — the Go provisioner, OpenTofu module, 12 G2 wrapper Helm charts (the original 11 plus bp-powerdns at #167), the per-Sovereign PowerDNS zone model (#167/#168), and the pool-domain-manager (PDM) with registrar adapters (#163/#170) all exist in this monorepo today (per IMPLEMENTATION-STATUS.md §7). End-to-end DoD against a real Hetzner project is pending Group M of PROVISIONING-PLAN.md. Catalyst-Zero (Contabo k3s, namespace catalyst) is the running catalyst-provisioner today.

How to provision a new Sovereign — a self-sufficient deployed instance of Catalyst. Defer to GLOSSARY.md for terminology and ARCHITECTURE.md for the model.


1. Inputs

Input Required Notes
Cloud provider Hetzner / AWS / GCP / Azure / OCI / Huawei Hetzner is the most-tested path.
Cloud credentials Provider API token Used by OpenTofu (one-shot bootstrap) and Crossplane (ongoing).
Sovereign name e.g. omantel, bankdhofar Slug, lowercase, 332 chars.
Sovereign domain e.g. omantel.omani.works, acme.bank.com Three modes (#169): pool (subdomain under omani.works / openova.io, allocated by pool-domain-manager); byo-manual (customer pastes OpenOva NS records into their own registrar UI); byo-api (customer pastes a registrar API token, OpenOva flips NS via the registrar adapter). Supported registrars for byo-api: Cloudflare, Namecheap, GoDaddy, OVH, Dynadot (#170).
Region(s) 1+ Single-region simplest for SME; 2+ for regulated/HA.
Building blocks per region typically mgt + rtz (+ dmz) At minimum mgt + rtz.
Keycloak topology per-organization (SME) / shared-sovereign (corporate) Determines Keycloak deployment shape.
Federation IdP (optional) Azure AD / Okta / Google / etc. For corporate; SME tier defers to per-Org Org-IdP federation.
TLS strategy Let's Encrypt / cert-manager / corporate CA cert-manager-managed, Let's Encrypt by default.
Object storage Cloud-provider native Used as the cold-tier backend behind SeaweedFS (which is the in-cluster S3 encapsulation layer that all consumers — Velero, Harbor, CNPG WAL, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg — talk to).

2. Provisioning runs from catalyst-provisioner

The bootstrap is performed by catalyst-provisioner.openova.io, an always-on provisioning service operated by OpenOva. It is not part of any Sovereign at runtime — once a Sovereign is up, it is fully self-sufficient.

Why a permanent provisioner instead of "boot from your laptop":

  • OpenTofu state must be durably stored — keeping it on a single person's laptop is fragile and a security risk.
  • Provider credentials are scoped, stored in OpenBao on the provisioner, and never leave it.
  • New Sovereigns can be created without a manual installer dance — the same machinery serves the next Sovereign provisioning request, regardless of who initiates it.

A self-host route exists for organizations that want zero OpenOva involvement: catalyst-provisioner is itself a Blueprint (bp-catalyst-provisioner) and can be deployed in a customer's own infrastructure. From there it bootstraps further Sovereigns. This is the air-gap path.


3. Phase 0 — Bootstrap

The implementation maps cleanly onto two artifacts in this monorepo:

Step Lives in What runs
1. Wizard input → tofu vars products/catalyst/bootstrap/api/internal/provisioner/ Go service writes tofu.auto.tfvars.json from validated wizard input, runs tofu init && tofu plan && tofu apply -auto-approve against the canonical OpenTofu module, streams stdout/stderr lines to the wizard via SSE. No cloud APIs called from Go (per INVIOLABLE-PRINCIPLES.md #3).
2. Cloud resources infra/hetzner/main.tf OpenTofu provisions: hcloud_network (10.0.0.0/16) + subnet (10.0.1.0/24), hcloud_firewall (80/443/6443/ICMP open; 22 closed by default — operator adds source-CIDR rule via Crossplane post-bootstrap), hcloud_ssh_key from wizard input, 1 control-plane server (or 3 if ha_enabled) on Ubuntu 24.04 with cloud-init, worker_count worker servers, hcloud_load_balancer (lb11) targeting NodePorts 31080/31443. DNS is authoritative on PowerDNS (#167/#168) — the per-Sovereign PowerDNS zone is created by pool-domain-manager (PDM) /v1/commit once the LB IP is known; for pool sovereigns PDM also writes the parent-zone delegation, and for byo-api Sovereigns the matching registrar adapter (Cloudflare / Namecheap / GoDaddy / OVH / Dynadot, #170) flips the NS records at the customer's registrar. byo-manual Sovereigns instead show the OpenOva NS list in the wizard and poll until the customer's own registrar propagates the delegation.
3. k3s + Flux bootstrap infra/hetzner/cloudinit-control-plane.tftpl cloud-init on the control-plane node installs k3s v1.31.4+k3s1 with --flannel-backend=none --disable-network-policy --disable=traefik --disable=servicelb --disable=local-storage --tls-san=<sovereign-fqdn>, then installs Flux v2.4.0 core, then applies the Flux GitRepository + Kustomization pointing at clusters/<sovereign-fqdn>/ in the public OpenOva monorepo. From this point Flux owns the cluster. Workers join via cloudinit-worker.tftpl using the project-derived k3s_token.
4. Bootstrap-kit install clusters/<sovereign-fqdn>/ (Flux-reconciled) Flux installs the 12 G2 wrapper Helm charts (each a bp-<name>:<semver> OCI artifact published by .github/workflows/blueprint-release.yaml) in dependency order: cilium → cert-manager → flux (host-level reconciler for the cluster's own Kustomizations) → crossplane → sealed-secrets (transient) → spire (server + agent) → nats-jetstream → openbao (3-node Raft) → keycloak (per topology choice) → gitea (with public Blueprint mirror) → bp-powerdns (per-Sovereign authoritative zone, #167) → bp-catalyst-platform (umbrella).
5. Crossplane adoption Crossplane Compositions in clusters/<sovereign-fqdn>/ Crossplane adopts management of all infrastructure created by OpenTofu in step 2; sealed-secrets is decommissioned in favour of ESO + OpenBao for day-2 secret distribution; further DNS records (gitea/admin/api/harbor) are written by external-dns against the per-Sovereign PowerDNS zone via the PowerDNS REST API (NOT against the registrar). Phase 1 begins (see §4).

The wizard's progress page polls Flux Kustomizations on the new cluster and renders steady-state to the user when every Kustomization is Ready=True.

DNS records written in Phase 0 — into the per-Sovereign PowerDNS zone (<sovereign-fqdn>.), see PLATFORM-POWERDNS.md §"Per-Sovereign zone model":

@                A → load balancer IP
*                A → load balancer IP
console          A → load balancer IP
api              A → load balancer IP
gitea            A → load balancer IP
harbor           A → load balancer IP

The PDM /v1/commit endpoint writes the canonical 6-record set into the freshly-created Sovereign zone via the PowerDNS REST API. The wildcard A record covers every additional subdomain a Sovereign might add at runtime (axon, umami, langfuse, etc.) without re-issuing certificates. Per NAMING §5.1 the canonical control-plane DNS pattern is {component}.{location-code}.{sovereign-domain} — the wildcard handles per-Application records under per-Environment subdomains.

OpenTofu state: kept in the catalyst-api Pod under /tmp/catalyst/tofu/<sovereign-fqdn>/ — pinned via the CATALYST_TOFU_WORKDIR env var on the catalyst-api Deployment (commit 27527e4c) and backed by the Pod's writable /tmp emptyDir (2 Gi sizeLimit; the in-code default /var/lib/catalyst/... is unwritable for UID 65534, hence the override). Re-running with the same FQDN is idempotent (tofu apply on existing state). For air-gap installs the operator MUST configure a remote backend with encryption-at-rest so the Hetzner token isn't carried only on Pod ephemeral storage.

Implementation status: the Go wrapper, OpenTofu module, and 12 G2 wrapper charts (the original 11 + bp-powerdns added at #167) all exist today (verified at IMPLEMENTATION-STATUS.md §7). The pool-domain-manager (core/pool-domain-manager/) and its 5 registrar adapters are deployed and running in openova-system. End-to-end DoD against a real Hetzner project is pending Group M of the Catalyst-Zero Provisioning Plan.

Total Phase 0 time: 3060 minutes for a single-region Hetzner Sovereign once DoD lands.


4. Phase 1 — Hand-off

After Phase 0 completes:

  1. Crossplane in the new Sovereign adopts management of all infrastructure created by OpenTofu. From this point forward, all infrastructure changes go through Crossplane.
  2. The bootstrap k3s nodes are not "thrown away" — they are claimed by Crossplane via the cloud provider's adoption mechanism.
  3. OpenTofu state is archived and read-only. It is never touched again.
  4. catalyst-provisioner no longer has any active connection to the new Sovereign.

The Sovereign is now self-sufficient. It has the full Catalyst control-plane set per PLATFORM-TECH-STACK.md §2.3:

  • Its own Crossplane managing further infrastructure.
  • Its own OpenBao for secrets.
  • Its own JetStream as event spine.
  • Its own Keycloak for users.
  • Its own SPIFFE/SPIRE for workload identity (5-min rotating SVIDs).
  • Its own Gitea (with mirror of the public Blueprint catalog).
  • Its own observability stack (Grafana + Alloy + Loki + Mimir + Tempo) for self-monitoring.
  • Its own Catalyst control plane (console, marketplace, admin, projector, catalog-svc, provisioning, environment-controller, blueprint-controller, billing).

5. Phase 2 — Day-1 setup

The first sovereign-admin logs into console.<location-code>.<sovereign-domain>:

Day-1 actions
──────────────────────────────────────────────────────────────────
1. Configure cert-manager issuers (Let's Encrypt / corporate CA).
2. Configure backup destination (cloud object storage for Velero).
3. Configure Harbor with image-scanning policies.
4. (Optional) Federate Keycloak's catalyst-admin realm to corporate IdP.
5. (Optional) Configure observability exports (SIEM, datadog, etc.).
6. Onboard the first Organization:
     Catalyst console → Admin → Organizations → New
     Provide: name, contact, plan.
   Environment-controller does NOT create vclusters yet.
   They are created when the first Environment is provisioned.
7. Create the first Environment in that Organization:
     Console → switch to Org context → Environments → New
     Environment-controller spins up a vcluster on the chosen host cluster
     and bootstraps Flux inside (watching the env-appropriate branch on
     every Application repo within this Org's Gitea Org). Apps not yet
     installed have no repos yet; repos are created on demand by the
     provisioning-service when each App is installed.
     Ready in ~60 seconds.

6. Phase 3 — Steady-state operation

From here on, the Sovereign runs autonomously. Sovereign-admins use the Catalyst admin UI for:

  • Onboarding more Organizations
  • Adding host clusters in new regions (Crossplane provisions them, environment-controller adopts them)
  • Updating Catalyst itself (umbrella Blueprint version bumps, applied via Flux PR)
  • Configuring SecretPolicies and EnvironmentPolicies
  • Monitoring the Sovereign's own observability stack
  • Reviewing audit logs

Everyday Application installs and configurations are done by org-admins and org-developers within their Organizations — see PERSONAS-AND-JOURNEYS.md.


7. Multi-region topology

7.1 Single-region (SME default)

Region A
└── Host cluster: hz-fsn-mgt-prod    ← Catalyst control plane + per-Org vclusters
    └── all building blocks collapse onto one cluster (mgt + rtz + dmz workloads
        in separate namespaces, with Cilium NetworkPolicies enforcing isolation)

Cheapest topology. Single-region failure = Sovereign down. Acceptable for SME tier where customers also accept SME-tier SLAs.

7.2 Multi-region (corporate default)

Region A (primary mgt)              Region B                       Region C (DR)
─────────────────                  ─────────────                  ─────────────
hz-nbg-mgt-prod                    hz-fsn-rtz-prod                hz-hel-rtz-prod
  Catalyst control plane             per-Org vclusters              per-Org vclusters
  Gitea, JetStream, OpenBao,         (sibling realizations          (sibling realizations
  Keycloak, projector,               of each Org's Environment)     of each Org's Environment)
  catalog-svc, marketplace,
  console, admin, billing
hz-nbg-dmz-prod                    hz-fsn-dmz-prod                hz-hel-dmz-prod
  ingress, WAF, PowerDNS            ingress, WAF, PowerDNS          ingress, WAF, PowerDNS

The mgt building block is typically NOT replicated (one Catalyst control plane per Sovereign). The rtz and dmz blocks ARE replicated for workload HA.

OpenBao runs in BOTH the mgt cluster (primary) and each rtz region (replica) — see SECURITY.md §5 for replication semantics.


8. Adding a region post-provisioning

sovereign-admin in Catalyst admin UI:
  Admin → Infrastructure → Add Region
    Provider: Hetzner
    Region: hel
    Building blocks: rtz, dmz
    Apply

Catalyst:

  1. Crossplane provisions the new VPC, hosts, k3s cluster, etc.
  2. Cluster registered in Catalyst's cluster registry.
  3. cert-manager + Cilium + Flux + Crossplane + SPIRE + ESO + OpenBao replica deployed via the cluster's Flux Kustomization.
  4. New region available as a Placement target for new and existing Environments.

Existing Applications with placement.mode: single-region do not migrate automatically. To extend an existing Application to the new region, the user explicitly switches Placement to active-active (or active-hotstandby) and adds the new region to placement.regions — that's a one-line edit in the Application's Gitea repo on the appropriate branch (or a click in the Topology tab).


9. Air-gap deployment

Connected zone (one-time)             Air-gapped Sovereign
──────────────────────────            ───────────────────────────────
1. Mirror public Blueprint OCI       Harbor receives blobs via physical
   artifacts to portable media.      transfer / data diode.
2. Mirror Catalyst control-plane     Sovereign's Gitea adopts blobs as
   container images.                 OCI manifests in local registry.
3. Mirror cert-manager root +        cert-manager configured with
   organization CA bundle.           internal CA only.
4. Configure Keycloak to local LDAP  Keycloak federates to internal AD/LDAP.
   (no external IdPs).

Catalyst is air-gap-ready by construction: every artifact (Blueprints, Catalyst code, base images) is OCI-signed. Mirror once, run forever.


10. Migration and decommission

10.1 Migrating an Organization between Sovereigns

Rare but supported. Example: a Bank Dhofar Organization started life on the openova Sovereign (paid SaaS), now wants to move to its own bankdhofar Sovereign (self-host).

1. Provision bankdhofar Sovereign (Phases 02).
2. On openova Sovereign: Admin → Organization → Export
     Catalyst produces an export bundle:
       - Org metadata
       - All Application Gitea repos under this Org (cloned + bundled, including all branches)
       - The Org's `shared-blueprints` repo
       - Keycloak realm export (users, federated identities)
       - OpenBao export (sealed secrets only)
3. On bankdhofar Sovereign: Admin → Organization → Import
     Environment-controller recreates Environments → vclusters.
     Flux pulls manifests, reconciles.
     Apps come up.
4. Final cutover: DNS swap.
5. Verify, then decommission on openova side.

Time depends on data volume; typically minutes to hours per Org.

10.2 Decommissioning a Sovereign

Reverse of provisioning:

1. Migrate all Organizations off (Section 10.1).
2. Catalyst admin → Sovereign → Decommission
3. Crossplane begins teardown of host clusters.
4. OpenBao final state exported and stored encrypted.
5. DNS records removed.
6. Cloud resources reclaimed.

The customer keeps the OpenBao export and Gitea bundles for whatever retention period their compliance demands.


Cross-reference ARCHITECTURE.md and SECURITY.md. For day-to-day operation see SRE.md.