- Wizard step canonical order updated to Org → Topology → Provider → Credentials → Components → Domain → Review (RUNBOOK-PROVISIONING, DEMO-RUNBOOK, IMPLEMENTATION-STATUS); SKU pickers cross-ref the PROVIDER_NODE_SIZES per-provider catalog (#176). - StepComponents UX rewritten: single flat marketplace card grid with family chips + product/family routes, two tabs (Choose Your Stack + Always Included) — replaces the prior "two-tab Mandatory infra/Apps" + "grouped by product header" prose (PRODUCT-FAMILIES, RUNBOOK- PROVISIONING, DEMO-RUNBOOK, COMPONENT-LOGOS). - CORTEX familyDependencies = [] reflected in PRODUCT-FAMILIES; the Specter / BGE cascade narratives rewritten to component-level-only resolution (langfuse → cnpg, librechat → ferretdb → cnpg) — fixes the "selecting Spector pulls entire FABRIC" over-broad claim. - catalyst-api OpenTofu workdir realigned from /var/lib/catalyst/... to /tmp/catalyst/tofu/<fqdn>/ via CATALYST_TOFU_WORKDIR env var (commit27527e4c) — fixes runtime drift in RUNBOOK-PROVISIONING, SOVEREIGN-PROVISIONING, DEMO-RUNBOOK; DEMO-RUNBOOK kubectl exec ns corrected from catalyst-system to catalyst. - Logo asset story rewritten: 58 logos (44 SVG + 14 PNG) sourced from CNCF artwork + project repos at #169b1d1c/#30ff318d, replacing the prior 62 stylised in-house marks; CI smoke-test (#6a7d2dd8) cross-referenced. - 12 G2 bootstrap-kit charts (original 11 + bp-powerdns #167) aligned in PROVISIONING-PLAN Group F + blueprint-release.yaml comment + SOVEREIGN-PROVISIONING header; previously stale at 11. - README repo-structure note updated: 12-component bootstrap kit + axon + external-dns leaf chart are built; 45 platform / 4 product folders remain README-only (was: "every folder except axon"). - ORCHESTRATOR-STATE main-tip SHA advanced fromdd578d1c→6afdb303with one-line summary of the post-Pass-1 batch. - VALIDATION-LOG: Reconcile Pass 2 entry appended (drift fixed across 10 files; six-category rubric). Reconcile Pass 2 against main @6afdb303— 10 files patched plus VALIDATION-LOG entry. Doc patches are landing first so the in-flight wizard step-reorder branch will merge into a doc set that already names the canonical order, avoiding a second drift round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
258 lines
17 KiB
Markdown
258 lines
17 KiB
Markdown
# Sovereign Provisioning
|
||
|
||
**Status:** Authoritative procedure. **Updated:** 2026-04-29.
|
||
**Implementation:** §3 below now reflects the deployed shape — the Go provisioner, OpenTofu module, 12 G2 wrapper Helm charts (the original 11 plus bp-powerdns at #167), the per-Sovereign PowerDNS zone model (#167/#168), and the pool-domain-manager (PDM) with registrar adapters (#163/#170) all exist in this monorepo today (per [`IMPLEMENTATION-STATUS.md`](IMPLEMENTATION-STATUS.md) §7). End-to-end DoD against a real Hetzner project is pending Group M of [`PROVISIONING-PLAN.md`](PROVISIONING-PLAN.md). Catalyst-Zero (Contabo k3s, namespace `catalyst`) is the running catalyst-provisioner today.
|
||
|
||
How to provision a new **Sovereign** — a self-sufficient deployed instance of Catalyst. Defer to [`GLOSSARY.md`](GLOSSARY.md) for terminology and [`ARCHITECTURE.md`](ARCHITECTURE.md) for the model.
|
||
|
||
---
|
||
|
||
## 1. Inputs
|
||
|
||
| Input | Required | Notes |
|
||
|---|---|---|
|
||
| Cloud provider | Hetzner / AWS / GCP / Azure / OCI / Huawei | Hetzner is the most-tested path. |
|
||
| Cloud credentials | Provider API token | Used by OpenTofu (one-shot bootstrap) and Crossplane (ongoing). |
|
||
| Sovereign name | e.g. `omantel`, `bankdhofar` | Slug, lowercase, 3–32 chars. |
|
||
| Sovereign domain | e.g. `omantel.omani.works`, `acme.bank.com` | Three modes (#169): **pool** (subdomain under `omani.works` / `openova.io`, allocated by pool-domain-manager); **byo-manual** (customer pastes OpenOva NS records into their own registrar UI); **byo-api** (customer pastes a registrar API token, OpenOva flips NS via the registrar adapter). Supported registrars for byo-api: Cloudflare, Namecheap, GoDaddy, OVH, Dynadot (#170). |
|
||
| Region(s) | 1+ | Single-region simplest for SME; 2+ for regulated/HA. |
|
||
| Building blocks per region | typically `mgt` + `rtz` (+ `dmz`) | At minimum `mgt` + `rtz`. |
|
||
| Keycloak topology | `per-organization` (SME) / `shared-sovereign` (corporate) | Determines Keycloak deployment shape. |
|
||
| Federation IdP (optional) | Azure AD / Okta / Google / etc. | For corporate; SME tier defers to per-Org Org-IdP federation. |
|
||
| TLS strategy | Let's Encrypt / cert-manager / corporate CA | cert-manager-managed, Let's Encrypt by default. |
|
||
| Object storage | Cloud-provider native | Used as the cold-tier backend behind SeaweedFS (which is the in-cluster S3 encapsulation layer that all consumers — Velero, Harbor, CNPG WAL, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg — talk to). |
|
||
|
||
---
|
||
|
||
## 2. Provisioning runs from `catalyst-provisioner`
|
||
|
||
The bootstrap is performed by `catalyst-provisioner.openova.io`, an always-on provisioning service operated by OpenOva. It is **not** part of any Sovereign at runtime — once a Sovereign is up, it is fully self-sufficient.
|
||
|
||
Why a permanent provisioner instead of "boot from your laptop":
|
||
- OpenTofu state must be durably stored — keeping it on a single person's laptop is fragile and a security risk.
|
||
- Provider credentials are scoped, stored in OpenBao on the provisioner, and never leave it.
|
||
- New Sovereigns can be created without a manual installer dance — the same machinery serves the next Sovereign provisioning request, regardless of who initiates it.
|
||
|
||
A self-host route exists for organizations that want zero OpenOva involvement: `catalyst-provisioner` is itself a Blueprint (`bp-catalyst-provisioner`) and can be deployed in a customer's own infrastructure. From there it bootstraps further Sovereigns. This is the air-gap path.
|
||
|
||
---
|
||
|
||
## 3. Phase 0 — Bootstrap
|
||
|
||
The implementation maps cleanly onto two artifacts in this monorepo:
|
||
|
||
| Step | Lives in | What runs |
|
||
|---|---|---|
|
||
| 1. Wizard input → tofu vars | [`products/catalyst/bootstrap/api/internal/provisioner/`](../products/catalyst/bootstrap/api/internal/provisioner/) | Go service writes `tofu.auto.tfvars.json` from validated wizard input, runs `tofu init && tofu plan && tofu apply -auto-approve` against the canonical OpenTofu module, streams stdout/stderr lines to the wizard via SSE. No cloud APIs called from Go (per [`INVIOLABLE-PRINCIPLES.md`](INVIOLABLE-PRINCIPLES.md) #3). |
|
||
| 2. Cloud resources | [`infra/hetzner/main.tf`](../infra/hetzner/main.tf) | OpenTofu provisions: hcloud_network (10.0.0.0/16) + subnet (10.0.1.0/24), hcloud_firewall (80/443/6443/ICMP open; 22 closed by default — operator adds source-CIDR rule via Crossplane post-bootstrap), hcloud_ssh_key from wizard input, 1 control-plane server (or 3 if `ha_enabled`) on Ubuntu 24.04 with cloud-init, `worker_count` worker servers, hcloud_load_balancer (lb11) targeting NodePorts 31080/31443. **DNS is authoritative on PowerDNS (#167/#168)** — the per-Sovereign PowerDNS zone is created by pool-domain-manager (PDM) `/v1/commit` once the LB IP is known; for pool sovereigns PDM also writes the parent-zone delegation, and for `byo-api` Sovereigns the matching registrar adapter (Cloudflare / Namecheap / GoDaddy / OVH / Dynadot, #170) flips the NS records at the customer's registrar. `byo-manual` Sovereigns instead show the OpenOva NS list in the wizard and poll until the customer's own registrar propagates the delegation. |
|
||
| 3. k3s + Flux bootstrap | [`infra/hetzner/cloudinit-control-plane.tftpl`](../infra/hetzner/cloudinit-control-plane.tftpl) | cloud-init on the control-plane node installs k3s v1.31.4+k3s1 with `--flannel-backend=none --disable-network-policy --disable=traefik --disable=servicelb --disable=local-storage --tls-san=<sovereign-fqdn>`, then installs Flux v2.4.0 core, then applies the Flux GitRepository + Kustomization pointing at `clusters/<sovereign-fqdn>/` in the public OpenOva monorepo. From this point Flux owns the cluster. Workers join via [`cloudinit-worker.tftpl`](../infra/hetzner/cloudinit-worker.tftpl) using the project-derived k3s_token. |
|
||
| 4. Bootstrap-kit install | `clusters/<sovereign-fqdn>/` (Flux-reconciled) | Flux installs the 12 G2 wrapper Helm charts (each a `bp-<name>:<semver>` OCI artifact published by [`.github/workflows/blueprint-release.yaml`](../.github/workflows/blueprint-release.yaml)) in dependency order: cilium → cert-manager → flux (host-level reconciler for the cluster's own Kustomizations) → crossplane → sealed-secrets (transient) → spire (server + agent) → nats-jetstream → openbao (3-node Raft) → keycloak (per topology choice) → gitea (with public Blueprint mirror) → bp-powerdns (per-Sovereign authoritative zone, #167) → bp-catalyst-platform (umbrella). |
|
||
| 5. Crossplane adoption | Crossplane Compositions in `clusters/<sovereign-fqdn>/` | Crossplane adopts management of all infrastructure created by OpenTofu in step 2; sealed-secrets is decommissioned in favour of ESO + OpenBao for day-2 secret distribution; further DNS records (gitea/admin/api/harbor) are written by `external-dns` against the per-Sovereign PowerDNS zone via the PowerDNS REST API (NOT against the registrar). Phase 1 begins (see §4). |
|
||
|
||
The wizard's progress page polls Flux Kustomizations on the new cluster and renders steady-state to the user when every Kustomization is `Ready=True`.
|
||
|
||
**DNS records written in Phase 0** — into the per-Sovereign PowerDNS zone (`<sovereign-fqdn>.`), see [`PLATFORM-POWERDNS.md`](PLATFORM-POWERDNS.md) §"Per-Sovereign zone model":
|
||
|
||
```
|
||
@ A → load balancer IP
|
||
* A → load balancer IP
|
||
console A → load balancer IP
|
||
api A → load balancer IP
|
||
gitea A → load balancer IP
|
||
harbor A → load balancer IP
|
||
```
|
||
|
||
The PDM `/v1/commit` endpoint writes the canonical 6-record set into the freshly-created Sovereign zone via the PowerDNS REST API. The wildcard A record covers every additional subdomain a Sovereign might add at runtime (`axon`, `umami`, `langfuse`, etc.) without re-issuing certificates. Per NAMING §5.1 the canonical control-plane DNS pattern is `{component}.{location-code}.{sovereign-domain}` — the wildcard handles per-Application records under per-Environment subdomains.
|
||
|
||
**OpenTofu state:** kept in the catalyst-api Pod under `/tmp/catalyst/tofu/<sovereign-fqdn>/` — pinned via the `CATALYST_TOFU_WORKDIR` env var on the catalyst-api Deployment (commit `27527e4c`) and backed by the Pod's writable `/tmp` emptyDir (2 Gi sizeLimit; the in-code default `/var/lib/catalyst/...` is unwritable for UID 65534, hence the override). Re-running with the same FQDN is idempotent (`tofu apply` on existing state). For air-gap installs the operator MUST configure a remote backend with encryption-at-rest so the Hetzner token isn't carried only on Pod ephemeral storage.
|
||
|
||
**Implementation status:** the Go wrapper, OpenTofu module, and 12 G2 wrapper charts (the original 11 + bp-powerdns added at #167) all exist today (verified at [`IMPLEMENTATION-STATUS.md`](IMPLEMENTATION-STATUS.md) §7). The pool-domain-manager (`core/pool-domain-manager/`) and its 5 registrar adapters are deployed and running in `openova-system`. End-to-end DoD against a real Hetzner project is pending Group M of the [Catalyst-Zero Provisioning Plan](PROVISIONING-PLAN.md).
|
||
|
||
Total Phase 0 time: 30–60 minutes for a single-region Hetzner Sovereign once DoD lands.
|
||
|
||
---
|
||
|
||
## 4. Phase 1 — Hand-off
|
||
|
||
After Phase 0 completes:
|
||
|
||
1. Crossplane in the new Sovereign **adopts** management of all infrastructure created by OpenTofu. From this point forward, all infrastructure changes go through Crossplane.
|
||
2. The bootstrap k3s nodes are not "thrown away" — they are claimed by Crossplane via the cloud provider's adoption mechanism.
|
||
3. OpenTofu state is archived and read-only. It is never touched again.
|
||
4. `catalyst-provisioner` no longer has any active connection to the new Sovereign.
|
||
|
||
The Sovereign is now self-sufficient. It has the full Catalyst control-plane set per [`PLATFORM-TECH-STACK.md`](PLATFORM-TECH-STACK.md) §2.3:
|
||
|
||
- Its own Crossplane managing further infrastructure.
|
||
- Its own OpenBao for secrets.
|
||
- Its own JetStream as event spine.
|
||
- Its own Keycloak for users.
|
||
- Its own SPIFFE/SPIRE for workload identity (5-min rotating SVIDs).
|
||
- Its own Gitea (with mirror of the public Blueprint catalog).
|
||
- Its own observability stack (Grafana + Alloy + Loki + Mimir + Tempo) for self-monitoring.
|
||
- Its own Catalyst control plane (console, marketplace, admin, projector, catalog-svc, provisioning, environment-controller, blueprint-controller, billing).
|
||
|
||
---
|
||
|
||
## 5. Phase 2 — Day-1 setup
|
||
|
||
The first `sovereign-admin` logs into `console.<location-code>.<sovereign-domain>`:
|
||
|
||
```
|
||
Day-1 actions
|
||
──────────────────────────────────────────────────────────────────
|
||
1. Configure cert-manager issuers (Let's Encrypt / corporate CA).
|
||
2. Configure backup destination (cloud object storage for Velero).
|
||
3. Configure Harbor with image-scanning policies.
|
||
4. (Optional) Federate Keycloak's catalyst-admin realm to corporate IdP.
|
||
5. (Optional) Configure observability exports (SIEM, datadog, etc.).
|
||
6. Onboard the first Organization:
|
||
Catalyst console → Admin → Organizations → New
|
||
Provide: name, contact, plan.
|
||
Environment-controller does NOT create vclusters yet.
|
||
They are created when the first Environment is provisioned.
|
||
7. Create the first Environment in that Organization:
|
||
Console → switch to Org context → Environments → New
|
||
Environment-controller spins up a vcluster on the chosen host cluster
|
||
and bootstraps Flux inside (watching the env-appropriate branch on
|
||
every Application repo within this Org's Gitea Org). Apps not yet
|
||
installed have no repos yet; repos are created on demand by the
|
||
provisioning-service when each App is installed.
|
||
Ready in ~60 seconds.
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Phase 3 — Steady-state operation
|
||
|
||
From here on, the Sovereign runs autonomously. Sovereign-admins use the Catalyst admin UI for:
|
||
|
||
- Onboarding more Organizations
|
||
- Adding host clusters in new regions (Crossplane provisions them, environment-controller adopts them)
|
||
- Updating Catalyst itself (umbrella Blueprint version bumps, applied via Flux PR)
|
||
- Configuring SecretPolicies and EnvironmentPolicies
|
||
- Monitoring the Sovereign's own observability stack
|
||
- Reviewing audit logs
|
||
|
||
Everyday Application installs and configurations are done by `org-admins` and `org-developers` within their Organizations — see [`PERSONAS-AND-JOURNEYS.md`](PERSONAS-AND-JOURNEYS.md).
|
||
|
||
---
|
||
|
||
## 7. Multi-region topology
|
||
|
||
### 7.1 Single-region (SME default)
|
||
|
||
```
|
||
Region A
|
||
└── Host cluster: hz-fsn-mgt-prod ← Catalyst control plane + per-Org vclusters
|
||
└── all building blocks collapse onto one cluster (mgt + rtz + dmz workloads
|
||
in separate namespaces, with Cilium NetworkPolicies enforcing isolation)
|
||
```
|
||
|
||
Cheapest topology. Single-region failure = Sovereign down. Acceptable for SME tier where customers also accept SME-tier SLAs.
|
||
|
||
### 7.2 Multi-region (corporate default)
|
||
|
||
```
|
||
Region A (primary mgt) Region B Region C (DR)
|
||
───────────────── ───────────── ─────────────
|
||
hz-nbg-mgt-prod hz-fsn-rtz-prod hz-hel-rtz-prod
|
||
Catalyst control plane per-Org vclusters per-Org vclusters
|
||
Gitea, JetStream, OpenBao, (sibling realizations (sibling realizations
|
||
Keycloak, projector, of each Org's Environment) of each Org's Environment)
|
||
catalog-svc, marketplace,
|
||
console, admin, billing
|
||
hz-nbg-dmz-prod hz-fsn-dmz-prod hz-hel-dmz-prod
|
||
ingress, WAF, PowerDNS ingress, WAF, PowerDNS ingress, WAF, PowerDNS
|
||
```
|
||
|
||
The `mgt` building block is typically NOT replicated (one Catalyst control plane per Sovereign). The `rtz` and `dmz` blocks ARE replicated for workload HA.
|
||
|
||
OpenBao runs in BOTH the mgt cluster (primary) and each rtz region (replica) — see [`SECURITY.md`](SECURITY.md) §5 for replication semantics.
|
||
|
||
---
|
||
|
||
## 8. Adding a region post-provisioning
|
||
|
||
```
|
||
sovereign-admin in Catalyst admin UI:
|
||
Admin → Infrastructure → Add Region
|
||
Provider: Hetzner
|
||
Region: hel
|
||
Building blocks: rtz, dmz
|
||
Apply
|
||
```
|
||
|
||
Catalyst:
|
||
1. Crossplane provisions the new VPC, hosts, k3s cluster, etc.
|
||
2. Cluster registered in Catalyst's cluster registry.
|
||
3. cert-manager + Cilium + Flux + Crossplane + SPIRE + ESO + OpenBao replica deployed via the cluster's Flux Kustomization.
|
||
4. New region available as a Placement target for new and existing Environments.
|
||
|
||
Existing Applications with `placement.mode: single-region` do not migrate automatically. To extend an existing Application to the new region, the user explicitly switches Placement to `active-active` (or `active-hotstandby`) and adds the new region to `placement.regions` — that's a one-line edit in the Application's Gitea repo on the appropriate branch (or a click in the Topology tab).
|
||
|
||
---
|
||
|
||
## 9. Air-gap deployment
|
||
|
||
```
|
||
Connected zone (one-time) Air-gapped Sovereign
|
||
────────────────────────── ───────────────────────────────
|
||
1. Mirror public Blueprint OCI Harbor receives blobs via physical
|
||
artifacts to portable media. transfer / data diode.
|
||
2. Mirror Catalyst control-plane Sovereign's Gitea adopts blobs as
|
||
container images. OCI manifests in local registry.
|
||
3. Mirror cert-manager root + cert-manager configured with
|
||
organization CA bundle. internal CA only.
|
||
4. Configure Keycloak to local LDAP Keycloak federates to internal AD/LDAP.
|
||
(no external IdPs).
|
||
```
|
||
|
||
Catalyst is air-gap-ready by construction: every artifact (Blueprints, Catalyst code, base images) is OCI-signed. Mirror once, run forever.
|
||
|
||
---
|
||
|
||
## 10. Migration and decommission
|
||
|
||
### 10.1 Migrating an Organization between Sovereigns
|
||
|
||
Rare but supported. Example: a Bank Dhofar Organization started life on the openova Sovereign (paid SaaS), now wants to move to its own bankdhofar Sovereign (self-host).
|
||
|
||
```
|
||
1. Provision bankdhofar Sovereign (Phases 0–2).
|
||
2. On openova Sovereign: Admin → Organization → Export
|
||
Catalyst produces an export bundle:
|
||
- Org metadata
|
||
- All Application Gitea repos under this Org (cloned + bundled, including all branches)
|
||
- The Org's `shared-blueprints` repo
|
||
- Keycloak realm export (users, federated identities)
|
||
- OpenBao export (sealed secrets only)
|
||
3. On bankdhofar Sovereign: Admin → Organization → Import
|
||
Environment-controller recreates Environments → vclusters.
|
||
Flux pulls manifests, reconciles.
|
||
Apps come up.
|
||
4. Final cutover: DNS swap.
|
||
5. Verify, then decommission on openova side.
|
||
```
|
||
|
||
Time depends on data volume; typically minutes to hours per Org.
|
||
|
||
### 10.2 Decommissioning a Sovereign
|
||
|
||
Reverse of provisioning:
|
||
|
||
```
|
||
1. Migrate all Organizations off (Section 10.1).
|
||
2. Catalyst admin → Sovereign → Decommission
|
||
3. Crossplane begins teardown of host clusters.
|
||
4. OpenBao final state exported and stored encrypted.
|
||
5. DNS records removed.
|
||
6. Cloud resources reclaimed.
|
||
```
|
||
|
||
The customer keeps the OpenBao export and Gitea bundles for whatever retention period their compliance demands.
|
||
|
||
---
|
||
|
||
*Cross-reference [`ARCHITECTURE.md`](ARCHITECTURE.md) and [`SECURITY.md`](SECURITY.md). For day-to-day operation see [`SRE.md`](SRE.md).*
|