History

e3mrah fcfed6408c feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101 ) (#1226 ) * feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) Follow-up to #1223. The Flux Kustomization on every Sovereign points at clusters/_template/bootstrap-kit/ and post-build-substitutes per- Sovereign vars (SOVEREIGN_FQDN, MARKETPLACE_ENABLED, ...). The per-Sovereign overlay file at clusters/<sov>/bootstrap-kit/01-cilium.yaml that #1223 added is therefore dead code (Flux doesn't read that path). The canonical mechanism is to extend the template with envsubst placeholders + thread the values through tofu vars. Wires four layers end-to-end: 1. clusters/_template/bootstrap-kit/01-cilium.yaml — adds `cluster.name: ${CLUSTER_MESH_NAME:=}` and `cluster.id: ${CLUSTER_MESH_ID:=0}` plus `clustermesh.useAPIServer: true` + NodePort 32379. Empty defaults = single-cluster Sovereign (no peer connects); the cilium subchart accepts empty cluster.name when id=0. 2. infra/hetzner/cloudinit-control-plane.tftpl — adds CLUSTER_MESH_NAME / CLUSTER_MESH_ID to the bootstrap-kit Kustomization's postBuild.substitute block (alongside SOVEREIGN_FQDN, MARKETPLACE_ENABLED, PARENT_DOMAINS_YAML). 3. infra/hetzner/variables.tf — declares cluster_mesh_name (string, default "") and cluster_mesh_id (number, default 0, validated 0-255). 4. infra/hetzner/main.tf — primary cloud-init passes var.cluster_mesh_{name,id} verbatim. Secondary regions (when var.regions[i>0] is non-empty per slice G3) auto-derive each peer's name as `<sovereign-stem>-<region-code-no-digits>` and increment id from var.cluster_mesh_id+1. Per-region override via the new RegionSpec.ClusterMeshName field. 5. products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — adds ClusterMeshName + ClusterMeshID to Request and threads them into writeTfvars(); RegionSpec gains ClusterMeshName for per-peer override. Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the chart-side default is intentionally empty — operator request OR per-Sovereign overlay must supply the values when ClusterMesh is enabled. The allocation registry lives at docs/CLUSTERMESH-CLUSTER-IDS.md (introduced in #1223). Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33 follow-up to #1223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): escape $ in tftpl comments referencing envsubst placeholders `tofu validate` reads `${CLUSTER_MESH_NAME}` inside YAML comments as a template variable reference; the comment was meant to refer to the Flux envsubst placeholder consumed downstream by the bootstrap-kit cilium HelmRelease. Escaped both refs with `$$` per Terraform's templatefile escape syntax so the comment renders verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): replace coalesce with conditional in secondary_region_cluster_mesh_name coalesce errors when every arg is empty (the not-in-mesh path). Switch to a conditional that yields '' when both the per-region override AND var.cluster_mesh_name are empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-10 00:19:53 +04:00
..
tests	feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095 ) (#1131 )	2026-05-09 00:29:44 +04:00
.terraform.lock.hcl	feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095 ) (#1131 )	2026-05-09 00:29:44 +04:00
cloudinit-control-plane.tftpl	feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101 ) (#1226 )	2026-05-10 00:19:53 +04:00
cloudinit-worker.tftpl	feat(infra/hetzner): bump default to cx42, add OS hardening + operator README	2026-04-28 13:54:15 +02:00
main.tf	feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101 ) (#1226 )	2026-05-10 00:19:53 +04:00
outputs.tf	feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095 ) (#1131 )	2026-05-09 00:29:44 +04:00
README.md	feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095 ) (#1131 )	2026-05-09 00:29:44 +04:00
variables.tf	feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101 ) (#1226 )	2026-05-10 00:19:53 +04:00
versions.tf	feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371 ) (#409 )	2026-05-01 16:54:22 +04:00

README.md

`infra/hetzner/` — Catalyst Sovereign provisioning module

Canonical Phase 0 OpenTofu module that provisions a single-region OR multi-region Catalyst Sovereign on Hetzner Cloud and bootstraps it onto Flux-driven GitOps. After tofu apply finishes, every subsequent change to the Sovereign goes through Crossplane (cloud resources) and Flux (Kubernetes resources). OpenTofu state is archived and never touched again.

This module is the implementation of docs/SOVEREIGN-PROVISIONING.md §3 (Phase 0 — Bootstrap) and follows docs/INVIOLABLE-PRINCIPLES.md — every value the wizard or operator picks is a variable; nothing is hardcoded.

What this module creates

Resource	Purpose
`hcloud_network` + `hcloud_network_subnet`	Private 10.0.0.0/16 with 10.0.1.0/24 reserved for control-plane and workers.
`hcloud_firewall`	Inbound rules for 80/443 (HTTPS), 6443 (k3s API), ICMP, and an opt-in SSH rule keyed to operator CIDRs.
`hcloud_ssh_key`	The operator's existing SSH key (from their Hetzner project) — never auto-generated.
`hcloud_server` (control plane)	1 node by default (`ha_enabled=false`); 3 nodes when HA is on. Cloud-init installs k3s + Flux + the bootstrap kit pointer.
`hcloud_server` (workers)	`worker_count` nodes (default 2 — issue #733 multi-node Sovereign). Set to 0 explicitly for solo dev/POC.
`hcloud_load_balancer` (`lb11`)	Public IPv4; forwards 80→31080 and 443→31443 (Cilium Gateway NodePorts post-bootstrap).
`null_resource.dns_pool`	Calls `/usr/local/bin/catalyst-dns` (a helper inside the catalyst-api container) when `domain_mode=pool` to write Dynadot A records for the new sovereign FQDN.

After Phase 0, the cluster's Flux pulls clusters/<sovereign_fqdn>/ from the public OpenOva monorepo and installs the 11-component bootstrap kit (Cilium → cert-manager → Crossplane → ESO → SPIRE → NATS → OpenBao → Keycloak → Gitea → catalyst-platform). Hetzner adoption by Crossplane happens once provider-hcloud is up.

Multi-region wiring (slice G1, EPIC-0 #1095)

The module accepts a var.regions[] list-of-objects payload that captures the wizard's per-region sizing. Slice G1 wires every entry in that list end-to-end:

`var.regions[i]`	Realised by	Notes
`regions[0]`	Legacy singular path (`hcloud_server.control_plane[0]`, `hcloud_load_balancer.main`, …)	Identity-preserving — no resource-address change for any Sovereign provisioned before slice G1. The catalyst-api provisioner mirrors `regions[0]` into `var.region` / `var.control_plane_size` / `var.worker_size` / `var.worker_count` before `tofu apply`.
`regions[1+]`	Multi-region overlay (`hcloud_server.secondary_control_plane["fsn1-1"]`, `hcloud_load_balancer.secondary["hel1-2"]`, …)	New resources keyed by `for_each = local.secondary_regions`. Same `hcloud_network.main` + `hcloud_firewall.main` + `hcloud_ssh_key.main` (one tenant boundary per Sovereign). Each secondary region gets its own `/24` subnet inside the shared `/16` and its own `lb11`.

The hybrid (singular path + secondary-region overlay) is purely additive: no existing Sovereign state has entries in local.secondary_regions (the iteration filter if i > 0 excludes regions[0]), so legacy tofu plan outputs are unchanged for any Sovereign whose request body has len(regions) ≤ 1. No tofu state mv is required for any pre-G1 state.

EPIC-6 (#1101) example: 3-region Continuum DR shape

Per docs/EPICS-1-6-unified-design.md §3.8 + §11, the EPIC-6 demo brings up one mgmt cluster + two data-plane clusters with Cilium ClusterMesh between them. Slice G1 provisions the cloud substrate; slice G3 wires ClusterMesh.

{
  "sovereign_fqdn":     "demo.example.com",
  "org_name":           "Demo Org",
  "org_email":          "ops@example.com",
  "hcloud_token":       "<rotate>",
  "hcloud_project_id":  "12345",
  "ssh_public_key":     "ssh-ed25519 AAAA... operator@laptop",

  // Legacy singular fields — derived from regions[0] by the catalyst-api
  // provisioner before tofu apply. No need to set these by hand when
  // regions[] is supplied; they're shown here for reference.
  "region":             "nbg1",
  "control_plane_size": "cpx32",
  "worker_size":        "cpx32",
  "worker_count":       1,

  // Per-region payload — slice G1 wires every entry in this list.
  // regions[0] = mgmt (Nuremberg), regions[1] = fsn data plane (Falkenstein),
  // regions[2] = hel data plane (Helsinki) per the EPIC-6 §3.8 cluster table.
  "regions": [
    { "provider": "hetzner", "cloudRegion": "nbg1", "controlPlaneSize": "cpx32", "workerSize": "cpx32", "workerCount": 1 },
    { "provider": "hetzner", "cloudRegion": "fsn1", "controlPlaneSize": "cpx32", "workerSize": "cpx32", "workerCount": 2 },
    { "provider": "hetzner", "cloudRegion": "hel1", "controlPlaneSize": "cpx32", "workerSize": "cpx32", "workerCount": 2 }
  ],

  "k3s_version":        "v1.31.4+k3s1",
  "object_storage_region":      "nbg1",
  "object_storage_access_key":  "<from Hetzner Console>",
  "object_storage_secret_key":  "<from Hetzner Console>",
  "object_storage_bucket_name": "catalyst-demo-example-com",
  "domain_mode":        "byo"
}

Outputs after tofu apply:

Output	Shape	EPIC-6 example value
`control_plane_ip`	string	`203.0.113.10` (mgmt CP, Nuremberg)
`load_balancer_ip`	string	`203.0.113.11` (mgmt LB, Nuremberg)
`secondary_region_keys`	`list(string)`	`["fsn1-1", "hel1-2"]`
`control_plane_ips_by_region`	`map(string)`	`{"fsn1-1": "203.0.113.20", "hel1-2": "203.0.113.30"}`
`load_balancer_ips_by_region`	`map(string)`	`{"fsn1-1": "203.0.113.21", "hel1-2": "203.0.113.31"}`

The catalyst-api joins secondary_region_keys with Request.Regions[1+] to project per-region status into the deployment record. PowerDNS lua-records (docs/MULTI-REGION-DNS.md) aggregate every LB IP in load_balancer_ips_by_region into the Sovereign FQDN's A-record set so a single hostname spans every region with ifurlup health checking.

Out of scope for slice G1

Cilium ClusterMesh wiring — slice G3 (joins separate clusters into a single mesh).
Per-cluster GitOps differentiation — every secondary CP today renders an identical Flux Kustomization pointed at clusters/<sovereign_fqdn>/. Per-cluster paths (clusters/hz-fsn-rtz-prod/, etc., per docs/NAMING-CONVENTION.md §4.1) ship in slice G3 alongside ClusterMesh.
Non-Hetzner regions — var.regions[] may carry oci / aws / huawei / azure entries; the Hetzner overlay filters them out (if r.provider == "hetzner"). Sister-provider modules (slice G2 / G4 / …) own their own iteration.

Resource address contract

Every legacy resource keeps its existing address. New addresses introduced by slice G1, all keyed for_each = local.secondary_regions (key shape: {cloudRegion}-{index} where index is the position in var.regions[]):

hcloud_network_subnet.secondary["{key}"]
hcloud_server.secondary_control_plane["{key}"]
hcloud_server.secondary_worker["{key}-w{i}"]   # i = 1..workerCount
hcloud_load_balancer.secondary["{key}"]
hcloud_load_balancer_network.secondary["{key}"]
hcloud_load_balancer_target.secondary_control_plane["{key}"]
hcloud_load_balancer_target.secondary_workers["{key}-w{i}"]
hcloud_load_balancer_service.secondary_http["{key}"]
hcloud_load_balancer_service.secondary_https["{key}"]
hcloud_load_balancer_service.secondary_dns["{key}"]

Tests

Module-local tests live under tests/multi_region.tftest.hcl. They exercise five scenarios offline (no real Hetzner) via mock_provider + override_resource:

cd infra/hetzner
tofu init -backend=false
tofu validate
tofu fmt -check
tofu test

CI runs these on every PR touching infra/hetzner/** via .github/workflows/infra-hetzner-tofu.yaml.

Why `cpx21` / `cpx31` are NOT the default (issue #752)

Both cpx21 (3 vCPU / 4 GB / €10.99/mo) and cpx31 (4 vCPU / 8 GB / €20.49/mo) appear cheaper than the chosen cpx22 / cpx32 defaults and are LISTED in Hetzner's GET /v1/server_types response with full EU pricing (fsn1, nbg1, hel1). They are NOT orderable.

$ HCLOUD_TOKEN=...
$ for SKU in cpx21 cpx31; do for LOC in fsn1 nbg1 hel1; do
    curl -sH "Authorization: Bearer $HCLOUD_TOKEN" -X POST \
      "https://api.hetzner.cloud/v1/servers" \
      -H "Content-Type: application/json" \
      -d "{\"name\":\"probe-$SKU-$LOC\",\"server_type\":\"$SKU\",\"image\":\"ubuntu-24.04\",\"location\":\"$LOC\",\"start_after_create\":false}" \
      | jq -r '.error.message // "ORDERED"'
  done; done
unsupported location for server type   # cpx21/fsn1
unsupported location for server type   # cpx21/nbg1
unsupported location for server type   # cpx21/hel1
unsupported location for server type   # cpx31/fsn1
unsupported location for server type   # cpx31/nbg1
unsupported location for server type   # cpx31/hel1

cpx22 and cpx32 return ORDERED (verified 2026-05-04 against a real project — server IDs cleaned up immediately after provisioning).

The /v1/server_types price entry is misleading: Hetzner advertises a price for every (SKU, location) pair regardless of whether new orders are accepted. The authoritative source for "can I order this?" is POST /v1/servers itself. The cpx (no-letter) generation is being phased out in favour of the cpx22/cpx32/cpx52 generation across EU DCs; cpx11/cpx21/cpx31/cpx41 are NOT orderable in fsn1/nbg1/hel1 as of 2026-05-04.

PR #741 attempted a default of cpx21 CP + cpx31 workers based on the listed prices and got blocked at tofu apply time with the same "unsupported location" error. PR #744 reverted to the orderable cpx22 + cpx32. Issue #752 documented the gap between the listed prices and the orderability constraint; this section is the durable record so future engineers don't re-attempt.

If Hetzner ever opens cpx21/cpx31 ordering in EU DCs (re-probe with the script above), the saving is ~€4/mo per Sovereign on CP + ~€11/mo per Sovereign per worker. Until then, cpx22/cpx32 is the floor.

Sizing rationale — why `cpx32 × 3` is the default (issue #733)

docs/PLATFORM-TECH-STACK.md §7.1 sets the RAM budget for a Catalyst-only mgt cluster at ~11.3 GB, and §7.4 adds ~8.8 GB for per-host-cluster infrastructure that runs on every host cluster including mgt (Cilium, Flux, Crossplane, cert-manager, ESO, Kyverno, Trivy Operator, Falco, Harbor, SeaweedFS, Velero, plus small operators).

The total Sovereign footprint is ~20 GB RAM, ~10 vCPU minimum. There are two ways to land that:

Vertical scale — single CPX52 node (12 vCPU / 24 GB) hosts everything.
Horizontal scale (default) — 1× CPX32 control plane + 2× CPX32 workers (3 nodes × 4 vCPU / 8 GB = 12 vCPU / 24 GB total). Same aggregate footprint, multi-node fault tolerance, real horizontal scale for workloads with replicas: 2.

The horizontal-scale shape is the canonical Catalyst architecture — clusters/_template/ was designed for it. The previous single-node default was a regression that discarded horizontal scalability; this module restores the multi-node default per issue #733.

Hetzner type	RAM	vCPU	Disk	Default role
`cx22`	4 GB	2	40 GB	Insufficient — OOM during Cilium install.
`cx32`	8 GB	4	80 GB	Too small for a solo Sovereign on its own.
`cpx32`	8 GB	4 (AMD)	160 GB	Default control plane AND default worker. Multi-node — pair with `worker_count ≥ 2` for the canonical 3-node topology (12 vCPU / 24 GB total).
`cpx42`	16 GB	8 (AMD)	320 GB	Mid-tier worker for trimmed component sets.
`cpx52`	24 GB	12 (AMD)	480 GB	Solo dev/POC starter when `worker_count=0` (single-node mode).
`cx42`	16 GB	8	160 GB	Legacy single-node default — still allowed, no longer default.
`cx52`	32 GB	16	320 GB	Heavy single-node Sovereign with many Blueprints.
`ccx33`	32 GB	8 dedicated	240 GB	Production dedicated-vCPU control plane — avoids noisy-neighbour latency on the API server.
`cax41`	32 GB	16 ARM	320 GB	Cheapest path to 32 GB. Confirm all upstream Blueprint container images are multi-arch before using (most are; a handful aren't).

Upgrade path

Resizing is non-destructive on Hetzner — tofu apply -var control_plane_size=ccx33 will trigger a hcloud_server resize. The node reboots once. On a single-node Sovereign that means ~60 seconds of console downtime; the LB health-check covers it. For HA Sovereigns (ha_enabled=true), the resize is rolling — no externally-visible downtime.

For a multi-node Sovereign, prefer adding workers (worker_count) before upsizing the control plane. The control plane's job is k3s + control-plane services; workers absorb the per-host-infra and application load.

Firewall rules

The Phase-0 firewall is intentionally minimal. All long-term policy is enforced by Cilium NetworkPolicies (in-cluster) and tightened by Crossplane Compositions (cloud edge) once Phase 1 completes.

Inbound (Phase-0 baseline)

Port	Protocol	Source	Why
80	TCP	`0.0.0.0/0`, `::/0`	HTTP — for ACME HTTP-01 challenges and the cert-manager bootstrap. Cilium Gateway terminates.
443	TCP	`0.0.0.0/0`, `::/0`	HTTPS — the only port end-users reach. All Catalyst surfaces (`console`, `gitea`, `harbor`, `admin`, `api`) are served behind 443 via Cilium Gateway and SNI routing.
6443	TCP	`0.0.0.0/0`, `::/0`	k3s API server. Open to allow the wizard to fetch the kubeconfig and confirm the cluster is healthy. Crossplane Composition tightens this to operator-owned CIDRs in Phase 2.
ICMP	ICMP	`0.0.0.0/0`, `::/0`	Diagnostics (Path MTU Discovery, traceroute). Open by default; closing it is a foot-gun that breaks PMTU.
22	TCP	`var.ssh_allowed_cidrs` (default: empty)	SSH break-glass. Off by default — the rule is omitted entirely when the list is empty. Operators add their own CIDRs at provisioning time or via a Crossplane Composition later.

Outbound (Hetzner default — open)

Hetzner's hcloud_firewall does not enforce egress unless you write explicit deny rules. We rely on the open-egress default plus in-cluster Cilium NetworkPolicies for fine-grained control. The egress flows the bootstrap requires:

Destination	Why
`get.k3s.io`, `github.com/k3s-io/k3s/releases`	k3s installer + binary download.
`pool.ntp.org` (UDP 123)	Time sync — required for SPIRE workload identity (5-min SVID rotation).
`1.1.1.1`, `8.8.8.8` (UDP/TCP 53)	DNS until the Sovereign's own DNS lands.
`ghcr.io` (TCP 443)	Container images for Catalyst services + bootstrap kit (`bp-*` Blueprints).
`github.com/openova-io/openova` (TCP 443)	Flux GitRepository pull.

Deliberately blocked

Port	Why blocked
22 (SSH)	Default-closed at the firewall. Break-glass is via Hetzner Console (out-of-band, password-less) when no `ssh_allowed_cidrs` is set. Removing the world-open SSH attack surface is the largest single hardening win.
10250 (kubelet)	Never exposed publicly. Cluster-internal only.
2379/2380 (etcd)	Embedded in k3s; never exposed publicly.
8472 (flannel VXLAN)	We disable flannel; Cilium uses geneve/wireguard within the cluster network.

k3s flags + rationale

k3s is installed via curl get.k3s.io | sh - from cloud-init. The INSTALL_K3S_EXEC argument carries the flag set required by the rest of the Catalyst stack. Each flag below maps to a specific architectural decision in docs/PLATFORM-TECH-STACK.md §8.

Flag	Why
`--cluster-init`	Initialise embedded etcd. Required for Phase-1 hand-off to add additional control-plane nodes (`ha_enabled=true`) without re-bootstrapping.
`--flannel-backend=none`	k3s ships with flannel; we replace the CNI with Cilium (gateway API, eBPF, mTLS via wireguard). Setting `none` keeps k3s from racing flannel against Cilium during boot.
`--disable=traefik`	k3s ships with Traefik; we use Cilium Gateway API (already part of the Cilium install). Catalyst's Gateway/HTTPRoute manifests assume Gateway API, not Traefik IngressRoute.
`--disable=servicelb`	k3s ships with klipper-lb; we use the Hetzner load balancer for ingress (`hcloud_load_balancer.main`) and PowerDNS lua-records (`ifurlup`) for cross-region failover. klipper-lb would steal the NodePort 80/443 binding.
`--disable=local-storage`	k3s ships local-path-provisioner; we use hcloud-csi (provisioned by Crossplane after Phase 1) so PVCs survive node deletion and can be migrated across regions via Velero.
`--disable-network-policy`	k3s ships kube-router NetworkPolicy; Cilium handles NetworkPolicy. Two NetworkPolicy controllers fight each other.
`--tls-san=<sovereign_fqdn>`	API server TLS cert must be valid for the public sovereign FQDN, otherwise the wizard's kubeconfig fetch and any operator running `kubectl --server=https://<fqdn>:6443` get a SAN mismatch.
`--node-label catalyst.openova.io/role=control-plane`	Used by NodeAffinity on Catalyst control-plane services (Console, projector, etc.) to pin them off worker nodes.
`--write-kubeconfig-mode=0644`	Lets the catalyst-api fetch the kubeconfig over the wizard channel without sudo. The kubeconfig is rotated and replaced with a SPIFFE-issued identity in Phase 2.

The INSTALL_K3S_VERSION environment variable is var.k3s_version (default v1.31.4+k3s1). Pinned so a Sovereign provisioned today and one provisioned next month land on the same Kubernetes minor — the Catalyst compatibility matrix in docs/PLATFORM-TECH-STACK.md §8.1 is keyed to k3s minor versions.

SSH key management — why no auto-generated keys

The module requires the operator to provide their own SSH public key via var.ssh_public_key. We never generate an ephemeral keypair. Rationale:

Break-glass continuity. A Sovereign lives for years. An ephemeral key generated at provisioning time disappears the moment the catalyst-provisioner container restarts; at that point the only way back into the cluster is via Hetzner Console password-reset, which itself disrupts the in-cluster SPIRE identity if it forces a kubelet restart. Operator-owned keys (rooted in their corporate identity provider or hardware token) survive provisioner restarts.
Audit trail. Hetzner logs every hcloud_ssh_key.create and every login that uses it. With operator-owned keys, that log directly traces back to a named human in the operator's IdP. With auto-generated keys, the log says "catalyst-provisioner did it" — useless for incident forensics.
No private-key custody problem. Catalyst would have to store the auto-generated private key somewhere to give the operator break-glass. Either we put it in OpenBao (chicken-and-egg: OpenBao isn't running yet during Phase 0), or we ship it back to the wizard (we're now responsible for the key never leaking through the browser, the catalyst-provisioner logs, the OpenTofu state file, ...). Operator-owned keys move that custody problem to whoever's already responsible for it (the operator).
Compliance. Most enterprise frameworks (SOC 2 CC6.1, ISO 27001 A.9.4.3) require keys to trace back to a named individual. Auto-generated, vendor-held keys fail this.

The validation regex on var.ssh_public_key accepts ssh-rsa, ssh-ed25519, and ecdsa-sha2-nistp256 formats. Recommend ssh-ed25519 from a YubiKey-resident key for production.

OS hardening (cloud-init)

Both cloudinit-control-plane.tftpl and cloudinit-worker.tftpl apply the same baseline. Each item is a template-conditional driven by a variable so an operator can disable it for a short-lived test Sovereign.

Item	Variable (default)	What happens
sshd drop-in	always on	`/etc/ssh/sshd_config.d/99-catalyst-hardening.conf` sets `PasswordAuthentication no`, `KbdInteractiveAuthentication no`, `PermitRootLogin prohibit-password`, disables forwarding, tightens `MaxAuthTries=3` and `LoginGraceTime=30`. The `ssh-rsa`/`ssh-ed25519` key Hetzner injects via `ssh_keys[]` is the only path in.
`unattended-upgrades`	`enable_unattended_upgrades=true`	Daily security-only upgrades on Ubuntu, restricted to the `*-security` pocket. Auto-reboot at 02:30 if a kernel upgrade requires it; the LB health check covers the ~60 s window. Removes unused kernels to keep `/boot` from filling.
`fail2ban` (sshd jail)	`enable_fail2ban=true`	Defence-in-depth in case `ssh_allowed_cidrs` is later widened. `maxretry=5`, `findtime=10m`, `bantime=1h`, systemd backend.

The hardening explicitly does not include AppArmor profile authoring, kernel-module blacklisting, or a CIS Level-2 sweep. Those are a Phase-2 task delivered by a Kyverno policy + a privileged DaemonSet (bp-cis-hardening), not Phase-0 cloud-init.

Variables — reference

See variables.tf for the authoritative source. Highlights:

Variable	Default	Validation
`region`	(required)	`fsn1`, `nbg1`, `hel1`, `ash`, `hil`
`control_plane_size`	`cx42`	`^(cx[0-9]+
`worker_size`	`cx32`	`^(cx[0-9]+
`worker_count`	`0`	`0 ≤ n ≤ 50`
`ha_enabled`	`false`	bool
`k3s_version`	`v1.31.4+k3s1`	`^v\d+\.\d+\.\d+\+k3s\d+$`
`ssh_public_key`	(required)	OpenSSH formats only
`ssh_allowed_cidrs`	`[]`	every entry must be a valid CIDR
`enable_unattended_upgrades`	`true`	bool
`enable_fail2ban`	`true`	bool
`domain_mode`	`pool`	`pool` or `byo`
`gitops_repo_url`	public OpenOva monorepo	string
`gitops_branch`	`main`	string

Every default is the common case for a solo Sovereign. The waterfall doctrine (docs/INVIOLABLE-PRINCIPLES.md §1) means the defaults must produce a working production-shape Sovereign, not a "demo it first" scaffold.

How to invoke this module standalone

Most operators reach this module through the Catalyst console wizard, which writes a tofu.auto.tfvars.json, runs tofu init && tofu apply, and ships the outputs back to the user. The wizard path is the supported one.

If you need to drive provisioning by CLI (air-gapped sites, debugging, or a CI pipeline you own), the module accepts a flat -var-file= invocation:

# 1. Clone the module
git clone https://github.com/openova-io/openova.git
cd openova/infra/hetzner

# 2. Write a tfvars file (NEVER commit this — it contains the hcloud_token).
#    File ownership 0600, on an encrypted disk.
cat > sovereign.tfvars.json <<EOF
{
  "sovereign_fqdn":     "omantel.omani.works",
  "sovereign_subdomain": "omantel",
  "org_name":           "Omantel",
  "org_email":          "ops@omantel.om",
  "hcloud_token":       "<rotate after run>",
  "hcloud_project_id":  "<your project id>",
  "region":             "fsn1",
  "control_plane_size": "cx42",
  "worker_count":       0,
  "ha_enabled":         false,
  "k3s_version":        "v1.31.4+k3s1",
  "ssh_public_key":     "ssh-ed25519 AAAA... operator@laptop",
  "ssh_allowed_cidrs":  ["203.0.113.7/32"],
  "domain_mode":        "byo",
  "gitops_repo_url":    "https://github.com/openova-io/openova",
  "gitops_branch":      "main"
}
EOF
chmod 0600 sovereign.tfvars.json

# 3. Init + plan + apply
tofu init
tofu plan  -var-file=sovereign.tfvars.json -out=plan.bin
tofu apply plan.bin

# 4. Read outputs
tofu output -json

Outputs:

Name	Use
`control_plane_ip`	First control-plane node's public IPv4.
`load_balancer_ip`	Public IPv4 the customer points DNS A records at (when `domain_mode=byo`).
`console_url`	`https://console.<sovereign_fqdn>` — usable once Flux finishes the bootstrap (~30 min).
`gitops_repo_url`	Path Flux on the new cluster watches; useful for audit.

After tofu apply finishes, archive the OpenTofu state file and the tfvars file. Per docs/SOVEREIGN-PROVISIONING.md §4, the state is read-only from this point forward — Crossplane has adopted the cloud resources and any further change goes through it.

What this module does NOT do

Out of scope by design — these are Crossplane / Flux territory:

Cilium + Hubble installation (handled by bp-cilium reconciled by Flux).
cert-manager issuers (handled by bp-cert-manager + Phase-2 day-1 setup).
Keycloak realm provisioning (handled by bp-keycloak + Phase-2 day-1 setup).
Object-storage bucket creation for Velero backups (Crossplane provider-hcloud + an hcloud-storage-volume Composition).
DNS records beyond the Phase-0 wildcard (handled by External-DNS in the Sovereign once the bootstrap kit comes up).
Day-2 cluster ops (node addition/removal — Crossplane Composition).

If you find yourself adding any of these to main.tf, you're violating docs/INVIOLABLE-PRINCIPLES.md §3 — stop and route the work to Crossplane / Flux instead.

Files

File	Role
`main.tf`	Resources + locals (network, firewall, SSH key, servers, LB, DNS hook).
`variables.tf`	Wizard inputs as variables, with validation blocks.
`outputs.tf`	What the catalyst-api provisioner reads back after `tofu apply`.
`versions.tf`	OpenTofu + provider version constraints.
`cloudinit-control-plane.tftpl`	cloud-init for the first / HA control-plane nodes. Installs hardening, k3s, Flux, bootstrap pointer.
`cloudinit-worker.tftpl`	cloud-init for `worker_count` nodes. Installs hardening + joins the cluster.

Part of the public OpenOva Catalyst monorepo. See docs/SOVEREIGN-PROVISIONING.md for the end-to-end provisioning narrative and docs/PLATFORM-TECH-STACK.md for the resource budget that drives the sizing defaults.

README.md Unescape Escape

infra/hetzner/ — Catalyst Sovereign provisioning module