openova/infra/hetzner
e3mrah 40ca4e4d50
fix(infra): registries.yaml mirror needs rewrite — Harbor proxy is /v2/proj/, not /proj/v2/ (#640)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:22:21 +04:00
..
cloudinit-control-plane.tftpl fix(infra): registries.yaml mirror needs rewrite — Harbor proxy is /v2/proj/, not /proj/v2/ (#640) 2026-05-02 23:22:21 +04:00
cloudinit-worker.tftpl feat(infra/hetzner): bump default to cx42, add OS hardening + operator README 2026-04-28 13:54:15 +02:00
main.tf fix(infra+api): wire handover_jwt_public_key end-to-end through tofu provisioning (#636) 2026-05-02 22:28:44 +04:00
outputs.tf feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371) (#409) 2026-05-01 16:54:22 +04:00
README.md refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171) 2026-04-29 08:51:09 +02:00
variables.tf fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623) 2026-05-02 19:21:16 +04:00
versions.tf feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371) (#409) 2026-05-01 16:54:22 +04:00

infra/hetzner/ — Catalyst Sovereign provisioning module

Canonical Phase 0 OpenTofu module that provisions a single-region Catalyst Sovereign on Hetzner Cloud and bootstraps it onto Flux-driven GitOps. After tofu apply finishes, every subsequent change to the Sovereign goes through Crossplane (cloud resources) and Flux (Kubernetes resources). OpenTofu state is archived and never touched again.

This module is the implementation of docs/SOVEREIGN-PROVISIONING.md §3 (Phase 0 — Bootstrap) and follows docs/INVIOLABLE-PRINCIPLES.md — every value the wizard or operator picks is a variable; nothing is hardcoded.


What this module creates

Resource Purpose
hcloud_network + hcloud_network_subnet Private 10.0.0.0/16 with 10.0.1.0/24 reserved for control-plane and workers.
hcloud_firewall Inbound rules for 80/443 (HTTPS), 6443 (k3s API), ICMP, and an opt-in SSH rule keyed to operator CIDRs.
hcloud_ssh_key The operator's existing SSH key (from their Hetzner project) — never auto-generated.
hcloud_server (control plane) 1 node by default (ha_enabled=false); 3 nodes when HA is on. Cloud-init installs k3s + Flux + the bootstrap kit pointer.
hcloud_server (workers) worker_count nodes (default 0 — solo Sovereign).
hcloud_load_balancer (lb11) Public IPv4; forwards 80→31080 and 443→31443 (Cilium Gateway NodePorts post-bootstrap).
null_resource.dns_pool Calls /usr/local/bin/catalyst-dns (a helper inside the catalyst-api container) when domain_mode=pool to write Dynadot A records for the new sovereign FQDN.

After Phase 0, the cluster's Flux pulls clusters/<sovereign_fqdn>/ from the public OpenOva monorepo and installs the 11-component bootstrap kit (Cilium → cert-manager → Crossplane → ESO → SPIRE → NATS → OpenBao → Keycloak → Gitea → catalyst-platform). Hetzner adoption by Crossplane happens once provider-hcloud is up.


Sizing rationale — why cx42 is the default

docs/PLATFORM-TECH-STACK.md §7.1 sets the RAM budget for a Catalyst-only mgt cluster at ~11.3 GB, and §7.4 adds ~8.8 GB for per-host-cluster infrastructure that runs on every host cluster including mgt (Cilium, Flux, Crossplane, cert-manager, ESO, Kyverno, Trivy Operator, Falco, Harbor, SeaweedFS, Velero, plus small operators).

For a solo Sovereign (single node hosting both the Catalyst control plane and the per-host-cluster infra), the floor is therefore ~20 GB RAM minimum, before adding any application Blueprints.

Hetzner type RAM vCPU Disk Verdict for solo Sovereign
cx22 4 GB 2 40 GB Insufficient — OOM during Cilium install.
cx32 8 GB 4 80 GB Insufficient. Used to be the default. Bootstrap kit OOMs around the OpenBao + Keycloak step (~12-15 GB working set).
cx42 16 GB 8 160 GB Default. Smallest viable size for a solo Sovereign with no Blueprints. Leaves ~5 GB headroom for the first 1-2 Application Blueprints before scaling.
cx52 32 GB 16 320 GB Recommended for a solo Sovereign that will also host workloads (10+ Blueprints).
ccx33 32 GB 8 dedicated 240 GB Recommended for production solo Sovereign — dedicated vCPUs avoid noisy-neighbour latency on the API server.
cax41 32 GB 16 ARM 320 GB Cheapest path to 32 GB. Confirm all upstream Blueprint container images are multi-arch before using (most are; a handful aren't).

This is a real fix. The original cx32 default was carried over from a development scratchpad; on a real provisioning run it would OOM during the bootstrap. The default is now cx42, validated against the §7.1 + §7.4 budget, and the variable's regex blocks anything outside the cxNN | ccxNN | caxNN namespace.

Upgrade path

Resizing is non-destructive on Hetzner — tofu apply -var control_plane_size=cx52 will trigger a hcloud_server resize. The node reboots once. On a single-node Sovereign that means ~60 seconds of console downtime; the LB health-check covers it. For HA Sovereigns (ha_enabled=true), the resize is rolling — no externally-visible downtime.

For a multi-node Sovereign, prefer adding workers (worker_count) before upsizing the control plane. The control plane's job is k3s + control-plane services; workers absorb the per-host-infra and application load.


Firewall rules

The Phase-0 firewall is intentionally minimal. All long-term policy is enforced by Cilium NetworkPolicies (in-cluster) and tightened by Crossplane Compositions (cloud edge) once Phase 1 completes.

Inbound (Phase-0 baseline)

Port Protocol Source Why
80 TCP 0.0.0.0/0, ::/0 HTTP — for ACME HTTP-01 challenges and the cert-manager bootstrap. Cilium Gateway terminates.
443 TCP 0.0.0.0/0, ::/0 HTTPS — the only port end-users reach. All Catalyst surfaces (console, gitea, harbor, admin, api) are served behind 443 via Cilium Gateway and SNI routing.
6443 TCP 0.0.0.0/0, ::/0 k3s API server. Open to allow the wizard to fetch the kubeconfig and confirm the cluster is healthy. Crossplane Composition tightens this to operator-owned CIDRs in Phase 2.
ICMP ICMP 0.0.0.0/0, ::/0 Diagnostics (Path MTU Discovery, traceroute). Open by default; closing it is a foot-gun that breaks PMTU.
22 TCP var.ssh_allowed_cidrs (default: empty) SSH break-glass. Off by default — the rule is omitted entirely when the list is empty. Operators add their own CIDRs at provisioning time or via a Crossplane Composition later.

Outbound (Hetzner default — open)

Hetzner's hcloud_firewall does not enforce egress unless you write explicit deny rules. We rely on the open-egress default plus in-cluster Cilium NetworkPolicies for fine-grained control. The egress flows the bootstrap requires:

Destination Why
get.k3s.io, github.com/k3s-io/k3s/releases k3s installer + binary download.
pool.ntp.org (UDP 123) Time sync — required for SPIRE workload identity (5-min SVID rotation).
1.1.1.1, 8.8.8.8 (UDP/TCP 53) DNS until the Sovereign's own DNS lands.
ghcr.io (TCP 443) Container images for Catalyst services + bootstrap kit (bp-* Blueprints).
github.com/openova-io/openova (TCP 443) Flux GitRepository pull.

Deliberately blocked

Port Why blocked
22 (SSH) Default-closed at the firewall. Break-glass is via Hetzner Console (out-of-band, password-less) when no ssh_allowed_cidrs is set. Removing the world-open SSH attack surface is the largest single hardening win.
10250 (kubelet) Never exposed publicly. Cluster-internal only.
2379/2380 (etcd) Embedded in k3s; never exposed publicly.
8472 (flannel VXLAN) We disable flannel; Cilium uses geneve/wireguard within the cluster network.

k3s flags + rationale

k3s is installed via curl get.k3s.io | sh - from cloud-init. The INSTALL_K3S_EXEC argument carries the flag set required by the rest of the Catalyst stack. Each flag below maps to a specific architectural decision in docs/PLATFORM-TECH-STACK.md §8.

Flag Why
--cluster-init Initialise embedded etcd. Required for Phase-1 hand-off to add additional control-plane nodes (ha_enabled=true) without re-bootstrapping.
--flannel-backend=none k3s ships with flannel; we replace the CNI with Cilium (gateway API, eBPF, mTLS via wireguard). Setting none keeps k3s from racing flannel against Cilium during boot.
--disable=traefik k3s ships with Traefik; we use Cilium Gateway API (already part of the Cilium install). Catalyst's Gateway/HTTPRoute manifests assume Gateway API, not Traefik IngressRoute.
--disable=servicelb k3s ships with klipper-lb; we use the Hetzner load balancer for ingress (hcloud_load_balancer.main) and PowerDNS lua-records (ifurlup) for cross-region failover. klipper-lb would steal the NodePort 80/443 binding.
--disable=local-storage k3s ships local-path-provisioner; we use hcloud-csi (provisioned by Crossplane after Phase 1) so PVCs survive node deletion and can be migrated across regions via Velero.
--disable-network-policy k3s ships kube-router NetworkPolicy; Cilium handles NetworkPolicy. Two NetworkPolicy controllers fight each other.
--tls-san=<sovereign_fqdn> API server TLS cert must be valid for the public sovereign FQDN, otherwise the wizard's kubeconfig fetch and any operator running kubectl --server=https://<fqdn>:6443 get a SAN mismatch.
--node-label catalyst.openova.io/role=control-plane Used by NodeAffinity on Catalyst control-plane services (Console, projector, etc.) to pin them off worker nodes.
--write-kubeconfig-mode=0644 Lets the catalyst-api fetch the kubeconfig over the wizard channel without sudo. The kubeconfig is rotated and replaced with a SPIFFE-issued identity in Phase 2.

The INSTALL_K3S_VERSION environment variable is var.k3s_version (default v1.31.4+k3s1). Pinned so a Sovereign provisioned today and one provisioned next month land on the same Kubernetes minor — the Catalyst compatibility matrix in docs/PLATFORM-TECH-STACK.md §8.1 is keyed to k3s minor versions.


SSH key management — why no auto-generated keys

The module requires the operator to provide their own SSH public key via var.ssh_public_key. We never generate an ephemeral keypair. Rationale:

  1. Break-glass continuity. A Sovereign lives for years. An ephemeral key generated at provisioning time disappears the moment the catalyst-provisioner container restarts; at that point the only way back into the cluster is via Hetzner Console password-reset, which itself disrupts the in-cluster SPIRE identity if it forces a kubelet restart. Operator-owned keys (rooted in their corporate identity provider or hardware token) survive provisioner restarts.
  2. Audit trail. Hetzner logs every hcloud_ssh_key.create and every login that uses it. With operator-owned keys, that log directly traces back to a named human in the operator's IdP. With auto-generated keys, the log says "catalyst-provisioner did it" — useless for incident forensics.
  3. No private-key custody problem. Catalyst would have to store the auto-generated private key somewhere to give the operator break-glass. Either we put it in OpenBao (chicken-and-egg: OpenBao isn't running yet during Phase 0), or we ship it back to the wizard (we're now responsible for the key never leaking through the browser, the catalyst-provisioner logs, the OpenTofu state file, ...). Operator-owned keys move that custody problem to whoever's already responsible for it (the operator).
  4. Compliance. Most enterprise frameworks (SOC 2 CC6.1, ISO 27001 A.9.4.3) require keys to trace back to a named individual. Auto-generated, vendor-held keys fail this.

The validation regex on var.ssh_public_key accepts ssh-rsa, ssh-ed25519, and ecdsa-sha2-nistp256 formats. Recommend ssh-ed25519 from a YubiKey-resident key for production.


OS hardening (cloud-init)

Both cloudinit-control-plane.tftpl and cloudinit-worker.tftpl apply the same baseline. Each item is a template-conditional driven by a variable so an operator can disable it for a short-lived test Sovereign.

Item Variable (default) What happens
sshd drop-in always on /etc/ssh/sshd_config.d/99-catalyst-hardening.conf sets PasswordAuthentication no, KbdInteractiveAuthentication no, PermitRootLogin prohibit-password, disables forwarding, tightens MaxAuthTries=3 and LoginGraceTime=30. The ssh-rsa/ssh-ed25519 key Hetzner injects via ssh_keys[] is the only path in.
unattended-upgrades enable_unattended_upgrades=true Daily security-only upgrades on Ubuntu, restricted to the *-security pocket. Auto-reboot at 02:30 if a kernel upgrade requires it; the LB health check covers the ~60 s window. Removes unused kernels to keep /boot from filling.
fail2ban (sshd jail) enable_fail2ban=true Defence-in-depth in case ssh_allowed_cidrs is later widened. maxretry=5, findtime=10m, bantime=1h, systemd backend.

The hardening explicitly does not include AppArmor profile authoring, kernel-module blacklisting, or a CIS Level-2 sweep. Those are a Phase-2 task delivered by a Kyverno policy + a privileged DaemonSet (bp-cis-hardening), not Phase-0 cloud-init.


Variables — reference

See variables.tf for the authoritative source. Highlights:

Variable Default Validation
region (required) fsn1, nbg1, hel1, ash, hil
control_plane_size cx42 `^(cx[0-9]+
worker_size cx32 `^(cx[0-9]+
worker_count 0 0 ≤ n ≤ 50
ha_enabled false bool
k3s_version v1.31.4+k3s1 ^v\d+\.\d+\.\d+\+k3s\d+$
ssh_public_key (required) OpenSSH formats only
ssh_allowed_cidrs [] every entry must be a valid CIDR
enable_unattended_upgrades true bool
enable_fail2ban true bool
domain_mode pool pool or byo
gitops_repo_url public OpenOva monorepo string
gitops_branch main string

Every default is the common case for a solo Sovereign. The waterfall doctrine (docs/INVIOLABLE-PRINCIPLES.md §1) means the defaults must produce a working production-shape Sovereign, not a "demo it first" scaffold.


How to invoke this module standalone

Most operators reach this module through the Catalyst console wizard, which writes a tofu.auto.tfvars.json, runs tofu init && tofu apply, and ships the outputs back to the user. The wizard path is the supported one.

If you need to drive provisioning by CLI (air-gapped sites, debugging, or a CI pipeline you own), the module accepts a flat -var-file= invocation:

# 1. Clone the module
git clone https://github.com/openova-io/openova.git
cd openova/infra/hetzner

# 2. Write a tfvars file (NEVER commit this — it contains the hcloud_token).
#    File ownership 0600, on an encrypted disk.
cat > sovereign.tfvars.json <<EOF
{
  "sovereign_fqdn":     "omantel.omani.works",
  "sovereign_subdomain": "omantel",
  "org_name":           "Omantel",
  "org_email":          "ops@omantel.om",
  "hcloud_token":       "<rotate after run>",
  "hcloud_project_id":  "<your project id>",
  "region":             "fsn1",
  "control_plane_size": "cx42",
  "worker_count":       0,
  "ha_enabled":         false,
  "k3s_version":        "v1.31.4+k3s1",
  "ssh_public_key":     "ssh-ed25519 AAAA... operator@laptop",
  "ssh_allowed_cidrs":  ["203.0.113.7/32"],
  "domain_mode":        "byo",
  "gitops_repo_url":    "https://github.com/openova-io/openova",
  "gitops_branch":      "main"
}
EOF
chmod 0600 sovereign.tfvars.json

# 3. Init + plan + apply
tofu init
tofu plan  -var-file=sovereign.tfvars.json -out=plan.bin
tofu apply plan.bin

# 4. Read outputs
tofu output -json

Outputs:

Name Use
control_plane_ip First control-plane node's public IPv4.
load_balancer_ip Public IPv4 the customer points DNS A records at (when domain_mode=byo).
console_url https://console.<sovereign_fqdn> — usable once Flux finishes the bootstrap (~30 min).
gitops_repo_url Path Flux on the new cluster watches; useful for audit.

After tofu apply finishes, archive the OpenTofu state file and the tfvars file. Per docs/SOVEREIGN-PROVISIONING.md §4, the state is read-only from this point forward — Crossplane has adopted the cloud resources and any further change goes through it.


What this module does NOT do

Out of scope by design — these are Crossplane / Flux territory:

  • Cilium + Hubble installation (handled by bp-cilium reconciled by Flux).
  • cert-manager issuers (handled by bp-cert-manager + Phase-2 day-1 setup).
  • Keycloak realm provisioning (handled by bp-keycloak + Phase-2 day-1 setup).
  • Object-storage bucket creation for Velero backups (Crossplane provider-hcloud + an hcloud-storage-volume Composition).
  • DNS records beyond the Phase-0 wildcard (handled by External-DNS in the Sovereign once the bootstrap kit comes up).
  • Day-2 cluster ops (node addition/removal — Crossplane Composition).

If you find yourself adding any of these to main.tf, you're violating docs/INVIOLABLE-PRINCIPLES.md §3 — stop and route the work to Crossplane / Flux instead.


Files

File Role
main.tf Resources + locals (network, firewall, SSH key, servers, LB, DNS hook).
variables.tf Wizard inputs as variables, with validation blocks.
outputs.tf What the catalyst-api provisioner reads back after tofu apply.
versions.tf OpenTofu + provider version constraints.
cloudinit-control-plane.tftpl cloud-init for the first / HA control-plane nodes. Installs hardening, k3s, Flux, bootstrap pointer.
cloudinit-worker.tftpl cloud-init for worker_count nodes. Installs hardening + joins the cluster.

Part of the public OpenOva Catalyst monorepo. See docs/SOVEREIGN-PROVISIONING.md for the end-to-end provisioning narrative and docs/PLATFORM-TECH-STACK.md for the resource budget that drives the sizing defaults.