openova

Author	SHA1	Message	Date
e3mrah	7bfd6df588	fix(catalyst-api,bp-catalyst-platform,infra): unblock multi-domain Day-2 add-domain flow on Sovereigns (#879 ) (#884 ) 5 stacked wiring bugs blocked the Day-2 add-parent-domain happy path on a fresh post-handover Sovereign — surfaced live on otech103, 2026-05-05 — plus a 6th gap (ghcr-pull reflector for catalyst-system). All six fixed in one PR so a single chart bump + cloud-init re-render closes the gap end-to-end. Bug 1 (chart, api-deployment.yaml): wire POOL_DOMAIN_MANAGER_URL= https://pool.openova.io. The in-cluster Service default only resolves on contabo; on Sovereigns every Day-2 POST died with NXDOMAIN. Bug 2 (chart + code): wire CATALYST_PDM_BASIC_AUTH_USER / _PASS env from a new pdm-basicauth Secret, and have pdmFlipNS SetBasicAuth from those envs. The PDM public ingress at pool.openova.io is gated by Traefik basicAuth; calls without Authorization: Basic returned 401. optional=true so contabo + CI + older Sovereigns degrade to a clear 401 log line. Per Inviolable Principle #10, the credentials only ever live in Pod env + are read once per call by pdmFlipNS — never enter a logged struct or persisted record. Bug 3 (code, parent_domains.go): pdmFlipNS body now includes the required nameservers field (computed from expectedNSFor). PDM's SetNSRequest schema requires it; the previous body got 422 missing-nameservers. Bug 4 (code, parent_domains.go): lookupPrimaryDomain falls back to SOVEREIGN_FQDN env after CATALYST_PRIMARY_DOMAIN. On a post-handover Sovereign no Deployment record is persisted, so without this fallback GET /parent-domains returned {"items":[]} and the propagation panel showed expectedNs:null. SOVEREIGN_FQDN is already wired by api-deployment.yaml from the sovereign-fqdn ConfigMap. Bug 5 (chart, httproute.yaml): catalyst-ui /auth/* PathPrefix narrowed to Exact /auth/handover. The previous PathPrefix collided with OIDC PKCE redirect_uri /auth/callback — catalyst-api 404s on that path because it only registers /api/v1/auth/callback, breaking login post-handover-JWT- cookie expiry. Exact match keeps /auth/handover routed to catalyst-api while every other /auth/* path falls through to catalyst-ui's React Router for client-side OIDC. Bug 6 (cloud-init): ghcr-pull + harbor-robot-token + new pdm-basicauth Reflector annotations enumerate explicit allowed/auto-namespaces (sme, catalyst, catalyst-system, gitea, harbor) instead of empty-string. The ambiguous empty-string interpretation caused otech103 to require a manual catalyst-system mirror creation; explicit list back-ports the verified working state. Provisioner wiring: Request.PDMBasicAuthUser/Pass + Provisioner fields + tfvars emission so the contabo catalyst-api can stamp the credentials onto every Sovereign provision request. variables.tf adds matching pdm_basic_auth_user / pdm_basic_auth_pass tofu vars (sensitive, default empty) so older provisioner builds that pre-date this change keep rendering valid cloud-init (the Secret renders with empty values and Pod start is unaffected). Chart bumped 1.4.11 -> 1.4.12, lockstep slot 13 pin updated. Closes the architectural blockers tracked in #879; the catalyst-api image rebuild + chart republish run via the existing CI pipelines (services- build.yaml + blueprint-release.yaml) on this commit's SHA. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 09:02:39 +04:00
e3mrah	e96741a0ca	feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827 ) (#838 ) A franchised Sovereign now supports N parent zones, NOT one. The operator brings 1+ parent domains at signup (`omani.works` for own use, `omani.trade` for the SME pool, etc.) and may add more post-handover via the admin console (#829). bp-powerdns 1.2.0 (platform/powerdns/chart): - New `zones: []` values key listing parent domains to bootstrap - New Helm post-install/post-upgrade hook Job (templates/zone-bootstrap-job.yaml) that POSTs each entry to /api/v1/servers/localhost/zones at install time. Idempotent on HTTP 409 — re-runs after upgrades or chart bumps never fail. - Default-values render skips when zones is empty (legacy behavior). bp-catalyst-platform 1.4.0 (products/catalyst/chart): - New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}` values - New templates/sovereign-wildcard-certs.yaml renders one cert-manager.io/v1.Certificate per zone (each `.<zone>` + apex) via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert renews independently. Skips entirely when parentZones is empty so the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml retains ownership of `sovereign-wildcard-tls` (avoids helm-vs-kustomize ownership flap). - New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded into the catalyst-api Pod as CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_SERVER_ID env vars. catalyst-api (products/catalyst/bootstrap/api): - New internal/powerdns package with typed Client (CreateZone, ZoneExists). Idempotent on HTTP 409/412. - handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the typed client when wired via SetPowerDNSZoneClient — the admin-console "Add another parent domain" flow now creates real zones in the Sovereign's PowerDNS at runtime. - main.go wires the client when CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_API_KEY are set. - Comprehensive unit tests (client_test.go: 9 cases incl. 201/409/412/500 + custom NS + custom serverID). Bootstrap-kit slot integration: - clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from Flux postBuild.substitute. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bumps to bp-catalyst-platform 1.4.0 and threads `parentZones: ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two slots stay in lockstep). - infra/hetzner: new `parent_domains_yaml` Terraform variable (defaults to single-zone array derived from sovereign_fqdn) → cloud-init renders the PARENT_DOMAINS_YAML Flux substitute. DoD verified end-to-end with helm template + envsubst: - Multi-zone overlay (omani.works + omani.trade) renders 2 PowerDNS zone-create API calls in the bootstrap Job AND 2 Certificate resources (`.omani.works`, `*.omani.trade`) in bp-catalyst-platform. - Single-zone fallback (PARENT_DOMAINS_YAML defaults to `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy provisioning paths working without per-overlay edits. Closes #827. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-04 23:42:00 +04:00
e3mrah	7ec25b9736	feat(provisioner): default Sovereign to 3x CPX32 (1 CP + 2 workers) — restore horizontal scale (#736 ) Issue #733. Every Sovereign provisioned this week launched with a single CPX52 control-plane and zero workers — completely discarded horizontal scalability. Restore the originally agreed shape: 1 CPX32 control plane + 2 CPX32 workers (3 nodes × 4 vCPU/8 GB = 12 vCPU/24 GB total — same aggregate footprint as a CPX52 vertical-scale, but with multi-node fault tolerance and the architectural shape clusters/_template/ was designed for). Changes: - infra/hetzner/variables.tf — defaults: control_plane_size cx42→cpx32, worker_size cx32→cpx32, worker_count 0→2. - infra/hetzner/main.tf — add hcloud_load_balancer_target.workers so the Hetzner LB targets every node (CP + workers); Cilium Gateway DaemonSet on every node serves ingress on its NodePort, so any node can absorb traffic for genuine horizontal scale. - infra/hetzner/README.md — sizing rationale rewritten around horizontal scale; CPX32 × 3 documented as canonical; CPX52 retained for solo dev. - ui model — INITIAL_WIZARD_STATE.workerCount 0→2. - ui StepProvider — first-visit + provider-change defaults workerCount 0→2. - ui providerSizes — `recommended: true` flag moves cpx52→cpx32; CPX52 description updated to "solo dev when worker_count=0". Constraints honoured: - Existing API requests with explicit controlPlaneSize: 'cpx52' / explicit workerCount: 0 keep working — only DEFAULTS change. - Sub-CPX32 SKUs (cx21/cx31) still allowed via dropdown. - Contabo single-node Catalyst-Zero is a different code path — unaffected. - No cron triggers added (event-driven only). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 13:57:53 +04:00
e3mrah	4946ccd125	feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710 ) (#719 ) Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS operator with a single overlay toggle. Changes ======= products/catalyst/chart: - Chart.yaml 1.2.7 → 1.3.0 - values.yaml: ingress.marketplace.enabled toggle (default false) + marketplace.{brand,currency,paymentProvider,signupPolicy} surface - templates/sme-services/marketplace-routes.yaml: HTTPRoute marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin, / → marketplace; HTTPRoute .<sov> → console (per-tenant wildcard) - templates/sme-services/marketplace-reference-grant.yaml: cross- namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services - .helmignore: stop excluding sme-services/ and marketplace-api/* (only .kustomization.yaml + .ingress.yaml remain Kustomize-only) - All sme-services/* + marketplace-api/* manifests wrapped with {{ if .Values.ingress.marketplace.enabled }} so non-marketplace Sovereigns render the chart unchanged clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: - chart version 1.2.7 → 1.3.0 - ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN} - ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false} infra/hetzner: - variables.tf: marketplace_enabled var (string "true"/"false", default "false") - main.tf: thread var into cloudinit-control-plane.tftpl - cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations products/catalyst/bootstrap/api/internal/provisioner/provisioner.go: - Request.MarketplaceEnabled bool (json:"marketplaceEnabled") - writeTfvars: marketplace_enabled = "true"\|"false" core/pool-domain-manager/internal/allocator/allocator.go: - canonicalRecordSet adds "marketplace" prefix → marketplace.<sov> resolves via PDM at zone-commit time (PR #710 explicit record so caches don't depend on the *.<sov> wildcard alone) DoD ready ========= - helm template with ingress.marketplace.enabled=false → identical manifest set to 1.2.7 (verified locally) - helm template with ingress.marketplace.enabled=true → emits 17 extra resources: 13 sme-services workloads + 2 marketplace-api + 1 HTTPRoute pair + 1 ReferenceGrant - pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green - catalyst-api builds, provisioner cloudinit_path_test green Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-04 07:47:37 +04:00
e3mrah	d0b574bd68	fix(hetzner-tofu): add powerdns_api_key to templatefile() vars (#687 ) PR #686 added var.powerdns_api_key to variables.tf and referenced it as ${powerdns_api_key} in cloudinit-control-plane.tftpl, but missed wiring it into the templatefile() vars dict in main.tf. Result on otech48: Invalid value for "vars" parameter: vars map does not contain key "powerdns_api_key", referenced at ./cloudinit-control-plane.tftpl:273 This commit closes the gap: powerdns_api_key now flows from var -> templatefile vars -> cloud-init -> Secret manifest. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:34:36 +04:00
e3mrah	369c229408	fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget (#685 ) cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns even with all of: - gatewayAPI.hostNetwork.enabled=true on the Cilium chart - securityContext.privileged=true on the cilium-envoy DaemonSet - securityContext.capabilities.add=[NET_BIND_SERVICE] - envoy-keep-cap-netbindservice=true in cilium-config ConfigMap - Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema) Repeatable error from cilium-envoy logs across otech45, otech46, otech47: listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed to bind or apply socket options: cannot bind '0.0.0.0:80': Permission denied The bind() syscall is intercepted by cilium-agent's BPF socket-LB program in a way that does not honour container capabilities. Even PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets "Permission denied". Cilium 1.19.3 → 1.16.5 made no difference (F1, PR #684 still ships — the version bump is sound for other reasons; the listener bind is just a separate fix). This commit moves the listeners to high ports (30080/30443) and lets the Hetzner LB do the public-facing port translation: HCLB :80 → CP node :30080 (cilium-gateway HTTP listener) HCLB :443 → CP node :30443 (cilium-gateway HTTPS listener) External users still hit `https://console.<sov>.omani.works/auth/handover` on port 443; the high port is invisible. High-port bind succeeds without NET_BIND_SERVICE because the kernel only gates ports below `net.ipv4.ip_unprivileged_port_start` (default 1024). Will be verified on otech48: the next fresh provision should serve console.otech48/auth/handover end-to-end without the 502/timeout chain seen on otech45–47. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:14:32 +04:00
e3mrah	dd4148acb6	fix(cilium-gateway): hostNetwork mode + Hetzner LB→80/443 (chart 1.1.5) (#674 ) The Cilium gateway-api L7LB nodePort chain was silently broken on otech45: TCP to LB:443 succeeds, but TLS handshake never completes. Root cause: Cilium 1.16.5's BPF L7LB Proxy Port (12869) doesn't match what cilium-envoy actually listens on (verified via /proc/net/tcp on the cilium-envoy pod — port 12869 not in listening sockets). The nodePort indirection (31443→envoy:12869) is broken at the redirect step. Fix: bind cilium-envoy directly to the host's :80 and :443 via gatewayAPI.hostNetwork.enabled=true. Hetzner LB forwards public 80→private:80 and 443→private:443 directly (no nodePort indirection). Two coordinated changes: 1. platform/cilium/chart/values.yaml: gatewayAPI.hostNetwork.enabled=true 2. infra/hetzner/main.tf: LB destination_port = 80/443 (was 31080/31443) bp-cilium chart bumped to 1.1.5. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 15:22:51 +04:00
e3mrah	0ee309aa8b	fix(infra+api): wire handover_jwt_public_key end-to-end through tofu provisioning (#636 ) * fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 22:28:44 +04:00
e3mrah	96a5e3a20e	fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service (#635 ) PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 22:14:23 +04:00
e3mrah	5a403e66b1	fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582 ) * fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase Harbor upstream always connects to a database named 'registry' (harbor.database.external.coreDatabase default). The CNPG Cluster was initialised with database='harbor', causing: FATAL: database "registry" does not exist (SQLSTATE 3D000) Fix: change postgres.cluster.database default from 'harbor' → 'registry' in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap and Harbor's coreDatabase now use 'registry'. Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run against harbor-pg-1. harbor-core is now 1/1 Running. Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix Five independent fixes that together complete the DNS-01 wildcard TLS chain for per-Sovereign certificate autonomy: 1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo): - values.yaml: `webhook.solverName: powerdns` → `pdns` - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is "powerdns" cert-manager gets 404 → "server could not find the resource". 2. cert-manager-dynadot-webhook solver_test.go mock format: - writeOK() and error injection used old ResponseHeader-wrapped format - Real api3.json returns ResponseCode/Status directly in SetDnsResponse - This caused the image build to fail at `ccc38987` so the dynadot fix never shipped; solver tests now pass cleanly (go test ./... OK) 3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template): - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied - anycast-endpoint.yaml: optional nodePort field rendered in port list 4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf): - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053 - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0 5. dynadot-client JSON parsing fix (core/pkg/dynadot-client): - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader - client_test.go: mock responses updated to real api3.json format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 13:49:58 +04:00
e3mrah	9e53d9e127	feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557 ) (#563 ) * docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade Per founder corrective: existing diagram missed the real blockers surfaced during otech10..otech22 burns. The image-pull-through gap (#557) and the cross-namespace secret gap (#543, #544) gate every workload pull from a public registry — without them, Sovereign hits DockerHub anonymous rate-limit on first provision and 30+ HRs are ImagePullBackOff/CreateContainerConfigError. Adds: - Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap + #557C charts global.imageRegistry templating). Edges to NATS / Gitea / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane / cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao - Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544 powerdns-api-credentials reflect). Edges to bp-catalyst-platform and bp-cert-manager-powerdns-webhook - Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch 38-HR threshold both gate Phase 8a integration test - Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is what makes "zero contabo dependency" DoD-met possible WBS now reflects the cascade observed live, not the pre-Phase-8a model. * feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560) - bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium subchart does not expose a single registry knob — per-Sovereign overlays wire specific image.repository fields alongside this value. - bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream chart exposes per-component image.registry knobs documented in the comment. - bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub added + deployment.yaml templated to prefix the webhook image repository when the value is non-empty. Verified: helm template with --set global.imageRegistry=harbor.openova.io produces harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>. - bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream subchart exposes sealed-secrets.image.registry for overlay wiring. All four charts render clean with default values (empty imageRegistry). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557) Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd transparently routes all five public-registry pulls through the central harbor.openova.io pull-through proxy (Option A of #557). - cloudinit-control-plane.tftpl: new write_files entry for /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so containerd reads the mirror config at startup). Mirrors docker.io, quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective harbor.openova.io/proxy-* projects. Auth via robot$openova-bot. - variables.tf: new harbor_robot_token variable (sensitive, default "") for the robot account token stored in openova-harbor/harbor-robot-token K8s Secret on contabo and forwarded by catalyst-api at provision time. - main.tf: wire harbor_robot_token into the templatefile() call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:49:13 +04:00
e3mrah	ccc38987c2	fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550 ) (#558 ) Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49) but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook (groupName: acme.dynadot.openova.io). Without slot 49b the APIService for acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely → HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL. Changes: - core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response, API returns SetDnsResponse); change ResponseCode to json.Number (API returns integer 0, not string "0"); update tests to match real API response format - platform/cert-manager-dynadot-webhook/chart: - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix) - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.* and privateKeySecretRefName; add rbac.create comment for domain-solver - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64) - clusterissuer.yaml: new template (skip-render default, enabled via overlay) - deployment.yaml: add imagePullSecrets support (required for private GHCR) - Chart.yaml: bump to 1.1.0 - clusters/_template/bootstrap-kit: - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer) - kustomization.yaml: add 49b entry - infra/hetzner: - variables.tf: add dynadot_managed_domains variable - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials Secret + apply it before Flux reconciles bootstrap-kit Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:42:13 +04:00
e3mrah	5b55d65461	fix(infra): kubeconfig points at CP public IP not LB IP (Closes #542 ) (#546 ) The Hetzner LB only forwards 80/443 (Cilium Gateway ingress); 6443 is exposed directly on the CP node via firewall rule (main.tf:51-56, 0.0.0.0/0 → CP:6443). Previous cloud-init rewrote kubeconfig server: to the LB's public IPv4, which silently failed with "connect: connection refused" — catalyst-api helmwatch could never observe HelmReleases on the new Sovereign, so the wizard jobs page stayed PENDING for every install-* job for 50+ minutes after the cluster was actually healthy. Pass control_plane_ipv4 (= hcloud_server.control_plane[0].ipv4_address) through the templatefile() call and rewrite k3s.yaml's 127.0.0.1:6443 to that IP instead. Same firewall already opens 6443 to 0.0.0.0/0 directly on the CP, so this is reachable from contabo without any LB / firewall changes. Permanent: every otechN provisioning from this commit forward will PUT back a kubeconfig that catalyst-api can actually connect to. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 11:55:48 +04:00
e3mrah	7e35040e29	fix(infra): cloud-init strip regex must preserve #cloud-config (Phase-8a bug #5 follow-up) (#482 ) #477 introduced a regex "/(?m)^[ ]{0,2}#[^!].*\n/" to strip YAML-block comments and fit Hetzner's 32KiB user_data cap. The [^!] guard preserved shebangs like #!/bin/bash but DID NOT preserve cloud-init directives like #cloud-config, #include, #cloud-boothook (none have ! after #). Result: cloud-init received user_data with the #cloud-config first-line DIRECTIVE stripped, didn't recognise the YAML body, and emitted: recoverable_errors: WARNING: Unhandled non-multipart (text/x-not-multipart) userdata → k3s never installed → Flux never bootstrapped → kubeconfig never PUT to catalyst-api → every Phase-8a provision since #477 has silently failed at boot Live evidence: deployment a76e3fec8566add9 SSH'd 2026-05-01 18:30 UTC, cloud-init status 'degraded done', /etc/systemd/system/k3s.service absent, no flux binary. Fix: require a SPACE after the '#' in the strip regex. YAML comments ARE typically '# foo bar' (with space). cloud-init directives are '#cloud-config' / '#include' / '#cloud-boothook' (no space) — the new regex preserves them. Out of scope: validating that ALL existing comments in the tftpl had a space after #. They do — verified by sed pre-render passing the sanity test (file shrinks 38KB → 13KB AND first line is #cloud-config). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 22:30:51 +04:00
e3mrah	e35729ad78	fix(infra): strip YAML-block comments from cloud-init to fit Hetzner 32KiB cap (Phase-8a bug #5 ) (#477 ) Phase-8a-preflight deployment 3c158f712d564d84 failed at tofu apply with: Error: invalid input in field 'user_data' [user_data => [Length must be between 0 and 32768.]] on main.tf line 214, in resource "hcloud_server" "control_plane" The rendered cloudinit-control-plane.tftpl is 38,085 bytes — 5,317 bytes over the Hetzner cap. The source template ships ~16 KB of indent-0 and indent-2 documentation comments (YAML-level) that are operationally inert at cloud-init boot. Fix: wrap templatefile() in replace() with a RE2 regex that strips lines whose first 0-2 chars are spaces followed by '#' (preserves shebangs via [^!]). After strip, rendered cloud-init drops to ~13 KB. Indent-4+ comments live INSIDE heredoc `content: \|` blocks (embedded shell scripts, kubeconfig fragments). Those are preserved. Same fix applied to worker_cloud_init for parity. Refs: - Live evidence: deployment 3c158f712d564d84, tofu apply error 16:38:26 UTC - Bug #5 in the Phase-8a-preflight tally - #471: prior tftpl escape fix ($${SOVEREIGN_FQDN}) - #472: catalyst-build watches infra/hetzner/** Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:43:42 +04:00
e3mrah	0172b9a89a	wip(#425 ): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435 ) Files staged from prior agent run before rate-limit. Re-dispatch will verify, complete missing pieces (Crossplane Provider+ProviderConfig in cloud-init, grep-zero acceptance, helm/go test runs, WBS row update), and finalise the PR. Includes: - platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml - platform/velero/chart/values.yaml (objectStorage.s3.* block) - platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0) - products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package) - internal/hetzner/objectstorage{,_test}.go DELETED - credentials handler + StepCredentials.tsx renamed - infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl} - clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml - platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 18:05:19 +04:00
e3mrah	1e17668055	feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371 ) (#409 ) * feat(catalyst): Hetzner Object Storage credential pattern (Phase 0b, #371) Adds the per-Sovereign Hetzner Object Storage credential capture + bucket provisioning Phase 0b path described in the omantel handover WBS §5. Hybrid Option A+B: wizard collects operator-issued S3 credentials (Hetzner exposes no Cloud API to mint them — they're issued once in the Hetzner Console and the secret half is shown exactly once), and OpenTofu auto-provisions the per-Sovereign bucket via the aminueza/minio provider + writes a flux-system/hetzner-object-storage Secret into the new Sovereign at cloud-init time so Harbor (#383) and Velero (#384) find their backing-store credentials already in the cluster from Phase 1 onwards. Extends the EXISTING canonical seam at every layer (per the founder's anti-duplication rule for #371's session): the existing Tofu module at infra/hetzner/, the existing handler/credentials.go validator, the existing provisioner.Request struct, the existing store.Redact path, and the existing wizard StepCredentials. No parallel binaries / scripts / operators introduced. infra/hetzner/ (Tofu module — Phase 0): - versions.tf: declare aminueza/minio provider (Hetzner's official recommendation for S3-compatible bucket creation per docs.hetzner.com/storage/object-storage/getting-started/...) - variables.tf: 4 sensitive vars — region (validated against fsn1/nbg1/hel1, the European-only OS regions as of 2026-04), access_key, secret_key, bucket_name (RFC-compliant S3 naming) - main.tf: minio_s3_bucket.main resource — idempotent on re-apply, no force_destroy (Velero archive must survive a control-plane reinstall), object_locking=false (content-addressed digests are the immutability guarantee for Harbor; Velero uses S3 versioning) - cloudinit-control-plane.tftpl: write flux-system/hetzner-object-storage Secret with the canonical s3-endpoint/s3-region/s3-bucket/s3-access-key/s3-secret-key keys Harbor + Velero charts consume via existingSecret refs - outputs.tf: surface endpoint/region/bucket back to catalyst-api for the deployment record (credentials NEVER returned) products/catalyst/bootstrap/api/ (Go): - internal/hetzner/objectstorage.go: NEW — minio-go/v7-based ListBuckets validator. Distinguishes auth failure ("rejected") from network failure ("unreachable") so the wizard renders the right error card. NOT a parallel cloud-resource path — the existing purge.go handles hcloud purge; objectstorage.go handles a separate API surface (S3-compatible) that has no equivalent client today. - internal/handler/credentials.go: extend with ValidateObjectStorageCredentials handler — same wire shape (200 valid:true / 200 valid:false / 503 unreachable / 400 bad input) as the existing token validator so the wizard's failure- card machinery handles both without per-endpoint switches. - cmd/api/main.go: wire POST /api/v1/credentials/object-storage/validate - internal/provisioner/provisioner.go: extend Request with ObjectStorageRegion/AccessKey/SecretKey/Bucket; Validate() rejects empty/malformed values fail-fast at /api/v1/deployments POST time; writeTfvars() emits the 4 new tfvars. - internal/handler/deployments.go: derive bucket name from FQDN slug pre-Validate (catalyst-<fqdn-with-dots-replaced-by-dashes>) so Hetzner's globally-namespaced bucket pool gets a deterministic, collision-resistant per-Sovereign name without operator input. - internal/store/store.go: redact access/secret keys; preserve region+bucket plain (they're public in tofu outputs anyway). products/catalyst/bootstrap/ui/ (TypeScript / React): - entities/deployment/model.ts + store.ts: 4 new wizard fields (objectStorageRegion/AccessKey/SecretKey/Validated) with merge() coercion for legacy persisted state. - pages/wizard/steps/StepCredentials.tsx: ObjectStorageSection — region picker (fsn1/nbg1/hel1), masked secret-key input, Validate button gating Next. Same FailureCard taxonomy (rejected/too-short/unreachable/network/parse/http) the existing TokenSection uses, so the operator UX is consistent. Section only renders when Hetzner is among chosen providers — non-Hetzner Sovereigns skip Phase 0b until their own backing-store path lands. - pages/wizard/steps/StepReview.tsx: include objectStorageRegion/AccessKey/SecretKey in the POST /v1/deployments payload (bucket derived server-side). Tests: - api: 7 new provisioner Validate tests (region/keys/bucket required + RFC-compliant + valid-region acceptance), 5 handler tests for the new endpoint (bad JSON / missing region / invalid region / short keys), 4 hetzner/objectstorage_test.go tests (endpoint composition + early input rejection), 1 handler test for the bucket-name derivation. Existing tests updated to supply the new required fields. - ui: StepCredentials.test.tsx pre-populates objectStorageValidated in beforeEach so the existing 11 SSH-section tests aren't gated on Object Storage validation. DoD: a fresh Sovereign provision results in a usable S3 endpoint URL + access/secret keys available as a K8s Secret in the Sovereign's home cluster (flux-system/hetzner-object-storage), ready for consumption by Harbor + Velero charts via existingSecret references. Closes #371. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(wbs): #371 done — Hetzner Object Storage Phase 0b shipped (#409) Marks #371 done with the architectural rationale (hybrid Option A + B — Hetzner exposes no Cloud API to mint S3 keys, so the wizard MUST capture them; OpenTofu auto-provisions the bucket + cloud-init writes the flux-system/hetzner-object-storage Secret with the canonical s3-* keys Harbor + Velero consume). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:54:22 +04:00
hatiyildiz	acf426c5a9	feat(catalyst-api): cloud-init POSTs kubeconfig back via bearer token (closes #183 ) Implement Option D from issue #183: the new Sovereign's cloud-init PUTs its rewritten kubeconfig (server URL pinned to the LB public IP, k3s service-account token in the body) to catalyst-api over HTTPS using a per-deployment bearer token. catalyst-api never SSHs into the Sovereign — by design, it does not hold the SSH private key (the wizard returns it once to the browser and does not persist it on the catalyst-api side). How the bearer flow works ------------------------- 1. CreateDeployment mints a 32-byte random bearer (crypto/rand, hex-encoded), computes its SHA-256, and persists ONLY the hash on Deployment.kubeconfigBearerHash. Plaintext is stamped onto provisioner.Request just long enough for writeTfvars to render it into the per-deployment OpenTofu workdir, then GC'd. 2. infra/hetzner/variables.tf adds three variables — deployment_id, kubeconfig_bearer_token (sensitive), catalyst_api_url. main.tf passes them through templatefile() with load_balancer_ipv4 read from hcloud_load_balancer.main.ipv4. 3. cloudinit-control-plane.tftpl, after `kubectl --raw /healthz` succeeds, sed-rewrites k3s.yaml's https://127.0.0.1:6443 to the LB's public IPv4, writes the result to a 0600 file, and curls PUT to {catalyst_api_url}/api/v1/deployments/{deployment_id}/ kubeconfig with `Authorization: Bearer {token}`. --retry 60 --retry-delay 10 --retry-all-errors handles transient reachability gaps. The 0600 file is removed after the PUT. 4. PUT /api/v1/deployments/{id}/kubeconfig: - Reads `Authorization: Bearer <token>` (RFC 6750). - Computes SHA-256 of the inbound bearer, constant-time-compares to the persisted hash via subtle.ConstantTimeCompare. - 401 on missing/malformed Authorization, 403 on bearer mismatch, 403 if no hash on record, 403 if KubeconfigPath already set (single-use replay defence), 422 on empty/oversize body, 503 if the kubeconfigs directory is unwritable. - On 204: writes the body to /var/lib/catalyst/kubeconfigs/ <id>.yaml at mode 0600 (atomic temp+rename), sets Result.KubeconfigPath, persistDeployment, then `go runPhase1Watch(dep)`. 5. GET /api/v1/deployments/{id}/kubeconfig now reads the file at Result.KubeconfigPath. 409 with {"error":"not-implemented"} when the postback hasn't happened yet (preserves the wizard's existing StepSuccess fallback). 409 {"error": "kubeconfig-file-missing"} on PVC drift. 6. internal/store: Record carries KubeconfigBearerHash. The path pointer round-trips via Result.KubeconfigPath; the JSON record NEVER contains the kubeconfig plaintext (test grep on the on- disk JSON for the kubeconfig sentinels asserts zero matches). 7. restoreFromStore relaunches helmwatch on Pod restart for any rehydrated deployment whose Result.KubeconfigPath points at an existing file AND Phase1FinishedAt is nil AND the original status was not in-flight (the existing in-flight-status-rewrite-to-failed contract is preserved). Channels are re-allocated for resumed deployments because the fromRecord-loaded ones are closed. 8. internal/handler/phase1_watch.go reads kubeconfig YAML from the file at Result.KubeconfigPath (not from a string field on Result). The Result.Kubeconfig field is removed entirely; the on-disk JSON only carries kubeconfigPath. Tests ----- internal/handler/kubeconfig_test.go covers every spec gate: - PUT 401 missing/malformed Authorization - PUT 403 bearer mismatch / no-bearer-hash / already-set - PUT 422 empty body / oversize body - PUT 404 deployment not found - PUT 204 first success, file at <dir>/<id>.yaml mode 0600, Result.KubeconfigPath set, on-disk JSON has kubeconfigPath pointer with no plaintext leak - PUT triggers Phase 1 helmwatch goroutine - GET reads from path-pointer - GET 409 path-pointer-set-but-file-missing - newBearerToken / hashBearerToken round-trip + entropy - subtle.ConstantTimeCompare correctness - shouldResumePhase1 gates every branch - restoreFromStore re-launches helmwatch on rehydrated deployments - phase1Started guard prevents double watch (PUT then runProvisioning) - extractBearer RFC 6750 case-insensitive scheme Chart ----- products/catalyst/chart/templates/api-deployment.yaml mounts the existing catalyst-api-deployments PVC at /var/lib/catalyst (one level up) so deployments/<id>.json and kubeconfigs/<id>.yaml live on the same single-attach volume — no second PVC. Adds env vars CATALYST_KUBECONFIGS_DIR=/var/lib/catalyst/kubeconfigs and CATALYST_API_PUBLIC_URL=https://console.openova.io/sovereign. Per docs/INVIOLABLE-PRINCIPLES.md - #3: OpenTofu is still the only Phase-0 IaC; cloud-init is part of the OpenTofu module's templated user_data, not a separate code path. catalyst-api never execs helm/kubectl/ssh. - #4: catalyst_api_url is runtime-configurable (CATALYST_API_PUBLIC_URL env var), so air-gapped franchises override without code changes. - #10: Bearer plaintext NEVER lands on disk on the catalyst-api side (only the SHA-256 hash). Kubeconfig plaintext NEVER lands in the JSON record (only the file path). The kubeconfig file is chmod 0600 and the directory 0700 owned by the catalyst-api UID. Closes #183. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:26:53 +02:00
hatiyildiz	dddbab4b80	fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull` because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init never created the Secret, so every fresh Sovereign's source-controller logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium. The operator workaround (kubectl apply by hand) is not durable across reprovisioning. Verified live on omantel.omani.works pre-fix. Changes: - provisioner.Request gains GHCRPullToken (json:"-") so it is never serialized into persisted deployment records. provisioner.New() reads CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the Request before tofu.auto.tfvars.json. Validate() rejects empty for domain_mode=pool with a pointer to docs/SECRET-ROTATION.md. - handler.CreateDeployment also stamps the env var onto the Request so the synchronous validation path returns 400 early on misconfiguration. - infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true, default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64 locals and passes both to templatefile(). cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so the GitRepository + Kustomization land into a cluster that already has working GHCR creds. - products/catalyst/chart/templates/api-deployment.yaml mounts CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in the catalyst namespace (key: token, optional: true so the Pod still starts on misconfigured installs and Validate() owns the gate). - docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token, Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds. Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN> placeholder; the token never lives in git. - Tests: provisioner unit tests cover New() reading the env var, tolerance of missing env, pool-mode validation rejection with operator-facing error, BYO acceptance, and the json:"-" serialization invariant. tests/e2e/hetzner-provisioning gains a TestCloudInit_RendersGHCRPullSecret render-only integration test that asserts the rendered cloud-init contains the Secret, applies it before flux-bootstrap, and that the dockerconfigjson round-trips the sample token through templatefile() correctly. Existing pool-mode handler tests now t.Setenv the placeholder token; the on-disk redaction test asserts the placeholder never reaches disk. Gates: - go vet ./... and go test -race -count=1 ./... in products/catalyst/bootstrap/api: PASS. - helm lint products/catalyst/chart: PASS (warnings pre-existing). - tofu fmt + tofu validate: deferred to CI (no tofu binary on the development host). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:07:27 +02:00
hatiyildiz	330211d275	fix(tofu): drop redundant null_resource.dns_pool — PDM owns DNS writes Every tofu apply on a pool deployment was hitting: null_resource.dns_pool[0]: Provisioning with 'local-exec'... null_resource.dns_pool[0] (local-exec): (output suppressed due to sensitive value in config) Error: Invalid field in API request catalyst-dns: write DNS: add *.omantel record: dynadot api error: code= Two separate code paths were both writing Dynadot records for the same deployment: 1. The OpenTofu module's null_resource.dns_pool — a local-exec that shells out to /usr/local/bin/catalyst-dns inside the catalyst-api container. The binary's request payload is rejected by Dynadot. 2. catalyst-api's pool-domain-manager call — pdm.Commit() at handler/deployments.go:247 writes the canonical record set with the LB IP after tofu apply returns. This path works. Per #168 PDM is the single owner of all pool-domain Dynadot writes. The null_resource path is a pre-#168 artifact that should have been removed when PDM took ownership; keeping it dual-wrote DNS records (when it worked) and broke the entire provision flow (when it didn't). Verified end-to-end against the live catalyst-api at console.openova.io: tofu apply created 7 of 11 Hetzner resources (network, firewall, subnet, LB, 2 LB services, ssh_key) before failing at null_resource.dns_pool[0]. With this commit the DNS-write step disappears from the plan, and PDM /commit handles record creation after the LB IP is known. The dynadot_key + dynadot_secret variables in variables.tf remain declared (provisioner.go still passes them through tfvars.json) but are no longer referenced by any resource. Removing them is a separate sweep — left for a follow-up to keep this commit narrowly scoped to the failure path.	2026-04-29 14:52:57 +02:00
hatiyildiz	e7a74f0eef	feat(infra/hetzner): bump default to cx42, add OS hardening + operator README Group J — closes #127, #128, #129, #130, #131, #132. Defaults - control_plane_size default cx42 (16 GB) — cx32 (8 GB) is INSUFFICIENT for a solo Sovereign per PLATFORM-TECH-STACK.md §7.1 (~11.3 GB Catalyst) + §7.4 (~8.8 GB per-host-cluster) = ~20 GB minimum. The previous cx32 default would OOM during the OpenBao + Keycloak step of bootstrap. - New k3s_version variable (v1.31.4+k3s1) — pinned, validated against the INSTALL_K3S_VERSION format. Previously hardcoded inside the cloud-init templates, in violation of INVIOLABLE-PRINCIPLES.md §4. Validation - Region restricted to the 5 known Hetzner locations. - control_plane_size + worker_size restricted to the cxNN \| ccxNN \| caxNN namespace (blocks tiny dev sizes that would OOM at runtime). - k3s_version regex matches the upstream installer's version format. - ssh_allowed_cidrs validated as proper CIDRs. Firewall - Document each open port (80, 443, 6443, ICMP) and each blocked port (22, 10250, 2379/2380, 8472) in README.md §"Firewall rules". - SSH (22) is now a dynamic rule keyed off ssh_allowed_cidrs (default empty = no SSH at the firewall, break-glass via Hetzner Console). OS hardening (cloudinit-.tftpl) - sshd drop-in: PasswordAuthentication no, PermitRootLogin prohibit-password, no forwarding, MaxAuthTries=3, LoginGraceTime=30. - enable_unattended_upgrades (default true): security-only pocket, auto-reboot at 02:30, removes unused kernels. - enable_fail2ban (default true): sshd jail, systemd backend. - Both control-plane and worker templates carry the same baseline. Documentation - New infra/hetzner/README.md (operator-facing) covers: What the module creates + Phase-0/Phase-1 boundary. * Sizing rationale with the §7.1+§7.4 RAM math + upgrade path. * Firewall rules: every open port, every blocked port, every deliberate egress flow. * k3s flag-by-flag rationale tied to PLATFORM-TECH-STACK.md §8. * SSH key management: why no auto-generated keys (break-glass + audit-trail + custody + compliance). * OS hardening table. * Standalone CLI invocation pattern (tofu apply -var-file=...). * What the module does NOT do (Crossplane / Flux territory). Closes #127 #128 #129 #130 #131 #132	2026-04-28 13:54:15 +02:00
hatiyildiz	e668637bc9	feat(provisioner): replace bespoke Hetzner+helm-exec code with OpenTofu→Crossplane→Flux Per docs/INVIOLABLE-PRINCIPLES.md Lesson #24 — the previous commits `915c467` + `07b4bcf` shipped bespoke Go code that called Hetzner Cloud API directly + exec'd helm/kubectl, which violates principle #3 (OpenTofu provisions Phase 0, Crossplane is the ONLY day-2 IaC, Flux is the ONLY GitOps reconciler, Blueprints are the ONLY install unit). This commit reverts all of that and replaces it with the canonical architecture. REVERTED (deleted): - products/catalyst/bootstrap/api/internal/hetzner/resources.go (379 lines bespoke Hetzner API client) - products/catalyst/bootstrap/api/internal/hetzner/cloudinit.go (bespoke cloud-init builder) - products/catalyst/bootstrap/api/internal/hetzner/provisioner.go (306 lines orchestrator) - products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go (helm-exec installer for 11 components) - products/catalyst/bootstrap/api/internal/bootstrap/exec.go (kubectl/helm exec wrappers) KEPT: - products/catalyst/bootstrap/api/internal/hetzner/client.go — fast token validity probe used by StepCredentials wizard step. NOT architectural drift; just a UX pre-flight check. - products/catalyst/bootstrap/api/internal/dynadot/dynadot.go — DNS API client. Will be invoked by the OpenTofu module via local-exec (the catalyst-dns helper binary). NEW (canonical architecture): infra/hetzner/ — OpenTofu module per docs/SOVEREIGN-PROVISIONING.md §3 Phase 0: - versions.tf: hetznercloud/hcloud provider ~> 1.49 - variables.tf: 17 typed variables matching wizard inputs (sovereign_fqdn, hcloud_token, region, control_plane_size, ssh_public_key, domain_mode, gitops_repo_url, etc.) — all runtime parameters, none hardcoded per principle #4 - main.tf: hcloud_network + subnet + firewall + ssh_key + control-plane server(s) with cloud-init + worker servers + load_balancer with services + null_resource calling /usr/local/bin/catalyst-dns for pool-domain DNS writes - outputs.tf: control_plane_ip, load_balancer_ip, sovereign_fqdn, console_url, gitops_repo_url - cloudinit-control-plane.tftpl: installs k3s with --flannel-backend=none --disable=traefik --disable=servicelb (Cilium replaces all of these), then installs Flux core, then applies a GitRepository pointing at clusters/${sovereign_fqdn}/ in the public OpenOva monorepo. From this point Flux is the GitOps engine — it reconciles bp-cilium → bp-cert-manager → bp-crossplane → ... → bp-catalyst-platform via the Kustomization tree the cluster directory ships. NO bespoke helm install from outside the cluster. NO direct kubectl apply. Flux is the install layer. - cloudinit-worker.tftpl: k3s agent join via private-IP control plane products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — thin OpenTofu invoker: - Validates wizard inputs - Stages the canonical infra/hetzner/ module into a per-deployment workdir - Writes tofu.auto.tfvars.json from the wizard request - Execs `tofu init`, `tofu plan -out=tfplan`, `tofu apply tfplan`, streaming stdout/stderr lines as SSE events to the wizard - Reads tofu output -json for control_plane_ip + load_balancer_ip - Returns Result. Flux on the new cluster takes over from here. products/catalyst/bootstrap/api/internal/handler/deployments.go — rewritten: - Uses provisioner.Request and provisioner.New() (no more hetzner.Provisioner) - Same SSE/poll endpoints; same Dynadot env-var injection for pool-domain mode What this commit DOES NOT yet include (intentionally — separate work): - clusters/${sovereign_fqdn}/ Kustomization tree in the monorepo that Flux will reconcile (each Sovereign gets its own cluster directory). Tracked separately as part of the bp-catalyst-platform umbrella work. - /usr/local/bin/catalyst-dns helper binary in the catalyst-api Containerfile. Tracked as ticket [G] dns Dynadot client. - Crossplane Compositions for hcloud resources at platform/crossplane/compositions/. Tracked as part of [F] crossplane chart. Lesson #24 closed. Architecture now matches docs/ARCHITECTURE.md §10 + SOVEREIGN-PROVISIONING.md §3-§4 exactly.	2026-04-28 13:38:56 +02:00

22 Commits