openova

Author	SHA1	Message	Date
e3mrah	96a5e3a20e	fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service (#635 ) PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 22:14:23 +04:00
e3mrah	169ba2f20a	fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623 ) PR #611 squash accidentally reverted the Phase-8b infra additions from PR #615 (`92fdda42`). Restores: - cloudinit-control-plane.tftpl: write_files entry for /var/lib/catalyst/handover-jwt-public.jwk (mode 0600) - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Without these, new Sovereign provisioning runs will not write the public key to disk and auth/handover on the Sovereign will return 503 (key unavailable). Co-authored-by: e3mrah <e3mrah@openova.io>	2026-05-02 19:21:16 +04:00
e3mrah	b5c9839da7	feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611 ) Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables: UI: - AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server callback; sovereign → client-side OIDC token exchange via oidc.ts) - Router: sovereign console routes (/console/), DETECTED_MODE index redirect, authCallbackRoute dedup fix, authHandoverRoute safety net - StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token before redirecting operator to Sovereign console (falls back to plain URL on error) API: - main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env - deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time - provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON - auth.go: /auth/handover endpoint for seamless single-identity flow Infra: - cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/ - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Chart: - api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars Playwright CI fixes: - playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard - playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix - cosmetic-guards.spec.ts: provision URL /sovereign/provision/ → /provision/* - sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests). Co-authored-by: e3mrah <e3mrah@openova.io>	2026-05-02 19:17:56 +04:00
e3mrah	92fdda42d7	feat(catalyst-api+infra): Phase-8b handover JWT minting on Catalyst-Zero (Closes #605 ) Merge via self-merge per CLAUDE.md. Playwright UI smoke passes; cosmetic guards pre-existing failure on main (unrelated to this PR). Resolves #605.	2026-05-02 19:07:27 +04:00
e3mrah	5a403e66b1	fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582 ) * fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase Harbor upstream always connects to a database named 'registry' (harbor.database.external.coreDatabase default). The CNPG Cluster was initialised with database='harbor', causing: FATAL: database "registry" does not exist (SQLSTATE 3D000) Fix: change postgres.cluster.database default from 'harbor' → 'registry' in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap and Harbor's coreDatabase now use 'registry'. Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run against harbor-pg-1. harbor-core is now 1/1 Running. Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix Five independent fixes that together complete the DNS-01 wildcard TLS chain for per-Sovereign certificate autonomy: 1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo): - values.yaml: `webhook.solverName: powerdns` → `pdns` - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is "powerdns" cert-manager gets 404 → "server could not find the resource". 2. cert-manager-dynadot-webhook solver_test.go mock format: - writeOK() and error injection used old ResponseHeader-wrapped format - Real api3.json returns ResponseCode/Status directly in SetDnsResponse - This caused the image build to fail at `ccc38987` so the dynadot fix never shipped; solver tests now pass cleanly (go test ./... OK) 3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template): - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied - anycast-endpoint.yaml: optional nodePort field rendered in port list 4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf): - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053 - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0 5. dynadot-client JSON parsing fix (core/pkg/dynadot-client): - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader - client_test.go: mock responses updated to real api3.json format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 13:49:58 +04:00
e3mrah	73ae746637	fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581 ) Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API CRDs at startup and disables its gateway controller if they are absent — a static, one-shot decision. Cloud-init installs k3s+Cilium first, then Flux reconciles bp-gateway-api minutes later, so the operator always starts without CRDs and never recovers. All 8 HTTPRoutes orphaned. Three-part permanent fix: 1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl. TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires TLSRoute CRD to be present; without it the operator's capability check fails entirely and disables the gateway controller. 2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true" to force GatewayClass creation regardless of CRD presence at Helm render time. Upstream default "auto" skips GatewayClass when the gateway API CRDs are absent at install time (Capabilities check). 3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0 and ship experimental channel (TLSRoute, TCPRoute, UDPRoute, BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed status.supportedFeatures from string[] to object[]; Cilium 1.16.5 writes the old string format and the v1.2.0 CRD rejects the status patch with "must be of type object: string", leaving GatewayClass permanently Unknown/Pending. v1.1.0 retains string schema. Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17 adopts the v1.2.0 object schema for supportedFeatures. Closes #503 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 13:23:32 +04:00
e3mrah	9e53d9e127	feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557 ) (#563 ) * docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade Per founder corrective: existing diagram missed the real blockers surfaced during otech10..otech22 burns. The image-pull-through gap (#557) and the cross-namespace secret gap (#543, #544) gate every workload pull from a public registry — without them, Sovereign hits DockerHub anonymous rate-limit on first provision and 30+ HRs are ImagePullBackOff/CreateContainerConfigError. Adds: - Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap + #557C charts global.imageRegistry templating). Edges to NATS / Gitea / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane / cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao - Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544 powerdns-api-credentials reflect). Edges to bp-catalyst-platform and bp-cert-manager-powerdns-webhook - Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch 38-HR threshold both gate Phase 8a integration test - Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is what makes "zero contabo dependency" DoD-met possible WBS now reflects the cascade observed live, not the pre-Phase-8a model. * feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560) - bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium subchart does not expose a single registry knob — per-Sovereign overlays wire specific image.repository fields alongside this value. - bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream chart exposes per-component image.registry knobs documented in the comment. - bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub added + deployment.yaml templated to prefix the webhook image repository when the value is non-empty. Verified: helm template with --set global.imageRegistry=harbor.openova.io produces harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>. - bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream subchart exposes sealed-secrets.image.registry for overlay wiring. All four charts render clean with default values (empty imageRegistry). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557) Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd transparently routes all five public-registry pulls through the central harbor.openova.io pull-through proxy (Option A of #557). - cloudinit-control-plane.tftpl: new write_files entry for /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so containerd reads the mirror config at startup). Mirrors docker.io, quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective harbor.openova.io/proxy-* projects. Auth via robot$openova-bot. - variables.tf: new harbor_robot_token variable (sensitive, default "") for the robot account token stored in openova-harbor/harbor-robot-token K8s Secret on contabo and forwarded by catalyst-api at provision time. - main.tf: wire harbor_robot_token into the templatefile() call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:49:13 +04:00
e3mrah	ccc38987c2	fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550 ) (#558 ) Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49) but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook (groupName: acme.dynadot.openova.io). Without slot 49b the APIService for acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely → HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL. Changes: - core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response, API returns SetDnsResponse); change ResponseCode to json.Number (API returns integer 0, not string "0"); update tests to match real API response format - platform/cert-manager-dynadot-webhook/chart: - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix) - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.* and privateKeySecretRefName; add rbac.create comment for domain-solver - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64) - clusterissuer.yaml: new template (skip-render default, enabled via overlay) - deployment.yaml: add imagePullSecrets support (required for private GHCR) - Chart.yaml: bump to 1.1.0 - clusters/_template/bootstrap-kit: - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer) - kustomization.yaml: add 49b entry - infra/hetzner: - variables.tf: add dynadot_managed_domains variable - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials Secret + apply it before Flux reconciles bootstrap-kit Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:42:13 +04:00
e3mrah	b2307e290d	fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543 ) (#554 ) Part A — bp-reflector blueprint: - Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a, dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288 via the bp-reflector OCI wrapper chart. - Register in bootstrap-kit/kustomization.yaml. - Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml): single replica, 32Mi memory, ServiceMonitor off by default. Part B — annotate flux-system/ghcr-pull + rename in charts: - infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector annotations to the ghcr-pull Secret written at cloud-init time so Reflector auto-mirrors it to every namespace on first boot. - Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in: api-deployment.yaml, ui-deployment.yaml, marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml (14 total occurrences). - Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit HelmRelease version reference to match. Root cause: the canonical secret name is ghcr-pull (written by cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff on all Catalyst pods on every new Sovereign. Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret propagated to 33 namespaces via kubectl; non-Running pods bounced. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 12:17:51 +04:00
e3mrah	5b55d65461	fix(infra): kubeconfig points at CP public IP not LB IP (Closes #542 ) (#546 ) The Hetzner LB only forwards 80/443 (Cilium Gateway ingress); 6443 is exposed directly on the CP node via firewall rule (main.tf:51-56, 0.0.0.0/0 → CP:6443). Previous cloud-init rewrote kubeconfig server: to the LB's public IPv4, which silently failed with "connect: connection refused" — catalyst-api helmwatch could never observe HelmReleases on the new Sovereign, so the wizard jobs page stayed PENDING for every install-* job for 50+ minutes after the cluster was actually healthy. Pass control_plane_ipv4 (= hcloud_server.control_plane[0].ipv4_address) through the templatefile() call and rewrite k3s.yaml's 127.0.0.1:6443 to that IP instead. Same firewall already opens 6443 to 0.0.0.0/0 directly on the CP, so this is reachable from contabo without any LB / firewall changes. Permanent: every otechN provisioning from this commit forward will PUT back a kubeconfig that catalyst-api can actually connect to. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 11:55:48 +04:00
e3mrah	66ff717fbc	fix(infra): reduce bootstrap Kustomization timeouts 30m→5m to unblock iterative fixes (closes #492 ) (#500 ) Phase-8a bug #17 (otech8 deployment 1bfc46347564467b, 2026-05-01): when the FIRST apply of bootstrap-kit was unhealthy (cilium crash-loop from issue #491), kustomize-controller held the revision lock for the full 30m health-check timeout and refused to pick up new GitRepository revisions. Even though Flux fetched fix `66ea39f0` from main within 1 minute, bootstrap-kit's lastAttemptedRevision stayed pinned to the OLD SHA `0765e89a` for the full 30 minutes. With cilium broken, the wait would never finish, no new revision would ever apply, and the operator was forced to wipe + reprovision from scratch. The same pathology would repeat on every iteration unless the timeout shape changed. Approach: Option A (timeout reduction). Drops `spec.timeout` on all three Flux Kustomizations in the cloud-init template — bootstrap-kit, sovereign-tls, infrastructure-config — from 30m to 5m. We KEEP `wait: true` so downstream `dependsOn: bootstrap-kit` declarations still get a consolidated "every HR Ready=True" signal. We do NOT adjust `interval` (5m is correct). Why 5m specifically: matches the GitRepository poll interval. Failed reconciles release the revision lock within ~6m worst case so a fresh fix on main gets applied on the next poll. Anything shorter risks tripping legitimately-slow CRD installs; anything longer re-introduces the iteration-stall pathology #492 documents. Why not Option B (wait: false): would break the dependsOn chain. The infrastructure-config Kustomization needs bootstrap-kit's HRs Ready before it applies Provider/ProviderConfig manifests that talk to Hetzner. Flipping wait: false would let infra-config apply prematurely. Why not Option C (tighter retryInterval): doesn't address the root cause. retryInterval governs how often to retry AFTER a failure; spec.timeout is what holds the revision lock during a failed wait. Test: kustomization_timeout_test.go (new) locks all three timeouts at exactly 5m AND blocks any operative `timeout: 30m` regression AND asserts wait: true is retained. Three assertions, one for each failure mode (regression to 30m, accidental 4th Kustomization without test update, drive-by flip to wait: false). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 00:34:35 +04:00
e3mrah	141dc9dfba	fix(infra): cloud-init helm install cilium values parity with Flux bp-cilium HR (closes #491 ) (#496 ) Phase-8a bug #16: every fresh Hetzner Sovereign deadlocked at Phase 1 because the bootstrap helm install in cloud-init used a MINIMAL set of --set flags (kubeProxyReplacement, k8sService*, tunnelProtocol, bpf.masquerade) while the Flux bp-cilium HelmRelease curated a much fuller value set. The drift was fatal: 1. cilium-agent waits forever for the operator to register ciliumenvoyconfigs + ciliumclusterwideenvoyconfigs CRDs. 2. The upstream chart only registers them when envoyConfig.enabled=true. 3. With the bootstrap install missing that flag, the agent crash-looped, the node taint node.cilium.io/agent-not-ready never lifted, and the bootstrap-kit Kustomization (wait: true, 30 min timeout — issue #492) never reconciled the upgrade that would have fixed the values. The fix is single-source-of-truth via a new write_files entry that lays down /var/lib/catalyst/cilium-values.yaml at cloud-init time, plus a -f flag on the bootstrap helm install that consumes it. The values mirror platform/cilium/chart/values.yaml's `cilium:` block PLUS the overlay in clusters/_template/bootstrap-kit/01-cilium.yaml (envoyConfig.enabled, l7Proxy). A new parity test (cilium_values_parity_test.go) locks the two files together so a future commit cannot change one without the other. Approach: hybrid — keep the chart values.yaml as the umbrella source of truth, render the merged effective values inline in cloud-init's write_files block (the umbrella's `cilium:` subchart wrapper is unwrapped because the bootstrap install targets cilium/cilium upstream chart directly, not the bp-cilium umbrella). Test enforces presence of every operator-curated key + load-bearing values. Files modified: infra/hetzner/cloudinit-control-plane.tftpl products/catalyst/bootstrap/api/internal/provisioner/cilium_values_parity_test.go (new) Refs: #491, #492 (bootstrap-kit wait timeout), `66ea39f0` (envoyConfig in HR) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 00:09:10 +04:00
e3mrah	0d75ae354f	fix(infra): split Cilium-Gateway Certificate into sovereign-tls Kustomization (Phase-8a bug #13 ) (#484 ) Phase-8a-preflight live deployment 93161846839dc2e1: bootstrap-kit Flux Kustomization fails server-side dry-run with Certificate/kube-system/sovereign-wildcard-tls dry-run failed: no matches for kind 'Certificate' in version 'cert-manager.io/v1' → entire Kustomization apply aborts → ZERO HelmReleases reconcile. Fix: split the Certificate into its own Flux Kustomization sovereign-tls that dependsOn bootstrap-kit (whose Ready gates on every HR including bp-cert-manager). Gateway stays in 01-cilium.yaml because Gateway API CRDs ship with Cilium itself. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 22:48:18 +04:00
e3mrah	7e35040e29	fix(infra): cloud-init strip regex must preserve #cloud-config (Phase-8a bug #5 follow-up) (#482 ) #477 introduced a regex "/(?m)^[ ]{0,2}#[^!].*\n/" to strip YAML-block comments and fit Hetzner's 32KiB user_data cap. The [^!] guard preserved shebangs like #!/bin/bash but DID NOT preserve cloud-init directives like #cloud-config, #include, #cloud-boothook (none have ! after #). Result: cloud-init received user_data with the #cloud-config first-line DIRECTIVE stripped, didn't recognise the YAML body, and emitted: recoverable_errors: WARNING: Unhandled non-multipart (text/x-not-multipart) userdata → k3s never installed → Flux never bootstrapped → kubeconfig never PUT to catalyst-api → every Phase-8a provision since #477 has silently failed at boot Live evidence: deployment a76e3fec8566add9 SSH'd 2026-05-01 18:30 UTC, cloud-init status 'degraded done', /etc/systemd/system/k3s.service absent, no flux binary. Fix: require a SPACE after the '#' in the strip regex. YAML comments ARE typically '# foo bar' (with space). cloud-init directives are '#cloud-config' / '#include' / '#cloud-boothook' (no space) — the new regex preserves them. Out of scope: validating that ALL existing comments in the tftpl had a space after #. They do — verified by sed pre-render passing the sanity test (file shrinks 38KB → 13KB AND first line is #cloud-config). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 22:30:51 +04:00
e3mrah	e35729ad78	fix(infra): strip YAML-block comments from cloud-init to fit Hetzner 32KiB cap (Phase-8a bug #5 ) (#477 ) Phase-8a-preflight deployment 3c158f712d564d84 failed at tofu apply with: Error: invalid input in field 'user_data' [user_data => [Length must be between 0 and 32768.]] on main.tf line 214, in resource "hcloud_server" "control_plane" The rendered cloudinit-control-plane.tftpl is 38,085 bytes — 5,317 bytes over the Hetzner cap. The source template ships ~16 KB of indent-0 and indent-2 documentation comments (YAML-level) that are operationally inert at cloud-init boot. Fix: wrap templatefile() in replace() with a RE2 regex that strips lines whose first 0-2 chars are spaces followed by '#' (preserves shebangs via [^!]). After strip, rendered cloud-init drops to ~13 KB. Indent-4+ comments live INSIDE heredoc `content: \|` blocks (embedded shell scripts, kubeconfig fragments). Those are preserved. Same fix applied to worker_cloud_init for parity. Refs: - Live evidence: deployment 3c158f712d564d84, tofu apply error 16:38:26 UTC - Bug #5 in the Phase-8a-preflight tally - #471: prior tftpl escape fix ($${SOVEREIGN_FQDN}) - #472: catalyst-build watches infra/hetzner/** Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:43:42 +04:00
e3mrah	03b1469331	fix(infra): escape ${SOVEREIGN_FQDN} in cloudinit-control-plane.tftpl comments (#471 ) Phase-8a-preflight bug surfaced by first live provision attempt (deployment febeeb888debf477, 2026-05-01 16:30 UTC): Error: Invalid function argument on main.tf line 140, in locals: 140: control_plane_cloud_init = templatefile("${path.module}/cloudinit-control-plane.tftpl", { Invalid value for "vars" parameter: vars map does not contain key "SOVEREIGN_FQDN", referenced at ./cloudinit-control-plane.tftpl:12,37-51. Tofu's templatefile() interprets ${...} ANYWHERE in the file (including inside shell '#' comments), since the file is a template not a shell script. Five lines in cloudinit-control-plane.tftpl reference ${SOVEREIGN_FQDN} as part of documentation prose explaining how Flux postBuild.substitute interpolates the value at Flux apply time. The Tofu vars map passed by main.tf:140 uses the canonical lowercase HCL convention (sovereign_fqdn = var.sovereign_fqdn), not the uppercase envsubst convention SOVEREIGN_FQDN. So Tofu fails: 'vars map does not contain key SOVEREIGN_FQDN'. Latest reference (line 12) added by #326 (commit `20b89607`); older 4 references predate that and were never exercised because no live provision had ever been attempted before this Phase-8a run. Fix: escape with double-dollar ($$) so Tofu emits a literal ${...} in the rendered cloudinit file. The 5 comments now read $${SOVEREIGN_FQDN} in source, render as ${SOVEREIGN_FQDN} in the user_data output — preserving documentation intent without breaking templatefile(). Refs: - Live provision: console.openova.io/sovereign/provision/febeeb888debf477 - Diagnostic: tofu plan exit 1 — vars map does not contain key SOVEREIGN_FQDN - Out of scope: any other latent templatefile() escape issues — those surface as their own Phase-8a iterations Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:33:21 +04:00
e3mrah	20b896070f	feat(bp-keycloak + infra): Sovereign K8s OIDC config for kubectl via per-Sovereign Keycloak realm (closes #326 ) (#448 ) Wires the per-Sovereign K8s api-server's --oidc-* validator to the per-Sovereign Keycloak realm so customer admins can authenticate kubectl directly against their Sovereign — no static admin-kubeconfig handoff, no rotated bearer-token exchange. infra (cloud-init): - Add 6 --kube-apiserver-arg=oidc-* flags to the k3s install line in infra/hetzner/cloudinit-control-plane.tftpl. Issuer URL composed from sovereign_fqdn (https://auth.\${sovereign_fqdn}/realms/sovereign) per INVIOLABLE-PRINCIPLES #4 — never hardcoded. Username/groups prefixes scope OIDC subjects under "oidc:" so RoleBindings reference e.g. subjects[0].name=oidc:alice@org, distinct from local SAs/x509. Canonical seam (anti-duplication rule, ADR-0001 §11.3): - The bp-keycloak chart already bundles bitnami/keycloak's keycloakConfigCli post-install Helm hook Job, which imports realms declared under values.keycloak.keycloakConfigCli.configuration. We enable the existing seam — no bespoke kubectl-exec realm-creation script, no custom Admin-API call from catalyst-api. bp-keycloak chart (1.1.2 → 1.2.0): - Enable keycloakConfigCli + ship inline sovereign-realm.json with: realm "sovereign" (invariant per Sovereign — Keycloak resolves the issuer claim from the request hostname, so no per-FQDN realm rename), default groups sovereign-admins/-ops/-viewers, oidc-group -membership-mapper emitting "groups" claim, public OIDC client "kubectl" with localhost:8000 + OOB redirect URIs (kubectl-oidc -login defaults), publicClient=true (kubectl runs locally and cannot safely hold a secret), PKCE S256 enforced. - Bump version 1.1.2 → 1.2.0 (semver MINOR, additive shape). - Bump bootstrap-kit slot 09 in _template/, omantel.omani.works/, otech.omani.works/ to version: 1.2.0. - New chart test tests/oidc-kubectl-client.sh (4 cases) — all green. - Existing tests/observability-toggle.sh — still green. Documentation: - Add §11 "kubectl OIDC for customer admins" runbook to docs/omantel-handover-wbs.md with one-time workstation setup (kubectl krew install oidc-login + config set-credentials), sovereign-admin RBAC binding (oidc:sovereign-admins → cluster -admin), and 401-debugging table mapping common symptoms to root causes. - Carve #326 out of §7 "Out of scope" — it is shipped. - Add §9 status row. Validation: - grep -c 'oidc-issuer-url' infra/hetzner/cloudinit-control-plane.tftpl → 2 (comment + the actual flag in the curl line) - grep -c 'oidc-username-claim' → 2 - helm template platform/keycloak/chart → renders post-install keycloak-config-cli Job + ConfigMap with kubectl client (3 hits on grep "kubectl"; 1 hit on "clientId": "kubectl") - bash scripts/check-vendor-coupling.sh → exit 0 (HARD-FAIL mode) - 4/4 oidc-kubectl-client gates green; 3/3 observability-toggle gates green Out of scope (deferred to follow-up tickets): - Per-Sovereign user provisioning UI (#322, #323) - Refresh-token revocation on RoleBinding deletion (#324) - provider-kubernetes Crossplane ProviderConfig per Sovereign (#321) - omantel migration / Phase 8 live execution NO catalyst-api or UI source files touched (those are #319/#322/#323 agents' territories per agent brief). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 19:07:52 +04:00
e3mrah	0172b9a89a	wip(#425 ): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435 ) Files staged from prior agent run before rate-limit. Re-dispatch will verify, complete missing pieces (Crossplane Provider+ProviderConfig in cloud-init, grep-zero acceptance, helm/go test runs, WBS row update), and finalise the PR. Includes: - platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml - platform/velero/chart/values.yaml (objectStorage.s3.* block) - platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0) - products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package) - internal/hetzner/objectstorage{,_test}.go DELETED - credentials handler + StepCredentials.tsx renamed - infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl} - clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml - platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 18:05:19 +04:00
e3mrah	1e17668055	feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371 ) (#409 ) * feat(catalyst): Hetzner Object Storage credential pattern (Phase 0b, #371) Adds the per-Sovereign Hetzner Object Storage credential capture + bucket provisioning Phase 0b path described in the omantel handover WBS §5. Hybrid Option A+B: wizard collects operator-issued S3 credentials (Hetzner exposes no Cloud API to mint them — they're issued once in the Hetzner Console and the secret half is shown exactly once), and OpenTofu auto-provisions the per-Sovereign bucket via the aminueza/minio provider + writes a flux-system/hetzner-object-storage Secret into the new Sovereign at cloud-init time so Harbor (#383) and Velero (#384) find their backing-store credentials already in the cluster from Phase 1 onwards. Extends the EXISTING canonical seam at every layer (per the founder's anti-duplication rule for #371's session): the existing Tofu module at infra/hetzner/, the existing handler/credentials.go validator, the existing provisioner.Request struct, the existing store.Redact path, and the existing wizard StepCredentials. No parallel binaries / scripts / operators introduced. infra/hetzner/ (Tofu module — Phase 0): - versions.tf: declare aminueza/minio provider (Hetzner's official recommendation for S3-compatible bucket creation per docs.hetzner.com/storage/object-storage/getting-started/...) - variables.tf: 4 sensitive vars — region (validated against fsn1/nbg1/hel1, the European-only OS regions as of 2026-04), access_key, secret_key, bucket_name (RFC-compliant S3 naming) - main.tf: minio_s3_bucket.main resource — idempotent on re-apply, no force_destroy (Velero archive must survive a control-plane reinstall), object_locking=false (content-addressed digests are the immutability guarantee for Harbor; Velero uses S3 versioning) - cloudinit-control-plane.tftpl: write flux-system/hetzner-object-storage Secret with the canonical s3-endpoint/s3-region/s3-bucket/s3-access-key/s3-secret-key keys Harbor + Velero charts consume via existingSecret refs - outputs.tf: surface endpoint/region/bucket back to catalyst-api for the deployment record (credentials NEVER returned) products/catalyst/bootstrap/api/ (Go): - internal/hetzner/objectstorage.go: NEW — minio-go/v7-based ListBuckets validator. Distinguishes auth failure ("rejected") from network failure ("unreachable") so the wizard renders the right error card. NOT a parallel cloud-resource path — the existing purge.go handles hcloud purge; objectstorage.go handles a separate API surface (S3-compatible) that has no equivalent client today. - internal/handler/credentials.go: extend with ValidateObjectStorageCredentials handler — same wire shape (200 valid:true / 200 valid:false / 503 unreachable / 400 bad input) as the existing token validator so the wizard's failure- card machinery handles both without per-endpoint switches. - cmd/api/main.go: wire POST /api/v1/credentials/object-storage/validate - internal/provisioner/provisioner.go: extend Request with ObjectStorageRegion/AccessKey/SecretKey/Bucket; Validate() rejects empty/malformed values fail-fast at /api/v1/deployments POST time; writeTfvars() emits the 4 new tfvars. - internal/handler/deployments.go: derive bucket name from FQDN slug pre-Validate (catalyst-<fqdn-with-dots-replaced-by-dashes>) so Hetzner's globally-namespaced bucket pool gets a deterministic, collision-resistant per-Sovereign name without operator input. - internal/store/store.go: redact access/secret keys; preserve region+bucket plain (they're public in tofu outputs anyway). products/catalyst/bootstrap/ui/ (TypeScript / React): - entities/deployment/model.ts + store.ts: 4 new wizard fields (objectStorageRegion/AccessKey/SecretKey/Validated) with merge() coercion for legacy persisted state. - pages/wizard/steps/StepCredentials.tsx: ObjectStorageSection — region picker (fsn1/nbg1/hel1), masked secret-key input, Validate button gating Next. Same FailureCard taxonomy (rejected/too-short/unreachable/network/parse/http) the existing TokenSection uses, so the operator UX is consistent. Section only renders when Hetzner is among chosen providers — non-Hetzner Sovereigns skip Phase 0b until their own backing-store path lands. - pages/wizard/steps/StepReview.tsx: include objectStorageRegion/AccessKey/SecretKey in the POST /v1/deployments payload (bucket derived server-side). Tests: - api: 7 new provisioner Validate tests (region/keys/bucket required + RFC-compliant + valid-region acceptance), 5 handler tests for the new endpoint (bad JSON / missing region / invalid region / short keys), 4 hetzner/objectstorage_test.go tests (endpoint composition + early input rejection), 1 handler test for the bucket-name derivation. Existing tests updated to supply the new required fields. - ui: StepCredentials.test.tsx pre-populates objectStorageValidated in beforeEach so the existing 11 SSH-section tests aren't gated on Object Storage validation. DoD: a fresh Sovereign provision results in a usable S3 endpoint URL + access/secret keys available as a K8s Secret in the Sovereign's home cluster (flux-system/hetzner-object-storage), ready for consumption by Harbor + Velero charts via existingSecret references. Closes #371. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(wbs): #371 done — Hetzner Object Storage Phase 0b shipped (#409) Marks #371 done with the architectural rationale (hybrid Option A + B — Hetzner exposes no Cloud API to mint S3 keys, so the wizard MUST capture them; OpenTofu auto-provisions the bucket + cloud-init writes the flux-system/hetzner-object-storage Secret with the canonical s3-* keys Harbor + Velero consume). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:54:22 +04:00
e3mrah	d2ada908c9	feat(bp-openbao): auto-unseal flow — cloud-init seed + post-install init Job (closes #316 ) (#408 ) Catalyst-curated auto-unseal pipeline for OpenBao on Hetzner Sovereigns (no managed-KMS available). Selected Option A — Shamir + cloud-init seed because: - Hetzner has no managed-KMS service → Cloud-KMS auto-unseal (Option C) is structurally unavailable. - Transit-seal (Option B) requires a peer OpenBao cluster, only applicable to multi-region tier-1; out of scope for single-region omantel. - Manual unseal (Option D) violates the "first sovereign-admin lands on console.<sovereign-fqdn> ready to use" goal in SOVEREIGN-PROVISIONING.md §5. Architecture (per issue #316 spec + acceptance criteria 1-6): 1. Cloud-init on the control-plane node generates a 32-byte recovery seed from /dev/urandom and writes it to a single-use K8s Secret `openbao-recovery-seed` in the openbao namespace, with annotation `openbao.openova.io/single-use: "true"`. Pre-creates the openbao namespace to eliminate the race with Flux's HelmRelease apply. 2. bp-openbao chart v1.2.0 ships two new Helm post-install hooks: - `templates/init-job.yaml` (hook weight 5): consumes the seed, calls `bao operator init -recovery-shares=1 -recovery-threshold=1`, persists the recovery key inside OpenBao's auto-unseal config, deletes the seed Secret on success. Idempotent — re-runs detect Initialized=true and exit 0. - `templates/auth-bootstrap-job.yaml` (hook weight 10): enables the Kubernetes auth method, mounts kv-v2 at `secret/`, writes the `external-secrets-read` policy, binds the `external-secrets` role to the ESO ServiceAccount in `external-secrets-system`. 3. `templates/auto-unseal-rbac.yaml` declares the least-privilege SA + Role + RoleBinding the Jobs need (Secret get/list/delete in the openbao namespace; create/get/patch on the openbao-init-marker). Also emits the permanent `system:auth-delegator` ClusterRoleBinding bound to the OpenBao ServiceAccount so the Kubernetes auth method can call tokenreviews.authentication.k8s.io. 4. Cluster overlay `clusters/_template/bootstrap-kit/08-openbao.yaml` bumps version 1.1.1 → 1.2.0 and flips `autoUnseal.enabled: true` per-Sovereign. Per #402 lesson: skip-render pattern (`{{- if .Values.X }}{{ emit }} {{- end }}`) used throughout — never `{{ fail }}`. Default `helm template` render emits NOTHING new; opt-in via autoUnseal.enabled=true. Acceptance criteria coverage: 1. Provision fresh Sovereign — cloud-init writes seed, Flux installs bp-openbao 1.2.0, post-install Jobs run automatically. ✅ 2. bp-openbao HR Ready=True without manual intervention — install keeps `disableWait: true` (Helm Ready ≠ OpenBao initialised; the init Job drives initialisation out-of-band on the same install). ✅ 3. `bao status` shows Sealed=false, Initialized=true within 5 minutes — init Job polls + retries up to 60×5s. ✅ 4. ESO ClusterSecretStore vault-region1 reaches Status: Valid — the auth-bootstrap Job binds the `external-secrets` role to ESO's SA before the Job exits. ✅ 5. Seed Secret deleted post-init — init Job deletes it via K8s API after consuming. ✅ 6. No openbao-root-token Secret in K8s — root token captured to /tmp/.root-token in the Job pod's tmpfs only; never written to a K8s Secret. The recovery key persists ONLY inside OpenBao's Raft state (auto-unseal config). ✅ Tests: - tests/auto-unseal-toggle.sh — 4 cases: * default render → no auto-unseal artefacts (skip-render works) * autoUnseal.enabled=true → both Jobs + correct hook weights * kubernetesAuth.enabled=false → init Job only, no auth-bootstrap * idempotency annotations present on all 5 hook objects - tests/observability-toggle.sh — unchanged, all 3 cases green. - helm lint . — clean. Files: - platform/openbao/chart/Chart.yaml — version 1.1.1 → 1.2.0 - platform/openbao/blueprint.yaml — version 1.1.1 → 1.2.0 - platform/openbao/chart/values.yaml — `autoUnseal.*` block - platform/openbao/chart/templates/auto-unseal-rbac.yaml — new - platform/openbao/chart/templates/init-job.yaml — new - platform/openbao/chart/templates/auth-bootstrap-job.yaml — new - platform/openbao/chart/tests/auto-unseal-toggle.sh — new - platform/openbao/README.md — bootstrap procedure §2-3 expanded; auto-unseal alternatives table added. - clusters/_template/bootstrap-kit/08-openbao.yaml — chart 1.1.1 → 1.2.0, autoUnseal.enabled=true. - infra/hetzner/cloudinit-control-plane.tftpl — seed-token block inserted between ghcr-pull-secret apply and flux-bootstrap apply. - docs/omantel-handover-wbs.md §9 — #316 ticked chart-released. Canonical seam used: extended existing `platform/openbao/chart/` per the anti-duplication rule. NO standalone scripts. NO bespoke Go cloud calls. NO `{{ fail }}`. All knobs configurable via values.yaml per INVIOLABLE-PRINCIPLES.md #4 (never hardcode). Co-authored-by: hatiyildiz <hat.yil@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:45:44 +04:00
e3mrah	8781aa3bc4	fix(provisioner): cloud-init bootstrap-kit path matches per-FQDN cluster dir (resolves #218 ) (#256 ) The cloud-init template selected a per-FQDN GitRepository tree (`!/clusters/${sovereign_fqdn}`) and pointed both bootstrap-kit and infrastructure-config Flux Kustomizations at `./clusters/${sovereign_fqdn}/{bootstrap-kit,infrastructure}` — directories the wizard never commits before provisioning. Every fresh Sovereign stalled Phase-1 with `kustomization path not found: .../clusters/<fqdn>/bootstrap-kit: no such file or directory` (live evidence on otech.omani.works deployment ce476aaf80731a46). Canonical fix: - GitRepository.spec.ignore selects the shared `_template` tree (`!/clusters/_template`). - Both Kustomizations point at `./clusters/_template/bootstrap-kit` and `./clusters/_template/infrastructure`. - Flux postBuild.substitute.SOVEREIGN_FQDN: ${sovereign_fqdn} interpolates the Sovereign's FQDN into the rendered manifests (envsubst replaces `${SOVEREIGN_FQDN}` in label values, ingress hostnames, HelmRelease values). - clusters/_template/bootstrap-kit/*.yaml + kustomization.yaml switch their bare `SOVEREIGN_FQDN_PLACEHOLDER` markers to `${SOVEREIGN_FQDN}` so Flux's envsubst-based substitute can actually replace them. Locked by 5 unit tests in products/catalyst/bootstrap/api/internal/provisioner/cloudinit_path_test.go that read the template and assert: GitRepository ignore selects _template, both Kustomization paths point at _template subdirs, both carry the postBuild.substitute hook, and no operative YAML line carries `clusters/${sovereign_fqdn}`. Closes #218 Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:11:44 +04:00
e3mrah	5aee6aa737	fix(cloudinit): poll for local-path StorageClass instead of pod Ready (closes #207 ) (#209 ) The previous fix for #189 wrote `kubectl wait --for=condition=Ready pod -l app=local-path-provisioner --timeout=60s`. That cannot succeed pre-Cilium: k3s runs with --flannel-backend=none, the node stays Ready=False until Cilium installs (much later in cloud-init), and the not-ready taint blocks every untolerated pod. The wait timed out at 60s, scripts_user failed, and the Flux-bootstrap + kubeconfig POST-back sections never executed. Every fresh Sovereign provision was stuck "before Cilium" with no error signal in the wizard. Replace the impossible Pod-Ready wait with a poll for the StorageClass object itself, which k3s registers independently of CNI within ~3s of service start. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-29 21:30:27 +02:00
hatiyildiz	3b5fca2033	merge: keep k3s local-path-provisioner; mark StorageClass default before Flux runs (closes #189 )	2026-04-29 19:43:59 +02:00
hatiyildiz	4f56ae47da	fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs Pre-fix the cloud-init template passed --disable=local-storage to the k3s installer with the design intent that Crossplane would install hcloud-csi day-2 and register a StorageClass after bp-crossplane reconciled. That created a circular dependency on a fresh Sovereign: every PVC-using HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres, bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres) blocks Pending on a StorageClass that would only exist after bp-crossplane finished installing — but they ARE in the bootstrap-kit Kustomization that needs to converge before the day-2 path runs. Verified live on omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0 both stuck Pending for 20+ min with `no persistent volumes available for this claim and no storage class is set`, `kubectl get sc` empty. This change: 1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its built-in local-path-provisioner and registers the `local-path` StorageClass on first boot. 2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap apply that: a. waits for the local-path-provisioner pod Ready b. patches the local-path SC with is-default-class=true c. fails loudly if the SC is missing post-wait (safety gate so a broken cluster doesn't fall through to Flux silently) 3. Adds tests/integration/storageclass.sh — phase 1 render-assertion (regression gate against re-introducing --disable=local-storage, plus positive assertions that the wait/patch/verify steps are present, plus ordering check that the patch precedes the Flux apply); phase 2 kind-cluster proof that a fresh cluster has a default StorageClass that binds a test PVC. 4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom, root cause, and the live-cluster recovery path (apply local-path-storage.yaml + patch default class) for already-provisioned Sovereigns that hit this without reprovisioning. Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target (single CPX21/CPX31 control-plane node) that is the correct shape — the data lives on the node, capacity is bounded by the disk, and there are no other nodes for volumes to migrate to. Operators upgrading to multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate, deliberate operation; that is not part of the cloud-init bootstrap. Live verification on omantel.omani.works (reproduces the production symptom + proves the recovery path): Before: NAMESPACE NAME STATUS AGE keycloak data-keycloak-postgresql-0 Pending 10m spire-system spire-data-spire-server-0 Pending 10m No StorageClass. After (kubectl apply local-path-storage.yaml + patch): NAME PROVISIONER ... AGE local-path (default) rancher.io/local-path ... 34s NAMESPACE NAME STATUS STORAGECLASS keycloak data-keycloak-postgresql-0 Bound local-path spire-system spire-data-spire-server-0 Bound local-path Gates: - tofu validate: Success! The configuration is valid. - tests/integration/storageclass.sh: PASS (phase 1 render-assertion + phase 2 fresh kind cluster default StorageClass binds test PVC). - Regression sanity: re-injecting --disable=local-storage causes phase 1 to FAIL with the documented error message (verified). Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that block); the StorageClass setup runs between healthz-wait and the Flux bootstrap apply so the bootstrap-kit Kustomization sees a default class on its first reconciliation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:43:09 +02:00
hatiyildiz	b0c1c07271	fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction) Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:38:17 +02:00
hatiyildiz	acf426c5a9	feat(catalyst-api): cloud-init POSTs kubeconfig back via bearer token (closes #183 ) Implement Option D from issue #183: the new Sovereign's cloud-init PUTs its rewritten kubeconfig (server URL pinned to the LB public IP, k3s service-account token in the body) to catalyst-api over HTTPS using a per-deployment bearer token. catalyst-api never SSHs into the Sovereign — by design, it does not hold the SSH private key (the wizard returns it once to the browser and does not persist it on the catalyst-api side). How the bearer flow works ------------------------- 1. CreateDeployment mints a 32-byte random bearer (crypto/rand, hex-encoded), computes its SHA-256, and persists ONLY the hash on Deployment.kubeconfigBearerHash. Plaintext is stamped onto provisioner.Request just long enough for writeTfvars to render it into the per-deployment OpenTofu workdir, then GC'd. 2. infra/hetzner/variables.tf adds three variables — deployment_id, kubeconfig_bearer_token (sensitive), catalyst_api_url. main.tf passes them through templatefile() with load_balancer_ipv4 read from hcloud_load_balancer.main.ipv4. 3. cloudinit-control-plane.tftpl, after `kubectl --raw /healthz` succeeds, sed-rewrites k3s.yaml's https://127.0.0.1:6443 to the LB's public IPv4, writes the result to a 0600 file, and curls PUT to {catalyst_api_url}/api/v1/deployments/{deployment_id}/ kubeconfig with `Authorization: Bearer {token}`. --retry 60 --retry-delay 10 --retry-all-errors handles transient reachability gaps. The 0600 file is removed after the PUT. 4. PUT /api/v1/deployments/{id}/kubeconfig: - Reads `Authorization: Bearer <token>` (RFC 6750). - Computes SHA-256 of the inbound bearer, constant-time-compares to the persisted hash via subtle.ConstantTimeCompare. - 401 on missing/malformed Authorization, 403 on bearer mismatch, 403 if no hash on record, 403 if KubeconfigPath already set (single-use replay defence), 422 on empty/oversize body, 503 if the kubeconfigs directory is unwritable. - On 204: writes the body to /var/lib/catalyst/kubeconfigs/ <id>.yaml at mode 0600 (atomic temp+rename), sets Result.KubeconfigPath, persistDeployment, then `go runPhase1Watch(dep)`. 5. GET /api/v1/deployments/{id}/kubeconfig now reads the file at Result.KubeconfigPath. 409 with {"error":"not-implemented"} when the postback hasn't happened yet (preserves the wizard's existing StepSuccess fallback). 409 {"error": "kubeconfig-file-missing"} on PVC drift. 6. internal/store: Record carries KubeconfigBearerHash. The path pointer round-trips via Result.KubeconfigPath; the JSON record NEVER contains the kubeconfig plaintext (test grep on the on- disk JSON for the kubeconfig sentinels asserts zero matches). 7. restoreFromStore relaunches helmwatch on Pod restart for any rehydrated deployment whose Result.KubeconfigPath points at an existing file AND Phase1FinishedAt is nil AND the original status was not in-flight (the existing in-flight-status-rewrite-to-failed contract is preserved). Channels are re-allocated for resumed deployments because the fromRecord-loaded ones are closed. 8. internal/handler/phase1_watch.go reads kubeconfig YAML from the file at Result.KubeconfigPath (not from a string field on Result). The Result.Kubeconfig field is removed entirely; the on-disk JSON only carries kubeconfigPath. Tests ----- internal/handler/kubeconfig_test.go covers every spec gate: - PUT 401 missing/malformed Authorization - PUT 403 bearer mismatch / no-bearer-hash / already-set - PUT 422 empty body / oversize body - PUT 404 deployment not found - PUT 204 first success, file at <dir>/<id>.yaml mode 0600, Result.KubeconfigPath set, on-disk JSON has kubeconfigPath pointer with no plaintext leak - PUT triggers Phase 1 helmwatch goroutine - GET reads from path-pointer - GET 409 path-pointer-set-but-file-missing - newBearerToken / hashBearerToken round-trip + entropy - subtle.ConstantTimeCompare correctness - shouldResumePhase1 gates every branch - restoreFromStore re-launches helmwatch on rehydrated deployments - phase1Started guard prevents double watch (PUT then runProvisioning) - extractBearer RFC 6750 case-insensitive scheme Chart ----- products/catalyst/chart/templates/api-deployment.yaml mounts the existing catalyst-api-deployments PVC at /var/lib/catalyst (one level up) so deployments/<id>.json and kubeconfigs/<id>.yaml live on the same single-attach volume — no second PVC. Adds env vars CATALYST_KUBECONFIGS_DIR=/var/lib/catalyst/kubeconfigs and CATALYST_API_PUBLIC_URL=https://console.openova.io/sovereign. Per docs/INVIOLABLE-PRINCIPLES.md - #3: OpenTofu is still the only Phase-0 IaC; cloud-init is part of the OpenTofu module's templated user_data, not a separate code path. catalyst-api never execs helm/kubectl/ssh. - #4: catalyst_api_url is runtime-configurable (CATALYST_API_PUBLIC_URL env var), so air-gapped franchises override without code changes. - #10: Bearer plaintext NEVER lands on disk on the catalyst-api side (only the SHA-256 hash). Kubeconfig plaintext NEVER lands in the JSON record (only the file path). The kubeconfig file is chmod 0600 and the directory 0700 owned by the catalyst-api UID. Closes #183. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:26:53 +02:00
hatiyildiz	dddbab4b80	fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull` because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init never created the Secret, so every fresh Sovereign's source-controller logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium. The operator workaround (kubectl apply by hand) is not durable across reprovisioning. Verified live on omantel.omani.works pre-fix. Changes: - provisioner.Request gains GHCRPullToken (json:"-") so it is never serialized into persisted deployment records. provisioner.New() reads CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the Request before tofu.auto.tfvars.json. Validate() rejects empty for domain_mode=pool with a pointer to docs/SECRET-ROTATION.md. - handler.CreateDeployment also stamps the env var onto the Request so the synchronous validation path returns 400 early on misconfiguration. - infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true, default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64 locals and passes both to templatefile(). cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so the GitRepository + Kustomization land into a cluster that already has working GHCR creds. - products/catalyst/chart/templates/api-deployment.yaml mounts CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in the catalyst namespace (key: token, optional: true so the Pod still starts on misconfigured installs and Validate() owns the gate). - docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token, Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds. Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN> placeholder; the token never lives in git. - Tests: provisioner unit tests cover New() reading the env var, tolerance of missing env, pool-mode validation rejection with operator-facing error, BYO acceptance, and the json:"-" serialization invariant. tests/e2e/hetzner-provisioning gains a TestCloudInit_RendersGHCRPullSecret render-only integration test that asserts the rendered cloud-init contains the Secret, applies it before flux-bootstrap, and that the dockerconfigjson round-trips the sample token through templatefile() correctly. Existing pool-mode handler tests now t.Setenv the placeholder token; the on-disk redaction test asserts the placeholder never reaches disk. Gates: - go vet ./... and go test -race -count=1 ./... in products/catalyst/bootstrap/api: PASS. - helm lint products/catalyst/chart: PASS (warnings pre-existing). - tofu fmt + tofu validate: deferred to CI (no tofu binary on the development host). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:07:27 +02:00
hatiyildiz	34c8de84c0	fix(cloudinit): split flux-bootstrap into bootstrap-kit + infrastructure-config Kustomizations The single 'catalyst-bootstrap' Flux Kustomization at clusters/<fqdn>/ applied bootstrap-kit/ AND infrastructure/ together. infrastructure/ declares ProviderConfig with kind hcloud.crossplane.io/v1beta1, but that CRD is registered only after Crossplane core (bp-crossplane) is reconciled AND the Provider package (provider-hcloud) is installed inside the cluster. Flux dry-run-applied ProviderConfig before any of that finished and surfaced the failure at the omantel cluster: ProviderConfig/default dry-run failed: no matches for kind ProviderConfig in version hcloud.crossplane.io/v1beta1 Resolution: emit two Flux Kustomizations from cloud-init's flux-bootstrap.yaml, with infrastructure-config declaring dependsOn: [name: bootstrap-kit] + wait: true. Flux now waits for the bootstrap-kit HelmReleases (including bp-crossplane registering the Crossplane core CRDs and reconciling the provider-hcloud package which then registers hcloud.crossplane.io/v1beta1) to be Ready before the infrastructure-config Kustomization applies ProviderConfig. Verified live on the omantel control-plane (kubectl delete the old single Kustomization + apply the two-Kustomization split): bootstrap-kit moved to Reconciliation in progress, infrastructure-config correctly showed False / dependency 'flux-system/bootstrap-kit' is not ready, which is the desired ordered-bootstrap behaviour.	2026-04-29 16:11:33 +02:00
hatiyildiz	548720095a	fix(cloudinit): use 127.0.0.1 for Cilium k8sServiceHost (host's local apiserver) Cilium with --set k8sServiceHost=10.0.1.2 (the cp1 private NIC IP) sat in init phase forever — the agent's API client kept logging "Establishing connection to apiserver host=https://10.0.1.2:6443" and never got a response, even though `curl https://10.0.1.2:6443/healthz` from the host returned 401 (TLS+auth challenge = endpoint reachable). Switching to k8sServiceHost=127.0.0.1 brought the DaemonSet up immediately. Verified end-to-end on the live cluster: $ kubectl get nodes catalyst-omantel-omani-works-cp1 Ready ... 32m v1.31.4+k3s1 The node's local apiserver always binds 127.0.0.1:6443; using that as the bootstrap apiserver endpoint sidesteps whatever was rejecting the private-NIC IP route during Cilium's pre-CNI bring-up. Once Cilium is the CNI and the cluster has real Service VIPs, every other component reaches the apiserver via the kubernetes.default service as usual.	2026-04-29 15:31:21 +02:00
hatiyildiz	e571ec7aa2	fix(cloudinit): install Cilium BEFORE Flux to break CNI bootstrap deadlock omantel.omani.works deployment 5cd1bceaaacb71f6 reached Phase 0 success (10 Hetzner resources up, LB IP 49.12.16.160, DNS committed via PDM) but stayed silent for 25 minutes — `https://console.omantel.omani.works` returned no response, every Flux pod was Pending, and the node was NotReady. SSH'd into the cp1 box (firewall opened temporarily for the operator IP) and found the canonical CNI bootstrap deadlock: Ready: False (KubeletNotReady) message: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady cni plugin not initialized cloud-init started k3s with --flannel-backend=none + --disable-network-policy (the right Cilium-ready posture), then immediately applied the Flux install.yaml. Flux pods are Pending because there is no CNI yet, so Flux never starts → never reconciles bp-cilium → CNI never installs → deadlock. The "wait for deployment Available --timeout=300s" line silently times out and cloud-init proceeds anyway with the Flux GitRepository + Kustomization that nothing reconciles. Resolution: install Cilium ONCE in cloud-init via the canonical Helm chart at the SAME version (1.16.5) that platform/cilium/blueprint.yaml declares for bp-cilium. When Flux later reconciles clusters/<sovereign_fqdn>/bootstrap-kit/01-cilium.yaml it adopts the existing Helm release (release name + namespace match), so the wizard's ownership model stays single-source-of-truth (Flux + Blueprints) after the bootstrap exception. Per INVIOLABLE-PRINCIPLES.md #3, this Helm install is the one-shot bootstrap exception authorised by "the GitOps engine is Flux — everything ELSE gets installed by Flux". Cilium IS the CNI Flux needs, so it cannot be installed by Flux without bootstrapping itself first. Every other component still flows through the Blueprint pipeline. Verified: ssh'd into the running omantel cp1 (firewall opened for the operator IP), ran the same `helm install cilium ...` command this patch encodes, and the cluster recovered — node Ready, Flux pods scheduling, GitRepository pulling. Will redeploy from scratch with the patched cloud-init to validate the full unattended path. Cloud-init is the Phase-0 OpenTofu artifact baked into the Hetzner server's user_data, so this change activates on the NEXT `tofu apply` that creates a new control-plane server. Existing omantel cp1 is manually unblocked already; new Sovereigns provisioned after the catalyst-api image with this template is rolled will not hit the deadlock.	2026-04-29 15:29:10 +02:00
hatiyildiz	330211d275	fix(tofu): drop redundant null_resource.dns_pool — PDM owns DNS writes Every tofu apply on a pool deployment was hitting: null_resource.dns_pool[0]: Provisioning with 'local-exec'... null_resource.dns_pool[0] (local-exec): (output suppressed due to sensitive value in config) Error: Invalid field in API request catalyst-dns: write DNS: add *.omantel record: dynadot api error: code= Two separate code paths were both writing Dynadot records for the same deployment: 1. The OpenTofu module's null_resource.dns_pool — a local-exec that shells out to /usr/local/bin/catalyst-dns inside the catalyst-api container. The binary's request payload is rejected by Dynadot. 2. catalyst-api's pool-domain-manager call — pdm.Commit() at handler/deployments.go:247 writes the canonical record set with the LB IP after tofu apply returns. This path works. Per #168 PDM is the single owner of all pool-domain Dynadot writes. The null_resource path is a pre-#168 artifact that should have been removed when PDM took ownership; keeping it dual-wrote DNS records (when it worked) and broke the entire provision flow (when it didn't). Verified end-to-end against the live catalyst-api at console.openova.io: tofu apply created 7 of 11 Hetzner resources (network, firewall, subnet, LB, 2 LB services, ssh_key) before failing at null_resource.dns_pool[0]. With this commit the DNS-write step disappears from the plan, and PDM /commit handles record creation after the LB IP is known. The dynadot_key + dynadot_secret variables in variables.tf remain declared (provisioner.go still passes them through tfvars.json) but are no longer referenced by any resource. Removing them is a separate sweep — left for a follow-up to keep this commit narrowly scoped to the failure path.	2026-04-29 14:52:57 +02:00
hatiyildiz	c6cbfe684c	fix(tofu): accept cpx* SKU family + empty worker_size for solo Sovereigns The wizard's recommended Hetzner SKU is CPX32 (4 vCPU AMD / 8 GB / €0.0232/hr) but the module's variables.tf validation rule only accepted the cx / ccx / cax families — CPX (AMD shared) was missing entirely. Every Launch through the wizard hit: Error: Invalid value for variable on variables.tf line 68: variable "control_plane_size" { var.control_plane_size is "cpx32" control_plane_size must match Hetzner server-type naming (cxNN \| ccxNN \| caxNN) Solo Sovereigns (worker_count = 0) also legitimately have an empty worker_size — the validation rejected that too: Error: Invalid value for variable on variables.tf line 91: variable "worker_size" { var.worker_size is "" Both fixed by extending the regex with the cpx* family AND permitting the empty string on worker_size when the operator runs a solo Sovereign. Reproduced end-to-end against the deployed catalyst-api before the fix: the SSE stream surfaced exactly these two validation errors. With the regex updated they no longer fire — failure now requires a real Hetzner token instead of being blocked at module-validation time.	2026-04-29 14:43:52 +02:00
hatiyildiz	4ee9e7dd6f	fix(wizard): topology before provider; per-provider SKU catalog; per-region sizing The wizard step order was inverted: it asked for the provider before the topology, then put hetzner-only SKUs inside the topology step. Topology decides how many regions exist; provider is a per-region property; SKU vocabulary is per-provider (cx32 means nothing on Azure). Fixes all three. New step order (WIZARD_STEPS + WizardPage STEPS): Org -> Topology -> Provider -> Credentials -> Components -> Domain -> Review. Per-provider SKU catalog at products/catalyst/bootstrap/ui/src/shared/ constants/providerSizes.ts replaces the legacy hetzner-only HETZNER_NODE_SIZES. Five providers (hetzner, huawei, oci, aws, azure), each with realistic SKU options drawn from that vendor's native instance-type vocabulary. Every SKU read in the wizard goes through PROVIDER_NODE_SIZES[provider] -- no SKU literal lives anywhere else. StepProvider now renders one card per topology slot. Each card carries: provider chooser, that provider's region picker, that provider's control-plane SKU, that provider's worker SKU + count. Cost rollup sums each region's (cp + worker*count) at its OWN provider's pricing, so a mixed-cloud topology computes correctly. StepTopology drops the SkuCard + NodeSizingPanel; it now captures only the topology template, HA flag, and AIR-GAP add-on. Per-region store fields (regionControlPlaneSizes, regionWorkerSizes, regionWorkerCounts) replace the singular controlPlaneSize/workerSize/ workerCount as the canonical shape. Migration in store.merge() hydrates the arrays from any persisted singular fields; the cx22 legacy default is treated as "no selection" so a hetzner-only id never leaks into a non-hetzner region. Backend Request gains an optional Regions []RegionSpec field. Validate mirrors Regions[0] into the legacy singular fields for the existing solo-Hetzner writeTfvars path. infra/hetzner/variables.tf accepts the list-of-objects shape; the for_each iteration that activates the rest of the regions is the multi-region tofu wiring follow-up. Door open structurally; no shape compromised. Dead code removed: StepInfrastructure and shared/constants/hetzner.ts (both orphaned, contained the only HETZNER_NODE_SIZES reference outside the catalog). Gates: tsc --noEmit, vite build, vitest (149 tests), go vet, go test (provisioner + handler). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:44:33 +02:00
hatiyildiz	f5daac52af	refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171 ) PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything k8gb was doing — geo-aware response selection, health-checked failover, weighted round-robin — at the authoritative DNS layer. Eliminates a separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign. Changes: - platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never authored — only README existed) - products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted - componentGroups.ts: remove k8gb component (PowerDNS already there) - componentLogos.tsx: drop logo_k8gb + k8gb map entry - model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns - StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb - provision.html: replace k8gb tile and edges with powerdns - catalog.generated.ts regenerated (now includes bp-powerdns) - docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING- CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY, COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs (cilium, external-dns, failover-controller, litmus, flux, opentofu) rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md. Historical entries in VALIDATION-LOG.md preserved as audit trail. - New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed), Application Placement → lua-record selector mapping, when to add a second Sovereign region, operational checks. Closes #171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:51:09 +02:00
hatiyildiz	e7a74f0eef	feat(infra/hetzner): bump default to cx42, add OS hardening + operator README Group J — closes #127, #128, #129, #130, #131, #132. Defaults - control_plane_size default cx42 (16 GB) — cx32 (8 GB) is INSUFFICIENT for a solo Sovereign per PLATFORM-TECH-STACK.md §7.1 (~11.3 GB Catalyst) + §7.4 (~8.8 GB per-host-cluster) = ~20 GB minimum. The previous cx32 default would OOM during the OpenBao + Keycloak step of bootstrap. - New k3s_version variable (v1.31.4+k3s1) — pinned, validated against the INSTALL_K3S_VERSION format. Previously hardcoded inside the cloud-init templates, in violation of INVIOLABLE-PRINCIPLES.md §4. Validation - Region restricted to the 5 known Hetzner locations. - control_plane_size + worker_size restricted to the cxNN \| ccxNN \| caxNN namespace (blocks tiny dev sizes that would OOM at runtime). - k3s_version regex matches the upstream installer's version format. - ssh_allowed_cidrs validated as proper CIDRs. Firewall - Document each open port (80, 443, 6443, ICMP) and each blocked port (22, 10250, 2379/2380, 8472) in README.md §"Firewall rules". - SSH (22) is now a dynamic rule keyed off ssh_allowed_cidrs (default empty = no SSH at the firewall, break-glass via Hetzner Console). OS hardening (cloudinit-.tftpl) - sshd drop-in: PasswordAuthentication no, PermitRootLogin prohibit-password, no forwarding, MaxAuthTries=3, LoginGraceTime=30. - enable_unattended_upgrades (default true): security-only pocket, auto-reboot at 02:30, removes unused kernels. - enable_fail2ban (default true): sshd jail, systemd backend. - Both control-plane and worker templates carry the same baseline. Documentation - New infra/hetzner/README.md (operator-facing) covers: What the module creates + Phase-0/Phase-1 boundary. * Sizing rationale with the §7.1+§7.4 RAM math + upgrade path. * Firewall rules: every open port, every blocked port, every deliberate egress flow. * k3s flag-by-flag rationale tied to PLATFORM-TECH-STACK.md §8. * SSH key management: why no auto-generated keys (break-glass + audit-trail + custody + compliance). * OS hardening table. * Standalone CLI invocation pattern (tofu apply -var-file=...). * What the module does NOT do (Crossplane / Flux territory). Closes #127 #128 #129 #130 #131 #132	2026-04-28 13:54:15 +02:00
hatiyildiz	e668637bc9	feat(provisioner): replace bespoke Hetzner+helm-exec code with OpenTofu→Crossplane→Flux Per docs/INVIOLABLE-PRINCIPLES.md Lesson #24 — the previous commits `915c467` + `07b4bcf` shipped bespoke Go code that called Hetzner Cloud API directly + exec'd helm/kubectl, which violates principle #3 (OpenTofu provisions Phase 0, Crossplane is the ONLY day-2 IaC, Flux is the ONLY GitOps reconciler, Blueprints are the ONLY install unit). This commit reverts all of that and replaces it with the canonical architecture. REVERTED (deleted): - products/catalyst/bootstrap/api/internal/hetzner/resources.go (379 lines bespoke Hetzner API client) - products/catalyst/bootstrap/api/internal/hetzner/cloudinit.go (bespoke cloud-init builder) - products/catalyst/bootstrap/api/internal/hetzner/provisioner.go (306 lines orchestrator) - products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go (helm-exec installer for 11 components) - products/catalyst/bootstrap/api/internal/bootstrap/exec.go (kubectl/helm exec wrappers) KEPT: - products/catalyst/bootstrap/api/internal/hetzner/client.go — fast token validity probe used by StepCredentials wizard step. NOT architectural drift; just a UX pre-flight check. - products/catalyst/bootstrap/api/internal/dynadot/dynadot.go — DNS API client. Will be invoked by the OpenTofu module via local-exec (the catalyst-dns helper binary). NEW (canonical architecture): infra/hetzner/ — OpenTofu module per docs/SOVEREIGN-PROVISIONING.md §3 Phase 0: - versions.tf: hetznercloud/hcloud provider ~> 1.49 - variables.tf: 17 typed variables matching wizard inputs (sovereign_fqdn, hcloud_token, region, control_plane_size, ssh_public_key, domain_mode, gitops_repo_url, etc.) — all runtime parameters, none hardcoded per principle #4 - main.tf: hcloud_network + subnet + firewall + ssh_key + control-plane server(s) with cloud-init + worker servers + load_balancer with services + null_resource calling /usr/local/bin/catalyst-dns for pool-domain DNS writes - outputs.tf: control_plane_ip, load_balancer_ip, sovereign_fqdn, console_url, gitops_repo_url - cloudinit-control-plane.tftpl: installs k3s with --flannel-backend=none --disable=traefik --disable=servicelb (Cilium replaces all of these), then installs Flux core, then applies a GitRepository pointing at clusters/${sovereign_fqdn}/ in the public OpenOva monorepo. From this point Flux is the GitOps engine — it reconciles bp-cilium → bp-cert-manager → bp-crossplane → ... → bp-catalyst-platform via the Kustomization tree the cluster directory ships. NO bespoke helm install from outside the cluster. NO direct kubectl apply. Flux is the install layer. - cloudinit-worker.tftpl: k3s agent join via private-IP control plane products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — thin OpenTofu invoker: - Validates wizard inputs - Stages the canonical infra/hetzner/ module into a per-deployment workdir - Writes tofu.auto.tfvars.json from the wizard request - Execs `tofu init`, `tofu plan -out=tfplan`, `tofu apply tfplan`, streaming stdout/stderr lines as SSE events to the wizard - Reads tofu output -json for control_plane_ip + load_balancer_ip - Returns Result. Flux on the new cluster takes over from here. products/catalyst/bootstrap/api/internal/handler/deployments.go — rewritten: - Uses provisioner.Request and provisioner.New() (no more hetzner.Provisioner) - Same SSE/poll endpoints; same Dynadot env-var injection for pool-domain mode What this commit DOES NOT yet include (intentionally — separate work): - clusters/${sovereign_fqdn}/ Kustomization tree in the monorepo that Flux will reconcile (each Sovereign gets its own cluster directory). Tracked separately as part of the bp-catalyst-platform umbrella work. - /usr/local/bin/catalyst-dns helper binary in the catalyst-api Containerfile. Tracked as ticket [G] dns Dynadot client. - Crossplane Compositions for hcloud resources at platform/crossplane/compositions/. Tracked as part of [F] crossplane chart. Lesson #24 closed. Architecture now matches docs/ARCHITECTURE.md §10 + SOVEREIGN-PROVISIONING.md §3-§4 exactly.	2026-04-28 13:38:56 +02:00

36 Commits