openova

Author	SHA1	Message	Date
e3mrah	5aee6aa737	fix(cloudinit): poll for local-path StorageClass instead of pod Ready (closes #207 ) (#209 ) The previous fix for #189 wrote `kubectl wait --for=condition=Ready pod -l app=local-path-provisioner --timeout=60s`. That cannot succeed pre-Cilium: k3s runs with --flannel-backend=none, the node stays Ready=False until Cilium installs (much later in cloud-init), and the not-ready taint blocks every untolerated pod. The wait timed out at 60s, scripts_user failed, and the Flux-bootstrap + kubeconfig POST-back sections never executed. Every fresh Sovereign provision was stuck "before Cilium" with no error signal in the wizard. Replace the impossible Pod-Ready wait with a poll for the StorageClass object itself, which k3s registers independently of CNI within ~3s of service start. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-29 21:30:27 +02:00
hatiyildiz	3b5fca2033	merge: keep k3s local-path-provisioner; mark StorageClass default before Flux runs (closes #189 )	2026-04-29 19:43:59 +02:00
hatiyildiz	4f56ae47da	fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs Pre-fix the cloud-init template passed --disable=local-storage to the k3s installer with the design intent that Crossplane would install hcloud-csi day-2 and register a StorageClass after bp-crossplane reconciled. That created a circular dependency on a fresh Sovereign: every PVC-using HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres, bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres) blocks Pending on a StorageClass that would only exist after bp-crossplane finished installing — but they ARE in the bootstrap-kit Kustomization that needs to converge before the day-2 path runs. Verified live on omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0 both stuck Pending for 20+ min with `no persistent volumes available for this claim and no storage class is set`, `kubectl get sc` empty. This change: 1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its built-in local-path-provisioner and registers the `local-path` StorageClass on first boot. 2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap apply that: a. waits for the local-path-provisioner pod Ready b. patches the local-path SC with is-default-class=true c. fails loudly if the SC is missing post-wait (safety gate so a broken cluster doesn't fall through to Flux silently) 3. Adds tests/integration/storageclass.sh — phase 1 render-assertion (regression gate against re-introducing --disable=local-storage, plus positive assertions that the wait/patch/verify steps are present, plus ordering check that the patch precedes the Flux apply); phase 2 kind-cluster proof that a fresh cluster has a default StorageClass that binds a test PVC. 4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom, root cause, and the live-cluster recovery path (apply local-path-storage.yaml + patch default class) for already-provisioned Sovereigns that hit this without reprovisioning. Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target (single CPX21/CPX31 control-plane node) that is the correct shape — the data lives on the node, capacity is bounded by the disk, and there are no other nodes for volumes to migrate to. Operators upgrading to multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate, deliberate operation; that is not part of the cloud-init bootstrap. Live verification on omantel.omani.works (reproduces the production symptom + proves the recovery path): Before: NAMESPACE NAME STATUS AGE keycloak data-keycloak-postgresql-0 Pending 10m spire-system spire-data-spire-server-0 Pending 10m No StorageClass. After (kubectl apply local-path-storage.yaml + patch): NAME PROVISIONER ... AGE local-path (default) rancher.io/local-path ... 34s NAMESPACE NAME STATUS STORAGECLASS keycloak data-keycloak-postgresql-0 Bound local-path spire-system spire-data-spire-server-0 Bound local-path Gates: - tofu validate: Success! The configuration is valid. - tests/integration/storageclass.sh: PASS (phase 1 render-assertion + phase 2 fresh kind cluster default StorageClass binds test PVC). - Regression sanity: re-injecting --disable=local-storage causes phase 1 to FAIL with the documented error message (verified). Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that block); the StorageClass setup runs between healthz-wait and the Flux bootstrap apply so the bootstrap-kit Kustomization sees a default class on its first reconciliation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:43:09 +02:00
hatiyildiz	b0c1c07271	fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction) Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:38:17 +02:00
hatiyildiz	acf426c5a9	feat(catalyst-api): cloud-init POSTs kubeconfig back via bearer token (closes #183 ) Implement Option D from issue #183: the new Sovereign's cloud-init PUTs its rewritten kubeconfig (server URL pinned to the LB public IP, k3s service-account token in the body) to catalyst-api over HTTPS using a per-deployment bearer token. catalyst-api never SSHs into the Sovereign — by design, it does not hold the SSH private key (the wizard returns it once to the browser and does not persist it on the catalyst-api side). How the bearer flow works ------------------------- 1. CreateDeployment mints a 32-byte random bearer (crypto/rand, hex-encoded), computes its SHA-256, and persists ONLY the hash on Deployment.kubeconfigBearerHash. Plaintext is stamped onto provisioner.Request just long enough for writeTfvars to render it into the per-deployment OpenTofu workdir, then GC'd. 2. infra/hetzner/variables.tf adds three variables — deployment_id, kubeconfig_bearer_token (sensitive), catalyst_api_url. main.tf passes them through templatefile() with load_balancer_ipv4 read from hcloud_load_balancer.main.ipv4. 3. cloudinit-control-plane.tftpl, after `kubectl --raw /healthz` succeeds, sed-rewrites k3s.yaml's https://127.0.0.1:6443 to the LB's public IPv4, writes the result to a 0600 file, and curls PUT to {catalyst_api_url}/api/v1/deployments/{deployment_id}/ kubeconfig with `Authorization: Bearer {token}`. --retry 60 --retry-delay 10 --retry-all-errors handles transient reachability gaps. The 0600 file is removed after the PUT. 4. PUT /api/v1/deployments/{id}/kubeconfig: - Reads `Authorization: Bearer <token>` (RFC 6750). - Computes SHA-256 of the inbound bearer, constant-time-compares to the persisted hash via subtle.ConstantTimeCompare. - 401 on missing/malformed Authorization, 403 on bearer mismatch, 403 if no hash on record, 403 if KubeconfigPath already set (single-use replay defence), 422 on empty/oversize body, 503 if the kubeconfigs directory is unwritable. - On 204: writes the body to /var/lib/catalyst/kubeconfigs/ <id>.yaml at mode 0600 (atomic temp+rename), sets Result.KubeconfigPath, persistDeployment, then `go runPhase1Watch(dep)`. 5. GET /api/v1/deployments/{id}/kubeconfig now reads the file at Result.KubeconfigPath. 409 with {"error":"not-implemented"} when the postback hasn't happened yet (preserves the wizard's existing StepSuccess fallback). 409 {"error": "kubeconfig-file-missing"} on PVC drift. 6. internal/store: Record carries KubeconfigBearerHash. The path pointer round-trips via Result.KubeconfigPath; the JSON record NEVER contains the kubeconfig plaintext (test grep on the on- disk JSON for the kubeconfig sentinels asserts zero matches). 7. restoreFromStore relaunches helmwatch on Pod restart for any rehydrated deployment whose Result.KubeconfigPath points at an existing file AND Phase1FinishedAt is nil AND the original status was not in-flight (the existing in-flight-status-rewrite-to-failed contract is preserved). Channels are re-allocated for resumed deployments because the fromRecord-loaded ones are closed. 8. internal/handler/phase1_watch.go reads kubeconfig YAML from the file at Result.KubeconfigPath (not from a string field on Result). The Result.Kubeconfig field is removed entirely; the on-disk JSON only carries kubeconfigPath. Tests ----- internal/handler/kubeconfig_test.go covers every spec gate: - PUT 401 missing/malformed Authorization - PUT 403 bearer mismatch / no-bearer-hash / already-set - PUT 422 empty body / oversize body - PUT 404 deployment not found - PUT 204 first success, file at <dir>/<id>.yaml mode 0600, Result.KubeconfigPath set, on-disk JSON has kubeconfigPath pointer with no plaintext leak - PUT triggers Phase 1 helmwatch goroutine - GET reads from path-pointer - GET 409 path-pointer-set-but-file-missing - newBearerToken / hashBearerToken round-trip + entropy - subtle.ConstantTimeCompare correctness - shouldResumePhase1 gates every branch - restoreFromStore re-launches helmwatch on rehydrated deployments - phase1Started guard prevents double watch (PUT then runProvisioning) - extractBearer RFC 6750 case-insensitive scheme Chart ----- products/catalyst/chart/templates/api-deployment.yaml mounts the existing catalyst-api-deployments PVC at /var/lib/catalyst (one level up) so deployments/<id>.json and kubeconfigs/<id>.yaml live on the same single-attach volume — no second PVC. Adds env vars CATALYST_KUBECONFIGS_DIR=/var/lib/catalyst/kubeconfigs and CATALYST_API_PUBLIC_URL=https://console.openova.io/sovereign. Per docs/INVIOLABLE-PRINCIPLES.md - #3: OpenTofu is still the only Phase-0 IaC; cloud-init is part of the OpenTofu module's templated user_data, not a separate code path. catalyst-api never execs helm/kubectl/ssh. - #4: catalyst_api_url is runtime-configurable (CATALYST_API_PUBLIC_URL env var), so air-gapped franchises override without code changes. - #10: Bearer plaintext NEVER lands on disk on the catalyst-api side (only the SHA-256 hash). Kubeconfig plaintext NEVER lands in the JSON record (only the file path). The kubeconfig file is chmod 0600 and the directory 0700 owned by the catalyst-api UID. Closes #183. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:26:53 +02:00
hatiyildiz	dddbab4b80	fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull` because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init never created the Secret, so every fresh Sovereign's source-controller logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium. The operator workaround (kubectl apply by hand) is not durable across reprovisioning. Verified live on omantel.omani.works pre-fix. Changes: - provisioner.Request gains GHCRPullToken (json:"-") so it is never serialized into persisted deployment records. provisioner.New() reads CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the Request before tofu.auto.tfvars.json. Validate() rejects empty for domain_mode=pool with a pointer to docs/SECRET-ROTATION.md. - handler.CreateDeployment also stamps the env var onto the Request so the synchronous validation path returns 400 early on misconfiguration. - infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true, default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64 locals and passes both to templatefile(). cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so the GitRepository + Kustomization land into a cluster that already has working GHCR creds. - products/catalyst/chart/templates/api-deployment.yaml mounts CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in the catalyst namespace (key: token, optional: true so the Pod still starts on misconfigured installs and Validate() owns the gate). - docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token, Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds. Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN> placeholder; the token never lives in git. - Tests: provisioner unit tests cover New() reading the env var, tolerance of missing env, pool-mode validation rejection with operator-facing error, BYO acceptance, and the json:"-" serialization invariant. tests/e2e/hetzner-provisioning gains a TestCloudInit_RendersGHCRPullSecret render-only integration test that asserts the rendered cloud-init contains the Secret, applies it before flux-bootstrap, and that the dockerconfigjson round-trips the sample token through templatefile() correctly. Existing pool-mode handler tests now t.Setenv the placeholder token; the on-disk redaction test asserts the placeholder never reaches disk. Gates: - go vet ./... and go test -race -count=1 ./... in products/catalyst/bootstrap/api: PASS. - helm lint products/catalyst/chart: PASS (warnings pre-existing). - tofu fmt + tofu validate: deferred to CI (no tofu binary on the development host). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:07:27 +02:00
hatiyildiz	34c8de84c0	fix(cloudinit): split flux-bootstrap into bootstrap-kit + infrastructure-config Kustomizations The single 'catalyst-bootstrap' Flux Kustomization at clusters/<fqdn>/ applied bootstrap-kit/ AND infrastructure/ together. infrastructure/ declares ProviderConfig with kind hcloud.crossplane.io/v1beta1, but that CRD is registered only after Crossplane core (bp-crossplane) is reconciled AND the Provider package (provider-hcloud) is installed inside the cluster. Flux dry-run-applied ProviderConfig before any of that finished and surfaced the failure at the omantel cluster: ProviderConfig/default dry-run failed: no matches for kind ProviderConfig in version hcloud.crossplane.io/v1beta1 Resolution: emit two Flux Kustomizations from cloud-init's flux-bootstrap.yaml, with infrastructure-config declaring dependsOn: [name: bootstrap-kit] + wait: true. Flux now waits for the bootstrap-kit HelmReleases (including bp-crossplane registering the Crossplane core CRDs and reconciling the provider-hcloud package which then registers hcloud.crossplane.io/v1beta1) to be Ready before the infrastructure-config Kustomization applies ProviderConfig. Verified live on the omantel control-plane (kubectl delete the old single Kustomization + apply the two-Kustomization split): bootstrap-kit moved to Reconciliation in progress, infrastructure-config correctly showed False / dependency 'flux-system/bootstrap-kit' is not ready, which is the desired ordered-bootstrap behaviour.	2026-04-29 16:11:33 +02:00
hatiyildiz	548720095a	fix(cloudinit): use 127.0.0.1 for Cilium k8sServiceHost (host's local apiserver) Cilium with --set k8sServiceHost=10.0.1.2 (the cp1 private NIC IP) sat in init phase forever — the agent's API client kept logging "Establishing connection to apiserver host=https://10.0.1.2:6443" and never got a response, even though `curl https://10.0.1.2:6443/healthz` from the host returned 401 (TLS+auth challenge = endpoint reachable). Switching to k8sServiceHost=127.0.0.1 brought the DaemonSet up immediately. Verified end-to-end on the live cluster: $ kubectl get nodes catalyst-omantel-omani-works-cp1 Ready ... 32m v1.31.4+k3s1 The node's local apiserver always binds 127.0.0.1:6443; using that as the bootstrap apiserver endpoint sidesteps whatever was rejecting the private-NIC IP route during Cilium's pre-CNI bring-up. Once Cilium is the CNI and the cluster has real Service VIPs, every other component reaches the apiserver via the kubernetes.default service as usual.	2026-04-29 15:31:21 +02:00
hatiyildiz	e571ec7aa2	fix(cloudinit): install Cilium BEFORE Flux to break CNI bootstrap deadlock omantel.omani.works deployment 5cd1bceaaacb71f6 reached Phase 0 success (10 Hetzner resources up, LB IP 49.12.16.160, DNS committed via PDM) but stayed silent for 25 minutes — `https://console.omantel.omani.works` returned no response, every Flux pod was Pending, and the node was NotReady. SSH'd into the cp1 box (firewall opened temporarily for the operator IP) and found the canonical CNI bootstrap deadlock: Ready: False (KubeletNotReady) message: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady cni plugin not initialized cloud-init started k3s with --flannel-backend=none + --disable-network-policy (the right Cilium-ready posture), then immediately applied the Flux install.yaml. Flux pods are Pending because there is no CNI yet, so Flux never starts → never reconciles bp-cilium → CNI never installs → deadlock. The "wait for deployment Available --timeout=300s" line silently times out and cloud-init proceeds anyway with the Flux GitRepository + Kustomization that nothing reconciles. Resolution: install Cilium ONCE in cloud-init via the canonical Helm chart at the SAME version (1.16.5) that platform/cilium/blueprint.yaml declares for bp-cilium. When Flux later reconciles clusters/<sovereign_fqdn>/bootstrap-kit/01-cilium.yaml it adopts the existing Helm release (release name + namespace match), so the wizard's ownership model stays single-source-of-truth (Flux + Blueprints) after the bootstrap exception. Per INVIOLABLE-PRINCIPLES.md #3, this Helm install is the one-shot bootstrap exception authorised by "the GitOps engine is Flux — everything ELSE gets installed by Flux". Cilium IS the CNI Flux needs, so it cannot be installed by Flux without bootstrapping itself first. Every other component still flows through the Blueprint pipeline. Verified: ssh'd into the running omantel cp1 (firewall opened for the operator IP), ran the same `helm install cilium ...` command this patch encodes, and the cluster recovered — node Ready, Flux pods scheduling, GitRepository pulling. Will redeploy from scratch with the patched cloud-init to validate the full unattended path. Cloud-init is the Phase-0 OpenTofu artifact baked into the Hetzner server's user_data, so this change activates on the NEXT `tofu apply` that creates a new control-plane server. Existing omantel cp1 is manually unblocked already; new Sovereigns provisioned after the catalyst-api image with this template is rolled will not hit the deadlock.	2026-04-29 15:29:10 +02:00
hatiyildiz	330211d275	fix(tofu): drop redundant null_resource.dns_pool — PDM owns DNS writes Every tofu apply on a pool deployment was hitting: null_resource.dns_pool[0]: Provisioning with 'local-exec'... null_resource.dns_pool[0] (local-exec): (output suppressed due to sensitive value in config) Error: Invalid field in API request catalyst-dns: write DNS: add *.omantel record: dynadot api error: code= Two separate code paths were both writing Dynadot records for the same deployment: 1. The OpenTofu module's null_resource.dns_pool — a local-exec that shells out to /usr/local/bin/catalyst-dns inside the catalyst-api container. The binary's request payload is rejected by Dynadot. 2. catalyst-api's pool-domain-manager call — pdm.Commit() at handler/deployments.go:247 writes the canonical record set with the LB IP after tofu apply returns. This path works. Per #168 PDM is the single owner of all pool-domain Dynadot writes. The null_resource path is a pre-#168 artifact that should have been removed when PDM took ownership; keeping it dual-wrote DNS records (when it worked) and broke the entire provision flow (when it didn't). Verified end-to-end against the live catalyst-api at console.openova.io: tofu apply created 7 of 11 Hetzner resources (network, firewall, subnet, LB, 2 LB services, ssh_key) before failing at null_resource.dns_pool[0]. With this commit the DNS-write step disappears from the plan, and PDM /commit handles record creation after the LB IP is known. The dynadot_key + dynadot_secret variables in variables.tf remain declared (provisioner.go still passes them through tfvars.json) but are no longer referenced by any resource. Removing them is a separate sweep — left for a follow-up to keep this commit narrowly scoped to the failure path.	2026-04-29 14:52:57 +02:00
hatiyildiz	c6cbfe684c	fix(tofu): accept cpx* SKU family + empty worker_size for solo Sovereigns The wizard's recommended Hetzner SKU is CPX32 (4 vCPU AMD / 8 GB / €0.0232/hr) but the module's variables.tf validation rule only accepted the cx / ccx / cax families — CPX (AMD shared) was missing entirely. Every Launch through the wizard hit: Error: Invalid value for variable on variables.tf line 68: variable "control_plane_size" { var.control_plane_size is "cpx32" control_plane_size must match Hetzner server-type naming (cxNN \| ccxNN \| caxNN) Solo Sovereigns (worker_count = 0) also legitimately have an empty worker_size — the validation rejected that too: Error: Invalid value for variable on variables.tf line 91: variable "worker_size" { var.worker_size is "" Both fixed by extending the regex with the cpx* family AND permitting the empty string on worker_size when the operator runs a solo Sovereign. Reproduced end-to-end against the deployed catalyst-api before the fix: the SSE stream surfaced exactly these two validation errors. With the regex updated they no longer fire — failure now requires a real Hetzner token instead of being blocked at module-validation time.	2026-04-29 14:43:52 +02:00
hatiyildiz	4ee9e7dd6f	fix(wizard): topology before provider; per-provider SKU catalog; per-region sizing The wizard step order was inverted: it asked for the provider before the topology, then put hetzner-only SKUs inside the topology step. Topology decides how many regions exist; provider is a per-region property; SKU vocabulary is per-provider (cx32 means nothing on Azure). Fixes all three. New step order (WIZARD_STEPS + WizardPage STEPS): Org -> Topology -> Provider -> Credentials -> Components -> Domain -> Review. Per-provider SKU catalog at products/catalyst/bootstrap/ui/src/shared/ constants/providerSizes.ts replaces the legacy hetzner-only HETZNER_NODE_SIZES. Five providers (hetzner, huawei, oci, aws, azure), each with realistic SKU options drawn from that vendor's native instance-type vocabulary. Every SKU read in the wizard goes through PROVIDER_NODE_SIZES[provider] -- no SKU literal lives anywhere else. StepProvider now renders one card per topology slot. Each card carries: provider chooser, that provider's region picker, that provider's control-plane SKU, that provider's worker SKU + count. Cost rollup sums each region's (cp + worker*count) at its OWN provider's pricing, so a mixed-cloud topology computes correctly. StepTopology drops the SkuCard + NodeSizingPanel; it now captures only the topology template, HA flag, and AIR-GAP add-on. Per-region store fields (regionControlPlaneSizes, regionWorkerSizes, regionWorkerCounts) replace the singular controlPlaneSize/workerSize/ workerCount as the canonical shape. Migration in store.merge() hydrates the arrays from any persisted singular fields; the cx22 legacy default is treated as "no selection" so a hetzner-only id never leaks into a non-hetzner region. Backend Request gains an optional Regions []RegionSpec field. Validate mirrors Regions[0] into the legacy singular fields for the existing solo-Hetzner writeTfvars path. infra/hetzner/variables.tf accepts the list-of-objects shape; the for_each iteration that activates the rest of the regions is the multi-region tofu wiring follow-up. Door open structurally; no shape compromised. Dead code removed: StepInfrastructure and shared/constants/hetzner.ts (both orphaned, contained the only HETZNER_NODE_SIZES reference outside the catalog). Gates: tsc --noEmit, vite build, vitest (149 tests), go vet, go test (provisioner + handler). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:44:33 +02:00
hatiyildiz	f5daac52af	refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171 ) PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything k8gb was doing — geo-aware response selection, health-checked failover, weighted round-robin — at the authoritative DNS layer. Eliminates a separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign. Changes: - platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never authored — only README existed) - products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted - componentGroups.ts: remove k8gb component (PowerDNS already there) - componentLogos.tsx: drop logo_k8gb + k8gb map entry - model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns - StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb - provision.html: replace k8gb tile and edges with powerdns - catalog.generated.ts regenerated (now includes bp-powerdns) - docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING- CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY, COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs (cilium, external-dns, failover-controller, litmus, flux, opentofu) rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md. Historical entries in VALIDATION-LOG.md preserved as audit trail. - New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed), Application Placement → lua-record selector mapping, when to add a second Sovereign region, operational checks. Closes #171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:51:09 +02:00
hatiyildiz	e7a74f0eef	feat(infra/hetzner): bump default to cx42, add OS hardening + operator README Group J — closes #127, #128, #129, #130, #131, #132. Defaults - control_plane_size default cx42 (16 GB) — cx32 (8 GB) is INSUFFICIENT for a solo Sovereign per PLATFORM-TECH-STACK.md §7.1 (~11.3 GB Catalyst) + §7.4 (~8.8 GB per-host-cluster) = ~20 GB minimum. The previous cx32 default would OOM during the OpenBao + Keycloak step of bootstrap. - New k3s_version variable (v1.31.4+k3s1) — pinned, validated against the INSTALL_K3S_VERSION format. Previously hardcoded inside the cloud-init templates, in violation of INVIOLABLE-PRINCIPLES.md §4. Validation - Region restricted to the 5 known Hetzner locations. - control_plane_size + worker_size restricted to the cxNN \| ccxNN \| caxNN namespace (blocks tiny dev sizes that would OOM at runtime). - k3s_version regex matches the upstream installer's version format. - ssh_allowed_cidrs validated as proper CIDRs. Firewall - Document each open port (80, 443, 6443, ICMP) and each blocked port (22, 10250, 2379/2380, 8472) in README.md §"Firewall rules". - SSH (22) is now a dynamic rule keyed off ssh_allowed_cidrs (default empty = no SSH at the firewall, break-glass via Hetzner Console). OS hardening (cloudinit-.tftpl) - sshd drop-in: PasswordAuthentication no, PermitRootLogin prohibit-password, no forwarding, MaxAuthTries=3, LoginGraceTime=30. - enable_unattended_upgrades (default true): security-only pocket, auto-reboot at 02:30, removes unused kernels. - enable_fail2ban (default true): sshd jail, systemd backend. - Both control-plane and worker templates carry the same baseline. Documentation - New infra/hetzner/README.md (operator-facing) covers: What the module creates + Phase-0/Phase-1 boundary. * Sizing rationale with the §7.1+§7.4 RAM math + upgrade path. * Firewall rules: every open port, every blocked port, every deliberate egress flow. * k3s flag-by-flag rationale tied to PLATFORM-TECH-STACK.md §8. * SSH key management: why no auto-generated keys (break-glass + audit-trail + custody + compliance). * OS hardening table. * Standalone CLI invocation pattern (tofu apply -var-file=...). * What the module does NOT do (Crossplane / Flux territory). Closes #127 #128 #129 #130 #131 #132	2026-04-28 13:54:15 +02:00
hatiyildiz	e668637bc9	feat(provisioner): replace bespoke Hetzner+helm-exec code with OpenTofu→Crossplane→Flux Per docs/INVIOLABLE-PRINCIPLES.md Lesson #24 — the previous commits `915c467` + `07b4bcf` shipped bespoke Go code that called Hetzner Cloud API directly + exec'd helm/kubectl, which violates principle #3 (OpenTofu provisions Phase 0, Crossplane is the ONLY day-2 IaC, Flux is the ONLY GitOps reconciler, Blueprints are the ONLY install unit). This commit reverts all of that and replaces it with the canonical architecture. REVERTED (deleted): - products/catalyst/bootstrap/api/internal/hetzner/resources.go (379 lines bespoke Hetzner API client) - products/catalyst/bootstrap/api/internal/hetzner/cloudinit.go (bespoke cloud-init builder) - products/catalyst/bootstrap/api/internal/hetzner/provisioner.go (306 lines orchestrator) - products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go (helm-exec installer for 11 components) - products/catalyst/bootstrap/api/internal/bootstrap/exec.go (kubectl/helm exec wrappers) KEPT: - products/catalyst/bootstrap/api/internal/hetzner/client.go — fast token validity probe used by StepCredentials wizard step. NOT architectural drift; just a UX pre-flight check. - products/catalyst/bootstrap/api/internal/dynadot/dynadot.go — DNS API client. Will be invoked by the OpenTofu module via local-exec (the catalyst-dns helper binary). NEW (canonical architecture): infra/hetzner/ — OpenTofu module per docs/SOVEREIGN-PROVISIONING.md §3 Phase 0: - versions.tf: hetznercloud/hcloud provider ~> 1.49 - variables.tf: 17 typed variables matching wizard inputs (sovereign_fqdn, hcloud_token, region, control_plane_size, ssh_public_key, domain_mode, gitops_repo_url, etc.) — all runtime parameters, none hardcoded per principle #4 - main.tf: hcloud_network + subnet + firewall + ssh_key + control-plane server(s) with cloud-init + worker servers + load_balancer with services + null_resource calling /usr/local/bin/catalyst-dns for pool-domain DNS writes - outputs.tf: control_plane_ip, load_balancer_ip, sovereign_fqdn, console_url, gitops_repo_url - cloudinit-control-plane.tftpl: installs k3s with --flannel-backend=none --disable=traefik --disable=servicelb (Cilium replaces all of these), then installs Flux core, then applies a GitRepository pointing at clusters/${sovereign_fqdn}/ in the public OpenOva monorepo. From this point Flux is the GitOps engine — it reconciles bp-cilium → bp-cert-manager → bp-crossplane → ... → bp-catalyst-platform via the Kustomization tree the cluster directory ships. NO bespoke helm install from outside the cluster. NO direct kubectl apply. Flux is the install layer. - cloudinit-worker.tftpl: k3s agent join via private-IP control plane products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — thin OpenTofu invoker: - Validates wizard inputs - Stages the canonical infra/hetzner/ module into a per-deployment workdir - Writes tofu.auto.tfvars.json from the wizard request - Execs `tofu init`, `tofu plan -out=tfplan`, `tofu apply tfplan`, streaming stdout/stderr lines as SSE events to the wizard - Reads tofu output -json for control_plane_ip + load_balancer_ip - Returns Result. Flux on the new cluster takes over from here. products/catalyst/bootstrap/api/internal/handler/deployments.go — rewritten: - Uses provisioner.Request and provisioner.New() (no more hetzner.Provisioner) - Same SSE/poll endpoints; same Dynadot env-var injection for pool-domain mode What this commit DOES NOT yet include (intentionally — separate work): - clusters/${sovereign_fqdn}/ Kustomization tree in the monorepo that Flux will reconcile (each Sovereign gets its own cluster directory). Tracked separately as part of the bp-catalyst-platform umbrella work. - /usr/local/bin/catalyst-dns helper binary in the catalyst-api Containerfile. Tracked as ticket [G] dns Dynadot client. - Crossplane Compositions for hcloud resources at platform/crossplane/compositions/. Tracked as part of [F] crossplane chart. Lesson #24 closed. Architecture now matches docs/ARCHITECTURE.md §10 + SOVEREIGN-PROVISIONING.md §3-§4 exactly.	2026-04-28 13:38:56 +02:00

15 Commits