3dea4e2cd8
64 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
fcfed6408c
|
feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) (#1226)
* feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) Follow-up to #1223. The Flux Kustomization on every Sovereign points at clusters/_template/bootstrap-kit/ and post-build-substitutes per- Sovereign vars (SOVEREIGN_FQDN, MARKETPLACE_ENABLED, ...). The per-Sovereign overlay file at clusters/<sov>/bootstrap-kit/01-cilium.yaml that #1223 added is therefore dead code (Flux doesn't read that path). The canonical mechanism is to extend the template with envsubst placeholders + thread the values through tofu vars. Wires four layers end-to-end: 1. clusters/_template/bootstrap-kit/01-cilium.yaml — adds `cluster.name: ${CLUSTER_MESH_NAME:=}` and `cluster.id: ${CLUSTER_MESH_ID:=0}` plus `clustermesh.useAPIServer: true` + NodePort 32379. Empty defaults = single-cluster Sovereign (no peer connects); the cilium subchart accepts empty cluster.name when id=0. 2. infra/hetzner/cloudinit-control-plane.tftpl — adds CLUSTER_MESH_NAME / CLUSTER_MESH_ID to the bootstrap-kit Kustomization's postBuild.substitute block (alongside SOVEREIGN_FQDN, MARKETPLACE_ENABLED, PARENT_DOMAINS_YAML). 3. infra/hetzner/variables.tf — declares cluster_mesh_name (string, default "") and cluster_mesh_id (number, default 0, validated 0-255). 4. infra/hetzner/main.tf — primary cloud-init passes var.cluster_mesh_{name,id} verbatim. Secondary regions (when var.regions[i>0] is non-empty per slice G3) auto-derive each peer's name as `<sovereign-stem>-<region-code-no-digits>` and increment id from var.cluster_mesh_id+1. Per-region override via the new RegionSpec.ClusterMeshName field. 5. products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — adds ClusterMeshName + ClusterMeshID to Request and threads them into writeTfvars(); RegionSpec gains ClusterMeshName for per-peer override. Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the chart-side default is intentionally empty — operator request OR per-Sovereign overlay must supply the values when ClusterMesh is enabled. The allocation registry lives at docs/CLUSTERMESH-CLUSTER-IDS.md (introduced in #1223). Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33 follow-up to #1223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): escape $ in tftpl comments referencing envsubst placeholders `tofu validate` reads `${CLUSTER_MESH_NAME}` inside YAML comments as a template variable reference; the comment was meant to refer to the Flux envsubst placeholder consumed downstream by the bootstrap-kit cilium HelmRelease. Escaped both refs with `$$` per Terraform's templatefile escape syntax so the comment renders verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): replace coalesce with conditional in secondary_region_cluster_mesh_name coalesce errors when every arg is empty (the not-in-mesh path). Switch to a conditional that yields '' when both the per-region override AND var.cluster_mesh_name are empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ca4abddd2
|
feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159)
* feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) Implements the server side of the Cloudflare KV lease-witness pattern that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/ witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare Workers KV namespace with read-then-CAS-write semantics enforced via the If-Match header — exact contract per K-Cont-3 #1158 report (item d) and the canonical-seams "Cloudflare KV Worker contract" entry. Routes: GET /lease/<slot-url-encoded> → 200 + LeaseState | 404 | 401 PUT /lease/<slot> → 200 + LeaseState | 412 + state | 401 DELETE /lease/<slot> → 204 | 412 | 401 All 7 K-Cont-3 trap behaviors verified by 46 vitest tests: 1. If-Match: 0 = first-acquire-on-empty-slot 2. Generation increments unconditionally (incl. Release) 3. 412 includes current state body 4. TTL eviction is server-authoritative in stamping (Worker doesn't auto-evict — controller's IsHeldBy decides) 5. X-Holder mismatch on DELETE returns 412 (stale region can't evict new primary) 6. Bearer token validation against env-bound allow-list 7. Optional X-Lease-Slot header logged for KV granularity Files: products/continuum/cloudflare-worker/{package.json, tsconfig.json, wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore, DESIGN.md, src/{index,auth,kv,types}.ts, src/handlers/{get,put,delete}.ts, test/{handlers,contract,env.d}.ts} infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf + README.md .github/workflows/cloudflare-worker-leases-build.yaml (event-driven, NO cron — push-on-paths + PR + workflow_dispatch) Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean. tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB bundle. Per the brief: tofu module ships ready for operator action — no auto-deploy. Operator runbook in DESIGN.md §"Operator runbook — deploy a new Sovereign". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource) `tofu validate` failed on `cloudflare_workers_secret` — that resource was REMOVED in cloudflare/cloudflare v5 (it consolidated into the inline `bindings = [...]` array on `cloudflare_workers_script` with `type = "secret_text"`). Same security guarantee — encrypted at rest in CF, never visible via dashboard read API once written. `tofu fmt` also wanted versions.tf alignment + the .terraform.lock.hcl pinning the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/ which commits its lock file). Per Inviolable Principle #5 the bearer token value still flows from TF_VAR_bearer_tokens_csv extracted at apply time from a K8s SealedSecret — never inlined here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8988cd9e4f
|
feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095) (#1131)
Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md §3.8 + §11), so this slice closes the gap. Architecture: hybrid singular-path + secondary-region overlay. - The legacy singular path (var.region + count = local.control_plane_count) STAYS untouched — every existing Sovereign state (omantel, otech*) keeps its resource addresses (hcloud_server.control_plane[0], hcloud_load_balancer.main, etc) and produces a no-op plan diff. - New regions (regions[1+]) are realised via a parallel for_each set keyed by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region gets its own /24 subnet inside the shared /16 hcloud_network, its own CP server, its own workers, and its own lb11 load balancer. The shared hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign). Why hybrid not full for_each: a wholesale refactor would change every existing resource address (hcloud_server.control_plane[0] → hcloud_server.control_plane["mgmt"]), forcing every running Sovereign to run `tofu state mv` for ~12 resources or face destructive recreates. The brief explicitly bans that. Hybrid is purely additive — secondary resources are NEW addresses no existing state carries. No `tofu state mv` runbook required. Existing Sovereigns provisioned with var.regions = [] or len(var.regions) == 1 produce identical plans before and after this PR. Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary regions and adds per-cluster GitOps path differentiation; today every secondary CP renders an identical Flux Kustomization pointed at clusters/<sovereign_fqdn>/. Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via mock_provider + override_resource (no real Hetzner): - legacy_no_regions_payload (var.regions=[]) - single_region_entry_does_not_double_provision (len==1) - three_region_mgmt_fsn_hel (EPIC-6 shape) - same_region_duplicates_produce_distinct_keys - non_hetzner_regions_are_filtered_out (oci entries skipped) All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check + test on every PR touching infra/hetzner/**. Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled": push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron. Validation: $ tofu validate Success! The configuration is valid. $ tofu fmt -check -recursive exit=0 $ tofu test tests/multi_region.tftest.hcl... pass run "legacy_no_regions_payload"... pass run "single_region_entry_does_not_double_provision"... pass run "three_region_mgmt_fsn_hel"... pass run "same_region_duplicates_produce_distinct_keys"... pass run "non_hetzner_regions_are_filtered_out"... pass Success! 5 passed, 0 failed. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8e312cd244
|
fix(infra/hetzner): strip any-indent comments, gate user_data ≤ 30 KiB at plan-time (#966) (#967)
Live blocker. Provisioning otech114 (deployment 5c3eea37d3aacda6, fsn1) failed at `tofu apply` with: Error: invalid input in field 'user_data' (invalid_input): [user_data => [Length must be between 0 and 32768.]] with hcloud_server.control_plane[0] on main.tf line 309 Hetzner Cloud's HARD 32 KiB cap on user_data was breached after #921 inlined a base64-encoded worker cloud-init (~4.8 KB) into the CP cloud- init for cluster-autoscaler's HCLOUD_CLOUD_INIT key, on top of #827's multi-domain substitutions. Rendered size: ~37 KB. Root cause: the prior strip regex `(?m)^[ ]{0,2}# .*\n` was scoped to indent-0/2 comments only — leaving ~14 KB of indent-6+ comments INSIDE write_files content blocks (e.g. flux-bootstrap.yaml's triplicate Kustomization documentation). Those comments are inert: every write_files entry is YAML / JSON / key=value config (no shell scripts), and parsers ignore `#`-prefixed lines entirely. Changes: 1. New strip regex `(?m)^[ ]*#( |$).*\n` strips ANY-indent comment lines that start with `#` followed by space or EOL. Preserves: - `#cloud-config` line 1 (no space after `#`) - `#!`-shebangs (no space after `#`) - `#pragma`-style directives (`#` followed by non-space non-EOL) Applied to both `local.control_plane_cloud_init` and `local.worker_cloud_init`. 2. Plan-time guardrail via `lifecycle.precondition` on `hcloud_server.control_plane` and `hcloud_server.worker`. Fails plan (not apply) when `length(local.<*>_cloud_init) > 30720` bytes (30 KiB = 32 KiB hard cap minus 10% future-additions buffer). Future bloat- creep that silently re-eats the headroom now fails fast at plan-time BEFORE the network/LB/firewall/SSH-key resources get created. Verified rendered sizes (Python simulation of templatefile + strip, substitutions match real otech114 inputs): CP cloud-init: 79404 bytes raw → 21144 bytes stripped (margin: 11624 under hard cap, 9576 under guardrail) Worker cloud-init: 3254 bytes raw → 2410 bytes stripped (b64-encoded for HCLOUD_CLOUD_INIT: 3216 bytes) `#cloud-config` first-line preserved. All 18 write_files entries and 43 runcmd entries parse intact. YAML/JSON/conf contents valid post-strip (comments are documentation only at the file-format level). Closes #966 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d1431bed09
|
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is not specified" on every Sovereign (otech112 evidence). HelmRelease reports Ready=True (Helm install succeeded) but the Pod CrashLoopBackOffs invisibly behind the False-positive condition. Closes #916 — wizard let operators dispatch unbuildable topologies (otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not encode regional orderability. Hetzner rejected the worker creation 41s into `tofu apply` after Phase-0 had already created the CP + network + LB + firewall. Chart fix (issue #921): - Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the umbrella chart (base64-encoded per upstream contract). - Render `hetzner-node-config` Secret unconditionally with both keys so the upstream Deployment's secretKeyRef references resolve cleanly during `helm template` AND in the live cluster regardless of overlay state. - Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto the upstream chart's deployment. - Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps it under `flux-system/cloud-credentials.hcloud-cloud-init`; the bootstrap-kit overlay lifts that key via Flux `valuesFrom` into `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus receive the IDENTICAL bootstrap as the Phase-0 worker fleet. - Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0. - Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's blueprint-release "Run chart integration tests" step. Wizard fix (issue #916): - Add `availableRegions?: string[]` to NodeSize interface; encode cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere new) per Hetzner /v1/server_types vs POST /v1/servers gap. - Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers. - StepProvider filters SKU dropdowns by selected region; auto-swaps current SKU to recommended default when region change drops it out of orderability. - Mirror the matrix Go-side in sku_availability.go; gate `provisioner.Request.Validate()` with same predicate so a stale wizard build OR direct API caller bypassing the UI cannot dispatch otech109's failure mode. - Two-sided enforcement covers both r.Regions[] (multi-region) and the legacy singular path. Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API side. Chart smoke renders + helm template gates the env wiring at publish time. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2ff50f0591
|
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e08d8721e1
|
fix(pdm/dynadot): pre-register glue records before set_ns (#900) (#906)
Multi-domain Day-2 add-domain on a Sovereign was failing with Dynadot's "'ns1.<sov>.omani.works' needs to be registered with an ip address before it can be used" error. Dynadot rejects set_ns whenever the NS hostnames aren't registered as account-level "host records" first. This change wires the glue pre-registration into the PDM dynadot adapter as an optional registrar.GlueRegistrar interface, threads the Sovereign's load-balancer IPv4 from cloud-init through Flux postBuild into the chart's `global.sovereignLBIP`, and forwards it via catalyst-api's pdmFlipNS to PDM's /set-ns endpoint as a new `glueIP` field. PDM's SetNS handler calls RegisterGlueRecord for each out-of-bailiwick NS before SetNameservers, with idempotent get_ns → register_ns / set_ns_ip semantics so retries are free. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7bfd6df588
|
fix(catalyst-api,bp-catalyst-platform,infra): unblock multi-domain Day-2 add-domain flow on Sovereigns (#879) (#884)
5 stacked wiring bugs blocked the Day-2 add-parent-domain happy path on a fresh post-handover Sovereign — surfaced live on otech103, 2026-05-05 — plus a 6th gap (ghcr-pull reflector for catalyst-system). All six fixed in one PR so a single chart bump + cloud-init re-render closes the gap end-to-end. Bug 1 (chart, api-deployment.yaml): wire POOL_DOMAIN_MANAGER_URL= https://pool.openova.io. The in-cluster Service default only resolves on contabo; on Sovereigns every Day-2 POST died with NXDOMAIN. Bug 2 (chart + code): wire CATALYST_PDM_BASIC_AUTH_USER / _PASS env from a new pdm-basicauth Secret, and have pdmFlipNS SetBasicAuth from those envs. The PDM public ingress at pool.openova.io is gated by Traefik basicAuth; calls without Authorization: Basic returned 401. optional=true so contabo + CI + older Sovereigns degrade to a clear 401 log line. Per Inviolable Principle #10, the credentials only ever live in Pod env + are read once per call by pdmFlipNS — never enter a logged struct or persisted record. Bug 3 (code, parent_domains.go): pdmFlipNS body now includes the required nameservers field (computed from expectedNSFor). PDM's SetNSRequest schema requires it; the previous body got 422 missing-nameservers. Bug 4 (code, parent_domains.go): lookupPrimaryDomain falls back to SOVEREIGN_FQDN env after CATALYST_PRIMARY_DOMAIN. On a post-handover Sovereign no Deployment record is persisted, so without this fallback GET /parent-domains returned {"items":[]} and the propagation panel showed expectedNs:null. SOVEREIGN_FQDN is already wired by api-deployment.yaml from the sovereign-fqdn ConfigMap. Bug 5 (chart, httproute.yaml): catalyst-ui /auth/* PathPrefix narrowed to Exact /auth/handover. The previous PathPrefix collided with OIDC PKCE redirect_uri /auth/callback — catalyst-api 404s on that path because it only registers /api/v1/auth/callback, breaking login post-handover-JWT- cookie expiry. Exact match keeps /auth/handover routed to catalyst-api while every other /auth/* path falls through to catalyst-ui's React Router for client-side OIDC. Bug 6 (cloud-init): ghcr-pull + harbor-robot-token + new pdm-basicauth Reflector annotations enumerate explicit allowed/auto-namespaces (sme, catalyst, catalyst-system, gitea, harbor) instead of empty-string. The ambiguous empty-string interpretation caused otech103 to require a manual catalyst-system mirror creation; explicit list back-ports the verified working state. Provisioner wiring: Request.PDMBasicAuthUser/Pass + Provisioner fields + tfvars emission so the contabo catalyst-api can stamp the credentials onto every Sovereign provision request. variables.tf adds matching pdm_basic_auth_user / pdm_basic_auth_pass tofu vars (sensitive, default empty) so older provisioner builds that pre-date this change keep rendering valid cloud-init (the Secret renders with empty values and Pod start is unaffected). Chart bumped 1.4.11 -> 1.4.12, lockstep slot 13 pin updated. Closes the architectural blockers tracked in #879; the catalyst-api image rebuild + chart republish run via the existing CI pipelines (services- build.yaml + blueprint-release.yaml) on this commit's SHA. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e96741a0ca
|
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The operator brings 1+ parent domains at signup (`omani.works` for own use, `omani.trade` for the SME pool, etc.) and may add more post-handover via the admin console (#829). bp-powerdns 1.2.0 (platform/powerdns/chart): - New `zones: []` values key listing parent domains to bootstrap - New Helm post-install/post-upgrade hook Job (templates/zone-bootstrap-job.yaml) that POSTs each entry to /api/v1/servers/localhost/zones at install time. Idempotent on HTTP 409 — re-runs after upgrades or chart bumps never fail. - Default-values render skips when zones is empty (legacy behavior). bp-catalyst-platform 1.4.0 (products/catalyst/chart): - New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}` values - New templates/sovereign-wildcard-certs.yaml renders one cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex) via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert renews independently. Skips entirely when parentZones is empty so the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml retains ownership of `sovereign-wildcard-tls` (avoids helm-vs-kustomize ownership flap). - New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded into the catalyst-api Pod as CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_SERVER_ID env vars. catalyst-api (products/catalyst/bootstrap/api): - New internal/powerdns package with typed Client (CreateZone, ZoneExists). Idempotent on HTTP 409/412. - handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the typed client when wired via SetPowerDNSZoneClient — the admin-console "Add another parent domain" flow now creates real zones in the Sovereign's PowerDNS at runtime. - main.go wires the client when CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_API_KEY are set. - Comprehensive unit tests (client_test.go: 9 cases incl. 201/409/412/500 + custom NS + custom serverID). Bootstrap-kit slot integration: - clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from Flux postBuild.substitute. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bumps to bp-catalyst-platform 1.4.0 and threads `parentZones: ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two slots stay in lockstep). - infra/hetzner: new `parent_domains_yaml` Terraform variable (defaults to single-zone array derived from sovereign_fqdn) → cloud-init renders the PARENT_DOMAINS_YAML Flux substitute. DoD verified end-to-end with helm template + envsubst: - Multi-zone overlay (omani.works + omani.trade) renders 2 PowerDNS zone-create API calls in the bootstrap Job AND 2 Certificate resources (`*.omani.works`, `*.omani.trade`) in bp-catalyst-platform. - Single-zone fallback (PARENT_DOMAINS_YAML defaults to `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy provisioning paths working without per-overlay edits. Closes #827. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
05065b66d6
|
fix(provisioner+observer): document cpx21 availability + kubectl retry/LKG (closes #752, #753) (#756)
#752 — investigate cpx21/cpx31 availability in EU DCs Concrete proof gathered against the live Hetzner Cloud API on 2026-05-04. GET /v1/server_types LISTS cpx11/cpx21/cpx31/cpx41 with full EU prices in fsn1/nbg1/hel1, but POST /v1/servers rejects every order for those SKUs in those DCs with: {"error":{"code":"invalid_input", "message":"unsupported location for server type"}} Probed all 6 (SKU × DC) combinations end-to-end via real POST + immediate DELETE. cpx22 + cpx32 were also probed as a sanity check and returned ORDERED. The /v1/server_types price entry is misleading: Hetzner advertises prices for every (SKU, location) pair regardless of orderability. Conclusion: NO SKU bump-back. cpx22 + cpx32 (PR #744) remain the floor. README + variables.tf docstrings now carry the durable reproducer so future engineers don't re-attempt cpx21/cpx31. #753 — kubectl retry / LKG observer reliability /tmp/autopilot.sh updated (script lives outside the repo, on the VPS): • Every kubectl call carries --request-timeout=8s so a hung TLS handshake surfaces as a fast empty rather than a 30s+ stall. • Last-known-good (LKG) state held across transient flakes: hr/cert/nodes no longer flip to "0/0 nodes=0" on a single failed poll. • Only 3 consecutive transients count as a real failure; below the threshold the observer prints "hr=<LKG> (transient N/3)". UI side: the wizard's StatusPill / ApplicationPage drive off SSE from catalyst-api (useDeploymentEvents.ts), not direct kubectl polling, so no UI change needed. catalyst-api itself uses client-go (helmwatch / phase1_watch), not exec kubectl, so its observer is not subject to the same shell-out flake. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e855ab0dfe
|
fix(k3s): taint CP node-role.kubernetes.io/control-plane:NoSchedule when workers exist (#751) (#755)
Root cause of the "apiserver flake / cpx22 too small / 8 stuck HRs"
chain: the k3s server install in cloudinit-control-plane.tftpl set
--node-label but no --node-taint. By k3s default the server node is
fully schedulable, so on a 1-CP + N-worker Sovereign with the
37-HelmRelease bootstrap-kit + guest workloads (bp-keycloak / bp-cnpg /
bp-harbor / bp-catalyst-platform / SME microservices), the scheduler
distributes guest pods onto the CP. They eat its memory, crowd
kubelet/etcd/apiserver, kubectl flakes, Helm post-install hooks time
out, HelmReleases get stuck mid-reconcile.
Fix: add --node-taint node-role.kubernetes.io/control-plane=true:NoSchedule
to the INSTALL_K3S_EXEC string, so the CP is reserved for system +
bootstrap controllers. cilium agent (DaemonSet) and cilium-operator
default to {operator: Exists} tolerations upstream — they tolerate
the taint and continue to run on the CP. cert-manager and flux2 default
to tolerations: [] — on multi-node Sovereigns they correctly land on
workers, which is the desired separation. Guest workloads do not
tolerate the taint and are pushed to workers where they belong.
Conditional on worker_count > 0: a Catalyst-Zero / solo Sovereign has
only the CP, so tainting NoSchedule there leaves no schedulable node
and the cluster never becomes ready. The Tofu inline ternary
"\${worker_count > 0 ? \"--node-taint ...\" : \"\"}" omits the flag
entirely in solo mode — k3s default (CP fully schedulable) carries
everything.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ceeefd7829
|
fix(cloud-init): quote MARKETPLACE_ENABLED so postBuild.substitute is map[string]string (#746)
ROOT CAUSE FOUND for the post-PR-#710 zero-touch handover stall (otech85
through otech89). Cloud-init template emitted:
postBuild:
substitute:
SOVEREIGN_FQDN: otech89.omani.works
MARKETPLACE_ENABLED: false ← UNQUOTED YAML BOOL
Tofu interpolates `${marketplace_enabled}` (a string variable holding
"true"|"false") into the rendered cloud-init. Without quotes, kubectl's
YAML parser converts `false`/`true` into BOOL, so the rendered Kustomi-
zation manifest violates the kustomize.toolkit.fluxcd.io/v1
postBuild.substitute schema (map[string]string).
Live evidence on otech89 (and earlier otech85-88 with same SHA):
GitRepository CRD apply → succeeds (no postBuild, no schema issue)
3× Kustomization apply → silently rejected by validator
flux-system kustomize-controller has 0 reconciliable Kustomizations
bootstrap-kit never lands → 0 HRs ever Ready → wizard stalls forever
Quote the value: `MARKETPLACE_ENABLED: "${marketplace_enabled}"` so it
renders as `MARKETPLACE_ENABLED: "false"` (string) and passes the CRD
validator.
This is the bug that has been blocking the 2-cycle zero-touch verifi-
cation since PR #719 introduced MARKETPLACE_ENABLED. Six provisioning
cycles burned (otech85-89 + retries) chasing it. Closes #733 cycle-
verification (the SKU work itself was correct end-to-end).
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
468c3badf8
|
fix(cloud-init): tolerate Crossplane Provider apply failure + retry in background (#745)
Live observation on otech88 (DID b2c528023b50ec45, 2026-05-04
11:40:42Z): the new Sovereign's flux-system reaches Ready (GitRepository
artifact stored, all 6 Flux deployments Available) but no Kustomization
CRs appear — kustomize-controller has nothing to reconcile and
hr=True=0/0 forever.
The cloud-init runcmd applies in this order:
1. cloud-credentials-secret.yaml
2. crossplane-provider-hcloud.yaml — `pkg.crossplane.io/v1 Provider`
CRD doesn't exist yet (bp-crossplane is installed by Flux below),
so this apply errors with "no matches for kind Provider in version
pkg.crossplane.io/v1"
3. flux-bootstrap.yaml — should apply 1× GitRepository + 4×
Kustomization
Empirically, only the GitRepository lands. The four Kustomization
documents in the same multi-doc YAML are not created. The exact
mechanism of failure is on-host (cloud-init runcmd output is at
/var/log/cloud-init-output.log on the Sovereign — out of reach per
"no SSH" rule), but the symptom is consistent across otech87 and
otech88 reprovisions on the new cost-optimised SKUs.
This patch is a belt-and-braces hardening:
1. Tolerate the Crossplane Provider apply's failure (`|| true`) so
the runcmd cannot propagate a non-zero exit through to whatever
downstream step is failing.
2. Add a background retry for the Crossplane Provider CR. Polls
every 30s up to 30m for the Provider CRD to appear (i.e.
bp-crossplane reconciled by Flux), then `kubectl apply` succeeds
and the loop exits. Detached via `&` so cloud-init runcmd
completes without waiting for Crossplane to be Ready.
The intent is to remove any chance the Provider apply blocks Flux
bootstrap. If Kustomizations still don't appear after this fix, the
root cause is elsewhere and a follow-up patch will land.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b02fc3788a
|
fix(provisioner): cost-optimized defaults use ORDERABLE SKUs — cpx22 CP + cpx32 workers (14% saving) (#744)
* fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z). After PR #742 fixed the empty SKU strings in tfvars, the next blocker appeared: writeTfvars was emitting `"regions": null` (Go nil slice marshals to JSON null) when the request had no per-region overrides. OpenTofu's variables.tf carries a validation block: validation { condition = alltrue([ for r in var.regions : contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider) ]) } The `for r in var.regions` iteration fails on null with: Error: Iteration over null value on variables.tf line 217, in variable "regions": The variables.tf default `[]` is what the validator expects; emit that shape explicitly via a coalesceRegions(req.Regions) helper that turns nil into an empty slice. Operator overrides round-trip unchanged. Tests: - TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions serialises as JSON `[]`, never `null`, when the request has no per-region overrides. Builds on PR #742. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving) Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the cpx21 CP default from PR #741 fell apart at apply time — Error: Server Type "cpx21" is unavailable in "fsn1" and can no longer be ordered Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog (`/v1/server_types`) but are NOT in the per-DC orderable list (`available_for_migration` on `/v1/datacenters`) for any EU DC (fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on for new Sovereigns in those regions. Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04: • cpx11 (2 vCPU / 2 GB) — too small for the CP working set • cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1 • cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1 • cpx42, cpx52, cpx62 — bigger and more expensive New default per Sovereign: | Component | Old | New | Savings | |-----------------|-----------------|------------------|---------| | Control plane | CPX32 (€16.49) | CPX22 (€9.49) | €7.00 | | Worker × 2 | CPX32 × 2 (€33) | CPX32 × 2 (€33) | €0 | | TOTAL | €49.47/mo | €42.47/mo | 14% | The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo) assumed those SKUs were orderable. They aren't in EU DCs. The 14% saving from cpx22 CP is the largest concrete optimisation that ships TODAY without compromising the multi-node horizontal-scale agreement (issue #733): still 1 CP + 2 workers from day one. Files changed: - infra/hetzner/variables.tf control_plane_size default cpx21 → cpx22 worker_size default cpx31 → cpx32 (back to the prior orderable choice) - products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing (€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49). Mark both as "listed but NOT orderable in EU DCs" so the wizard surfaces the constraint instead of letting operators pick a non-orderable SKU. Move recommended:true from CPX21 → CPX22. defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31'). - products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx Comment refresh — names the new orderable defaults. - products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22']. Builds on PR #741 (issue #740 chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
994c2d1c2a
|
fix(provisioner): cost-optimized default sizes — cpx21 CP + cpx31 workers (38% saving) (#741)
The new Sovereign default after PR #736 / #738 / #739 was 1× CPX32 control plane + 2× CPX32 workers — €33/mo per Sovereign. CPX32 is over-provisioned for the CP working set: the CP carries only k3s (apiserver/etcd/scheduler/ controller-manager) + cilium-operator + flux controllers + cert-manager + sealed-secrets — NOT the heavy bp-keycloak/cnpg/harbor/openbao/grafana stack (those land on workers because the bootstrap-kit explicitly schedules them off the CP taint). CP RAM budget: etcd ~512 MB + control plane ~1.5 GB + cilium/flux/ cert-manager/sealed-secrets ~1 GB + OS ~512 MB ≈ 3.5 GB — fits CPX21's 4 GB. Workers stay at 8 GB on CPX31 since RAM is the binding constraint for the bootstrap-kit's worker pods, not vCPU. New default per Sovereign: | Component | Old | New | Savings | |-----------------|-----------------|-----------------|---------| | Control plane | CPX32 (€11/mo) | CPX21 (€5.5/mo) | €5.5 | | Worker × 2 | CPX32 × 2 (€22) | CPX31 × 2 (€15) | €7 | | TOTAL | €33/mo | €20.5/mo | 38% | Multi-node horizontal-scale agreement (issue #733) preserved: still 1 CP + 2 workers minimum from day one. Files changed: - infra/hetzner/variables.tf control_plane_size default cpx32 → cpx21 worker_size default cpx32 → cpx31 Validation regex unchanged (cxNN | cpxNN | ccxNN | caxNN). - products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts Add CPX11, CPX21, CPX31 catalog entries. Move recommended:true from CPX32 → CPX21 (control-plane default). Add defaultWorkerSizeId() — Hetzner returns 'cpx31', other providers fall through to defaultNodeSizeId() symmetric default. - products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx First-visit useEffect + handleSelectProvider now call defaultWorkerSizeId(provider) for the worker SKU instead of mirroring the CP SKU. Comment updated naming the cost-optimised pair. - products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts Recommended-Hetzner-SKU set assertion: ['cpx32'] → ['cpx21']. If a Sovereign exhibits CP RAM pressure with this default, the next safe stop UP is cpx31 (4 vCPU / 8 GB, ~€7.5/mo) — never back to cpx32. Closes #740. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e085a68585
|
fix(k3s): add 10.0.1.2 to --tls-san so Cilium can verify CP cert from workers (#739)
Issue #733 follow-up #2. After #738 changed Cilium's k8sServiceHost
from 127.0.0.1 to the CP private IP 10.0.1.2, Cilium's TLS verification
fails with:
Get "https://10.0.1.2:6443/api/v1/namespaces/kube-system":
tls: failed to verify certificate: x509: certificate is valid for
10.43.0.1, 127.0.0.1, 178.104.211.206, 2a01:..., ::1, not 10.0.1.2
k3s auto-generates the apiserver TLS cert with SANs covering the public
IP, the cluster service IP (10.43.0.1), and localhost — but NOT the
private subnet IP 10.0.1.2. Adding `--tls-san=10.0.1.2` to the k3s
server install command makes the cert valid for the address Cilium
(and any other in-cluster client) reaches the apiserver via.
The sovereign FQDN is also already in --tls-san, this just adds the
private subnet anchor that the multi-node Cilium config in #738
introduced.
Verified live on otech51 (deploy SHA
|
||
|
|
69de64ba19
|
fix(cilium): k8sServiceHost 127.0.0.1 → 10.0.1.2 so workers' Cilium can reach apiserver (#738)
Issue #733 follow-up. The default cpx32 multi-node Sovereign (1 CP + 2 workers) provisioned successfully, but worker nodes stuck NotReady because cilium-agent on workers crashloop'd: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system": dial tcp 127.0.0.1:6443: connect: connection refused Root cause: `k8sServiceHost: 127.0.0.1` works on the k3s SERVER node (supervisor binds localhost:6443) but FAILS on every k3s AGENT node (agent does NOT expose apiserver on localhost — only the supervisor on :6444). Pre-#733 every Sovereign was solo (worker_count=0), so this never fired. Fix: point Cilium at `10.0.1.2`, the CP's stable private IP on the Sovereign's 10.0.1.0/24 subnet (cp1=10.0.1.2 per main.tf network block). No-op on the CP (10.0.1.2 IS its own private IP) and works on workers (which already join the cluster via the same address per cloudinit-worker.tftpl `K3S_URL=https://${cp_private_ip}:6443`). Files: - infra/hetzner/cloudinit-control-plane.tftpl — bootstrap helm install values file written to /var/lib/catalyst/cilium-values.yaml - platform/cilium/chart/values.yaml — Flux bp-cilium HelmRelease values (cilium_values_parity_test.go enforces the two stay aligned) Verified live on otech50: 3× CPX32 servers running, 1 CP Ready, 2 workers registered with k3s but NotReady due to cilium init failure. After this fix workers should reach Ready, and the Phase-1 watcher sees all components Ready=True across the multi-node cluster. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ec25b9736
|
feat(provisioner): default Sovereign to 3x CPX32 (1 CP + 2 workers) — restore horizontal scale (#736)
Issue #733. Every Sovereign provisioned this week launched with a single CPX52 control-plane and zero workers — completely discarded horizontal scalability. Restore the originally agreed shape: 1 CPX32 control plane + 2 CPX32 workers (3 nodes × 4 vCPU/8 GB = 12 vCPU/24 GB total — same aggregate footprint as a CPX52 vertical-scale, but with multi-node fault tolerance and the architectural shape clusters/_template/ was designed for). Changes: - infra/hetzner/variables.tf — defaults: control_plane_size cx42→cpx32, worker_size cx32→cpx32, worker_count 0→2. - infra/hetzner/main.tf — add hcloud_load_balancer_target.workers so the Hetzner LB targets every node (CP + workers); Cilium Gateway DaemonSet on every node serves ingress on its NodePort, so any node can absorb traffic for genuine horizontal scale. - infra/hetzner/README.md — sizing rationale rewritten around horizontal scale; CPX32 × 3 documented as canonical; CPX52 retained for solo dev. - ui model — INITIAL_WIZARD_STATE.workerCount 0→2. - ui StepProvider — first-visit + provider-change defaults workerCount 0→2. - ui providerSizes — `recommended: true` flag moves cpx52→cpx32; CPX52 description updated to "solo dev when worker_count=0". Constraints honoured: - Existing API requests with explicit controlPlaneSize: 'cpx52' / explicit workerCount: 0 keep working — only DEFAULTS change. - Sub-CPX32 SKUs (cx21/cx31) still allowed via dropdown. - Contabo single-node Catalyst-Zero is a different code path — unaffected. - No cron triggers added (event-driven only). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
4946ccd125
|
feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) (#719)
Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.
Changes
=======
products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
/ → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
*.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
{{ if .Values.ingress.marketplace.enabled }} so non-marketplace
Sovereigns render the chart unchanged
clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}
infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations
products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"
core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
resolves via PDM at zone-commit time (PR #710 explicit record so
caches don't depend on the *.<sov> wildcard alone)
DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
resources: 13 sme-services workloads + 2 marketplace-api + 1
HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
6f3e15b1ec
|
fix(handover): provision JWK Secret on Sovereign + inject SOVEREIGN_FQDN env (Phase-8b followup) (#692)
Two handover bugs caught live on otech48 (2026-05-03): 1. Sovereign-side catalyst-api responded to GET /auth/handover with "server misconfiguration: public key unavailable". Root cause: the K8s Secret `catalyst-handover-jwt-public` (referenced by the chart's optional Secret-volume) was never materialised on the Sovereign, so the optional volume mount fell through and the JWK file was absent inside the container. 1.2.0 wired the mount but no provisioning step created the Secret. Fix mirrors the canonical pattern from PR #543 (ghcr-pull) and PR #680 (harbor-robot-token): cloud-init now writes the Secret manifest into catalyst-system NS and runcmd applies it BEFORE flux-bootstrap, so the Secret exists by the time bp-catalyst-platform reconciles. Also moves the chart volume mount off the catalyst-api PVC (mountPath /etc/catalyst/handover-jwt-public, no subPath) so a leftover empty directory in the PVC from pre-#606 installs cannot collide with the re-provisioned Secret mount. 2. /auth/handover validator rejected every valid JWT with 401 "invalid audience" because SOVEREIGN_FQDN was unset on Sovereigns — the audience check collapsed to the literal "https://console." prefix. The bp-catalyst-platform HelmRelease overlay was already setting `global.sovereignFQDN` but the chart template never plumbed it through to the Pod env. Added a SOVEREIGN_FQDN env reading `.Values.global.sovereignFQDN` (default "" so Catalyst-Zero installs, where catalyst-api is the SIGNER not the validator, stay clean). Bumps: - bp-catalyst-platform 1.2.4 -> 1.2.5 - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml HelmRelease pin Will be verified live on otech49 — fresh provision should reach https://console.otech49.omani.works/auth/handover?token=... and exchange to a Keycloak session WITHOUT manual Secret creation. Issue #606 followup. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d0b574bd68
|
fix(hetzner-tofu): add powerdns_api_key to templatefile() vars (#687)
PR #686 added var.powerdns_api_key to variables.tf and referenced it as ${powerdns_api_key} in cloudinit-control-plane.tftpl, but missed wiring it into the templatefile() vars dict in main.tf. Result on otech48: Invalid value for "vars" parameter: vars map does not contain key "powerdns_api_key", referenced at ./cloudinit-control-plane.tftpl:273 This commit closes the gap: powerdns_api_key now flows from var -> templatefile vars -> cloud-init -> Secret manifest. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
684759564e
|
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns even with all of: - gatewayAPI.hostNetwork.enabled=true on the Cilium chart - securityContext.privileged=true on the cilium-envoy DaemonSet - securityContext.capabilities.add=[NET_BIND_SERVICE] - envoy-keep-cap-netbindservice=true in cilium-config ConfigMap - Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema) Repeatable error from cilium-envoy logs across otech45, otech46, otech47: listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed to bind or apply socket options: cannot bind '0.0.0.0:80': Permission denied The bind() syscall is intercepted by cilium-agent's BPF socket-LB program in a way that does not honour container capabilities. Even PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets "Permission denied". Cilium 1.19.3 → 1.16.5 made no difference (F1, PR #684 still ships — the version bump is sound for other reasons; the listener bind is just a separate fix). This commit moves the listeners to high ports (30080/30443) and lets the Hetzner LB do the public-facing port translation: HCLB :80 → CP node :30080 (cilium-gateway HTTP listener) HCLB :443 → CP node :30443 (cilium-gateway HTTPS listener) External users still hit `https://console.<sov>.omani.works/auth/handover` on port 443; the high port is invisible. High-port bind succeeds without NET_BIND_SERVICE because the kernel only gates ports below `net.ipv4.ip_unprivileged_port_start` (default 1024). Will be verified on otech48: the next fresh provision should serve console.otech48/auth/handover end-to-end without the 502/timeout chain seen on otech45–47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681) calls contabo's authoritative PowerDNS at pdns.openova.io to write DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook needs an X-API-Key Secret in the Sovereign's cert-manager namespace — PR #681 didn't ship the materialization seam, so on otech43..otech47 the Secret was missing and the wildcard cert never issued. This commit closes the seam from contabo to the Sovereign: 1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on openova-system/powerdns-api-credentials extended from "external-dns" to "external-dns,catalyst" so contabo catalyst-api can mount the API key. 2. bp-powerdns: api.basicAuth.enabled flips default true to false. Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that blocked machine-to-machine API access from Sovereigns. The X-API-Key contract is unchanged. 3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key secret (optional=true so Sovereign-side catalyst-api Pods that don't reflect this still start clean). 4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every Request before Validate(). Forwards as tofu var powerdns_api_key. 5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive, default ""). 6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct dynadot-api-credentials Secret block (PR #681 dropped bp-cert-manager-dynadot-webhook) with a new cert-manager/powerdns-api-credentials Secret block. runcmd applies it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook. End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token. Will be verified live on otech48 (next provision after this lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
369c229408
|
fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget (#685)
cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns even with all of: - gatewayAPI.hostNetwork.enabled=true on the Cilium chart - securityContext.privileged=true on the cilium-envoy DaemonSet - securityContext.capabilities.add=[NET_BIND_SERVICE] - envoy-keep-cap-netbindservice=true in cilium-config ConfigMap - Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema) Repeatable error from cilium-envoy logs across otech45, otech46, otech47: listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed to bind or apply socket options: cannot bind '0.0.0.0:80': Permission denied The bind() syscall is intercepted by cilium-agent's BPF socket-LB program in a way that does not honour container capabilities. Even PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets "Permission denied". Cilium 1.19.3 → 1.16.5 made no difference (F1, PR #684 still ships — the version bump is sound for other reasons; the listener bind is just a separate fix). This commit moves the listeners to high ports (30080/30443) and lets the Hetzner LB do the public-facing port translation: HCLB :80 → CP node :30080 (cilium-gateway HTTP listener) HCLB :443 → CP node :30443 (cilium-gateway HTTPS listener) External users still hit `https://console.<sov>.omani.works/auth/handover` on port 443; the high port is invisible. High-port bind succeeds without NET_BIND_SERVICE because the kernel only gates ports below `net.ipv4.ip_unprivileged_port_start` (default 1024). Will be verified on otech48: the next fresh provision should serve console.otech48/auth/handover end-to-end without the 502/timeout chain seen on otech45–47. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
affcf37923
|
fix(bp-catalyst-platform): provision harbor-robot-token automatically on Sovereign install (RCA + permanent fix) (#680)
Caught live on otech43–46 — manual placeholder Secret was being created each iteration. RCA: The catalyst-api Pod template references the `harbor-robot-token` Secret via a REQUIRED (non-optional) secretKeyRef. On Sovereign clusters that Secret was never materialised — only `ghcr-pull` had the canonical cloud-init + Reflector auto-mirror seam (PR #543). The chart's old comment said "Reflector mirrors from openova-harbor namespace into catalyst" but `openova-harbor` doesn't exist on Sovereigns; that namespace lives only on contabo where the central Harbor source Secret is administered. Result: every fresh Sovereign's catalyst-api Pod stuck in CreateContainerConfigError until the operator hand-created a placeholder Secret. The token VALUE was already arriving on the Sovereign — Tofu var.harbor_robot_token is interpolated into /etc/rancher/k3s/registries.yaml at cloud-init time so containerd can authenticate against harbor.openova.io. We just never materialised the same value as a Kubernetes Secret for catalyst-api to mount. Permanent fix mirrors the canonical `ghcr-pull` seam: 1. infra/hetzner/cloudinit-control-plane.tftpl write_files block emits /var/lib/catalyst/harbor-robot-token-secret.yaml — a Secret in flux-system ns with auto-mirror Reflector annotations (`reflection-auto-enabled: "true"`). 2. runcmd applies it BEFORE flux-bootstrap, so the Secret exists before any Helm release reconciles. 3. bp-reflector (slot 05a, already deployed) propagates the Secret into every namespace — including catalyst-system — on first reconcile tick. catalyst-api's secretKeyRef resolves cleanly, Pod starts. 4. Token rotation flows through `var.harbor_robot_token` → re-render Tofu → re-apply cloud-init; Reflector propagates the rotation to all mirrored copies on the next watch tick. `harbor-robot-token` stays NOT optional in the chart: the architecture mandate is every Sovereign image pull goes through harbor.openova.io; falling through to docker.io is forbidden (anonymous rate-limit makes a fresh Hetzner IP unbootable). A missing token must surface immediately as Pod start failure, never silently mid-provision. Bumps: - bp-catalyst-platform 1.2.2 → 1.2.3 (chart-side change is a comment-only update on the secretKeyRef explaining the new seam; the Pod spec still references the same Secret name and key). - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml HelmRelease version pin → 1.2.3. No bootstrap-kit dependency changes — bp-reflector's slot-05a position is unchanged and was already a dependency for ghcr-pull. No expected-bootstrap-deps.yaml edits needed. Issue #557 follow-up. Closes the per-Sovereign manual workaround. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
dd4148acb6
|
fix(cilium-gateway): hostNetwork mode + Hetzner LB→80/443 (chart 1.1.5) (#674)
The Cilium gateway-api L7LB nodePort chain was silently broken on otech45: TCP to LB:443 succeeds, but TLS handshake never completes. Root cause: Cilium 1.16.5's BPF L7LB Proxy Port (12869) doesn't match what cilium-envoy actually listens on (verified via /proc/net/tcp on the cilium-envoy pod — port 12869 not in listening sockets). The nodePort indirection (31443→envoy:12869) is broken at the redirect step. Fix: bind cilium-envoy directly to the host's :80 and :443 via gatewayAPI.hostNetwork.enabled=true. Hetzner LB forwards public 80→private:80 and 443→private:443 directly (no nodePort indirection). Two coordinated changes: 1. platform/cilium/chart/values.yaml: gatewayAPI.hostNetwork.enabled=true 2. infra/hetzner/main.tf: LB destination_port = 80/443 (was 31080/31443) bp-cilium chart bumped to 1.1.5. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
1734979d74
|
fix(infra): bump kernel inotify limits (bao init was hitting EMFILE) (#656)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)
* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)
The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):
componentGroups.ts Flux HelmRelease.dependsOn
---------------------- ---------------------------
keycloak: [cnpg] keycloak: [cert-manager, gateway-api]
openbao: [] openbao: [spire, gateway-api, cnpg]
harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager,
valkey] gateway-api]
Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03
This commit:
1. Adds scripts/generate-blueprint-deps.sh that parses every
bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
keyed by bare component id (bp- prefix stripped on both source
and target side).
2. Commits the generated JSON.
3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
4. Patches componentGroups.ts so every RAW_COMPONENT's
`dependencies` field is OVERRIDDEN at module load with the
Flux-canonical list (the inline `dependencies: [...]` literals
are now ignored — Flux is canonical).
Follow-ups (not in this PR):
- CI drift check that re-runs the script and diffs the JSON.
- Strip the inline `dependencies: [...]` arrays entirely once the
drift check is green.
- Wire the FlowPage edge-rendering to match.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT
PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
keycloak: ['cert-manager', 'openbao'] ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.
Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(jobs): don't regress status to pending after exec started
helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).
Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.
Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(jobs): cascade Failed status through dependsOn (fail-fast)
Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.
Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'
Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.
Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.
Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 1048576
fs.inotify.max_queued_events = 16384
Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
---------
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
40ca4e4d50
|
fix(infra): registries.yaml mirror needs rewrite — Harbor proxy is /v2/proj/, not /proj/v2/ (#640)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
0ee309aa8b
|
fix(infra+api): wire handover_jwt_public_key end-to-end through tofu provisioning (#636)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
96a5e3a20e
|
fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service (#635)
PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
169ba2f20a
|
fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623)
PR #611 squash accidentally reverted the Phase-8b infra additions from PR #615
(
|
||
|
|
b5c9839da7
|
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables: UI: - AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server callback; sovereign → client-side OIDC token exchange via oidc.ts) - Router: sovereign console routes (/console/*), DETECTED_MODE index redirect, authCallbackRoute dedup fix, authHandoverRoute safety net - StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token before redirecting operator to Sovereign console (falls back to plain URL on error) API: - main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env - deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time - provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON - auth.go: /auth/handover endpoint for seamless single-identity flow Infra: - cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/ - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Chart: - api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars Playwright CI fixes: - playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard - playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix - cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/* - sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests). Co-authored-by: e3mrah <e3mrah@openova.io> |
||
|
|
92fdda42d7
|
feat(catalyst-api+infra): Phase-8b handover JWT minting on Catalyst-Zero (Closes #605)
Merge via self-merge per CLAUDE.md. Playwright UI smoke passes; cosmetic guards pre-existing failure on main (unrelated to this PR). Resolves #605. |
||
|
|
5a403e66b1
|
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase
Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:
FATAL: database "registry" does not exist (SQLSTATE 3D000)
Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.
Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.
Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix
Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:
1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
- values.yaml: `webhook.solverName: powerdns` → `pdns`
- The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
"powerdns" cert-manager gets 404 → "server could not find the resource".
2. cert-manager-dynadot-webhook solver_test.go mock format:
- writeOK() and error injection used old ResponseHeader-wrapped format
- Real api3.json returns ResponseCode/Status directly in SetDnsResponse
- This caused the image build to fail at
|
||
|
|
73ae746637
|
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API CRDs at startup and disables its gateway controller if they are absent — a static, one-shot decision. Cloud-init installs k3s+Cilium first, then Flux reconciles bp-gateway-api minutes later, so the operator always starts without CRDs and never recovers. All 8 HTTPRoutes orphaned. Three-part permanent fix: 1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl. TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires TLSRoute CRD to be present; without it the operator's capability check fails entirely and disables the gateway controller. 2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true" to force GatewayClass creation regardless of CRD presence at Helm render time. Upstream default "auto" skips GatewayClass when the gateway API CRDs are absent at install time (Capabilities check). 3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0 and ship experimental channel (TLSRoute, TCPRoute, UDPRoute, BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed status.supportedFeatures from string[] to object[]; Cilium 1.16.5 writes the old string format and the v1.2.0 CRD rejects the status patch with "must be of type object: string", leaving GatewayClass permanently Unknown/Pending. v1.1.0 retains string schema. Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17 adopts the v1.2.0 object schema for supportedFeatures. Closes #503 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
9e53d9e127
|
feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557) (#563)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade Per founder corrective: existing diagram missed the real blockers surfaced during otech10..otech22 burns. The image-pull-through gap (#557) and the cross-namespace secret gap (#543, #544) gate every workload pull from a public registry — without them, Sovereign hits DockerHub anonymous rate-limit on first provision and 30+ HRs are ImagePullBackOff/CreateContainerConfigError. Adds: - Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap + #557C charts global.imageRegistry templating). Edges to NATS / Gitea / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane / cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao - Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544 powerdns-api-credentials reflect). Edges to bp-catalyst-platform and bp-cert-manager-powerdns-webhook - Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch 38-HR threshold both gate Phase 8a integration test - Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is what makes "zero contabo dependency" DoD-met possible WBS now reflects the cascade observed live, not the pre-Phase-8a model. * feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560) - bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium subchart does not expose a single registry knob — per-Sovereign overlays wire specific image.repository fields alongside this value. - bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream chart exposes per-component image.registry knobs documented in the comment. - bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub added + deployment.yaml templated to prefix the webhook image repository when the value is non-empty. Verified: helm template with --set global.imageRegistry=harbor.openova.io produces harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>. - bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream subchart exposes sealed-secrets.image.registry for overlay wiring. All four charts render clean with default values (empty imageRegistry). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557) Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd transparently routes all five public-registry pulls through the central harbor.openova.io pull-through proxy (Option A of #557). - cloudinit-control-plane.tftpl: new write_files entry for /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so containerd reads the mirror config at startup). Mirrors docker.io, quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective harbor.openova.io/proxy-* projects. Auth via robot$openova-bot. - variables.tf: new harbor_robot_token variable (sensitive, default "") for the robot account token stored in openova-harbor/harbor-robot-token K8s Secret on contabo and forwarded by catalyst-api at provision time. - main.tf: wire harbor_robot_token into the templatefile() call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
ccc38987c2
|
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.
Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
API returns SetDnsResponse); change ResponseCode to json.Number (API returns
integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
- rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
- values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
and privateKeySecretRefName; add rbac.create comment for domain-solver
- certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
- clusterissuer.yaml: new template (skip-render default, enabled via overlay)
- deployment.yaml: add imagePullSecrets support (required for private GHCR)
- Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
- 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
- kustomization.yaml: add 49b entry
- infra/hetzner:
- variables.tf: add dynadot_managed_domains variable
- main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
- cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
Secret + apply it before Flux reconciles bootstrap-kit
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
b2307e290d
|
fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543) (#554)
Part A — bp-reflector blueprint: - Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a, dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288 via the bp-reflector OCI wrapper chart. - Register in bootstrap-kit/kustomization.yaml. - Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml): single replica, 32Mi memory, ServiceMonitor off by default. Part B — annotate flux-system/ghcr-pull + rename in charts: - infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector annotations to the ghcr-pull Secret written at cloud-init time so Reflector auto-mirrors it to every namespace on first boot. - Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in: api-deployment.yaml, ui-deployment.yaml, marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml (14 total occurrences). - Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit HelmRelease version reference to match. Root cause: the canonical secret name is ghcr-pull (written by cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff on all Catalyst pods on every new Sovereign. Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret propagated to 33 namespaces via kubectl; non-Running pods bounced. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
5b55d65461
|
fix(infra): kubeconfig points at CP public IP not LB IP (Closes #542) (#546)
The Hetzner LB only forwards 80/443 (Cilium Gateway ingress); 6443 is exposed directly on the CP node via firewall rule (main.tf:51-56, 0.0.0.0/0 → CP:6443). Previous cloud-init rewrote kubeconfig server: to the LB's public IPv4, which silently failed with "connect: connection refused" — catalyst-api helmwatch could never observe HelmReleases on the new Sovereign, so the wizard jobs page stayed PENDING for every install-* job for 50+ minutes after the cluster was actually healthy. Pass control_plane_ipv4 (= hcloud_server.control_plane[0].ipv4_address) through the templatefile() call and rewrite k3s.yaml's 127.0.0.1:6443 to that IP instead. Same firewall already opens 6443 to 0.0.0.0/0 directly on the CP, so this is reachable from contabo without any LB / firewall changes. Permanent: every otechN provisioning from this commit forward will PUT back a kubeconfig that catalyst-api can actually connect to. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
66ff717fbc
|
fix(infra): reduce bootstrap Kustomization timeouts 30m→5m to unblock iterative fixes (closes #492) (#500)
Phase-8a bug #17 (otech8 deployment 1bfc46347564467b, 2026-05-01): when the FIRST apply of bootstrap-kit was unhealthy (cilium crash-loop from issue #491), kustomize-controller held the revision lock for the full 30m health-check timeout and refused to pick up new GitRepository revisions. Even though Flux fetched fix `66ea39f0` from main within 1 minute, bootstrap-kit's lastAttemptedRevision stayed pinned to the OLD SHA `0765e89a` for the full 30 minutes. With cilium broken, the wait would never finish, no new revision would ever apply, and the operator was forced to wipe + reprovision from scratch. The same pathology would repeat on every iteration unless the timeout shape changed. Approach: Option A (timeout reduction). Drops `spec.timeout` on all three Flux Kustomizations in the cloud-init template — bootstrap-kit, sovereign-tls, infrastructure-config — from 30m to 5m. We KEEP `wait: true` so downstream `dependsOn: bootstrap-kit` declarations still get a consolidated "every HR Ready=True" signal. We do NOT adjust `interval` (5m is correct). Why 5m specifically: matches the GitRepository poll interval. Failed reconciles release the revision lock within ~6m worst case so a fresh fix on main gets applied on the next poll. Anything shorter risks tripping legitimately-slow CRD installs; anything longer re-introduces the iteration-stall pathology #492 documents. Why not Option B (wait: false): would break the dependsOn chain. The infrastructure-config Kustomization needs bootstrap-kit's HRs Ready before it applies Provider/ProviderConfig manifests that talk to Hetzner. Flipping wait: false would let infra-config apply prematurely. Why not Option C (tighter retryInterval): doesn't address the root cause. retryInterval governs how often to retry AFTER a failure; spec.timeout is what holds the revision lock during a failed wait. Test: kustomization_timeout_test.go (new) locks all three timeouts at exactly 5m AND blocks any operative `timeout: 30m` regression AND asserts wait: true is retained. Three assertions, one for each failure mode (regression to 30m, accidental 4th Kustomization without test update, drive-by flip to wait: false). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
141dc9dfba
|
fix(infra): cloud-init helm install cilium values parity with Flux bp-cilium HR (closes #491) (#496)
Phase-8a bug #16: every fresh Hetzner Sovereign deadlocked at Phase 1
because the bootstrap helm install in cloud-init used a MINIMAL set of
--set flags (kubeProxyReplacement, k8sService*, tunnelProtocol,
bpf.masquerade) while the Flux bp-cilium HelmRelease curated a much
fuller value set. The drift was fatal:
1. cilium-agent waits forever for the operator to register
ciliumenvoyconfigs + ciliumclusterwideenvoyconfigs CRDs.
2. The upstream chart only registers them when envoyConfig.enabled=true.
3. With the bootstrap install missing that flag, the agent crash-looped,
the node taint node.cilium.io/agent-not-ready never lifted, and the
bootstrap-kit Kustomization (wait: true, 30 min timeout — issue #492)
never reconciled the upgrade that would have fixed the values.
The fix is single-source-of-truth via a new write_files entry that lays
down /var/lib/catalyst/cilium-values.yaml at cloud-init time, plus a -f
flag on the bootstrap helm install that consumes it. The values mirror
platform/cilium/chart/values.yaml's `cilium:` block PLUS the overlay
in clusters/_template/bootstrap-kit/01-cilium.yaml (envoyConfig.enabled,
l7Proxy). A new parity test (cilium_values_parity_test.go) locks the
two files together so a future commit cannot change one without the
other.
Approach: hybrid — keep the chart values.yaml as the umbrella source
of truth, render the merged effective values inline in cloud-init's
write_files block (the umbrella's `cilium:` subchart wrapper is
unwrapped because the bootstrap install targets cilium/cilium upstream
chart directly, not the bp-cilium umbrella). Test enforces presence
of every operator-curated key + load-bearing values.
Files modified:
infra/hetzner/cloudinit-control-plane.tftpl
products/catalyst/bootstrap/api/internal/provisioner/cilium_values_parity_test.go (new)
Refs: #491, #492 (bootstrap-kit wait timeout),
|
||
|
|
0d75ae354f
|
fix(infra): split Cilium-Gateway Certificate into sovereign-tls Kustomization (Phase-8a bug #13) (#484)
Phase-8a-preflight live deployment 93161846839dc2e1: bootstrap-kit Flux Kustomization fails server-side dry-run with Certificate/kube-system/sovereign-wildcard-tls dry-run failed: no matches for kind 'Certificate' in version 'cert-manager.io/v1' → entire Kustomization apply aborts → ZERO HelmReleases reconcile. Fix: split the Certificate into its own Flux Kustomization sovereign-tls that dependsOn bootstrap-kit (whose Ready gates on every HR including bp-cert-manager). Gateway stays in 01-cilium.yaml because Gateway API CRDs ship with Cilium itself. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
7e35040e29
|
fix(infra): cloud-init strip regex must preserve #cloud-config (Phase-8a bug #5 follow-up) (#482)
#477 introduced a regex "/(?m)^[ ]{0,2}#[^!].*\n/" to strip YAML-block comments and fit Hetzner's 32KiB user_data cap. The [^!] guard preserved shebangs like #!/bin/bash but DID NOT preserve cloud-init directives like #cloud-config, #include, #cloud-boothook (none have ! after #). Result: cloud-init received user_data with the #cloud-config first-line DIRECTIVE stripped, didn't recognise the YAML body, and emitted: recoverable_errors: WARNING: Unhandled non-multipart (text/x-not-multipart) userdata → k3s never installed → Flux never bootstrapped → kubeconfig never PUT to catalyst-api → every Phase-8a provision since #477 has silently failed at boot Live evidence: deployment a76e3fec8566add9 SSH'd 2026-05-01 18:30 UTC, cloud-init status 'degraded done', /etc/systemd/system/k3s.service absent, no flux binary. Fix: require a SPACE after the '#' in the strip regex. YAML comments ARE typically '# foo bar' (with space). cloud-init directives are '#cloud-config' / '#include' / '#cloud-boothook' (no space) — the new regex preserves them. Out of scope: validating that ALL existing comments in the tftpl had a space after #. They do — verified by sed pre-render passing the sanity test (file shrinks 38KB → 13KB AND first line is #cloud-config). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
e35729ad78
|
fix(infra): strip YAML-block comments from cloud-init to fit Hetzner 32KiB cap (Phase-8a bug #5) (#477)
Phase-8a-preflight deployment 3c158f712d564d84 failed at tofu apply with:
Error: invalid input in field 'user_data'
[user_data => [Length must be between 0 and 32768.]]
on main.tf line 214, in resource "hcloud_server" "control_plane"
The rendered cloudinit-control-plane.tftpl is 38,085 bytes — 5,317
bytes over the Hetzner cap. The source template ships ~16 KB of
indent-0 and indent-2 documentation comments (YAML-level) that are
operationally inert at cloud-init boot.
Fix: wrap templatefile() in replace() with a RE2 regex that strips
lines whose first 0-2 chars are spaces followed by '#' (preserves
shebangs via [^!]). After strip, rendered cloud-init drops to ~13 KB.
Indent-4+ comments live INSIDE heredoc `content: |` blocks
(embedded shell scripts, kubeconfig fragments). Those are preserved.
Same fix applied to worker_cloud_init for parity.
Refs:
- Live evidence: deployment 3c158f712d564d84, tofu apply error 16:38:26 UTC
- Bug #5 in the Phase-8a-preflight tally
- #471: prior tftpl escape fix ($${SOVEREIGN_FQDN})
- #472: catalyst-build watches infra/hetzner/**
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
|
||
|
|
03b1469331
|
fix(infra): escape ${SOVEREIGN_FQDN} in cloudinit-control-plane.tftpl comments (#471)
Phase-8a-preflight bug surfaced by first live provision attempt
(deployment febeeb888debf477, 2026-05-01 16:30 UTC):
Error: Invalid function argument
on main.tf line 140, in locals:
140: control_plane_cloud_init = templatefile("${path.module}/cloudinit-control-plane.tftpl", {
Invalid value for "vars" parameter: vars map does not contain key
"SOVEREIGN_FQDN", referenced at ./cloudinit-control-plane.tftpl:12,37-51.
Tofu's templatefile() interprets ${...} ANYWHERE in the file (including
inside shell '#' comments), since the file is a template not a shell
script. Five lines in cloudinit-control-plane.tftpl reference
${SOVEREIGN_FQDN} as part of documentation prose explaining how
Flux postBuild.substitute interpolates the value at Flux apply time.
The Tofu vars map passed by main.tf:140 uses the canonical lowercase
HCL convention (sovereign_fqdn = var.sovereign_fqdn), not the uppercase
envsubst convention SOVEREIGN_FQDN. So Tofu fails: 'vars map does not
contain key SOVEREIGN_FQDN'.
Latest reference (line 12) added by #326 (commit
|
||
|
|
20b896070f
|
feat(bp-keycloak + infra): Sovereign K8s OIDC config for kubectl via per-Sovereign Keycloak realm (closes #326) (#448)
Wires the per-Sovereign K8s api-server's --oidc-* validator to the
per-Sovereign Keycloak realm so customer admins can authenticate
kubectl directly against their Sovereign — no static admin-kubeconfig
handoff, no rotated bearer-token exchange.
infra (cloud-init):
- Add 6 --kube-apiserver-arg=oidc-* flags to the k3s install line in
infra/hetzner/cloudinit-control-plane.tftpl. Issuer URL composed
from sovereign_fqdn (https://auth.\${sovereign_fqdn}/realms/sovereign)
per INVIOLABLE-PRINCIPLES #4 — never hardcoded. Username/groups
prefixes scope OIDC subjects under "oidc:" so RoleBindings reference
e.g. subjects[0].name=oidc:alice@org, distinct from local SAs/x509.
Canonical seam (anti-duplication rule, ADR-0001 §11.3):
- The bp-keycloak chart already bundles bitnami/keycloak's
keycloakConfigCli post-install Helm hook Job, which imports realms
declared under values.keycloak.keycloakConfigCli.configuration. We
enable the existing seam — no bespoke kubectl-exec realm-creation
script, no custom Admin-API call from catalyst-api.
bp-keycloak chart (1.1.2 → 1.2.0):
- Enable keycloakConfigCli + ship inline sovereign-realm.json with:
realm "sovereign" (invariant per Sovereign — Keycloak resolves the
issuer claim from the request hostname, so no per-FQDN realm
rename), default groups sovereign-admins/-ops/-viewers, oidc-group
-membership-mapper emitting "groups" claim, public OIDC client
"kubectl" with localhost:8000 + OOB redirect URIs (kubectl-oidc
-login defaults), publicClient=true (kubectl runs locally and
cannot safely hold a secret), PKCE S256 enforced.
- Bump version 1.1.2 → 1.2.0 (semver MINOR, additive shape).
- Bump bootstrap-kit slot 09 in _template/, omantel.omani.works/,
otech.omani.works/ to version: 1.2.0.
- New chart test tests/oidc-kubectl-client.sh (4 cases) — all green.
- Existing tests/observability-toggle.sh — still green.
Documentation:
- Add §11 "kubectl OIDC for customer admins" runbook to
docs/omantel-handover-wbs.md with one-time workstation setup
(kubectl krew install oidc-login + config set-credentials),
sovereign-admin RBAC binding (oidc:sovereign-admins → cluster
-admin), and 401-debugging table mapping common symptoms to
root causes.
- Carve #326 out of §7 "Out of scope" — it is shipped.
- Add §9 status row.
Validation:
- grep -c 'oidc-issuer-url' infra/hetzner/cloudinit-control-plane.tftpl
→ 2 (comment + the actual flag in the curl line)
- grep -c 'oidc-username-claim' → 2
- helm template platform/keycloak/chart → renders post-install
keycloak-config-cli Job + ConfigMap with kubectl client (3 hits
on grep "kubectl"; 1 hit on "clientId": "kubectl")
- bash scripts/check-vendor-coupling.sh → exit 0 (HARD-FAIL mode)
- 4/4 oidc-kubectl-client gates green; 3/3 observability-toggle
gates green
Out of scope (deferred to follow-up tickets):
- Per-Sovereign user provisioning UI (#322, #323)
- Refresh-token revocation on RoleBinding deletion (#324)
- provider-kubernetes Crossplane ProviderConfig per Sovereign (#321)
- omantel migration / Phase 8 live execution
NO catalyst-api or UI source files touched (those are #319/#322/#323
agents' territories per agent brief).
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
|
||
|
|
0172b9a89a
|
wip(#425): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435)
Files staged from prior agent run before rate-limit. Re-dispatch will
verify, complete missing pieces (Crossplane Provider+ProviderConfig in
cloud-init, grep-zero acceptance, helm/go test runs, WBS row update),
and finalise the PR.
Includes:
- platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml
- platform/velero/chart/values.yaml (objectStorage.s3.* block)
- platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0)
- products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package)
- internal/hetzner/objectstorage{,_test}.go DELETED
- credentials handler + StepCredentials.tsx renamed
- infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl}
- clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml
- platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425)
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
|
||
|
|
1e17668055
|
feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371) (#409)
* feat(catalyst): Hetzner Object Storage credential pattern (Phase 0b, #371) Adds the per-Sovereign Hetzner Object Storage credential capture + bucket provisioning Phase 0b path described in the omantel handover WBS §5. Hybrid Option A+B: wizard collects operator-issued S3 credentials (Hetzner exposes no Cloud API to mint them — they're issued once in the Hetzner Console and the secret half is shown exactly once), and OpenTofu auto-provisions the per-Sovereign bucket via the aminueza/minio provider + writes a flux-system/hetzner-object-storage Secret into the new Sovereign at cloud-init time so Harbor (#383) and Velero (#384) find their backing-store credentials already in the cluster from Phase 1 onwards. Extends the EXISTING canonical seam at every layer (per the founder's anti-duplication rule for #371's session): the existing Tofu module at infra/hetzner/, the existing handler/credentials.go validator, the existing provisioner.Request struct, the existing store.Redact path, and the existing wizard StepCredentials. No parallel binaries / scripts / operators introduced. infra/hetzner/ (Tofu module — Phase 0): - versions.tf: declare aminueza/minio provider (Hetzner's official recommendation for S3-compatible bucket creation per docs.hetzner.com/storage/object-storage/getting-started/...) - variables.tf: 4 sensitive vars — region (validated against fsn1/nbg1/hel1, the European-only OS regions as of 2026-04), access_key, secret_key, bucket_name (RFC-compliant S3 naming) - main.tf: minio_s3_bucket.main resource — idempotent on re-apply, no force_destroy (Velero archive must survive a control-plane reinstall), object_locking=false (content-addressed digests are the immutability guarantee for Harbor; Velero uses S3 versioning) - cloudinit-control-plane.tftpl: write flux-system/hetzner-object-storage Secret with the canonical s3-endpoint/s3-region/s3-bucket/s3-access-key/s3-secret-key keys Harbor + Velero charts consume via existingSecret refs - outputs.tf: surface endpoint/region/bucket back to catalyst-api for the deployment record (credentials NEVER returned) products/catalyst/bootstrap/api/ (Go): - internal/hetzner/objectstorage.go: NEW — minio-go/v7-based ListBuckets validator. Distinguishes auth failure ("rejected") from network failure ("unreachable") so the wizard renders the right error card. NOT a parallel cloud-resource path — the existing purge.go handles hcloud purge; objectstorage.go handles a separate API surface (S3-compatible) that has no equivalent client today. - internal/handler/credentials.go: extend with ValidateObjectStorageCredentials handler — same wire shape (200 valid:true / 200 valid:false / 503 unreachable / 400 bad input) as the existing token validator so the wizard's failure- card machinery handles both without per-endpoint switches. - cmd/api/main.go: wire POST /api/v1/credentials/object-storage/validate - internal/provisioner/provisioner.go: extend Request with ObjectStorageRegion/AccessKey/SecretKey/Bucket; Validate() rejects empty/malformed values fail-fast at /api/v1/deployments POST time; writeTfvars() emits the 4 new tfvars. - internal/handler/deployments.go: derive bucket name from FQDN slug pre-Validate (catalyst-<fqdn-with-dots-replaced-by-dashes>) so Hetzner's globally-namespaced bucket pool gets a deterministic, collision-resistant per-Sovereign name without operator input. - internal/store/store.go: redact access/secret keys; preserve region+bucket plain (they're public in tofu outputs anyway). products/catalyst/bootstrap/ui/ (TypeScript / React): - entities/deployment/model.ts + store.ts: 4 new wizard fields (objectStorageRegion/AccessKey/SecretKey/Validated) with merge() coercion for legacy persisted state. - pages/wizard/steps/StepCredentials.tsx: ObjectStorageSection — region picker (fsn1/nbg1/hel1), masked secret-key input, Validate button gating Next. Same FailureCard taxonomy (rejected/too-short/unreachable/network/parse/http) the existing TokenSection uses, so the operator UX is consistent. Section only renders when Hetzner is among chosen providers — non-Hetzner Sovereigns skip Phase 0b until their own backing-store path lands. - pages/wizard/steps/StepReview.tsx: include objectStorageRegion/AccessKey/SecretKey in the POST /v1/deployments payload (bucket derived server-side). Tests: - api: 7 new provisioner Validate tests (region/keys/bucket required + RFC-compliant + valid-region acceptance), 5 handler tests for the new endpoint (bad JSON / missing region / invalid region / short keys), 4 hetzner/objectstorage_test.go tests (endpoint composition + early input rejection), 1 handler test for the bucket-name derivation. Existing tests updated to supply the new required fields. - ui: StepCredentials.test.tsx pre-populates objectStorageValidated in beforeEach so the existing 11 SSH-section tests aren't gated on Object Storage validation. DoD: a fresh Sovereign provision results in a usable S3 endpoint URL + access/secret keys available as a K8s Secret in the Sovereign's home cluster (flux-system/hetzner-object-storage), ready for consumption by Harbor + Velero charts via existingSecret references. Closes #371. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(wbs): #371 done — Hetzner Object Storage Phase 0b shipped (#409) Marks #371 done with the architectural rationale (hybrid Option A + B — Hetzner exposes no Cloud API to mint S3 keys, so the wizard MUST capture them; OpenTofu auto-provisions the bucket + cloud-init writes the flux-system/hetzner-object-storage Secret with the canonical s3-* keys Harbor + Velero consume). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d2ada908c9
|
feat(bp-openbao): auto-unseal flow — cloud-init seed + post-install init Job (closes #316) (#408)
Catalyst-curated auto-unseal pipeline for OpenBao on Hetzner Sovereigns
(no managed-KMS available). Selected **Option A — Shamir + cloud-init
seed** because:
- Hetzner has no managed-KMS service → Cloud-KMS auto-unseal (Option C)
is structurally unavailable.
- Transit-seal (Option B) requires a peer OpenBao cluster, only
applicable to multi-region tier-1; out of scope for single-region
omantel.
- Manual unseal (Option D) violates the "first sovereign-admin lands
on console.<sovereign-fqdn> ready to use" goal in
SOVEREIGN-PROVISIONING.md §5.
Architecture (per issue #316 spec + acceptance criteria 1-6):
1. Cloud-init on the control-plane node generates a 32-byte recovery
seed from /dev/urandom and writes it to a single-use K8s Secret
`openbao-recovery-seed` in the openbao namespace, with annotation
`openbao.openova.io/single-use: "true"`. Pre-creates the openbao
namespace to eliminate the race with Flux's HelmRelease apply.
2. bp-openbao chart v1.2.0 ships two new Helm post-install hooks:
- `templates/init-job.yaml` (hook weight 5): consumes the seed,
calls `bao operator init -recovery-shares=1 -recovery-threshold=1`,
persists the recovery key inside OpenBao's auto-unseal config,
deletes the seed Secret on success. Idempotent — re-runs detect
Initialized=true and exit 0.
- `templates/auth-bootstrap-job.yaml` (hook weight 10): enables
the Kubernetes auth method, mounts kv-v2 at `secret/`, writes
the `external-secrets-read` policy, binds the `external-secrets`
role to the ESO ServiceAccount in `external-secrets-system`.
3. `templates/auto-unseal-rbac.yaml` declares the least-privilege SA
+ Role + RoleBinding the Jobs need (Secret get/list/delete in the
openbao namespace; create/get/patch on the openbao-init-marker).
Also emits the permanent `system:auth-delegator` ClusterRoleBinding
bound to the OpenBao ServiceAccount so the Kubernetes auth method
can call tokenreviews.authentication.k8s.io.
4. Cluster overlay `clusters/_template/bootstrap-kit/08-openbao.yaml`
bumps version 1.1.1 → 1.2.0 and flips `autoUnseal.enabled: true`
per-Sovereign.
Per #402 lesson: skip-render pattern (`{{- if .Values.X }}{{ emit }}
{{- end }}`) used throughout — never `{{ fail }}`. Default `helm
template` render emits NOTHING new; opt-in via autoUnseal.enabled=true.
Acceptance criteria coverage:
1. Provision fresh Sovereign — cloud-init writes seed, Flux installs
bp-openbao 1.2.0, post-install Jobs run automatically. ✅
2. bp-openbao HR Ready=True without manual intervention — install
keeps `disableWait: true` (Helm Ready ≠ OpenBao initialised; the
init Job drives initialisation out-of-band on the same install). ✅
3. `bao status` shows Sealed=false, Initialized=true within 5 minutes
— init Job polls + retries up to 60×5s. ✅
4. ESO ClusterSecretStore vault-region1 reaches Status: Valid — the
auth-bootstrap Job binds the `external-secrets` role to ESO's SA
before the Job exits. ✅
5. Seed Secret deleted post-init — init Job deletes it via K8s API
after consuming. ✅
6. No openbao-root-token Secret in K8s — root token captured to
/tmp/.root-token in the Job pod's tmpfs only; never written to a
K8s Secret. The recovery key persists ONLY inside OpenBao's Raft
state (auto-unseal config). ✅
Tests:
- tests/auto-unseal-toggle.sh — 4 cases:
* default render → no auto-unseal artefacts (skip-render works)
* autoUnseal.enabled=true → both Jobs + correct hook weights
* kubernetesAuth.enabled=false → init Job only, no auth-bootstrap
* idempotency annotations present on all 5 hook objects
- tests/observability-toggle.sh — unchanged, all 3 cases green.
- helm lint . — clean.
Files:
- platform/openbao/chart/Chart.yaml — version 1.1.1 → 1.2.0
- platform/openbao/blueprint.yaml — version 1.1.1 → 1.2.0
- platform/openbao/chart/values.yaml — `autoUnseal.*` block
- platform/openbao/chart/templates/auto-unseal-rbac.yaml — new
- platform/openbao/chart/templates/init-job.yaml — new
- platform/openbao/chart/templates/auth-bootstrap-job.yaml — new
- platform/openbao/chart/tests/auto-unseal-toggle.sh — new
- platform/openbao/README.md — bootstrap procedure §2-3 expanded;
auto-unseal alternatives table added.
- clusters/_template/bootstrap-kit/08-openbao.yaml — chart 1.1.1 →
1.2.0, autoUnseal.enabled=true.
- infra/hetzner/cloudinit-control-plane.tftpl — seed-token block
inserted between ghcr-pull-secret apply and flux-bootstrap apply.
- docs/omantel-handover-wbs.md §9 — #316 ticked chart-released.
Canonical seam used: extended existing `platform/openbao/chart/` per
the anti-duplication rule. NO standalone scripts. NO bespoke Go cloud
calls. NO `{{ fail }}`. All knobs configurable via values.yaml per
INVIOLABLE-PRINCIPLES.md #4 (never hardcode).
Co-authored-by: hatiyildiz <hat.yil@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8781aa3bc4
|
fix(provisioner): cloud-init bootstrap-kit path matches per-FQDN cluster dir (resolves #218) (#256)
The cloud-init template selected a per-FQDN GitRepository tree
(`!/clusters/${sovereign_fqdn}`) and pointed both bootstrap-kit
and infrastructure-config Flux Kustomizations at
`./clusters/${sovereign_fqdn}/{bootstrap-kit,infrastructure}` —
directories the wizard never commits before provisioning. Every
fresh Sovereign stalled Phase-1 with `kustomization path not found:
.../clusters/<fqdn>/bootstrap-kit: no such file or directory`
(live evidence on otech.omani.works deployment ce476aaf80731a46).
Canonical fix:
- GitRepository.spec.ignore selects the shared `_template` tree
(`!/clusters/_template`).
- Both Kustomizations point at `./clusters/_template/bootstrap-kit`
and `./clusters/_template/infrastructure`.
- Flux postBuild.substitute.SOVEREIGN_FQDN: ${sovereign_fqdn}
interpolates the Sovereign's FQDN into the rendered manifests
(envsubst replaces `${SOVEREIGN_FQDN}` in label values, ingress
hostnames, HelmRelease values).
- clusters/_template/bootstrap-kit/*.yaml + kustomization.yaml
switch their bare `SOVEREIGN_FQDN_PLACEHOLDER` markers to
`${SOVEREIGN_FQDN}` so Flux's envsubst-based substitute can
actually replace them.
Locked by 5 unit tests in
products/catalyst/bootstrap/api/internal/provisioner/cloudinit_path_test.go
that read the template and assert: GitRepository ignore selects
_template, both Kustomization paths point at _template subdirs,
both carry the postBuild.substitute hook, and no operative YAML
line carries `clusters/${sovereign_fqdn}`.
Closes #218
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
5aee6aa737
|
fix(cloudinit): poll for local-path StorageClass instead of pod Ready (closes #207) (#209)
The previous fix for #189 wrote `kubectl wait --for=condition=Ready pod -l app=local-path-provisioner --timeout=60s`. That cannot succeed pre-Cilium: k3s runs with --flannel-backend=none, the node stays Ready=False until Cilium installs (much later in cloud-init), and the not-ready taint blocks every untolerated pod. The wait timed out at 60s, scripts_user failed, and the Flux-bootstrap + kubeconfig POST-back sections never executed. Every fresh Sovereign provision was stuck "before Cilium" with no error signal in the wizard. Replace the impossible Pod-Ready wait with a poll for the StorageClass object itself, which k3s registers independently of CNI within ~3s of service start. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |