openova

Author	SHA1	Message	Date
e3mrah	7bfd6df588	fix(catalyst-api,bp-catalyst-platform,infra): unblock multi-domain Day-2 add-domain flow on Sovereigns (#879 ) (#884 ) 5 stacked wiring bugs blocked the Day-2 add-parent-domain happy path on a fresh post-handover Sovereign — surfaced live on otech103, 2026-05-05 — plus a 6th gap (ghcr-pull reflector for catalyst-system). All six fixed in one PR so a single chart bump + cloud-init re-render closes the gap end-to-end. Bug 1 (chart, api-deployment.yaml): wire POOL_DOMAIN_MANAGER_URL= https://pool.openova.io. The in-cluster Service default only resolves on contabo; on Sovereigns every Day-2 POST died with NXDOMAIN. Bug 2 (chart + code): wire CATALYST_PDM_BASIC_AUTH_USER / _PASS env from a new pdm-basicauth Secret, and have pdmFlipNS SetBasicAuth from those envs. The PDM public ingress at pool.openova.io is gated by Traefik basicAuth; calls without Authorization: Basic returned 401. optional=true so contabo + CI + older Sovereigns degrade to a clear 401 log line. Per Inviolable Principle #10, the credentials only ever live in Pod env + are read once per call by pdmFlipNS — never enter a logged struct or persisted record. Bug 3 (code, parent_domains.go): pdmFlipNS body now includes the required nameservers field (computed from expectedNSFor). PDM's SetNSRequest schema requires it; the previous body got 422 missing-nameservers. Bug 4 (code, parent_domains.go): lookupPrimaryDomain falls back to SOVEREIGN_FQDN env after CATALYST_PRIMARY_DOMAIN. On a post-handover Sovereign no Deployment record is persisted, so without this fallback GET /parent-domains returned {"items":[]} and the propagation panel showed expectedNs:null. SOVEREIGN_FQDN is already wired by api-deployment.yaml from the sovereign-fqdn ConfigMap. Bug 5 (chart, httproute.yaml): catalyst-ui /auth/* PathPrefix narrowed to Exact /auth/handover. The previous PathPrefix collided with OIDC PKCE redirect_uri /auth/callback — catalyst-api 404s on that path because it only registers /api/v1/auth/callback, breaking login post-handover-JWT- cookie expiry. Exact match keeps /auth/handover routed to catalyst-api while every other /auth/* path falls through to catalyst-ui's React Router for client-side OIDC. Bug 6 (cloud-init): ghcr-pull + harbor-robot-token + new pdm-basicauth Reflector annotations enumerate explicit allowed/auto-namespaces (sme, catalyst, catalyst-system, gitea, harbor) instead of empty-string. The ambiguous empty-string interpretation caused otech103 to require a manual catalyst-system mirror creation; explicit list back-ports the verified working state. Provisioner wiring: Request.PDMBasicAuthUser/Pass + Provisioner fields + tfvars emission so the contabo catalyst-api can stamp the credentials onto every Sovereign provision request. variables.tf adds matching pdm_basic_auth_user / pdm_basic_auth_pass tofu vars (sensitive, default empty) so older provisioner builds that pre-date this change keep rendering valid cloud-init (the Secret renders with empty values and Pod start is unaffected). Chart bumped 1.4.11 -> 1.4.12, lockstep slot 13 pin updated. Closes the architectural blockers tracked in #879; the catalyst-api image rebuild + chart republish run via the existing CI pipelines (services- build.yaml + blueprint-release.yaml) on this commit's SHA. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 09:02:39 +04:00
e3mrah	e96741a0ca	feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827 ) (#838 ) A franchised Sovereign now supports N parent zones, NOT one. The operator brings 1+ parent domains at signup (`omani.works` for own use, `omani.trade` for the SME pool, etc.) and may add more post-handover via the admin console (#829). bp-powerdns 1.2.0 (platform/powerdns/chart): - New `zones: []` values key listing parent domains to bootstrap - New Helm post-install/post-upgrade hook Job (templates/zone-bootstrap-job.yaml) that POSTs each entry to /api/v1/servers/localhost/zones at install time. Idempotent on HTTP 409 — re-runs after upgrades or chart bumps never fail. - Default-values render skips when zones is empty (legacy behavior). bp-catalyst-platform 1.4.0 (products/catalyst/chart): - New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}` values - New templates/sovereign-wildcard-certs.yaml renders one cert-manager.io/v1.Certificate per zone (each `.<zone>` + apex) via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert renews independently. Skips entirely when parentZones is empty so the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml retains ownership of `sovereign-wildcard-tls` (avoids helm-vs-kustomize ownership flap). - New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded into the catalyst-api Pod as CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_SERVER_ID env vars. catalyst-api (products/catalyst/bootstrap/api): - New internal/powerdns package with typed Client (CreateZone, ZoneExists). Idempotent on HTTP 409/412. - handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the typed client when wired via SetPowerDNSZoneClient — the admin-console "Add another parent domain" flow now creates real zones in the Sovereign's PowerDNS at runtime. - main.go wires the client when CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_API_KEY are set. - Comprehensive unit tests (client_test.go: 9 cases incl. 201/409/412/500 + custom NS + custom serverID). Bootstrap-kit slot integration: - clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from Flux postBuild.substitute. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bumps to bp-catalyst-platform 1.4.0 and threads `parentZones: ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two slots stay in lockstep). - infra/hetzner: new `parent_domains_yaml` Terraform variable (defaults to single-zone array derived from sovereign_fqdn) → cloud-init renders the PARENT_DOMAINS_YAML Flux substitute. DoD verified end-to-end with helm template + envsubst: - Multi-zone overlay (omani.works + omani.trade) renders 2 PowerDNS zone-create API calls in the bootstrap Job AND 2 Certificate resources (`.omani.works`, `*.omani.trade`) in bp-catalyst-platform. - Single-zone fallback (PARENT_DOMAINS_YAML defaults to `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy provisioning paths working without per-overlay edits. Closes #827. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-04 23:42:00 +04:00
e3mrah	05065b66d6	fix(provisioner+observer): document cpx21 availability + kubectl retry/LKG (closes #752 , #753 ) (#756 ) #752 — investigate cpx21/cpx31 availability in EU DCs Concrete proof gathered against the live Hetzner Cloud API on 2026-05-04. GET /v1/server_types LISTS cpx11/cpx21/cpx31/cpx41 with full EU prices in fsn1/nbg1/hel1, but POST /v1/servers rejects every order for those SKUs in those DCs with: {"error":{"code":"invalid_input", "message":"unsupported location for server type"}} Probed all 6 (SKU × DC) combinations end-to-end via real POST + immediate DELETE. cpx22 + cpx32 were also probed as a sanity check and returned ORDERED. The /v1/server_types price entry is misleading: Hetzner advertises prices for every (SKU, location) pair regardless of orderability. Conclusion: NO SKU bump-back. cpx22 + cpx32 (PR #744) remain the floor. README + variables.tf docstrings now carry the durable reproducer so future engineers don't re-attempt cpx21/cpx31. #753 — kubectl retry / LKG observer reliability /tmp/autopilot.sh updated (script lives outside the repo, on the VPS): • Every kubectl call carries --request-timeout=8s so a hung TLS handshake surfaces as a fast empty rather than a 30s+ stall. • Last-known-good (LKG) state held across transient flakes: hr/cert/nodes no longer flip to "0/0 nodes=0" on a single failed poll. • Only 3 consecutive transients count as a real failure; below the threshold the observer prints "hr=<LKG> (transient N/3)". UI side: the wizard's StatusPill / ApplicationPage drive off SSE from catalyst-api (useDeploymentEvents.ts), not direct kubectl polling, so no UI change needed. catalyst-api itself uses client-go (helmwatch / phase1_watch), not exec kubectl, so its observer is not subject to the same shell-out flake. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:11:44 +04:00
e3mrah	b02fc3788a	fix(provisioner): cost-optimized defaults use ORDERABLE SKUs — cpx22 CP + cpx32 workers (14% saving) (#744 ) * fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z). After PR #742 fixed the empty SKU strings in tfvars, the next blocker appeared: writeTfvars was emitting `"regions": null` (Go nil slice marshals to JSON null) when the request had no per-region overrides. OpenTofu's variables.tf carries a validation block: validation { condition = alltrue([ for r in var.regions : contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider) ]) } The `for r in var.regions` iteration fails on null with: Error: Iteration over null value on variables.tf line 217, in variable "regions": The variables.tf default `[]` is what the validator expects; emit that shape explicitly via a coalesceRegions(req.Regions) helper that turns nil into an empty slice. Operator overrides round-trip unchanged. Tests: - TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions serialises as JSON `[]`, never `null`, when the request has no per-region overrides. Builds on PR #742. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving) Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the cpx21 CP default from PR #741 fell apart at apply time — Error: Server Type "cpx21" is unavailable in "fsn1" and can no longer be ordered Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog (`/v1/server_types`) but are NOT in the per-DC orderable list (`available_for_migration` on `/v1/datacenters`) for any EU DC (fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on for new Sovereigns in those regions. Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04: • cpx11 (2 vCPU / 2 GB) — too small for the CP working set • cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1 • cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1 • cpx42, cpx52, cpx62 — bigger and more expensive New default per Sovereign: \| Component \| Old \| New \| Savings \| \|-----------------\|-----------------\|------------------\|---------\| \| Control plane \| CPX32 (€16.49) \| CPX22 (€9.49) \| €7.00 \| \| Worker × 2 \| CPX32 × 2 (€33) \| CPX32 × 2 (€33) \| €0 \| \| TOTAL \| €49.47/mo \| €42.47/mo \| 14% \| The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo) assumed those SKUs were orderable. They aren't in EU DCs. The 14% saving from cpx22 CP is the largest concrete optimisation that ships TODAY without compromising the multi-node horizontal-scale agreement (issue #733): still 1 CP + 2 workers from day one. Files changed: - infra/hetzner/variables.tf control_plane_size default cpx21 → cpx22 worker_size default cpx31 → cpx32 (back to the prior orderable choice) - products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing (€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49). Mark both as "listed but NOT orderable in EU DCs" so the wizard surfaces the constraint instead of letting operators pick a non-orderable SKU. Move recommended:true from CPX21 → CPX22. defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31'). - products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx Comment refresh — names the new orderable defaults. - products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22']. Builds on PR #741 (issue #740 chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:35:55 +04:00
e3mrah	994c2d1c2a	fix(provisioner): cost-optimized default sizes — cpx21 CP + cpx31 workers (38% saving) (#741 ) The new Sovereign default after PR #736 / #738 / #739 was 1× CPX32 control plane + 2× CPX32 workers — €33/mo per Sovereign. CPX32 is over-provisioned for the CP working set: the CP carries only k3s (apiserver/etcd/scheduler/ controller-manager) + cilium-operator + flux controllers + cert-manager + sealed-secrets — NOT the heavy bp-keycloak/cnpg/harbor/openbao/grafana stack (those land on workers because the bootstrap-kit explicitly schedules them off the CP taint). CP RAM budget: etcd ~512 MB + control plane ~1.5 GB + cilium/flux/ cert-manager/sealed-secrets ~1 GB + OS ~512 MB ≈ 3.5 GB — fits CPX21's 4 GB. Workers stay at 8 GB on CPX31 since RAM is the binding constraint for the bootstrap-kit's worker pods, not vCPU. New default per Sovereign: \| Component \| Old \| New \| Savings \| \|-----------------\|-----------------\|-----------------\|---------\| \| Control plane \| CPX32 (€11/mo) \| CPX21 (€5.5/mo) \| €5.5 \| \| Worker × 2 \| CPX32 × 2 (€22) \| CPX31 × 2 (€15) \| €7 \| \| TOTAL \| €33/mo \| €20.5/mo \| 38% \| Multi-node horizontal-scale agreement (issue #733) preserved: still 1 CP + 2 workers minimum from day one. Files changed: - infra/hetzner/variables.tf control_plane_size default cpx32 → cpx21 worker_size default cpx32 → cpx31 Validation regex unchanged (cxNN \| cpxNN \| ccxNN \| caxNN). - products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts Add CPX11, CPX21, CPX31 catalog entries. Move recommended:true from CPX32 → CPX21 (control-plane default). Add defaultWorkerSizeId() — Hetzner returns 'cpx31', other providers fall through to defaultNodeSizeId() symmetric default. - products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx First-visit useEffect + handleSelectProvider now call defaultWorkerSizeId(provider) for the worker SKU instead of mirroring the CP SKU. Comment updated naming the cost-optimised pair. - products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts Recommended-Hetzner-SKU set assertion: ['cpx32'] → ['cpx21']. If a Sovereign exhibits CP RAM pressure with this default, the next safe stop UP is cpx31 (4 vCPU / 8 GB, ~€7.5/mo) — never back to cpx32. Closes #740. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:00:01 +04:00
e3mrah	7ec25b9736	feat(provisioner): default Sovereign to 3x CPX32 (1 CP + 2 workers) — restore horizontal scale (#736 ) Issue #733. Every Sovereign provisioned this week launched with a single CPX52 control-plane and zero workers — completely discarded horizontal scalability. Restore the originally agreed shape: 1 CPX32 control plane + 2 CPX32 workers (3 nodes × 4 vCPU/8 GB = 12 vCPU/24 GB total — same aggregate footprint as a CPX52 vertical-scale, but with multi-node fault tolerance and the architectural shape clusters/_template/ was designed for). Changes: - infra/hetzner/variables.tf — defaults: control_plane_size cx42→cpx32, worker_size cx32→cpx32, worker_count 0→2. - infra/hetzner/main.tf — add hcloud_load_balancer_target.workers so the Hetzner LB targets every node (CP + workers); Cilium Gateway DaemonSet on every node serves ingress on its NodePort, so any node can absorb traffic for genuine horizontal scale. - infra/hetzner/README.md — sizing rationale rewritten around horizontal scale; CPX32 × 3 documented as canonical; CPX52 retained for solo dev. - ui model — INITIAL_WIZARD_STATE.workerCount 0→2. - ui StepProvider — first-visit + provider-change defaults workerCount 0→2. - ui providerSizes — `recommended: true` flag moves cpx52→cpx32; CPX52 description updated to "solo dev when worker_count=0". Constraints honoured: - Existing API requests with explicit controlPlaneSize: 'cpx52' / explicit workerCount: 0 keep working — only DEFAULTS change. - Sub-CPX32 SKUs (cx21/cx31) still allowed via dropdown. - Contabo single-node Catalyst-Zero is a different code path — unaffected. - No cron triggers added (event-driven only). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 13:57:53 +04:00
e3mrah	4946ccd125	feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710 ) (#719 ) Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS operator with a single overlay toggle. Changes ======= products/catalyst/chart: - Chart.yaml 1.2.7 → 1.3.0 - values.yaml: ingress.marketplace.enabled toggle (default false) + marketplace.{brand,currency,paymentProvider,signupPolicy} surface - templates/sme-services/marketplace-routes.yaml: HTTPRoute marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin, / → marketplace; HTTPRoute .<sov> → console (per-tenant wildcard) - templates/sme-services/marketplace-reference-grant.yaml: cross- namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services - .helmignore: stop excluding sme-services/ and marketplace-api/* (only .kustomization.yaml + .ingress.yaml remain Kustomize-only) - All sme-services/* + marketplace-api/* manifests wrapped with {{ if .Values.ingress.marketplace.enabled }} so non-marketplace Sovereigns render the chart unchanged clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: - chart version 1.2.7 → 1.3.0 - ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN} - ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false} infra/hetzner: - variables.tf: marketplace_enabled var (string "true"/"false", default "false") - main.tf: thread var into cloudinit-control-plane.tftpl - cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations products/catalyst/bootstrap/api/internal/provisioner/provisioner.go: - Request.MarketplaceEnabled bool (json:"marketplaceEnabled") - writeTfvars: marketplace_enabled = "true"\|"false" core/pool-domain-manager/internal/allocator/allocator.go: - canonicalRecordSet adds "marketplace" prefix → marketplace.<sov> resolves via PDM at zone-commit time (PR #710 explicit record so caches don't depend on the *.<sov> wildcard alone) DoD ready ========= - helm template with ingress.marketplace.enabled=false → identical manifest set to 1.2.7 (verified locally) - helm template with ingress.marketplace.enabled=true → emits 17 extra resources: 13 sme-services workloads + 2 marketplace-api + 1 HTTPRoute pair + 1 ReferenceGrant - pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green - catalyst-api builds, provisioner cloudinit_path_test green Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-04 07:47:37 +04:00
e3mrah	684759564e	fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686 ) * fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns even with all of: - gatewayAPI.hostNetwork.enabled=true on the Cilium chart - securityContext.privileged=true on the cilium-envoy DaemonSet - securityContext.capabilities.add=[NET_BIND_SERVICE] - envoy-keep-cap-netbindservice=true in cilium-config ConfigMap - Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema) Repeatable error from cilium-envoy logs across otech45, otech46, otech47: listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed to bind or apply socket options: cannot bind '0.0.0.0:80': Permission denied The bind() syscall is intercepted by cilium-agent's BPF socket-LB program in a way that does not honour container capabilities. Even PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets "Permission denied". Cilium 1.19.3 → 1.16.5 made no difference (F1, PR #684 still ships — the version bump is sound for other reasons; the listener bind is just a separate fix). This commit moves the listeners to high ports (30080/30443) and lets the Hetzner LB do the public-facing port translation: HCLB :80 → CP node :30080 (cilium-gateway HTTP listener) HCLB :443 → CP node :30443 (cilium-gateway HTTPS listener) External users still hit `https://console.<sov>.omani.works/auth/handover` on port 443; the high port is invisible. High-port bind succeeds without NET_BIND_SERVICE because the kernel only gates ports below `net.ipv4.ip_unprivileged_port_start` (default 1024). Will be verified on otech48: the next fresh provision should serve console.otech48/auth/handover end-to-end without the 502/timeout chain seen on otech45–47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681) calls contabo's authoritative PowerDNS at pdns.openova.io to write DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook needs an X-API-Key Secret in the Sovereign's cert-manager namespace — PR #681 didn't ship the materialization seam, so on otech43..otech47 the Secret was missing and the wildcard cert never issued. This commit closes the seam from contabo to the Sovereign: 1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on openova-system/powerdns-api-credentials extended from "external-dns" to "external-dns,catalyst" so contabo catalyst-api can mount the API key. 2. bp-powerdns: api.basicAuth.enabled flips default true to false. Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that blocked machine-to-machine API access from Sovereigns. The X-API-Key contract is unchanged. 3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key secret (optional=true so Sovereign-side catalyst-api Pods that don't reflect this still start clean). 4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every Request before Validate(). Forwards as tofu var powerdns_api_key. 5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive, default ""). 6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct dynadot-api-credentials Secret block (PR #681 dropped bp-cert-manager-dynadot-webhook) with a new cert-manager/powerdns-api-credentials Secret block. runcmd applies it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook. End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token. Will be verified live on otech48 (next provision after this lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:23:27 +04:00
e3mrah	169ba2f20a	fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623 ) PR #611 squash accidentally reverted the Phase-8b infra additions from PR #615 (`92fdda42`). Restores: - cloudinit-control-plane.tftpl: write_files entry for /var/lib/catalyst/handover-jwt-public.jwk (mode 0600) - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Without these, new Sovereign provisioning runs will not write the public key to disk and auth/handover on the Sovereign will return 503 (key unavailable). Co-authored-by: e3mrah <e3mrah@openova.io>	2026-05-02 19:21:16 +04:00
e3mrah	b5c9839da7	feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611 ) Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables: UI: - AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server callback; sovereign → client-side OIDC token exchange via oidc.ts) - Router: sovereign console routes (/console/), DETECTED_MODE index redirect, authCallbackRoute dedup fix, authHandoverRoute safety net - StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token before redirecting operator to Sovereign console (falls back to plain URL on error) API: - main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env - deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time - provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON - auth.go: /auth/handover endpoint for seamless single-identity flow Infra: - cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/ - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Chart: - api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars Playwright CI fixes: - playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard - playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix - cosmetic-guards.spec.ts: provision URL /sovereign/provision/ → /provision/* - sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests). Co-authored-by: e3mrah <e3mrah@openova.io>	2026-05-02 19:17:56 +04:00
e3mrah	92fdda42d7	feat(catalyst-api+infra): Phase-8b handover JWT minting on Catalyst-Zero (Closes #605 ) Merge via self-merge per CLAUDE.md. Playwright UI smoke passes; cosmetic guards pre-existing failure on main (unrelated to this PR). Resolves #605.	2026-05-02 19:07:27 +04:00
e3mrah	9e53d9e127	feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557 ) (#563 ) * docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade Per founder corrective: existing diagram missed the real blockers surfaced during otech10..otech22 burns. The image-pull-through gap (#557) and the cross-namespace secret gap (#543, #544) gate every workload pull from a public registry — without them, Sovereign hits DockerHub anonymous rate-limit on first provision and 30+ HRs are ImagePullBackOff/CreateContainerConfigError. Adds: - Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap + #557C charts global.imageRegistry templating). Edges to NATS / Gitea / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane / cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao - Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544 powerdns-api-credentials reflect). Edges to bp-catalyst-platform and bp-cert-manager-powerdns-webhook - Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch 38-HR threshold both gate Phase 8a integration test - Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is what makes "zero contabo dependency" DoD-met possible WBS now reflects the cascade observed live, not the pre-Phase-8a model. * feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560) - bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium subchart does not expose a single registry knob — per-Sovereign overlays wire specific image.repository fields alongside this value. - bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream chart exposes per-component image.registry knobs documented in the comment. - bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub added + deployment.yaml templated to prefix the webhook image repository when the value is non-empty. Verified: helm template with --set global.imageRegistry=harbor.openova.io produces harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>. - bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream subchart exposes sealed-secrets.image.registry for overlay wiring. All four charts render clean with default values (empty imageRegistry). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557) Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd transparently routes all five public-registry pulls through the central harbor.openova.io pull-through proxy (Option A of #557). - cloudinit-control-plane.tftpl: new write_files entry for /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so containerd reads the mirror config at startup). Mirrors docker.io, quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective harbor.openova.io/proxy-* projects. Auth via robot$openova-bot. - variables.tf: new harbor_robot_token variable (sensitive, default "") for the robot account token stored in openova-harbor/harbor-robot-token K8s Secret on contabo and forwarded by catalyst-api at provision time. - main.tf: wire harbor_robot_token into the templatefile() call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:49:13 +04:00
e3mrah	ccc38987c2	fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550 ) (#558 ) Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49) but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook (groupName: acme.dynadot.openova.io). Without slot 49b the APIService for acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely → HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL. Changes: - core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response, API returns SetDnsResponse); change ResponseCode to json.Number (API returns integer 0, not string "0"); update tests to match real API response format - platform/cert-manager-dynadot-webhook/chart: - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix) - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.* and privateKeySecretRefName; add rbac.create comment for domain-solver - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64) - clusterissuer.yaml: new template (skip-render default, enabled via overlay) - deployment.yaml: add imagePullSecrets support (required for private GHCR) - Chart.yaml: bump to 1.1.0 - clusters/_template/bootstrap-kit: - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer) - kustomization.yaml: add 49b entry - infra/hetzner: - variables.tf: add dynadot_managed_domains variable - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials Secret + apply it before Flux reconciles bootstrap-kit Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:42:13 +04:00
e3mrah	0172b9a89a	wip(#425 ): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435 ) Files staged from prior agent run before rate-limit. Re-dispatch will verify, complete missing pieces (Crossplane Provider+ProviderConfig in cloud-init, grep-zero acceptance, helm/go test runs, WBS row update), and finalise the PR. Includes: - platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml - platform/velero/chart/values.yaml (objectStorage.s3.* block) - platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0) - products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package) - internal/hetzner/objectstorage{,_test}.go DELETED - credentials handler + StepCredentials.tsx renamed - infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl} - clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml - platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 18:05:19 +04:00
e3mrah	1e17668055	feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371 ) (#409 ) * feat(catalyst): Hetzner Object Storage credential pattern (Phase 0b, #371) Adds the per-Sovereign Hetzner Object Storage credential capture + bucket provisioning Phase 0b path described in the omantel handover WBS §5. Hybrid Option A+B: wizard collects operator-issued S3 credentials (Hetzner exposes no Cloud API to mint them — they're issued once in the Hetzner Console and the secret half is shown exactly once), and OpenTofu auto-provisions the per-Sovereign bucket via the aminueza/minio provider + writes a flux-system/hetzner-object-storage Secret into the new Sovereign at cloud-init time so Harbor (#383) and Velero (#384) find their backing-store credentials already in the cluster from Phase 1 onwards. Extends the EXISTING canonical seam at every layer (per the founder's anti-duplication rule for #371's session): the existing Tofu module at infra/hetzner/, the existing handler/credentials.go validator, the existing provisioner.Request struct, the existing store.Redact path, and the existing wizard StepCredentials. No parallel binaries / scripts / operators introduced. infra/hetzner/ (Tofu module — Phase 0): - versions.tf: declare aminueza/minio provider (Hetzner's official recommendation for S3-compatible bucket creation per docs.hetzner.com/storage/object-storage/getting-started/...) - variables.tf: 4 sensitive vars — region (validated against fsn1/nbg1/hel1, the European-only OS regions as of 2026-04), access_key, secret_key, bucket_name (RFC-compliant S3 naming) - main.tf: minio_s3_bucket.main resource — idempotent on re-apply, no force_destroy (Velero archive must survive a control-plane reinstall), object_locking=false (content-addressed digests are the immutability guarantee for Harbor; Velero uses S3 versioning) - cloudinit-control-plane.tftpl: write flux-system/hetzner-object-storage Secret with the canonical s3-endpoint/s3-region/s3-bucket/s3-access-key/s3-secret-key keys Harbor + Velero charts consume via existingSecret refs - outputs.tf: surface endpoint/region/bucket back to catalyst-api for the deployment record (credentials NEVER returned) products/catalyst/bootstrap/api/ (Go): - internal/hetzner/objectstorage.go: NEW — minio-go/v7-based ListBuckets validator. Distinguishes auth failure ("rejected") from network failure ("unreachable") so the wizard renders the right error card. NOT a parallel cloud-resource path — the existing purge.go handles hcloud purge; objectstorage.go handles a separate API surface (S3-compatible) that has no equivalent client today. - internal/handler/credentials.go: extend with ValidateObjectStorageCredentials handler — same wire shape (200 valid:true / 200 valid:false / 503 unreachable / 400 bad input) as the existing token validator so the wizard's failure- card machinery handles both without per-endpoint switches. - cmd/api/main.go: wire POST /api/v1/credentials/object-storage/validate - internal/provisioner/provisioner.go: extend Request with ObjectStorageRegion/AccessKey/SecretKey/Bucket; Validate() rejects empty/malformed values fail-fast at /api/v1/deployments POST time; writeTfvars() emits the 4 new tfvars. - internal/handler/deployments.go: derive bucket name from FQDN slug pre-Validate (catalyst-<fqdn-with-dots-replaced-by-dashes>) so Hetzner's globally-namespaced bucket pool gets a deterministic, collision-resistant per-Sovereign name without operator input. - internal/store/store.go: redact access/secret keys; preserve region+bucket plain (they're public in tofu outputs anyway). products/catalyst/bootstrap/ui/ (TypeScript / React): - entities/deployment/model.ts + store.ts: 4 new wizard fields (objectStorageRegion/AccessKey/SecretKey/Validated) with merge() coercion for legacy persisted state. - pages/wizard/steps/StepCredentials.tsx: ObjectStorageSection — region picker (fsn1/nbg1/hel1), masked secret-key input, Validate button gating Next. Same FailureCard taxonomy (rejected/too-short/unreachable/network/parse/http) the existing TokenSection uses, so the operator UX is consistent. Section only renders when Hetzner is among chosen providers — non-Hetzner Sovereigns skip Phase 0b until their own backing-store path lands. - pages/wizard/steps/StepReview.tsx: include objectStorageRegion/AccessKey/SecretKey in the POST /v1/deployments payload (bucket derived server-side). Tests: - api: 7 new provisioner Validate tests (region/keys/bucket required + RFC-compliant + valid-region acceptance), 5 handler tests for the new endpoint (bad JSON / missing region / invalid region / short keys), 4 hetzner/objectstorage_test.go tests (endpoint composition + early input rejection), 1 handler test for the bucket-name derivation. Existing tests updated to supply the new required fields. - ui: StepCredentials.test.tsx pre-populates objectStorageValidated in beforeEach so the existing 11 SSH-section tests aren't gated on Object Storage validation. DoD: a fresh Sovereign provision results in a usable S3 endpoint URL + access/secret keys available as a K8s Secret in the Sovereign's home cluster (flux-system/hetzner-object-storage), ready for consumption by Harbor + Velero charts via existingSecret references. Closes #371. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(wbs): #371 done — Hetzner Object Storage Phase 0b shipped (#409) Marks #371 done with the architectural rationale (hybrid Option A + B — Hetzner exposes no Cloud API to mint S3 keys, so the wizard MUST capture them; OpenTofu auto-provisions the bucket + cloud-init writes the flux-system/hetzner-object-storage Secret with the canonical s3-* keys Harbor + Velero consume). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:54:22 +04:00
hatiyildiz	acf426c5a9	feat(catalyst-api): cloud-init POSTs kubeconfig back via bearer token (closes #183 ) Implement Option D from issue #183: the new Sovereign's cloud-init PUTs its rewritten kubeconfig (server URL pinned to the LB public IP, k3s service-account token in the body) to catalyst-api over HTTPS using a per-deployment bearer token. catalyst-api never SSHs into the Sovereign — by design, it does not hold the SSH private key (the wizard returns it once to the browser and does not persist it on the catalyst-api side). How the bearer flow works ------------------------- 1. CreateDeployment mints a 32-byte random bearer (crypto/rand, hex-encoded), computes its SHA-256, and persists ONLY the hash on Deployment.kubeconfigBearerHash. Plaintext is stamped onto provisioner.Request just long enough for writeTfvars to render it into the per-deployment OpenTofu workdir, then GC'd. 2. infra/hetzner/variables.tf adds three variables — deployment_id, kubeconfig_bearer_token (sensitive), catalyst_api_url. main.tf passes them through templatefile() with load_balancer_ipv4 read from hcloud_load_balancer.main.ipv4. 3. cloudinit-control-plane.tftpl, after `kubectl --raw /healthz` succeeds, sed-rewrites k3s.yaml's https://127.0.0.1:6443 to the LB's public IPv4, writes the result to a 0600 file, and curls PUT to {catalyst_api_url}/api/v1/deployments/{deployment_id}/ kubeconfig with `Authorization: Bearer {token}`. --retry 60 --retry-delay 10 --retry-all-errors handles transient reachability gaps. The 0600 file is removed after the PUT. 4. PUT /api/v1/deployments/{id}/kubeconfig: - Reads `Authorization: Bearer <token>` (RFC 6750). - Computes SHA-256 of the inbound bearer, constant-time-compares to the persisted hash via subtle.ConstantTimeCompare. - 401 on missing/malformed Authorization, 403 on bearer mismatch, 403 if no hash on record, 403 if KubeconfigPath already set (single-use replay defence), 422 on empty/oversize body, 503 if the kubeconfigs directory is unwritable. - On 204: writes the body to /var/lib/catalyst/kubeconfigs/ <id>.yaml at mode 0600 (atomic temp+rename), sets Result.KubeconfigPath, persistDeployment, then `go runPhase1Watch(dep)`. 5. GET /api/v1/deployments/{id}/kubeconfig now reads the file at Result.KubeconfigPath. 409 with {"error":"not-implemented"} when the postback hasn't happened yet (preserves the wizard's existing StepSuccess fallback). 409 {"error": "kubeconfig-file-missing"} on PVC drift. 6. internal/store: Record carries KubeconfigBearerHash. The path pointer round-trips via Result.KubeconfigPath; the JSON record NEVER contains the kubeconfig plaintext (test grep on the on- disk JSON for the kubeconfig sentinels asserts zero matches). 7. restoreFromStore relaunches helmwatch on Pod restart for any rehydrated deployment whose Result.KubeconfigPath points at an existing file AND Phase1FinishedAt is nil AND the original status was not in-flight (the existing in-flight-status-rewrite-to-failed contract is preserved). Channels are re-allocated for resumed deployments because the fromRecord-loaded ones are closed. 8. internal/handler/phase1_watch.go reads kubeconfig YAML from the file at Result.KubeconfigPath (not from a string field on Result). The Result.Kubeconfig field is removed entirely; the on-disk JSON only carries kubeconfigPath. Tests ----- internal/handler/kubeconfig_test.go covers every spec gate: - PUT 401 missing/malformed Authorization - PUT 403 bearer mismatch / no-bearer-hash / already-set - PUT 422 empty body / oversize body - PUT 404 deployment not found - PUT 204 first success, file at <dir>/<id>.yaml mode 0600, Result.KubeconfigPath set, on-disk JSON has kubeconfigPath pointer with no plaintext leak - PUT triggers Phase 1 helmwatch goroutine - GET reads from path-pointer - GET 409 path-pointer-set-but-file-missing - newBearerToken / hashBearerToken round-trip + entropy - subtle.ConstantTimeCompare correctness - shouldResumePhase1 gates every branch - restoreFromStore re-launches helmwatch on rehydrated deployments - phase1Started guard prevents double watch (PUT then runProvisioning) - extractBearer RFC 6750 case-insensitive scheme Chart ----- products/catalyst/chart/templates/api-deployment.yaml mounts the existing catalyst-api-deployments PVC at /var/lib/catalyst (one level up) so deployments/<id>.json and kubeconfigs/<id>.yaml live on the same single-attach volume — no second PVC. Adds env vars CATALYST_KUBECONFIGS_DIR=/var/lib/catalyst/kubeconfigs and CATALYST_API_PUBLIC_URL=https://console.openova.io/sovereign. Per docs/INVIOLABLE-PRINCIPLES.md - #3: OpenTofu is still the only Phase-0 IaC; cloud-init is part of the OpenTofu module's templated user_data, not a separate code path. catalyst-api never execs helm/kubectl/ssh. - #4: catalyst_api_url is runtime-configurable (CATALYST_API_PUBLIC_URL env var), so air-gapped franchises override without code changes. - #10: Bearer plaintext NEVER lands on disk on the catalyst-api side (only the SHA-256 hash). Kubeconfig plaintext NEVER lands in the JSON record (only the file path). The kubeconfig file is chmod 0600 and the directory 0700 owned by the catalyst-api UID. Closes #183. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:26:53 +02:00
hatiyildiz	dddbab4b80	fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull` because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init never created the Secret, so every fresh Sovereign's source-controller logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium. The operator workaround (kubectl apply by hand) is not durable across reprovisioning. Verified live on omantel.omani.works pre-fix. Changes: - provisioner.Request gains GHCRPullToken (json:"-") so it is never serialized into persisted deployment records. provisioner.New() reads CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the Request before tofu.auto.tfvars.json. Validate() rejects empty for domain_mode=pool with a pointer to docs/SECRET-ROTATION.md. - handler.CreateDeployment also stamps the env var onto the Request so the synchronous validation path returns 400 early on misconfiguration. - infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true, default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64 locals and passes both to templatefile(). cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so the GitRepository + Kustomization land into a cluster that already has working GHCR creds. - products/catalyst/chart/templates/api-deployment.yaml mounts CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in the catalyst namespace (key: token, optional: true so the Pod still starts on misconfigured installs and Validate() owns the gate). - docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token, Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds. Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN> placeholder; the token never lives in git. - Tests: provisioner unit tests cover New() reading the env var, tolerance of missing env, pool-mode validation rejection with operator-facing error, BYO acceptance, and the json:"-" serialization invariant. tests/e2e/hetzner-provisioning gains a TestCloudInit_RendersGHCRPullSecret render-only integration test that asserts the rendered cloud-init contains the Secret, applies it before flux-bootstrap, and that the dockerconfigjson round-trips the sample token through templatefile() correctly. Existing pool-mode handler tests now t.Setenv the placeholder token; the on-disk redaction test asserts the placeholder never reaches disk. Gates: - go vet ./... and go test -race -count=1 ./... in products/catalyst/bootstrap/api: PASS. - helm lint products/catalyst/chart: PASS (warnings pre-existing). - tofu fmt + tofu validate: deferred to CI (no tofu binary on the development host). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:07:27 +02:00
hatiyildiz	c6cbfe684c	fix(tofu): accept cpx* SKU family + empty worker_size for solo Sovereigns The wizard's recommended Hetzner SKU is CPX32 (4 vCPU AMD / 8 GB / €0.0232/hr) but the module's variables.tf validation rule only accepted the cx / ccx / cax families — CPX (AMD shared) was missing entirely. Every Launch through the wizard hit: Error: Invalid value for variable on variables.tf line 68: variable "control_plane_size" { var.control_plane_size is "cpx32" control_plane_size must match Hetzner server-type naming (cxNN \| ccxNN \| caxNN) Solo Sovereigns (worker_count = 0) also legitimately have an empty worker_size — the validation rejected that too: Error: Invalid value for variable on variables.tf line 91: variable "worker_size" { var.worker_size is "" Both fixed by extending the regex with the cpx* family AND permitting the empty string on worker_size when the operator runs a solo Sovereign. Reproduced end-to-end against the deployed catalyst-api before the fix: the SSE stream surfaced exactly these two validation errors. With the regex updated they no longer fire — failure now requires a real Hetzner token instead of being blocked at module-validation time.	2026-04-29 14:43:52 +02:00
hatiyildiz	4ee9e7dd6f	fix(wizard): topology before provider; per-provider SKU catalog; per-region sizing The wizard step order was inverted: it asked for the provider before the topology, then put hetzner-only SKUs inside the topology step. Topology decides how many regions exist; provider is a per-region property; SKU vocabulary is per-provider (cx32 means nothing on Azure). Fixes all three. New step order (WIZARD_STEPS + WizardPage STEPS): Org -> Topology -> Provider -> Credentials -> Components -> Domain -> Review. Per-provider SKU catalog at products/catalyst/bootstrap/ui/src/shared/ constants/providerSizes.ts replaces the legacy hetzner-only HETZNER_NODE_SIZES. Five providers (hetzner, huawei, oci, aws, azure), each with realistic SKU options drawn from that vendor's native instance-type vocabulary. Every SKU read in the wizard goes through PROVIDER_NODE_SIZES[provider] -- no SKU literal lives anywhere else. StepProvider now renders one card per topology slot. Each card carries: provider chooser, that provider's region picker, that provider's control-plane SKU, that provider's worker SKU + count. Cost rollup sums each region's (cp + worker*count) at its OWN provider's pricing, so a mixed-cloud topology computes correctly. StepTopology drops the SkuCard + NodeSizingPanel; it now captures only the topology template, HA flag, and AIR-GAP add-on. Per-region store fields (regionControlPlaneSizes, regionWorkerSizes, regionWorkerCounts) replace the singular controlPlaneSize/workerSize/ workerCount as the canonical shape. Migration in store.merge() hydrates the arrays from any persisted singular fields; the cx22 legacy default is treated as "no selection" so a hetzner-only id never leaks into a non-hetzner region. Backend Request gains an optional Regions []RegionSpec field. Validate mirrors Regions[0] into the legacy singular fields for the existing solo-Hetzner writeTfvars path. infra/hetzner/variables.tf accepts the list-of-objects shape; the for_each iteration that activates the rest of the regions is the multi-region tofu wiring follow-up. Door open structurally; no shape compromised. Dead code removed: StepInfrastructure and shared/constants/hetzner.ts (both orphaned, contained the only HETZNER_NODE_SIZES reference outside the catalog). Gates: tsc --noEmit, vite build, vitest (149 tests), go vet, go test (provisioner + handler). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:44:33 +02:00
hatiyildiz	e7a74f0eef	feat(infra/hetzner): bump default to cx42, add OS hardening + operator README Group J — closes #127, #128, #129, #130, #131, #132. Defaults - control_plane_size default cx42 (16 GB) — cx32 (8 GB) is INSUFFICIENT for a solo Sovereign per PLATFORM-TECH-STACK.md §7.1 (~11.3 GB Catalyst) + §7.4 (~8.8 GB per-host-cluster) = ~20 GB minimum. The previous cx32 default would OOM during the OpenBao + Keycloak step of bootstrap. - New k3s_version variable (v1.31.4+k3s1) — pinned, validated against the INSTALL_K3S_VERSION format. Previously hardcoded inside the cloud-init templates, in violation of INVIOLABLE-PRINCIPLES.md §4. Validation - Region restricted to the 5 known Hetzner locations. - control_plane_size + worker_size restricted to the cxNN \| ccxNN \| caxNN namespace (blocks tiny dev sizes that would OOM at runtime). - k3s_version regex matches the upstream installer's version format. - ssh_allowed_cidrs validated as proper CIDRs. Firewall - Document each open port (80, 443, 6443, ICMP) and each blocked port (22, 10250, 2379/2380, 8472) in README.md §"Firewall rules". - SSH (22) is now a dynamic rule keyed off ssh_allowed_cidrs (default empty = no SSH at the firewall, break-glass via Hetzner Console). OS hardening (cloudinit-.tftpl) - sshd drop-in: PasswordAuthentication no, PermitRootLogin prohibit-password, no forwarding, MaxAuthTries=3, LoginGraceTime=30. - enable_unattended_upgrades (default true): security-only pocket, auto-reboot at 02:30, removes unused kernels. - enable_fail2ban (default true): sshd jail, systemd backend. - Both control-plane and worker templates carry the same baseline. Documentation - New infra/hetzner/README.md (operator-facing) covers: What the module creates + Phase-0/Phase-1 boundary. * Sizing rationale with the §7.1+§7.4 RAM math + upgrade path. * Firewall rules: every open port, every blocked port, every deliberate egress flow. * k3s flag-by-flag rationale tied to PLATFORM-TECH-STACK.md §8. * SSH key management: why no auto-generated keys (break-glass + audit-trail + custody + compliance). * OS hardening table. * Standalone CLI invocation pattern (tofu apply -var-file=...). * What the module does NOT do (Crossplane / Flux territory). Closes #127 #128 #129 #130 #131 #132	2026-04-28 13:54:15 +02:00
hatiyildiz	e668637bc9	feat(provisioner): replace bespoke Hetzner+helm-exec code with OpenTofu→Crossplane→Flux Per docs/INVIOLABLE-PRINCIPLES.md Lesson #24 — the previous commits `915c467` + `07b4bcf` shipped bespoke Go code that called Hetzner Cloud API directly + exec'd helm/kubectl, which violates principle #3 (OpenTofu provisions Phase 0, Crossplane is the ONLY day-2 IaC, Flux is the ONLY GitOps reconciler, Blueprints are the ONLY install unit). This commit reverts all of that and replaces it with the canonical architecture. REVERTED (deleted): - products/catalyst/bootstrap/api/internal/hetzner/resources.go (379 lines bespoke Hetzner API client) - products/catalyst/bootstrap/api/internal/hetzner/cloudinit.go (bespoke cloud-init builder) - products/catalyst/bootstrap/api/internal/hetzner/provisioner.go (306 lines orchestrator) - products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go (helm-exec installer for 11 components) - products/catalyst/bootstrap/api/internal/bootstrap/exec.go (kubectl/helm exec wrappers) KEPT: - products/catalyst/bootstrap/api/internal/hetzner/client.go — fast token validity probe used by StepCredentials wizard step. NOT architectural drift; just a UX pre-flight check. - products/catalyst/bootstrap/api/internal/dynadot/dynadot.go — DNS API client. Will be invoked by the OpenTofu module via local-exec (the catalyst-dns helper binary). NEW (canonical architecture): infra/hetzner/ — OpenTofu module per docs/SOVEREIGN-PROVISIONING.md §3 Phase 0: - versions.tf: hetznercloud/hcloud provider ~> 1.49 - variables.tf: 17 typed variables matching wizard inputs (sovereign_fqdn, hcloud_token, region, control_plane_size, ssh_public_key, domain_mode, gitops_repo_url, etc.) — all runtime parameters, none hardcoded per principle #4 - main.tf: hcloud_network + subnet + firewall + ssh_key + control-plane server(s) with cloud-init + worker servers + load_balancer with services + null_resource calling /usr/local/bin/catalyst-dns for pool-domain DNS writes - outputs.tf: control_plane_ip, load_balancer_ip, sovereign_fqdn, console_url, gitops_repo_url - cloudinit-control-plane.tftpl: installs k3s with --flannel-backend=none --disable=traefik --disable=servicelb (Cilium replaces all of these), then installs Flux core, then applies a GitRepository pointing at clusters/${sovereign_fqdn}/ in the public OpenOva monorepo. From this point Flux is the GitOps engine — it reconciles bp-cilium → bp-cert-manager → bp-crossplane → ... → bp-catalyst-platform via the Kustomization tree the cluster directory ships. NO bespoke helm install from outside the cluster. NO direct kubectl apply. Flux is the install layer. - cloudinit-worker.tftpl: k3s agent join via private-IP control plane products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — thin OpenTofu invoker: - Validates wizard inputs - Stages the canonical infra/hetzner/ module into a per-deployment workdir - Writes tofu.auto.tfvars.json from the wizard request - Execs `tofu init`, `tofu plan -out=tfplan`, `tofu apply tfplan`, streaming stdout/stderr lines as SSE events to the wizard - Reads tofu output -json for control_plane_ip + load_balancer_ip - Returns Result. Flux on the new cluster takes over from here. products/catalyst/bootstrap/api/internal/handler/deployments.go — rewritten: - Uses provisioner.Request and provisioner.New() (no more hetzner.Provisioner) - Same SSE/poll endpoints; same Dynadot env-var injection for pool-domain mode What this commit DOES NOT yet include (intentionally — separate work): - clusters/${sovereign_fqdn}/ Kustomization tree in the monorepo that Flux will reconcile (each Sovereign gets its own cluster directory). Tracked separately as part of the bp-catalyst-platform umbrella work. - /usr/local/bin/catalyst-dns helper binary in the catalyst-api Containerfile. Tracked as ticket [G] dns Dynadot client. - Crossplane Compositions for hcloud resources at platform/crossplane/compositions/. Tracked as part of [F] crossplane chart. Lesson #24 closed. Architecture now matches docs/ARCHITECTURE.md §10 + SOVEREIGN-PROVISIONING.md §3-§4 exactly.	2026-04-28 13:38:56 +02:00

21 Commits