Commit Graph

21 Commits

Author SHA1 Message Date
e3mrah
7bfd6df588
fix(catalyst-api,bp-catalyst-platform,infra): unblock multi-domain Day-2 add-domain flow on Sovereigns (#879) (#884)
5 stacked wiring bugs blocked the Day-2 add-parent-domain happy path on a
fresh post-handover Sovereign — surfaced live on otech103, 2026-05-05 — plus
a 6th gap (ghcr-pull reflector for catalyst-system). All six fixed in one PR
so a single chart bump + cloud-init re-render closes the gap end-to-end.

Bug 1 (chart, api-deployment.yaml): wire POOL_DOMAIN_MANAGER_URL=
https://pool.openova.io. The in-cluster Service default only resolves on
contabo; on Sovereigns every Day-2 POST died with NXDOMAIN.

Bug 2 (chart + code): wire CATALYST_PDM_BASIC_AUTH_USER / _PASS env from a
new pdm-basicauth Secret, and have pdmFlipNS SetBasicAuth from those envs.
The PDM public ingress at pool.openova.io is gated by Traefik basicAuth;
calls without Authorization: Basic returned 401. optional=true so contabo
+ CI + older Sovereigns degrade to a clear 401 log line. Per Inviolable
Principle #10, the credentials only ever live in Pod env + are read once
per call by pdmFlipNS — never enter a logged struct or persisted record.

Bug 3 (code, parent_domains.go): pdmFlipNS body now includes the required
nameservers field (computed from expectedNSFor). PDM's SetNSRequest schema
requires it; the previous body got 422 missing-nameservers.

Bug 4 (code, parent_domains.go): lookupPrimaryDomain falls back to
SOVEREIGN_FQDN env after CATALYST_PRIMARY_DOMAIN. On a post-handover
Sovereign no Deployment record is persisted, so without this fallback GET
/parent-domains returned {"items":[]} and the propagation panel showed
expectedNs:null. SOVEREIGN_FQDN is already wired by api-deployment.yaml
from the sovereign-fqdn ConfigMap.

Bug 5 (chart, httproute.yaml): catalyst-ui /auth/* PathPrefix narrowed to
Exact /auth/handover. The previous PathPrefix collided with OIDC PKCE
redirect_uri /auth/callback — catalyst-api 404s on that path because it
only registers /api/v1/auth/callback, breaking login post-handover-JWT-
cookie expiry. Exact match keeps /auth/handover routed to catalyst-api
while every other /auth/* path falls through to catalyst-ui's React
Router for client-side OIDC.

Bug 6 (cloud-init): ghcr-pull + harbor-robot-token + new pdm-basicauth
Reflector annotations enumerate explicit allowed/auto-namespaces (sme,
catalyst, catalyst-system, gitea, harbor) instead of empty-string. The
ambiguous empty-string interpretation caused otech103 to require a manual
catalyst-system mirror creation; explicit list back-ports the verified
working state.

Provisioner wiring: Request.PDMBasicAuthUser/Pass + Provisioner fields
+ tfvars emission so the contabo catalyst-api can stamp the credentials
onto every Sovereign provision request. variables.tf adds matching
pdm_basic_auth_user / pdm_basic_auth_pass tofu vars (sensitive, default
empty) so older provisioner builds that pre-date this change keep
rendering valid cloud-init (the Secret renders with empty values and
Pod start is unaffected).

Chart bumped 1.4.11 -> 1.4.12, lockstep slot 13 pin updated. Closes
the architectural blockers tracked in #879; the catalyst-api image
rebuild + chart republish run via the existing CI pipelines (services-
build.yaml + blueprint-release.yaml) on this commit's SHA.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 09:02:39 +04:00
e3mrah
e96741a0ca
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The
operator brings 1+ parent domains at signup (`omani.works` for own
use, `omani.trade` for the SME pool, etc.) and may add more
post-handover via the admin console (#829).

bp-powerdns 1.2.0 (platform/powerdns/chart):
- New `zones: []` values key listing parent domains to bootstrap
- New Helm post-install/post-upgrade hook Job
  (templates/zone-bootstrap-job.yaml) that POSTs each entry to
  /api/v1/servers/localhost/zones at install time. Idempotent on
  HTTP 409 — re-runs after upgrades or chart bumps never fail.
- Default-values render skips when zones is empty (legacy behavior).

bp-catalyst-platform 1.4.0 (products/catalyst/chart):
- New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}`
  values
- New templates/sovereign-wildcard-certs.yaml renders one
  cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex)
  via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert
  renews independently. Skips entirely when parentZones is empty so
  the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml
  retains ownership of `sovereign-wildcard-tls` (avoids
  helm-vs-kustomize ownership flap).
- New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded
  into the catalyst-api Pod as CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_SERVER_ID env vars.

catalyst-api (products/catalyst/bootstrap/api):
- New internal/powerdns package with typed Client (CreateZone,
  ZoneExists). Idempotent on HTTP 409/412.
- handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the
  typed client when wired via SetPowerDNSZoneClient — the
  admin-console "Add another parent domain" flow now creates real
  zones in the Sovereign's PowerDNS at runtime.
- main.go wires the client when CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_API_KEY are set.
- Comprehensive unit tests (client_test.go: 9 cases incl.
  201/409/412/500 + custom NS + custom serverID).

Bootstrap-kit slot integration:
- clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to
  bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from
  Flux postBuild.substitute.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
  bumps to bp-catalyst-platform 1.4.0 and threads `parentZones:
  ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two
  slots stay in lockstep).
- infra/hetzner: new `parent_domains_yaml` Terraform variable
  (defaults to single-zone array derived from sovereign_fqdn) →
  cloud-init renders the PARENT_DOMAINS_YAML Flux substitute.

DoD verified end-to-end with helm template + envsubst:
- Multi-zone overlay (omani.works + omani.trade) renders 2
  PowerDNS zone-create API calls in the bootstrap Job AND 2
  Certificate resources (`*.omani.works`, `*.omani.trade`) in
  bp-catalyst-platform.
- Single-zone fallback (PARENT_DOMAINS_YAML defaults to
  `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy
  provisioning paths working without per-overlay edits.

Closes #827.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 23:42:00 +04:00
e3mrah
05065b66d6
fix(provisioner+observer): document cpx21 availability + kubectl retry/LKG (closes #752, #753) (#756)
#752 — investigate cpx21/cpx31 availability in EU DCs

Concrete proof gathered against the live Hetzner Cloud API on 2026-05-04.
GET /v1/server_types LISTS cpx11/cpx21/cpx31/cpx41 with full EU prices in
fsn1/nbg1/hel1, but POST /v1/servers rejects every order for those SKUs in
those DCs with:

  {"error":{"code":"invalid_input",
            "message":"unsupported location for server type"}}

Probed all 6 (SKU × DC) combinations end-to-end via real POST + immediate
DELETE. cpx22 + cpx32 were also probed as a sanity check and returned
ORDERED. The /v1/server_types price entry is misleading: Hetzner advertises
prices for every (SKU, location) pair regardless of orderability.

Conclusion: NO SKU bump-back. cpx22 + cpx32 (PR #744) remain the floor.
README + variables.tf docstrings now carry the durable reproducer so future
engineers don't re-attempt cpx21/cpx31.

#753 — kubectl retry / LKG observer reliability

/tmp/autopilot.sh updated (script lives outside the repo, on the VPS):
  • Every kubectl call carries --request-timeout=8s so a hung TLS handshake
    surfaces as a fast empty rather than a 30s+ stall.
  • Last-known-good (LKG) state held across transient flakes: hr/cert/nodes
    no longer flip to "0/0 nodes=0" on a single failed poll.
  • Only 3 consecutive transients count as a real failure; below the
    threshold the observer prints "hr=<LKG> (transient N/3)".

UI side: the wizard's StatusPill / ApplicationPage drive off SSE from
catalyst-api (useDeploymentEvents.ts), not direct kubectl polling, so no UI
change needed. catalyst-api itself uses client-go (helmwatch / phase1_watch),
not exec kubectl, so its observer is not subject to the same shell-out flake.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:11:44 +04:00
e3mrah
b02fc3788a
fix(provisioner): cost-optimized defaults use ORDERABLE SKUs — cpx22 CP + cpx32 workers (14% saving) (#744)
* fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request

Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z).
After PR #742 fixed the empty SKU strings in tfvars, the next blocker
appeared: writeTfvars was emitting `"regions": null` (Go nil slice
marshals to JSON null) when the request had no per-region overrides.

OpenTofu's variables.tf carries a validation block:

  validation {
    condition = alltrue([
      for r in var.regions :
      contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider)
    ])
  }

The `for r in var.regions` iteration fails on null with:

  Error: Iteration over null value
  on variables.tf line 217, in variable "regions":

The variables.tf default `[]` is what the validator expects; emit
that shape explicitly via a coalesceRegions(req.Regions) helper that
turns nil into an empty slice. Operator overrides round-trip
unchanged.

Tests:
- TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions
  serialises as JSON `[]`, never `null`, when the request has no
  per-region overrides.

Builds on PR #742.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving)

Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the
cpx21 CP default from PR #741 fell apart at apply time —

  Error: Server Type "cpx21" is unavailable in "fsn1" and can no
  longer be ordered

Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog
(`/v1/server_types`) but are NOT in the per-DC orderable list
(`available_for_migration` on `/v1/datacenters`) for any EU DC
(fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on
for new Sovereigns in those regions.

Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04:
  • cpx11 (2 vCPU / 2 GB) — too small for the CP working set
  • cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1
  • cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1
  • cpx42, cpx52, cpx62 — bigger and more expensive

New default per Sovereign:

| Component       | Old             | New              | Savings |
|-----------------|-----------------|------------------|---------|
| Control plane   | CPX32 (€16.49)  | CPX22 (€9.49)    | €7.00   |
| Worker × 2      | CPX32 × 2 (€33) | CPX32 × 2 (€33)  | €0      |
| TOTAL           | €49.47/mo       | €42.47/mo        | 14%     |

The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo)
assumed those SKUs were orderable. They aren't in EU DCs. The 14%
saving from cpx22 CP is the largest concrete optimisation that
ships TODAY without compromising the multi-node horizontal-scale
agreement (issue #733): still 1 CP + 2 workers from day one.

Files changed:

- infra/hetzner/variables.tf
  control_plane_size default cpx21 → cpx22
  worker_size        default cpx31 → cpx32 (back to the prior orderable choice)

- products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts
  Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing
  (€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49).
  Mark both as "listed but NOT orderable in EU DCs" so the wizard
  surfaces the constraint instead of letting operators pick a
  non-orderable SKU.
  Move recommended:true from CPX21 → CPX22.
  defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31').

- products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx
  Comment refresh — names the new orderable defaults.

- products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts
  Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22'].

Builds on PR #741 (issue #740 chain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:35:55 +04:00
e3mrah
994c2d1c2a
fix(provisioner): cost-optimized default sizes — cpx21 CP + cpx31 workers (38% saving) (#741)
The new Sovereign default after PR #736 / #738 / #739 was 1× CPX32 control
plane + 2× CPX32 workers — €33/mo per Sovereign. CPX32 is over-provisioned
for the CP working set: the CP carries only k3s (apiserver/etcd/scheduler/
controller-manager) + cilium-operator + flux controllers + cert-manager +
sealed-secrets — NOT the heavy bp-keycloak/cnpg/harbor/openbao/grafana
stack (those land on workers because the bootstrap-kit explicitly schedules
them off the CP taint).

CP RAM budget: etcd ~512 MB + control plane ~1.5 GB + cilium/flux/
cert-manager/sealed-secrets ~1 GB + OS ~512 MB ≈ 3.5 GB — fits CPX21's
4 GB. Workers stay at 8 GB on CPX31 since RAM is the binding constraint
for the bootstrap-kit's worker pods, not vCPU.

New default per Sovereign:

| Component       | Old             | New             | Savings |
|-----------------|-----------------|-----------------|---------|
| Control plane   | CPX32 (€11/mo)  | CPX21 (€5.5/mo) | €5.5    |
| Worker × 2      | CPX32 × 2 (€22) | CPX31 × 2 (€15) | €7      |
| TOTAL           | €33/mo          | €20.5/mo        | 38%     |

Multi-node horizontal-scale agreement (issue #733) preserved: still
1 CP + 2 workers minimum from day one.

Files changed:

- infra/hetzner/variables.tf
  control_plane_size default cpx32 → cpx21
  worker_size        default cpx32 → cpx31
  Validation regex unchanged (cxNN | cpxNN | ccxNN | caxNN).

- products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts
  Add CPX11, CPX21, CPX31 catalog entries.
  Move recommended:true from CPX32 → CPX21 (control-plane default).
  Add defaultWorkerSizeId() — Hetzner returns 'cpx31', other providers
  fall through to defaultNodeSizeId() symmetric default.

- products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx
  First-visit useEffect + handleSelectProvider now call
  defaultWorkerSizeId(provider) for the worker SKU instead of mirroring
  the CP SKU. Comment updated naming the cost-optimised pair.

- products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts
  Recommended-Hetzner-SKU set assertion: ['cpx32'] → ['cpx21'].

If a Sovereign exhibits CP RAM pressure with this default, the next safe
stop UP is cpx31 (4 vCPU / 8 GB, ~€7.5/mo) — never back to cpx32.

Closes #740.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:00:01 +04:00
e3mrah
7ec25b9736
feat(provisioner): default Sovereign to 3x CPX32 (1 CP + 2 workers) — restore horizontal scale (#736)
Issue #733. Every Sovereign provisioned this week launched with a single
CPX52 control-plane and zero workers — completely discarded horizontal
scalability. Restore the originally agreed shape: 1 CPX32 control plane
+ 2 CPX32 workers (3 nodes × 4 vCPU/8 GB = 12 vCPU/24 GB total — same
aggregate footprint as a CPX52 vertical-scale, but with multi-node fault
tolerance and the architectural shape clusters/_template/ was designed
for).

Changes:
- infra/hetzner/variables.tf — defaults: control_plane_size cx42→cpx32,
  worker_size cx32→cpx32, worker_count 0→2.
- infra/hetzner/main.tf — add hcloud_load_balancer_target.workers so the
  Hetzner LB targets every node (CP + workers); Cilium Gateway DaemonSet
  on every node serves ingress on its NodePort, so any node can absorb
  traffic for genuine horizontal scale.
- infra/hetzner/README.md — sizing rationale rewritten around horizontal
  scale; CPX32 × 3 documented as canonical; CPX52 retained for solo dev.
- ui model — INITIAL_WIZARD_STATE.workerCount 0→2.
- ui StepProvider — first-visit + provider-change defaults workerCount 0→2.
- ui providerSizes — `recommended: true` flag moves cpx52→cpx32; CPX52
  description updated to "solo dev when worker_count=0".

Constraints honoured:
- Existing API requests with explicit controlPlaneSize: 'cpx52' / explicit
  workerCount: 0 keep working — only DEFAULTS change.
- Sub-CPX32 SKUs (cx21/cx31) still allowed via dropdown.
- Contabo single-node Catalyst-Zero is a different code path — unaffected.
- No cron triggers added (event-driven only).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:57:53 +04:00
e3mrah
4946ccd125
feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) (#719)
Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.

Changes
=======

products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
  marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
  marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
  / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
  namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
  *.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
  {{ if .Values.ingress.marketplace.enabled }} so non-marketplace
  Sovereigns render the chart unchanged

clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}

infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
  on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"

core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
  resolves via PDM at zone-commit time (PR #710 explicit record so
  caches don't depend on the *.<sov> wildcard alone)

DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
  manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
  resources: 13 sme-services workloads + 2 marketplace-api + 1
  HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 07:47:37 +04:00
e3mrah
684759564e
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget

cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns
even with all of:

- gatewayAPI.hostNetwork.enabled=true on the Cilium chart
- securityContext.privileged=true on the cilium-envoy DaemonSet
- securityContext.capabilities.add=[NET_BIND_SERVICE]
- envoy-keep-cap-netbindservice=true in cilium-config ConfigMap
- Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema)

Repeatable error from cilium-envoy logs across otech45, otech46, otech47:

  listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed
  to bind or apply socket options: cannot bind '0.0.0.0:80':
  Permission denied

The bind() syscall is intercepted by cilium-agent's BPF socket-LB
program in a way that does not honour container capabilities. Even
PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets
"Permission denied". Cilium 1.19.3 → 1.16.5 made no difference
(F1, PR #684 still ships — the version bump is sound for other
reasons; the listener bind is just a separate fix).

This commit moves the listeners to high ports (30080/30443) and lets
the Hetzner LB do the public-facing port translation:

  HCLB :80   → CP node :30080  (cilium-gateway HTTP listener)
  HCLB :443  → CP node :30443  (cilium-gateway HTTPS listener)

External users still hit `https://console.<sov>.omani.works/auth/handover`
on port 443; the high port is invisible. High-port bind succeeds
without NET_BIND_SERVICE because the kernel only gates ports below
`net.ipv4.ip_unprivileged_port_start` (default 1024).

Will be verified on otech48: the next fresh provision should serve
console.otech48/auth/handover end-to-end without the 502/timeout
chain seen on otech45–47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager

PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681)
calls contabo's authoritative PowerDNS at pdns.openova.io to write
DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook
needs an X-API-Key Secret in the Sovereign's cert-manager namespace —
PR #681 didn't ship the materialization seam, so on otech43..otech47
the Secret was missing and the wildcard cert never issued.

This commit closes the seam from contabo to the Sovereign:

1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on
   openova-system/powerdns-api-credentials extended from "external-dns"
   to "external-dns,catalyst" so contabo catalyst-api can mount the
   API key.

2. bp-powerdns: api.basicAuth.enabled flips default true to false.
   Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that
   blocked machine-to-machine API access from Sovereigns. The X-API-Key
   contract is unchanged.

3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds
   CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key
   secret (optional=true so Sovereign-side catalyst-api Pods that don't
   reflect this still start clean).

4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field
   reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every
   Request before Validate(). Forwards as tofu var powerdns_api_key.

5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive,
   default "").

6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct
   dynadot-api-credentials Secret block (PR #681 dropped
   bp-cert-manager-dynadot-webhook) with a new
   cert-manager/powerdns-api-credentials Secret block. runcmd applies
   it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook.

End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token.

Will be verified live on otech48 (next provision after this lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:23:27 +04:00
e3mrah
169ba2f20a
fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623)
PR #611 squash accidentally reverted the Phase-8b infra additions from PR #615
(92fdda42). Restores:
- cloudinit-control-plane.tftpl: write_files entry for /var/lib/catalyst/handover-jwt-public.jwk (mode 0600)
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Without these, new Sovereign provisioning runs will not write the public key
to disk and auth/handover on the Sovereign will return 503 (key unavailable).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:21:16 +04:00
e3mrah
b5c9839da7
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables:

UI:
- AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server
  callback; sovereign → client-side OIDC token exchange via oidc.ts)
- Router: sovereign console routes (/console/*), DETECTED_MODE index redirect,
  authCallbackRoute dedup fix, authHandoverRoute safety net
- StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token
  before redirecting operator to Sovereign console (falls back to plain URL on error)

API:
- main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env
- deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time
- provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON
- auth.go: /auth/handover endpoint for seamless single-identity flow

Infra:
- cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Chart:
- api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars

Playwright CI fixes:
- playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard
- playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix
- cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/*
- sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard

Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:17:56 +04:00
e3mrah
92fdda42d7
feat(catalyst-api+infra): Phase-8b handover JWT minting on Catalyst-Zero (Closes #605)
Merge via self-merge per CLAUDE.md. Playwright UI smoke passes; cosmetic guards pre-existing failure on main (unrelated to this PR). Resolves #605.
2026-05-02 19:07:27 +04:00
e3mrah
9e53d9e127
feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557) (#563)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade

Per founder corrective: existing diagram missed the real blockers
surfaced during otech10..otech22 burns. The image-pull-through gap
(#557) and the cross-namespace secret gap (#543, #544) gate every
workload pull from a public registry — without them, Sovereign hits
DockerHub anonymous rate-limit on first provision and 30+ HRs are
ImagePullBackOff/CreateContainerConfigError.

Adds:
- Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap +
  #557C charts global.imageRegistry templating). Edges to NATS / Gitea
  / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane /
  cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao
- Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544
  powerdns-api-credentials reflect). Edges to bp-catalyst-platform and
  bp-cert-manager-powerdns-webhook
- Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch
  38-HR threshold both gate Phase 8a integration test
- Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is
  what makes "zero contabo dependency" DoD-met possible

WBS now reflects the cascade observed live, not the pre-Phase-8a model.

* feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560)

- bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium
  subchart does not expose a single registry knob — per-Sovereign overlays
  wire specific image.repository fields alongside this value.
- bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  chart exposes per-component image.registry knobs documented in the comment.
- bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub
  added + deployment.yaml templated to prefix the webhook image repository
  when the value is non-empty. Verified: helm template with
  --set global.imageRegistry=harbor.openova.io produces
  harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>.
- bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  subchart exposes sealed-secrets.image.registry for overlay wiring.

All four charts render clean with default values (empty imageRegistry).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557)

Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd
transparently routes all five public-registry pulls through the central
harbor.openova.io pull-through proxy (Option A of #557).

- cloudinit-control-plane.tftpl: new write_files entry for
  /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so
  containerd reads the mirror config at startup). Mirrors docker.io,
  quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective
  harbor.openova.io/proxy-* projects. Auth via robot$openova-bot.
- variables.tf: new harbor_robot_token variable (sensitive, default "")
  for the robot account token stored in openova-harbor/harbor-robot-token
  K8s Secret on contabo and forwarded by catalyst-api at provision time.
- main.tf: wire harbor_robot_token into the templatefile() call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:49:13 +04:00
e3mrah
ccc38987c2
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.

Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
  API returns SetDnsResponse); change ResponseCode to json.Number (API returns
  integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
  - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
    cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
  - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
    and privateKeySecretRefName; add rbac.create comment for domain-solver
  - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
  - clusterissuer.yaml: new template (skip-render default, enabled via overlay)
  - deployment.yaml: add imagePullSecrets support (required for private GHCR)
  - Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
  - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
  - kustomization.yaml: add 49b entry
- infra/hetzner:
  - variables.tf: add dynadot_managed_domains variable
  - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
  - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
    Secret + apply it before Flux reconciles bootstrap-kit

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:42:13 +04:00
e3mrah
0172b9a89a
wip(#425): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435)
Files staged from prior agent run before rate-limit. Re-dispatch will
verify, complete missing pieces (Crossplane Provider+ProviderConfig in
cloud-init, grep-zero acceptance, helm/go test runs, WBS row update),
and finalise the PR.

Includes:
- platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml
- platform/velero/chart/values.yaml (objectStorage.s3.* block)
- platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0)
- products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package)
- internal/hetzner/objectstorage{,_test}.go DELETED
- credentials handler + StepCredentials.tsx renamed
- infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl}
- clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml
- platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 18:05:19 +04:00
e3mrah
1e17668055
feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371) (#409)
* feat(catalyst): Hetzner Object Storage credential pattern (Phase 0b, #371)

Adds the per-Sovereign Hetzner Object Storage credential capture + bucket
provisioning Phase 0b path described in the omantel handover WBS §5.
Hybrid Option A+B: wizard collects operator-issued S3 credentials (Hetzner
exposes no Cloud API to mint them — they're issued once in the Hetzner
Console and the secret half is shown exactly once), and OpenTofu
auto-provisions the per-Sovereign bucket via the aminueza/minio provider
+ writes a flux-system/hetzner-object-storage Secret into the new
Sovereign at cloud-init time so Harbor (#383) and Velero (#384) find
their backing-store credentials already in the cluster from Phase 1
onwards.

Extends the EXISTING canonical seam at every layer (per the founder's
anti-duplication rule for #371's session): the existing Tofu module at
infra/hetzner/, the existing handler/credentials.go validator, the
existing provisioner.Request struct, the existing store.Redact path,
and the existing wizard StepCredentials. No parallel binaries / scripts
/ operators introduced.

infra/hetzner/ (Tofu module — Phase 0):
  - versions.tf: declare aminueza/minio provider (Hetzner's official
    recommendation for S3-compatible bucket creation per
    docs.hetzner.com/storage/object-storage/getting-started/...)
  - variables.tf: 4 sensitive vars — region (validated against
    fsn1/nbg1/hel1, the European-only OS regions as of 2026-04),
    access_key, secret_key, bucket_name (RFC-compliant S3 naming)
  - main.tf: minio_s3_bucket.main resource — idempotent on re-apply,
    no force_destroy (Velero archive must survive a control-plane
    reinstall), object_locking=false (content-addressed digests are
    the immutability guarantee for Harbor; Velero uses S3 versioning)
  - cloudinit-control-plane.tftpl: write
    flux-system/hetzner-object-storage Secret with the canonical
    s3-endpoint/s3-region/s3-bucket/s3-access-key/s3-secret-key keys
    Harbor + Velero charts consume via existingSecret refs
  - outputs.tf: surface endpoint/region/bucket back to catalyst-api
    for the deployment record (credentials NEVER returned)

products/catalyst/bootstrap/api/ (Go):
  - internal/hetzner/objectstorage.go: NEW — minio-go/v7-based
    ListBuckets validator. Distinguishes auth failure ("rejected") from
    network failure ("unreachable") so the wizard renders the right
    error card. NOT a parallel cloud-resource path — the existing
    purge.go handles hcloud purge; objectstorage.go handles a separate
    API surface (S3-compatible) that has no equivalent client today.
  - internal/handler/credentials.go: extend with
    ValidateObjectStorageCredentials handler — same wire shape
    (200 valid:true / 200 valid:false / 503 unreachable / 400 bad
    input) as the existing token validator so the wizard's failure-
    card machinery handles both without per-endpoint switches.
  - cmd/api/main.go: wire POST
    /api/v1/credentials/object-storage/validate
  - internal/provisioner/provisioner.go: extend Request with
    ObjectStorageRegion/AccessKey/SecretKey/Bucket; Validate()
    rejects empty/malformed values fail-fast at /api/v1/deployments
    POST time; writeTfvars() emits the 4 new tfvars.
  - internal/handler/deployments.go: derive bucket name from FQDN slug
    pre-Validate (catalyst-<fqdn-with-dots-replaced-by-dashes>) so
    Hetzner's globally-namespaced bucket pool gets a deterministic,
    collision-resistant per-Sovereign name without operator input.
  - internal/store/store.go: redact access/secret keys; preserve
    region+bucket plain (they're public in tofu outputs anyway).

products/catalyst/bootstrap/ui/ (TypeScript / React):
  - entities/deployment/model.ts + store.ts: 4 new wizard fields
    (objectStorageRegion/AccessKey/SecretKey/Validated) with merge()
    coercion for legacy persisted state.
  - pages/wizard/steps/StepCredentials.tsx: ObjectStorageSection —
    region picker (fsn1/nbg1/hel1), masked secret-key input,
    Validate button gating Next. Same FailureCard taxonomy
    (rejected/too-short/unreachable/network/parse/http) the existing
    TokenSection uses, so the operator UX is consistent. Section
    only renders when Hetzner is among chosen providers — non-Hetzner
    Sovereigns skip Phase 0b until their own backing-store path lands.
  - pages/wizard/steps/StepReview.tsx: include
    objectStorageRegion/AccessKey/SecretKey in the
    POST /v1/deployments payload (bucket derived server-side).

Tests:
  - api: 7 new provisioner Validate tests (region/keys/bucket
    required + RFC-compliant + valid-region acceptance), 5 handler
    tests for the new endpoint (bad JSON / missing region / invalid
    region / short keys), 4 hetzner/objectstorage_test.go tests
    (endpoint composition + early input rejection), 1 handler test
    for the bucket-name derivation. Existing tests updated to supply
    the new required fields.
  - ui: StepCredentials.test.tsx pre-populates objectStorageValidated
    in beforeEach so the existing 11 SSH-section tests aren't gated
    on Object Storage validation.

DoD: a fresh Sovereign provision results in a usable S3 endpoint URL +
access/secret keys available as a K8s Secret in the Sovereign's home
cluster (flux-system/hetzner-object-storage), ready for consumption by
Harbor + Velero charts via existingSecret references.

Closes #371.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(wbs): #371 done — Hetzner Object Storage Phase 0b shipped (#409)

Marks #371 done with the architectural rationale (hybrid Option A + B —
Hetzner exposes no Cloud API to mint S3 keys, so the wizard MUST capture
them; OpenTofu auto-provisions the bucket + cloud-init writes the
flux-system/hetzner-object-storage Secret with the canonical s3-* keys
Harbor + Velero consume).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:54:22 +04:00
hatiyildiz
acf426c5a9 feat(catalyst-api): cloud-init POSTs kubeconfig back via bearer token (closes #183)
Implement Option D from issue #183: the new Sovereign's cloud-init
PUTs its rewritten kubeconfig (server URL pinned to the LB public
IP, k3s service-account token in the body) to catalyst-api over
HTTPS using a per-deployment bearer token. catalyst-api never SSHs
into the Sovereign — by design, it does not hold the SSH private
key (the wizard returns it once to the browser and does not
persist it on the catalyst-api side).

How the bearer flow works
-------------------------
1. CreateDeployment mints a 32-byte random bearer (crypto/rand,
   hex-encoded), computes its SHA-256, and persists ONLY the
   hash on Deployment.kubeconfigBearerHash. Plaintext is stamped
   onto provisioner.Request just long enough for writeTfvars to
   render it into the per-deployment OpenTofu workdir, then GC'd.

2. infra/hetzner/variables.tf adds three variables — deployment_id,
   kubeconfig_bearer_token (sensitive), catalyst_api_url. main.tf
   passes them through templatefile() with load_balancer_ipv4 read
   from hcloud_load_balancer.main.ipv4.

3. cloudinit-control-plane.tftpl, after `kubectl --raw /healthz`
   succeeds, sed-rewrites k3s.yaml's https://127.0.0.1:6443 to the
   LB's public IPv4, writes the result to a 0600 file, and curls
   PUT to {catalyst_api_url}/api/v1/deployments/{deployment_id}/
   kubeconfig with `Authorization: Bearer {token}`. --retry 60
   --retry-delay 10 --retry-all-errors handles transient
   reachability gaps. The 0600 file is removed after the PUT.

4. PUT /api/v1/deployments/{id}/kubeconfig:
   - Reads `Authorization: Bearer <token>` (RFC 6750).
   - Computes SHA-256 of the inbound bearer, constant-time-compares
     to the persisted hash via subtle.ConstantTimeCompare.
   - 401 on missing/malformed Authorization, 403 on bearer
     mismatch, 403 if no hash on record, 403 if KubeconfigPath
     already set (single-use replay defence), 422 on empty/oversize
     body, 503 if the kubeconfigs directory is unwritable.
   - On 204: writes the body to /var/lib/catalyst/kubeconfigs/
     <id>.yaml at mode 0600 (atomic temp+rename), sets
     Result.KubeconfigPath, persistDeployment, then `go
     runPhase1Watch(dep)`.

5. GET /api/v1/deployments/{id}/kubeconfig now reads the file at
   Result.KubeconfigPath. 409 with {"error":"not-implemented"} when
   the postback hasn't happened yet (preserves the wizard's
   existing StepSuccess fallback). 409 {"error":
   "kubeconfig-file-missing"} on PVC drift.

6. internal/store: Record carries KubeconfigBearerHash. The path
   pointer round-trips via Result.KubeconfigPath; the JSON record
   NEVER contains the kubeconfig plaintext (test grep on the on-
   disk JSON for the kubeconfig sentinels asserts zero matches).

7. restoreFromStore relaunches helmwatch on Pod restart for any
   rehydrated deployment whose Result.KubeconfigPath points at an
   existing file AND Phase1FinishedAt is nil AND the original
   status was not in-flight (the existing
   in-flight-status-rewrite-to-failed contract is preserved).
   Channels are re-allocated for resumed deployments because the
   fromRecord-loaded ones are closed.

8. internal/handler/phase1_watch.go reads kubeconfig YAML from
   the file at Result.KubeconfigPath (not from a string field on
   Result). The Result.Kubeconfig field is removed entirely; the
   on-disk JSON only carries kubeconfigPath.

Tests
-----
internal/handler/kubeconfig_test.go covers every spec gate:
- PUT 401 missing/malformed Authorization
- PUT 403 bearer mismatch / no-bearer-hash / already-set
- PUT 422 empty body / oversize body
- PUT 404 deployment not found
- PUT 204 first success, file at <dir>/<id>.yaml mode 0600,
  Result.KubeconfigPath set, on-disk JSON has kubeconfigPath
  pointer with no plaintext leak
- PUT triggers Phase 1 helmwatch goroutine
- GET reads from path-pointer
- GET 409 path-pointer-set-but-file-missing
- newBearerToken / hashBearerToken round-trip + entropy
- subtle.ConstantTimeCompare correctness
- shouldResumePhase1 gates every branch
- restoreFromStore re-launches helmwatch on rehydrated deployments
- phase1Started guard prevents double watch (PUT then runProvisioning)
- extractBearer RFC 6750 case-insensitive scheme

Chart
-----
products/catalyst/chart/templates/api-deployment.yaml mounts the
existing catalyst-api-deployments PVC at /var/lib/catalyst (one
level up) so deployments/<id>.json and kubeconfigs/<id>.yaml live
on the same single-attach volume — no second PVC. Adds env vars
CATALYST_KUBECONFIGS_DIR=/var/lib/catalyst/kubeconfigs and
CATALYST_API_PUBLIC_URL=https://console.openova.io/sovereign.

Per docs/INVIOLABLE-PRINCIPLES.md
- #3: OpenTofu is still the only Phase-0 IaC; cloud-init is part of
  the OpenTofu module's templated user_data, not a separate code
  path. catalyst-api never execs helm/kubectl/ssh.
- #4: catalyst_api_url is runtime-configurable
  (CATALYST_API_PUBLIC_URL env var), so air-gapped franchises
  override without code changes.
- #10: Bearer plaintext NEVER lands on disk on the catalyst-api
  side (only the SHA-256 hash). Kubeconfig plaintext NEVER lands
  in the JSON record (only the file path). The kubeconfig file is
  chmod 0600 and the directory 0700 owned by the catalyst-api UID.

Closes #183.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:53 +02:00
hatiyildiz
dddbab4b80 fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly
Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull`
because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init
never created the Secret, so every fresh Sovereign's source-controller
logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium.
The operator workaround (kubectl apply by hand) is not durable across
reprovisioning. Verified live on omantel.omani.works pre-fix.

Changes:

- provisioner.Request gains GHCRPullToken (json:"-") so it is never
  serialized into persisted deployment records. provisioner.New() reads
  CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the
  Request before tofu.auto.tfvars.json. Validate() rejects empty for
  domain_mode=pool with a pointer to docs/SECRET-ROTATION.md.
- handler.CreateDeployment also stamps the env var onto the Request so
  the synchronous validation path returns 400 early on misconfiguration.
- infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true,
  default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64
  locals and passes both to templatefile().
  cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson
  Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd
  applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so
  the GitRepository + Kustomization land into a cluster that already
  has working GHCR creds.
- products/catalyst/chart/templates/api-deployment.yaml mounts
  CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in
  the catalyst namespace (key: token, optional: true so the Pod still
  starts on misconfigured installs and Validate() owns the gate).
- docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token,
  Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds.
  Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN>
  placeholder; the token never lives in git.
- Tests: provisioner unit tests cover New() reading the env var,
  tolerance of missing env, pool-mode validation rejection with
  operator-facing error, BYO acceptance, and the json:"-" serialization
  invariant. tests/e2e/hetzner-provisioning gains a
  TestCloudInit_RendersGHCRPullSecret render-only integration test that
  asserts the rendered cloud-init contains the Secret, applies it
  before flux-bootstrap, and that the dockerconfigjson round-trips the
  sample token through templatefile() correctly. Existing
  pool-mode handler tests now t.Setenv the placeholder token; the
  on-disk redaction test asserts the placeholder never reaches disk.

Gates:
- go vet ./... and go test -race -count=1 ./... in
  products/catalyst/bootstrap/api: PASS.
- helm lint products/catalyst/chart: PASS (warnings pre-existing).
- tofu fmt + tofu validate: deferred to CI (no tofu binary on the
  development host).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:07:27 +02:00
hatiyildiz
c6cbfe684c fix(tofu): accept cpx* SKU family + empty worker_size for solo Sovereigns
The wizard's recommended Hetzner SKU is CPX32 (4 vCPU AMD / 8 GB / €0.0232/hr)
but the module's variables.tf validation rule only accepted the cx / ccx /
cax families — CPX (AMD shared) was missing entirely. Every Launch through
the wizard hit:

  Error: Invalid value for variable
  on variables.tf line 68: variable "control_plane_size" {
  var.control_plane_size is "cpx32"
  control_plane_size must match Hetzner server-type naming (cxNN | ccxNN | caxNN)

Solo Sovereigns (worker_count = 0) also legitimately have an empty
worker_size — the validation rejected that too:

  Error: Invalid value for variable
  on variables.tf line 91: variable "worker_size" {
  var.worker_size is ""

Both fixed by extending the regex with the cpx* family AND permitting
the empty string on worker_size when the operator runs a solo Sovereign.

Reproduced end-to-end against the deployed catalyst-api before the fix:
the SSE stream surfaced exactly these two validation errors. With the
regex updated they no longer fire — failure now requires a real
Hetzner token instead of being blocked at module-validation time.
2026-04-29 14:43:52 +02:00
hatiyildiz
4ee9e7dd6f fix(wizard): topology before provider; per-provider SKU catalog; per-region sizing
The wizard step order was inverted: it asked for the provider before the
topology, then put hetzner-only SKUs inside the topology step. Topology
decides how many regions exist; provider is a per-region property; SKU
vocabulary is per-provider (cx32 means nothing on Azure). Fixes all three.

New step order (WIZARD_STEPS + WizardPage STEPS): Org -> Topology ->
Provider -> Credentials -> Components -> Domain -> Review.

Per-provider SKU catalog at products/catalyst/bootstrap/ui/src/shared/
constants/providerSizes.ts replaces the legacy hetzner-only HETZNER_NODE_SIZES.
Five providers (hetzner, huawei, oci, aws, azure), each with realistic SKU
options drawn from that vendor's native instance-type vocabulary. Every
SKU read in the wizard goes through PROVIDER_NODE_SIZES[provider] -- no
SKU literal lives anywhere else.

StepProvider now renders one card per topology slot. Each card carries:
provider chooser, that provider's region picker, that provider's
control-plane SKU, that provider's worker SKU + count. Cost rollup sums
each region's (cp + worker*count) at its OWN provider's pricing, so a
mixed-cloud topology computes correctly.

StepTopology drops the SkuCard + NodeSizingPanel; it now captures only
the topology template, HA flag, and AIR-GAP add-on.

Per-region store fields (regionControlPlaneSizes, regionWorkerSizes,
regionWorkerCounts) replace the singular controlPlaneSize/workerSize/
workerCount as the canonical shape. Migration in store.merge() hydrates
the arrays from any persisted singular fields; the cx22 legacy default
is treated as "no selection" so a hetzner-only id never leaks into a
non-hetzner region.

Backend Request gains an optional Regions []RegionSpec field. Validate
mirrors Regions[0] into the legacy singular fields for the existing
solo-Hetzner writeTfvars path. infra/hetzner/variables.tf accepts the
list-of-objects shape; the for_each iteration that activates the rest
of the regions is the multi-region tofu wiring follow-up. Door open
structurally; no shape compromised.

Dead code removed: StepInfrastructure and shared/constants/hetzner.ts
(both orphaned, contained the only HETZNER_NODE_SIZES reference outside
the catalog).

Gates: tsc --noEmit, vite build, vitest (149 tests), go vet, go test
(provisioner + handler).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:44:33 +02:00
hatiyildiz
e7a74f0eef feat(infra/hetzner): bump default to cx42, add OS hardening + operator README
Group J — closes #127, #128, #129, #130, #131, #132.

Defaults
- control_plane_size default cx42 (16 GB) — cx32 (8 GB) is INSUFFICIENT
  for a solo Sovereign per PLATFORM-TECH-STACK.md §7.1 (~11.3 GB Catalyst)
  + §7.4 (~8.8 GB per-host-cluster) = ~20 GB minimum. The previous cx32
  default would OOM during the OpenBao + Keycloak step of bootstrap.
- New k3s_version variable (v1.31.4+k3s1) — pinned, validated against
  the INSTALL_K3S_VERSION format. Previously hardcoded inside the
  cloud-init templates, in violation of INVIOLABLE-PRINCIPLES.md §4.

Validation
- Region restricted to the 5 known Hetzner locations.
- control_plane_size + worker_size restricted to the cxNN | ccxNN | caxNN
  namespace (blocks tiny dev sizes that would OOM at runtime).
- k3s_version regex matches the upstream installer's version format.
- ssh_allowed_cidrs validated as proper CIDRs.

Firewall
- Document each open port (80, 443, 6443, ICMP) and each blocked port
  (22, 10250, 2379/2380, 8472) in README.md §"Firewall rules".
- SSH (22) is now a dynamic rule keyed off ssh_allowed_cidrs (default
  empty = no SSH at the firewall, break-glass via Hetzner Console).

OS hardening (cloudinit-*.tftpl)
- sshd drop-in: PasswordAuthentication no, PermitRootLogin
  prohibit-password, no forwarding, MaxAuthTries=3, LoginGraceTime=30.
- enable_unattended_upgrades (default true): security-only pocket,
  auto-reboot at 02:30, removes unused kernels.
- enable_fail2ban (default true): sshd jail, systemd backend.
- Both control-plane and worker templates carry the same baseline.

Documentation
- New infra/hetzner/README.md (operator-facing) covers:
  * What the module creates + Phase-0/Phase-1 boundary.
  * Sizing rationale with the §7.1+§7.4 RAM math + upgrade path.
  * Firewall rules: every open port, every blocked port, every
    deliberate egress flow.
  * k3s flag-by-flag rationale tied to PLATFORM-TECH-STACK.md §8.
  * SSH key management: why no auto-generated keys (break-glass +
    audit-trail + custody + compliance).
  * OS hardening table.
  * Standalone CLI invocation pattern (tofu apply -var-file=...).
  * What the module does NOT do (Crossplane / Flux territory).

Closes #127 #128 #129 #130 #131 #132
2026-04-28 13:54:15 +02:00
hatiyildiz
e668637bc9 feat(provisioner): replace bespoke Hetzner+helm-exec code with OpenTofu→Crossplane→Flux
Per docs/INVIOLABLE-PRINCIPLES.md Lesson #24 — the previous commits 915c467 + 07b4bcf shipped bespoke Go code that called Hetzner Cloud API directly + exec'd helm/kubectl, which violates principle #3 (OpenTofu provisions Phase 0, Crossplane is the ONLY day-2 IaC, Flux is the ONLY GitOps reconciler, Blueprints are the ONLY install unit). This commit reverts all of that and replaces it with the canonical architecture.

REVERTED (deleted):
- products/catalyst/bootstrap/api/internal/hetzner/resources.go (379 lines bespoke Hetzner API client)
- products/catalyst/bootstrap/api/internal/hetzner/cloudinit.go (bespoke cloud-init builder)
- products/catalyst/bootstrap/api/internal/hetzner/provisioner.go (306 lines orchestrator)
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go (helm-exec installer for 11 components)
- products/catalyst/bootstrap/api/internal/bootstrap/exec.go (kubectl/helm exec wrappers)

KEPT:
- products/catalyst/bootstrap/api/internal/hetzner/client.go — fast token validity probe used by StepCredentials wizard step. NOT architectural drift; just a UX pre-flight check.
- products/catalyst/bootstrap/api/internal/dynadot/dynadot.go — DNS API client. Will be invoked by the OpenTofu module via local-exec (the catalyst-dns helper binary).

NEW (canonical architecture):

infra/hetzner/ — OpenTofu module per docs/SOVEREIGN-PROVISIONING.md §3 Phase 0:
- versions.tf: hetznercloud/hcloud provider ~> 1.49
- variables.tf: 17 typed variables matching wizard inputs (sovereign_fqdn, hcloud_token, region, control_plane_size, ssh_public_key, domain_mode, gitops_repo_url, etc.) — all runtime parameters, none hardcoded per principle #4
- main.tf: hcloud_network + subnet + firewall + ssh_key + control-plane server(s) with cloud-init + worker servers + load_balancer with services + null_resource calling /usr/local/bin/catalyst-dns for pool-domain DNS writes
- outputs.tf: control_plane_ip, load_balancer_ip, sovereign_fqdn, console_url, gitops_repo_url
- cloudinit-control-plane.tftpl: installs k3s with --flannel-backend=none --disable=traefik --disable=servicelb (Cilium replaces all of these), then installs Flux core, then applies a GitRepository pointing at clusters/${sovereign_fqdn}/ in the public OpenOva monorepo. From this point Flux is the GitOps engine — it reconciles bp-cilium → bp-cert-manager → bp-crossplane → ... → bp-catalyst-platform via the Kustomization tree the cluster directory ships. NO bespoke helm install from outside the cluster. NO direct kubectl apply. Flux is the install layer.
- cloudinit-worker.tftpl: k3s agent join via private-IP control plane

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — thin OpenTofu invoker:
- Validates wizard inputs
- Stages the canonical infra/hetzner/ module into a per-deployment workdir
- Writes tofu.auto.tfvars.json from the wizard request
- Execs `tofu init`, `tofu plan -out=tfplan`, `tofu apply tfplan`, streaming stdout/stderr lines as SSE events to the wizard
- Reads tofu output -json for control_plane_ip + load_balancer_ip
- Returns Result. Flux on the new cluster takes over from here.

products/catalyst/bootstrap/api/internal/handler/deployments.go — rewritten:
- Uses provisioner.Request and provisioner.New() (no more hetzner.Provisioner)
- Same SSE/poll endpoints; same Dynadot env-var injection for pool-domain mode

What this commit DOES NOT yet include (intentionally — separate work):
- clusters/${sovereign_fqdn}/ Kustomization tree in the monorepo that Flux will reconcile (each Sovereign gets its own cluster directory). Tracked separately as part of the bp-catalyst-platform umbrella work.
- /usr/local/bin/catalyst-dns helper binary in the catalyst-api Containerfile. Tracked as ticket [G] dns Dynadot client.
- Crossplane Compositions for hcloud resources at platform/crossplane/compositions/. Tracked as part of [F] crossplane chart.

Lesson #24 closed. Architecture now matches docs/ARCHITECTURE.md §10 + SOVEREIGN-PROVISIONING.md §3-§4 exactly.
2026-04-28 13:38:56 +02:00