Commit Graph

36 Commits

Author SHA1 Message Date
e3mrah
96a5e3a20e
fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service (#635)
PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 22:14:23 +04:00
e3mrah
169ba2f20a
fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623)
PR #611 squash accidentally reverted the Phase-8b infra additions from PR #615
(92fdda42). Restores:
- cloudinit-control-plane.tftpl: write_files entry for /var/lib/catalyst/handover-jwt-public.jwk (mode 0600)
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Without these, new Sovereign provisioning runs will not write the public key
to disk and auth/handover on the Sovereign will return 503 (key unavailable).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:21:16 +04:00
e3mrah
b5c9839da7
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables:

UI:
- AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server
  callback; sovereign → client-side OIDC token exchange via oidc.ts)
- Router: sovereign console routes (/console/*), DETECTED_MODE index redirect,
  authCallbackRoute dedup fix, authHandoverRoute safety net
- StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token
  before redirecting operator to Sovereign console (falls back to plain URL on error)

API:
- main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env
- deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time
- provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON
- auth.go: /auth/handover endpoint for seamless single-identity flow

Infra:
- cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Chart:
- api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars

Playwright CI fixes:
- playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard
- playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix
- cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/*
- sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard

Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:17:56 +04:00
e3mrah
92fdda42d7
feat(catalyst-api+infra): Phase-8b handover JWT minting on Catalyst-Zero (Closes #605)
Merge via self-merge per CLAUDE.md. Playwright UI smoke passes; cosmetic guards pre-existing failure on main (unrelated to this PR). Resolves #605.
2026-05-02 19:07:27 +04:00
e3mrah
5a403e66b1
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase

Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix

Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:

1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
   - values.yaml: `webhook.solverName: powerdns` → `pdns`
   - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
     calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
     "powerdns" cert-manager gets 404 → "server could not find the resource".

2. cert-manager-dynadot-webhook solver_test.go mock format:
   - writeOK() and error injection used old ResponseHeader-wrapped format
   - Real api3.json returns ResponseCode/Status directly in SetDnsResponse
   - This caused the image build to fail at ccc38987 so the dynadot fix
     never shipped; solver tests now pass cleanly (go test ./... OK)

3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template):
   - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values
   - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied
   - anycast-endpoint.yaml: optional nodePort field rendered in port list

4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf):
   - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053
   - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0

5. dynadot-client JSON parsing fix (core/pkg/dynadot-client):
   - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader
   - client_test.go: mock responses updated to real api3.json format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:49:58 +04:00
e3mrah
73ae746637
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API
CRDs at startup and disables its gateway controller if they are absent —
a static, one-shot decision. Cloud-init installs k3s+Cilium first, then
Flux reconciles bp-gateway-api minutes later, so the operator always
starts without CRDs and never recovers. All 8 HTTPRoutes orphaned.

Three-part permanent fix:

1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl.
   TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires
   TLSRoute CRD to be present; without it the operator's capability
   check fails entirely and disables the gateway controller.

2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true"
   to force GatewayClass creation regardless of CRD presence at Helm
   render time. Upstream default "auto" skips GatewayClass when the
   gateway API CRDs are absent at install time (Capabilities check).

3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0
   and ship experimental channel (TLSRoute, TCPRoute, UDPRoute,
   BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed
   status.supportedFeatures from string[] to object[]; Cilium 1.16.5
   writes the old string format and the v1.2.0 CRD rejects the status
   patch with "must be of type object: string", leaving GatewayClass
   permanently Unknown/Pending. v1.1.0 retains string schema.

Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17
adopts the v1.2.0 object schema for supportedFeatures.

Closes #503

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:23:32 +04:00
e3mrah
9e53d9e127
feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557) (#563)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade

Per founder corrective: existing diagram missed the real blockers
surfaced during otech10..otech22 burns. The image-pull-through gap
(#557) and the cross-namespace secret gap (#543, #544) gate every
workload pull from a public registry — without them, Sovereign hits
DockerHub anonymous rate-limit on first provision and 30+ HRs are
ImagePullBackOff/CreateContainerConfigError.

Adds:
- Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap +
  #557C charts global.imageRegistry templating). Edges to NATS / Gitea
  / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane /
  cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao
- Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544
  powerdns-api-credentials reflect). Edges to bp-catalyst-platform and
  bp-cert-manager-powerdns-webhook
- Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch
  38-HR threshold both gate Phase 8a integration test
- Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is
  what makes "zero contabo dependency" DoD-met possible

WBS now reflects the cascade observed live, not the pre-Phase-8a model.

* feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560)

- bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium
  subchart does not expose a single registry knob — per-Sovereign overlays
  wire specific image.repository fields alongside this value.
- bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  chart exposes per-component image.registry knobs documented in the comment.
- bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub
  added + deployment.yaml templated to prefix the webhook image repository
  when the value is non-empty. Verified: helm template with
  --set global.imageRegistry=harbor.openova.io produces
  harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>.
- bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  subchart exposes sealed-secrets.image.registry for overlay wiring.

All four charts render clean with default values (empty imageRegistry).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557)

Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd
transparently routes all five public-registry pulls through the central
harbor.openova.io pull-through proxy (Option A of #557).

- cloudinit-control-plane.tftpl: new write_files entry for
  /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so
  containerd reads the mirror config at startup). Mirrors docker.io,
  quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective
  harbor.openova.io/proxy-* projects. Auth via robot$openova-bot.
- variables.tf: new harbor_robot_token variable (sensitive, default "")
  for the robot account token stored in openova-harbor/harbor-robot-token
  K8s Secret on contabo and forwarded by catalyst-api at provision time.
- main.tf: wire harbor_robot_token into the templatefile() call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:49:13 +04:00
e3mrah
ccc38987c2
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.

Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
  API returns SetDnsResponse); change ResponseCode to json.Number (API returns
  integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
  - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
    cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
  - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
    and privateKeySecretRefName; add rbac.create comment for domain-solver
  - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
  - clusterissuer.yaml: new template (skip-render default, enabled via overlay)
  - deployment.yaml: add imagePullSecrets support (required for private GHCR)
  - Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
  - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
  - kustomization.yaml: add 49b entry
- infra/hetzner:
  - variables.tf: add dynadot_managed_domains variable
  - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
  - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
    Secret + apply it before Flux reconciles bootstrap-kit

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:42:13 +04:00
e3mrah
b2307e290d
fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543) (#554)
Part A — bp-reflector blueprint:
- Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a,
  dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288
  via the bp-reflector OCI wrapper chart.
- Register in bootstrap-kit/kustomization.yaml.
- Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml):
  single replica, 32Mi memory, ServiceMonitor off by default.

Part B — annotate flux-system/ghcr-pull + rename in charts:
- infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector
  annotations to the ghcr-pull Secret written at cloud-init time so
  Reflector auto-mirrors it to every namespace on first boot.
- Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in:
  api-deployment.yaml, ui-deployment.yaml,
  marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml
  (14 total occurrences).
- Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit
  HelmRelease version reference to match.

Root cause: the canonical secret name is ghcr-pull (written by
cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were
referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff
on all Catalyst pods on every new Sovereign.

Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret
propagated to 33 namespaces via kubectl; non-Running pods bounced.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:17:51 +04:00
e3mrah
5b55d65461
fix(infra): kubeconfig points at CP public IP not LB IP (Closes #542) (#546)
The Hetzner LB only forwards 80/443 (Cilium Gateway ingress); 6443 is
exposed directly on the CP node via firewall rule (main.tf:51-56,
0.0.0.0/0 → CP:6443). Previous cloud-init rewrote kubeconfig server: to
the LB's public IPv4, which silently failed with "connect: connection
refused" — catalyst-api helmwatch could never observe HelmReleases on
the new Sovereign, so the wizard jobs page stayed PENDING for every
install-* job for 50+ minutes after the cluster was actually healthy.

Pass control_plane_ipv4 (= hcloud_server.control_plane[0].ipv4_address)
through the templatefile() call and rewrite k3s.yaml's 127.0.0.1:6443 to
that IP instead. Same firewall already opens 6443 to 0.0.0.0/0 directly
on the CP, so this is reachable from contabo without any LB / firewall
changes.

Permanent: every otechN provisioning from this commit forward will PUT
back a kubeconfig that catalyst-api can actually connect to.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 11:55:48 +04:00
e3mrah
66ff717fbc
fix(infra): reduce bootstrap Kustomization timeouts 30m→5m to unblock iterative fixes (closes #492) (#500)
Phase-8a bug #17 (otech8 deployment 1bfc46347564467b, 2026-05-01):
when the FIRST apply of bootstrap-kit was unhealthy (cilium crash-loop
from issue #491), kustomize-controller held the revision lock for the
full 30m health-check timeout and refused to pick up new GitRepository
revisions. Even though Flux fetched fix `66ea39f0` from main within 1
minute, bootstrap-kit's lastAttemptedRevision stayed pinned to the OLD
SHA `0765e89a` for the full 30 minutes. With cilium broken, the wait
would never finish, no new revision would ever apply, and the operator
was forced to wipe + reprovision from scratch. The same pathology
would repeat on every iteration unless the timeout shape changed.

Approach: Option A (timeout reduction). Drops `spec.timeout` on all
three Flux Kustomizations in the cloud-init template — bootstrap-kit,
sovereign-tls, infrastructure-config — from 30m to 5m. We KEEP
`wait: true` so downstream `dependsOn: bootstrap-kit` declarations
still get a consolidated "every HR Ready=True" signal. We do NOT
adjust `interval` (5m is correct).

Why 5m specifically: matches the GitRepository poll interval. Failed
reconciles release the revision lock within ~6m worst case so a fresh
fix on main gets applied on the next poll. Anything shorter risks
tripping legitimately-slow CRD installs; anything longer re-introduces
the iteration-stall pathology #492 documents.

Why not Option B (wait: false): would break the dependsOn chain. The
infrastructure-config Kustomization needs bootstrap-kit's HRs Ready
before it applies Provider/ProviderConfig manifests that talk to
Hetzner. Flipping wait: false would let infra-config apply prematurely.

Why not Option C (tighter retryInterval): doesn't address the root
cause. retryInterval governs how often to retry AFTER a failure;
spec.timeout is what holds the revision lock during a failed wait.

Test: kustomization_timeout_test.go (new) locks all three timeouts at
exactly 5m AND blocks any operative `timeout: 30m` regression AND
asserts wait: true is retained. Three assertions, one for each failure
mode (regression to 30m, accidental 4th Kustomization without test
update, drive-by flip to wait: false).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:34:35 +04:00
e3mrah
141dc9dfba
fix(infra): cloud-init helm install cilium values parity with Flux bp-cilium HR (closes #491) (#496)
Phase-8a bug #16: every fresh Hetzner Sovereign deadlocked at Phase 1
because the bootstrap helm install in cloud-init used a MINIMAL set of
--set flags (kubeProxyReplacement, k8sService*, tunnelProtocol,
bpf.masquerade) while the Flux bp-cilium HelmRelease curated a much
fuller value set. The drift was fatal:

  1. cilium-agent waits forever for the operator to register
     ciliumenvoyconfigs + ciliumclusterwideenvoyconfigs CRDs.
  2. The upstream chart only registers them when envoyConfig.enabled=true.
  3. With the bootstrap install missing that flag, the agent crash-looped,
     the node taint node.cilium.io/agent-not-ready never lifted, and the
     bootstrap-kit Kustomization (wait: true, 30 min timeout — issue #492)
     never reconciled the upgrade that would have fixed the values.

The fix is single-source-of-truth via a new write_files entry that lays
down /var/lib/catalyst/cilium-values.yaml at cloud-init time, plus a -f
flag on the bootstrap helm install that consumes it. The values mirror
platform/cilium/chart/values.yaml's `cilium:` block PLUS the overlay
in clusters/_template/bootstrap-kit/01-cilium.yaml (envoyConfig.enabled,
l7Proxy). A new parity test (cilium_values_parity_test.go) locks the
two files together so a future commit cannot change one without the
other.

Approach: hybrid — keep the chart values.yaml as the umbrella source
of truth, render the merged effective values inline in cloud-init's
write_files block (the umbrella's `cilium:` subchart wrapper is
unwrapped because the bootstrap install targets cilium/cilium upstream
chart directly, not the bp-cilium umbrella). Test enforces presence
of every operator-curated key + load-bearing values.

Files modified:
  infra/hetzner/cloudinit-control-plane.tftpl
  products/catalyst/bootstrap/api/internal/provisioner/cilium_values_parity_test.go (new)

Refs: #491, #492 (bootstrap-kit wait timeout), 66ea39f0 (envoyConfig in HR)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:09:10 +04:00
e3mrah
0d75ae354f
fix(infra): split Cilium-Gateway Certificate into sovereign-tls Kustomization (Phase-8a bug #13) (#484)
Phase-8a-preflight live deployment 93161846839dc2e1: bootstrap-kit Flux
Kustomization fails server-side dry-run with

  Certificate/kube-system/sovereign-wildcard-tls dry-run failed:
  no matches for kind 'Certificate' in version 'cert-manager.io/v1'

→ entire Kustomization apply aborts → ZERO HelmReleases reconcile.

Fix: split the Certificate into its own Flux Kustomization sovereign-tls
that dependsOn bootstrap-kit (whose Ready gates on every HR including
bp-cert-manager). Gateway stays in 01-cilium.yaml because Gateway API
CRDs ship with Cilium itself.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 22:48:18 +04:00
e3mrah
7e35040e29
fix(infra): cloud-init strip regex must preserve #cloud-config (Phase-8a bug #5 follow-up) (#482)
#477 introduced a regex "/(?m)^[ ]{0,2}#[^!].*\n/" to strip YAML-block
comments and fit Hetzner's 32KiB user_data cap. The [^!] guard preserved
shebangs like #!/bin/bash but DID NOT preserve cloud-init directives
like #cloud-config, #include, #cloud-boothook (none have ! after #).

Result: cloud-init received user_data with the #cloud-config first-line
DIRECTIVE stripped, didn't recognise the YAML body, and emitted:
  recoverable_errors:
  WARNING: Unhandled non-multipart (text/x-not-multipart) userdata

→ k3s never installed
→ Flux never bootstrapped
→ kubeconfig never PUT to catalyst-api
→ every Phase-8a provision since #477 has silently failed at boot

Live evidence: deployment a76e3fec8566add9 SSH'd 2026-05-01 18:30 UTC,
cloud-init status 'degraded done', /etc/systemd/system/k3s.service
absent, no flux binary.

Fix: require a SPACE after the '#' in the strip regex. YAML comments
ARE typically '# foo bar' (with space). cloud-init directives are
'#cloud-config' / '#include' / '#cloud-boothook' (no space) — the new
regex preserves them.

Out of scope: validating that ALL existing comments in the tftpl had
a space after #. They do — verified by sed pre-render passing the
sanity test (file shrinks 38KB → 13KB AND first line is #cloud-config).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 22:30:51 +04:00
e3mrah
e35729ad78
fix(infra): strip YAML-block comments from cloud-init to fit Hetzner 32KiB cap (Phase-8a bug #5) (#477)
Phase-8a-preflight deployment 3c158f712d564d84 failed at tofu apply with:

  Error: invalid input in field 'user_data'
    [user_data => [Length must be between 0 and 32768.]]
    on main.tf line 214, in resource "hcloud_server" "control_plane"

The rendered cloudinit-control-plane.tftpl is 38,085 bytes — 5,317
bytes over the Hetzner cap. The source template ships ~16 KB of
indent-0 and indent-2 documentation comments (YAML-level) that are
operationally inert at cloud-init boot.

Fix: wrap templatefile() in replace() with a RE2 regex that strips
lines whose first 0-2 chars are spaces followed by '#' (preserves
shebangs via [^!]). After strip, rendered cloud-init drops to ~13 KB.

Indent-4+ comments live INSIDE heredoc `content: |` blocks
(embedded shell scripts, kubeconfig fragments). Those are preserved.

Same fix applied to worker_cloud_init for parity.

Refs:
- Live evidence: deployment 3c158f712d564d84, tofu apply error 16:38:26 UTC
- Bug #5 in the Phase-8a-preflight tally
- #471: prior tftpl escape fix ($${SOVEREIGN_FQDN})
- #472: catalyst-build watches infra/hetzner/**

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:43:42 +04:00
e3mrah
03b1469331
fix(infra): escape ${SOVEREIGN_FQDN} in cloudinit-control-plane.tftpl comments (#471)
Phase-8a-preflight bug surfaced by first live provision attempt
(deployment febeeb888debf477, 2026-05-01 16:30 UTC):

  Error: Invalid function argument
    on main.tf line 140, in locals:
    140:   control_plane_cloud_init = templatefile("${path.module}/cloudinit-control-plane.tftpl", {
  Invalid value for "vars" parameter: vars map does not contain key
  "SOVEREIGN_FQDN", referenced at ./cloudinit-control-plane.tftpl:12,37-51.

Tofu's templatefile() interprets ${...} ANYWHERE in the file (including
inside shell '#' comments), since the file is a template not a shell
script. Five lines in cloudinit-control-plane.tftpl reference
${SOVEREIGN_FQDN} as part of documentation prose explaining how
Flux postBuild.substitute interpolates the value at Flux apply time.

The Tofu vars map passed by main.tf:140 uses the canonical lowercase
HCL convention (sovereign_fqdn = var.sovereign_fqdn), not the uppercase
envsubst convention SOVEREIGN_FQDN. So Tofu fails: 'vars map does not
contain key SOVEREIGN_FQDN'.

Latest reference (line 12) added by #326 (commit 20b89607); older 4
references predate that and were never exercised because no live
provision had ever been attempted before this Phase-8a run.

Fix: escape with double-dollar ($$) so Tofu emits a literal ${...}
in the rendered cloudinit file. The 5 comments now read $${SOVEREIGN_FQDN}
in source, render as ${SOVEREIGN_FQDN} in the user_data output —
preserving documentation intent without breaking templatefile().

Refs:
- Live provision: console.openova.io/sovereign/provision/febeeb888debf477
- Diagnostic: tofu plan exit 1 — vars map does not contain key SOVEREIGN_FQDN
- Out of scope: any other latent templatefile() escape issues — those
  surface as their own Phase-8a iterations

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:33:21 +04:00
e3mrah
20b896070f
feat(bp-keycloak + infra): Sovereign K8s OIDC config for kubectl via per-Sovereign Keycloak realm (closes #326) (#448)
Wires the per-Sovereign K8s api-server's --oidc-* validator to the
per-Sovereign Keycloak realm so customer admins can authenticate
kubectl directly against their Sovereign — no static admin-kubeconfig
handoff, no rotated bearer-token exchange.

infra (cloud-init):
  - Add 6 --kube-apiserver-arg=oidc-* flags to the k3s install line in
    infra/hetzner/cloudinit-control-plane.tftpl. Issuer URL composed
    from sovereign_fqdn (https://auth.\${sovereign_fqdn}/realms/sovereign)
    per INVIOLABLE-PRINCIPLES #4 — never hardcoded. Username/groups
    prefixes scope OIDC subjects under "oidc:" so RoleBindings reference
    e.g. subjects[0].name=oidc:alice@org, distinct from local SAs/x509.

Canonical seam (anti-duplication rule, ADR-0001 §11.3):
  - The bp-keycloak chart already bundles bitnami/keycloak's
    keycloakConfigCli post-install Helm hook Job, which imports realms
    declared under values.keycloak.keycloakConfigCli.configuration. We
    enable the existing seam — no bespoke kubectl-exec realm-creation
    script, no custom Admin-API call from catalyst-api.

bp-keycloak chart (1.1.2 → 1.2.0):
  - Enable keycloakConfigCli + ship inline sovereign-realm.json with:
    realm "sovereign" (invariant per Sovereign — Keycloak resolves the
    issuer claim from the request hostname, so no per-FQDN realm
    rename), default groups sovereign-admins/-ops/-viewers, oidc-group
    -membership-mapper emitting "groups" claim, public OIDC client
    "kubectl" with localhost:8000 + OOB redirect URIs (kubectl-oidc
    -login defaults), publicClient=true (kubectl runs locally and
    cannot safely hold a secret), PKCE S256 enforced.
  - Bump version 1.1.2 → 1.2.0 (semver MINOR, additive shape).
  - Bump bootstrap-kit slot 09 in _template/, omantel.omani.works/,
    otech.omani.works/ to version: 1.2.0.
  - New chart test tests/oidc-kubectl-client.sh (4 cases) — all green.
  - Existing tests/observability-toggle.sh — still green.

Documentation:
  - Add §11 "kubectl OIDC for customer admins" runbook to
    docs/omantel-handover-wbs.md with one-time workstation setup
    (kubectl krew install oidc-login + config set-credentials),
    sovereign-admin RBAC binding (oidc:sovereign-admins → cluster
    -admin), and 401-debugging table mapping common symptoms to
    root causes.
  - Carve #326 out of §7 "Out of scope" — it is shipped.
  - Add §9 status row.

Validation:
  - grep -c 'oidc-issuer-url' infra/hetzner/cloudinit-control-plane.tftpl
    → 2 (comment + the actual flag in the curl line)
  - grep -c 'oidc-username-claim' → 2
  - helm template platform/keycloak/chart → renders post-install
    keycloak-config-cli Job + ConfigMap with kubectl client (3 hits
    on grep "kubectl"; 1 hit on "clientId": "kubectl")
  - bash scripts/check-vendor-coupling.sh → exit 0 (HARD-FAIL mode)
  - 4/4 oidc-kubectl-client gates green; 3/3 observability-toggle
    gates green

Out of scope (deferred to follow-up tickets):
  - Per-Sovereign user provisioning UI (#322, #323)
  - Refresh-token revocation on RoleBinding deletion (#324)
  - provider-kubernetes Crossplane ProviderConfig per Sovereign (#321)
  - omantel migration / Phase 8 live execution

NO catalyst-api or UI source files touched (those are #319/#322/#323
agents' territories per agent brief).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 19:07:52 +04:00
e3mrah
0172b9a89a
wip(#425): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435)
Files staged from prior agent run before rate-limit. Re-dispatch will
verify, complete missing pieces (Crossplane Provider+ProviderConfig in
cloud-init, grep-zero acceptance, helm/go test runs, WBS row update),
and finalise the PR.

Includes:
- platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml
- platform/velero/chart/values.yaml (objectStorage.s3.* block)
- platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0)
- products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package)
- internal/hetzner/objectstorage{,_test}.go DELETED
- credentials handler + StepCredentials.tsx renamed
- infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl}
- clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml
- platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 18:05:19 +04:00
e3mrah
1e17668055
feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371) (#409)
* feat(catalyst): Hetzner Object Storage credential pattern (Phase 0b, #371)

Adds the per-Sovereign Hetzner Object Storage credential capture + bucket
provisioning Phase 0b path described in the omantel handover WBS §5.
Hybrid Option A+B: wizard collects operator-issued S3 credentials (Hetzner
exposes no Cloud API to mint them — they're issued once in the Hetzner
Console and the secret half is shown exactly once), and OpenTofu
auto-provisions the per-Sovereign bucket via the aminueza/minio provider
+ writes a flux-system/hetzner-object-storage Secret into the new
Sovereign at cloud-init time so Harbor (#383) and Velero (#384) find
their backing-store credentials already in the cluster from Phase 1
onwards.

Extends the EXISTING canonical seam at every layer (per the founder's
anti-duplication rule for #371's session): the existing Tofu module at
infra/hetzner/, the existing handler/credentials.go validator, the
existing provisioner.Request struct, the existing store.Redact path,
and the existing wizard StepCredentials. No parallel binaries / scripts
/ operators introduced.

infra/hetzner/ (Tofu module — Phase 0):
  - versions.tf: declare aminueza/minio provider (Hetzner's official
    recommendation for S3-compatible bucket creation per
    docs.hetzner.com/storage/object-storage/getting-started/...)
  - variables.tf: 4 sensitive vars — region (validated against
    fsn1/nbg1/hel1, the European-only OS regions as of 2026-04),
    access_key, secret_key, bucket_name (RFC-compliant S3 naming)
  - main.tf: minio_s3_bucket.main resource — idempotent on re-apply,
    no force_destroy (Velero archive must survive a control-plane
    reinstall), object_locking=false (content-addressed digests are
    the immutability guarantee for Harbor; Velero uses S3 versioning)
  - cloudinit-control-plane.tftpl: write
    flux-system/hetzner-object-storage Secret with the canonical
    s3-endpoint/s3-region/s3-bucket/s3-access-key/s3-secret-key keys
    Harbor + Velero charts consume via existingSecret refs
  - outputs.tf: surface endpoint/region/bucket back to catalyst-api
    for the deployment record (credentials NEVER returned)

products/catalyst/bootstrap/api/ (Go):
  - internal/hetzner/objectstorage.go: NEW — minio-go/v7-based
    ListBuckets validator. Distinguishes auth failure ("rejected") from
    network failure ("unreachable") so the wizard renders the right
    error card. NOT a parallel cloud-resource path — the existing
    purge.go handles hcloud purge; objectstorage.go handles a separate
    API surface (S3-compatible) that has no equivalent client today.
  - internal/handler/credentials.go: extend with
    ValidateObjectStorageCredentials handler — same wire shape
    (200 valid:true / 200 valid:false / 503 unreachable / 400 bad
    input) as the existing token validator so the wizard's failure-
    card machinery handles both without per-endpoint switches.
  - cmd/api/main.go: wire POST
    /api/v1/credentials/object-storage/validate
  - internal/provisioner/provisioner.go: extend Request with
    ObjectStorageRegion/AccessKey/SecretKey/Bucket; Validate()
    rejects empty/malformed values fail-fast at /api/v1/deployments
    POST time; writeTfvars() emits the 4 new tfvars.
  - internal/handler/deployments.go: derive bucket name from FQDN slug
    pre-Validate (catalyst-<fqdn-with-dots-replaced-by-dashes>) so
    Hetzner's globally-namespaced bucket pool gets a deterministic,
    collision-resistant per-Sovereign name without operator input.
  - internal/store/store.go: redact access/secret keys; preserve
    region+bucket plain (they're public in tofu outputs anyway).

products/catalyst/bootstrap/ui/ (TypeScript / React):
  - entities/deployment/model.ts + store.ts: 4 new wizard fields
    (objectStorageRegion/AccessKey/SecretKey/Validated) with merge()
    coercion for legacy persisted state.
  - pages/wizard/steps/StepCredentials.tsx: ObjectStorageSection —
    region picker (fsn1/nbg1/hel1), masked secret-key input,
    Validate button gating Next. Same FailureCard taxonomy
    (rejected/too-short/unreachable/network/parse/http) the existing
    TokenSection uses, so the operator UX is consistent. Section
    only renders when Hetzner is among chosen providers — non-Hetzner
    Sovereigns skip Phase 0b until their own backing-store path lands.
  - pages/wizard/steps/StepReview.tsx: include
    objectStorageRegion/AccessKey/SecretKey in the
    POST /v1/deployments payload (bucket derived server-side).

Tests:
  - api: 7 new provisioner Validate tests (region/keys/bucket
    required + RFC-compliant + valid-region acceptance), 5 handler
    tests for the new endpoint (bad JSON / missing region / invalid
    region / short keys), 4 hetzner/objectstorage_test.go tests
    (endpoint composition + early input rejection), 1 handler test
    for the bucket-name derivation. Existing tests updated to supply
    the new required fields.
  - ui: StepCredentials.test.tsx pre-populates objectStorageValidated
    in beforeEach so the existing 11 SSH-section tests aren't gated
    on Object Storage validation.

DoD: a fresh Sovereign provision results in a usable S3 endpoint URL +
access/secret keys available as a K8s Secret in the Sovereign's home
cluster (flux-system/hetzner-object-storage), ready for consumption by
Harbor + Velero charts via existingSecret references.

Closes #371.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(wbs): #371 done — Hetzner Object Storage Phase 0b shipped (#409)

Marks #371 done with the architectural rationale (hybrid Option A + B —
Hetzner exposes no Cloud API to mint S3 keys, so the wizard MUST capture
them; OpenTofu auto-provisions the bucket + cloud-init writes the
flux-system/hetzner-object-storage Secret with the canonical s3-* keys
Harbor + Velero consume).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:54:22 +04:00
e3mrah
d2ada908c9
feat(bp-openbao): auto-unseal flow — cloud-init seed + post-install init Job (closes #316) (#408)
Catalyst-curated auto-unseal pipeline for OpenBao on Hetzner Sovereigns
(no managed-KMS available). Selected **Option A — Shamir + cloud-init
seed** because:

  - Hetzner has no managed-KMS service → Cloud-KMS auto-unseal (Option C)
    is structurally unavailable.
  - Transit-seal (Option B) requires a peer OpenBao cluster, only
    applicable to multi-region tier-1; out of scope for single-region
    omantel.
  - Manual unseal (Option D) violates the "first sovereign-admin lands
    on console.<sovereign-fqdn> ready to use" goal in
    SOVEREIGN-PROVISIONING.md §5.

Architecture (per issue #316 spec + acceptance criteria 1-6):

  1. Cloud-init on the control-plane node generates a 32-byte recovery
     seed from /dev/urandom and writes it to a single-use K8s Secret
     `openbao-recovery-seed` in the openbao namespace, with annotation
     `openbao.openova.io/single-use: "true"`. Pre-creates the openbao
     namespace to eliminate the race with Flux's HelmRelease apply.
  2. bp-openbao chart v1.2.0 ships two new Helm post-install hooks:
       - `templates/init-job.yaml` (hook weight 5): consumes the seed,
         calls `bao operator init -recovery-shares=1 -recovery-threshold=1`,
         persists the recovery key inside OpenBao's auto-unseal config,
         deletes the seed Secret on success. Idempotent — re-runs detect
         Initialized=true and exit 0.
       - `templates/auth-bootstrap-job.yaml` (hook weight 10): enables
         the Kubernetes auth method, mounts kv-v2 at `secret/`, writes
         the `external-secrets-read` policy, binds the `external-secrets`
         role to the ESO ServiceAccount in `external-secrets-system`.
  3. `templates/auto-unseal-rbac.yaml` declares the least-privilege SA
     + Role + RoleBinding the Jobs need (Secret get/list/delete in the
     openbao namespace; create/get/patch on the openbao-init-marker).
     Also emits the permanent `system:auth-delegator` ClusterRoleBinding
     bound to the OpenBao ServiceAccount so the Kubernetes auth method
     can call tokenreviews.authentication.k8s.io.
  4. Cluster overlay `clusters/_template/bootstrap-kit/08-openbao.yaml`
     bumps version 1.1.1 → 1.2.0 and flips `autoUnseal.enabled: true`
     per-Sovereign.

Per #402 lesson: skip-render pattern (`{{- if .Values.X }}{{ emit }}
{{- end }}`) used throughout — never `{{ fail }}`. Default `helm
template` render emits NOTHING new; opt-in via autoUnseal.enabled=true.

Acceptance criteria coverage:
  1. Provision fresh Sovereign — cloud-init writes seed, Flux installs
     bp-openbao 1.2.0, post-install Jobs run automatically. 
  2. bp-openbao HR Ready=True without manual intervention — install
     keeps `disableWait: true` (Helm Ready ≠ OpenBao initialised; the
     init Job drives initialisation out-of-band on the same install). 
  3. `bao status` shows Sealed=false, Initialized=true within 5 minutes
     — init Job polls + retries up to 60×5s. 
  4. ESO ClusterSecretStore vault-region1 reaches Status: Valid — the
     auth-bootstrap Job binds the `external-secrets` role to ESO's SA
     before the Job exits. 
  5. Seed Secret deleted post-init — init Job deletes it via K8s API
     after consuming. 
  6. No openbao-root-token Secret in K8s — root token captured to
     /tmp/.root-token in the Job pod's tmpfs only; never written to a
     K8s Secret. The recovery key persists ONLY inside OpenBao's Raft
     state (auto-unseal config). 

Tests:
  - tests/auto-unseal-toggle.sh — 4 cases:
    * default render → no auto-unseal artefacts (skip-render works)
    * autoUnseal.enabled=true → both Jobs + correct hook weights
    * kubernetesAuth.enabled=false → init Job only, no auth-bootstrap
    * idempotency annotations present on all 5 hook objects
  - tests/observability-toggle.sh — unchanged, all 3 cases green.
  - helm lint . — clean.

Files:
  - platform/openbao/chart/Chart.yaml — version 1.1.1 → 1.2.0
  - platform/openbao/blueprint.yaml — version 1.1.1 → 1.2.0
  - platform/openbao/chart/values.yaml — `autoUnseal.*` block
  - platform/openbao/chart/templates/auto-unseal-rbac.yaml — new
  - platform/openbao/chart/templates/init-job.yaml — new
  - platform/openbao/chart/templates/auth-bootstrap-job.yaml — new
  - platform/openbao/chart/tests/auto-unseal-toggle.sh — new
  - platform/openbao/README.md — bootstrap procedure §2-3 expanded;
    auto-unseal alternatives table added.
  - clusters/_template/bootstrap-kit/08-openbao.yaml — chart 1.1.1 →
    1.2.0, autoUnseal.enabled=true.
  - infra/hetzner/cloudinit-control-plane.tftpl — seed-token block
    inserted between ghcr-pull-secret apply and flux-bootstrap apply.
  - docs/omantel-handover-wbs.md §9 — #316 ticked chart-released.

Canonical seam used: extended existing `platform/openbao/chart/` per
the anti-duplication rule. NO standalone scripts. NO bespoke Go cloud
calls. NO `{{ fail }}`. All knobs configurable via values.yaml per
INVIOLABLE-PRINCIPLES.md #4 (never hardcode).

Co-authored-by: hatiyildiz <hat.yil@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:45:44 +04:00
e3mrah
8781aa3bc4
fix(provisioner): cloud-init bootstrap-kit path matches per-FQDN cluster dir (resolves #218) (#256)
The cloud-init template selected a per-FQDN GitRepository tree
(`!/clusters/${sovereign_fqdn}`) and pointed both bootstrap-kit
and infrastructure-config Flux Kustomizations at
`./clusters/${sovereign_fqdn}/{bootstrap-kit,infrastructure}` —
directories the wizard never commits before provisioning. Every
fresh Sovereign stalled Phase-1 with `kustomization path not found:
.../clusters/<fqdn>/bootstrap-kit: no such file or directory`
(live evidence on otech.omani.works deployment ce476aaf80731a46).

Canonical fix:
- GitRepository.spec.ignore selects the shared `_template` tree
  (`!/clusters/_template`).
- Both Kustomizations point at `./clusters/_template/bootstrap-kit`
  and `./clusters/_template/infrastructure`.
- Flux postBuild.substitute.SOVEREIGN_FQDN: ${sovereign_fqdn}
  interpolates the Sovereign's FQDN into the rendered manifests
  (envsubst replaces `${SOVEREIGN_FQDN}` in label values, ingress
  hostnames, HelmRelease values).
- clusters/_template/bootstrap-kit/*.yaml + kustomization.yaml
  switch their bare `SOVEREIGN_FQDN_PLACEHOLDER` markers to
  `${SOVEREIGN_FQDN}` so Flux's envsubst-based substitute can
  actually replace them.

Locked by 5 unit tests in
products/catalyst/bootstrap/api/internal/provisioner/cloudinit_path_test.go
that read the template and assert: GitRepository ignore selects
_template, both Kustomization paths point at _template subdirs,
both carry the postBuild.substitute hook, and no operative YAML
line carries `clusters/${sovereign_fqdn}`.

Closes #218

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:11:44 +04:00
e3mrah
5aee6aa737
fix(cloudinit): poll for local-path StorageClass instead of pod Ready (closes #207) (#209)
The previous fix for #189 wrote `kubectl wait --for=condition=Ready pod
-l app=local-path-provisioner --timeout=60s`. That cannot succeed
pre-Cilium: k3s runs with --flannel-backend=none, the node stays
Ready=False until Cilium installs (much later in cloud-init), and the
not-ready taint blocks every untolerated pod. The wait timed out at
60s, scripts_user failed, and the Flux-bootstrap + kubeconfig POST-back
sections never executed. Every fresh Sovereign provision was stuck
"before Cilium" with no error signal in the wizard.

Replace the impossible Pod-Ready wait with a poll for the StorageClass
object itself, which k3s registers independently of CNI within ~3s of
service start.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 21:30:27 +02:00
hatiyildiz
3b5fca2033 merge: keep k3s local-path-provisioner; mark StorageClass default before Flux runs (closes #189) 2026-04-29 19:43:59 +02:00
hatiyildiz
4f56ae47da fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs
Pre-fix the cloud-init template passed --disable=local-storage to the k3s
installer with the design intent that Crossplane would install hcloud-csi
day-2 and register a StorageClass after bp-crossplane reconciled. That
created a circular dependency on a fresh Sovereign: every PVC-using
HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres,
bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres)
blocks Pending on a StorageClass that would only exist after bp-crossplane
finished installing — but they ARE in the bootstrap-kit Kustomization
that needs to converge before the day-2 path runs. Verified live on
omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0
both stuck Pending for 20+ min with `no persistent volumes available for
this claim and no storage class is set`, `kubectl get sc` empty.

This change:
1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its
   built-in local-path-provisioner and registers the `local-path`
   StorageClass on first boot.
2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap
   apply that:
     a. waits for the local-path-provisioner pod Ready
     b. patches the local-path SC with is-default-class=true
     c. fails loudly if the SC is missing post-wait (safety gate so a
        broken cluster doesn't fall through to Flux silently)
3. Adds tests/integration/storageclass.sh — phase 1 render-assertion
   (regression gate against re-introducing --disable=local-storage,
   plus positive assertions that the wait/patch/verify steps are
   present, plus ordering check that the patch precedes the Flux
   apply); phase 2 kind-cluster proof that a fresh cluster has a
   default StorageClass that binds a test PVC.
4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom,
   root cause, and the live-cluster recovery path (apply
   local-path-storage.yaml + patch default class) for already-provisioned
   Sovereigns that hit this without reprovisioning.

Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target
(single CPX21/CPX31 control-plane node) that is the correct shape — the
data lives on the node, capacity is bounded by the disk, and there are
no other nodes for volumes to migrate to. Operators upgrading to
multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate,
deliberate operation; that is not part of the cloud-init bootstrap.

Live verification on omantel.omani.works (reproduces the production
symptom + proves the recovery path):

  Before:
    NAMESPACE      NAME                         STATUS    AGE
    keycloak       data-keycloak-postgresql-0   Pending   10m
    spire-system   spire-data-spire-server-0    Pending   10m
    No StorageClass.

  After (kubectl apply local-path-storage.yaml + patch):
    NAME                   PROVISIONER             ...   AGE
    local-path (default)   rancher.io/local-path   ...   34s

    NAMESPACE      NAME                         STATUS   STORAGECLASS
    keycloak       data-keycloak-postgresql-0   Bound    local-path
    spire-system   spire-data-spire-server-0    Bound    local-path

Gates:
  - tofu validate: Success! The configuration is valid.
  - tests/integration/storageclass.sh: PASS (phase 1 render-assertion +
    phase 2 fresh kind cluster default StorageClass binds test PVC).
  - Regression sanity: re-injecting --disable=local-storage causes
    phase 1 to FAIL with the documented error message (verified).

Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that
block); the StorageClass setup runs between healthz-wait and the Flux
bootstrap apply so the bootstrap-kit Kustomization sees a default class
on its first reconciliation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:43:09 +02:00
hatiyildiz
b0c1c07271 fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.

This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.

Changes:

  * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
    URL pinned at v2.4.0 (deliberate; this is the source of truth) and
    added the CRITICAL VERSION-PIN INVARIANT comment block documenting
    the failure mode.

  * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
    2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
    2.4.0, matching cloud-init exactly. Bumped chart version
    1.1.1 -> 1.1.2.

  * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
    .version` mirror of the dep pin moved from 2.13.0 to 2.14.1.

  * `clusters/_template/bootstrap-kit/03-flux.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
    bp-flux HelmRelease to 1.1.2 + added explicit
    `install.disableTakeOwnership: false`,
    `upgrade.disableTakeOwnership: false`, and
    `upgrade.preserveValues: true` so helm-controller adopts the
    cloud-init-installed Flux objects rather than rolling back on
    ownership conflict.

  * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
    umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.

  * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
    — bumped HelmRelease to 1.1.2.

  * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
    catastrophic-failure replay test:
      Case 1: Chart.yaml declares the flux2 subchart with explicit version.
      Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
      Case 3: chart's flux2 subchart appVersion equals cloud-init's
              pinned upstream version (the load-bearing invariant).
      Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
      Case 5: helm template renders cleanly + contains the four core
              Flux controllers.
      Case 6: replay test rejects a planted mismatched fake Chart.yaml
              (the gate's own self-test — proves the gate works).
    All six cases green locally; the new test joins the existing
    observability-toggle test in tests/.

  * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
    — version-pin invariant" documenting the failure mode, the four
    pin-sites, the safe bump procedure, and the existing-Sovereign
    recovery path (full reprovision).

Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.

Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:38:17 +02:00
hatiyildiz
acf426c5a9 feat(catalyst-api): cloud-init POSTs kubeconfig back via bearer token (closes #183)
Implement Option D from issue #183: the new Sovereign's cloud-init
PUTs its rewritten kubeconfig (server URL pinned to the LB public
IP, k3s service-account token in the body) to catalyst-api over
HTTPS using a per-deployment bearer token. catalyst-api never SSHs
into the Sovereign — by design, it does not hold the SSH private
key (the wizard returns it once to the browser and does not
persist it on the catalyst-api side).

How the bearer flow works
-------------------------
1. CreateDeployment mints a 32-byte random bearer (crypto/rand,
   hex-encoded), computes its SHA-256, and persists ONLY the
   hash on Deployment.kubeconfigBearerHash. Plaintext is stamped
   onto provisioner.Request just long enough for writeTfvars to
   render it into the per-deployment OpenTofu workdir, then GC'd.

2. infra/hetzner/variables.tf adds three variables — deployment_id,
   kubeconfig_bearer_token (sensitive), catalyst_api_url. main.tf
   passes them through templatefile() with load_balancer_ipv4 read
   from hcloud_load_balancer.main.ipv4.

3. cloudinit-control-plane.tftpl, after `kubectl --raw /healthz`
   succeeds, sed-rewrites k3s.yaml's https://127.0.0.1:6443 to the
   LB's public IPv4, writes the result to a 0600 file, and curls
   PUT to {catalyst_api_url}/api/v1/deployments/{deployment_id}/
   kubeconfig with `Authorization: Bearer {token}`. --retry 60
   --retry-delay 10 --retry-all-errors handles transient
   reachability gaps. The 0600 file is removed after the PUT.

4. PUT /api/v1/deployments/{id}/kubeconfig:
   - Reads `Authorization: Bearer <token>` (RFC 6750).
   - Computes SHA-256 of the inbound bearer, constant-time-compares
     to the persisted hash via subtle.ConstantTimeCompare.
   - 401 on missing/malformed Authorization, 403 on bearer
     mismatch, 403 if no hash on record, 403 if KubeconfigPath
     already set (single-use replay defence), 422 on empty/oversize
     body, 503 if the kubeconfigs directory is unwritable.
   - On 204: writes the body to /var/lib/catalyst/kubeconfigs/
     <id>.yaml at mode 0600 (atomic temp+rename), sets
     Result.KubeconfigPath, persistDeployment, then `go
     runPhase1Watch(dep)`.

5. GET /api/v1/deployments/{id}/kubeconfig now reads the file at
   Result.KubeconfigPath. 409 with {"error":"not-implemented"} when
   the postback hasn't happened yet (preserves the wizard's
   existing StepSuccess fallback). 409 {"error":
   "kubeconfig-file-missing"} on PVC drift.

6. internal/store: Record carries KubeconfigBearerHash. The path
   pointer round-trips via Result.KubeconfigPath; the JSON record
   NEVER contains the kubeconfig plaintext (test grep on the on-
   disk JSON for the kubeconfig sentinels asserts zero matches).

7. restoreFromStore relaunches helmwatch on Pod restart for any
   rehydrated deployment whose Result.KubeconfigPath points at an
   existing file AND Phase1FinishedAt is nil AND the original
   status was not in-flight (the existing
   in-flight-status-rewrite-to-failed contract is preserved).
   Channels are re-allocated for resumed deployments because the
   fromRecord-loaded ones are closed.

8. internal/handler/phase1_watch.go reads kubeconfig YAML from
   the file at Result.KubeconfigPath (not from a string field on
   Result). The Result.Kubeconfig field is removed entirely; the
   on-disk JSON only carries kubeconfigPath.

Tests
-----
internal/handler/kubeconfig_test.go covers every spec gate:
- PUT 401 missing/malformed Authorization
- PUT 403 bearer mismatch / no-bearer-hash / already-set
- PUT 422 empty body / oversize body
- PUT 404 deployment not found
- PUT 204 first success, file at <dir>/<id>.yaml mode 0600,
  Result.KubeconfigPath set, on-disk JSON has kubeconfigPath
  pointer with no plaintext leak
- PUT triggers Phase 1 helmwatch goroutine
- GET reads from path-pointer
- GET 409 path-pointer-set-but-file-missing
- newBearerToken / hashBearerToken round-trip + entropy
- subtle.ConstantTimeCompare correctness
- shouldResumePhase1 gates every branch
- restoreFromStore re-launches helmwatch on rehydrated deployments
- phase1Started guard prevents double watch (PUT then runProvisioning)
- extractBearer RFC 6750 case-insensitive scheme

Chart
-----
products/catalyst/chart/templates/api-deployment.yaml mounts the
existing catalyst-api-deployments PVC at /var/lib/catalyst (one
level up) so deployments/<id>.json and kubeconfigs/<id>.yaml live
on the same single-attach volume — no second PVC. Adds env vars
CATALYST_KUBECONFIGS_DIR=/var/lib/catalyst/kubeconfigs and
CATALYST_API_PUBLIC_URL=https://console.openova.io/sovereign.

Per docs/INVIOLABLE-PRINCIPLES.md
- #3: OpenTofu is still the only Phase-0 IaC; cloud-init is part of
  the OpenTofu module's templated user_data, not a separate code
  path. catalyst-api never execs helm/kubectl/ssh.
- #4: catalyst_api_url is runtime-configurable
  (CATALYST_API_PUBLIC_URL env var), so air-gapped franchises
  override without code changes.
- #10: Bearer plaintext NEVER lands on disk on the catalyst-api
  side (only the SHA-256 hash). Kubeconfig plaintext NEVER lands
  in the JSON record (only the file path). The kubeconfig file is
  chmod 0600 and the directory 0700 owned by the catalyst-api UID.

Closes #183.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:53 +02:00
hatiyildiz
dddbab4b80 fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly
Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull`
because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init
never created the Secret, so every fresh Sovereign's source-controller
logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium.
The operator workaround (kubectl apply by hand) is not durable across
reprovisioning. Verified live on omantel.omani.works pre-fix.

Changes:

- provisioner.Request gains GHCRPullToken (json:"-") so it is never
  serialized into persisted deployment records. provisioner.New() reads
  CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the
  Request before tofu.auto.tfvars.json. Validate() rejects empty for
  domain_mode=pool with a pointer to docs/SECRET-ROTATION.md.
- handler.CreateDeployment also stamps the env var onto the Request so
  the synchronous validation path returns 400 early on misconfiguration.
- infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true,
  default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64
  locals and passes both to templatefile().
  cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson
  Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd
  applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so
  the GitRepository + Kustomization land into a cluster that already
  has working GHCR creds.
- products/catalyst/chart/templates/api-deployment.yaml mounts
  CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in
  the catalyst namespace (key: token, optional: true so the Pod still
  starts on misconfigured installs and Validate() owns the gate).
- docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token,
  Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds.
  Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN>
  placeholder; the token never lives in git.
- Tests: provisioner unit tests cover New() reading the env var,
  tolerance of missing env, pool-mode validation rejection with
  operator-facing error, BYO acceptance, and the json:"-" serialization
  invariant. tests/e2e/hetzner-provisioning gains a
  TestCloudInit_RendersGHCRPullSecret render-only integration test that
  asserts the rendered cloud-init contains the Secret, applies it
  before flux-bootstrap, and that the dockerconfigjson round-trips the
  sample token through templatefile() correctly. Existing
  pool-mode handler tests now t.Setenv the placeholder token; the
  on-disk redaction test asserts the placeholder never reaches disk.

Gates:
- go vet ./... and go test -race -count=1 ./... in
  products/catalyst/bootstrap/api: PASS.
- helm lint products/catalyst/chart: PASS (warnings pre-existing).
- tofu fmt + tofu validate: deferred to CI (no tofu binary on the
  development host).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:07:27 +02:00
hatiyildiz
34c8de84c0 fix(cloudinit): split flux-bootstrap into bootstrap-kit + infrastructure-config Kustomizations
The single 'catalyst-bootstrap' Flux Kustomization at clusters/<fqdn>/
applied bootstrap-kit/ AND infrastructure/ together. infrastructure/
declares ProviderConfig with kind hcloud.crossplane.io/v1beta1, but
that CRD is registered only after Crossplane core (bp-crossplane) is
reconciled AND the Provider package (provider-hcloud) is installed
inside the cluster. Flux dry-run-applied ProviderConfig before any of
that finished and surfaced the failure at the omantel cluster:

  ProviderConfig/default dry-run failed: no matches for kind
  ProviderConfig in version hcloud.crossplane.io/v1beta1

Resolution: emit two Flux Kustomizations from cloud-init's
flux-bootstrap.yaml, with infrastructure-config declaring
dependsOn: [name: bootstrap-kit] + wait: true. Flux now waits for the
bootstrap-kit HelmReleases (including bp-crossplane registering the
Crossplane core CRDs and reconciling the provider-hcloud package
which then registers hcloud.crossplane.io/v1beta1) to be Ready before
the infrastructure-config Kustomization applies ProviderConfig.

Verified live on the omantel control-plane (kubectl delete the old
single Kustomization + apply the two-Kustomization split): bootstrap-kit
moved to Reconciliation in progress, infrastructure-config correctly
showed False / dependency 'flux-system/bootstrap-kit' is not ready,
which is the desired ordered-bootstrap behaviour.
2026-04-29 16:11:33 +02:00
hatiyildiz
548720095a fix(cloudinit): use 127.0.0.1 for Cilium k8sServiceHost (host's local apiserver)
Cilium with --set k8sServiceHost=10.0.1.2 (the cp1 private NIC IP) sat
in init phase forever — the agent's API client kept logging
"Establishing connection to apiserver host=https://10.0.1.2:6443" and
never got a response, even though `curl https://10.0.1.2:6443/healthz`
from the host returned 401 (TLS+auth challenge = endpoint reachable).

Switching to k8sServiceHost=127.0.0.1 brought the DaemonSet up
immediately. Verified end-to-end on the live cluster:

  $ kubectl get nodes
  catalyst-omantel-omani-works-cp1   Ready   ...   32m   v1.31.4+k3s1

The node's local apiserver always binds 127.0.0.1:6443; using that as
the bootstrap apiserver endpoint sidesteps whatever was rejecting the
private-NIC IP route during Cilium's pre-CNI bring-up. Once Cilium is
the CNI and the cluster has real Service VIPs, every other component
reaches the apiserver via the kubernetes.default service as usual.
2026-04-29 15:31:21 +02:00
hatiyildiz
e571ec7aa2 fix(cloudinit): install Cilium BEFORE Flux to break CNI bootstrap deadlock
omantel.omani.works deployment 5cd1bceaaacb71f6 reached Phase 0 success
(10 Hetzner resources up, LB IP 49.12.16.160, DNS committed via PDM)
but stayed silent for 25 minutes — `https://console.omantel.omani.works`
returned no response, every Flux pod was Pending, and the node was
NotReady. SSH'd into the cp1 box (firewall opened temporarily for the
operator IP) and found the canonical CNI bootstrap deadlock:

  Ready: False  (KubeletNotReady)
  message: container runtime network not ready: NetworkReady=false
   reason:NetworkPluginNotReady cni plugin not initialized

cloud-init started k3s with --flannel-backend=none + --disable-network-policy
(the right Cilium-ready posture), then immediately applied the Flux
install.yaml. Flux pods are Pending because there is no CNI yet, so
Flux never starts → never reconciles bp-cilium → CNI never installs →
deadlock. The "wait for deployment Available --timeout=300s" line
silently times out and cloud-init proceeds anyway with the Flux
GitRepository + Kustomization that nothing reconciles.

Resolution: install Cilium ONCE in cloud-init via the canonical Helm
chart at the SAME version (1.16.5) that platform/cilium/blueprint.yaml
declares for bp-cilium. When Flux later reconciles
clusters/<sovereign_fqdn>/bootstrap-kit/01-cilium.yaml it adopts the
existing Helm release (release name + namespace match), so the wizard's
ownership model stays single-source-of-truth (Flux + Blueprints) after
the bootstrap exception.

Per INVIOLABLE-PRINCIPLES.md #3, this Helm install is the one-shot
bootstrap exception authorised by "the GitOps engine is Flux —
everything ELSE gets installed by Flux". Cilium IS the CNI Flux needs,
so it cannot be installed by Flux without bootstrapping itself first.
Every other component still flows through the Blueprint pipeline.

Verified: ssh'd into the running omantel cp1 (firewall opened for the
operator IP), ran the same `helm install cilium ...` command this
patch encodes, and the cluster recovered — node Ready, Flux pods
scheduling, GitRepository pulling. Will redeploy from scratch with
the patched cloud-init to validate the full unattended path.

Cloud-init is the Phase-0 OpenTofu artifact baked into the Hetzner
server's user_data, so this change activates on the NEXT `tofu apply`
that creates a new control-plane server. Existing omantel cp1 is
manually unblocked already; new Sovereigns provisioned after the
catalyst-api image with this template is rolled will not hit the
deadlock.
2026-04-29 15:29:10 +02:00
hatiyildiz
330211d275 fix(tofu): drop redundant null_resource.dns_pool — PDM owns DNS writes
Every tofu apply on a pool deployment was hitting:

  null_resource.dns_pool[0]: Provisioning with 'local-exec'...
  null_resource.dns_pool[0] (local-exec): (output suppressed due to sensitive value in config)
  Error: Invalid field in API request
  catalyst-dns: write DNS: add *.omantel record: dynadot api error: code=

Two separate code paths were both writing Dynadot records for the same
deployment:

  1. The OpenTofu module's null_resource.dns_pool — a local-exec that
     shells out to /usr/local/bin/catalyst-dns inside the catalyst-api
     container. The binary's request payload is rejected by Dynadot.
  2. catalyst-api's pool-domain-manager call — pdm.Commit() at
     handler/deployments.go:247 writes the canonical record set with the
     LB IP after tofu apply returns. This path works.

Per #168 PDM is the single owner of all pool-domain Dynadot writes.
The null_resource path is a pre-#168 artifact that should have been
removed when PDM took ownership; keeping it dual-wrote DNS records
(when it worked) and broke the entire provision flow (when it didn't).

Verified end-to-end against the live catalyst-api at
console.openova.io: tofu apply created 7 of 11 Hetzner resources
(network, firewall, subnet, LB, 2 LB services, ssh_key) before
failing at null_resource.dns_pool[0]. With this commit the DNS-write
step disappears from the plan, and PDM /commit handles record
creation after the LB IP is known.

The dynadot_key + dynadot_secret variables in variables.tf remain
declared (provisioner.go still passes them through tfvars.json) but
are no longer referenced by any resource. Removing them is a separate
sweep — left for a follow-up to keep this commit narrowly scoped to
the failure path.
2026-04-29 14:52:57 +02:00
hatiyildiz
c6cbfe684c fix(tofu): accept cpx* SKU family + empty worker_size for solo Sovereigns
The wizard's recommended Hetzner SKU is CPX32 (4 vCPU AMD / 8 GB / €0.0232/hr)
but the module's variables.tf validation rule only accepted the cx / ccx /
cax families — CPX (AMD shared) was missing entirely. Every Launch through
the wizard hit:

  Error: Invalid value for variable
  on variables.tf line 68: variable "control_plane_size" {
  var.control_plane_size is "cpx32"
  control_plane_size must match Hetzner server-type naming (cxNN | ccxNN | caxNN)

Solo Sovereigns (worker_count = 0) also legitimately have an empty
worker_size — the validation rejected that too:

  Error: Invalid value for variable
  on variables.tf line 91: variable "worker_size" {
  var.worker_size is ""

Both fixed by extending the regex with the cpx* family AND permitting
the empty string on worker_size when the operator runs a solo Sovereign.

Reproduced end-to-end against the deployed catalyst-api before the fix:
the SSE stream surfaced exactly these two validation errors. With the
regex updated they no longer fire — failure now requires a real
Hetzner token instead of being blocked at module-validation time.
2026-04-29 14:43:52 +02:00
hatiyildiz
4ee9e7dd6f fix(wizard): topology before provider; per-provider SKU catalog; per-region sizing
The wizard step order was inverted: it asked for the provider before the
topology, then put hetzner-only SKUs inside the topology step. Topology
decides how many regions exist; provider is a per-region property; SKU
vocabulary is per-provider (cx32 means nothing on Azure). Fixes all three.

New step order (WIZARD_STEPS + WizardPage STEPS): Org -> Topology ->
Provider -> Credentials -> Components -> Domain -> Review.

Per-provider SKU catalog at products/catalyst/bootstrap/ui/src/shared/
constants/providerSizes.ts replaces the legacy hetzner-only HETZNER_NODE_SIZES.
Five providers (hetzner, huawei, oci, aws, azure), each with realistic SKU
options drawn from that vendor's native instance-type vocabulary. Every
SKU read in the wizard goes through PROVIDER_NODE_SIZES[provider] -- no
SKU literal lives anywhere else.

StepProvider now renders one card per topology slot. Each card carries:
provider chooser, that provider's region picker, that provider's
control-plane SKU, that provider's worker SKU + count. Cost rollup sums
each region's (cp + worker*count) at its OWN provider's pricing, so a
mixed-cloud topology computes correctly.

StepTopology drops the SkuCard + NodeSizingPanel; it now captures only
the topology template, HA flag, and AIR-GAP add-on.

Per-region store fields (regionControlPlaneSizes, regionWorkerSizes,
regionWorkerCounts) replace the singular controlPlaneSize/workerSize/
workerCount as the canonical shape. Migration in store.merge() hydrates
the arrays from any persisted singular fields; the cx22 legacy default
is treated as "no selection" so a hetzner-only id never leaks into a
non-hetzner region.

Backend Request gains an optional Regions []RegionSpec field. Validate
mirrors Regions[0] into the legacy singular fields for the existing
solo-Hetzner writeTfvars path. infra/hetzner/variables.tf accepts the
list-of-objects shape; the for_each iteration that activates the rest
of the regions is the multi-region tofu wiring follow-up. Door open
structurally; no shape compromised.

Dead code removed: StepInfrastructure and shared/constants/hetzner.ts
(both orphaned, contained the only HETZNER_NODE_SIZES reference outside
the catalog).

Gates: tsc --noEmit, vite build, vitest (149 tests), go vet, go test
(provisioner + handler).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:44:33 +02:00
hatiyildiz
f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00
hatiyildiz
e7a74f0eef feat(infra/hetzner): bump default to cx42, add OS hardening + operator README
Group J — closes #127, #128, #129, #130, #131, #132.

Defaults
- control_plane_size default cx42 (16 GB) — cx32 (8 GB) is INSUFFICIENT
  for a solo Sovereign per PLATFORM-TECH-STACK.md §7.1 (~11.3 GB Catalyst)
  + §7.4 (~8.8 GB per-host-cluster) = ~20 GB minimum. The previous cx32
  default would OOM during the OpenBao + Keycloak step of bootstrap.
- New k3s_version variable (v1.31.4+k3s1) — pinned, validated against
  the INSTALL_K3S_VERSION format. Previously hardcoded inside the
  cloud-init templates, in violation of INVIOLABLE-PRINCIPLES.md §4.

Validation
- Region restricted to the 5 known Hetzner locations.
- control_plane_size + worker_size restricted to the cxNN | ccxNN | caxNN
  namespace (blocks tiny dev sizes that would OOM at runtime).
- k3s_version regex matches the upstream installer's version format.
- ssh_allowed_cidrs validated as proper CIDRs.

Firewall
- Document each open port (80, 443, 6443, ICMP) and each blocked port
  (22, 10250, 2379/2380, 8472) in README.md §"Firewall rules".
- SSH (22) is now a dynamic rule keyed off ssh_allowed_cidrs (default
  empty = no SSH at the firewall, break-glass via Hetzner Console).

OS hardening (cloudinit-*.tftpl)
- sshd drop-in: PasswordAuthentication no, PermitRootLogin
  prohibit-password, no forwarding, MaxAuthTries=3, LoginGraceTime=30.
- enable_unattended_upgrades (default true): security-only pocket,
  auto-reboot at 02:30, removes unused kernels.
- enable_fail2ban (default true): sshd jail, systemd backend.
- Both control-plane and worker templates carry the same baseline.

Documentation
- New infra/hetzner/README.md (operator-facing) covers:
  * What the module creates + Phase-0/Phase-1 boundary.
  * Sizing rationale with the §7.1+§7.4 RAM math + upgrade path.
  * Firewall rules: every open port, every blocked port, every
    deliberate egress flow.
  * k3s flag-by-flag rationale tied to PLATFORM-TECH-STACK.md §8.
  * SSH key management: why no auto-generated keys (break-glass +
    audit-trail + custody + compliance).
  * OS hardening table.
  * Standalone CLI invocation pattern (tofu apply -var-file=...).
  * What the module does NOT do (Crossplane / Flux territory).

Closes #127 #128 #129 #130 #131 #132
2026-04-28 13:54:15 +02:00
hatiyildiz
e668637bc9 feat(provisioner): replace bespoke Hetzner+helm-exec code with OpenTofu→Crossplane→Flux
Per docs/INVIOLABLE-PRINCIPLES.md Lesson #24 — the previous commits 915c467 + 07b4bcf shipped bespoke Go code that called Hetzner Cloud API directly + exec'd helm/kubectl, which violates principle #3 (OpenTofu provisions Phase 0, Crossplane is the ONLY day-2 IaC, Flux is the ONLY GitOps reconciler, Blueprints are the ONLY install unit). This commit reverts all of that and replaces it with the canonical architecture.

REVERTED (deleted):
- products/catalyst/bootstrap/api/internal/hetzner/resources.go (379 lines bespoke Hetzner API client)
- products/catalyst/bootstrap/api/internal/hetzner/cloudinit.go (bespoke cloud-init builder)
- products/catalyst/bootstrap/api/internal/hetzner/provisioner.go (306 lines orchestrator)
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go (helm-exec installer for 11 components)
- products/catalyst/bootstrap/api/internal/bootstrap/exec.go (kubectl/helm exec wrappers)

KEPT:
- products/catalyst/bootstrap/api/internal/hetzner/client.go — fast token validity probe used by StepCredentials wizard step. NOT architectural drift; just a UX pre-flight check.
- products/catalyst/bootstrap/api/internal/dynadot/dynadot.go — DNS API client. Will be invoked by the OpenTofu module via local-exec (the catalyst-dns helper binary).

NEW (canonical architecture):

infra/hetzner/ — OpenTofu module per docs/SOVEREIGN-PROVISIONING.md §3 Phase 0:
- versions.tf: hetznercloud/hcloud provider ~> 1.49
- variables.tf: 17 typed variables matching wizard inputs (sovereign_fqdn, hcloud_token, region, control_plane_size, ssh_public_key, domain_mode, gitops_repo_url, etc.) — all runtime parameters, none hardcoded per principle #4
- main.tf: hcloud_network + subnet + firewall + ssh_key + control-plane server(s) with cloud-init + worker servers + load_balancer with services + null_resource calling /usr/local/bin/catalyst-dns for pool-domain DNS writes
- outputs.tf: control_plane_ip, load_balancer_ip, sovereign_fqdn, console_url, gitops_repo_url
- cloudinit-control-plane.tftpl: installs k3s with --flannel-backend=none --disable=traefik --disable=servicelb (Cilium replaces all of these), then installs Flux core, then applies a GitRepository pointing at clusters/${sovereign_fqdn}/ in the public OpenOva monorepo. From this point Flux is the GitOps engine — it reconciles bp-cilium → bp-cert-manager → bp-crossplane → ... → bp-catalyst-platform via the Kustomization tree the cluster directory ships. NO bespoke helm install from outside the cluster. NO direct kubectl apply. Flux is the install layer.
- cloudinit-worker.tftpl: k3s agent join via private-IP control plane

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — thin OpenTofu invoker:
- Validates wizard inputs
- Stages the canonical infra/hetzner/ module into a per-deployment workdir
- Writes tofu.auto.tfvars.json from the wizard request
- Execs `tofu init`, `tofu plan -out=tfplan`, `tofu apply tfplan`, streaming stdout/stderr lines as SSE events to the wizard
- Reads tofu output -json for control_plane_ip + load_balancer_ip
- Returns Result. Flux on the new cluster takes over from here.

products/catalyst/bootstrap/api/internal/handler/deployments.go — rewritten:
- Uses provisioner.Request and provisioner.New() (no more hetzner.Provisioner)
- Same SSE/poll endpoints; same Dynadot env-var injection for pool-domain mode

What this commit DOES NOT yet include (intentionally — separate work):
- clusters/${sovereign_fqdn}/ Kustomization tree in the monorepo that Flux will reconcile (each Sovereign gets its own cluster directory). Tracked separately as part of the bp-catalyst-platform umbrella work.
- /usr/local/bin/catalyst-dns helper binary in the catalyst-api Containerfile. Tracked as ticket [G] dns Dynadot client.
- Crossplane Compositions for hcloud resources at platform/crossplane/compositions/. Tracked as part of [F] crossplane chart.

Lesson #24 closed. Architecture now matches docs/ARCHITECTURE.md §10 + SOVEREIGN-PROVISIONING.md §3-§4 exactly.
2026-04-28 13:38:56 +02:00