Commit Graph

15 Commits

Author SHA1 Message Date
e3mrah
5aee6aa737
fix(cloudinit): poll for local-path StorageClass instead of pod Ready (closes #207) (#209)
The previous fix for #189 wrote `kubectl wait --for=condition=Ready pod
-l app=local-path-provisioner --timeout=60s`. That cannot succeed
pre-Cilium: k3s runs with --flannel-backend=none, the node stays
Ready=False until Cilium installs (much later in cloud-init), and the
not-ready taint blocks every untolerated pod. The wait timed out at
60s, scripts_user failed, and the Flux-bootstrap + kubeconfig POST-back
sections never executed. Every fresh Sovereign provision was stuck
"before Cilium" with no error signal in the wizard.

Replace the impossible Pod-Ready wait with a poll for the StorageClass
object itself, which k3s registers independently of CNI within ~3s of
service start.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 21:30:27 +02:00
hatiyildiz
3b5fca2033 merge: keep k3s local-path-provisioner; mark StorageClass default before Flux runs (closes #189) 2026-04-29 19:43:59 +02:00
hatiyildiz
4f56ae47da fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs
Pre-fix the cloud-init template passed --disable=local-storage to the k3s
installer with the design intent that Crossplane would install hcloud-csi
day-2 and register a StorageClass after bp-crossplane reconciled. That
created a circular dependency on a fresh Sovereign: every PVC-using
HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres,
bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres)
blocks Pending on a StorageClass that would only exist after bp-crossplane
finished installing — but they ARE in the bootstrap-kit Kustomization
that needs to converge before the day-2 path runs. Verified live on
omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0
both stuck Pending for 20+ min with `no persistent volumes available for
this claim and no storage class is set`, `kubectl get sc` empty.

This change:
1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its
   built-in local-path-provisioner and registers the `local-path`
   StorageClass on first boot.
2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap
   apply that:
     a. waits for the local-path-provisioner pod Ready
     b. patches the local-path SC with is-default-class=true
     c. fails loudly if the SC is missing post-wait (safety gate so a
        broken cluster doesn't fall through to Flux silently)
3. Adds tests/integration/storageclass.sh — phase 1 render-assertion
   (regression gate against re-introducing --disable=local-storage,
   plus positive assertions that the wait/patch/verify steps are
   present, plus ordering check that the patch precedes the Flux
   apply); phase 2 kind-cluster proof that a fresh cluster has a
   default StorageClass that binds a test PVC.
4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom,
   root cause, and the live-cluster recovery path (apply
   local-path-storage.yaml + patch default class) for already-provisioned
   Sovereigns that hit this without reprovisioning.

Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target
(single CPX21/CPX31 control-plane node) that is the correct shape — the
data lives on the node, capacity is bounded by the disk, and there are
no other nodes for volumes to migrate to. Operators upgrading to
multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate,
deliberate operation; that is not part of the cloud-init bootstrap.

Live verification on omantel.omani.works (reproduces the production
symptom + proves the recovery path):

  Before:
    NAMESPACE      NAME                         STATUS    AGE
    keycloak       data-keycloak-postgresql-0   Pending   10m
    spire-system   spire-data-spire-server-0    Pending   10m
    No StorageClass.

  After (kubectl apply local-path-storage.yaml + patch):
    NAME                   PROVISIONER             ...   AGE
    local-path (default)   rancher.io/local-path   ...   34s

    NAMESPACE      NAME                         STATUS   STORAGECLASS
    keycloak       data-keycloak-postgresql-0   Bound    local-path
    spire-system   spire-data-spire-server-0    Bound    local-path

Gates:
  - tofu validate: Success! The configuration is valid.
  - tests/integration/storageclass.sh: PASS (phase 1 render-assertion +
    phase 2 fresh kind cluster default StorageClass binds test PVC).
  - Regression sanity: re-injecting --disable=local-storage causes
    phase 1 to FAIL with the documented error message (verified).

Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that
block); the StorageClass setup runs between healthz-wait and the Flux
bootstrap apply so the bootstrap-kit Kustomization sees a default class
on its first reconciliation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:43:09 +02:00
hatiyildiz
b0c1c07271 fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.

This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.

Changes:

  * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
    URL pinned at v2.4.0 (deliberate; this is the source of truth) and
    added the CRITICAL VERSION-PIN INVARIANT comment block documenting
    the failure mode.

  * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
    2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
    2.4.0, matching cloud-init exactly. Bumped chart version
    1.1.1 -> 1.1.2.

  * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
    .version` mirror of the dep pin moved from 2.13.0 to 2.14.1.

  * `clusters/_template/bootstrap-kit/03-flux.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
    bp-flux HelmRelease to 1.1.2 + added explicit
    `install.disableTakeOwnership: false`,
    `upgrade.disableTakeOwnership: false`, and
    `upgrade.preserveValues: true` so helm-controller adopts the
    cloud-init-installed Flux objects rather than rolling back on
    ownership conflict.

  * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
    umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.

  * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
    — bumped HelmRelease to 1.1.2.

  * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
    catastrophic-failure replay test:
      Case 1: Chart.yaml declares the flux2 subchart with explicit version.
      Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
      Case 3: chart's flux2 subchart appVersion equals cloud-init's
              pinned upstream version (the load-bearing invariant).
      Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
      Case 5: helm template renders cleanly + contains the four core
              Flux controllers.
      Case 6: replay test rejects a planted mismatched fake Chart.yaml
              (the gate's own self-test — proves the gate works).
    All six cases green locally; the new test joins the existing
    observability-toggle test in tests/.

  * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
    — version-pin invariant" documenting the failure mode, the four
    pin-sites, the safe bump procedure, and the existing-Sovereign
    recovery path (full reprovision).

Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.

Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:38:17 +02:00
hatiyildiz
acf426c5a9 feat(catalyst-api): cloud-init POSTs kubeconfig back via bearer token (closes #183)
Implement Option D from issue #183: the new Sovereign's cloud-init
PUTs its rewritten kubeconfig (server URL pinned to the LB public
IP, k3s service-account token in the body) to catalyst-api over
HTTPS using a per-deployment bearer token. catalyst-api never SSHs
into the Sovereign — by design, it does not hold the SSH private
key (the wizard returns it once to the browser and does not
persist it on the catalyst-api side).

How the bearer flow works
-------------------------
1. CreateDeployment mints a 32-byte random bearer (crypto/rand,
   hex-encoded), computes its SHA-256, and persists ONLY the
   hash on Deployment.kubeconfigBearerHash. Plaintext is stamped
   onto provisioner.Request just long enough for writeTfvars to
   render it into the per-deployment OpenTofu workdir, then GC'd.

2. infra/hetzner/variables.tf adds three variables — deployment_id,
   kubeconfig_bearer_token (sensitive), catalyst_api_url. main.tf
   passes them through templatefile() with load_balancer_ipv4 read
   from hcloud_load_balancer.main.ipv4.

3. cloudinit-control-plane.tftpl, after `kubectl --raw /healthz`
   succeeds, sed-rewrites k3s.yaml's https://127.0.0.1:6443 to the
   LB's public IPv4, writes the result to a 0600 file, and curls
   PUT to {catalyst_api_url}/api/v1/deployments/{deployment_id}/
   kubeconfig with `Authorization: Bearer {token}`. --retry 60
   --retry-delay 10 --retry-all-errors handles transient
   reachability gaps. The 0600 file is removed after the PUT.

4. PUT /api/v1/deployments/{id}/kubeconfig:
   - Reads `Authorization: Bearer <token>` (RFC 6750).
   - Computes SHA-256 of the inbound bearer, constant-time-compares
     to the persisted hash via subtle.ConstantTimeCompare.
   - 401 on missing/malformed Authorization, 403 on bearer
     mismatch, 403 if no hash on record, 403 if KubeconfigPath
     already set (single-use replay defence), 422 on empty/oversize
     body, 503 if the kubeconfigs directory is unwritable.
   - On 204: writes the body to /var/lib/catalyst/kubeconfigs/
     <id>.yaml at mode 0600 (atomic temp+rename), sets
     Result.KubeconfigPath, persistDeployment, then `go
     runPhase1Watch(dep)`.

5. GET /api/v1/deployments/{id}/kubeconfig now reads the file at
   Result.KubeconfigPath. 409 with {"error":"not-implemented"} when
   the postback hasn't happened yet (preserves the wizard's
   existing StepSuccess fallback). 409 {"error":
   "kubeconfig-file-missing"} on PVC drift.

6. internal/store: Record carries KubeconfigBearerHash. The path
   pointer round-trips via Result.KubeconfigPath; the JSON record
   NEVER contains the kubeconfig plaintext (test grep on the on-
   disk JSON for the kubeconfig sentinels asserts zero matches).

7. restoreFromStore relaunches helmwatch on Pod restart for any
   rehydrated deployment whose Result.KubeconfigPath points at an
   existing file AND Phase1FinishedAt is nil AND the original
   status was not in-flight (the existing
   in-flight-status-rewrite-to-failed contract is preserved).
   Channels are re-allocated for resumed deployments because the
   fromRecord-loaded ones are closed.

8. internal/handler/phase1_watch.go reads kubeconfig YAML from
   the file at Result.KubeconfigPath (not from a string field on
   Result). The Result.Kubeconfig field is removed entirely; the
   on-disk JSON only carries kubeconfigPath.

Tests
-----
internal/handler/kubeconfig_test.go covers every spec gate:
- PUT 401 missing/malformed Authorization
- PUT 403 bearer mismatch / no-bearer-hash / already-set
- PUT 422 empty body / oversize body
- PUT 404 deployment not found
- PUT 204 first success, file at <dir>/<id>.yaml mode 0600,
  Result.KubeconfigPath set, on-disk JSON has kubeconfigPath
  pointer with no plaintext leak
- PUT triggers Phase 1 helmwatch goroutine
- GET reads from path-pointer
- GET 409 path-pointer-set-but-file-missing
- newBearerToken / hashBearerToken round-trip + entropy
- subtle.ConstantTimeCompare correctness
- shouldResumePhase1 gates every branch
- restoreFromStore re-launches helmwatch on rehydrated deployments
- phase1Started guard prevents double watch (PUT then runProvisioning)
- extractBearer RFC 6750 case-insensitive scheme

Chart
-----
products/catalyst/chart/templates/api-deployment.yaml mounts the
existing catalyst-api-deployments PVC at /var/lib/catalyst (one
level up) so deployments/<id>.json and kubeconfigs/<id>.yaml live
on the same single-attach volume — no second PVC. Adds env vars
CATALYST_KUBECONFIGS_DIR=/var/lib/catalyst/kubeconfigs and
CATALYST_API_PUBLIC_URL=https://console.openova.io/sovereign.

Per docs/INVIOLABLE-PRINCIPLES.md
- #3: OpenTofu is still the only Phase-0 IaC; cloud-init is part of
  the OpenTofu module's templated user_data, not a separate code
  path. catalyst-api never execs helm/kubectl/ssh.
- #4: catalyst_api_url is runtime-configurable
  (CATALYST_API_PUBLIC_URL env var), so air-gapped franchises
  override without code changes.
- #10: Bearer plaintext NEVER lands on disk on the catalyst-api
  side (only the SHA-256 hash). Kubeconfig plaintext NEVER lands
  in the JSON record (only the file path). The kubeconfig file is
  chmod 0600 and the directory 0700 owned by the catalyst-api UID.

Closes #183.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:53 +02:00
hatiyildiz
dddbab4b80 fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly
Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull`
because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init
never created the Secret, so every fresh Sovereign's source-controller
logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium.
The operator workaround (kubectl apply by hand) is not durable across
reprovisioning. Verified live on omantel.omani.works pre-fix.

Changes:

- provisioner.Request gains GHCRPullToken (json:"-") so it is never
  serialized into persisted deployment records. provisioner.New() reads
  CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the
  Request before tofu.auto.tfvars.json. Validate() rejects empty for
  domain_mode=pool with a pointer to docs/SECRET-ROTATION.md.
- handler.CreateDeployment also stamps the env var onto the Request so
  the synchronous validation path returns 400 early on misconfiguration.
- infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true,
  default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64
  locals and passes both to templatefile().
  cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson
  Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd
  applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so
  the GitRepository + Kustomization land into a cluster that already
  has working GHCR creds.
- products/catalyst/chart/templates/api-deployment.yaml mounts
  CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in
  the catalyst namespace (key: token, optional: true so the Pod still
  starts on misconfigured installs and Validate() owns the gate).
- docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token,
  Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds.
  Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN>
  placeholder; the token never lives in git.
- Tests: provisioner unit tests cover New() reading the env var,
  tolerance of missing env, pool-mode validation rejection with
  operator-facing error, BYO acceptance, and the json:"-" serialization
  invariant. tests/e2e/hetzner-provisioning gains a
  TestCloudInit_RendersGHCRPullSecret render-only integration test that
  asserts the rendered cloud-init contains the Secret, applies it
  before flux-bootstrap, and that the dockerconfigjson round-trips the
  sample token through templatefile() correctly. Existing
  pool-mode handler tests now t.Setenv the placeholder token; the
  on-disk redaction test asserts the placeholder never reaches disk.

Gates:
- go vet ./... and go test -race -count=1 ./... in
  products/catalyst/bootstrap/api: PASS.
- helm lint products/catalyst/chart: PASS (warnings pre-existing).
- tofu fmt + tofu validate: deferred to CI (no tofu binary on the
  development host).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:07:27 +02:00
hatiyildiz
34c8de84c0 fix(cloudinit): split flux-bootstrap into bootstrap-kit + infrastructure-config Kustomizations
The single 'catalyst-bootstrap' Flux Kustomization at clusters/<fqdn>/
applied bootstrap-kit/ AND infrastructure/ together. infrastructure/
declares ProviderConfig with kind hcloud.crossplane.io/v1beta1, but
that CRD is registered only after Crossplane core (bp-crossplane) is
reconciled AND the Provider package (provider-hcloud) is installed
inside the cluster. Flux dry-run-applied ProviderConfig before any of
that finished and surfaced the failure at the omantel cluster:

  ProviderConfig/default dry-run failed: no matches for kind
  ProviderConfig in version hcloud.crossplane.io/v1beta1

Resolution: emit two Flux Kustomizations from cloud-init's
flux-bootstrap.yaml, with infrastructure-config declaring
dependsOn: [name: bootstrap-kit] + wait: true. Flux now waits for the
bootstrap-kit HelmReleases (including bp-crossplane registering the
Crossplane core CRDs and reconciling the provider-hcloud package
which then registers hcloud.crossplane.io/v1beta1) to be Ready before
the infrastructure-config Kustomization applies ProviderConfig.

Verified live on the omantel control-plane (kubectl delete the old
single Kustomization + apply the two-Kustomization split): bootstrap-kit
moved to Reconciliation in progress, infrastructure-config correctly
showed False / dependency 'flux-system/bootstrap-kit' is not ready,
which is the desired ordered-bootstrap behaviour.
2026-04-29 16:11:33 +02:00
hatiyildiz
548720095a fix(cloudinit): use 127.0.0.1 for Cilium k8sServiceHost (host's local apiserver)
Cilium with --set k8sServiceHost=10.0.1.2 (the cp1 private NIC IP) sat
in init phase forever — the agent's API client kept logging
"Establishing connection to apiserver host=https://10.0.1.2:6443" and
never got a response, even though `curl https://10.0.1.2:6443/healthz`
from the host returned 401 (TLS+auth challenge = endpoint reachable).

Switching to k8sServiceHost=127.0.0.1 brought the DaemonSet up
immediately. Verified end-to-end on the live cluster:

  $ kubectl get nodes
  catalyst-omantel-omani-works-cp1   Ready   ...   32m   v1.31.4+k3s1

The node's local apiserver always binds 127.0.0.1:6443; using that as
the bootstrap apiserver endpoint sidesteps whatever was rejecting the
private-NIC IP route during Cilium's pre-CNI bring-up. Once Cilium is
the CNI and the cluster has real Service VIPs, every other component
reaches the apiserver via the kubernetes.default service as usual.
2026-04-29 15:31:21 +02:00
hatiyildiz
e571ec7aa2 fix(cloudinit): install Cilium BEFORE Flux to break CNI bootstrap deadlock
omantel.omani.works deployment 5cd1bceaaacb71f6 reached Phase 0 success
(10 Hetzner resources up, LB IP 49.12.16.160, DNS committed via PDM)
but stayed silent for 25 minutes — `https://console.omantel.omani.works`
returned no response, every Flux pod was Pending, and the node was
NotReady. SSH'd into the cp1 box (firewall opened temporarily for the
operator IP) and found the canonical CNI bootstrap deadlock:

  Ready: False  (KubeletNotReady)
  message: container runtime network not ready: NetworkReady=false
   reason:NetworkPluginNotReady cni plugin not initialized

cloud-init started k3s with --flannel-backend=none + --disable-network-policy
(the right Cilium-ready posture), then immediately applied the Flux
install.yaml. Flux pods are Pending because there is no CNI yet, so
Flux never starts → never reconciles bp-cilium → CNI never installs →
deadlock. The "wait for deployment Available --timeout=300s" line
silently times out and cloud-init proceeds anyway with the Flux
GitRepository + Kustomization that nothing reconciles.

Resolution: install Cilium ONCE in cloud-init via the canonical Helm
chart at the SAME version (1.16.5) that platform/cilium/blueprint.yaml
declares for bp-cilium. When Flux later reconciles
clusters/<sovereign_fqdn>/bootstrap-kit/01-cilium.yaml it adopts the
existing Helm release (release name + namespace match), so the wizard's
ownership model stays single-source-of-truth (Flux + Blueprints) after
the bootstrap exception.

Per INVIOLABLE-PRINCIPLES.md #3, this Helm install is the one-shot
bootstrap exception authorised by "the GitOps engine is Flux —
everything ELSE gets installed by Flux". Cilium IS the CNI Flux needs,
so it cannot be installed by Flux without bootstrapping itself first.
Every other component still flows through the Blueprint pipeline.

Verified: ssh'd into the running omantel cp1 (firewall opened for the
operator IP), ran the same `helm install cilium ...` command this
patch encodes, and the cluster recovered — node Ready, Flux pods
scheduling, GitRepository pulling. Will redeploy from scratch with
the patched cloud-init to validate the full unattended path.

Cloud-init is the Phase-0 OpenTofu artifact baked into the Hetzner
server's user_data, so this change activates on the NEXT `tofu apply`
that creates a new control-plane server. Existing omantel cp1 is
manually unblocked already; new Sovereigns provisioned after the
catalyst-api image with this template is rolled will not hit the
deadlock.
2026-04-29 15:29:10 +02:00
hatiyildiz
330211d275 fix(tofu): drop redundant null_resource.dns_pool — PDM owns DNS writes
Every tofu apply on a pool deployment was hitting:

  null_resource.dns_pool[0]: Provisioning with 'local-exec'...
  null_resource.dns_pool[0] (local-exec): (output suppressed due to sensitive value in config)
  Error: Invalid field in API request
  catalyst-dns: write DNS: add *.omantel record: dynadot api error: code=

Two separate code paths were both writing Dynadot records for the same
deployment:

  1. The OpenTofu module's null_resource.dns_pool — a local-exec that
     shells out to /usr/local/bin/catalyst-dns inside the catalyst-api
     container. The binary's request payload is rejected by Dynadot.
  2. catalyst-api's pool-domain-manager call — pdm.Commit() at
     handler/deployments.go:247 writes the canonical record set with the
     LB IP after tofu apply returns. This path works.

Per #168 PDM is the single owner of all pool-domain Dynadot writes.
The null_resource path is a pre-#168 artifact that should have been
removed when PDM took ownership; keeping it dual-wrote DNS records
(when it worked) and broke the entire provision flow (when it didn't).

Verified end-to-end against the live catalyst-api at
console.openova.io: tofu apply created 7 of 11 Hetzner resources
(network, firewall, subnet, LB, 2 LB services, ssh_key) before
failing at null_resource.dns_pool[0]. With this commit the DNS-write
step disappears from the plan, and PDM /commit handles record
creation after the LB IP is known.

The dynadot_key + dynadot_secret variables in variables.tf remain
declared (provisioner.go still passes them through tfvars.json) but
are no longer referenced by any resource. Removing them is a separate
sweep — left for a follow-up to keep this commit narrowly scoped to
the failure path.
2026-04-29 14:52:57 +02:00
hatiyildiz
c6cbfe684c fix(tofu): accept cpx* SKU family + empty worker_size for solo Sovereigns
The wizard's recommended Hetzner SKU is CPX32 (4 vCPU AMD / 8 GB / €0.0232/hr)
but the module's variables.tf validation rule only accepted the cx / ccx /
cax families — CPX (AMD shared) was missing entirely. Every Launch through
the wizard hit:

  Error: Invalid value for variable
  on variables.tf line 68: variable "control_plane_size" {
  var.control_plane_size is "cpx32"
  control_plane_size must match Hetzner server-type naming (cxNN | ccxNN | caxNN)

Solo Sovereigns (worker_count = 0) also legitimately have an empty
worker_size — the validation rejected that too:

  Error: Invalid value for variable
  on variables.tf line 91: variable "worker_size" {
  var.worker_size is ""

Both fixed by extending the regex with the cpx* family AND permitting
the empty string on worker_size when the operator runs a solo Sovereign.

Reproduced end-to-end against the deployed catalyst-api before the fix:
the SSE stream surfaced exactly these two validation errors. With the
regex updated they no longer fire — failure now requires a real
Hetzner token instead of being blocked at module-validation time.
2026-04-29 14:43:52 +02:00
hatiyildiz
4ee9e7dd6f fix(wizard): topology before provider; per-provider SKU catalog; per-region sizing
The wizard step order was inverted: it asked for the provider before the
topology, then put hetzner-only SKUs inside the topology step. Topology
decides how many regions exist; provider is a per-region property; SKU
vocabulary is per-provider (cx32 means nothing on Azure). Fixes all three.

New step order (WIZARD_STEPS + WizardPage STEPS): Org -> Topology ->
Provider -> Credentials -> Components -> Domain -> Review.

Per-provider SKU catalog at products/catalyst/bootstrap/ui/src/shared/
constants/providerSizes.ts replaces the legacy hetzner-only HETZNER_NODE_SIZES.
Five providers (hetzner, huawei, oci, aws, azure), each with realistic SKU
options drawn from that vendor's native instance-type vocabulary. Every
SKU read in the wizard goes through PROVIDER_NODE_SIZES[provider] -- no
SKU literal lives anywhere else.

StepProvider now renders one card per topology slot. Each card carries:
provider chooser, that provider's region picker, that provider's
control-plane SKU, that provider's worker SKU + count. Cost rollup sums
each region's (cp + worker*count) at its OWN provider's pricing, so a
mixed-cloud topology computes correctly.

StepTopology drops the SkuCard + NodeSizingPanel; it now captures only
the topology template, HA flag, and AIR-GAP add-on.

Per-region store fields (regionControlPlaneSizes, regionWorkerSizes,
regionWorkerCounts) replace the singular controlPlaneSize/workerSize/
workerCount as the canonical shape. Migration in store.merge() hydrates
the arrays from any persisted singular fields; the cx22 legacy default
is treated as "no selection" so a hetzner-only id never leaks into a
non-hetzner region.

Backend Request gains an optional Regions []RegionSpec field. Validate
mirrors Regions[0] into the legacy singular fields for the existing
solo-Hetzner writeTfvars path. infra/hetzner/variables.tf accepts the
list-of-objects shape; the for_each iteration that activates the rest
of the regions is the multi-region tofu wiring follow-up. Door open
structurally; no shape compromised.

Dead code removed: StepInfrastructure and shared/constants/hetzner.ts
(both orphaned, contained the only HETZNER_NODE_SIZES reference outside
the catalog).

Gates: tsc --noEmit, vite build, vitest (149 tests), go vet, go test
(provisioner + handler).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:44:33 +02:00
hatiyildiz
f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00
hatiyildiz
e7a74f0eef feat(infra/hetzner): bump default to cx42, add OS hardening + operator README
Group J — closes #127, #128, #129, #130, #131, #132.

Defaults
- control_plane_size default cx42 (16 GB) — cx32 (8 GB) is INSUFFICIENT
  for a solo Sovereign per PLATFORM-TECH-STACK.md §7.1 (~11.3 GB Catalyst)
  + §7.4 (~8.8 GB per-host-cluster) = ~20 GB minimum. The previous cx32
  default would OOM during the OpenBao + Keycloak step of bootstrap.
- New k3s_version variable (v1.31.4+k3s1) — pinned, validated against
  the INSTALL_K3S_VERSION format. Previously hardcoded inside the
  cloud-init templates, in violation of INVIOLABLE-PRINCIPLES.md §4.

Validation
- Region restricted to the 5 known Hetzner locations.
- control_plane_size + worker_size restricted to the cxNN | ccxNN | caxNN
  namespace (blocks tiny dev sizes that would OOM at runtime).
- k3s_version regex matches the upstream installer's version format.
- ssh_allowed_cidrs validated as proper CIDRs.

Firewall
- Document each open port (80, 443, 6443, ICMP) and each blocked port
  (22, 10250, 2379/2380, 8472) in README.md §"Firewall rules".
- SSH (22) is now a dynamic rule keyed off ssh_allowed_cidrs (default
  empty = no SSH at the firewall, break-glass via Hetzner Console).

OS hardening (cloudinit-*.tftpl)
- sshd drop-in: PasswordAuthentication no, PermitRootLogin
  prohibit-password, no forwarding, MaxAuthTries=3, LoginGraceTime=30.
- enable_unattended_upgrades (default true): security-only pocket,
  auto-reboot at 02:30, removes unused kernels.
- enable_fail2ban (default true): sshd jail, systemd backend.
- Both control-plane and worker templates carry the same baseline.

Documentation
- New infra/hetzner/README.md (operator-facing) covers:
  * What the module creates + Phase-0/Phase-1 boundary.
  * Sizing rationale with the §7.1+§7.4 RAM math + upgrade path.
  * Firewall rules: every open port, every blocked port, every
    deliberate egress flow.
  * k3s flag-by-flag rationale tied to PLATFORM-TECH-STACK.md §8.
  * SSH key management: why no auto-generated keys (break-glass +
    audit-trail + custody + compliance).
  * OS hardening table.
  * Standalone CLI invocation pattern (tofu apply -var-file=...).
  * What the module does NOT do (Crossplane / Flux territory).

Closes #127 #128 #129 #130 #131 #132
2026-04-28 13:54:15 +02:00
hatiyildiz
e668637bc9 feat(provisioner): replace bespoke Hetzner+helm-exec code with OpenTofu→Crossplane→Flux
Per docs/INVIOLABLE-PRINCIPLES.md Lesson #24 — the previous commits 915c467 + 07b4bcf shipped bespoke Go code that called Hetzner Cloud API directly + exec'd helm/kubectl, which violates principle #3 (OpenTofu provisions Phase 0, Crossplane is the ONLY day-2 IaC, Flux is the ONLY GitOps reconciler, Blueprints are the ONLY install unit). This commit reverts all of that and replaces it with the canonical architecture.

REVERTED (deleted):
- products/catalyst/bootstrap/api/internal/hetzner/resources.go (379 lines bespoke Hetzner API client)
- products/catalyst/bootstrap/api/internal/hetzner/cloudinit.go (bespoke cloud-init builder)
- products/catalyst/bootstrap/api/internal/hetzner/provisioner.go (306 lines orchestrator)
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go (helm-exec installer for 11 components)
- products/catalyst/bootstrap/api/internal/bootstrap/exec.go (kubectl/helm exec wrappers)

KEPT:
- products/catalyst/bootstrap/api/internal/hetzner/client.go — fast token validity probe used by StepCredentials wizard step. NOT architectural drift; just a UX pre-flight check.
- products/catalyst/bootstrap/api/internal/dynadot/dynadot.go — DNS API client. Will be invoked by the OpenTofu module via local-exec (the catalyst-dns helper binary).

NEW (canonical architecture):

infra/hetzner/ — OpenTofu module per docs/SOVEREIGN-PROVISIONING.md §3 Phase 0:
- versions.tf: hetznercloud/hcloud provider ~> 1.49
- variables.tf: 17 typed variables matching wizard inputs (sovereign_fqdn, hcloud_token, region, control_plane_size, ssh_public_key, domain_mode, gitops_repo_url, etc.) — all runtime parameters, none hardcoded per principle #4
- main.tf: hcloud_network + subnet + firewall + ssh_key + control-plane server(s) with cloud-init + worker servers + load_balancer with services + null_resource calling /usr/local/bin/catalyst-dns for pool-domain DNS writes
- outputs.tf: control_plane_ip, load_balancer_ip, sovereign_fqdn, console_url, gitops_repo_url
- cloudinit-control-plane.tftpl: installs k3s with --flannel-backend=none --disable=traefik --disable=servicelb (Cilium replaces all of these), then installs Flux core, then applies a GitRepository pointing at clusters/${sovereign_fqdn}/ in the public OpenOva monorepo. From this point Flux is the GitOps engine — it reconciles bp-cilium → bp-cert-manager → bp-crossplane → ... → bp-catalyst-platform via the Kustomization tree the cluster directory ships. NO bespoke helm install from outside the cluster. NO direct kubectl apply. Flux is the install layer.
- cloudinit-worker.tftpl: k3s agent join via private-IP control plane

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — thin OpenTofu invoker:
- Validates wizard inputs
- Stages the canonical infra/hetzner/ module into a per-deployment workdir
- Writes tofu.auto.tfvars.json from the wizard request
- Execs `tofu init`, `tofu plan -out=tfplan`, `tofu apply tfplan`, streaming stdout/stderr lines as SSE events to the wizard
- Reads tofu output -json for control_plane_ip + load_balancer_ip
- Returns Result. Flux on the new cluster takes over from here.

products/catalyst/bootstrap/api/internal/handler/deployments.go — rewritten:
- Uses provisioner.Request and provisioner.New() (no more hetzner.Provisioner)
- Same SSE/poll endpoints; same Dynadot env-var injection for pool-domain mode

What this commit DOES NOT yet include (intentionally — separate work):
- clusters/${sovereign_fqdn}/ Kustomization tree in the monorepo that Flux will reconcile (each Sovereign gets its own cluster directory). Tracked separately as part of the bp-catalyst-platform umbrella work.
- /usr/local/bin/catalyst-dns helper binary in the catalyst-api Containerfile. Tracked as ticket [G] dns Dynadot client.
- Crossplane Compositions for hcloud resources at platform/crossplane/compositions/. Tracked as part of [F] crossplane chart.

Lesson #24 closed. Architecture now matches docs/ARCHITECTURE.md §10 + SOVEREIGN-PROVISIONING.md §3-§4 exactly.
2026-04-28 13:38:56 +02:00