Commit Graph

64 Commits

Author SHA1 Message Date
e3mrah
fcfed6408c
feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) (#1226)
* feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101)

Follow-up to #1223. The Flux Kustomization on every Sovereign points
at clusters/_template/bootstrap-kit/ and post-build-substitutes per-
Sovereign vars (SOVEREIGN_FQDN, MARKETPLACE_ENABLED, ...). The
per-Sovereign overlay file at clusters/<sov>/bootstrap-kit/01-cilium.yaml
that #1223 added is therefore dead code (Flux doesn't read that
path). The canonical mechanism is to extend the template with
envsubst placeholders + thread the values through tofu vars.

Wires four layers end-to-end:

1. clusters/_template/bootstrap-kit/01-cilium.yaml — adds
   `cluster.name: ${CLUSTER_MESH_NAME:=}` and
   `cluster.id: ${CLUSTER_MESH_ID:=0}` plus
   `clustermesh.useAPIServer: true` + NodePort 32379. Empty defaults
   = single-cluster Sovereign (no peer connects); the cilium subchart
   accepts empty cluster.name when id=0.

2. infra/hetzner/cloudinit-control-plane.tftpl — adds
   CLUSTER_MESH_NAME / CLUSTER_MESH_ID to the bootstrap-kit
   Kustomization's postBuild.substitute block (alongside
   SOVEREIGN_FQDN, MARKETPLACE_ENABLED, PARENT_DOMAINS_YAML).

3. infra/hetzner/variables.tf — declares cluster_mesh_name (string,
   default "") and cluster_mesh_id (number, default 0, validated 0-255).

4. infra/hetzner/main.tf — primary cloud-init passes
   var.cluster_mesh_{name,id} verbatim. Secondary regions (when
   var.regions[i>0] is non-empty per slice G3) auto-derive each
   peer's name as `<sovereign-stem>-<region-code-no-digits>` and
   increment id from var.cluster_mesh_id+1. Per-region override via
   the new RegionSpec.ClusterMeshName field.

5. products/catalyst/bootstrap/api/internal/provisioner/provisioner.go
   — adds ClusterMeshName + ClusterMeshID to Request and threads them
   into writeTfvars(); RegionSpec gains ClusterMeshName for per-peer
   override.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the chart-side
default is intentionally empty — operator request OR per-Sovereign
overlay must supply the values when ClusterMesh is enabled. The
allocation registry lives at docs/CLUSTERMESH-CLUSTER-IDS.md
(introduced in #1223).

Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33 follow-up to #1223

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): escape $ in tftpl comments referencing envsubst placeholders

`tofu validate` reads `${CLUSTER_MESH_NAME}` inside YAML comments as a
template variable reference; the comment was meant to refer to the Flux
envsubst placeholder consumed downstream by the bootstrap-kit cilium
HelmRelease. Escaped both refs with `$$` per Terraform's templatefile
escape syntax so the comment renders verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): replace coalesce with conditional in secondary_region_cluster_mesh_name

coalesce errors when every arg is empty (the not-in-mesh path). Switch
to a conditional that yields '' when both the per-region override AND
var.cluster_mesh_name are empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:19:53 +04:00
e3mrah
7ca4abddd2
feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159)
* feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101)

Implements the server side of the Cloudflare KV lease-witness pattern
that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/
witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare
Workers KV namespace with read-then-CAS-write semantics enforced via
the If-Match header — exact contract per K-Cont-3 #1158 report (item d)
and the canonical-seams "Cloudflare KV Worker contract" entry.

Routes:
  GET    /lease/<slot-url-encoded>  → 200 + LeaseState | 404 | 401
  PUT    /lease/<slot>              → 200 + LeaseState | 412 + state | 401
  DELETE /lease/<slot>              → 204 | 412 | 401

All 7 K-Cont-3 trap behaviors verified by 46 vitest tests:
  1. If-Match: 0 = first-acquire-on-empty-slot
  2. Generation increments unconditionally (incl. Release)
  3. 412 includes current state body
  4. TTL eviction is server-authoritative in stamping (Worker doesn't
     auto-evict — controller's IsHeldBy decides)
  5. X-Holder mismatch on DELETE returns 412 (stale region can't
     evict new primary)
  6. Bearer token validation against env-bound allow-list
  7. Optional X-Lease-Slot header logged for KV granularity

Files:
  products/continuum/cloudflare-worker/{package.json, tsconfig.json,
    wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore,
    DESIGN.md, src/{index,auth,kv,types}.ts,
    src/handlers/{get,put,delete}.ts,
    test/{handlers,contract,env.d}.ts}
  infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf
    + README.md
  .github/workflows/cloudflare-worker-leases-build.yaml
    (event-driven, NO cron — push-on-paths + PR + workflow_dispatch)

Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean.
tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB
bundle.

Per the brief: tofu module ships ready for operator action — no
auto-deploy. Operator runbook in DESIGN.md §"Operator runbook —
deploy a new Sovereign".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource)

`tofu validate` failed on `cloudflare_workers_secret` — that resource
was REMOVED in cloudflare/cloudflare v5 (it consolidated into the
inline `bindings = [...]` array on `cloudflare_workers_script` with
`type = "secret_text"`). Same security guarantee — encrypted at rest
in CF, never visible via dashboard read API once written. `tofu fmt`
also wanted versions.tf alignment + the .terraform.lock.hcl pinning
the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/
which commits its lock file).

Per Inviolable Principle #5 the bearer token value still flows from
TF_VAR_bearer_tokens_csv extracted at apply time from a K8s
SealedSecret — never inlined here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 08:01:44 +04:00
e3mrah
8988cd9e4f
feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095) (#1131)
Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today
infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard
payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum
DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md
§3.8 + §11), so this slice closes the gap.

Architecture: hybrid singular-path + secondary-region overlay.
- The legacy singular path (var.region + count = local.control_plane_count)
  STAYS untouched — every existing Sovereign state (omantel, otech*) keeps
  its resource addresses (hcloud_server.control_plane[0],
  hcloud_load_balancer.main, etc) and produces a no-op plan diff.
- New regions (regions[1+]) are realised via a parallel for_each set keyed
  by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region
  gets its own /24 subnet inside the shared /16 hcloud_network, its own
  CP server, its own workers, and its own lb11 load balancer. The shared
  hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign).

Why hybrid not full for_each: a wholesale refactor would change every
existing resource address (hcloud_server.control_plane[0] →
hcloud_server.control_plane["mgmt"]), forcing every running Sovereign
to run `tofu state mv` for ~12 resources or face destructive recreates.
The brief explicitly bans that. Hybrid is purely additive — secondary
resources are NEW addresses no existing state carries.

No `tofu state mv` runbook required. Existing Sovereigns provisioned
with var.regions = [] or len(var.regions) == 1 produce identical plans
before and after this PR.

Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary
regions and adds per-cluster GitOps path differentiation; today every
secondary CP renders an identical Flux Kustomization pointed at
clusters/<sovereign_fqdn>/.

Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via
mock_provider + override_resource (no real Hetzner):
  - legacy_no_regions_payload (var.regions=[])
  - single_region_entry_does_not_double_provision (len==1)
  - three_region_mgmt_fsn_hel (EPIC-6 shape)
  - same_region_duplicates_produce_distinct_keys
  - non_hetzner_regions_are_filtered_out (oci entries skipped)
All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check
+ test on every PR touching infra/hetzner/**.

Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled":
push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron.

Validation:
  $ tofu validate
  Success! The configuration is valid.
  $ tofu fmt -check -recursive
  exit=0
  $ tofu test
  tests/multi_region.tftest.hcl... pass
    run "legacy_no_regions_payload"... pass
    run "single_region_entry_does_not_double_provision"... pass
    run "three_region_mgmt_fsn_hel"... pass
    run "same_region_duplicates_produce_distinct_keys"... pass
    run "non_hetzner_regions_are_filtered_out"... pass
  Success! 5 passed, 0 failed.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:29:44 +04:00
e3mrah
8e312cd244
fix(infra/hetzner): strip any-indent comments, gate user_data ≤ 30 KiB at plan-time (#966) (#967)
Live blocker. Provisioning otech114 (deployment 5c3eea37d3aacda6, fsn1)
failed at `tofu apply` with:

  Error: invalid input in field 'user_data' (invalid_input):
  [user_data => [Length must be between 0 and 32768.]]
  with hcloud_server.control_plane[0]
  on main.tf line 309

Hetzner Cloud's HARD 32 KiB cap on user_data was breached after #921
inlined a base64-encoded worker cloud-init (~4.8 KB) into the CP cloud-
init for cluster-autoscaler's HCLOUD_CLOUD_INIT key, on top of #827's
multi-domain substitutions. Rendered size: ~37 KB.

Root cause: the prior strip regex `(?m)^[ ]{0,2}# .*\n` was scoped to
indent-0/2 comments only — leaving ~14 KB of indent-6+ comments INSIDE
write_files content blocks (e.g. flux-bootstrap.yaml's triplicate
Kustomization documentation). Those comments are inert: every write_files
entry is YAML / JSON / key=value config (no shell scripts), and parsers
ignore `#`-prefixed lines entirely.

Changes:

1. New strip regex `(?m)^[ ]*#( |$).*\n` strips ANY-indent comment lines
   that start with `#` followed by space or EOL. Preserves:
   - `#cloud-config` line 1 (no space after `#`)
   - `#!`-shebangs (no space after `#`)
   - `#pragma`-style directives (`#` followed by non-space non-EOL)
   Applied to both `local.control_plane_cloud_init` and
   `local.worker_cloud_init`.

2. Plan-time guardrail via `lifecycle.precondition` on
   `hcloud_server.control_plane` and `hcloud_server.worker`. Fails plan
   (not apply) when `length(local.<*>_cloud_init) > 30720` bytes (30 KiB
   = 32 KiB hard cap minus 10% future-additions buffer). Future bloat-
   creep that silently re-eats the headroom now fails fast at plan-time
   BEFORE the network/LB/firewall/SSH-key resources get created.

Verified rendered sizes (Python simulation of templatefile + strip,
substitutions match real otech114 inputs):

  CP cloud-init:     79404 bytes raw → 21144 bytes stripped
                     (margin: 11624 under hard cap, 9576 under guardrail)
  Worker cloud-init:  3254 bytes raw →  2410 bytes stripped
                     (b64-encoded for HCLOUD_CLOUD_INIT: 3216 bytes)

`#cloud-config` first-line preserved. All 18 write_files entries and
43 runcmd entries parse intact. YAML/JSON/conf contents valid post-strip
(comments are documentation only at the file-format level).

Closes #966

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 17:58:44 +04:00
e3mrah
d1431bed09
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without
HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x
FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is
not specified" on every Sovereign (otech112 evidence). HelmRelease
reports Ready=True (Helm install succeeded) but the Pod
CrashLoopBackOffs invisibly behind the False-positive condition.

Closes #916 — wizard let operators dispatch unbuildable topologies
(otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not
encode regional orderability. Hetzner rejected the worker creation 41s
into `tofu apply` after Phase-0 had already created the CP + network +
LB + firewall.

Chart fix (issue #921):
- Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the
  umbrella chart (base64-encoded per upstream contract).
- Render `hetzner-node-config` Secret unconditionally with both keys so
  the upstream Deployment's secretKeyRef references resolve cleanly
  during `helm template` AND in the live cluster regardless of overlay
  state.
- Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto
  the upstream chart's deployment.
- Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps
  it under `flux-system/cloud-credentials.hcloud-cloud-init`; the
  bootstrap-kit overlay lifts that key via Flux `valuesFrom` into
  `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus
  receive the IDENTICAL bootstrap as the Phase-0 worker fleet.
- Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0.
- Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies
  Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's
  blueprint-release "Run chart integration tests" step.

Wizard fix (issue #916):
- Add `availableRegions?: string[]` to NodeSize interface; encode
  cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere
  new) per Hetzner /v1/server_types vs POST /v1/servers gap.
- Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers.
- StepProvider filters SKU dropdowns by selected region; auto-swaps
  current SKU to recommended default when region change drops it out
  of orderability.
- Mirror the matrix Go-side in sku_availability.go; gate
  `provisioner.Request.Validate()` with same predicate so a stale
  wizard build OR direct API caller bypassing the UI cannot dispatch
  otech109's failure mode.
- Two-sided enforcement covers both r.Regions[] (multi-region) and the
  legacy singular path.

Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API
side. Chart smoke renders + helm template gates the env wiring at
publish time.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:21:59 +04:00
e3mrah
2ff50f0591
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on
fresh Sovereign):

#952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls
PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar}
anonymously and gets 403 Forbidden. Fix:

- Templatize spec.imagePullSecrets on Deployment + channel-seed Job.
- Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`.
- Add `newapi` to flux-system/ghcr-pull's reflector
  reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl
  so bp-reflector mirrors the source Secret into the namespace
  automatically on every fresh Sovereign.
- Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay.

#953 — services-build.yaml's image-rewrite loop only matched the
hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8
sme-services templates use `image: "{{ ... }}/services-<svc>:{{
.Values.images.smeTag }}"`. Each services-build run bumped only
auth.yaml while reporting "update sme service images to ${SHA}",
leaving the live Pod on stale bytes (PR #951's #941 fix never reached
services-catalog despite the merge + chart bump chain). Fix:

- After the hardcoded loop, also bump `images.smeTag` in
  products/catalyst/chart/values.yaml with a strict regex match
  (`^  smeTag: "<sha>"$`); refuse to auto-bump if the line shape
  changes (defends against silent drift if a contributor renames the
  field).
- Mirror the change into the retry-path `rewrite()` function so a
  reset-to-origin/main retry does not recreate the original bug.

Tests:

- platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases
  asserting the Deployment and channel-seed Job carry the default
  ghcr-pull reference, that an empty override suppresses the block,
  and that custom secret names propagate (Inviolable Principle #4).
- tests/integration/services-build-rewrite.sh — 3 cases reproducing
  the workflow's rewrite logic on a sandboxed copy of the live
  chart, asserting both auth.yaml's hardcoded line AND values.yaml's
  smeTag get bumped, that helm-render of the catalyst chart with
  the bumped values produces all 8 SME-service Deployments at the
  new SHA, and that an idempotent re-bump to a second SHA also lands
  cleanly.

Refs: #952 #953 (umbrella #915 — alice signup gate 5).

Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:47:37 +04:00
e3mrah
e08d8721e1
fix(pdm/dynadot): pre-register glue records before set_ns (#900) (#906)
Multi-domain Day-2 add-domain on a Sovereign was failing with Dynadot's
"'ns1.<sov>.omani.works' needs to be registered with an ip address
before it can be used" error. Dynadot rejects set_ns whenever the NS
hostnames aren't registered as account-level "host records" first.

This change wires the glue pre-registration into the PDM dynadot
adapter as an optional registrar.GlueRegistrar interface, threads the
Sovereign's load-balancer IPv4 from cloud-init through Flux postBuild
into the chart's `global.sovereignLBIP`, and forwards it via
catalyst-api's pdmFlipNS to PDM's /set-ns endpoint as a new `glueIP`
field. PDM's SetNS handler calls RegisterGlueRecord for each
out-of-bailiwick NS before SetNameservers, with idempotent get_ns →
register_ns / set_ns_ip semantics so retries are free.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:00:45 +04:00
e3mrah
7bfd6df588
fix(catalyst-api,bp-catalyst-platform,infra): unblock multi-domain Day-2 add-domain flow on Sovereigns (#879) (#884)
5 stacked wiring bugs blocked the Day-2 add-parent-domain happy path on a
fresh post-handover Sovereign — surfaced live on otech103, 2026-05-05 — plus
a 6th gap (ghcr-pull reflector for catalyst-system). All six fixed in one PR
so a single chart bump + cloud-init re-render closes the gap end-to-end.

Bug 1 (chart, api-deployment.yaml): wire POOL_DOMAIN_MANAGER_URL=
https://pool.openova.io. The in-cluster Service default only resolves on
contabo; on Sovereigns every Day-2 POST died with NXDOMAIN.

Bug 2 (chart + code): wire CATALYST_PDM_BASIC_AUTH_USER / _PASS env from a
new pdm-basicauth Secret, and have pdmFlipNS SetBasicAuth from those envs.
The PDM public ingress at pool.openova.io is gated by Traefik basicAuth;
calls without Authorization: Basic returned 401. optional=true so contabo
+ CI + older Sovereigns degrade to a clear 401 log line. Per Inviolable
Principle #10, the credentials only ever live in Pod env + are read once
per call by pdmFlipNS — never enter a logged struct or persisted record.

Bug 3 (code, parent_domains.go): pdmFlipNS body now includes the required
nameservers field (computed from expectedNSFor). PDM's SetNSRequest schema
requires it; the previous body got 422 missing-nameservers.

Bug 4 (code, parent_domains.go): lookupPrimaryDomain falls back to
SOVEREIGN_FQDN env after CATALYST_PRIMARY_DOMAIN. On a post-handover
Sovereign no Deployment record is persisted, so without this fallback GET
/parent-domains returned {"items":[]} and the propagation panel showed
expectedNs:null. SOVEREIGN_FQDN is already wired by api-deployment.yaml
from the sovereign-fqdn ConfigMap.

Bug 5 (chart, httproute.yaml): catalyst-ui /auth/* PathPrefix narrowed to
Exact /auth/handover. The previous PathPrefix collided with OIDC PKCE
redirect_uri /auth/callback — catalyst-api 404s on that path because it
only registers /api/v1/auth/callback, breaking login post-handover-JWT-
cookie expiry. Exact match keeps /auth/handover routed to catalyst-api
while every other /auth/* path falls through to catalyst-ui's React
Router for client-side OIDC.

Bug 6 (cloud-init): ghcr-pull + harbor-robot-token + new pdm-basicauth
Reflector annotations enumerate explicit allowed/auto-namespaces (sme,
catalyst, catalyst-system, gitea, harbor) instead of empty-string. The
ambiguous empty-string interpretation caused otech103 to require a manual
catalyst-system mirror creation; explicit list back-ports the verified
working state.

Provisioner wiring: Request.PDMBasicAuthUser/Pass + Provisioner fields
+ tfvars emission so the contabo catalyst-api can stamp the credentials
onto every Sovereign provision request. variables.tf adds matching
pdm_basic_auth_user / pdm_basic_auth_pass tofu vars (sensitive, default
empty) so older provisioner builds that pre-date this change keep
rendering valid cloud-init (the Secret renders with empty values and
Pod start is unaffected).

Chart bumped 1.4.11 -> 1.4.12, lockstep slot 13 pin updated. Closes
the architectural blockers tracked in #879; the catalyst-api image
rebuild + chart republish run via the existing CI pipelines (services-
build.yaml + blueprint-release.yaml) on this commit's SHA.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 09:02:39 +04:00
e3mrah
e96741a0ca
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The
operator brings 1+ parent domains at signup (`omani.works` for own
use, `omani.trade` for the SME pool, etc.) and may add more
post-handover via the admin console (#829).

bp-powerdns 1.2.0 (platform/powerdns/chart):
- New `zones: []` values key listing parent domains to bootstrap
- New Helm post-install/post-upgrade hook Job
  (templates/zone-bootstrap-job.yaml) that POSTs each entry to
  /api/v1/servers/localhost/zones at install time. Idempotent on
  HTTP 409 — re-runs after upgrades or chart bumps never fail.
- Default-values render skips when zones is empty (legacy behavior).

bp-catalyst-platform 1.4.0 (products/catalyst/chart):
- New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}`
  values
- New templates/sovereign-wildcard-certs.yaml renders one
  cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex)
  via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert
  renews independently. Skips entirely when parentZones is empty so
  the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml
  retains ownership of `sovereign-wildcard-tls` (avoids
  helm-vs-kustomize ownership flap).
- New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded
  into the catalyst-api Pod as CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_SERVER_ID env vars.

catalyst-api (products/catalyst/bootstrap/api):
- New internal/powerdns package with typed Client (CreateZone,
  ZoneExists). Idempotent on HTTP 409/412.
- handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the
  typed client when wired via SetPowerDNSZoneClient — the
  admin-console "Add another parent domain" flow now creates real
  zones in the Sovereign's PowerDNS at runtime.
- main.go wires the client when CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_API_KEY are set.
- Comprehensive unit tests (client_test.go: 9 cases incl.
  201/409/412/500 + custom NS + custom serverID).

Bootstrap-kit slot integration:
- clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to
  bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from
  Flux postBuild.substitute.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
  bumps to bp-catalyst-platform 1.4.0 and threads `parentZones:
  ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two
  slots stay in lockstep).
- infra/hetzner: new `parent_domains_yaml` Terraform variable
  (defaults to single-zone array derived from sovereign_fqdn) →
  cloud-init renders the PARENT_DOMAINS_YAML Flux substitute.

DoD verified end-to-end with helm template + envsubst:
- Multi-zone overlay (omani.works + omani.trade) renders 2
  PowerDNS zone-create API calls in the bootstrap Job AND 2
  Certificate resources (`*.omani.works`, `*.omani.trade`) in
  bp-catalyst-platform.
- Single-zone fallback (PARENT_DOMAINS_YAML defaults to
  `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy
  provisioning paths working without per-overlay edits.

Closes #827.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 23:42:00 +04:00
e3mrah
05065b66d6
fix(provisioner+observer): document cpx21 availability + kubectl retry/LKG (closes #752, #753) (#756)
#752 — investigate cpx21/cpx31 availability in EU DCs

Concrete proof gathered against the live Hetzner Cloud API on 2026-05-04.
GET /v1/server_types LISTS cpx11/cpx21/cpx31/cpx41 with full EU prices in
fsn1/nbg1/hel1, but POST /v1/servers rejects every order for those SKUs in
those DCs with:

  {"error":{"code":"invalid_input",
            "message":"unsupported location for server type"}}

Probed all 6 (SKU × DC) combinations end-to-end via real POST + immediate
DELETE. cpx22 + cpx32 were also probed as a sanity check and returned
ORDERED. The /v1/server_types price entry is misleading: Hetzner advertises
prices for every (SKU, location) pair regardless of orderability.

Conclusion: NO SKU bump-back. cpx22 + cpx32 (PR #744) remain the floor.
README + variables.tf docstrings now carry the durable reproducer so future
engineers don't re-attempt cpx21/cpx31.

#753 — kubectl retry / LKG observer reliability

/tmp/autopilot.sh updated (script lives outside the repo, on the VPS):
  • Every kubectl call carries --request-timeout=8s so a hung TLS handshake
    surfaces as a fast empty rather than a 30s+ stall.
  • Last-known-good (LKG) state held across transient flakes: hr/cert/nodes
    no longer flip to "0/0 nodes=0" on a single failed poll.
  • Only 3 consecutive transients count as a real failure; below the
    threshold the observer prints "hr=<LKG> (transient N/3)".

UI side: the wizard's StatusPill / ApplicationPage drive off SSE from
catalyst-api (useDeploymentEvents.ts), not direct kubectl polling, so no UI
change needed. catalyst-api itself uses client-go (helmwatch / phase1_watch),
not exec kubectl, so its observer is not subject to the same shell-out flake.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:11:44 +04:00
e3mrah
e855ab0dfe
fix(k3s): taint CP node-role.kubernetes.io/control-plane:NoSchedule when workers exist (#751) (#755)
Root cause of the "apiserver flake / cpx22 too small / 8 stuck HRs"
chain: the k3s server install in cloudinit-control-plane.tftpl set
--node-label but no --node-taint. By k3s default the server node is
fully schedulable, so on a 1-CP + N-worker Sovereign with the
37-HelmRelease bootstrap-kit + guest workloads (bp-keycloak / bp-cnpg /
bp-harbor / bp-catalyst-platform / SME microservices), the scheduler
distributes guest pods onto the CP. They eat its memory, crowd
kubelet/etcd/apiserver, kubectl flakes, Helm post-install hooks time
out, HelmReleases get stuck mid-reconcile.

Fix: add --node-taint node-role.kubernetes.io/control-plane=true:NoSchedule
to the INSTALL_K3S_EXEC string, so the CP is reserved for system +
bootstrap controllers. cilium agent (DaemonSet) and cilium-operator
default to {operator: Exists} tolerations upstream — they tolerate
the taint and continue to run on the CP. cert-manager and flux2 default
to tolerations: [] — on multi-node Sovereigns they correctly land on
workers, which is the desired separation. Guest workloads do not
tolerate the taint and are pushed to workers where they belong.

Conditional on worker_count > 0: a Catalyst-Zero / solo Sovereign has
only the CP, so tainting NoSchedule there leaves no schedulable node
and the cluster never becomes ready. The Tofu inline ternary
"\${worker_count > 0 ? \"--node-taint ...\" : \"\"}" omits the flag
entirely in solo mode — k3s default (CP fully schedulable) carries
everything.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:07:34 +04:00
e3mrah
ceeefd7829
fix(cloud-init): quote MARKETPLACE_ENABLED so postBuild.substitute is map[string]string (#746)
ROOT CAUSE FOUND for the post-PR-#710 zero-touch handover stall (otech85
through otech89). Cloud-init template emitted:

  postBuild:
    substitute:
      SOVEREIGN_FQDN: otech89.omani.works
      MARKETPLACE_ENABLED: false      ← UNQUOTED YAML BOOL

Tofu interpolates `${marketplace_enabled}` (a string variable holding
"true"|"false") into the rendered cloud-init. Without quotes, kubectl's
YAML parser converts `false`/`true` into BOOL, so the rendered Kustomi-
zation manifest violates the kustomize.toolkit.fluxcd.io/v1
postBuild.substitute schema (map[string]string).

Live evidence on otech89 (and earlier otech85-88 with same SHA):
  GitRepository CRD apply  → succeeds (no postBuild, no schema issue)
  3× Kustomization apply   → silently rejected by validator
  flux-system kustomize-controller has 0 reconciliable Kustomizations
  bootstrap-kit never lands → 0 HRs ever Ready → wizard stalls forever

Quote the value: `MARKETPLACE_ENABLED: "${marketplace_enabled}"` so it
renders as `MARKETPLACE_ENABLED: "false"` (string) and passes the CRD
validator.

This is the bug that has been blocking the 2-cycle zero-touch verifi-
cation since PR #719 introduced MARKETPLACE_ENABLED. Six provisioning
cycles burned (otech85-89 + retries) chasing it. Closes #733 cycle-
verification (the SKU work itself was correct end-to-end).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 16:01:19 +04:00
e3mrah
468c3badf8
fix(cloud-init): tolerate Crossplane Provider apply failure + retry in background (#745)
Live observation on otech88 (DID b2c528023b50ec45, 2026-05-04
11:40:42Z): the new Sovereign's flux-system reaches Ready (GitRepository
artifact stored, all 6 Flux deployments Available) but no Kustomization
CRs appear — kustomize-controller has nothing to reconcile and
hr=True=0/0 forever.

The cloud-init runcmd applies in this order:
  1. cloud-credentials-secret.yaml
  2. crossplane-provider-hcloud.yaml — `pkg.crossplane.io/v1 Provider`
     CRD doesn't exist yet (bp-crossplane is installed by Flux below),
     so this apply errors with "no matches for kind Provider in version
     pkg.crossplane.io/v1"
  3. flux-bootstrap.yaml — should apply 1× GitRepository + 4×
     Kustomization

Empirically, only the GitRepository lands. The four Kustomization
documents in the same multi-doc YAML are not created. The exact
mechanism of failure is on-host (cloud-init runcmd output is at
/var/log/cloud-init-output.log on the Sovereign — out of reach per
"no SSH" rule), but the symptom is consistent across otech87 and
otech88 reprovisions on the new cost-optimised SKUs.

This patch is a belt-and-braces hardening:

1. Tolerate the Crossplane Provider apply's failure (`|| true`) so
   the runcmd cannot propagate a non-zero exit through to whatever
   downstream step is failing.

2. Add a background retry for the Crossplane Provider CR. Polls
   every 30s up to 30m for the Provider CRD to appear (i.e.
   bp-crossplane reconciled by Flux), then `kubectl apply` succeeds
   and the loop exits. Detached via `&` so cloud-init runcmd
   completes without waiting for Crossplane to be Ready.

The intent is to remove any chance the Provider apply blocks Flux
bootstrap. If Kustomizations still don't appear after this fix, the
root cause is elsewhere and a follow-up patch will land.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:50:55 +04:00
e3mrah
b02fc3788a
fix(provisioner): cost-optimized defaults use ORDERABLE SKUs — cpx22 CP + cpx32 workers (14% saving) (#744)
* fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request

Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z).
After PR #742 fixed the empty SKU strings in tfvars, the next blocker
appeared: writeTfvars was emitting `"regions": null` (Go nil slice
marshals to JSON null) when the request had no per-region overrides.

OpenTofu's variables.tf carries a validation block:

  validation {
    condition = alltrue([
      for r in var.regions :
      contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider)
    ])
  }

The `for r in var.regions` iteration fails on null with:

  Error: Iteration over null value
  on variables.tf line 217, in variable "regions":

The variables.tf default `[]` is what the validator expects; emit
that shape explicitly via a coalesceRegions(req.Regions) helper that
turns nil into an empty slice. Operator overrides round-trip
unchanged.

Tests:
- TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions
  serialises as JSON `[]`, never `null`, when the request has no
  per-region overrides.

Builds on PR #742.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving)

Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the
cpx21 CP default from PR #741 fell apart at apply time —

  Error: Server Type "cpx21" is unavailable in "fsn1" and can no
  longer be ordered

Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog
(`/v1/server_types`) but are NOT in the per-DC orderable list
(`available_for_migration` on `/v1/datacenters`) for any EU DC
(fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on
for new Sovereigns in those regions.

Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04:
  • cpx11 (2 vCPU / 2 GB) — too small for the CP working set
  • cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1
  • cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1
  • cpx42, cpx52, cpx62 — bigger and more expensive

New default per Sovereign:

| Component       | Old             | New              | Savings |
|-----------------|-----------------|------------------|---------|
| Control plane   | CPX32 (€16.49)  | CPX22 (€9.49)    | €7.00   |
| Worker × 2      | CPX32 × 2 (€33) | CPX32 × 2 (€33)  | €0      |
| TOTAL           | €49.47/mo       | €42.47/mo        | 14%     |

The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo)
assumed those SKUs were orderable. They aren't in EU DCs. The 14%
saving from cpx22 CP is the largest concrete optimisation that
ships TODAY without compromising the multi-node horizontal-scale
agreement (issue #733): still 1 CP + 2 workers from day one.

Files changed:

- infra/hetzner/variables.tf
  control_plane_size default cpx21 → cpx22
  worker_size        default cpx31 → cpx32 (back to the prior orderable choice)

- products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts
  Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing
  (€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49).
  Mark both as "listed but NOT orderable in EU DCs" so the wizard
  surfaces the constraint instead of letting operators pick a
  non-orderable SKU.
  Move recommended:true from CPX21 → CPX22.
  defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31').

- products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx
  Comment refresh — names the new orderable defaults.

- products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts
  Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22'].

Builds on PR #741 (issue #740 chain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:35:55 +04:00
e3mrah
994c2d1c2a
fix(provisioner): cost-optimized default sizes — cpx21 CP + cpx31 workers (38% saving) (#741)
The new Sovereign default after PR #736 / #738 / #739 was 1× CPX32 control
plane + 2× CPX32 workers — €33/mo per Sovereign. CPX32 is over-provisioned
for the CP working set: the CP carries only k3s (apiserver/etcd/scheduler/
controller-manager) + cilium-operator + flux controllers + cert-manager +
sealed-secrets — NOT the heavy bp-keycloak/cnpg/harbor/openbao/grafana
stack (those land on workers because the bootstrap-kit explicitly schedules
them off the CP taint).

CP RAM budget: etcd ~512 MB + control plane ~1.5 GB + cilium/flux/
cert-manager/sealed-secrets ~1 GB + OS ~512 MB ≈ 3.5 GB — fits CPX21's
4 GB. Workers stay at 8 GB on CPX31 since RAM is the binding constraint
for the bootstrap-kit's worker pods, not vCPU.

New default per Sovereign:

| Component       | Old             | New             | Savings |
|-----------------|-----------------|-----------------|---------|
| Control plane   | CPX32 (€11/mo)  | CPX21 (€5.5/mo) | €5.5    |
| Worker × 2      | CPX32 × 2 (€22) | CPX31 × 2 (€15) | €7      |
| TOTAL           | €33/mo          | €20.5/mo        | 38%     |

Multi-node horizontal-scale agreement (issue #733) preserved: still
1 CP + 2 workers minimum from day one.

Files changed:

- infra/hetzner/variables.tf
  control_plane_size default cpx32 → cpx21
  worker_size        default cpx32 → cpx31
  Validation regex unchanged (cxNN | cpxNN | ccxNN | caxNN).

- products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts
  Add CPX11, CPX21, CPX31 catalog entries.
  Move recommended:true from CPX32 → CPX21 (control-plane default).
  Add defaultWorkerSizeId() — Hetzner returns 'cpx31', other providers
  fall through to defaultNodeSizeId() symmetric default.

- products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx
  First-visit useEffect + handleSelectProvider now call
  defaultWorkerSizeId(provider) for the worker SKU instead of mirroring
  the CP SKU. Comment updated naming the cost-optimised pair.

- products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts
  Recommended-Hetzner-SKU set assertion: ['cpx32'] → ['cpx21'].

If a Sovereign exhibits CP RAM pressure with this default, the next safe
stop UP is cpx31 (4 vCPU / 8 GB, ~€7.5/mo) — never back to cpx32.

Closes #740.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:00:01 +04:00
e3mrah
e085a68585
fix(k3s): add 10.0.1.2 to --tls-san so Cilium can verify CP cert from workers (#739)
Issue #733 follow-up #2. After #738 changed Cilium's k8sServiceHost
from 127.0.0.1 to the CP private IP 10.0.1.2, Cilium's TLS verification
fails with:

  Get "https://10.0.1.2:6443/api/v1/namespaces/kube-system":
    tls: failed to verify certificate: x509: certificate is valid for
    10.43.0.1, 127.0.0.1, 178.104.211.206, 2a01:..., ::1, not 10.0.1.2

k3s auto-generates the apiserver TLS cert with SANs covering the public
IP, the cluster service IP (10.43.0.1), and localhost — but NOT the
private subnet IP 10.0.1.2. Adding `--tls-san=10.0.1.2` to the k3s
server install command makes the cert valid for the address Cilium
(and any other in-cluster client) reaches the apiserver via.

The sovereign FQDN is also already in --tls-san, this just adds the
private subnet anchor that the multi-node Cilium config in #738
introduced.

Verified live on otech51 (deploy SHA 69de64b): Cilium reached
"Establishing connection to apiserver host=https://10.0.1.2:6443"
correctly with the new k8sServiceHost, but TLS handshake failed on
cert SAN mismatch. After this fix the SAN list will include 10.0.1.2.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:35:20 +04:00
e3mrah
69de64ba19
fix(cilium): k8sServiceHost 127.0.0.1 → 10.0.1.2 so workers' Cilium can reach apiserver (#738)
Issue #733 follow-up. The default cpx32 multi-node Sovereign (1 CP + 2
workers) provisioned successfully, but worker nodes stuck NotReady
because cilium-agent on workers crashloop'd:

  Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system":
    dial tcp 127.0.0.1:6443: connect: connection refused

Root cause: `k8sServiceHost: 127.0.0.1` works on the k3s SERVER node
(supervisor binds localhost:6443) but FAILS on every k3s AGENT node
(agent does NOT expose apiserver on localhost — only the supervisor
on :6444). Pre-#733 every Sovereign was solo (worker_count=0), so
this never fired.

Fix: point Cilium at `10.0.1.2`, the CP's stable private IP on the
Sovereign's 10.0.1.0/24 subnet (cp1=10.0.1.2 per main.tf network
block). No-op on the CP (10.0.1.2 IS its own private IP) and works
on workers (which already join the cluster via the same address per
cloudinit-worker.tftpl `K3S_URL=https://${cp_private_ip}:6443`).

Files:
- infra/hetzner/cloudinit-control-plane.tftpl — bootstrap helm install
  values file written to /var/lib/catalyst/cilium-values.yaml
- platform/cilium/chart/values.yaml — Flux bp-cilium HelmRelease
  values (cilium_values_parity_test.go enforces the two stay aligned)

Verified live on otech50: 3× CPX32 servers running, 1 CP Ready, 2
workers registered with k3s but NotReady due to cilium init failure.
After this fix workers should reach Ready, and the Phase-1 watcher
sees all components Ready=True across the multi-node cluster.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:23:51 +04:00
e3mrah
7ec25b9736
feat(provisioner): default Sovereign to 3x CPX32 (1 CP + 2 workers) — restore horizontal scale (#736)
Issue #733. Every Sovereign provisioned this week launched with a single
CPX52 control-plane and zero workers — completely discarded horizontal
scalability. Restore the originally agreed shape: 1 CPX32 control plane
+ 2 CPX32 workers (3 nodes × 4 vCPU/8 GB = 12 vCPU/24 GB total — same
aggregate footprint as a CPX52 vertical-scale, but with multi-node fault
tolerance and the architectural shape clusters/_template/ was designed
for).

Changes:
- infra/hetzner/variables.tf — defaults: control_plane_size cx42→cpx32,
  worker_size cx32→cpx32, worker_count 0→2.
- infra/hetzner/main.tf — add hcloud_load_balancer_target.workers so the
  Hetzner LB targets every node (CP + workers); Cilium Gateway DaemonSet
  on every node serves ingress on its NodePort, so any node can absorb
  traffic for genuine horizontal scale.
- infra/hetzner/README.md — sizing rationale rewritten around horizontal
  scale; CPX32 × 3 documented as canonical; CPX52 retained for solo dev.
- ui model — INITIAL_WIZARD_STATE.workerCount 0→2.
- ui StepProvider — first-visit + provider-change defaults workerCount 0→2.
- ui providerSizes — `recommended: true` flag moves cpx52→cpx32; CPX52
  description updated to "solo dev when worker_count=0".

Constraints honoured:
- Existing API requests with explicit controlPlaneSize: 'cpx52' / explicit
  workerCount: 0 keep working — only DEFAULTS change.
- Sub-CPX32 SKUs (cx21/cx31) still allowed via dropdown.
- Contabo single-node Catalyst-Zero is a different code path — unaffected.
- No cron triggers added (event-driven only).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:57:53 +04:00
e3mrah
4946ccd125
feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) (#719)
Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.

Changes
=======

products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
  marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
  marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
  / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
  namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
  *.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
  {{ if .Values.ingress.marketplace.enabled }} so non-marketplace
  Sovereigns render the chart unchanged

clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}

infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
  on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"

core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
  resolves via PDM at zone-commit time (PR #710 explicit record so
  caches don't depend on the *.<sov> wildcard alone)

DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
  manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
  resources: 13 sme-services workloads + 2 marketplace-api + 1
  HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 07:47:37 +04:00
e3mrah
6f3e15b1ec
fix(handover): provision JWK Secret on Sovereign + inject SOVEREIGN_FQDN env (Phase-8b followup) (#692)
Two handover bugs caught live on otech48 (2026-05-03):

1. Sovereign-side catalyst-api responded to GET /auth/handover with
   "server misconfiguration: public key unavailable". Root cause: the
   K8s Secret `catalyst-handover-jwt-public` (referenced by the chart's
   optional Secret-volume) was never materialised on the Sovereign,
   so the optional volume mount fell through and the JWK file was
   absent inside the container. 1.2.0 wired the mount but no
   provisioning step created the Secret. Fix mirrors the canonical
   pattern from PR #543 (ghcr-pull) and PR #680 (harbor-robot-token):
   cloud-init now writes the Secret manifest into catalyst-system NS
   and runcmd applies it BEFORE flux-bootstrap, so the Secret exists
   by the time bp-catalyst-platform reconciles. Also moves the chart
   volume mount off the catalyst-api PVC (mountPath
   /etc/catalyst/handover-jwt-public, no subPath) so a leftover empty
   directory in the PVC from pre-#606 installs cannot collide with
   the re-provisioned Secret mount.

2. /auth/handover validator rejected every valid JWT with 401
   "invalid audience" because SOVEREIGN_FQDN was unset on Sovereigns
   — the audience check collapsed to the literal "https://console."
   prefix. The bp-catalyst-platform HelmRelease overlay was already
   setting `global.sovereignFQDN` but the chart template never plumbed
   it through to the Pod env. Added a SOVEREIGN_FQDN env reading
   `.Values.global.sovereignFQDN` (default "" so Catalyst-Zero
   installs, where catalyst-api is the SIGNER not the validator,
   stay clean).

Bumps:
- bp-catalyst-platform 1.2.4 -> 1.2.5
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml HelmRelease pin

Will be verified live on otech49 — fresh provision should reach
https://console.otech49.omani.works/auth/handover?token=... and
exchange to a Keycloak session WITHOUT manual Secret creation.

Issue #606 followup.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:47:21 +04:00
e3mrah
d0b574bd68
fix(hetzner-tofu): add powerdns_api_key to templatefile() vars (#687)
PR #686 added var.powerdns_api_key to variables.tf and referenced it as
${powerdns_api_key} in cloudinit-control-plane.tftpl, but missed wiring
it into the templatefile() vars dict in main.tf. Result on otech48:

  Invalid value for "vars" parameter: vars map does not contain key
  "powerdns_api_key", referenced at ./cloudinit-control-plane.tftpl:273

This commit closes the gap: powerdns_api_key now flows from var ->
templatefile vars -> cloud-init -> Secret manifest.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:34:36 +04:00
e3mrah
684759564e
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget

cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns
even with all of:

- gatewayAPI.hostNetwork.enabled=true on the Cilium chart
- securityContext.privileged=true on the cilium-envoy DaemonSet
- securityContext.capabilities.add=[NET_BIND_SERVICE]
- envoy-keep-cap-netbindservice=true in cilium-config ConfigMap
- Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema)

Repeatable error from cilium-envoy logs across otech45, otech46, otech47:

  listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed
  to bind or apply socket options: cannot bind '0.0.0.0:80':
  Permission denied

The bind() syscall is intercepted by cilium-agent's BPF socket-LB
program in a way that does not honour container capabilities. Even
PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets
"Permission denied". Cilium 1.19.3 → 1.16.5 made no difference
(F1, PR #684 still ships — the version bump is sound for other
reasons; the listener bind is just a separate fix).

This commit moves the listeners to high ports (30080/30443) and lets
the Hetzner LB do the public-facing port translation:

  HCLB :80   → CP node :30080  (cilium-gateway HTTP listener)
  HCLB :443  → CP node :30443  (cilium-gateway HTTPS listener)

External users still hit `https://console.<sov>.omani.works/auth/handover`
on port 443; the high port is invisible. High-port bind succeeds
without NET_BIND_SERVICE because the kernel only gates ports below
`net.ipv4.ip_unprivileged_port_start` (default 1024).

Will be verified on otech48: the next fresh provision should serve
console.otech48/auth/handover end-to-end without the 502/timeout
chain seen on otech45–47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager

PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681)
calls contabo's authoritative PowerDNS at pdns.openova.io to write
DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook
needs an X-API-Key Secret in the Sovereign's cert-manager namespace —
PR #681 didn't ship the materialization seam, so on otech43..otech47
the Secret was missing and the wildcard cert never issued.

This commit closes the seam from contabo to the Sovereign:

1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on
   openova-system/powerdns-api-credentials extended from "external-dns"
   to "external-dns,catalyst" so contabo catalyst-api can mount the
   API key.

2. bp-powerdns: api.basicAuth.enabled flips default true to false.
   Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that
   blocked machine-to-machine API access from Sovereigns. The X-API-Key
   contract is unchanged.

3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds
   CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key
   secret (optional=true so Sovereign-side catalyst-api Pods that don't
   reflect this still start clean).

4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field
   reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every
   Request before Validate(). Forwards as tofu var powerdns_api_key.

5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive,
   default "").

6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct
   dynadot-api-credentials Secret block (PR #681 dropped
   bp-cert-manager-dynadot-webhook) with a new
   cert-manager/powerdns-api-credentials Secret block. runcmd applies
   it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook.

End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token.

Will be verified live on otech48 (next provision after this lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:23:27 +04:00
e3mrah
369c229408
fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget (#685)
cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns
even with all of:

- gatewayAPI.hostNetwork.enabled=true on the Cilium chart
- securityContext.privileged=true on the cilium-envoy DaemonSet
- securityContext.capabilities.add=[NET_BIND_SERVICE]
- envoy-keep-cap-netbindservice=true in cilium-config ConfigMap
- Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema)

Repeatable error from cilium-envoy logs across otech45, otech46, otech47:

  listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed
  to bind or apply socket options: cannot bind '0.0.0.0:80':
  Permission denied

The bind() syscall is intercepted by cilium-agent's BPF socket-LB
program in a way that does not honour container capabilities. Even
PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets
"Permission denied". Cilium 1.19.3 → 1.16.5 made no difference
(F1, PR #684 still ships — the version bump is sound for other
reasons; the listener bind is just a separate fix).

This commit moves the listeners to high ports (30080/30443) and lets
the Hetzner LB do the public-facing port translation:

  HCLB :80   → CP node :30080  (cilium-gateway HTTP listener)
  HCLB :443  → CP node :30443  (cilium-gateway HTTPS listener)

External users still hit `https://console.<sov>.omani.works/auth/handover`
on port 443; the high port is invisible. High-port bind succeeds
without NET_BIND_SERVICE because the kernel only gates ports below
`net.ipv4.ip_unprivileged_port_start` (default 1024).

Will be verified on otech48: the next fresh provision should serve
console.otech48/auth/handover end-to-end without the 502/timeout
chain seen on otech45–47.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:14:32 +04:00
e3mrah
affcf37923
fix(bp-catalyst-platform): provision harbor-robot-token automatically on Sovereign install (RCA + permanent fix) (#680)
Caught live on otech43–46 — manual placeholder Secret was being created
each iteration. RCA:

The catalyst-api Pod template references the `harbor-robot-token`
Secret via a REQUIRED (non-optional) secretKeyRef. On Sovereign
clusters that Secret was never materialised — only `ghcr-pull` had
the canonical cloud-init + Reflector auto-mirror seam (PR #543). The
chart's old comment said "Reflector mirrors from openova-harbor
namespace into catalyst" but `openova-harbor` doesn't exist on
Sovereigns; that namespace lives only on contabo where the central
Harbor source Secret is administered. Result: every fresh Sovereign's
catalyst-api Pod stuck in CreateContainerConfigError until the
operator hand-created a placeholder Secret.

The token VALUE was already arriving on the Sovereign — Tofu
var.harbor_robot_token is interpolated into
/etc/rancher/k3s/registries.yaml at cloud-init time so containerd
can authenticate against harbor.openova.io. We just never materialised
the same value as a Kubernetes Secret for catalyst-api to mount.

Permanent fix mirrors the canonical `ghcr-pull` seam:

  1. infra/hetzner/cloudinit-control-plane.tftpl write_files block
     emits /var/lib/catalyst/harbor-robot-token-secret.yaml — a
     Secret in flux-system ns with auto-mirror Reflector annotations
     (`reflection-auto-enabled: "true"`).
  2. runcmd applies it BEFORE flux-bootstrap, so the Secret exists
     before any Helm release reconciles.
  3. bp-reflector (slot 05a, already deployed) propagates the Secret
     into every namespace — including catalyst-system — on first
     reconcile tick. catalyst-api's secretKeyRef resolves cleanly,
     Pod starts.
  4. Token rotation flows through `var.harbor_robot_token` →
     re-render Tofu → re-apply cloud-init; Reflector propagates the
     rotation to all mirrored copies on the next watch tick.

`harbor-robot-token` stays NOT optional in the chart: the architecture
mandate is every Sovereign image pull goes through harbor.openova.io;
falling through to docker.io is forbidden (anonymous rate-limit makes
a fresh Hetzner IP unbootable). A missing token must surface
immediately as Pod start failure, never silently mid-provision.

Bumps:
  - bp-catalyst-platform 1.2.2 → 1.2.3 (chart-side change is a
    comment-only update on the secretKeyRef explaining the new seam;
    the Pod spec still references the same Secret name and key).
  - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
    HelmRelease version pin → 1.2.3.

No bootstrap-kit dependency changes — bp-reflector's slot-05a position
is unchanged and was already a dependency for ghcr-pull. No
expected-bootstrap-deps.yaml edits needed.

Issue #557 follow-up. Closes the per-Sovereign manual workaround.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 16:54:37 +04:00
e3mrah
dd4148acb6
fix(cilium-gateway): hostNetwork mode + Hetzner LB→80/443 (chart 1.1.5) (#674)
The Cilium gateway-api L7LB nodePort chain was silently broken on
otech45: TCP to LB:443 succeeds, but TLS handshake never completes.
Root cause: Cilium 1.16.5's BPF L7LB Proxy Port (12869) doesn't match
what cilium-envoy actually listens on (verified via /proc/net/tcp on
the cilium-envoy pod — port 12869 not in listening sockets). The
nodePort indirection (31443→envoy:12869) is broken at the redirect
step.

Fix: bind cilium-envoy directly to the host's :80 and :443 via
gatewayAPI.hostNetwork.enabled=true. Hetzner LB forwards public
80→private:80 and 443→private:443 directly (no nodePort indirection).

Two coordinated changes:
  1. platform/cilium/chart/values.yaml: gatewayAPI.hostNetwork.enabled=true
  2. infra/hetzner/main.tf: LB destination_port = 80/443 (was 31080/31443)

bp-cilium chart bumped to 1.1.5.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 15:22:51 +04:00
e3mrah
1734979d74
fix(infra): bump kernel inotify limits (bao init was hitting EMFILE) (#656)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)

* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)

The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):

  componentGroups.ts          Flux HelmRelease.dependsOn
  ----------------------      ---------------------------
  keycloak: [cnpg]            keycloak: [cert-manager, gateway-api]
  openbao:  []                openbao:  [spire, gateway-api, cnpg]
  harbor:   [cnpg, seaweedfs, harbor:   [cnpg, cert-manager,
              valkey]                    gateway-api]

Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03

This commit:
  1. Adds scripts/generate-blueprint-deps.sh that parses every
     bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
     keyed by bare component id (bp- prefix stripped on both source
     and target side).
  2. Commits the generated JSON.
  3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
     thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
  4. Patches componentGroups.ts so every RAW_COMPONENT's
     `dependencies` field is OVERRIDDEN at module load with the
     Flux-canonical list (the inline `dependencies: [...]` literals
     are now ignored — Flux is canonical).

Follow-ups (not in this PR):
  - CI drift check that re-runs the script and diffs the JSON.
  - Strip the inline `dependencies: [...]` arrays entirely once the
    drift check is green.
  - Wire the FlowPage edge-rendering to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT

PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
  keycloak: ['cert-manager', 'openbao']  ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.

Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): don't regress status to pending after exec started

helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).

Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.

Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): cascade Failed status through dependsOn (fail-fast)

Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.

Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'

Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
  failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.

Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.

Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
  fs.inotify.max_user_instances = 8192
  fs.inotify.max_user_watches   = 1048576
  fs.inotify.max_queued_events  = 16384

Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 10:32:38 +04:00
e3mrah
40ca4e4d50
fix(infra): registries.yaml mirror needs rewrite — Harbor proxy is /v2/proj/, not /proj/v2/ (#640)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:22:21 +04:00
e3mrah
0ee309aa8b
fix(infra+api): wire handover_jwt_public_key end-to-end through tofu provisioning (#636)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 22:28:44 +04:00
e3mrah
96a5e3a20e
fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service (#635)
PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 22:14:23 +04:00
e3mrah
169ba2f20a
fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623)
PR #611 squash accidentally reverted the Phase-8b infra additions from PR #615
(92fdda42). Restores:
- cloudinit-control-plane.tftpl: write_files entry for /var/lib/catalyst/handover-jwt-public.jwk (mode 0600)
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Without these, new Sovereign provisioning runs will not write the public key
to disk and auth/handover on the Sovereign will return 503 (key unavailable).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:21:16 +04:00
e3mrah
b5c9839da7
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables:

UI:
- AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server
  callback; sovereign → client-side OIDC token exchange via oidc.ts)
- Router: sovereign console routes (/console/*), DETECTED_MODE index redirect,
  authCallbackRoute dedup fix, authHandoverRoute safety net
- StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token
  before redirecting operator to Sovereign console (falls back to plain URL on error)

API:
- main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env
- deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time
- provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON
- auth.go: /auth/handover endpoint for seamless single-identity flow

Infra:
- cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Chart:
- api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars

Playwright CI fixes:
- playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard
- playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix
- cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/*
- sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard

Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:17:56 +04:00
e3mrah
92fdda42d7
feat(catalyst-api+infra): Phase-8b handover JWT minting on Catalyst-Zero (Closes #605)
Merge via self-merge per CLAUDE.md. Playwright UI smoke passes; cosmetic guards pre-existing failure on main (unrelated to this PR). Resolves #605.
2026-05-02 19:07:27 +04:00
e3mrah
5a403e66b1
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase

Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix

Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:

1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
   - values.yaml: `webhook.solverName: powerdns` → `pdns`
   - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
     calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
     "powerdns" cert-manager gets 404 → "server could not find the resource".

2. cert-manager-dynadot-webhook solver_test.go mock format:
   - writeOK() and error injection used old ResponseHeader-wrapped format
   - Real api3.json returns ResponseCode/Status directly in SetDnsResponse
   - This caused the image build to fail at ccc38987 so the dynadot fix
     never shipped; solver tests now pass cleanly (go test ./... OK)

3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template):
   - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values
   - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied
   - anycast-endpoint.yaml: optional nodePort field rendered in port list

4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf):
   - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053
   - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0

5. dynadot-client JSON parsing fix (core/pkg/dynadot-client):
   - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader
   - client_test.go: mock responses updated to real api3.json format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:49:58 +04:00
e3mrah
73ae746637
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API
CRDs at startup and disables its gateway controller if they are absent —
a static, one-shot decision. Cloud-init installs k3s+Cilium first, then
Flux reconciles bp-gateway-api minutes later, so the operator always
starts without CRDs and never recovers. All 8 HTTPRoutes orphaned.

Three-part permanent fix:

1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl.
   TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires
   TLSRoute CRD to be present; without it the operator's capability
   check fails entirely and disables the gateway controller.

2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true"
   to force GatewayClass creation regardless of CRD presence at Helm
   render time. Upstream default "auto" skips GatewayClass when the
   gateway API CRDs are absent at install time (Capabilities check).

3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0
   and ship experimental channel (TLSRoute, TCPRoute, UDPRoute,
   BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed
   status.supportedFeatures from string[] to object[]; Cilium 1.16.5
   writes the old string format and the v1.2.0 CRD rejects the status
   patch with "must be of type object: string", leaving GatewayClass
   permanently Unknown/Pending. v1.1.0 retains string schema.

Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17
adopts the v1.2.0 object schema for supportedFeatures.

Closes #503

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:23:32 +04:00
e3mrah
9e53d9e127
feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557) (#563)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade

Per founder corrective: existing diagram missed the real blockers
surfaced during otech10..otech22 burns. The image-pull-through gap
(#557) and the cross-namespace secret gap (#543, #544) gate every
workload pull from a public registry — without them, Sovereign hits
DockerHub anonymous rate-limit on first provision and 30+ HRs are
ImagePullBackOff/CreateContainerConfigError.

Adds:
- Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap +
  #557C charts global.imageRegistry templating). Edges to NATS / Gitea
  / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane /
  cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao
- Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544
  powerdns-api-credentials reflect). Edges to bp-catalyst-platform and
  bp-cert-manager-powerdns-webhook
- Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch
  38-HR threshold both gate Phase 8a integration test
- Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is
  what makes "zero contabo dependency" DoD-met possible

WBS now reflects the cascade observed live, not the pre-Phase-8a model.

* feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560)

- bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium
  subchart does not expose a single registry knob — per-Sovereign overlays
  wire specific image.repository fields alongside this value.
- bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  chart exposes per-component image.registry knobs documented in the comment.
- bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub
  added + deployment.yaml templated to prefix the webhook image repository
  when the value is non-empty. Verified: helm template with
  --set global.imageRegistry=harbor.openova.io produces
  harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>.
- bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  subchart exposes sealed-secrets.image.registry for overlay wiring.

All four charts render clean with default values (empty imageRegistry).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557)

Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd
transparently routes all five public-registry pulls through the central
harbor.openova.io pull-through proxy (Option A of #557).

- cloudinit-control-plane.tftpl: new write_files entry for
  /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so
  containerd reads the mirror config at startup). Mirrors docker.io,
  quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective
  harbor.openova.io/proxy-* projects. Auth via robot$openova-bot.
- variables.tf: new harbor_robot_token variable (sensitive, default "")
  for the robot account token stored in openova-harbor/harbor-robot-token
  K8s Secret on contabo and forwarded by catalyst-api at provision time.
- main.tf: wire harbor_robot_token into the templatefile() call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:49:13 +04:00
e3mrah
ccc38987c2
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.

Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
  API returns SetDnsResponse); change ResponseCode to json.Number (API returns
  integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
  - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
    cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
  - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
    and privateKeySecretRefName; add rbac.create comment for domain-solver
  - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
  - clusterissuer.yaml: new template (skip-render default, enabled via overlay)
  - deployment.yaml: add imagePullSecrets support (required for private GHCR)
  - Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
  - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
  - kustomization.yaml: add 49b entry
- infra/hetzner:
  - variables.tf: add dynadot_managed_domains variable
  - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
  - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
    Secret + apply it before Flux reconciles bootstrap-kit

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:42:13 +04:00
e3mrah
b2307e290d
fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543) (#554)
Part A — bp-reflector blueprint:
- Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a,
  dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288
  via the bp-reflector OCI wrapper chart.
- Register in bootstrap-kit/kustomization.yaml.
- Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml):
  single replica, 32Mi memory, ServiceMonitor off by default.

Part B — annotate flux-system/ghcr-pull + rename in charts:
- infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector
  annotations to the ghcr-pull Secret written at cloud-init time so
  Reflector auto-mirrors it to every namespace on first boot.
- Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in:
  api-deployment.yaml, ui-deployment.yaml,
  marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml
  (14 total occurrences).
- Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit
  HelmRelease version reference to match.

Root cause: the canonical secret name is ghcr-pull (written by
cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were
referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff
on all Catalyst pods on every new Sovereign.

Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret
propagated to 33 namespaces via kubectl; non-Running pods bounced.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:17:51 +04:00
e3mrah
5b55d65461
fix(infra): kubeconfig points at CP public IP not LB IP (Closes #542) (#546)
The Hetzner LB only forwards 80/443 (Cilium Gateway ingress); 6443 is
exposed directly on the CP node via firewall rule (main.tf:51-56,
0.0.0.0/0 → CP:6443). Previous cloud-init rewrote kubeconfig server: to
the LB's public IPv4, which silently failed with "connect: connection
refused" — catalyst-api helmwatch could never observe HelmReleases on
the new Sovereign, so the wizard jobs page stayed PENDING for every
install-* job for 50+ minutes after the cluster was actually healthy.

Pass control_plane_ipv4 (= hcloud_server.control_plane[0].ipv4_address)
through the templatefile() call and rewrite k3s.yaml's 127.0.0.1:6443 to
that IP instead. Same firewall already opens 6443 to 0.0.0.0/0 directly
on the CP, so this is reachable from contabo without any LB / firewall
changes.

Permanent: every otechN provisioning from this commit forward will PUT
back a kubeconfig that catalyst-api can actually connect to.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 11:55:48 +04:00
e3mrah
66ff717fbc
fix(infra): reduce bootstrap Kustomization timeouts 30m→5m to unblock iterative fixes (closes #492) (#500)
Phase-8a bug #17 (otech8 deployment 1bfc46347564467b, 2026-05-01):
when the FIRST apply of bootstrap-kit was unhealthy (cilium crash-loop
from issue #491), kustomize-controller held the revision lock for the
full 30m health-check timeout and refused to pick up new GitRepository
revisions. Even though Flux fetched fix `66ea39f0` from main within 1
minute, bootstrap-kit's lastAttemptedRevision stayed pinned to the OLD
SHA `0765e89a` for the full 30 minutes. With cilium broken, the wait
would never finish, no new revision would ever apply, and the operator
was forced to wipe + reprovision from scratch. The same pathology
would repeat on every iteration unless the timeout shape changed.

Approach: Option A (timeout reduction). Drops `spec.timeout` on all
three Flux Kustomizations in the cloud-init template — bootstrap-kit,
sovereign-tls, infrastructure-config — from 30m to 5m. We KEEP
`wait: true` so downstream `dependsOn: bootstrap-kit` declarations
still get a consolidated "every HR Ready=True" signal. We do NOT
adjust `interval` (5m is correct).

Why 5m specifically: matches the GitRepository poll interval. Failed
reconciles release the revision lock within ~6m worst case so a fresh
fix on main gets applied on the next poll. Anything shorter risks
tripping legitimately-slow CRD installs; anything longer re-introduces
the iteration-stall pathology #492 documents.

Why not Option B (wait: false): would break the dependsOn chain. The
infrastructure-config Kustomization needs bootstrap-kit's HRs Ready
before it applies Provider/ProviderConfig manifests that talk to
Hetzner. Flipping wait: false would let infra-config apply prematurely.

Why not Option C (tighter retryInterval): doesn't address the root
cause. retryInterval governs how often to retry AFTER a failure;
spec.timeout is what holds the revision lock during a failed wait.

Test: kustomization_timeout_test.go (new) locks all three timeouts at
exactly 5m AND blocks any operative `timeout: 30m` regression AND
asserts wait: true is retained. Three assertions, one for each failure
mode (regression to 30m, accidental 4th Kustomization without test
update, drive-by flip to wait: false).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:34:35 +04:00
e3mrah
141dc9dfba
fix(infra): cloud-init helm install cilium values parity with Flux bp-cilium HR (closes #491) (#496)
Phase-8a bug #16: every fresh Hetzner Sovereign deadlocked at Phase 1
because the bootstrap helm install in cloud-init used a MINIMAL set of
--set flags (kubeProxyReplacement, k8sService*, tunnelProtocol,
bpf.masquerade) while the Flux bp-cilium HelmRelease curated a much
fuller value set. The drift was fatal:

  1. cilium-agent waits forever for the operator to register
     ciliumenvoyconfigs + ciliumclusterwideenvoyconfigs CRDs.
  2. The upstream chart only registers them when envoyConfig.enabled=true.
  3. With the bootstrap install missing that flag, the agent crash-looped,
     the node taint node.cilium.io/agent-not-ready never lifted, and the
     bootstrap-kit Kustomization (wait: true, 30 min timeout — issue #492)
     never reconciled the upgrade that would have fixed the values.

The fix is single-source-of-truth via a new write_files entry that lays
down /var/lib/catalyst/cilium-values.yaml at cloud-init time, plus a -f
flag on the bootstrap helm install that consumes it. The values mirror
platform/cilium/chart/values.yaml's `cilium:` block PLUS the overlay
in clusters/_template/bootstrap-kit/01-cilium.yaml (envoyConfig.enabled,
l7Proxy). A new parity test (cilium_values_parity_test.go) locks the
two files together so a future commit cannot change one without the
other.

Approach: hybrid — keep the chart values.yaml as the umbrella source
of truth, render the merged effective values inline in cloud-init's
write_files block (the umbrella's `cilium:` subchart wrapper is
unwrapped because the bootstrap install targets cilium/cilium upstream
chart directly, not the bp-cilium umbrella). Test enforces presence
of every operator-curated key + load-bearing values.

Files modified:
  infra/hetzner/cloudinit-control-plane.tftpl
  products/catalyst/bootstrap/api/internal/provisioner/cilium_values_parity_test.go (new)

Refs: #491, #492 (bootstrap-kit wait timeout), 66ea39f0 (envoyConfig in HR)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:09:10 +04:00
e3mrah
0d75ae354f
fix(infra): split Cilium-Gateway Certificate into sovereign-tls Kustomization (Phase-8a bug #13) (#484)
Phase-8a-preflight live deployment 93161846839dc2e1: bootstrap-kit Flux
Kustomization fails server-side dry-run with

  Certificate/kube-system/sovereign-wildcard-tls dry-run failed:
  no matches for kind 'Certificate' in version 'cert-manager.io/v1'

→ entire Kustomization apply aborts → ZERO HelmReleases reconcile.

Fix: split the Certificate into its own Flux Kustomization sovereign-tls
that dependsOn bootstrap-kit (whose Ready gates on every HR including
bp-cert-manager). Gateway stays in 01-cilium.yaml because Gateway API
CRDs ship with Cilium itself.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 22:48:18 +04:00
e3mrah
7e35040e29
fix(infra): cloud-init strip regex must preserve #cloud-config (Phase-8a bug #5 follow-up) (#482)
#477 introduced a regex "/(?m)^[ ]{0,2}#[^!].*\n/" to strip YAML-block
comments and fit Hetzner's 32KiB user_data cap. The [^!] guard preserved
shebangs like #!/bin/bash but DID NOT preserve cloud-init directives
like #cloud-config, #include, #cloud-boothook (none have ! after #).

Result: cloud-init received user_data with the #cloud-config first-line
DIRECTIVE stripped, didn't recognise the YAML body, and emitted:
  recoverable_errors:
  WARNING: Unhandled non-multipart (text/x-not-multipart) userdata

→ k3s never installed
→ Flux never bootstrapped
→ kubeconfig never PUT to catalyst-api
→ every Phase-8a provision since #477 has silently failed at boot

Live evidence: deployment a76e3fec8566add9 SSH'd 2026-05-01 18:30 UTC,
cloud-init status 'degraded done', /etc/systemd/system/k3s.service
absent, no flux binary.

Fix: require a SPACE after the '#' in the strip regex. YAML comments
ARE typically '# foo bar' (with space). cloud-init directives are
'#cloud-config' / '#include' / '#cloud-boothook' (no space) — the new
regex preserves them.

Out of scope: validating that ALL existing comments in the tftpl had
a space after #. They do — verified by sed pre-render passing the
sanity test (file shrinks 38KB → 13KB AND first line is #cloud-config).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 22:30:51 +04:00
e3mrah
e35729ad78
fix(infra): strip YAML-block comments from cloud-init to fit Hetzner 32KiB cap (Phase-8a bug #5) (#477)
Phase-8a-preflight deployment 3c158f712d564d84 failed at tofu apply with:

  Error: invalid input in field 'user_data'
    [user_data => [Length must be between 0 and 32768.]]
    on main.tf line 214, in resource "hcloud_server" "control_plane"

The rendered cloudinit-control-plane.tftpl is 38,085 bytes — 5,317
bytes over the Hetzner cap. The source template ships ~16 KB of
indent-0 and indent-2 documentation comments (YAML-level) that are
operationally inert at cloud-init boot.

Fix: wrap templatefile() in replace() with a RE2 regex that strips
lines whose first 0-2 chars are spaces followed by '#' (preserves
shebangs via [^!]). After strip, rendered cloud-init drops to ~13 KB.

Indent-4+ comments live INSIDE heredoc `content: |` blocks
(embedded shell scripts, kubeconfig fragments). Those are preserved.

Same fix applied to worker_cloud_init for parity.

Refs:
- Live evidence: deployment 3c158f712d564d84, tofu apply error 16:38:26 UTC
- Bug #5 in the Phase-8a-preflight tally
- #471: prior tftpl escape fix ($${SOVEREIGN_FQDN})
- #472: catalyst-build watches infra/hetzner/**

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:43:42 +04:00
e3mrah
03b1469331
fix(infra): escape ${SOVEREIGN_FQDN} in cloudinit-control-plane.tftpl comments (#471)
Phase-8a-preflight bug surfaced by first live provision attempt
(deployment febeeb888debf477, 2026-05-01 16:30 UTC):

  Error: Invalid function argument
    on main.tf line 140, in locals:
    140:   control_plane_cloud_init = templatefile("${path.module}/cloudinit-control-plane.tftpl", {
  Invalid value for "vars" parameter: vars map does not contain key
  "SOVEREIGN_FQDN", referenced at ./cloudinit-control-plane.tftpl:12,37-51.

Tofu's templatefile() interprets ${...} ANYWHERE in the file (including
inside shell '#' comments), since the file is a template not a shell
script. Five lines in cloudinit-control-plane.tftpl reference
${SOVEREIGN_FQDN} as part of documentation prose explaining how
Flux postBuild.substitute interpolates the value at Flux apply time.

The Tofu vars map passed by main.tf:140 uses the canonical lowercase
HCL convention (sovereign_fqdn = var.sovereign_fqdn), not the uppercase
envsubst convention SOVEREIGN_FQDN. So Tofu fails: 'vars map does not
contain key SOVEREIGN_FQDN'.

Latest reference (line 12) added by #326 (commit 20b89607); older 4
references predate that and were never exercised because no live
provision had ever been attempted before this Phase-8a run.

Fix: escape with double-dollar ($$) so Tofu emits a literal ${...}
in the rendered cloudinit file. The 5 comments now read $${SOVEREIGN_FQDN}
in source, render as ${SOVEREIGN_FQDN} in the user_data output —
preserving documentation intent without breaking templatefile().

Refs:
- Live provision: console.openova.io/sovereign/provision/febeeb888debf477
- Diagnostic: tofu plan exit 1 — vars map does not contain key SOVEREIGN_FQDN
- Out of scope: any other latent templatefile() escape issues — those
  surface as their own Phase-8a iterations

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 20:33:21 +04:00
e3mrah
20b896070f
feat(bp-keycloak + infra): Sovereign K8s OIDC config for kubectl via per-Sovereign Keycloak realm (closes #326) (#448)
Wires the per-Sovereign K8s api-server's --oidc-* validator to the
per-Sovereign Keycloak realm so customer admins can authenticate
kubectl directly against their Sovereign — no static admin-kubeconfig
handoff, no rotated bearer-token exchange.

infra (cloud-init):
  - Add 6 --kube-apiserver-arg=oidc-* flags to the k3s install line in
    infra/hetzner/cloudinit-control-plane.tftpl. Issuer URL composed
    from sovereign_fqdn (https://auth.\${sovereign_fqdn}/realms/sovereign)
    per INVIOLABLE-PRINCIPLES #4 — never hardcoded. Username/groups
    prefixes scope OIDC subjects under "oidc:" so RoleBindings reference
    e.g. subjects[0].name=oidc:alice@org, distinct from local SAs/x509.

Canonical seam (anti-duplication rule, ADR-0001 §11.3):
  - The bp-keycloak chart already bundles bitnami/keycloak's
    keycloakConfigCli post-install Helm hook Job, which imports realms
    declared under values.keycloak.keycloakConfigCli.configuration. We
    enable the existing seam — no bespoke kubectl-exec realm-creation
    script, no custom Admin-API call from catalyst-api.

bp-keycloak chart (1.1.2 → 1.2.0):
  - Enable keycloakConfigCli + ship inline sovereign-realm.json with:
    realm "sovereign" (invariant per Sovereign — Keycloak resolves the
    issuer claim from the request hostname, so no per-FQDN realm
    rename), default groups sovereign-admins/-ops/-viewers, oidc-group
    -membership-mapper emitting "groups" claim, public OIDC client
    "kubectl" with localhost:8000 + OOB redirect URIs (kubectl-oidc
    -login defaults), publicClient=true (kubectl runs locally and
    cannot safely hold a secret), PKCE S256 enforced.
  - Bump version 1.1.2 → 1.2.0 (semver MINOR, additive shape).
  - Bump bootstrap-kit slot 09 in _template/, omantel.omani.works/,
    otech.omani.works/ to version: 1.2.0.
  - New chart test tests/oidc-kubectl-client.sh (4 cases) — all green.
  - Existing tests/observability-toggle.sh — still green.

Documentation:
  - Add §11 "kubectl OIDC for customer admins" runbook to
    docs/omantel-handover-wbs.md with one-time workstation setup
    (kubectl krew install oidc-login + config set-credentials),
    sovereign-admin RBAC binding (oidc:sovereign-admins → cluster
    -admin), and 401-debugging table mapping common symptoms to
    root causes.
  - Carve #326 out of §7 "Out of scope" — it is shipped.
  - Add §9 status row.

Validation:
  - grep -c 'oidc-issuer-url' infra/hetzner/cloudinit-control-plane.tftpl
    → 2 (comment + the actual flag in the curl line)
  - grep -c 'oidc-username-claim' → 2
  - helm template platform/keycloak/chart → renders post-install
    keycloak-config-cli Job + ConfigMap with kubectl client (3 hits
    on grep "kubectl"; 1 hit on "clientId": "kubectl")
  - bash scripts/check-vendor-coupling.sh → exit 0 (HARD-FAIL mode)
  - 4/4 oidc-kubectl-client gates green; 3/3 observability-toggle
    gates green

Out of scope (deferred to follow-up tickets):
  - Per-Sovereign user provisioning UI (#322, #323)
  - Refresh-token revocation on RoleBinding deletion (#324)
  - provider-kubernetes Crossplane ProviderConfig per Sovereign (#321)
  - omantel migration / Phase 8 live execution

NO catalyst-api or UI source files touched (those are #319/#322/#323
agents' territories per agent brief).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 19:07:52 +04:00
e3mrah
0172b9a89a
wip(#425): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435)
Files staged from prior agent run before rate-limit. Re-dispatch will
verify, complete missing pieces (Crossplane Provider+ProviderConfig in
cloud-init, grep-zero acceptance, helm/go test runs, WBS row update),
and finalise the PR.

Includes:
- platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml
- platform/velero/chart/values.yaml (objectStorage.s3.* block)
- platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0)
- products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package)
- internal/hetzner/objectstorage{,_test}.go DELETED
- credentials handler + StepCredentials.tsx renamed
- infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl}
- clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml
- platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 18:05:19 +04:00
e3mrah
1e17668055
feat(catalyst): Hetzner Object Storage credential pattern — Phase 0b (#371) (#409)
* feat(catalyst): Hetzner Object Storage credential pattern (Phase 0b, #371)

Adds the per-Sovereign Hetzner Object Storage credential capture + bucket
provisioning Phase 0b path described in the omantel handover WBS §5.
Hybrid Option A+B: wizard collects operator-issued S3 credentials (Hetzner
exposes no Cloud API to mint them — they're issued once in the Hetzner
Console and the secret half is shown exactly once), and OpenTofu
auto-provisions the per-Sovereign bucket via the aminueza/minio provider
+ writes a flux-system/hetzner-object-storage Secret into the new
Sovereign at cloud-init time so Harbor (#383) and Velero (#384) find
their backing-store credentials already in the cluster from Phase 1
onwards.

Extends the EXISTING canonical seam at every layer (per the founder's
anti-duplication rule for #371's session): the existing Tofu module at
infra/hetzner/, the existing handler/credentials.go validator, the
existing provisioner.Request struct, the existing store.Redact path,
and the existing wizard StepCredentials. No parallel binaries / scripts
/ operators introduced.

infra/hetzner/ (Tofu module — Phase 0):
  - versions.tf: declare aminueza/minio provider (Hetzner's official
    recommendation for S3-compatible bucket creation per
    docs.hetzner.com/storage/object-storage/getting-started/...)
  - variables.tf: 4 sensitive vars — region (validated against
    fsn1/nbg1/hel1, the European-only OS regions as of 2026-04),
    access_key, secret_key, bucket_name (RFC-compliant S3 naming)
  - main.tf: minio_s3_bucket.main resource — idempotent on re-apply,
    no force_destroy (Velero archive must survive a control-plane
    reinstall), object_locking=false (content-addressed digests are
    the immutability guarantee for Harbor; Velero uses S3 versioning)
  - cloudinit-control-plane.tftpl: write
    flux-system/hetzner-object-storage Secret with the canonical
    s3-endpoint/s3-region/s3-bucket/s3-access-key/s3-secret-key keys
    Harbor + Velero charts consume via existingSecret refs
  - outputs.tf: surface endpoint/region/bucket back to catalyst-api
    for the deployment record (credentials NEVER returned)

products/catalyst/bootstrap/api/ (Go):
  - internal/hetzner/objectstorage.go: NEW — minio-go/v7-based
    ListBuckets validator. Distinguishes auth failure ("rejected") from
    network failure ("unreachable") so the wizard renders the right
    error card. NOT a parallel cloud-resource path — the existing
    purge.go handles hcloud purge; objectstorage.go handles a separate
    API surface (S3-compatible) that has no equivalent client today.
  - internal/handler/credentials.go: extend with
    ValidateObjectStorageCredentials handler — same wire shape
    (200 valid:true / 200 valid:false / 503 unreachable / 400 bad
    input) as the existing token validator so the wizard's failure-
    card machinery handles both without per-endpoint switches.
  - cmd/api/main.go: wire POST
    /api/v1/credentials/object-storage/validate
  - internal/provisioner/provisioner.go: extend Request with
    ObjectStorageRegion/AccessKey/SecretKey/Bucket; Validate()
    rejects empty/malformed values fail-fast at /api/v1/deployments
    POST time; writeTfvars() emits the 4 new tfvars.
  - internal/handler/deployments.go: derive bucket name from FQDN slug
    pre-Validate (catalyst-<fqdn-with-dots-replaced-by-dashes>) so
    Hetzner's globally-namespaced bucket pool gets a deterministic,
    collision-resistant per-Sovereign name without operator input.
  - internal/store/store.go: redact access/secret keys; preserve
    region+bucket plain (they're public in tofu outputs anyway).

products/catalyst/bootstrap/ui/ (TypeScript / React):
  - entities/deployment/model.ts + store.ts: 4 new wizard fields
    (objectStorageRegion/AccessKey/SecretKey/Validated) with merge()
    coercion for legacy persisted state.
  - pages/wizard/steps/StepCredentials.tsx: ObjectStorageSection —
    region picker (fsn1/nbg1/hel1), masked secret-key input,
    Validate button gating Next. Same FailureCard taxonomy
    (rejected/too-short/unreachable/network/parse/http) the existing
    TokenSection uses, so the operator UX is consistent. Section
    only renders when Hetzner is among chosen providers — non-Hetzner
    Sovereigns skip Phase 0b until their own backing-store path lands.
  - pages/wizard/steps/StepReview.tsx: include
    objectStorageRegion/AccessKey/SecretKey in the
    POST /v1/deployments payload (bucket derived server-side).

Tests:
  - api: 7 new provisioner Validate tests (region/keys/bucket
    required + RFC-compliant + valid-region acceptance), 5 handler
    tests for the new endpoint (bad JSON / missing region / invalid
    region / short keys), 4 hetzner/objectstorage_test.go tests
    (endpoint composition + early input rejection), 1 handler test
    for the bucket-name derivation. Existing tests updated to supply
    the new required fields.
  - ui: StepCredentials.test.tsx pre-populates objectStorageValidated
    in beforeEach so the existing 11 SSH-section tests aren't gated
    on Object Storage validation.

DoD: a fresh Sovereign provision results in a usable S3 endpoint URL +
access/secret keys available as a K8s Secret in the Sovereign's home
cluster (flux-system/hetzner-object-storage), ready for consumption by
Harbor + Velero charts via existingSecret references.

Closes #371.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(wbs): #371 done — Hetzner Object Storage Phase 0b shipped (#409)

Marks #371 done with the architectural rationale (hybrid Option A + B —
Hetzner exposes no Cloud API to mint S3 keys, so the wizard MUST capture
them; OpenTofu auto-provisions the bucket + cloud-init writes the
flux-system/hetzner-object-storage Secret with the canonical s3-* keys
Harbor + Velero consume).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:54:22 +04:00
e3mrah
d2ada908c9
feat(bp-openbao): auto-unseal flow — cloud-init seed + post-install init Job (closes #316) (#408)
Catalyst-curated auto-unseal pipeline for OpenBao on Hetzner Sovereigns
(no managed-KMS available). Selected **Option A — Shamir + cloud-init
seed** because:

  - Hetzner has no managed-KMS service → Cloud-KMS auto-unseal (Option C)
    is structurally unavailable.
  - Transit-seal (Option B) requires a peer OpenBao cluster, only
    applicable to multi-region tier-1; out of scope for single-region
    omantel.
  - Manual unseal (Option D) violates the "first sovereign-admin lands
    on console.<sovereign-fqdn> ready to use" goal in
    SOVEREIGN-PROVISIONING.md §5.

Architecture (per issue #316 spec + acceptance criteria 1-6):

  1. Cloud-init on the control-plane node generates a 32-byte recovery
     seed from /dev/urandom and writes it to a single-use K8s Secret
     `openbao-recovery-seed` in the openbao namespace, with annotation
     `openbao.openova.io/single-use: "true"`. Pre-creates the openbao
     namespace to eliminate the race with Flux's HelmRelease apply.
  2. bp-openbao chart v1.2.0 ships two new Helm post-install hooks:
       - `templates/init-job.yaml` (hook weight 5): consumes the seed,
         calls `bao operator init -recovery-shares=1 -recovery-threshold=1`,
         persists the recovery key inside OpenBao's auto-unseal config,
         deletes the seed Secret on success. Idempotent — re-runs detect
         Initialized=true and exit 0.
       - `templates/auth-bootstrap-job.yaml` (hook weight 10): enables
         the Kubernetes auth method, mounts kv-v2 at `secret/`, writes
         the `external-secrets-read` policy, binds the `external-secrets`
         role to the ESO ServiceAccount in `external-secrets-system`.
  3. `templates/auto-unseal-rbac.yaml` declares the least-privilege SA
     + Role + RoleBinding the Jobs need (Secret get/list/delete in the
     openbao namespace; create/get/patch on the openbao-init-marker).
     Also emits the permanent `system:auth-delegator` ClusterRoleBinding
     bound to the OpenBao ServiceAccount so the Kubernetes auth method
     can call tokenreviews.authentication.k8s.io.
  4. Cluster overlay `clusters/_template/bootstrap-kit/08-openbao.yaml`
     bumps version 1.1.1 → 1.2.0 and flips `autoUnseal.enabled: true`
     per-Sovereign.

Per #402 lesson: skip-render pattern (`{{- if .Values.X }}{{ emit }}
{{- end }}`) used throughout — never `{{ fail }}`. Default `helm
template` render emits NOTHING new; opt-in via autoUnseal.enabled=true.

Acceptance criteria coverage:
  1. Provision fresh Sovereign — cloud-init writes seed, Flux installs
     bp-openbao 1.2.0, post-install Jobs run automatically. 
  2. bp-openbao HR Ready=True without manual intervention — install
     keeps `disableWait: true` (Helm Ready ≠ OpenBao initialised; the
     init Job drives initialisation out-of-band on the same install). 
  3. `bao status` shows Sealed=false, Initialized=true within 5 minutes
     — init Job polls + retries up to 60×5s. 
  4. ESO ClusterSecretStore vault-region1 reaches Status: Valid — the
     auth-bootstrap Job binds the `external-secrets` role to ESO's SA
     before the Job exits. 
  5. Seed Secret deleted post-init — init Job deletes it via K8s API
     after consuming. 
  6. No openbao-root-token Secret in K8s — root token captured to
     /tmp/.root-token in the Job pod's tmpfs only; never written to a
     K8s Secret. The recovery key persists ONLY inside OpenBao's Raft
     state (auto-unseal config). 

Tests:
  - tests/auto-unseal-toggle.sh — 4 cases:
    * default render → no auto-unseal artefacts (skip-render works)
    * autoUnseal.enabled=true → both Jobs + correct hook weights
    * kubernetesAuth.enabled=false → init Job only, no auth-bootstrap
    * idempotency annotations present on all 5 hook objects
  - tests/observability-toggle.sh — unchanged, all 3 cases green.
  - helm lint . — clean.

Files:
  - platform/openbao/chart/Chart.yaml — version 1.1.1 → 1.2.0
  - platform/openbao/blueprint.yaml — version 1.1.1 → 1.2.0
  - platform/openbao/chart/values.yaml — `autoUnseal.*` block
  - platform/openbao/chart/templates/auto-unseal-rbac.yaml — new
  - platform/openbao/chart/templates/init-job.yaml — new
  - platform/openbao/chart/templates/auth-bootstrap-job.yaml — new
  - platform/openbao/chart/tests/auto-unseal-toggle.sh — new
  - platform/openbao/README.md — bootstrap procedure §2-3 expanded;
    auto-unseal alternatives table added.
  - clusters/_template/bootstrap-kit/08-openbao.yaml — chart 1.1.1 →
    1.2.0, autoUnseal.enabled=true.
  - infra/hetzner/cloudinit-control-plane.tftpl — seed-token block
    inserted between ghcr-pull-secret apply and flux-bootstrap apply.
  - docs/omantel-handover-wbs.md §9 — #316 ticked chart-released.

Canonical seam used: extended existing `platform/openbao/chart/` per
the anti-duplication rule. NO standalone scripts. NO bespoke Go cloud
calls. NO `{{ fail }}`. All knobs configurable via values.yaml per
INVIOLABLE-PRINCIPLES.md #4 (never hardcode).

Co-authored-by: hatiyildiz <hat.yil@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:45:44 +04:00
e3mrah
8781aa3bc4
fix(provisioner): cloud-init bootstrap-kit path matches per-FQDN cluster dir (resolves #218) (#256)
The cloud-init template selected a per-FQDN GitRepository tree
(`!/clusters/${sovereign_fqdn}`) and pointed both bootstrap-kit
and infrastructure-config Flux Kustomizations at
`./clusters/${sovereign_fqdn}/{bootstrap-kit,infrastructure}` —
directories the wizard never commits before provisioning. Every
fresh Sovereign stalled Phase-1 with `kustomization path not found:
.../clusters/<fqdn>/bootstrap-kit: no such file or directory`
(live evidence on otech.omani.works deployment ce476aaf80731a46).

Canonical fix:
- GitRepository.spec.ignore selects the shared `_template` tree
  (`!/clusters/_template`).
- Both Kustomizations point at `./clusters/_template/bootstrap-kit`
  and `./clusters/_template/infrastructure`.
- Flux postBuild.substitute.SOVEREIGN_FQDN: ${sovereign_fqdn}
  interpolates the Sovereign's FQDN into the rendered manifests
  (envsubst replaces `${SOVEREIGN_FQDN}` in label values, ingress
  hostnames, HelmRelease values).
- clusters/_template/bootstrap-kit/*.yaml + kustomization.yaml
  switch their bare `SOVEREIGN_FQDN_PLACEHOLDER` markers to
  `${SOVEREIGN_FQDN}` so Flux's envsubst-based substitute can
  actually replace them.

Locked by 5 unit tests in
products/catalyst/bootstrap/api/internal/provisioner/cloudinit_path_test.go
that read the template and assert: GitRepository ignore selects
_template, both Kustomization paths point at _template subdirs,
both carry the postBuild.substitute hook, and no operative YAML
line carries `clusters/${sovereign_fqdn}`.

Closes #218

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:11:44 +04:00
e3mrah
5aee6aa737
fix(cloudinit): poll for local-path StorageClass instead of pod Ready (closes #207) (#209)
The previous fix for #189 wrote `kubectl wait --for=condition=Ready pod
-l app=local-path-provisioner --timeout=60s`. That cannot succeed
pre-Cilium: k3s runs with --flannel-backend=none, the node stays
Ready=False until Cilium installs (much later in cloud-init), and the
not-ready taint blocks every untolerated pod. The wait timed out at
60s, scripts_user failed, and the Flux-bootstrap + kubeconfig POST-back
sections never executed. Every fresh Sovereign provision was stuck
"before Cilium" with no error signal in the wizard.

Replace the impossible Pod-Ready wait with a poll for the StorageClass
object itself, which k3s registers independently of CNI within ~3s of
service start.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 21:30:27 +02:00