openova

Author	SHA1	Message	Date
e3mrah	585317b99e	fix(bootstrap-kit): nest bp-openbao single-replica overrides under openbao subchart key (Closes #517 ) (#518 ) PR #5e0646e0 added `server.ha.replicas: 1` + `server.affinity: ""` at the TOP LEVEL of the bp-openbao HR values block. platform/openbao/chart/ Chart.yaml declares the upstream openbao chart as a Helm SUBCHART under `dependencies:`, so Helm umbrella-chart convention requires those values nested under the `openbao:` key. Top-level keys are silently ignored. Result on otech17: StatefulSet stayed at replicas=3, openbao-1/openbao-2 Pending forever (required pod-anti-affinity by hostname on a single node), openbao-init Job DeadlineExceeded, HR Stalled. Verified with `helm template`: - top-level `server.ha.replicas=1` → STS renders replicas: 3 - nested `openbao.server.ha.replicas=1` → STS renders replicas: 1 Same fix for `server.affinity: ""` — the upstream chart's helper `{{- if and (ne .mode "dev") .Values.server.affinity }}` treats empty string as falsy and skips the affinity block entirely. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:53:21 +04:00
e3mrah	5e0646e083	fix(bootstrap-kit): bp-openbao single-replica + no anti-affinity for single-node Sovereigns otech17 (6b17518f12d529ea, 2026-05-02): bp-openbao StatefulSet defaults to 3 replicas with required pod-anti-affinity by hostname. On a single-node Phase-8a Sovereign (cpx52, workerCount=0), 2/3 pods stay Pending forever, the openbao-init Job's wait-for-Ready loop times out, and the entire HR fails post-install. Fix: override server.ha.replicas=1 and clear server.affinity until the worker-pool provisioning path is wired up. autoUnseal does not require a quorum to bootstrap (single-replica Raft init works the same shape).	2026-05-02 04:45:58 +02:00
github-actions[bot]	e26b673031	deploy: update catalyst images to `a542572`	2026-05-02 02:07:50 +00:00
e3mrah	a54257212f	fix(bp-catalyst-platform): drop 10 foundation Blueprint subchart deps to stop duplicate source-controller in catalyst-system NS (#510 ) (#514 ) Phase-8a-preflight otech16 (2026-05-02): bp-cnpg, bp-spire, and bp-crossplane-claims intermittently failed chart pulls with i/o timeout against `source-controller.catalyst-system.svc.cluster.local` — a duplicate of the canonical source-controller already running in flux-system NS (installed by cloud-init + bootstrap-kit slot 03). Root cause: the bp-catalyst-platform umbrella chart declared the 10 foundation Blueprints (bp-cilium, bp-cert-manager, bp-flux, bp-crossplane, bp-sealed-secrets, bp-spire, bp-nats-jetstream, bp-openbao, bp-keycloak, bp-gitea) as Helm subchart dependencies. With `targetNamespace: catalyst-system` the helm-controller rendered every subchart's templates into catalyst-system — including the entire flux2 stack (source-controller, helm-controller, kustomize-controller, notification-controller). Other HRs whose `sourceRef.namespace: flux-system` reference is resolved by the Flux service-account in catalyst-system intermittently routed to the duplicate via service-discovery and timed out. Fix shape: the umbrella ships ONLY Catalyst-Zero control-plane workloads (catalyst-ui, catalyst-api, ProvisioningState CRD, Sovereign HTTPRoute). The foundation layer is owned end-to-end by clusters/_template/bootstrap-kit/ at slots 01..10, where each Blueprint is a top-level Flux HelmRelease in its own canonical namespace (flux-system, cert-manager, kube-system, etc.) with explicit dependsOn ordering. Changes: - products/catalyst/chart/Chart.yaml: bump 1.1.8 → 1.1.9. Drop all 10 `dependencies:` entries. Add `annotations.catalyst.openova.io/no-upstream: "true"` to opt out of the blueprint-release hollow-chart guard (issue #181) — this umbrella legitimately ships only Catalyst-authored CRs. - products/catalyst/chart/values.yaml: drop bp-keycloak.keycloak.postgresql and bp-gitea.gitea.postgresql fullnameOverride blocks (no longer applicable; bp-keycloak and bp-gitea are top-level HelmReleases in separate namespaces, no postgresql collision possible). - products/catalyst/chart/Chart.lock + charts/.tgz removed (no deps). - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump chart version reference 1.1.8 → 1.1.9. `helm template products/catalyst/chart/ --namespace catalyst-system` emits ONLY catalyst-{ui,api} Deployments + Services + 2 PVCs (and HTTPRoute when ingress.hosts..host is set). No Flux controllers, no NetworkPolicies, no upstream-chart bytes. Verified. Closes #510 Co-authored-by: e3mrah <emrah@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 06:05:52 +04:00
e3mrah	f689766615	fix(infra): add explicit dependsOn to bp-openbao + bp-catalyst-platform (#512 ) (#513 ) Phase-8a-preflight live deployment otech16 (9e14dcc0d2de7586, 2026-05-02): even after bumping install/upgrade timeout to 15m (commit `f47948e7`), the post-install hooks for bp-openbao and bp-catalyst-platform STILL race their dependencies. The hooks need workload pods Ready before they can do their work — bp-openbao 3-node Raft init waits for cnpg-postgres + Cilium L7, and bp-catalyst-platform umbrella init waits for keycloak + cnpg. Fix (Option C — explicit dependsOn): - bp-openbao: add bp-cnpg (already had bp-spire, bp-gateway-api) - bp-catalyst-platform: add bp-keycloak + bp-cnpg (already had bp-gitea, bp-gateway-api) This makes Flux wait for those HRs Ready=True BEFORE starting the install, so the post-install hooks run after deps are warm. Eliminates the race. Updated scripts/expected-bootstrap-deps.yaml to match. Verified: - bash scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles - go test ./tests/e2e/bootstrap-kit/... -run TestBootstrapKit_DependencyOrderMatchesCanonical — PASS Closes #512 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 06:00:56 +04:00
e3mrah	f47948e7a5	fix(bootstrap-kit): bp-openbao and bp-catalyst-platform install/upgrade timeout 5m→15m for post-install hooks Same pattern as bp-keycloak in commit `ac276f06`: post-install hooks need >5m on first-install. otech16 (9e14dcc0d2de7586) hit: - bp-openbao: failed post-install: timed out waiting for the condition - bp-catalyst-platform: failed post-install: timed out waiting for the condition disableWait: true governs resource Ready wait, NOT hook timeout. Helm hook timeout defaults to 5m. OpenBao 3-node Raft init + catalyst-platform umbrella init Jobs both legitimately need ~5-10min on first install.	2026-05-02 03:39:02 +02:00
e3mrah	ac276f0670	fix(bootstrap-kit): bp-keycloak install/upgrade timeout 5m→15m for post-install hook Phase-8a-preflight live deployment otech14 (7bbd66f49fa1d07d, 2026-05-02) exposed: keycloak-config-cli post-install hook fails to connect to keycloak-headless:8080 within Helm's default 5m hook timeout. Root cause: keycloak server cold-start takes ~2.5min (PostgreSQL schema migration + 100+ Liquibase changesets). The keycloak-config-cli hook then waits up to 120s for the keycloak HTTP API to respond. Total wall time = ~4.5min — RIGHT at the edge of Helm's 5m default. Cilium L7 init plus first-time pod scheduling pushes it over. Fix: set explicit install/upgrade timeout: 15m on the HR. disableWait already prevents readiness blocking; this only governs the post-install hook (Helm-tracked Job). This also matches PR #221's original 15m setting that was reverted by the disableWait refactor — disableWait turns off resource-readiness wait but does NOT govern hook timeout, which remained at the 5m default.	2026-05-02 02:01:50 +02:00
e3mrah	7931e695b0	fix(cert-manager-powerdns-webhook): cap CA Certificate CN at 64 bytes (#509 ) The chart's CA Certificate template generated a `spec.commonName` of `ca.<fullname>.cert-manager` where `<fullname>` is the Helm fullname (release name + chart name). With the bootstrap-kit's release name `cert-manager-powerdns-webhook`, the rendered CN landed at 78 bytes: ca.cert-manager-powerdns-webhook-bp-cert-manager-powerdns-webhook.cert-manager cert-manager's admission webhook rejects this against the RFC 5280 ub-common-name-length=64 PKIX upper bound, breaking otech11 (ac90a3ea12954e7d, chart 1.0.1, 2026-05-02) at install time. Fix: collapse the CN onto the chart `name` helper (always `bp-cert-manager-powerdns-webhook`, ≤63 chars) instead of the release-prefixed `fullname`. The CA cert's CN is opaque identity only — no client validates by hostname against this CN — so the shortening is behaviour-preserving and stable across any operator-chosen releaseName. Rendered CN with this fix: ca.bp-cert-manager-powerdns-webhook.cert-manager (48 bytes) Bumps chart 1.0.1 → 1.0.2 and updates the bootstrap-kit slot reference in clusters/_template/bootstrap-kit/49-bp-cert-manager-powerdns-webhook.yaml. Closes #508.	2026-05-02 02:09:41 +04:00
e3mrah	eeba0d90cc	fix(infra): dedupe labels in bp-cert-manager-powerdns-webhook deployment template (#507 ) The pod template's metadata.labels block in the upstream Deployment template included BOTH the `selectorLabels` helper AND the `labels` helper. Since `labels` already emits app.kubernetes.io/name and app.kubernetes.io/instance, the rendered YAML had those keys twice in a single mapping, which Helm v3 post-render rejects with: yaml: unmarshal errors: line 29: mapping key "app.kubernetes.io/name" already defined at line 26 line 30: mapping key "app.kubernetes.io/instance" already defined at line 27 Surfaced live on Phase-8a-preflight otech11 (ac90a3ea12954e7d, on catalyst-api:c148ef3, 2026-05-01). Fix: drop the redundant `selectorLabels` include — `labels` is a superset. Bump chart version 1.0.0 → 1.0.1 and update the bootstrap-kit HR reference accordingly. Closes openova#506. Co-authored-by: e3mrah <emrah@openova.io>	2026-05-02 01:52:50 +04:00
e3mrah	a292dedc52	fix(bootstrap-kit): bump bp-seaweedfs 1.0.1→1.1.0 to pick up #340 fromToml fix	2026-05-01 23:48:48 +02:00
e3mrah	e1f7d22f3c	fix(bootstrap-kit): install Gateway API CRDs ahead of HTTPRoute charts (#503 ) (#505 ) Adds bp-gateway-api Blueprint (slot 01a) that vendors the upstream Kubernetes Gateway API Standard-channel CRDs (v1.2.0) and registers them ahead of every chart that ships HTTPRoute templates: bp-openbao, bp-keycloak, bp-gitea, bp-powerdns, bp-catalyst-platform, bp-harbor, bp-grafana. Phase-8a-preflight live deployment otech10 (e1a0cd6662872fcb on catalyst-api:c148ef3, 2026-05-01) reached 21/37 HRs Ready=True before stalling on bp-harbor / bp-openbao / bp-powerdns reconciling to InstallFailed with `no matches for kind "HTTPRoute" in version "gateway.networking.k8s.io/v1"`. Cilium 1.16's chart `gatewayAPI. enabled=true` flag wires up the cilium gateway controller and creates the `cilium` GatewayClass, but does NOT install the gateway.networking.k8s.io CRDs themselves; cilium 1.16 has no `installCRDs`-equivalent knob for gateway-api so the upstream CRDs must ship via a separate Blueprint. Pattern locked in by docs/INVIOLABLE-PRINCIPLES.md and reinforced by the founder for ALL similar future cases: intra-chart CRD-ordering breaks → split into two charts + Flux dependsOn. Mirrors the bp-crossplane/bp-crossplane-claims and bp-external-secrets/ bp-external-secrets-stores splits. Files: - platform/gateway-api/{blueprint.yaml,chart/} — new Blueprint with per-CRD templates vendored from kubernetes-sigs/gateway-api v1.2.0 standard-install.yaml; helm.sh/resource-policy: keep on every CRD so Helm uninstall does not orphan every HTTPRoute on the cluster - platform/gateway-api/chart/scripts/regenerate.sh — developer tool for re-vendoring on upstream version bump (annotation-driven) - platform/gateway-api/chart/tests/crd-render.sh — chart integration test (5 CRDs, keep annotation, bundle-version matches Chart.yaml pin) - clusters/_template/bootstrap-kit/01a-gateway-api.yaml — HelmRelease + HelmRepository, dependsOn bp-cilium - clusters/_template/bootstrap-kit/{08-openbao,09-keycloak,10-gitea, 11-powerdns,13-bp-catalyst-platform,19-harbor,25-grafana}.yaml — add `dependsOn: bp-gateway-api` - clusters/_template/bootstrap-kit/kustomization.yaml — register 01a-gateway-api.yaml between 01-cilium and 02-cert-manager - scripts/expected-bootstrap-deps.yaml — declare slot 1a + add bp-gateway-api to depends_on of every HTTPRoute-using slot Closes #503 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 01:30:50 +04:00
e3mrah	1865ac8975	fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340 ) (#504 ) * fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) The upstream seaweedfs/seaweedfs 4.22.0 chart now ships templates/shared/security-configmap.yaml which calls fromToml — a Sprig function added in Helm 3.13. Flux v1.x helm-controller bundles a Helm SDK older than 3.13 and PARSES every template before any {{- if .Values.global.seaweedfs.enableSecurity }} gate fires, so the file's mere presence breaks install on every Sovereign with: parse error at (bp-seaweedfs/charts/seaweedfs/templates/shared/security-configmap.yaml:21): function "fromToml" not defined even though enableSecurity defaults to false. Setting the gate value does NOT skip parsing — only deleting / never-shipping the file does. Fix shape (per ticket #340): 1. Vendor upstream seaweedfs/seaweedfs 4.22.0 into chart/charts/seaweedfs/ (committed bytes, not auto-pulled at build time). Required because the upstream Helm repo overwrites 4.22.0 in place — re-pulling would re-introduce the broken file. 2. Delete charts/seaweedfs/templates/shared/security-configmap.yaml. Every other template that references the deleted ConfigMap is gated under {{- if enableSecurity }} so removing it is a no-op for our default deployment shape (Catalyst SeaweedFS auth happens at the S3 layer via IAM creds from External Secrets, not via the upstream chart's TLS/JWT machinery). 3. Drop the dependencies: block from chart/Chart.yaml; add annotations.catalyst.openova.io/no-upstream=true so the blueprint-release workflow's hollow-chart guard (issue #181) skips the auto-pull/round-trip checks for this chart. 4. Whitelist platform/seaweedfs/chart/charts/ in .gitignore so the vendored bytes are tracked. 5. Bump bp-seaweedfs 1.0.1 → 1.1.0 (signal: vendored, not auto-pulled). 6. Add tests/no-fromtoml.sh — chart-test that asserts the offending file stays deleted across future re-vendors. Runs in .github/workflows/blueprint-release.yaml as a publish-gating check. Unblocks Phase-8a observability + storage chain on otech (bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn bp-seaweedfs). Closes #340 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): align expected-bootstrap-deps.yaml with bp-harbor's actual deps The bp-harbor HR at clusters/_template/bootstrap-kit/19-harbor.yaml lines 35-37 already removed `bp-seaweedfs` from its dependsOn (cloud-direct architecture per ADR-0001 §13 — Harbor writes blobs directly to cloud Object Storage on Sovereigns, not via SeaweedFS), but the expected DAG in scripts/expected-bootstrap-deps.yaml was never updated to match. Pre-existing drift on main; surfaced by the dependency-graph-audit check on PR #504 (bp-seaweedfs vendoring fix). Fixing it inline so the audit passes on the same PR — the two changes are both about the storage chain on Sovereigns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 01:20:59 +04:00
github-actions[bot]	2f4c624bb9	deploy: update catalyst images to `c148ef3`	2026-05-01 20:50:37 +00:00
e3mrah	c148ef36ff	fix(catalyst-api): release PDM subdomain on Pod-restart orphan + add explicit release endpoint (closes #489 ) (#502 ) * fix(catalyst-api): release PDM subdomain on Pod-restart orphan + add explicit release endpoint Each failed provision permanently consumed its pool subdomain in PDM — otech, otech1..otech9 stayed locked because two release seams were missing: 1. Pod-restart orphan: when catalyst-api dies mid-provisioning, the runProvisioning goroutine that would have called pdm.Release on Phase-0 failure dies with the Pod. fromRecord rewrites the rehydrated status to "failed" but nothing reaps the still-active reservation. restoreFromStore now fires a best-effort pdm.Release for every record it rewrites from in-flight to failed, gated on AdoptedAt==nil so customer-owned Sovereigns are protected. 2. Abandoned-deployment retries: the only operator-driven release path was Cancel & Wipe, which requires re-entering the HetznerToken. Franchise customers retrying under the same subdomain after a botched provision shouldn't need a Hetzner credential roundtrip for a PDM-only fix. New endpoint DELETE /api/v1/deployments/{id}/release-subdomain releases the PDM allocation only — no Hetzner work, no record deletion. Refuses in-flight (409), wiped (410), and adopted (422) deployments. Tests cover: failed-deployment release, idempotent ErrNotFound, in-flight refusal across all in-flight statuses, adopted protection, BYO no-op, 404 on unknown id, 502 on PDM transient, Pod-restart orphan release on restoreFromStore, and the negative-path proof that a clean-failed record on disk does NOT trigger a duplicate Release at restart. Closes #489 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(catalyst-api): fix data race in fakePDM around orphan-release goroutine The Pod-restart orphan-release path (issue #489) fires pdm.Release in a goroutine spawned by restoreFromStore. The race detector flagged the test's read of fpdm.releases against the goroutine's append. Adding a sync.Mutex to fakePDM + a snapshotReleases() accessor closes the race without changing the surface that 30+ other tests already use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 00:48:36 +04:00
github-actions[bot]	b8c639127a	deploy: update catalyst images to `bd9103a`	2026-05-01 20:40:08 +00:00
github-actions[bot]	bd9103aadc	deploy: update catalyst images to `66ff717`	2026-05-01 22:38:03 +02:00
e3mrah	d6caeddf5d	test(catalyst-ui): lock in JobsTable row-id contract — no dead phase slugs (closes #474 ) (#501 ) Phase-8a-preflight first live provision (febeeb888debf477) failed at tofu plan, so catalyst-api recorded zero jobs. The wizard renders synthetic phase rows from the local event stream regardless (per INVIOLABLE-PRINCIPLES.md #1). Pre-fix the synthetic IDs collided with bare phase slugs (e.g. id was `infrastructure` instead of `infrastructure:tofu-init`), so clicking navigated to /jobs/infrastructure which JobDetail's local jobsById couldn't resolve → "Job not found". Cumulative resolution shipped earlier: PR #480 renamed cluster-bootstrap group slug to phase-1-bootstrap (no longer collides with bare leaf id); PR #498 routes catalyst-ui fetches through API_BASE so /jobs/{id} routes work under /sovereign/; jobs.ts always emits prefixed `infrastructure:tofu-` ids for the synthetic phase rows. This commit adds 4 vitest cases asserting the contract: - No row id is a forbidden bare slug (infrastructure / phase / cluster). - Every row id matches one of the well-known shapes (group slug, tofu phase id, cluster-bootstrap leaf, or application id). - No row id contains "/" that would break the /jobs/$jobId route param. - Every leaf's parentId resolves to a row in the same flat list (no orphans → no un-clickable rows). Live verification: console.openova.io/sovereign/provision/d198b513476df186/jobs on catalyst-ui:141dc9d renders 50+ rows linking to either a /jobs/applications group or a /jobs/bp-* leaf — every URL resolves. Bare /jobs/infrastructure or /jobs/phase no longer appear. Co-authored-by: alierenbaysal <alierenbaysal@noreply.github.com>	2026-05-02 00:35:52 +04:00
e3mrah	66ff717fbc	fix(infra): reduce bootstrap Kustomization timeouts 30m→5m to unblock iterative fixes (closes #492 ) (#500 ) Phase-8a bug #17 (otech8 deployment 1bfc46347564467b, 2026-05-01): when the FIRST apply of bootstrap-kit was unhealthy (cilium crash-loop from issue #491), kustomize-controller held the revision lock for the full 30m health-check timeout and refused to pick up new GitRepository revisions. Even though Flux fetched fix `66ea39f0` from main within 1 minute, bootstrap-kit's lastAttemptedRevision stayed pinned to the OLD SHA `0765e89a` for the full 30 minutes. With cilium broken, the wait would never finish, no new revision would ever apply, and the operator was forced to wipe + reprovision from scratch. The same pathology would repeat on every iteration unless the timeout shape changed. Approach: Option A (timeout reduction). Drops `spec.timeout` on all three Flux Kustomizations in the cloud-init template — bootstrap-kit, sovereign-tls, infrastructure-config — from 30m to 5m. We KEEP `wait: true` so downstream `dependsOn: bootstrap-kit` declarations still get a consolidated "every HR Ready=True" signal. We do NOT adjust `interval` (5m is correct). Why 5m specifically: matches the GitRepository poll interval. Failed reconciles release the revision lock within ~6m worst case so a fresh fix on main gets applied on the next poll. Anything shorter risks tripping legitimately-slow CRD installs; anything longer re-introduces the iteration-stall pathology #492 documents. Why not Option B (wait: false): would break the dependsOn chain. The infrastructure-config Kustomization needs bootstrap-kit's HRs Ready before it applies Provider/ProviderConfig manifests that talk to Hetzner. Flipping wait: false would let infra-config apply prematurely. Why not Option C (tighter retryInterval): doesn't address the root cause. retryInterval governs how often to retry AFTER a failure; spec.timeout is what holds the revision lock during a failed wait. Test: kustomization_timeout_test.go (new) locks all three timeouts at exactly 5m AND blocks any operative `timeout: 30m` regression AND asserts wait: true is retained. Three assertions, one for each failure mode (regression to 30m, accidental 4th Kustomization without test update, drive-by flip to wait: false). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 00:34:35 +04:00
github-actions[bot]	8457bf775e	deploy: update catalyst images to `a363f34`	2026-05-01 20:32:14 +00:00
e3mrah	a363f340bc	fix(catalyst-ui): grid-layout high-fan-out depths so 50+ siblings fit visible viewBox (closes #493 ) (#499 ) Phase-8a-preflight live screenshot (.playwright-mcp/otech9-cluster-bootstrap-2026-05-01.png) showed the JobDetail flow canvas rendering as yellow line trails with zero visible bubbles on a 50+ node provisioning graph. PR #486 passed bounded tests for 5/8/12/15 nodes but never covered production scale (~50 blueprint installs as siblings of one parent). Root cause: every sibling at the same depth was anchored to one X coordinate (depthPER_DEPTH_X) and Y-clamped at ±Y_SCATTER_PX2 (±160). With 50 nodes × 92px collision pitch, the natural cluster wanted 4600px height — but viewBox.MAX_VBOX_H=700 capped the visible window. Only ~15% of node centroids landed inside. Fix: gridTargets useMemo pre-pass. For each depth bucket whose sibling count exceeds the viewBox's vertical capacity (~7 at MAX_VBOX_H=700), lay siblings out in a sub-column grid. Each node anchors to its (subColX, subRowY) cell instead of the shared depth anchor. Sparse depths fall through to the original force behaviour. Forces wired through the grid: - forceX target = cell.tx (or depthX for sparse depths) - forceY target = regionYMid + cell.ty (or regionYMid + jitter) - Per-tick clamp: cell-bounded for high-fan-out nodes, depth-bounded for sparse nodes - Initial seed positions placed at cell centers so the simulation converges quickly without oscillating Tests: - New bounded cases for 30/50/80 siblings asserting ≥95% of node centroids land inside the viewBox at first paint (was ~15% pre-fix) - New 60-node case asserting viewBox stays bounded AND every bubble retains radius ≥40 (visible) - All 11 bounded tests pass; tsc --noEmit clean Live verification deferred to next fresh Hetzner provision. Co-authored-by: alierenbaysal <alierenbaysal@noreply.github.com>	2026-05-02 00:29:23 +04:00
e3mrah	a5f5a37e99	fix(catalyst-ui): route every fetch through API_BASE + add regression guardrail (closes #494 ) (#498 ) Issue #494 — JobDetail page surfaced a 404 in the otech9 cluster-bootstrap screenshot because a tier-naive `/api/...` path can bypass the `/sovereign/` Vite base. While the audit confirmed every existing fetch / EventSource in the catalyst-ui already routes through `API_BASE`, the antipattern had reappeared once before and lacked a guardrail to keep it from sneaking back in. Changes: • src/shared/config/urls.ts — add `apiUrl()` helper that normalises a path which may begin with `/api/...` (e.g. the `streamURL` echoed by the catalyst-api `POST /api/v1/deployments` response) into the tier-correct `${API_BASE}/...` form. Idempotent; absolute http(s) URLs pass through untouched. Doc-comment now records why the rule exists for future readers. • src/shared/lib/useProvisioningStream.ts — pipe the server-provided `streamURL` through `apiUrl()` before opening the EventSource so the wizard's live SSE reaches Traefik via the strip-sovereign middleware regardless of the base path. • src/test/no-hardcoded-api.test.ts — vitest regression guardrail: walks every `.ts`/`.tsx` source file (excluding tests), strips comments, fails CI if any `fetch( '/api/...`, `new EventSource( '/api/...`, or `axios.<m>( '/api/...` literal slips in. Verified by injecting a temporary violation file (caught) then removing it. • src/shared/config/urls.test.ts — unit tests for `apiUrl()` covering `/api/...`, `/v1/...`, `v1/...`, absolute http(s), and idempotency. The 404 on the deployed otech9 deployment turned out to be a legitimate backend response (`{"error":"job-not-found"}`) — the deployment had zero jobs because the job-recorder wasn't backfilled — but the rule this PR encodes is the correct invariant: the UI must never depend on its host page resolving a relative path. Per docs/INVIOLABLE-PRINCIPLES.md: • #2 (no compromise) — full guardrail in CI, not a TODO. • #4 (never hardcode) — every URL derives from `API_BASE`. • #8 (24-hour-no-stop) — gate added so this exact bug can't silently regress. Co-authored-by: alierenbaysal <alierenbaysal@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 00:26:21 +04:00
github-actions[bot]	c76b409c64	deploy: update catalyst images to `141dc9d`	2026-05-01 20:11:03 +00:00
e3mrah	141dc9dfba	fix(infra): cloud-init helm install cilium values parity with Flux bp-cilium HR (closes #491 ) (#496 ) Phase-8a bug #16: every fresh Hetzner Sovereign deadlocked at Phase 1 because the bootstrap helm install in cloud-init used a MINIMAL set of --set flags (kubeProxyReplacement, k8sService*, tunnelProtocol, bpf.masquerade) while the Flux bp-cilium HelmRelease curated a much fuller value set. The drift was fatal: 1. cilium-agent waits forever for the operator to register ciliumenvoyconfigs + ciliumclusterwideenvoyconfigs CRDs. 2. The upstream chart only registers them when envoyConfig.enabled=true. 3. With the bootstrap install missing that flag, the agent crash-looped, the node taint node.cilium.io/agent-not-ready never lifted, and the bootstrap-kit Kustomization (wait: true, 30 min timeout — issue #492) never reconciled the upgrade that would have fixed the values. The fix is single-source-of-truth via a new write_files entry that lays down /var/lib/catalyst/cilium-values.yaml at cloud-init time, plus a -f flag on the bootstrap helm install that consumes it. The values mirror platform/cilium/chart/values.yaml's `cilium:` block PLUS the overlay in clusters/_template/bootstrap-kit/01-cilium.yaml (envoyConfig.enabled, l7Proxy). A new parity test (cilium_values_parity_test.go) locks the two files together so a future commit cannot change one without the other. Approach: hybrid — keep the chart values.yaml as the umbrella source of truth, render the merged effective values inline in cloud-init's write_files block (the umbrella's `cilium:` subchart wrapper is unwrapped because the bootstrap install targets cilium/cilium upstream chart directly, not the bp-cilium umbrella). Test enforces presence of every operator-curated key + load-bearing values. Files modified: infra/hetzner/cloudinit-control-plane.tftpl products/catalyst/bootstrap/api/internal/provisioner/cilium_values_parity_test.go (new) Refs: #491, #492 (bootstrap-kit wait timeout), `66ea39f0` (envoyConfig in HR) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 00:09:10 +04:00
e3mrah	e2f8df7430	fix(catalyst-api): Phase-1 short-circuit must NOT flip Status to ready (closes #488 ) (#495 ) Phase-8a-preflight live deployments otech1..otech9 (2026-05-01) consistently flipped status: ready and phase1FinishedAt seconds after Phase-0 completed, even though no kubeconfig PUT had been received and the new Sovereign was still mid-cloud-init. The wizard banner read "Sovereign ready" while catalyst-api had observed precisely zero HelmReleases. The screenshot at .playwright-mcp/otech9-cluster-bootstrap-2026-05-01.png even logs: "Phase-1 watch skipped: no kubeconfig is available on the catalyst-api side." …on a deployment whose status was simultaneously "ready". The UI lied to the operator on every iteration today. Root cause: markPhase1Done(dep, nil, "") was called from two short-circuit paths (kubeconfig missing + watcher-start failure). Empty outcome fell through the switch's default branch which set Status="ready". With no observed components and no terminal classification there is nothing truthful catalyst-api can say about the new Sovereign except "I don't know" — which means failed, with an operator-actionable diagnostic. Fix: - Add helmwatch.OutcomeKubeconfigMissing + OutcomeWatcherStartFailed outcome constants. - Replace the two markPhase1Done(_, nil, "") call sites with explicit outcomes. - Add explicit cases in the switch that set Status="failed" with errors pointing the operator at cloud-init logs / informer factory init. - Keep a defensive "outcome empty AND len(finalStates)==0" trap so any future caller that forgets to pass a non-empty outcome surfaces as a programming-error failure rather than silently flipping ready. - Strengthen TestRunPhase1Watch_EmptyKubeconfigShortCircuits to assert Status=="failed", a non-empty Error mentioning kubeconfig, and the exact OutcomeKubeconfigMissing on Result.Phase1Outcome. Pre-fix the test only asserted "not stuck at phase1-watching" — too weak to catch the false-ready regression. go test ./products/catalyst/bootstrap/api/... — all green.	2026-05-02 00:07:38 +04:00
hatiyildiz	66ea39f091	fix(infra): set envoyConfig.enabled=true so cilium-operator registers envoyconfig CRDs (Phase-8a bug #15 ) Phase-8a-preflight live deployment 1bfc46347564467b confirmed cilium-agent crash-loops forever waiting for envoyconfig CRDs that the operator never registers: Still waiting for Cilium Operator to register the following CRDs: [crd:ciliumclusterwideenvoyconfigs.cilium.io crd:ciliumenvoyconfigs.cilium.io] Root cause: upstream Cilium 1.16 chart has TWO separate envoy toggles: - cilium.envoy.enabled — runs Envoy as a separate DaemonSet (was set) - cilium.envoyConfig.enabled — registers CRDs + agent/operator controllers for CiliumEnvoyConfig (was NOT set) The chart values.yaml only sets envoy.enabled=true. Operator finishes CRD registration with 11 of 13 CRDs, missing the two envoy ones, and cilium-agent's node taint never lifts. All 37 dependent HelmReleases block forever on the dependsOn chain. Fix in HR values (no chart rebuild needed; lands via Flux on next sovereign provision directly).	2026-05-01 21:38:33 +02:00
github-actions[bot]	0765e89ac6	deploy: update catalyst images to `e6663f1`	2026-05-01 19:26:11 +00:00
e3mrah	e6663f169d	fix(catalyst-ui): remove status banners from Apps page; surface as global notifications (closes #475 ) (#487 ) Founder #475 — the "Provisioning failed" / "Cancel & Wipe" / "Per-component install monitoring is unavailable" banners pollute the Apps page. They render above the apps grid, forcing operators onto the Apps tab to read terminal deployment status, and crowd out the actual catalog. Replaces the inline banners with a global toast surface: • new shared/ui/notifications.tsx — NotificationProvider + useNotifications() seam. Bottom-right stacked tray, fixed positioning so it's visible on every tab (Apps / Jobs / Dashboard / Cloud / Users). Toasts replace in-place by id so a deployment-failure update edits the existing card rather than stacking duplicates. • RootLayout — mounts NotificationProvider once at the top of the tree. • AppsPage — strips FailureCard + Phase1UnavailableBanner. Two new useEffects mirror the same copy + the same retry / wipe / back-to-wizard actions through notify(). WipeDeploymentModal stays page-scoped so the toast action can flip it open. • useDeploymentEvents — wraps `retry` in useCallback so the AppsPage notification effect doesn't re-fire every render (would otherwise loop notify → re-render → notify). Vitest: • 8 cases on the notification surface (push, replace-by-id, dismiss, role=alert vs role=status, action dismissOnClick semantics, provider guard). • 2 new cases on AppsPage that gate any future regression: main element has zero role="alert" / role="status" children on first paint, and the legacy banner test ids never render. Acceptance vs founder ask: • Apps page in failed state renders ONLY apps grid + tabs + search box. • Same status content fires as a bottom-right toast with Retry stream / Cancel & Wipe / Back to wizard actions. • Notifications stay visible across Apps / Jobs / Dashboard / Cloud / Users tabs because the tray is mounted in RootLayout above Outlet. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 23:23:12 +04:00
e3mrah	62e03ae129	fix(catalyst-ui): re-tune physics so bubbles stay visible (#481 follow-up) (#486 ) PR #483 over-corrected the physics tuning — the operator reported "infinitely stretching lines, can't see a single bubble in the canvas". Two structural defects: (1) NODE_RADIUS stayed at 22 → diameter 44px. Combined with MAX_VBOX 1600x900 and a typical canvas-host of 600-800px wide (LogPane covers ~30% of the screen), preserveAspectRatio meet scaled the SVG to ~0.4x → bubbles rendered at 16-22px wide. Effectively invisible. (2) MIN_VBOX floors at 1200x700 forced sparse graphs (4-6 nodes across a ~200x100 layout space) into a viewBox 6x larger than the cluster, scaling bubbles down even further. (3) FORCE_X_STRENGTH=0.55 + FORCE_LINK_STRENGTH=0.45 fought hard on depth-disparate dependencies (depth-0 root wired to depth-5 leaf), producing oscillation that read as "infinite stretch" in mid-tick frames. The fix: - NODE_RADIUS 22 → 40 (diameter 80px — meets acceptance criterion) - GROUP_RADIUS 28 → 48 - MIN_VBOX 1200x700 → 400x280 (sparse graphs render at native scale) - MAX_VBOX 1600x900 → 1200x700 (effective render scale stays ~1:1) - FORCE_X_STRENGTH 0.55 → 0.12 (gentle depth anchor, no oscillation) - FORCE_Y_STRENGTH 0.22 → 0.10 - FORCE_LINK_STRENGTH 0.45 → 0.18 - LINK_DISTANCE NODE_RADIUS4 → NODE_RADIUS2.5 (100px, edges <140px) - PER_DEPTH_X NODE_RADIUS5 → NODE_RADIUS4 (with bigger nodes) - Per-tick X clamp tightened from ±1.5×PER_DEPTH_X to ±1.0× - Per-tick Y clamp tightened from MAX_VBOX_H/2 to ±Y_SCATTER_PX*2 - Initial seed X scatter scales with NODE_RADIUS Tests: - FlowCanvasOrganic.bounded.test.tsx — 7 cases, locks viewBox ≤ 1200x700, bubble radius ≥40 (diameter ≥80), edge length <300px, every node centroid strictly inside viewBox for 5/8/12/15-node graphs. - All pre-existing tests pass: flowLayoutOrganic.test (cycle protection #476), FlowPage.test, JobDetail.test, JobDetail.hang regression, LogPane.fallback (the #483 LogPane work is unaffected). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 23:22:39 +04:00
e3mrah	a5f3ec900a	fix(infra): move Cilium Gateway to sovereign-tls Kustomization too (Phase-8a bug #14 ) (#485 ) Phase-8a-preflight live deployment a56961fbd5ae6003 confirmed bootstrap-kit Kustomization still fails dry-run after #484 — same pattern, different CRD: Gateway/kube-system/cilium-gateway dry-run failed: no matches for kind 'Gateway' in version 'gateway.networking.k8s.io/v1' The Gateway API CRDs ARE installed by the Cilium HelmRelease (gatewayAPI.enabled=true) but Flux validates ALL resources in the Kustomization BEFORE applying any HR. So at validation time, Cilium hasn't installed yet → no CRDs → Gateway dry-run fails. Same fix shape as #484 (Cert split): move Gateway into sovereign-tls Kustomization which dependsOn bootstrap-kit Ready (i.e. Cilium HR is up + CRDs registered). Updated: - clusters/_template/sovereign-tls/cilium-gateway.yaml (NEW) - clusters/_template/sovereign-tls/kustomization.yaml (resources list) - clusters/_template/bootstrap-kit/01-cilium.yaml (Gateway block removed) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 23:01:53 +04:00
github-actions[bot]	5debb7dd8a	deploy: update catalyst images to `0d75ae3`	2026-05-01 18:50:32 +00:00
e3mrah	0d75ae354f	fix(infra): split Cilium-Gateway Certificate into sovereign-tls Kustomization (Phase-8a bug #13 ) (#484 ) Phase-8a-preflight live deployment 93161846839dc2e1: bootstrap-kit Flux Kustomization fails server-side dry-run with Certificate/kube-system/sovereign-wildcard-tls dry-run failed: no matches for kind 'Certificate' in version 'cert-manager.io/v1' → entire Kustomization apply aborts → ZERO HelmReleases reconcile. Fix: split the Certificate into its own Flux Kustomization sovereign-tls that dependsOn bootstrap-kit (whose Ready gates on every HR including bp-cert-manager). Gateway stays in 01-cilium.yaml because Gateway API CRDs ship with Cilium itself. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 22:48:18 +04:00
github-actions[bot]	5da604595d	deploy: update catalyst images to `67a408f`	2026-05-01 18:43:13 +00:00
e3mrah	67a408f66d	fix(catalyst-ui): JobDetail flow physics + exec-logs viewer (closes #481 ) (#483 ) Bug A — Flow physics scattered + tiny + km-long edges: • forceY strength 0.05→0.22, forceLink strength 0.08→0.45 so siblings cluster around the host instead of drifting to canvas edges. • Initial Y scatter ±140→±60, X scatter ±40→±40 (kept), forceY target scatter ±180→±60. Steady-state edges now ~110px. • New MAX_VBOX (1600×900) ceiling on the SVG viewBox + per-tick x/y clamp keep nodes inside the viewport regardless of force quirks. Bug B — LogPane empty for derived (Phase-0 / cluster-bootstrap) jobs: • useJobDetail returns 404 for derived jobs because the catalyst-api Bridge has no Execution rows for them — but the SSE event reducer DOES have the captured events in DerivedJob.steps[]. • LogPane gains a `fallbackLines: LogLine[]` prop; when executionId is null AND fallbackLines is non-empty, renders inline through the same dark-theme list as ExecutionLogs (no polling). • JobDetail maps derivedJobsById[selectedJobId].steps → LogLine[] via stepsToLogLines() and threads it through CanvasLogBridge. Tests: FlowCanvasOrganic.bounded.test.tsx (viewBox + per-node clamp) LogPane.fallback.test.tsx (3 paths: lines / empty / unset) Pre-existing 11 cycle-protection + JobDetail tests still pass. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:41:13 +04:00
github-actions[bot]	eb08e89168	deploy: update catalyst images to `7e35040`	2026-05-01 18:32:43 +00:00
e3mrah	7e35040e29	fix(infra): cloud-init strip regex must preserve #cloud-config (Phase-8a bug #5 follow-up) (#482 ) #477 introduced a regex "/(?m)^[ ]{0,2}#[^!].*\n/" to strip YAML-block comments and fit Hetzner's 32KiB user_data cap. The [^!] guard preserved shebangs like #!/bin/bash but DID NOT preserve cloud-init directives like #cloud-config, #include, #cloud-boothook (none have ! after #). Result: cloud-init received user_data with the #cloud-config first-line DIRECTIVE stripped, didn't recognise the YAML body, and emitted: recoverable_errors: WARNING: Unhandled non-multipart (text/x-not-multipart) userdata → k3s never installed → Flux never bootstrapped → kubeconfig never PUT to catalyst-api → every Phase-8a provision since #477 has silently failed at boot Live evidence: deployment a76e3fec8566add9 SSH'd 2026-05-01 18:30 UTC, cloud-init status 'degraded done', /etc/systemd/system/k3s.service absent, no flux binary. Fix: require a SPACE after the '#' in the strip regex. YAML comments ARE typically '# foo bar' (with space). cloud-init directives are '#cloud-config' / '#include' / '#cloud-boothook' (no space) — the new regex preserves them. Out of scope: validating that ALL existing comments in the tftpl had a space after #. They do — verified by sed pre-render passing the sanity test (file shrinks 38KB → 13KB AND first line is #cloud-config). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 22:30:51 +04:00
github-actions[bot]	419dfe4a65	deploy: update catalyst images to `1ea300d`	2026-05-01 17:53:47 +00:00
e3mrah	1ea300dfd9	fix(catalyst-ui): job-detail browser hang — render flow view on click instead of infinite-loop (closes #476 ) (#480 ) Root cause: adaptDerivedJobsToFlat synthesised a "Cluster Bootstrap" group whose slug ('cluster-bootstrap') equalled the bare leaf job's id, also 'cluster-bootstrap' (jobs.ts line 210). byId.set(j.id, j) in flowLayoutOrganic is last-wins, so the leaf overwrote the group in the index. The leaf's parentId then pointed at itself, and isVisible()/visibleRepresentative()/defaultFoldedAtDepth() walked that self-reference forever — Chrome hung the moment the operator clicked any job in the JobsTable. Two-layer fix: 1. PREVENT — Rename GROUP_CLUSTER_BOOTSTRAP slug from 'cluster-bootstrap' to 'phase-1-bootstrap' so it cannot collide with any leaf id. Parallel to the existing 'phase-0-infra' slug. 2. DEFEND — Cycle-protect every parent-chain walk in flowLayoutOrganic.ts (isVisible, visibleRepresentative, defaultFoldedAtDepth) by tracking visited ids. Malformed input now degrades gracefully instead of freezing the browser. Regression tests: - flowLayoutOrganic.test.ts — locks each cycle case (self-cycle, id-collision, multi-step a→b→a) to a 100ms budget. - jobsAdapter.test.ts — asserts no group slug collides with any leaf id from the default wizard state, plus the post-rename leaf invariant (parentId !== id). - JobDetail.hang.regression.test.tsx — mounts JobDetail with the exact `infrastructure:tofu-apply` URL the live deployment hung on, asserts < 2s. - JobDetail.test.tsx — refreshed for the v3 surface (full-bleed canvas + LogPane); the v2 tab-strip assertions are gone because PR #353 retired that layout. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 21:51:39 +04:00
github-actions[bot]	23418e6c9a	deploy: update catalyst images to `dfd7480`	2026-05-01 17:12:30 +00:00
e3mrah	dfd74805dc	fix(wizard): auto-default Object Storage region from cloud-region (closes #473 ) (#479 ) Phase-8a-preflight first live provision (deployment febeeb888debf477) caught the wizard letting an operator click 'Validate' on the Object Storage section before picking a region. The S3 ListBuckets call succeeded (regionless), but the deployment-create POST failed at server-side with `object storage region is required`, forcing a Back -> fsn1 -> re-Validate -> Continue cycle. Fix: when ObjectStorageSection mounts and store.objectStorageRegion is empty, mirror Region 1's cloud-region (regionCloudRegions[0]) into objectStorageRegion if it's one of fsn1/nbg1/hel1; otherwise fall back to fsn1 (Object Storage is European-only, ash/hil compute Sovereigns still pick a European S3 zone per model.ts §160). Pre-existing values are never overridden, so operator overrides via the fsn1/nbg1/hel1 buttons survive across step navigation. UX: the Validate button now becomes enabled from first paint when keys are filled in; no more dead-end click on a regionless state. Tests: 6 new vitest cases covering the fsn1/nbg1/hel1 mirror, ash fallback, pre-existing-value preservation, and operator override. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:10:34 +04:00
github-actions[bot]	56718e1655	deploy: update catalyst images to `9e2e768`	2026-05-01 16:59:05 +00:00
e3mrah	9e2e768039	fix(catalyst-api): wipe.go panic 'send on closed channel' (Phase-8a bug #10 ) (#478 ) Phase-8a-preflight deployment 520e7b7a217b226c surfaced this when operator clicked Decommission Sovereign on a deployment whose Phase-1 watch had already terminated: panic: send on closed channel -> handler.(*Handler).WipeDeployment.func1 -> /app/internal/handler/wipe.go:156 Returned HTTP 500 with empty body (panic recovery middleware ate the detail). The wipe handler's emit() closure sends on dep.eventsCh inside a select-with-default — but select-with-default does NOT catch send-on-closed, only send-would-block. Root cause: the prior 'if dep.eventsCh == nil' guard treated CLOSED channels as healthy. Go has no portable check-without-receive for closed, and a closed channel is non-nil. Phase-1 watch terminated on this deployment because no kubeconfig arrived (Phase-8a bug #8 — separate issue), and its terminal goroutine closed the channel (deployments.go:575). Wipe then inherited the closed channel, the guard skipped recreation, first emit() panicked. Fix: always replace dep.eventsCh in WipeDeployment instead of guarding on nil. Any stragglers reading from the old channel will see end-of-stream (which is what closed already conveyed); the wipe emit goroutine writes to the fresh channel. Refs: - Live evidence: deployment 520e7b7a217b226c, POST /wipe → 500 + panic in pod logs - Companion bug #8: phase-1 watch terminates with componentCount=0 when no kubeconfig (separate ticket) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:56:50 +04:00
github-actions[bot]	a59c169cff	deploy: update catalyst images to `e35729a`	2026-05-01 16:46:27 +00:00
e3mrah	e35729ad78	fix(infra): strip YAML-block comments from cloud-init to fit Hetzner 32KiB cap (Phase-8a bug #5 ) (#477 ) Phase-8a-preflight deployment 3c158f712d564d84 failed at tofu apply with: Error: invalid input in field 'user_data' [user_data => [Length must be between 0 and 32768.]] on main.tf line 214, in resource "hcloud_server" "control_plane" The rendered cloudinit-control-plane.tftpl is 38,085 bytes — 5,317 bytes over the Hetzner cap. The source template ships ~16 KB of indent-0 and indent-2 documentation comments (YAML-level) that are operationally inert at cloud-init boot. Fix: wrap templatefile() in replace() with a RE2 regex that strips lines whose first 0-2 chars are spaces followed by '#' (preserves shebangs via [^!]). After strip, rendered cloud-init drops to ~13 KB. Indent-4+ comments live INSIDE heredoc `content: \|` blocks (embedded shell scripts, kubeconfig fragments). Those are preserved. Same fix applied to worker_cloud_init for parity. Refs: - Live evidence: deployment 3c158f712d564d84, tofu apply error 16:38:26 UTC - Bug #5 in the Phase-8a-preflight tally - #471: prior tftpl escape fix ($${SOVEREIGN_FQDN}) - #472: catalyst-build watches infra/hetzner/** Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:43:42 +04:00
github-actions[bot]	8fdddafa17	deploy: update catalyst images to `52c6938`	2026-05-01 16:36:25 +00:00
e3mrah	52c6938e02	ci(catalyst-build): watch infra/hetzner/ so cloudinit changes rebuild catalyst-api (#472 ) Phase-8a-preflight bug #2 (after #471's tftpl escape fix): catalyst-api Docker image bakes /infra/hetzner/cloudinit-control-plane.tftpl. Without this path in the build trigger, fixes to that file do NOT rebuild the image — the running pod keeps using the stale tftpl and provisioning keeps failing with the same Tofu error. Per CLAUDE.md Rule 4a (GitHub Actions is the only build path), the path filter MUST cover every directory the image depends on. Missing infra/hetzner/ was a long-standing latent CI bug — surfaced by Phase-8a #454 first live provision attempt. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:34:13 +04:00
e3mrah	03b1469331	fix(infra): escape ${SOVEREIGN_FQDN} in cloudinit-control-plane.tftpl comments (#471 ) Phase-8a-preflight bug surfaced by first live provision attempt (deployment febeeb888debf477, 2026-05-01 16:30 UTC): Error: Invalid function argument on main.tf line 140, in locals: 140: control_plane_cloud_init = templatefile("${path.module}/cloudinit-control-plane.tftpl", { Invalid value for "vars" parameter: vars map does not contain key "SOVEREIGN_FQDN", referenced at ./cloudinit-control-plane.tftpl:12,37-51. Tofu's templatefile() interprets ${...} ANYWHERE in the file (including inside shell '#' comments), since the file is a template not a shell script. Five lines in cloudinit-control-plane.tftpl reference ${SOVEREIGN_FQDN} as part of documentation prose explaining how Flux postBuild.substitute interpolates the value at Flux apply time. The Tofu vars map passed by main.tf:140 uses the canonical lowercase HCL convention (sovereign_fqdn = var.sovereign_fqdn), not the uppercase envsubst convention SOVEREIGN_FQDN. So Tofu fails: 'vars map does not contain key SOVEREIGN_FQDN'. Latest reference (line 12) added by #326 (commit `20b89607`); older 4 references predate that and were never exercised because no live provision had ever been attempted before this Phase-8a run. Fix: escape with double-dollar ($$) so Tofu emits a literal ${...} in the rendered cloudinit file. The 5 comments now read $${SOVEREIGN_FQDN} in source, render as ${SOVEREIGN_FQDN} in the user_data output — preserving documentation intent without breaking templatefile(). Refs: - Live provision: console.openova.io/sovereign/provision/febeeb888debf477 - Diagnostic: tofu plan exit 1 — vars map does not contain key SOVEREIGN_FQDN - Out of scope: any other latent templatefile() escape issues — those surface as their own Phase-8a iterations Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:33:21 +04:00
e3mrah	1628a1b3aa	ci(preflight): GHCR auth for A+E + WBS tick — all 4 preflights done (#470 ) First runs of preflight A (bootstrap-kit) and E (Keycloak) failed with the same error: helm OCI pull from ghcr.io/openova-io/bp-* returning 401 'unauthorized: authentication required'. bp-* are PRIVATE GHCR packages. #460's agent fixed it for B in c26fbcaf. #461's already had GHCR login. This commit applies the same helm-registry-login pattern to A and E. WBS state on main after this commit: - done (35): all chart-level + #317 + #319 + #453 + 4 preflights - wip (0) - blocked (3): 454, 455, 456 (Phase-8 live runs, operator-driven) The preflights' first runs ALREADY surfaced a real CI bug pattern that would have hit Phase 8a — exactly what they're for. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:06:36 +04:00
e3mrah	a7a90619e5	docs(wbs): mark #461 done — preflight C cilium-httproute shipped (#469 ) PR #465 merged at `48b73af6` ships .github/workflows/preflight-cilium-httproute.yaml — Phase-8a Risk R3 preflight (Cilium Gateway HTTPRoute admission for bp-catalyst-platform on kind). Update §9 status row from "in flight" to "done". Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:04:37 +04:00
e3mrah	4a7eb42d26	feat(ci): Phase-8a preflight E — Keycloak realm-import + kubectl OIDC client (closes #462 ) (#468 ) Surfaces Risk R6 (docs/omantel-handover-wbs.md §9a — Keycloak realm-import config-CLI bootstrap timing untested). bp-keycloak 1.2.0 ships a sovereign realm + a public kubectl OIDC client via the upstream bitnami/keycloak chart's keycloakConfigCli post-install Helm hook (issue #326); this workflow proves it actually wires up on a clean cluster before we run it on a real Sovereign. Workflow installs bp-keycloak 1.2.0 on a kind cluster (helm/kind-action v1, kindest/node:v1.30.6 — same versions as test-bootstrap-kit), waits for the keycloak StatefulSet to roll out, polls for the keycloakConfigCli post-install Job by label (app.kubernetes.io/component=keycloak-config-cli), waits for it to Complete, port-forwards svc/keycloak and asserts: 1. /realms/sovereign returns 200 (realm exists in Keycloak's DB). 2. The kubectl OIDC client is provisioned with publicClient=true, redirectUris contains http://localhost:8000 (kubectl-oidc-login default), and the groups client scope is wired with the oidc-group-membership-mapper (the per-Sovereign k3s api-server's --oidc-groups-claim flag depends on this). Acceptance per ticket: if the post-install Job fails, the workflow summary captures Job logs + StatefulSet logs + cluster state via GITHUB_STEP_SUMMARY so a failed run is debuggable without re-running. Triggers are event-driven only per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled" rule — push on the workflow file itself plus workflow_dispatch for ad-hoc re-runs. Closes #462. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:30 +04:00
e3mrah	abac00d8b3	feat(ci): Phase-8a preflight A — bootstrap-kit reconcile dry-run on kind (closes #459 ) (#467 ) Surfaces Risk-register R4 (docs/omantel-handover-wbs.md §9a — bootstrap-kit reconcile-chain order untested under load) before Phase 8a (#454) burns Hetzner credit on test.omani.works. New workflow .github/workflows/preflight-bootstrap-kit.yaml: - kind v0.25.0 + kindest/node:v1.30.6 - Gateway API CRDs v1.2.0 standard channel - Full Flux controller set (fluxcd/flux2/action@main + flux install) - Mock Secrets: flux-system/object-storage, flux-system/cloud-credentials, flux-system/ghcr-pull - Renders clusters/_template/bootstrap-kit/ with SOVEREIGN_FQDN_PLACEHOLDER + ${SOVEREIGN_FQDN} -> test-sov.example.com (matches test harness pattern in tests/e2e/bootstrap-kit/main_test.go:247) - 30 x 30s HR poll loop, never-fail-fast (goal: surface ALL bugs, not stop at first) - $GITHUB_STEP_SUMMARY emits Markdown table of every HR's terminal Ready condition + per-HR describe blocks for non-Ready + recent flux-system events + raw hrs.json artefact (14d retention) - Event-driven only: push on self-edit + workflow_dispatch; no schedule: cron (per CLAUDE.md "every workflow MUST be event-driven") Canonical seam reused (no duplication): - kind setup + flux install pattern from .github/workflows/test-bootstrap-kit.yaml - bootstrap-kit kustomization at clusters/_template/bootstrap-kit/ (the same overlay production Sovereigns consume; substitution shape mirrors tests/e2e/bootstrap-kit/main_test.go:247) - event-driven shape per .github/workflows/check-vendor-coupling.yaml (#428) Out of scope (sibling preflights): - #460 Crossplane provider-hcloud Healthy probe - #461 Cilium Gateway HTTPRoute admission - #462 Keycloak realm-import Validated: actionlint clean, YAML parses cleanly. WBS row #459 in §9 updated: 🟡 in flight -> 🟢 done (workflow shipped). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:26 +04:00

... 2 3 4 5 6 ...

951 Commits