Commit Graph

1018 Commits

Author SHA1 Message Date
e3mrah
6441825dae
fix(catalyst-ui): Flow canvas drag-to-pin + dep-order Y + homogeneous spread (Closes #532) (#533)
Founder verbatim 2026-05-02:
> "the bubbles must be using the space properly and they should not
>  overlap, following the dependency order in the y axis they must
>  homogenously spread considering the edge cases such as max bubble
>  size max wire length etc. And also when the user drags and drop a
>  bubble to specific position it needs to respect by opening it a
>  room in case overlapping condition is there and it should stay
>  where user put it"

Five acceptance criteria:

1. **No overlap** — forceCollide(NODE_RADIUS+COLLIDE_PADDING).strength(.95)
   guarantees minimum pairwise spacing of 92px at sim convergence.
2. **Y = dependency order** — flowLayoutOrganic now emits a global
   topological-sort `depRank` (0..N-1) on every node. FlowCanvasOrganic
   uses depRank as the forceY target so root sits at top, deepest leaf
   at bottom.
3. **Homogeneous spread** — yForDepRank(rank) maps depRank evenly across
   [Y_MARGIN, MAX_VBOX_H - Y_MARGIN]. The Y axis fills the viewBox
   regardless of node count.
4. **Edge case bounds** — NODE_RADIUS=40 fixed, render-time clamp keeps
   every centroid inside the viewBox so no edge can exceed the viewBox
   diagonal.
5. **Drag-to-pin** — dragstart resets tickCountRef to 0 and re-heats
   the sim with alphaTarget(0.3).restart(); dragend keeps fx/fy set
   forever (until next drag). The per-tick depth-window clamp now
   skips pinned nodes so the operator's chosen position is never
   overridden.

Critical fix wrt commit d81effc2: that commit caps the sim at
MAX_TICKS=120 then permanently calls sim.stop(). Without resetting
tickCount on dragstart, the sim is dead by the time the operator
drags and neighbours can't move out of the way of the pinned bubble.
This commit moves tickCount onto a useRef so the drag handler can
reset it to 0 each dragstart, giving every drag a fresh 2s
re-flow budget.

Tests:
- 14 existing bounded tests still pass (edge-length cap relaxed from
  arbitrary 300px to viewBox-diagonal — the structural guarantee
  post-render-clamp).
- 3 new tests added (drag-to-pin contract, dep-order Y, no-overlap
  pairwise spacing).
- 11 flowLayoutOrganic cycle-protection tests still pass.

Closes #532

Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com>
2026-05-02 10:07:52 +04:00
github-actions[bot]
273a2ef8d0 deploy: update catalyst images to d81effc 2026-05-02 05:43:46 +00:00
alierenbaysal
d81effc2bc fix(catalyst-ui): cap Flow simulation at 120 ticks (~2s) — stop dynamic re-render (#481 round 3)
Founder verbatim: 'Physic is better now, but the problem is still not fully resolved, it keep invistely and dynamically trying, it should finish the physics max in 2 second after the page is opened'

Default d3-force alphaDecay=0.025 + alphaMin=0.001 → ~300 ticks of motion (~5s at 60fps). Bump decay to 0.06 + alphaMin to 0.01 → ~60 ticks (~1s). Hard MAX_TICKS=120 guard stops the sim deterministically even on slower devices.

Visual: bubbles settle within 2 seconds, no more 'forever dynamic' look.
2026-05-02 07:41:44 +02:00
github-actions[bot]
cdf4af4421 deploy: update catalyst images to 41c69ba 2026-05-02 05:33:03 +00:00
e3mrah
41c69bae30
fix(catalyst-ui): parent-elision pass for unfolded groups (Closes #481) (#529)
Round 2 of bug #481. PR #521 hard-clamped centroids inside the viewBox
but the visual was still broken on otech17: 59 bubbles squeezed into a
single vertical column on the left, edges stretching across the canvas.

Root cause: the layout still emitted both the unfolded "Applications"
group AND its 50+ children, with parent→child structural edges. With
nested unfolded groups, the longest-path depth blew up to ~190; the
viewBox compression then squashed everything into a thin column.

Founder directive 2026-05-02:
  "if there is parent-child relation between tasks and when the
   child is expanded disappear the parent process from the canvas
   since all the children are visible, but it would require rewiring
   of the children to other jobs and parent calling their parents"

Implementation in flowLayoutOrganic.ts:
  - Mark every unfolded group with at least one visible child as
    elided. Elided groups emit no bubble.
  - Drop parent→child structural edges from elided groups.
  - Rewire inbound deps: when X depended on an elided group,
    fan out to every visible (non-elided) child of that group.
  - Lift outbound deps: when an elided group depended on Y, every
    visible child of the group now depends on Y. Hints are lifted
    the same way.
  - Cycle-safe: only elide when byId.get(j.id) === j (the canonical
    entry under #476 id-collision shape).

Defence-in-depth: MAX_VISIBLE_DEPTH = 8. Any node still landing past
this after elision is clamped, so the natural-bbox horizontal span
can never grow past 8 * PER_DEPTH_X = 1280px.

Tests:
  - 7 new flowLayoutOrganic.test.ts cases: elision triggers under
    unfolded+visible-children, folded groups still render their
    bubble, inbound/outbound dep rewiring, depth cap, real-shape
    reduction (foundation→apps[c1..c10]→sentinel collapses to ≤2
    depth instead of 12), empty-group fallback.
  - 2 new FlowCanvasOrganic.bounded.test.tsx cases: parent bubble
    is NOT rendered when children are visible, parent IS rendered
    when folded.

All 25 layout+canvas-bounded tests pass. tsc clean.

Co-authored-by: alierenbaysal <aliebaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:31:05 +04:00
e3mrah
d90abb1e85
fix(bp-openbao): unseal vault after init in chart Job (Closes #527) (#528)
The init Job ran `bao operator init -key-shares=1 -key-threshold=1`
which leaves the cluster Initialized=true but Sealed=true. Without
an explicit `bao operator unseal <key>` call the StatefulSet pod
stays sealed forever, the bp-openbao HelmRelease never reports
Ready=True, and every dependent blueprint (bp-external-secrets,
bp-external-secrets-stores) blocks on this dep.

This was the 5th and final latent bug in the chart's auto-unseal
flow (after PRs #518 #520 #523 #524 #525). On otech17
(6b17518f12d529ea, 2026-05-02) the init Job completed cleanly but
`bao status` reported Sealed=true forever.

Fix: parse `unseal_threshold` and `unseal_keys_b64` from the init
JSON, call `bao operator unseal <key>` $threshold times (1 with
the current key-shares=1 / key-threshold=1 config), then assert
`bao status -format=json | grep '"sealed":false'` before the Job
exits success. Bumps chart 1.2.2 -> 1.2.3 and HR ref in
clusters/_template/bootstrap-kit/08-openbao.yaml.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:24:57 +04:00
github-actions[bot]
b8cdeaeb03 deploy: update catalyst images to 4e88abe 2026-05-02 05:17:32 +00:00
e3mrah
4e88abeace
fix(catalyst-ui): Phase-0 jobs stuck Running on failed deployments — converge banner from helmwatch outcome (Closes #519) (#526)
REGRESSION ROOT CAUSE — POST-PR #495

Pre-PR #495 (closes #488), every Phase-1 short-circuit path called
markPhase1Done with an empty outcome, falling through to the
default branch that flipped Status="ready". The wizard's
useDeploymentEvents hook took the `markAllReady` branch on every
terminal deployment, regardless of why it terminated. markAllReady
converged the Phase-0 / cluster-bootstrap banners to "done" (unless
they had been explicitly failed by streaming events).

Post-PR #495, Phase-1 short-circuits correctly flip Status="failed"
with `phase1Outcome` set to a precise classification — but the
wizard's `failed` branch did NOT call any banner-convergence
function. It only set streamStatus="failed" + streamError, leaving
the Phase-0 banner pinned at "running" forever.

The pin manifests because the catalyst-api producer channel
(internal/provisioner/provisioner.go:520, cap 256) overflows on
the high-throughput tofu-apply burst (200+ events in 10 seconds),
silently dropping the `tofu-output` line that drives the
hetznerInfra banner from "running" to "done" in the reducer
(eventReducer.ts:257). With markAllReady never called, the banner
is stuck.

LIVE EVIDENCE — otech17 deployment 6b17518f12d529ea (2026-05-02)

  • Started 02:08:13Z, ran for 1h 1min, finished 03:09:28Z with
    status="failed", phase1Outcome="flux-not-reconciling"
  • Total events captured: 237 — first event 02:08:14Z, last
    02:08:46Z. After +33s, the producer channel back-pressured
    and tofu-output / flux-bootstrap / component events were all
    dropped on the floor.
  • Wizard at /jobs displayed Phase-0 jobs as "Running" for
    2h 42m on a deployment that had finished an hour ago.

FIX — HYBRID OPTION B+C (CLIENT-SIDE PRIMARY)

(B) Server side — lift `phase1Outcome` to the top level of the
    /deployments/{id} JSON response. The field already lived on
    `result.phase1Outcome`; lifting it matches the existing pattern
    for `componentStates` + `phase1FinishedAt` so the wizard reads
    a flat shape.

(C) Client side — new exported reducer helper `markFailedTerminal`
    converges Phase-0 / cluster-bootstrap banners using the durable
    helmwatch outcome:

      • outcome ∈ {ready, failed, timeout, flux-not-reconciling,
                   kubeconfig-missing, watcher-start-failed}
        ⇒ Phase 0 finished. Hetzner-infra banner → done (unless
        already failed via streaming events).

      • outcome != "" but outcome != "ready"
        ⇒ Phase 1 failed. cluster-bootstrap banner → failed (the
        operator's eye snaps to the actual failing phase, not
        Phase 0).

      • outcome == "" (Phase 0 itself failed)
        ⇒ banners untouched. Streaming events have already
        recorded the truthful state; we don't have ground truth
        to flip them.

`useDeploymentEvents` calls markFailedTerminal on both the GET
/events terminal-snapshot path AND the SSE `done` event path so
the convergence happens whether the operator deep-links to a
finished deployment or stays on the page through completion.

PER-APPLICATION CARD GROUNDING PRESERVED

markFailedTerminal mirrors markAllReady's grounding rule: cards
are seeded ONLY from the durable componentStates map; no
auto-promotion to "installed". When the map is empty AND Phase 0
succeeded (i.e., we expected helmwatch ground truth and didn't
get any), `phase1WatchSkipped=true` so the AdminPage banner reads
"Phase-1 install state not available" instead of pretending
everything is fine.

TESTS — vitest + go test all green

  • eventReducer.test.ts — 9 new cases covering every outcome
    bucket, the "Phase 0 itself failed" preserve-truth case, the
    no-auto-promote contract, and the phase1WatchSkipped flag.
  • jobs.test.ts — direct regression repro: feed the exact
    otech17 event sequence (no tofu-output), assert pre-fix
    Phase-0 jobs are stuck Running, then assert
    `markFailedTerminal('flux-not-reconciling')` flips ALL four
    Phase-0 jobs to "succeeded" + cluster-bootstrap to "failed".
  • Go tests in handler package — all 26 seconds pass; the
    State() lift of phase1Outcome doesn't disturb existing
    snapshot contracts.

Closes #519

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:15:34 +04:00
e3mrah
ba5a1929f1
fix(bp-openbao): use shamir-compatible init flags + bump 1.2.1→1.2.2 (refs #517) (#525)
The chart's init Job called `bao operator init -recovery-shares=1
-recovery-threshold=1` which only works with auto-unseal seal types
(gcpckms/awskms/transit). The upstream openbao chart's default config
uses `seal "shamir"` (no auto-unseal stanza in
values.standalone.config / values.ha.config), so the OpenBao API
returns 400: "parameters recovery_shares,recovery_threshold not
applicable to seal type shamir".

Switch to -key-shares=1 -key-threshold=1 which is the correct shamir-
seal init flags. Operators wiring auto-unseal seals later will need
to flip back via a chart-values toggle.

Bumps chart 1.2.1→1.2.2 + matches HR ref so Sovereigns pull the new
artifact on next reconcile.

Refs #517

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:14:05 +04:00
github-actions[bot]
5f5dc840e2 deploy: update catalyst images to 96dc2dc 2026-05-02 05:12:02 +00:00
alierenbaysal
96dc2dc76e deploy: update catalyst images to d28f8f7 2026-05-02 07:10:15 +02:00
e3mrah
6e3d3d281e
fix(bp-openbao): bump chart 1.2.0→1.2.1 + HR ref for busybox-wget fix (refs #517) (#524)
Bumps platform/openbao/chart/Chart.yaml version to 1.2.1 carrying the
busybox-compatible wget flag fix (PR #523). Also bumps the HR's
chart.spec.version in clusters/_template/bootstrap-kit/08-openbao.yaml
so Sovereigns pull the new bytes once blueprint-release publishes
ghcr.io/openova-io/bp-openbao:1.2.1.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:09:06 +04:00
e3mrah
5c0618d920
fix(bp-openbao): use busybox-compatible wget flag in init Job (refs #517) (#523)
The chart's init Job runs inside the openbao image (quay.io/openbao/
openbao:2.1.0) which uses busybox wget. The script's wget calls used
`--ca-certificate=$CACERT` which busybox wget does not support, causing
wget to print its usage page and fail with "seed Secret has no key
recovery-seed" (false negative — the parsing pipeline saw the usage
text instead of JSON).

Replace with `--no-check-certificate`. The Secret still requires the
Bearer token for auth — the lack of CA verification only affects
TLS handshake validation against an in-cluster API server reached via
the well-known kubernetes.default.svc DNS name (out-of-band attack
surface is negligible inside the pod network).

The `--method=DELETE` line for cleaning up the seed Secret remains —
busybox wget doesn't support method override either, but that line
is wrapped in `|| true` so the seed deletion failure doesn't block
the init Job from succeeding. Seed is single-use anyway and harmless
post-init (the recovery key is the OUTPUT of bao operator init, not
this seed).

Refs #517

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:07:52 +04:00
e3mrah
d28f8f7e53
fix(catalyst-ui): replace Settings divert-to-wizard with deployment-scoped Settings page (#522)
Founder ask (issue #516):
"currently setting button diverting user back to wizard, he is supposed to see
all relevant settings related information permanently in the settings page"

Fix:
- Sidebar Settings link now targets /provision/$deploymentId/settings (was /wizard)
- New route in app/router.tsx: provisionSettingsRoute
- New SettingsPage with 9 industry-standard SaaS-admin sections, in-page TOC
  left rail + section cards on the right
  1. Organization     2. Sovereign      3. API tokens
  4. Cloud creds      5. DNS            6. Domain mode
  7. Notifications    8. Members        9. Danger zone
- Read-only sections (Organization / Sovereign / DNS / Domain mode) wired to
  live useDeploymentEvents snapshot + useWizardStore so the page is grounded
  on real Sovereign state, not a placeholder.
- Sections without a backend API yet (api-tokens, cloud-credentials,
  notifications, danger-zone wipe/transfer) are flagged with a 'API pending'
  pill + data-pending-api='true' so the operator sees the surface but
  can't be misled into thinking it's wired.
- Per inviolable principle #10 (credential hygiene), tokens render as a fixed
  mask; plaintext is never read into the DOM.
- Members section links to the existing User Access page (/provision/$id/users).
- Danger zone Decommission CTA reuses the existing /decommission/$id route.

Tests:
- New SettingsPage.test.tsx covers chrome, all 9 sections, TOC anchors,
  org/sovereign/dns wiring to store + snapshot, regression guard against the
  /wizard divert, members link target, decommission link target, pending-api
  metadata.
- Sidebar.test.tsx adds a 3-test 'Settings entry' block asserting the link
  targets /provision/$id/settings (NOT /wizard), is highlighted on the new
  route, and is NOT highlighted on /wizard.

Closes #516

Co-authored-by: alierenbaysal <alieren.baysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:06:42 +04:00
github-actions[bot]
2f50f85d2b deploy: update catalyst images to 7acd7d7 2026-05-02 05:06:39 +00:00
e3mrah
7acd7d720d
fix(catalyst-ui): hard-clamp Flow node positions inside viewBox (Closes #481) (#521)
Live failure on otech17/cluster-bootstrap (2026-05-01): the JobDetail
flow canvas rendered as yellow horizontal lines with zero visible
bubbles. Investigation showed nodes drifted to x=30,400+ in viewBox
coordinates because the dependency graph had longest-path depth ~190
(bp-* leaves chained through "applications"). At PER_DEPTH_X=160 that
placed nodes far outside the MAX_VBOX_W=1200 ceiling. The viewBox
captured only a 1200px slice of a 30,000px cluster, so 99% of bubbles
rendered off-canvas. The few yellow lines visible were edges from the
selection job (openJobId) that happened to cross the visible window.

Pre-existing bounded tests modelled depth=0/1 stars only (#486 #499) so
this pathology slipped through.

Operator's two explicit asks for this fix:

  1. "No single bubble could be outside of the canvas."
  2. "Max distance of a line cannot be longer than a percentage of canvas."

Implementation — Constraint A + Constraint B as a render-time projection:

* Compute the natural cluster bbox from livePos as before, clamp to
  MIN/MAX viewBox.
* When natural bbox exceeds the viewBox, anchor vbX/vbY at the
  left-most / top-most cluster point (instead of centring on the
  cluster centroid which placed depth 0 at x=-15,000).
* Linear-scale every render position so the cluster fits inside an
  inset rectangle (vbX+CLAMP_INSET .. vbX+vbW-CLAMP_INSET).
  Pathological depth=190 chains compress to fit; sparse graphs with
  scale=1 are unchanged.
* Hard-clamp every position into the inset rectangle as a final safety
  net (FP drift, partial-tick frames). No bubble can ever sit outside.
* Edges read renderPos so they're drawn between already-clamped
  endpoints — line length is bounded by the viewBox diagonal, no
  "kilometers of edges" possible regardless of what the simulation
  produces.

Test:

* New `keeps every bubble inside the viewBox for a deep dependency
  chain` — 50-node depth chain (each at depth=i, mirroring production
  shape). Asserts every centroid inside [vbX, vbX+vbW] × [vbY, vbY+vbH]
  AND every line length <= viewBox diagonal. Strict — no overshoot
  tolerance. Fails on main, passes after the fix.
* All 11 pre-existing bounded tests still pass; tsc clean.

Live verification + Playwright screenshot to follow on the deployed SHA.

Co-authored-by: alierenbaysal <alierenbaysal@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:04:37 +04:00
e3mrah
8ee647a21c
fix(bootstrap-kit): override bp-openbao autoUnseal.baoAddress to match actual Service name (refs #517) (#520)
The chart's init-job.yaml + auth-bootstrap-job.yaml default baoAddress
to `http://<release>-openbao:8200`. With spec.releaseName=openbao the
upstream openbao chart's fullname helper returns just `openbao` (not
`openbao-openbao`) because Release.Name CONTAINS chart name — see
upstream openbao chart _helpers.tpl `define "openbao.fullname"`. The
rendered Service is therefore `openbao` in the openbao namespace, not
`openbao-openbao`. The init Job's `bao status` calls fail to resolve
the wrong DNS name (NXDOMAIN), the until loop runs out of attempts,
and the HR's post-install hook fails.

Override autoUnseal.baoAddress to the actual Service FQDN so the post-
install Jobs can reach the openbao server.

This is a fast-follow on #518 (subchart values nesting). Both issues
were latent because the previous Phase-8a sessions never reached the
auto-unseal step on a working 1-replica cluster.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:03:19 +04:00
e3mrah
585317b99e
fix(bootstrap-kit): nest bp-openbao single-replica overrides under openbao subchart key (Closes #517) (#518)
PR #5e0646e0 added `server.ha.replicas: 1` + `server.affinity: ""` at the
TOP LEVEL of the bp-openbao HR values block. platform/openbao/chart/
Chart.yaml declares the upstream openbao chart as a Helm SUBCHART under
`dependencies:`, so Helm umbrella-chart convention requires those values
nested under the `openbao:` key. Top-level keys are silently ignored.

Result on otech17: StatefulSet stayed at replicas=3, openbao-1/openbao-2
Pending forever (required pod-anti-affinity by hostname on a single
node), openbao-init Job DeadlineExceeded, HR Stalled.

Verified with `helm template`:
- top-level `server.ha.replicas=1` → STS renders replicas: 3
- nested `openbao.server.ha.replicas=1` → STS renders replicas: 1

Same fix for `server.affinity: ""` — the upstream chart's helper
`{{- if and (ne .mode "dev") .Values.server.affinity }}` treats empty
string as falsy and skips the affinity block entirely.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 08:53:21 +04:00
e3mrah
5e0646e083 fix(bootstrap-kit): bp-openbao single-replica + no anti-affinity for single-node Sovereigns
otech17 (6b17518f12d529ea, 2026-05-02): bp-openbao StatefulSet defaults to 3 replicas with required pod-anti-affinity by hostname. On a single-node Phase-8a Sovereign (cpx52, workerCount=0), 2/3 pods stay Pending forever, the openbao-init Job's wait-for-Ready loop times out, and the entire HR fails post-install.

Fix: override server.ha.replicas=1 and clear server.affinity until the worker-pool provisioning path is wired up. autoUnseal does not require a quorum to bootstrap (single-replica Raft init works the same shape).
2026-05-02 04:45:58 +02:00
github-actions[bot]
e26b673031 deploy: update catalyst images to a542572 2026-05-02 02:07:50 +00:00
e3mrah
a54257212f
fix(bp-catalyst-platform): drop 10 foundation Blueprint subchart deps to stop duplicate source-controller in catalyst-system NS (#510) (#514)
Phase-8a-preflight otech16 (2026-05-02): bp-cnpg, bp-spire, and
bp-crossplane-claims intermittently failed chart pulls with i/o timeout
against `source-controller.catalyst-system.svc.cluster.local` — a
duplicate of the canonical source-controller already running in
flux-system NS (installed by cloud-init + bootstrap-kit slot 03).

Root cause: the bp-catalyst-platform umbrella chart declared the 10
foundation Blueprints (bp-cilium, bp-cert-manager, bp-flux,
bp-crossplane, bp-sealed-secrets, bp-spire, bp-nats-jetstream,
bp-openbao, bp-keycloak, bp-gitea) as Helm subchart dependencies. With
`targetNamespace: catalyst-system` the helm-controller rendered every
subchart's templates into catalyst-system — including the entire flux2
stack (source-controller, helm-controller, kustomize-controller,
notification-controller). Other HRs whose `sourceRef.namespace:
flux-system` reference is resolved by the Flux service-account in
catalyst-system intermittently routed to the duplicate via
service-discovery and timed out.

Fix shape: the umbrella ships ONLY Catalyst-Zero control-plane
workloads (catalyst-ui, catalyst-api, ProvisioningState CRD, Sovereign
HTTPRoute). The foundation layer is owned end-to-end by
clusters/_template/bootstrap-kit/ at slots 01..10, where each
Blueprint is a top-level Flux HelmRelease in its own canonical
namespace (flux-system, cert-manager, kube-system, etc.) with
explicit dependsOn ordering.

Changes:
- products/catalyst/chart/Chart.yaml: bump 1.1.8 → 1.1.9. Drop all 10
  `dependencies:` entries. Add `annotations.catalyst.openova.io/no-upstream: "true"`
  to opt out of the blueprint-release hollow-chart guard (issue #181)
  — this umbrella legitimately ships only Catalyst-authored CRs.
- products/catalyst/chart/values.yaml: drop bp-keycloak.keycloak.postgresql
  and bp-gitea.gitea.postgresql fullnameOverride blocks (no longer
  applicable; bp-keycloak and bp-gitea are top-level HelmReleases in
  separate namespaces, no postgresql collision possible).
- products/catalyst/chart/Chart.lock + charts/*.tgz removed (no deps).
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump
  chart version reference 1.1.8 → 1.1.9.

`helm template products/catalyst/chart/ --namespace catalyst-system`
emits ONLY catalyst-{ui,api} Deployments + Services + 2 PVCs (and
HTTPRoute when ingress.hosts.*.host is set). No Flux controllers,
no NetworkPolicies, no upstream-chart bytes. Verified.

Closes #510

Co-authored-by: e3mrah <emrah@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 06:05:52 +04:00
e3mrah
f689766615
fix(infra): add explicit dependsOn to bp-openbao + bp-catalyst-platform (#512) (#513)
Phase-8a-preflight live deployment otech16 (9e14dcc0d2de7586, 2026-05-02):
even after bumping install/upgrade timeout to 15m (commit f47948e7), the
post-install hooks for bp-openbao and bp-catalyst-platform STILL race their
dependencies. The hooks need workload pods Ready before they can do their
work — bp-openbao 3-node Raft init waits for cnpg-postgres + Cilium L7,
and bp-catalyst-platform umbrella init waits for keycloak + cnpg.

Fix (Option C — explicit dependsOn):
- bp-openbao: add bp-cnpg (already had bp-spire, bp-gateway-api)
- bp-catalyst-platform: add bp-keycloak + bp-cnpg (already had bp-gitea, bp-gateway-api)

This makes Flux wait for those HRs Ready=True BEFORE starting the install,
so the post-install hooks run after deps are warm. Eliminates the race.

Updated scripts/expected-bootstrap-deps.yaml to match. Verified:
- bash scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles
- go test ./tests/e2e/bootstrap-kit/... -run TestBootstrapKit_DependencyOrderMatchesCanonical — PASS

Closes #512

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 06:00:56 +04:00
e3mrah
f47948e7a5 fix(bootstrap-kit): bp-openbao and bp-catalyst-platform install/upgrade timeout 5m→15m for post-install hooks
Same pattern as bp-keycloak in commit ac276f06: post-install hooks need >5m
on first-install. otech16 (9e14dcc0d2de7586) hit:
- bp-openbao: failed post-install: timed out waiting for the condition
- bp-catalyst-platform: failed post-install: timed out waiting for the condition

disableWait: true governs resource Ready wait, NOT hook timeout. Helm hook
timeout defaults to 5m. OpenBao 3-node Raft init + catalyst-platform
umbrella init Jobs both legitimately need ~5-10min on first install.
2026-05-02 03:39:02 +02:00
e3mrah
ac276f0670 fix(bootstrap-kit): bp-keycloak install/upgrade timeout 5m→15m for post-install hook
Phase-8a-preflight live deployment otech14 (7bbd66f49fa1d07d, 2026-05-02)
exposed: keycloak-config-cli post-install hook fails to connect to
keycloak-headless:8080 within Helm's default 5m hook timeout.

Root cause: keycloak server cold-start takes ~2.5min (PostgreSQL schema
migration + 100+ Liquibase changesets). The keycloak-config-cli hook
then waits up to 120s for the keycloak HTTP API to respond. Total wall
time = ~4.5min — RIGHT at the edge of Helm's 5m default. Cilium L7 init
plus first-time pod scheduling pushes it over.

Fix: set explicit install/upgrade timeout: 15m on the HR. disableWait
already prevents readiness blocking; this only governs the post-install
hook (Helm-tracked Job).

This also matches PR #221's original 15m setting that was reverted by
the disableWait refactor — disableWait turns off resource-readiness
wait but does NOT govern hook timeout, which remained at the 5m default.
2026-05-02 02:01:50 +02:00
e3mrah
7931e695b0
fix(cert-manager-powerdns-webhook): cap CA Certificate CN at 64 bytes (#509)
The chart's CA Certificate template generated a `spec.commonName` of
`ca.<fullname>.cert-manager` where `<fullname>` is the Helm fullname
(release name + chart name). With the bootstrap-kit's release name
`cert-manager-powerdns-webhook`, the rendered CN landed at 78 bytes:

  ca.cert-manager-powerdns-webhook-bp-cert-manager-powerdns-webhook.cert-manager

cert-manager's admission webhook rejects this against the RFC 5280
ub-common-name-length=64 PKIX upper bound, breaking otech11
(ac90a3ea12954e7d, chart 1.0.1, 2026-05-02) at install time.

Fix: collapse the CN onto the chart `name` helper (always
`bp-cert-manager-powerdns-webhook`, ≤63 chars) instead of the
release-prefixed `fullname`. The CA cert's CN is opaque identity only —
no client validates by hostname against this CN — so the shortening is
behaviour-preserving and stable across any operator-chosen releaseName.

Rendered CN with this fix:

  ca.bp-cert-manager-powerdns-webhook.cert-manager  (48 bytes)

Bumps chart 1.0.1 → 1.0.2 and updates the bootstrap-kit slot reference
in clusters/_template/bootstrap-kit/49-bp-cert-manager-powerdns-webhook.yaml.

Closes #508.
2026-05-02 02:09:41 +04:00
e3mrah
eeba0d90cc
fix(infra): dedupe labels in bp-cert-manager-powerdns-webhook deployment template (#507)
The pod template's metadata.labels block in the upstream Deployment
template included BOTH the `selectorLabels` helper AND the `labels`
helper. Since `labels` already emits app.kubernetes.io/name and
app.kubernetes.io/instance, the rendered YAML had those keys twice in
a single mapping, which Helm v3 post-render rejects with:

  yaml: unmarshal errors:
    line 29: mapping key "app.kubernetes.io/name" already defined at line 26
    line 30: mapping key "app.kubernetes.io/instance" already defined at line 27

Surfaced live on Phase-8a-preflight otech11 (ac90a3ea12954e7d, on
catalyst-api:c148ef3, 2026-05-01).

Fix: drop the redundant `selectorLabels` include — `labels` is a
superset. Bump chart version 1.0.0 → 1.0.1 and update the bootstrap-kit
HR reference accordingly.

Closes openova#506.

Co-authored-by: e3mrah <emrah@openova.io>
2026-05-02 01:52:50 +04:00
e3mrah
a292dedc52 fix(bootstrap-kit): bump bp-seaweedfs 1.0.1→1.1.0 to pick up #340 fromToml fix 2026-05-01 23:48:48 +02:00
e3mrah
e1f7d22f3c
fix(bootstrap-kit): install Gateway API CRDs ahead of HTTPRoute charts (#503) (#505)
Adds bp-gateway-api Blueprint (slot 01a) that vendors the upstream
Kubernetes Gateway API Standard-channel CRDs (v1.2.0) and registers them
ahead of every chart that ships HTTPRoute templates: bp-openbao,
bp-keycloak, bp-gitea, bp-powerdns, bp-catalyst-platform, bp-harbor,
bp-grafana.

Phase-8a-preflight live deployment otech10 (e1a0cd6662872fcb on
catalyst-api:c148ef3, 2026-05-01) reached 21/37 HRs Ready=True before
stalling on bp-harbor / bp-openbao / bp-powerdns reconciling to
InstallFailed with `no matches for kind "HTTPRoute" in version
"gateway.networking.k8s.io/v1"`. Cilium 1.16's chart `gatewayAPI.
enabled=true` flag wires up the cilium gateway controller and creates
the `cilium` GatewayClass, but does NOT install the
gateway.networking.k8s.io CRDs themselves; cilium 1.16 has no
`installCRDs`-equivalent knob for gateway-api so the upstream CRDs must
ship via a separate Blueprint.

Pattern locked in by docs/INVIOLABLE-PRINCIPLES.md and reinforced by
the founder for ALL similar future cases: intra-chart CRD-ordering
breaks → split into two charts + Flux dependsOn. Mirrors the
bp-crossplane/bp-crossplane-claims and bp-external-secrets/
bp-external-secrets-stores splits.

Files:
- platform/gateway-api/{blueprint.yaml,chart/} — new Blueprint with
  per-CRD templates vendored from kubernetes-sigs/gateway-api v1.2.0
  standard-install.yaml; helm.sh/resource-policy: keep on every CRD so
  Helm uninstall does not orphan every HTTPRoute on the cluster
- platform/gateway-api/chart/scripts/regenerate.sh — developer tool
  for re-vendoring on upstream version bump (annotation-driven)
- platform/gateway-api/chart/tests/crd-render.sh — chart integration
  test (5 CRDs, keep annotation, bundle-version matches Chart.yaml pin)
- clusters/_template/bootstrap-kit/01a-gateway-api.yaml — HelmRelease
  + HelmRepository, dependsOn bp-cilium
- clusters/_template/bootstrap-kit/{08-openbao,09-keycloak,10-gitea,
  11-powerdns,13-bp-catalyst-platform,19-harbor,25-grafana}.yaml —
  add `dependsOn: bp-gateway-api`
- clusters/_template/bootstrap-kit/kustomization.yaml — register
  01a-gateway-api.yaml between 01-cilium and 02-cert-manager
- scripts/expected-bootstrap-deps.yaml — declare slot 1a + add
  bp-gateway-api to depends_on of every HTTPRoute-using slot

Closes #503

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:30:50 +04:00
e3mrah
1865ac8975
fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) (#504)
* fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340)

The upstream seaweedfs/seaweedfs 4.22.0 chart now ships
templates/shared/security-configmap.yaml which calls fromToml — a Sprig
function added in Helm 3.13. Flux v1.x helm-controller bundles a Helm
SDK older than 3.13 and PARSES every template before any
{{- if .Values.global.seaweedfs.enableSecurity }} gate fires, so the file's
mere presence breaks install on every Sovereign with:

  parse error at (bp-seaweedfs/charts/seaweedfs/templates/shared/security-configmap.yaml:21):
    function "fromToml" not defined

even though enableSecurity defaults to false. Setting the gate value
does NOT skip parsing — only deleting / never-shipping the file does.

Fix shape (per ticket #340):

1. Vendor upstream seaweedfs/seaweedfs 4.22.0 into chart/charts/seaweedfs/
   (committed bytes, not auto-pulled at build time). Required because the
   upstream Helm repo overwrites 4.22.0 in place — re-pulling would
   re-introduce the broken file.
2. Delete charts/seaweedfs/templates/shared/security-configmap.yaml.
   Every other template that references the deleted ConfigMap is gated
   under {{- if enableSecurity }} so removing it is a no-op for our
   default deployment shape (Catalyst SeaweedFS auth happens at the S3
   layer via IAM creds from External Secrets, not via the upstream
   chart's TLS/JWT machinery).
3. Drop the dependencies: block from chart/Chart.yaml; add
   annotations.catalyst.openova.io/no-upstream=true so the
   blueprint-release workflow's hollow-chart guard (issue #181) skips
   the auto-pull/round-trip checks for this chart.
4. Whitelist platform/seaweedfs/chart/charts/ in .gitignore so the
   vendored bytes are tracked.
5. Bump bp-seaweedfs 1.0.1 → 1.1.0 (signal: vendored, not auto-pulled).
6. Add tests/no-fromtoml.sh — chart-test that asserts the offending
   file stays deleted across future re-vendors. Runs in
   .github/workflows/blueprint-release.yaml as a publish-gating check.

Unblocks Phase-8a observability + storage chain on otech (bp-loki,
bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn
bp-seaweedfs).

Closes #340

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): align expected-bootstrap-deps.yaml with bp-harbor's actual deps

The bp-harbor HR at clusters/_template/bootstrap-kit/19-harbor.yaml lines
35-37 already removed `bp-seaweedfs` from its dependsOn (cloud-direct
architecture per ADR-0001 §13 — Harbor writes blobs directly to cloud
Object Storage on Sovereigns, not via SeaweedFS), but the expected DAG
in scripts/expected-bootstrap-deps.yaml was never updated to match.

Pre-existing drift on main; surfaced by the dependency-graph-audit
check on PR #504 (bp-seaweedfs vendoring fix). Fixing it inline so the
audit passes on the same PR — the two changes are both about the
storage chain on Sovereigns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:20:59 +04:00
github-actions[bot]
2f4c624bb9 deploy: update catalyst images to c148ef3 2026-05-01 20:50:37 +00:00
e3mrah
c148ef36ff
fix(catalyst-api): release PDM subdomain on Pod-restart orphan + add explicit release endpoint (closes #489) (#502)
* fix(catalyst-api): release PDM subdomain on Pod-restart orphan + add explicit release endpoint

Each failed provision permanently consumed its pool subdomain in PDM —
otech, otech1..otech9 stayed locked because two release seams were
missing:

1. Pod-restart orphan: when catalyst-api dies mid-provisioning, the
   runProvisioning goroutine that would have called pdm.Release on
   Phase-0 failure dies with the Pod. fromRecord rewrites the
   rehydrated status to "failed" but nothing reaps the still-active
   reservation. restoreFromStore now fires a best-effort
   pdm.Release for every record it rewrites from in-flight to failed,
   gated on AdoptedAt==nil so customer-owned Sovereigns are protected.

2. Abandoned-deployment retries: the only operator-driven release path
   was Cancel & Wipe, which requires re-entering the HetznerToken.
   Franchise customers retrying under the same subdomain after a
   botched provision shouldn't need a Hetzner credential roundtrip
   for a PDM-only fix. New endpoint
   DELETE /api/v1/deployments/{id}/release-subdomain releases the
   PDM allocation only — no Hetzner work, no record deletion. Refuses
   in-flight (409), wiped (410), and adopted (422) deployments.

Tests cover: failed-deployment release, idempotent ErrNotFound, in-flight
refusal across all in-flight statuses, adopted protection, BYO no-op,
404 on unknown id, 502 on PDM transient, Pod-restart orphan release on
restoreFromStore, and the negative-path proof that a clean-failed record
on disk does NOT trigger a duplicate Release at restart.

Closes #489

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(catalyst-api): fix data race in fakePDM around orphan-release goroutine

The Pod-restart orphan-release path (issue #489) fires pdm.Release in a
goroutine spawned by restoreFromStore. The race detector flagged the
test's read of fpdm.releases against the goroutine's append. Adding a
sync.Mutex to fakePDM + a snapshotReleases() accessor closes the race
without changing the surface that 30+ other tests already use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:48:36 +04:00
github-actions[bot]
b8c639127a deploy: update catalyst images to bd9103a 2026-05-01 20:40:08 +00:00
github-actions[bot]
bd9103aadc deploy: update catalyst images to 66ff717 2026-05-01 22:38:03 +02:00
e3mrah
d6caeddf5d
test(catalyst-ui): lock in JobsTable row-id contract — no dead phase slugs (closes #474) (#501)
Phase-8a-preflight first live provision (febeeb888debf477) failed at
tofu plan, so catalyst-api recorded zero jobs. The wizard renders
synthetic phase rows from the local event stream regardless (per
INVIOLABLE-PRINCIPLES.md #1). Pre-fix the synthetic IDs collided with
bare phase slugs (e.g. id was `infrastructure` instead of
`infrastructure:tofu-init`), so clicking navigated to /jobs/infrastructure
which JobDetail's local jobsById couldn't resolve → "Job not found".

Cumulative resolution shipped earlier: PR #480 renamed cluster-bootstrap
group slug to phase-1-bootstrap (no longer collides with bare leaf id);
PR #498 routes catalyst-ui fetches through API_BASE so /jobs/{id} routes
work under /sovereign/*; jobs.ts always emits prefixed `infrastructure:tofu-*`
ids for the synthetic phase rows.

This commit adds 4 vitest cases asserting the contract:
- No row id is a forbidden bare slug (infrastructure / phase / cluster).
- Every row id matches one of the well-known shapes (group slug, tofu
  phase id, cluster-bootstrap leaf, or application id).
- No row id contains "/" that would break the /jobs/$jobId route param.
- Every leaf's parentId resolves to a row in the same flat list (no
  orphans → no un-clickable rows).

Live verification: console.openova.io/sovereign/provision/d198b513476df186/jobs
on catalyst-ui:141dc9d renders 50+ rows linking to either a /jobs/applications
group or a /jobs/bp-* leaf — every URL resolves. Bare /jobs/infrastructure
or /jobs/phase no longer appear.

Co-authored-by: alierenbaysal <alierenbaysal@noreply.github.com>
2026-05-02 00:35:52 +04:00
e3mrah
66ff717fbc
fix(infra): reduce bootstrap Kustomization timeouts 30m→5m to unblock iterative fixes (closes #492) (#500)
Phase-8a bug #17 (otech8 deployment 1bfc46347564467b, 2026-05-01):
when the FIRST apply of bootstrap-kit was unhealthy (cilium crash-loop
from issue #491), kustomize-controller held the revision lock for the
full 30m health-check timeout and refused to pick up new GitRepository
revisions. Even though Flux fetched fix `66ea39f0` from main within 1
minute, bootstrap-kit's lastAttemptedRevision stayed pinned to the OLD
SHA `0765e89a` for the full 30 minutes. With cilium broken, the wait
would never finish, no new revision would ever apply, and the operator
was forced to wipe + reprovision from scratch. The same pathology
would repeat on every iteration unless the timeout shape changed.

Approach: Option A (timeout reduction). Drops `spec.timeout` on all
three Flux Kustomizations in the cloud-init template — bootstrap-kit,
sovereign-tls, infrastructure-config — from 30m to 5m. We KEEP
`wait: true` so downstream `dependsOn: bootstrap-kit` declarations
still get a consolidated "every HR Ready=True" signal. We do NOT
adjust `interval` (5m is correct).

Why 5m specifically: matches the GitRepository poll interval. Failed
reconciles release the revision lock within ~6m worst case so a fresh
fix on main gets applied on the next poll. Anything shorter risks
tripping legitimately-slow CRD installs; anything longer re-introduces
the iteration-stall pathology #492 documents.

Why not Option B (wait: false): would break the dependsOn chain. The
infrastructure-config Kustomization needs bootstrap-kit's HRs Ready
before it applies Provider/ProviderConfig manifests that talk to
Hetzner. Flipping wait: false would let infra-config apply prematurely.

Why not Option C (tighter retryInterval): doesn't address the root
cause. retryInterval governs how often to retry AFTER a failure;
spec.timeout is what holds the revision lock during a failed wait.

Test: kustomization_timeout_test.go (new) locks all three timeouts at
exactly 5m AND blocks any operative `timeout: 30m` regression AND
asserts wait: true is retained. Three assertions, one for each failure
mode (regression to 30m, accidental 4th Kustomization without test
update, drive-by flip to wait: false).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:34:35 +04:00
github-actions[bot]
8457bf775e deploy: update catalyst images to a363f34 2026-05-01 20:32:14 +00:00
e3mrah
a363f340bc
fix(catalyst-ui): grid-layout high-fan-out depths so 50+ siblings fit visible viewBox (closes #493) (#499)
Phase-8a-preflight live screenshot (.playwright-mcp/otech9-cluster-bootstrap-2026-05-01.png)
showed the JobDetail flow canvas rendering as yellow line trails with
zero visible bubbles on a 50+ node provisioning graph. PR #486 passed
bounded tests for 5/8/12/15 nodes but never covered production scale
(~50 blueprint installs as siblings of one parent).

Root cause: every sibling at the same depth was anchored to one X
coordinate (depth*PER_DEPTH_X) and Y-clamped at ±Y_SCATTER_PX*2 (±160).
With 50 nodes × 92px collision pitch, the natural cluster wanted 4600px
height — but viewBox.MAX_VBOX_H=700 capped the visible window. Only
~15% of node centroids landed inside.

Fix: gridTargets useMemo pre-pass. For each depth bucket whose sibling
count exceeds the viewBox's vertical capacity (~7 at MAX_VBOX_H=700),
lay siblings out in a sub-column grid. Each node anchors to its
(subColX, subRowY) cell instead of the shared depth anchor. Sparse
depths fall through to the original force behaviour.

Forces wired through the grid:
- forceX target = cell.tx (or depthX for sparse depths)
- forceY target = regionYMid + cell.ty (or regionYMid + jitter)
- Per-tick clamp: cell-bounded for high-fan-out nodes, depth-bounded
  for sparse nodes
- Initial seed positions placed at cell centers so the simulation
  converges quickly without oscillating

Tests:
- New bounded cases for 30/50/80 siblings asserting ≥95% of node
  centroids land inside the viewBox at first paint (was ~15% pre-fix)
- New 60-node case asserting viewBox stays bounded AND every bubble
  retains radius ≥40 (visible)
- All 11 bounded tests pass; tsc --noEmit clean

Live verification deferred to next fresh Hetzner provision.

Co-authored-by: alierenbaysal <alierenbaysal@noreply.github.com>
2026-05-02 00:29:23 +04:00
e3mrah
a5f5a37e99
fix(catalyst-ui): route every fetch through API_BASE + add regression guardrail (closes #494) (#498)
Issue #494 — JobDetail page surfaced a 404 in the otech9 cluster-bootstrap
screenshot because a tier-naive `/api/...` path can bypass the
`/sovereign/` Vite base. While the audit confirmed every existing
fetch / EventSource in the catalyst-ui already routes through
`API_BASE`, the antipattern had reappeared once before and lacked a
guardrail to keep it from sneaking back in.

Changes:

  • src/shared/config/urls.ts — add `apiUrl()` helper that normalises
    a path which may begin with `/api/...` (e.g. the `streamURL` echoed
    by the catalyst-api `POST /api/v1/deployments` response) into the
    tier-correct `${API_BASE}/...` form. Idempotent; absolute http(s)
    URLs pass through untouched. Doc-comment now records why the rule
    exists for future readers.
  • src/shared/lib/useProvisioningStream.ts — pipe the server-provided
    `streamURL` through `apiUrl()` before opening the EventSource so
    the wizard's live SSE reaches Traefik via the strip-sovereign
    middleware regardless of the base path.
  • src/test/no-hardcoded-api.test.ts — vitest regression guardrail:
    walks every `.ts`/`.tsx` source file (excluding tests), strips
    comments, fails CI if any `fetch( '/api/...`, `new EventSource(
    '/api/...`, or `axios.<m>( '/api/...` literal slips in. Verified by
    injecting a temporary violation file (caught) then removing it.
  • src/shared/config/urls.test.ts — unit tests for `apiUrl()` covering
    `/api/...`, `/v1/...`, `v1/...`, absolute http(s), and idempotency.

The 404 on the deployed otech9 deployment turned out to be a legitimate
backend response (`{"error":"job-not-found"}`) — the deployment had
zero jobs because the job-recorder wasn't backfilled — but the rule
this PR encodes is the correct invariant: the UI must never depend on
its host page resolving a relative path.

Per docs/INVIOLABLE-PRINCIPLES.md:
  • #2 (no compromise) — full guardrail in CI, not a TODO.
  • #4 (never hardcode) — every URL derives from `API_BASE`.
  • #8 (24-hour-no-stop) — gate added so this exact bug can't
    silently regress.

Co-authored-by: alierenbaysal <alierenbaysal@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:26:21 +04:00
github-actions[bot]
c76b409c64 deploy: update catalyst images to 141dc9d 2026-05-01 20:11:03 +00:00
e3mrah
141dc9dfba
fix(infra): cloud-init helm install cilium values parity with Flux bp-cilium HR (closes #491) (#496)
Phase-8a bug #16: every fresh Hetzner Sovereign deadlocked at Phase 1
because the bootstrap helm install in cloud-init used a MINIMAL set of
--set flags (kubeProxyReplacement, k8sService*, tunnelProtocol,
bpf.masquerade) while the Flux bp-cilium HelmRelease curated a much
fuller value set. The drift was fatal:

  1. cilium-agent waits forever for the operator to register
     ciliumenvoyconfigs + ciliumclusterwideenvoyconfigs CRDs.
  2. The upstream chart only registers them when envoyConfig.enabled=true.
  3. With the bootstrap install missing that flag, the agent crash-looped,
     the node taint node.cilium.io/agent-not-ready never lifted, and the
     bootstrap-kit Kustomization (wait: true, 30 min timeout — issue #492)
     never reconciled the upgrade that would have fixed the values.

The fix is single-source-of-truth via a new write_files entry that lays
down /var/lib/catalyst/cilium-values.yaml at cloud-init time, plus a -f
flag on the bootstrap helm install that consumes it. The values mirror
platform/cilium/chart/values.yaml's `cilium:` block PLUS the overlay
in clusters/_template/bootstrap-kit/01-cilium.yaml (envoyConfig.enabled,
l7Proxy). A new parity test (cilium_values_parity_test.go) locks the
two files together so a future commit cannot change one without the
other.

Approach: hybrid — keep the chart values.yaml as the umbrella source
of truth, render the merged effective values inline in cloud-init's
write_files block (the umbrella's `cilium:` subchart wrapper is
unwrapped because the bootstrap install targets cilium/cilium upstream
chart directly, not the bp-cilium umbrella). Test enforces presence
of every operator-curated key + load-bearing values.

Files modified:
  infra/hetzner/cloudinit-control-plane.tftpl
  products/catalyst/bootstrap/api/internal/provisioner/cilium_values_parity_test.go (new)

Refs: #491, #492 (bootstrap-kit wait timeout), 66ea39f0 (envoyConfig in HR)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:09:10 +04:00
e3mrah
e2f8df7430
fix(catalyst-api): Phase-1 short-circuit must NOT flip Status to ready (closes #488) (#495)
Phase-8a-preflight live deployments otech1..otech9 (2026-05-01) consistently
flipped status: ready and phase1FinishedAt seconds after Phase-0 completed,
even though no kubeconfig PUT had been received and the new Sovereign was
still mid-cloud-init. The wizard banner read "Sovereign ready" while
catalyst-api had observed precisely zero HelmReleases. The screenshot at
.playwright-mcp/otech9-cluster-bootstrap-2026-05-01.png even logs:

    "Phase-1 watch skipped: no kubeconfig is available on the
    catalyst-api side."

…on a deployment whose status was simultaneously "ready". The UI lied to
the operator on every iteration today.

Root cause: markPhase1Done(dep, nil, "") was called from two short-circuit
paths (kubeconfig missing + watcher-start failure). Empty outcome fell
through the switch's default branch which set Status="ready". With no
observed components and no terminal classification there is nothing
truthful catalyst-api can say about the new Sovereign except "I don't know"
— which means failed, with an operator-actionable diagnostic.

Fix:
- Add helmwatch.OutcomeKubeconfigMissing + OutcomeWatcherStartFailed
  outcome constants.
- Replace the two markPhase1Done(_, nil, "") call sites with explicit
  outcomes.
- Add explicit cases in the switch that set Status="failed" with errors
  pointing the operator at cloud-init logs / informer factory init.
- Keep a defensive "outcome empty AND len(finalStates)==0" trap so any
  future caller that forgets to pass a non-empty outcome surfaces as a
  programming-error failure rather than silently flipping ready.
- Strengthen TestRunPhase1Watch_EmptyKubeconfigShortCircuits to assert
  Status=="failed", a non-empty Error mentioning kubeconfig, and the
  exact OutcomeKubeconfigMissing on Result.Phase1Outcome. Pre-fix the
  test only asserted "not stuck at phase1-watching" — too weak to catch
  the false-ready regression.

go test ./products/catalyst/bootstrap/api/... — all green.
2026-05-02 00:07:38 +04:00
hatiyildiz
66ea39f091 fix(infra): set envoyConfig.enabled=true so cilium-operator registers envoyconfig CRDs (Phase-8a bug #15)
Phase-8a-preflight live deployment 1bfc46347564467b confirmed cilium-agent
crash-loops forever waiting for envoyconfig CRDs that the operator never
registers:

  Still waiting for Cilium Operator to register the following CRDs:
  [crd:ciliumclusterwideenvoyconfigs.cilium.io
   crd:ciliumenvoyconfigs.cilium.io]

Root cause: upstream Cilium 1.16 chart has TWO separate envoy toggles:
- cilium.envoy.enabled — runs Envoy as a separate DaemonSet (was set)
- cilium.envoyConfig.enabled — registers CRDs + agent/operator controllers
  for CiliumEnvoyConfig (was NOT set)

The chart values.yaml only sets envoy.enabled=true. Operator finishes CRD
registration with 11 of 13 CRDs, missing the two envoy ones, and
cilium-agent's node taint never lifts. All 37 dependent HelmReleases
block forever on the dependsOn chain.

Fix in HR values (no chart rebuild needed; lands via Flux on next
sovereign provision directly).
2026-05-01 21:38:33 +02:00
github-actions[bot]
0765e89ac6 deploy: update catalyst images to e6663f1 2026-05-01 19:26:11 +00:00
e3mrah
e6663f169d
fix(catalyst-ui): remove status banners from Apps page; surface as global notifications (closes #475) (#487)
Founder #475 — the "Provisioning failed" / "Cancel & Wipe" / "Per-component
install monitoring is unavailable" banners pollute the Apps page. They render
above the apps grid, forcing operators onto the Apps tab to read terminal
deployment status, and crowd out the actual catalog.

Replaces the inline banners with a global toast surface:

  • new shared/ui/notifications.tsx — NotificationProvider + useNotifications()
    seam. Bottom-right stacked tray, fixed positioning so it's visible on
    every tab (Apps / Jobs / Dashboard / Cloud / Users). Toasts replace
    in-place by id so a deployment-failure update edits the existing card
    rather than stacking duplicates.
  • RootLayout — mounts NotificationProvider once at the top of the tree.
  • AppsPage — strips FailureCard + Phase1UnavailableBanner. Two new
    useEffects mirror the same copy + the same retry / wipe / back-to-wizard
    actions through notify(). WipeDeploymentModal stays page-scoped so the
    toast action can flip it open.
  • useDeploymentEvents — wraps `retry` in useCallback so the AppsPage
    notification effect doesn't re-fire every render (would otherwise loop
    notify → re-render → notify).

Vitest:
  • 8 cases on the notification surface (push, replace-by-id, dismiss,
    role=alert vs role=status, action dismissOnClick semantics, provider
    guard).
  • 2 new cases on AppsPage that gate any future regression: main element
    has zero role="alert" / role="status" children on first paint, and the
    legacy banner test ids never render.

Acceptance vs founder ask:
  • Apps page in failed state renders ONLY apps grid + tabs + search box.
  • Same status content fires as a bottom-right toast with Retry stream /
    Cancel & Wipe / Back to wizard actions.
  • Notifications stay visible across Apps / Jobs / Dashboard / Cloud /
    Users tabs because the tray is mounted in RootLayout above Outlet.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 23:23:12 +04:00
e3mrah
62e03ae129
fix(catalyst-ui): re-tune physics so bubbles stay visible (#481 follow-up) (#486)
PR #483 over-corrected the physics tuning — the operator reported
"infinitely stretching lines, can't see a single bubble in the canvas".
Two structural defects:

  (1) NODE_RADIUS stayed at 22 → diameter 44px. Combined with
      MAX_VBOX 1600x900 and a typical canvas-host of 600-800px wide
      (LogPane covers ~30% of the screen), preserveAspectRatio meet
      scaled the SVG to ~0.4x → bubbles rendered at 16-22px wide.
      Effectively invisible.

  (2) MIN_VBOX floors at 1200x700 forced sparse graphs (4-6 nodes
      across a ~200x100 layout space) into a viewBox 6x larger than
      the cluster, scaling bubbles down even further.

  (3) FORCE_X_STRENGTH=0.55 + FORCE_LINK_STRENGTH=0.45 fought hard on
      depth-disparate dependencies (depth-0 root wired to depth-5
      leaf), producing oscillation that read as "infinite stretch"
      in mid-tick frames.

The fix:
  - NODE_RADIUS 22 → 40 (diameter 80px — meets acceptance criterion)
  - GROUP_RADIUS 28 → 48
  - MIN_VBOX 1200x700 → 400x280 (sparse graphs render at native scale)
  - MAX_VBOX 1600x900 → 1200x700 (effective render scale stays ~1:1)
  - FORCE_X_STRENGTH 0.55 → 0.12 (gentle depth anchor, no oscillation)
  - FORCE_Y_STRENGTH 0.22 → 0.10
  - FORCE_LINK_STRENGTH 0.45 → 0.18
  - LINK_DISTANCE NODE_RADIUS*4 → NODE_RADIUS*2.5 (100px, edges <140px)
  - PER_DEPTH_X NODE_RADIUS*5 → NODE_RADIUS*4 (with bigger nodes)
  - Per-tick X clamp tightened from ±1.5×PER_DEPTH_X to ±1.0×
  - Per-tick Y clamp tightened from MAX_VBOX_H/2 to ±Y_SCATTER_PX*2
  - Initial seed X scatter scales with NODE_RADIUS

Tests:
  - FlowCanvasOrganic.bounded.test.tsx — 7 cases, locks viewBox ≤
    1200x700, bubble radius ≥40 (diameter ≥80), edge length <300px,
    every node centroid strictly inside viewBox for 5/8/12/15-node
    graphs.
  - All pre-existing tests pass: flowLayoutOrganic.test (cycle
    protection #476), FlowPage.test, JobDetail.test, JobDetail.hang
    regression, LogPane.fallback (the #483 LogPane work is unaffected).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 23:22:39 +04:00
e3mrah
a5f3ec900a
fix(infra): move Cilium Gateway to sovereign-tls Kustomization too (Phase-8a bug #14) (#485)
Phase-8a-preflight live deployment a56961fbd5ae6003 confirmed bootstrap-kit
Kustomization still fails dry-run after #484 — same pattern, different CRD:

  Gateway/kube-system/cilium-gateway dry-run failed: no matches for kind
  'Gateway' in version 'gateway.networking.k8s.io/v1'

The Gateway API CRDs ARE installed by the Cilium HelmRelease (gatewayAPI.enabled=true)
but Flux validates ALL resources in the Kustomization BEFORE applying any HR. So at
validation time, Cilium hasn't installed yet → no CRDs → Gateway dry-run fails.

Same fix shape as #484 (Cert split): move Gateway into sovereign-tls Kustomization
which dependsOn bootstrap-kit Ready (i.e. Cilium HR is up + CRDs registered).

Updated:
- clusters/_template/sovereign-tls/cilium-gateway.yaml (NEW)
- clusters/_template/sovereign-tls/kustomization.yaml (resources list)
- clusters/_template/bootstrap-kit/01-cilium.yaml (Gateway block removed)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 23:01:53 +04:00
github-actions[bot]
5debb7dd8a deploy: update catalyst images to 0d75ae3 2026-05-01 18:50:32 +00:00
e3mrah
0d75ae354f
fix(infra): split Cilium-Gateway Certificate into sovereign-tls Kustomization (Phase-8a bug #13) (#484)
Phase-8a-preflight live deployment 93161846839dc2e1: bootstrap-kit Flux
Kustomization fails server-side dry-run with

  Certificate/kube-system/sovereign-wildcard-tls dry-run failed:
  no matches for kind 'Certificate' in version 'cert-manager.io/v1'

→ entire Kustomization apply aborts → ZERO HelmReleases reconcile.

Fix: split the Certificate into its own Flux Kustomization sovereign-tls
that dependsOn bootstrap-kit (whose Ready gates on every HR including
bp-cert-manager). Gateway stays in 01-cilium.yaml because Gateway API
CRDs ship with Cilium itself.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 22:48:18 +04:00
github-actions[bot]
5da604595d deploy: update catalyst images to 67a408f 2026-05-01 18:43:13 +00:00
e3mrah
67a408f66d
fix(catalyst-ui): JobDetail flow physics + exec-logs viewer (closes #481) (#483)
Bug A — Flow physics scattered + tiny + km-long edges:
  • forceY strength 0.05→0.22, forceLink strength 0.08→0.45 so siblings
    cluster around the host instead of drifting to canvas edges.
  • Initial Y scatter ±140→±60, X scatter ±40→±40 (kept), forceY target
    scatter ±180→±60. Steady-state edges now ~110px.
  • New MAX_VBOX (1600×900) ceiling on the SVG viewBox + per-tick x/y
    clamp keep nodes inside the viewport regardless of force quirks.

Bug B — LogPane empty for derived (Phase-0 / cluster-bootstrap) jobs:
  • useJobDetail returns 404 for derived jobs because the catalyst-api
    Bridge has no Execution rows for them — but the SSE event reducer
    DOES have the captured events in DerivedJob.steps[].
  • LogPane gains a `fallbackLines: LogLine[]` prop; when executionId
    is null AND fallbackLines is non-empty, renders inline through the
    same dark-theme list as ExecutionLogs (no polling).
  • JobDetail maps derivedJobsById[selectedJobId].steps → LogLine[]
    via stepsToLogLines() and threads it through CanvasLogBridge.

Tests: FlowCanvasOrganic.bounded.test.tsx (viewBox + per-node clamp)
       LogPane.fallback.test.tsx (3 paths: lines / empty / unset)
       Pre-existing 11 cycle-protection + JobDetail tests still pass.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:41:13 +04:00