Commit Graph

1044 Commits

Author SHA1 Message Date
github-actions[bot]
a78b4e2e51 deploy: update catalyst images to dad5ead 2026-05-04 07:54:28 +00:00
e3mrah
dad5ead534
feat(wizard): Marketplace mode step (#710 wave 3a) (#725)
Inserts StepMarketplace between StepComponents and StepDomain so the
operator can opt the new Sovereign into a multi-tenant SaaS platform
during provisioning. The toggle drives store.marketplaceEnabled, which
StepReview now ships in the POST /v1/deployments body — the catalyst-api
Request struct + OpenTofu var.marketplace_enabled + cloud-init Flux
substitute + bp-catalyst-platform ingress.marketplace.enabled values
were all wired earlier (PR #719); this PR is the missing UI seam.

Brand fields (name / tagline / primary colour) persist on the wizard
state so a future settings page can read them without re-prompting on
every wizard run. The chart only consumes the enabled flag for now.

Wizard step list grows from 7 to 8 stops (StepMarketplace at id=6,
shifting Domain → 7 and Review → 8). WizardLayout test updated to
assert the new count; the existing pre-existing StepComponents test
failures (CORTEX cascade) and the @tabler/icons-react typecheck error
are untouched and unrelated.

Companion PRs (other agents): post-launch settings page + catalog
publish/unpublish admin. This is 1 of 3 parallel pieces on #710 wave 3.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 11:52:17 +04:00
github-actions[bot]
f7365de162 deploy: update sme service images to 2a034a0 2026-05-04 07:38:18 +00:00
github-actions[bot]
84d40a58c7 deploy: update Catalyst marketplace image to 2a034a0 2026-05-04 07:37:45 +00:00
e3mrah
2a034a0959
feat(catalog): unified catalog with Published flag — operator curates marketplace (#710 wave 2) (#724)
Single source of truth for apps; Sovereign-console operator decides which
apps marketplace customers see; marketplace storefront filters by
Published. Per founder rule 2026-05-04: unpublish is a marketplace-
visibility toggle, not a deployment-lifecycle action — existing tenant
deployments of an unpublished app keep running unaffected.

core/services/catalog/store/store.go
====================================
- App.Published bool — operator-controlled visibility
- ListPublishedApps: marketplace-storefront subset
  (Published=true AND System=false AND Deployable=true).
  System and Deployable are catalog-team-controlled; Published is the
  operator's curation knob.
- SetAppPublished(slug, bool) — hot-path one-bit write the Sovereign
  console hits per row toggle. Cheaper than UpdateApp; slug-keyed so
  the UI doesn't need the internal Mongo _id.
- UpdateApp: thread published through full-update path too.

core/services/catalog/handlers/handlers.go + routes.go
======================================================
- ListApps now honours ?published=true query param:
    GET /catalog/apps                  → operator view: every app
    GET /catalog/apps?published=true   → marketplace view: filtered
- New PATCH /catalog/admin/apps/{slug}/publish?value={true|false}
  for the Sovereign-console operator's row toggle.
- requireAdmin gating preserved on the admin endpoint.

core/services/catalog/handlers/seed.go
======================================
- migrateAppPublished: defaults Published=true on every existing app
  on the day Catalyst 1.3.x ships. Operators opt OUT of marketplace
  visibility per app, not IN — matches how a real SaaS storefront is
  curated and prevents an empty marketplace on flag-introduction day.
  Idempotent on re-run.

core/marketplace/src/lib/api.ts
================================
- getApps() now hits /catalog/apps?published=true so the marketplace
  storefront only renders the operator-curated subset.

DoD pending wave 2.5
====================
The Sovereign-console "Catalog & publishing" admin page (per-row
toggle UI) is the next chunk and ships in a follow-up — backend +
storefront filter are the load-bearing change here. Catalog admins
can flip the flag today via the PATCH endpoint; the per-row UI is
quality-of-life on top.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 11:37:03 +04:00
github-actions[bot]
52f68420ac deploy: update Catalyst marketplace image to 73d68d9 2026-05-04 07:31:20 +00:00
e3mrah
73d68d99c1
fix(auth-ux): HTML PIN email + copyable email pill + 6-box marketplace PIN + drop UI debris (#721) (#723)
Wave 1 of #721 — what the founder actually saw on console.openova.io
and marketplace.openova.io / marketplace.<sov>.

PIN email rewrite (catalyst-api auth.go)
========================================
Was: plaintext "Your OpenOva sign-in code:\n\n    9 6 5 1 2 8\n…"
Now: multipart/alternative MIME with a polished HTML alternative —
white card on neutral background, OpenOva mark + wordmark,
"Your sign-in code" heading, big tinted code block (34px monospaced,
10px letter-spacing, one-tap copy on iOS Mail), expiration + ignore
notice, footer credit. Inline styles only — Gmail/Outlook web strip
<style>. Card pinned at 480px so narrow webmail panes render correctly.
text/plain fallback kept for clients without HTML.

Catalyst-Zero verify page (VerifyPinPage.tsx)
=============================================
- Email shown as a copyable PILL with copy icon — click copies to
  clipboard, icon flips to a check for 1.5s. Selection-fallback for
  browsers without clipboard API.
- Centered title + subtitle (was left-aligned in 1.2.x).
- Microcopy: "Codes expire after 10 minutes — check your spam folder."

Marketplace checkout sign-in (CheckoutStep.svelte)
==================================================
- 1 single <input maxlength=6> → 6 separate <input maxlength=1>
  boxes with auto-advance, paste-fan-out (paste a 6-digit code anywhere
  on the row, all 6 boxes fill, autosubmits), backspace-back, ArrowLeft/
  Right navigation, autocomplete=one-time-code on first box for iOS SMS
  autofill, caret-transparent so the digit IS the caret.
- Email shown as the same copyable pill pattern (svg copy/check icons,
  hover-to-brand affordance).
- Dropped "Use a different email" link (browser back works).
- Added expire/spam microcopy below button.

Header + wayfinding cleanup
===========================
- Header.svelte: top-right "Sign in" button hidden when pathname is
  /checkout or /login. Two sign-in CTAs on the same screen was the UI
  debris caught live 2026-05-04.
- CheckoutStep.svelte: "← Back to Review" moved from bottom-left
  (where users don't look) to top-left above the Checkout heading,
  rendered with a chevron icon.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 11:30:24 +04:00
github-actions[bot]
f375533ffa deploy: update catalyst images to 88bfa34 2026-05-04 05:44:50 +00:00
e3mrah
88bfa347d4
fix(auth): sign-out actually signs out + iCloud-style PIN UX (closes #721) (#722)
* feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710)

Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.

Changes
=======

products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
  marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
  marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
  / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
  namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
  *.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
  {{ if .Values.ingress.marketplace.enabled }} so non-marketplace
  Sovereigns render the chart unchanged

clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}

infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
  on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"

core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
  resolves via PDM at zone-commit time (PR #710 explicit record so
  caches don't depend on the *.<sov> wildcard alone)

DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
  manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
  resources: 13 sme-services workloads + 2 marketplace-api + 1
  HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green

* fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712)

The deploy job's `git push` is made under GITHUB_TOKEN; per GitHub
Actions design, commits authored by GITHUB_TOKEN don't re-trigger
workflows. blueprint-release.yaml's `on.push.paths: products/*/chart/**`
filter matches the deploy commit's diff (chart/values.yaml +
chart/templates/{api,ui}-deployment.yaml), so the workflow SHOULD fire,
but doesn't — leaving the bp-catalyst-platform:1.2.7 OCI artifact stuck
on whatever catalyst-api SHA was current at the last manual chart-
touching PR.

Today (2026-05-03) this stranded otech62-otech66 on catalyst-api:74d08eb
six PRs after the SHA was superseded — every fresh Sovereign installed
the buggy pre-#701 image and rejected handover with 401 unauthenticated.

Fix: after `git push` succeeds in the deploy job, dispatch
blueprint-release explicitly via `gh workflow run`. The dispatched run
re-renders + re-publishes the chart with the just-pushed values.yaml.

Closes #712.

* fix(auth): sign-out actually signs out + iCloud-style PIN UX (closes #721)

Sign-out
========
1. Cookie-clear Domain mismatch
   PIN-verify SETS catalyst_session with Domain:$CATALYST_SESSION_COOKIE_DOMAIN
   so the cookie carries across console.<sov> and marketplace.<sov>.
   HandleAuthLogout was clearing WITHOUT the Domain attribute. Browsers
   require an exact-match Set-Cookie (Path + Domain + SameSite) to
   actually drop a cookie — a mismatched Domain creates a new empty
   cookie scoped to the current host while the original parent-domain
   cookie stays alive. Next /whoami picks it up and the operator looks
   "still signed in".

   Fix: mirror the EXACT Domain/Path/Secure/SameSite the cookie was
   set with. Same fix on catalyst_refresh.

2. Keycloak SSO session survives local cookie drop
   Even if the local cookie clear worked, the upstream KC SSO session
   stayed alive. The next OIDC PKCE auth-guard fetch silently re-
   authenticated against KC and the operator landed back as the same
   identity.

   Fix: HandleAuthLogout returns 200 with
   { ok: true, keycloakLogoutURL: "<kc>/realms/<realm>/protocol/
     openid-connect/logout?client_id=...&post_logout_redirect_uri=
     <origin>/login" }.
   UI's signOut() hard-navigates to keycloakLogoutURL so KC drops the
   SSO session and 302s back to /login. qc.clear() flushes all
   TanStack Query caches before the navigation.

PIN UX (iCloud reference)
=========================
PinInput6.tsx
  - Box size 48×56 → 56×64 (sm: 64×72)
  - Border 1px → 1.5px, rounded-lg → rounded-xl
  - Soft inner-shadow on top + bottom
  - Filled box gets a brand-tinted border (operator sees progress)
  - Focus: scale 1.04 + 3px ring at 30% brand alpha
  - text-xl → text-2xl (sm: text-3xl), tracking-tight, tabular-nums
  - caret-transparent — the digit IS the caret (matches iOS native)
  - Webkit autofill background normalised

VerifyPinPage.tsx
  - Title + subtitle centered (was left-aligned)
  - Title 20px → 24px, semibold, tracking-tight
  - Subtitle in two lines: "A 6-digit code was sent to" / email
  - "Didn't get a code? Send a new one" + spam-folder microcopy below
  - Error message centered

LoginPage.tsx
  - Centered title + subtitle to match
  - Copy: "We'll email you a 6-digit code to verify it's you."

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 09:41:49 +04:00
github-actions[bot]
4c7e1e6d4c deploy: update catalyst images to 35183af 2026-05-04 03:51:04 +00:00
e3mrah
35183af5be
fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712) (#720)
* feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710)

Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.

Changes
=======

products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
  marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
  marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
  / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
  namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
  *.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
  {{ if .Values.ingress.marketplace.enabled }} so non-marketplace
  Sovereigns render the chart unchanged

clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}

infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
  on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"

core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
  resolves via PDM at zone-commit time (PR #710 explicit record so
  caches don't depend on the *.<sov> wildcard alone)

DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
  manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
  resources: 13 sme-services workloads + 2 marketplace-api + 1
  HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green

* fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712)

The deploy job's `git push` is made under GITHUB_TOKEN; per GitHub
Actions design, commits authored by GITHUB_TOKEN don't re-trigger
workflows. blueprint-release.yaml's `on.push.paths: products/*/chart/**`
filter matches the deploy commit's diff (chart/values.yaml +
chart/templates/{api,ui}-deployment.yaml), so the workflow SHOULD fire,
but doesn't — leaving the bp-catalyst-platform:1.2.7 OCI artifact stuck
on whatever catalyst-api SHA was current at the last manual chart-
touching PR.

Today (2026-05-03) this stranded otech62-otech66 on catalyst-api:74d08eb
six PRs after the SHA was superseded — every fresh Sovereign installed
the buggy pre-#701 image and rejected handover with 401 unauthenticated.

Fix: after `git push` succeeds in the deploy job, dispatch
blueprint-release explicitly via `gh workflow run`. The dispatched run
re-renders + re-publishes the chart with the just-pushed values.yaml.

Closes #712.

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 07:49:03 +04:00
e3mrah
4946ccd125
feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) (#719)
Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.

Changes
=======

products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
  marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
  marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
  / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
  namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
  *.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
  {{ if .Values.ingress.marketplace.enabled }} so non-marketplace
  Sovereigns render the chart unchanged

clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}

infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
  on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"

core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
  resolves via PDM at zone-commit time (PR #710 explicit record so
  caches don't depend on the *.<sov> wildcard alone)

DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
  manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
  resources: 13 sme-services workloads + 2 marketplace-api + 1
  HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 07:47:37 +04:00
github-actions[bot]
3a7fdad13f deploy: update catalyst images to 1b1ea52 2026-05-03 22:47:22 +00:00
e3mrah
1b1ea52c39
fix(bp-catalyst-platform): emit sovereign-fqdn ConfigMap atomically in chart (closes #717) (#718)
* fix(catalyst-api,bp-keycloak): handover 401 root-causes — Reloader annot + realm SA users array (#713)

Closes #713

Two distinct chart bugs surfaced live on otech62 (2026-05-03), both producing
401 on /auth/handover:

1. SOVEREIGN_FQDN race
   api-deployment.yaml reads SOVEREIGN_FQDN from ConfigMap "sovereign-fqdn"
   with optional:true. On Sovereigns, that ConfigMap is rendered by the
   sovereign-tls Flux Kustomization concurrently with bp-catalyst-platform
   HelmRelease. When the Pod starts first, valueFrom collapses to "" and
   stays empty — audience check rejects every valid token as "invalid
   audience". Fix: add Reloader annotations so the Pod rolls when the
   ConfigMap (and the handover-jwt-public Secret) appears.

2. catalyst-api-server SA missing user-level realm-management role mappings
   bp-keycloak realm import granted roles via clientScopeMappings — wrong
   level. The actual service-account user had no clientRoles entry, so KC
   rejected GET /users with 403 when catalyst-api tried to ensure the
   operator user during handover. Fix: add explicit "users" array binding
   service-account-catalyst-api-server to realm-management.{impersonation,
   manage-users, view-users, query-users}.

* fix(catalyst-api,bp-reloader): tofu state on PVC + Reloader annotations strategy (#715)

Closes #715

Two architectural bugs surfaced live on otech64 (2026-05-03), both leading
to a healthy-looking Sovereign that the operator could not reach.

1. catalyst-api tofu workdir on emptyDir
   CATALYST_TOFU_WORKDIR=/tmp/catalyst/tofu (emptyDir). When contabo's
   catalyst-api Pod rolled mid-apply (the PR #714 deploy commit triggered
   a rolling restart 3 minutes into otech64's tofu run), in-progress state
   was lost. Tofu had created LB/network/server/services but not the
   hcloud_load_balancer_target.control_plane resource yet — the cluster
   came up at the k3s level but the public LB had no targets, returning
   TLS handshake failure for every console.<sov> request.

   Move CATALYST_TOFU_WORKDIR to /var/lib/catalyst/tofu (PVC-backed,
   fsGroup=65534 already wires write access). tofu apply resumes from
   where it left off after any Pod restart.

2. bp-reloader env-vars strategy
   reloadStrategy=env-vars only injects checksum env vars for ConfigMaps
   referenced via envFrom. Workloads using valueFrom: configMapKeyRef
   (catalyst-api's SOVEREIGN_FQDN) are silently not reloaded — the
   configmap.reloader.stakater.com/reload annotation added in PR #714
   was a no-op under env-vars.

   Switch to reloadStrategy=annotations. Reloader bumps a pod-template
   annotation, triggering rollout regardless of how the CM/Secret is
   referenced.

* fix(bp-catalyst-platform): emit sovereign-fqdn ConfigMap inside chart, drop sovereign-tls duplicate (#717)

Closes #717

Reloader v1.4.16 is silent on the SOVEREIGN_FQDN race (#713). Tried all
annotation forms (configmap.reloader.stakater.com/reload, reloader/auto)
and both reload strategies (env-vars, annotations). RBAC is correct, watch
coverage is global, but manual CM patches produce zero Reloader log output
and zero Pod rollouts. Abandoning Reloader as the race fix.

Move the sovereign-fqdn ConfigMap into bp-catalyst-platform chart
templates, guarded by {{ if .Values.global.sovereignFQDN }}. Helm install
applies all chart manifests in a single etcd transaction so the ConfigMap
commits before the Pod schedules. valueFrom resolves correctly the first
time. No race possible.

Drop the duplicate from clusters/_template/sovereign-tls/ to avoid
Helm-vs-Flux ownership flapping. The Kustomize path on contabo enumerates
files in templates/kustomization.yaml so this Helm-templated file is never
parsed by Kustomize.

Verified live: deleting the existing CM and re-running Helm install
produced an immediately-correct catalyst-api Pod with SOVEREIGN_FQDN
populated, where the same install with the previous out-of-chart CM had
left the env empty for the Pod's lifetime.

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 02:45:24 +04:00
github-actions[bot]
b2f78a81e1 deploy: update catalyst images to 9a58289 2026-05-03 22:06:35 +00:00
e3mrah
9a58289786
fix(catalyst-api,bp-reloader): tofu state on PVC + Reloader annotations strategy (closes #715) (#716)
* fix(catalyst-api,bp-keycloak): handover 401 root-causes — Reloader annot + realm SA users array (#713)

Closes #713

Two distinct chart bugs surfaced live on otech62 (2026-05-03), both producing
401 on /auth/handover:

1. SOVEREIGN_FQDN race
   api-deployment.yaml reads SOVEREIGN_FQDN from ConfigMap "sovereign-fqdn"
   with optional:true. On Sovereigns, that ConfigMap is rendered by the
   sovereign-tls Flux Kustomization concurrently with bp-catalyst-platform
   HelmRelease. When the Pod starts first, valueFrom collapses to "" and
   stays empty — audience check rejects every valid token as "invalid
   audience". Fix: add Reloader annotations so the Pod rolls when the
   ConfigMap (and the handover-jwt-public Secret) appears.

2. catalyst-api-server SA missing user-level realm-management role mappings
   bp-keycloak realm import granted roles via clientScopeMappings — wrong
   level. The actual service-account user had no clientRoles entry, so KC
   rejected GET /users with 403 when catalyst-api tried to ensure the
   operator user during handover. Fix: add explicit "users" array binding
   service-account-catalyst-api-server to realm-management.{impersonation,
   manage-users, view-users, query-users}.

* fix(catalyst-api,bp-reloader): tofu state on PVC + Reloader annotations strategy (#715)

Closes #715

Two architectural bugs surfaced live on otech64 (2026-05-03), both leading
to a healthy-looking Sovereign that the operator could not reach.

1. catalyst-api tofu workdir on emptyDir
   CATALYST_TOFU_WORKDIR=/tmp/catalyst/tofu (emptyDir). When contabo's
   catalyst-api Pod rolled mid-apply (the PR #714 deploy commit triggered
   a rolling restart 3 minutes into otech64's tofu run), in-progress state
   was lost. Tofu had created LB/network/server/services but not the
   hcloud_load_balancer_target.control_plane resource yet — the cluster
   came up at the k3s level but the public LB had no targets, returning
   TLS handshake failure for every console.<sov> request.

   Move CATALYST_TOFU_WORKDIR to /var/lib/catalyst/tofu (PVC-backed,
   fsGroup=65534 already wires write access). tofu apply resumes from
   where it left off after any Pod restart.

2. bp-reloader env-vars strategy
   reloadStrategy=env-vars only injects checksum env vars for ConfigMaps
   referenced via envFrom. Workloads using valueFrom: configMapKeyRef
   (catalyst-api's SOVEREIGN_FQDN) are silently not reloaded — the
   configmap.reloader.stakater.com/reload annotation added in PR #714
   was a no-op under env-vars.

   Switch to reloadStrategy=annotations. Reloader bumps a pod-template
   annotation, triggering rollout regardless of how the CM/Secret is
   referenced.

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 02:04:26 +04:00
github-actions[bot]
c179cba12a deploy: update catalyst images to e96e31a 2026-05-03 21:39:29 +00:00
e3mrah
e96e31a781
fix(catalyst-api,bp-keycloak): handover 401 root-causes — Reloader annot + realm SA users array (#713) (#714)
Closes #713

Two distinct chart bugs surfaced live on otech62 (2026-05-03), both producing
401 on /auth/handover:

1. SOVEREIGN_FQDN race
   api-deployment.yaml reads SOVEREIGN_FQDN from ConfigMap "sovereign-fqdn"
   with optional:true. On Sovereigns, that ConfigMap is rendered by the
   sovereign-tls Flux Kustomization concurrently with bp-catalyst-platform
   HelmRelease. When the Pod starts first, valueFrom collapses to "" and
   stays empty — audience check rejects every valid token as "invalid
   audience". Fix: add Reloader annotations so the Pod rolls when the
   ConfigMap (and the handover-jwt-public Secret) appears.

2. catalyst-api-server SA missing user-level realm-management role mappings
   bp-keycloak realm import granted roles via clientScopeMappings — wrong
   level. The actual service-account user had no clientRoles entry, so KC
   rejected GET /users with 403 when catalyst-api tried to ensure the
   operator user during handover. Fix: add explicit "users" array binding
   service-account-catalyst-api-server to realm-management.{impersonation,
   manage-users, view-users, query-users}.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 01:37:36 +04:00
github-actions[bot]
2eb499e9d7 deploy: update catalyst images to f254ff1 2026-05-03 20:27:20 +00:00
e3mrah
f254ff1f8d
fix(catalyst-ui): auth-guard honors catalyst_session cookie before OIDC PKCE fallback (Phase-8b followup) (#711)
The wizard handover lands the operator at
  GET https://console.<sov>.omani.works/auth/handover?token=<jwt>
which the Sovereign-side catalyst-api validates and 302-redirects to
/console/dashboard with a fresh `catalyst_session` HttpOnly Secure
SameSite=Lax cookie. Verified live with curl on otech49:

  HTTP/1.1 302 Found
  location: /console/dashboard
  set-cookie: catalyst_session=eyJhbGciOiJSUzI1NiI...; HttpOnly; Secure; SameSite=Lax

The browser arrived at /console/dashboard with the cookie attached but
SovereignConsoleLayout went straight from "no sessionStorage tokens"
to initiateLogin() (PKCE redirect to Keycloak). Operators landed on
auth.<sov>.../auth?response_type=code&client_id=catalyst-ui&... — a
username/password screen. User from the field on otech49 + otech52
today: "fuck, this is asking username password!!!"

Fix: probe GET /api/v1/whoami (with credentials:'include') BEFORE
considering Keycloak. The whoami handler is gated by the catalyst-api
session middleware, which HMAC-validates the cookie's signature
against the local handover signer's public key. On 200, the layout
enters a new `cookie-authenticated` AuthState and renders the console
shell directly. On 401, the existing OIDC flow runs unchanged so
returning users with an expired cookie still get the silent refresh
plus PKCE fallback. 5xx is treated like 401 (fall through to OIDC) so
a flaky API never traps an authenticated user behind a Keycloak
login they don't need.

Sign-out is also branch-aware: the cookie path DELETEs
/api/v1/auth/session and reloads to '/'; the OIDC path keeps calling
initiateLogout() so the Keycloak end-session URL is still reached.

File changed: products/catalyst/bootstrap/ui/src/app/layouts/SovereignConsoleLayout.tsx
Tests added:  products/catalyst/bootstrap/ui/src/app/layouts/SovereignConsoleLayout.test.tsx

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 00:25:19 +04:00
github-actions[bot]
4984488b41 deploy: update catalyst images to 4a9b2b2 2026-05-03 20:01:47 +00:00
e3mrah
4a9b2b2bff
fix(catalyst-api/wipe): retry firewall delete + purge Hetzner S3 buckets (closes #706) (#709)
* fix(catalyst-api/wipe): retry firewall delete on 422 resource_in_use

Hetzner server delete is asynchronous — returns 200 'action started'
while the firewall stays attached for 5-30s. Single-shot delete saw
422, swallowed it, reported '0 firewalls deleted' while leaving the
firewall live (verified on otech50 2026-05-03).

Adds deleteFirewallWithRetry with exponential backoff (6s/12s/24s/48s,
5 attempts). PurgeReport gains FirewallsRetried + S3Buckets fields.

Issue #706.

* feat(catalyst-api/wipe): add Hetzner Object Storage bucket purge

Adds PurgeBuckets() that empties + deletes the per-Sovereign Hetzner
Object Storage bucket via the S3 API. tofu destroy can't remove
`minio_s3_bucket` while objects are present, so 28 orphan buckets
accumulated from otech23..otech50 (audit 2026-05-03).

Sequence: BucketExists → ListObjectVersions → RemoveObjects (batch
1000) → ListIncompleteUploads → RemoveIncompleteUpload → RemoveBucket.
404 anywhere is idempotent success.

Issue #706.

* test(catalyst-api/wipe): firewall retry + bucket purge regression coverage

Adds purge_firewall_retry_test.go with three cases:
- TestFirewallRetry_Server_Detach_Async: 422 twice then 204 → 1 fw deleted
- TestFirewallRetry_Exhausted: always 422 → no fw deleted, error reported
- TestFirewallRetry_AlreadyGone_404: idempotent success path

Adds buckets_test.go with stubbed S3 endpoints exercising:
- BucketNameForSovereign/HetznerObjectStorageEndpoint contract
- empty bucket, 1500-version bucket (3 keys, multi-delete batches),
  in-progress multipart upload abort, 404 idempotent, progress callback

Issue #706.

* fix(catalyst-api/wipe): wire bucket purge into WipeDeployment handler

After hetzner.Purge() returns (which now retries firewall delete on
422), call hetzner.PurgeBuckets() with the per-Sovereign Object Storage
credentials from dep.Request. Runs AFTER tofu destroy so tofu state
isn't fought, BEFORE local-record cleanup so the wizard banner shows
the count.

Skips with a logged warning when in-memory credentials are unavailable
(Pod restart between provision and wipe). The SSE log + UI banner now
report the s3-buckets count alongside the existing resource tallies.

Issue #706.

* feat(catalyst-ui): wipe banner now reports S3 buckets + firewall retries

Adds s3_buckets and firewalls_retried fields to the WipeReport
TypeScript shape and renders the new bucket count alongside the
existing servers/lbs/networks/firewalls/ssh-keys tally. When the
firewall retry counter is non-zero, surfaces it in a parenthetical so
operators see why the wipe took an extra few seconds.

Both the AppsPage Cancel & Wipe modal and the DecommissionPage success
view consume the same WipeReport interface so this single update
covers both surfaces.

Issue #706.

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-03 23:59:48 +04:00
github-actions[bot]
cdbb617231 deploy: update catalyst images to e4ef4c0 2026-05-03 19:56:21 +00:00
e3mrah
e4ef4c0671
fix(catalyst-api/jobs): bridge subscribes to helmwatch transition events (closes #695) (#708)
* fix(bp-external-dns): livenessProbe.initialDelaySeconds=180 for cold-cluster cache-sync (closes #700)

PR #679 added --request-timeout=120s but external-dns has TWO timeouts:
RequestTimeout (per-API-call, controlled by --request-timeout) and
WaitForCacheSync (initial informer sync, hardcoded 60s in upstream binary,
NOT exposed as a flag). On a fresh Sovereign with k3s apiserver
CPU-saturated, the cache sync misses 60s -> fatal: failed to sync
*v1.Node: context deadline exceeded -> CrashLoopBackOff 5-10 times.
Caught live on otech49+ (2026-05-03), 5 restarts before stable.

Bump livenessProbe.initialDelaySeconds from upstream 10s default to 180s
so kubelet does NOT restart the Pod while the initial cache sync runs
against a CPU-saturated freshly-provisioned k3s apiserver. The Sovereign
apiserver reaches steady-state within ~2 min so 3 min comfortably covers
cold starts. Also bumps periodSeconds=30 + failureThreshold=3 so a
genuinely-hung pod is still killed within ~90s once steady-state.
readinessProbe gets a corresponding initialDelaySeconds=30 so endpoint
flapping during sync doesn't churn services.

Helm overrides REPLACE whole maps (not merge), so the override preserves
the upstream httpGet.path: /healthz + port: http shape verbatim.

Bumps:
- platform/external-dns/chart/Chart.yaml: 1.1.5 -> 1.1.6
- clusters/_template/bootstrap-kit/12-external-dns.yaml: HelmRelease pin 1.1.5 -> 1.1.6

* fix(catalyst-api/jobs): bridge subscribes to helmwatch transition events (closes #695)

Wires the per-deployment jobs.Bridge directly to the helmwatch
Watcher's runtime event stream so every per-component HelmRelease
transition observed AFTER the initial-list seed advances the per-Job
state map. The wizard's /jobs page now reflects the live cluster state
instead of pinning Install rows to whatever the initial-list snapshot
saw at attach time.

Symptom (verified on otech48/49/50/52, 2026-05-03 14:40-19:20):
the wizard rendered Install rows as "running"/"pending" even after
`kubectl --context=otech<N> -n flux-system get hr` showed every
bp-* HelmRelease at Ready=True.

Wiring change:

  helmwatch.Watcher.Subscribe(fn func(provisioner.Event)) — fan-out
  callback registered alongside the primary `emit` Emit. Every event
  the Watcher dispatches reaches both sinks. Used by the handler at
  attachBridgeSeederHook + RefreshWatch construction sites:

    watcher.Subscribe(func(ev provisioner.Event) {
        if err := bridge.OnProvisionerEvent(ev); err != nil {
            h.log.Warn("jobs bridge: runtime event forward failed",
                "id", depID, "phase", ev.Phase,
                "component", ev.Component, "err", err)
        }
    })

Tests:

  - internal/jobs/helmwatch_bridge_test.go::TestBridge_SeedThenRuntimeTransitions
    seeds 3 pending HRs, asserts 3 pending jobs; emits Ready=True for
    HR-1 → asserts 1 succeeded + 2 pending; emits Ready=Unknown for
    HR-2 → asserts 1 succeeded + 1 running + 1 pending. Verifies
    StartedAt / FinishedAt / DurationMs / LatestExecutionID stamps
    too.

  - internal/helmwatch/helmwatch_test.go::TestWatch_SubscribeFanOut
    proves a Subscribe callback receives the same set of per-component
    events as the primary emit, including the "ready for handover"
    terminal event.

  - internal/helmwatch/helmwatch_test.go::TestWatch_SubscribeNilIsNoop
    guards against panic on nil callback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 23:54:20 +04:00
e3mrah
c5ffaa2fd7
fix(bp-external-dns): livenessProbe.initialDelaySeconds=180 for cold-cluster cache-sync (closes #700) (#707)
PR #679 added --request-timeout=120s but external-dns has TWO timeouts:
RequestTimeout (per-API-call, controlled by --request-timeout) and
WaitForCacheSync (initial informer sync, hardcoded 60s in upstream binary,
NOT exposed as a flag). On a fresh Sovereign with k3s apiserver
CPU-saturated, the cache sync misses 60s -> fatal: failed to sync
*v1.Node: context deadline exceeded -> CrashLoopBackOff 5-10 times.
Caught live on otech49+ (2026-05-03), 5 restarts before stable.

Bump livenessProbe.initialDelaySeconds from upstream 10s default to 180s
so kubelet does NOT restart the Pod while the initial cache sync runs
against a CPU-saturated freshly-provisioned k3s apiserver. The Sovereign
apiserver reaches steady-state within ~2 min so 3 min comfortably covers
cold starts. Also bumps periodSeconds=30 + failureThreshold=3 so a
genuinely-hung pod is still killed within ~90s once steady-state.
readinessProbe gets a corresponding initialDelaySeconds=30 so endpoint
flapping during sync doesn't churn services.

Helm overrides REPLACE whole maps (not merge), so the override preserves
the upstream httpGet.path: /healthz + port: http shape verbatim.

Bumps:
- platform/external-dns/chart/Chart.yaml: 1.1.5 -> 1.1.6
- clusters/_template/bootstrap-kit/12-external-dns.yaml: HelmRelease pin 1.1.5 -> 1.1.6

Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-03 23:39:36 +04:00
github-actions[bot]
6df37b032c deploy: update catalyst images to 0238a2b 2026-05-03 18:53:12 +00:00
e3mrah
0238a2bde0
fix(flow-canvas): round-5 — variable slots + fit-to-host + zigzag + 60ms resize (#669) (#705)
Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-03 22:51:10 +04:00
github-actions[bot]
21122116dd deploy: update catalyst images to bceaa20 2026-05-03 18:03:55 +00:00
e3mrah
bceaa20c43
fix(catalyst-api): mint local session JWT in auth_handover (PR #694 pattern) (#703)
Keycloak v26 dropped legacy 'requested_subject' token-exchange. The
auth_handover.go path still called kc.ImpersonateToken() which uses
that parameter, returning 400 'invalid_request'. PR #694 already
moved PIN-verify to local JWT minting via handoverSigner.SignCustomClaims;
apply the same pattern to /auth/handover.

Caught live on otech49 (2026-05-03):
  ERROR auth_handover: ImpersonateToken failed
  err=token endpoint 400: Parameter 'requested_subject' is not
  supported for standard token exchange

Sovereign Keycloak still owns the canonical user record (created via
EnsureUser before token mint) — only the session-cookie minting
moves local. IdP brokering and federation paths are unaffected.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 22:01:06 +04:00
github-actions[bot]
4ba39c2d60 deploy: update catalyst images to 3144eed 2026-05-03 17:42:30 +00:00
e3mrah
3144eedd5e
fix(catalyst-api): read CATALYST_HANDOVER_JWT_PUBLIC_KEY_PATH env (PR #692 followup) (#702)
PR #692 moved the Sovereign-side JWK volume mount from
/var/lib/catalyst/handover-jwt-public.jwk (subPath, conflicted with
the catalyst-api PVC) to /etc/catalyst/handover-jwt-public/public.jwk
(directory mount). The chart sets CATALYST_HANDOVER_JWT_PUBLIC_KEY_PATH
to the new path, but the AuthHandover handler never read that env.
Result: auth_handover.go used the hardcoded default
/var/lib/catalyst/handover-jwt-public.jwk which no longer exists,
returning 401 'public key unavailable' on every handover.

Caught live on otech49 (2026-05-03):
  ERROR auth_handover: load public key failed
  err=read /var/lib/catalyst/handover-jwt-public.jwk: no such file
  path=/var/lib/catalyst/handover-jwt-public.jwk

Fix:
- Resolution order: handler field -> env var -> default const
- Default const updated to the new path so cold-starts work without
  the env var (defence in depth)

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 21:40:39 +04:00
github-actions[bot]
0e6ac5cd29 deploy: update catalyst images to ed2b374 2026-05-03 17:36:22 +00:00
e3mrah
ed2b374b5e
fix(catalyst-api): move /auth/handover OUTSIDE the session-gate (Phase-8b followup) (#701)
The Sovereign-side /auth/handover handler is the ENTRY POINT that
establishes the session. The operator's browser arrives with the
handover JWT in the URL query and zero cookies. Putting the route
inside the RequireSession middleware group rejects every handover
with 401 {error:unauthenticated} before AuthHandover ever runs.

Caught live on otech49 (2026-05-03):
  GET /auth/handover?token=<valid-jwt> -> 401 in 43us (middleware
  rejection, no body log line emitted).

This was working on otech48 only because catalyst-api there had no
Keycloak credentials wired (kc-sa-credentials Secret was missing) so
GetAuthConfig() returned nil and RequireSession became a passthrough.
Once PR #691 wired the credentials cleanly on otech49, the gate
activated and broke the handover.

Fix: register the route at the top-level mux outside the auth group,
mirroring the same pattern as /api/v1/deployments/{id}/kubeconfig
(cloud-init postback that also has no cookies). The handler's own
JWT validation IS the authentication.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 21:33:14 +04:00
github-actions[bot]
cf9946f4f1 deploy: update catalyst images to 2146deb 2026-05-03 17:10:05 +00:00
e3mrah
2146deb427
fix(catalyst-platform): escape literal Helm-curly in api-deployment.yaml comment (#699)
Helm parses the entire file (including YAML comments) for template
directives BEFORE YAML parsing strips comments. Literal '{{ ... }}'
inside a # comment was treated as a template directive and failed
with 'unexpected <.> in operand' at line 419.

PR #698 introduced this in the explanatory comment for the
SOVEREIGN_FQDN ConfigMap workaround. Reword to avoid the literal
double-curlies — the comment still describes the constraint without
tripping the Helm parser.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 21:08:13 +04:00
github-actions[bot]
7edc4370a3 deploy: update catalyst images to 74d08eb 2026-05-03 16:51:31 +00:00
e3mrah
74d08eb5a6
fix(catalyst-api+sovereign-tls): SOVEREIGN_FQDN via ConfigMap, not Helm template (PR #692 followup) (#698)
PR #692 added an inline Helm-template `value:` for SOVEREIGN_FQDN in
api-deployment.yaml. That broke contabo-mkt's catalyst-platform Flux
Kustomization (path: ./products/catalyst/chart/templates) because Kustomize
parses raw YAML and Helm `{{ ... }}` is not valid YAML syntax. Live error
on contabo at adf8dc7d:

  kustomize build failed: yaml: invalid map key:
  map[string]interface {}{".Values.global.sovereignFQDN | default \"\" | quote":""}

Replace the Helm-template form with `valueFrom.configMapKeyRef.optional:
true` so the same template renders cleanly under both consumers:

- contabo-mkt (Kustomize): ConfigMap `sovereign-fqdn` doesn't exist →
  optional ref → env stays empty → catalyst-api on contabo never validates
  handover JWTs anyway (it's the SIGNER, not the validator). Correct.

- Sovereigns (Helm via bp-catalyst-platform OCI chart): on apply, the
  sovereign-tls Kustomization renders `sovereign-fqdn-configmap.yaml` with
  envsubst on ${SOVEREIGN_FQDN}, creating the ConfigMap with the per-
  Sovereign FQDN. catalyst-api Pod resolves the ref → env populated →
  audience check works.

This restores the bridge between the two consumers without forking the
template. The bp-catalyst-platform 1.2.5 → 1.2.7 bump publishes the new
chart; bootstrap-kit overlay pin updated.

Will be verified on otech49 (next provision after this lands).

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:49:36 +04:00
github-actions[bot]
01a2e3bdb4 deploy: update catalyst images to 1946e0a 2026-05-03 16:40:41 +00:00
e3mrah
1946e0a46e
fix(flow-canvas): variable-width depth columns + ResizeObserver debounce (#669 round 3) (#693)
* fix(flow-canvas): variable-width depth columns + ResizeObserver debounce (#669 round 3)

Round-2 UAT showed:
1. Dense bucket of 30+ siblings piled at the right edge while 60% of
   canvas (left side) sat empty with one bubble per depth.
2. Sim "trying never stabilizing" during pane-transition animations.

Root cause #1: round-2 used a constant `perDepthX` for every depth.
With one-bubble depths next to a 30+ sibling depth, the dense bucket
got 80% × perDepthX (~128 px) of horizontal room and had to pile into
8+ sub-columns; sparse depths each got the same perDepthX (~160 px)
for a single bubble. Net: 60% canvas unused on the left, dense
cluster jammed at right.

Round-3 fix #1: variable-width depth columns. Each depth gets a slot
whose width tracks its bucket's natural extent at radius R:
sparse buckets need 2R + small gap; dense buckets need
(totalCols - 1) * (2R + COLLIDE_PADDING) to fit sub-columns
side-by-side. depthToX returns the centerline of slot[depth];
adjacent slots are separated by `gap = clamp(r*4, MIN, MAX)`. Total
layout width = sum(slots) + gaps.

Root cause #2: ResizeObserver fired on every animation frame during
the 220ms padding-right transition (pane open/close). Every fire
called setHostSize, which retriggered layoutMetrics → R changed by
1-2 px → all node targets shifted → sim re-seeded → never settled.

Round-3 fix #2: 180ms debounce on the observer + 8 px epsilon gate
(sub-pixel changes ignored entirely). Combined with snap-to-4 on R
and snap-to-8 on slot widths in layoutMetrics, the metrics now hold
constant during pane-transition animations and the sim converges
once.

Tests: bounded layout (17) + JobDetail (5) all green; tsc -b clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(flow-canvas): sqrt-aspect dense buckets + tight grid clamps (#669 round 4)

Round-3 still piled the dense bucket at the right edge. Distribution
test on the founder's exact screenshot shape (1+1+30) showed the dense
slot occupied only 28% of total X-extent — better than round-2 (~13%)
but not enough.

Round-4 fix:
1. layoutMetrics targets a sqrt-aspect-ratio for dense buckets:
   targetRows = round(sqrt(count / 1.6))
   30 leaves → 4 rows × 8 cols → ~700 px slot at R=40, occupying
   >50% of total X-extent. The densest bucket's targetRows now sets
   R via vertical-fit, so wide buckets actually claim X-room rather
   than collapsing into thin tall columns.
2. gridTargets reads cols/rows from layoutMetrics.slotInfo instead
   of recomputing — guarantees the per-tick clamp uses the same
   sub-grid dimensions as the slot-width math.
3. Per-cell clamp window narrowed to ±(pitch/2 - R) so the bubble
   edge can never reach a neighbour's centre. Old clamp used the
   full pitch which let forceCollide push bubbles into a neighbour's
   territory and then ratcheted them in — centres could collapse to
   <2R apart.

Adds FlowCanvasOrganic.distribution.test.tsx replicating the founder's
UAT screenshot (depth 0: 1, depth 1: 1, depth 2: 30). Asserts:
- depth-0 X < depth-1 X < depth-2 X (left-to-right)
- dense leafSpan ≥ 30% of total layout extent
- no centre-to-centre distance < 2R

All tests green: distribution (2/2), bounded (17/17), JobDetail (5/5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:38:44 +04:00
github-actions[bot]
3da196ec42 deploy: update catalyst images to 46c956b 2026-05-03 16:36:40 +00:00
e3mrah
46c956b21e
feat(catalyst-ui+api): wizard guest mode + ownership check (#689) (#696)
The wizard surface is now anonymous-first. A visitor lands on
console.openova.io and runs the entire 7-step provisioning flow
without a session; auth fires only when they click Launch.

Frontend (catalyst-ui):
- Drop the wizardAuthGuard so the wizard route renders for anonymous
  visitors. The existing zustand+persist store already keeps every
  form field in localStorage with credential-hygiene partitioning
  (Hetzner token, SSH private key, registrar token NEVER persisted),
  so the guest-mode hydration on refresh works for free.
- New shared/lib/useSession hook polls /api/v1/whoami via React
  Query; exposes signedIn / email / refetch / signOut.
- New widgets/auth/ProfileMenu in the wizard header — Sign in button
  for anonymous, email-initial avatar with sign-out dropdown for
  signed-in.
- New widgets/auth/PinSignInModal — two-stage email → 6-digit PIN
  modal that POSTs /auth/pin/issue + /auth/pin/verify (issue #688).
  Falls back to /auth/magic-link when the PIN endpoint is not
  available, so this PR is shippable independent of #688's merge
  order.
- StepReview Launch handler routes anonymous through the PIN modal;
  on verify it stamps the verified email into orgEmail and POSTs
  the deployment immediately.
- New /provision/* beforeLoad guard: anonymous → redirect to wizard
  with a sessionStorage flash banner; signed-in cross-tenant gets
  the canonical 404 from the API (no UI-side branch).
- New shared/lib/flashBanner — sessionStorage seam for the guard →
  wizard banner hand-off.

Backend (catalyst-api):
- Add OwnerEmail to store.Record and handler.Deployment, stamped
  from X-User-Email at CreateDeployment.
- New checkOwnership helper enforces 404 (NEVER 403) on cross-tenant
  access — never leak existence of someone else's deployment via
  the response code. Legacy records (OwnerEmail == "") pass through
  with a warning so in-place upgrade does not lock operators out.
- Wired into GetDeployment, StreamLogs, GetDeploymentEvents,
  WipeDeployment, GetKubeconfig, MintHandoverToken, ListJobs, and
  GetJob. PutKubeconfig keeps its bearer-token auth (cloud-init
  postback path).

Tests:
- Backend: deployments_owner_test.go covers legacy passthrough,
  no-session passthrough, owner match (case-insensitive), the
  load-bearing 404-not-403 cross-tenant assertion, and end-to-end
  proof through GetDeployment + GetDeploymentEvents.
- Frontend: flashBanner round-trip + clear-on-read; useSession
  signed-in / 401 / signOut paths; WizardLayout guest-mode
  [Sign in] button + flash banner rendering.

Closes #689.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:34:38 +04:00
e3mrah
4764b69e4c
fix(catalyst-api): Phase-1 watcher transitions status to ready when all HRs Ready (#697)
otech48 incident (2026-05-03): all 37 bp-* HelmReleases on the Sovereign
cluster reached Ready=True, but the catalyst-api deployment record stayed
status=phase1-watching. Wizard's POST /mint-handover-token returned 409
not-handover-ready, blocking the auto-redirect to console.<sov>/auth/handover.

Root cause: helmwatch's terminate-on-all-done gate required len(observed) >=
MinBootstrapKitHRs. Chart shipped CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS=38,
but the actual bootstrap-kit cardinality had drifted to 37 — making the
gate permanently unsatisfiable. Watch ran until 60-minute WatchTimeout fired.

Fix: gate terminate-on-all-done on the informer's HasSynced signal instead
of the brittle count. After WaitForCacheSync returns the full bp-* set is
in the cache regardless of cardinality. MinBootstrapKitHRs stays as a
defence-in-depth floor (default lowered 11 → 1) for the empty-cache
footgun. Chart env CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS dropped to 1.

Implementation:
- helmwatch.Watcher: new informerSynced bool gate, set after
  WaitForCacheSync. processEvent refuses to consider terminate-on-all-done
  while informerSynced=false. After WaitForCacheSync, re-evaluate the
  all-terminal check once on the synced cache (handles the rehydrate-
  after-restart path where every HR is already Ready=True at attach).
- helmwatch.maybeEmitReadyTransition: emits the operator-visible
  "All N blueprints reconciled. Sovereign ready for handover." SSE event
  exactly once when the gate fires (idempotency guard against flicker
  re-triggering the gate).
- handler.markPhase1Done: persistDeployment after status flip so the
  on-disk JSON reflects status=ready before any wizard poll. Also
  refuses to downgrade an already-adopted deployment if a late watcher
  event tries to flap it.
- Tests: new transition_test.go with happy-path, idempotency, partial-
  ready, realistic 37-HR convergence, and empty-cache scenarios. New
  TestMarkPhase1Done_RefusesToDowngradeAdopted in phase1_watch_test.go.

Will be verified live on otech49 (next provision after this lands):
- Wizard auto-shows "Open your Sovereign Console" button within 30s of
  all HRs reaching Ready
- No manual API calls or kubectl exec needed to flip status
- catalyst-api logs show "All 37 blueprints reconciled" event in SSE buffer

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:34:26 +04:00
github-actions[bot]
8afb667da9 deploy: update catalyst images to ba31f24 2026-05-03 16:28:50 +00:00
e3mrah
ba31f24922
feat(catalyst-ui+api): replace magic-link with 6-digit PIN auth (#688) (#694)
Replace the magic-link login flow on console.openova.io with a paste-friendly
6-digit numeric PIN, modelled on bank/Google verification screens. Founder
rejected magic links because they look like phishing (2026-05-03).

## Backend (products/catalyst/bootstrap/api)

- New handler/pinstore.go — sync.Mutex-guarded in-memory map keyed by email
  with 10-minute TTL, 60-second per-email rate limit, 3-attempt lockout, and
  a background goroutine that sweeps expired entries every minute.
  PINs are NEVER persisted to disk per credential-hygiene rules.

- handler/auth.go rewritten:
  * POST /api/v1/auth/pin/issue — body {email}. EnsureUser in openova realm,
    generate 6-digit PIN with crypto/rand (NEVER math/rand), store, send
    plaintext email with prominent "3 7 2 4 5 8" code and NO clickable URL,
    return {ok, requestId, expiresInSec}. Rate-limit 60s.
  * POST /api/v1/auth/pin/verify — body {email, pin, requestId}. Atomic
    verify+decrement, on match mint self-signed session JWT (same handover
    signer; KC 24.7 removed legacy token-exchange) and set HttpOnly Secure
    SameSite=Lax cookie. Wrong: 401 with attemptsRemaining. Locked/expired:
    410. Stable error codes: pin-invalid / pin-expired / attempts-exceeded /
    email-required / pin-rate-limited.

- Routes wired in cmd/api/main.go. Legacy /auth/magic and /auth/callback
  redirect to /login?error=flow_changed for stale bookmarks.

- Handler struct gets a pinStore field; openovaKC keycloakClient kept for
  the EnsureUser call.

- Tests: auth_pin_test.go (14 tests covering happy path, all error codes,
  SMTP rollback, rate limit, request-mismatch) + pinstore_test.go (12 tests
  on the store invariants).

## Frontend (products/catalyst/bootstrap/ui)

- New PinInput6.tsx component — 6 inputs, inputmode=numeric, maxlength=1,
  auto-advance focus, Backspace steps back, paste-anywhere splits clipboard
  digits across boxes (extracts /\d/g), auto-submits on the 6th digit or
  Enter. one-time-code autocomplete on box 0 for SMS prefill.

- LoginPage rewritten — single email field, "Send code" button, on success
  navigates to /login/verify with email + requestId in the URL. PIN never
  enters the URL.

- New VerifyPinPage — renders PinInput6, calls /pin/verify, on 401 shows
  "Code incorrect, X attempts remaining", on 410 routes back to /login
  with the error code, on 200 navigates to /wizard (or ?next=...).

- AuthCallbackPage stripped of magic-link code path; Catalyst-Zero branch
  is now a 302 safety net for stale Keycloak redirect URIs.

- Router gets /login/verify route.

- 17 vitest cases on PinInput6 covering paste, typing, backspace, Enter,
  pasting alphanumerics/long strings, controlled value, disabled state.

## DoD verification

- go test ./internal/handler/... -run "Pin|Handover|Auth" → PASS
  (12 pinstore_test + 14 auth_pin_test + handover/auth tests)
- npm test src/components/PinInput6.test.tsx → 17 passed
- helm template products/catalyst/chart → renders without error
- Email body contains zero clickable URLs: TestSendPinEmail_NoMagicLinkURL
  asserts ?token=, &token=, magic-link substrings absent

Closes #688

Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-03 20:26:05 +04:00
e3mrah
7ca9541ef9
fix(handover): provision Keycloak service-account credentials zero-touch (Phase-8b followup) (#691)
* fix(handover): provision Keycloak service-account credentials zero-touch (Phase-8b followup)

Sovereign-side catalyst-api needs Keycloak service-account credentials
to provision the operator's user during /auth/handover. Today the chart
references K8s Secret `catalyst-kc-sa-credentials` with keys addr/realm/
client-id/client-secret in the catalyst-system namespace — but no
zero-touch path materialised it. The dead SealedSecret template at
09a-keycloak-catalyst-api-secret.yaml had a different name AND different
keys (CATALYST_KC_*), used PLACEHOLDER_SEALED_VALUE markers no
provisioner replaced, and wasn't even listed in the bootstrap-kit
kustomization.

Symptom on otech48: GET /auth/handover?token=<valid-jwt> returns
"server misconfiguration: keycloak not configured"
(auth_handover.go:169).

Fix: bp-keycloak chart's configmap-sovereign-realm.yaml template now
emits the realm-import ConfigMap AND the catalyst-kc-sa-credentials
Secret in a single template scope so they share the same generated
client secret. Pattern mirrors platform/powerdns/chart/templates/
api-credentials-secret.yaml (canonical seam, ADR-0001 §11.3
anti-duplication).

Secret-value resolution order (first match wins):
  1. operator-supplied .Values.catalystApiServerClientSecret
  2. helm `lookup` of existing Secret in keycloak ns (idempotent)
  3. fresh randAlphaNum 32 (zero-touch on first install)

The Secret carries the four keys exactly as the catalyst-api Pod's
secretKeyRef expects — addr / realm / client-id / client-secret —
with addr derived from gateway.host (https://auth.<sovereignFQDN>).
Reflector annotations auto-mirror the Secret to catalyst-system as
soon as that namespace materialises (bootstrap-kit slot 13).

The realm import already creates the catalyst-api-server client with
serviceAccountsEnabled + impersonation/manage-users/view-users/
query-users role mappings — so once Keycloak is Ready and the realm
imports, the SA is fully provisioned and the K8s Secret carries a
matching client secret. No post-install Job, no Admin-API script,
no out-of-band SealedSecret ceremony.

Cleanup: removes the dead 09a SealedSecret template (not in
kustomization, never produced a working Secret).

Bumps:
  - bp-keycloak chart 1.3.0 -> 1.3.1
  - clusters/_template/bootstrap-kit/09-keycloak.yaml HelmRelease
    pin 1.3.0 -> 1.3.1

Existing per-Sovereign overlays (clusters/otech.omani.works/,
clusters/omantel.omani.works/) intentionally remain on 1.3.0 — fresh
otechN provisioning consumes _template at provision time.

Will be verified live on otech49 — handover end-to-end without ANY
manual Secret creation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(keycloak): bump blueprint.yaml spec.version to match chart 1.3.1

TestBootstrapKit_BlueprintCardsHaveRequiredFields/keycloak asserts
Chart.yaml.version == blueprint.yaml.spec.version. Forgot to bump
blueprint.yaml in the previous commit.

Note: 8 other blueprints (cert-manager, flux, crossplane, sealed-secrets,
spire, nats-jetstream, openbao, gitea) carry the same pre-existing
mismatch and the test fails on main too. Out of scope for this PR;
fixing the keycloak case to keep the new chart version internally
consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:50:06 +04:00
github-actions[bot]
2146279083 deploy: update catalyst images to 6f3e15b 2026-05-03 15:49:28 +00:00
e3mrah
6f3e15b1ec
fix(handover): provision JWK Secret on Sovereign + inject SOVEREIGN_FQDN env (Phase-8b followup) (#692)
Two handover bugs caught live on otech48 (2026-05-03):

1. Sovereign-side catalyst-api responded to GET /auth/handover with
   "server misconfiguration: public key unavailable". Root cause: the
   K8s Secret `catalyst-handover-jwt-public` (referenced by the chart's
   optional Secret-volume) was never materialised on the Sovereign,
   so the optional volume mount fell through and the JWK file was
   absent inside the container. 1.2.0 wired the mount but no
   provisioning step created the Secret. Fix mirrors the canonical
   pattern from PR #543 (ghcr-pull) and PR #680 (harbor-robot-token):
   cloud-init now writes the Secret manifest into catalyst-system NS
   and runcmd applies it BEFORE flux-bootstrap, so the Secret exists
   by the time bp-catalyst-platform reconciles. Also moves the chart
   volume mount off the catalyst-api PVC (mountPath
   /etc/catalyst/handover-jwt-public, no subPath) so a leftover empty
   directory in the PVC from pre-#606 installs cannot collide with
   the re-provisioned Secret mount.

2. /auth/handover validator rejected every valid JWT with 401
   "invalid audience" because SOVEREIGN_FQDN was unset on Sovereigns
   — the audience check collapsed to the literal "https://console."
   prefix. The bp-catalyst-platform HelmRelease overlay was already
   setting `global.sovereignFQDN` but the chart template never plumbed
   it through to the Pod env. Added a SOVEREIGN_FQDN env reading
   `.Values.global.sovereignFQDN` (default "" so Catalyst-Zero
   installs, where catalyst-api is the SIGNER not the validator,
   stay clean).

Bumps:
- bp-catalyst-platform 1.2.4 -> 1.2.5
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml HelmRelease pin

Will be verified live on otech49 — fresh provision should reach
https://console.otech49.omani.works/auth/handover?token=... and
exchange to a Keycloak session WITHOUT manual Secret creation.

Issue #606 followup.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:47:21 +04:00
github-actions[bot]
adf8dc7ded deploy: update catalyst images to d0b574b 2026-05-03 14:36:29 +00:00
e3mrah
d0b574bd68
fix(hetzner-tofu): add powerdns_api_key to templatefile() vars (#687)
PR #686 added var.powerdns_api_key to variables.tf and referenced it as
${powerdns_api_key} in cloudinit-control-plane.tftpl, but missed wiring
it into the templatefile() vars dict in main.tf. Result on otech48:

  Invalid value for "vars" parameter: vars map does not contain key
  "powerdns_api_key", referenced at ./cloudinit-control-plane.tftpl:273

This commit closes the gap: powerdns_api_key now flows from var ->
templatefile vars -> cloud-init -> Secret manifest.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:34:36 +04:00
github-actions[bot]
351ab9b584 deploy: update catalyst images to 6847595 2026-05-03 14:25:30 +00:00