Commit Graph

235 Commits

Author SHA1 Message Date
e3mrah
25f14469d3
fix(provisioner): map wizard's three-mode domain selector to tofu's binary pool/byo enum (#1069)
Caught live on omantel.biz re-provision (deploymentId ab0bf689620f4102):
tofu plan failed at exit 1 with:

  Error: Invalid value for variable
    on variables.tf line 296:
   296: variable "domain_mode" {
      ├────────────────
      │ var.domain_mode is "byo-manual"
    Domain mode must be 'pool' or 'byo'.

The wizard's StepDomain has three options (pool / byo-manual /
byo-api) so the UX can branch the operator into the right flow:

  - pool:        OpenOva owns the parent zone via Dynadot+PDM
  - byo-manual:  operator pastes NS records into their registrar
  - byo-api:     operator's registrar API drives NS automatically

The OpenTofu module's `variable "domain_mode"` validation only
accepts the binary pool/byo distinction — from the cloud-infra layer
(Hetzner servers, network, LB) NONE of those wizard distinctions
matter; tofu only needs to know whether to call Dynadot at apply
time. The three-mode wizard value was being written verbatim to the
tfvars without mapping.

Add `mapDomainModeForTofu(wizardMode)` helper:
  - "pool"      → "pool"
  - "byo-manual"→ "byo"
  - "byo-api"   → "byo"
  - empty       → "byo"  (test path that doesn't set the field)

Bump chart 1.4.83 → 1.4.84.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:26:50 +04:00
e3mrah
0a0b912e0d
fix(wizard): KServe was wrongly under Always Included on every Sovereign (#1068)
* fix(hetzner-purge): close volumes/primary_ips/floating_ips gap — wipe was leaving Crossplane orphans

Founder caught the gap on omantel.biz post-decommission: Hetzner
console showed 0 servers/LBs/IPs but 1 Volume + 2 Networks + 1
Firewall lingering. Networks/Firewall were the existing async-detach
window (handled by name-prefix fallback in the next provision); the
**Volume** was a hard miss — Purge() never called /v1/volumes.

Root cause: post-handover, the Hetzner Cloud Volume CSI driver
allocates Hetzner Volumes for every CNPG/Harbor/Loki/Mimir
StatefulSet PVC. tofu state never tracks them. When the operator
decommissions, `tofu destroy` is a no-op for the Volume and the
existing label-sweep didn't list /v1/volumes either. Result: orphan
volumes accrue cloud cost across re-provision cycles.

Same architectural gap for primary_ips (CCM-allocated for LoadBalancer
services since Hetzner's 2023 IP-decoupling) and floating_ips
(rare in Catalyst stack but listed for completeness).

Fix: extend Purge() + purgeByNamePrefix() to walk three additional
endpoints in dependency order:

  servers → load_balancers → firewalls → networks → ssh_keys
  → volumes (after servers detach)
  → primary_ips (after LBs free their IPs)
  → floating_ips

Both label-pass AND name-prefix-pass cover all 8 kinds. PurgeReport
extended with Volumes/PrimaryIPs/FloatingIPs slices; Total() updated.

CSI-named volumes (`pvc-<uid>` form) won't match either pass — those
need the canonical `catalyst.openova.io/sovereign=<fqdn>` label which
the Crossplane composition for VolumeClaim must apply. That's a
separate composition-layer fix tracked separately; this PR closes
the wipe gap for everything labelled OR name-prefixed.

Bump chart 1.4.80 → 1.4.81.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(wizard): KServe was wrongly under Always Included on every Sovereign

Founder caught on console.openova.io/sovereign/wizard step 4: KServe
appeared in the "Always Included" section as if every Sovereign had
to install it. False positive — KServe is conditionally mandatory
ONLY when the operator opts into the CORTEX (AI/ML) product family.

Two coupled bugs:

(1) Data model: kserve was tagged tier:'mandatory' inside the CORTEX
    product family, but tier:'mandatory' is consumed everywhere in
    the wizard as "always-on regardless of family selection":
      - componentGroups.ts:543 — seedIds.add(c.id) → auto-selected at
        wizard init for every Sovereign
      - applicationCatalog.ts:97 — seeded into the apps grid
      - store.ts:642 — special-cased as undeselectable
      - StepComponents.tsx — surfaced under "Always Included" tab
    Demote to tier:'recommended'. CORTEX has
    cascadeOnMemberSelection:true so picking any CORTEX member (vLLM,
    Specter, BGE, Milvus, …) still auto-pulls KServe via the cascade
    — that's the right semantics. KServe stays visible under CORTEX
    in Tab 1 ("Choose Your Stack") and locks-in once CORTEX is
    selected.

(2) UI filter: AlwaysIncludedTab was iterating every PRODUCTS entry
    regardless of product.tier and listing every member with
    component.tier === 'mandatory'. That mixes the platform-mandatory
    layer (PILOT/SPINE/SURGE/SILO/GUARDIAN tier:'mandatory' families)
    with conditional-mandatory members of opt-in families
    (CORTEX/RELAY tier:'optional', INSIGHTS/FABRIC tier:'recommended').
    Filter by product.tier === 'mandatory' so only the always-on
    families' mandatory members appear. Defence-in-depth — even if a
    new opt-in family ships with internal-mandatory members, they
    won't leak into "Always Included".

Audit confirmed kserve was the only offender across all 9 product
families today. PILOT/SPINE/SURGE/SILO/GUARDIAN remain unchanged
(their members rightfully tier:'mandatory'); CORTEX kserve fixed;
others have no internal mandatories.

Bump chart 1.4.81 → 1.4.82.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 00:33:19 +04:00
e3mrah
b233202b65
fix(hetzner-purge): close volumes/primary_ips/floating_ips gap — wipe was leaving Crossplane orphans (#1067)
Founder caught the gap on omantel.biz post-decommission: Hetzner
console showed 0 servers/LBs/IPs but 1 Volume + 2 Networks + 1
Firewall lingering. Networks/Firewall were the existing async-detach
window (handled by name-prefix fallback in the next provision); the
**Volume** was a hard miss — Purge() never called /v1/volumes.

Root cause: post-handover, the Hetzner Cloud Volume CSI driver
allocates Hetzner Volumes for every CNPG/Harbor/Loki/Mimir
StatefulSet PVC. tofu state never tracks them. When the operator
decommissions, `tofu destroy` is a no-op for the Volume and the
existing label-sweep didn't list /v1/volumes either. Result: orphan
volumes accrue cloud cost across re-provision cycles.

Same architectural gap for primary_ips (CCM-allocated for LoadBalancer
services since Hetzner's 2023 IP-decoupling) and floating_ips
(rare in Catalyst stack but listed for completeness).

Fix: extend Purge() + purgeByNamePrefix() to walk three additional
endpoints in dependency order:

  servers → load_balancers → firewalls → networks → ssh_keys
  → volumes (after servers detach)
  → primary_ips (after LBs free their IPs)
  → floating_ips

Both label-pass AND name-prefix-pass cover all 8 kinds. PurgeReport
extended with Volumes/PrimaryIPs/FloatingIPs slices; Total() updated.

CSI-named volumes (`pvc-<uid>` form) won't match either pass — those
need the canonical `catalyst.openova.io/sovereign=<fqdn>` label which
the Crossplane composition for VolumeClaim must apply. That's a
separate composition-layer fix tracked separately; this PR closes
the wipe gap for everything labelled OR name-prefixed.

Bump chart 1.4.80 → 1.4.81.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 00:08:50 +04:00
e3mrah
daeff32cbe
fix(cloudpage): hoist k8sStream above ctx — TS use-before-declaration broke build-ui (#1066)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(apps): publish chip on each card — replaces deleted /catalog page

Per founder direction: "if the catalog is just labeling an app to be
shown in marketplace, why don't we do it through the apps?" — drop
the standalone /catalog page (#1058), put the publish toggle on each
/apps card.

Backend (catalyst-api):
- New file sme_catalog_client.go — best-effort client for the
  in-cluster SME catalog microservice at
  http://catalog.sme.svc.cluster.local:8082. 30s response cache,
  1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier
  not deployed on this Sovereign — common when marketplace.enabled
  is false).
- HandleSovereignApps decorates each app with `marketplacePublished`
  *bool joined by slug from the SME catalog. nil ⇒ slug not in SME
  catalog (bootstrap component, or marketplace not deployed) ⇒ FE
  suppresses the chip.
- New handler HandleSovereignAppPublish at PATCH
  /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}.
  Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME
  catalog. Surfaces upstream status verbatim. Invalidates the cache
  so the next /apps poll reflects the change immediately.

Frontend (AppsPage):
- liveAppsQuery returns { statusById, publishedBySlug } instead of
  the bare status map.
- Each AppCard with a non-null marketplacePublished renders a
  PUBLISHED / UNPUBLISHED chip alongside the status chip. Click →
  PATCH → optimistic refetch via React Query.
- Bootstrap components and apps not in the SME catalog have nil →
  no chip (correct: nothing to toggle).
- Cards with marketplace.enabled=false render no chips at all (SME
  catalog unreachable → nil for every slug).

Bump chart 1.4.66 → 1.4.67.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code

Audit triggered by founder asking if PRs #1051..#1059 reach NEW
Sovereigns or just my manual `kubectl set image` patches on omantel.
Answer was: nothing reached anyone except omantel via manual patches.
Both contabo AND every fresh Sovereign would install :2122fb8 — the
SHA frozen at PR #1040's last manual chart-touch on May 6 morning.

Root cause:
- chart/templates/api-deployment.yaml + ui-deployment.yaml carry
  LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"),
  not Helm-templated `{{ .Values.images.catalystApi.tag }}`.
- catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag
  on every push — but no template reads from it. Dead code.
- contabo's catalyst-platform Flux Kustomization at
  ./products/catalyst/chart/templates applies these as raw manifests.
- Sovereigns Helm-install the same chart; Helm passes the literal
  through unchanged.
- Both ended up frozen at whatever literal was committed at the last
  manual chart-touching PR.

Fix:
1. CI's deploy step now bumps both the literal SHAs in the two
   template files AND the unused-but-kept-for-SME-services
   values.yaml. Sed-patches the literal directly so contabo's Kustomize
   path keeps working.
2. The commit step adds the two templates to the staged set alongside
   values.yaml, so every "deploy: update catalyst images to <sha>"
   commit propagates to contabo (10-min reconcile) AND Sovereigns
   (next OCI chart publish via blueprint-release).
3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with
   the latest literal (currently :8361df4) gets republished and
   pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml.

Why drop the "freeze contabo" intent of the previous comment:
The previous comment said contabo auto-roll on every PR was bad
because PR #975's image broke contabo (k8scache startup loop).
Solution there is: fix the bug in the code, not freeze contabo.
Freezing masked real divergence — the reason the founder caught
this is that manual omantel patches were the only thing keeping
omantel current while contabo + every other fresh Sovereign quietly
ran 9 PRs behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane

Founder asked: "make the real-time k8s information propagation
development reused — find the reverted prior work and implement the
final working one."

History:
- PR #358 (May 1) shipped the full informer + SSE data plane:
  internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics}
  + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) +
  UI hook lib/useK8sStream.ts + widget useK8sCacheStream.
- PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream
  with kinds=namespace,node,pv,pod,deployment,...,server.hcloud,
  volume.hcloud and `&initialState=1` for live cloud-graph deltas.
- PR #981 hotfix dropped the synchronous discovery probe in
  factory.go:AddCluster (it was calling
  core.Discovery().ServerResourcesForGroupVersion(gv) with NO context
  timeout — on a kubeconfig pointing at a decommissioned otech the
  call hung the catalyst-api startup for minutes per dead cluster).

After #981 the discovery-probe surgery was clean — no follow-up
broke. The data plane code stayed in the codebase. The remaining
gap was operational, not architectural:

  On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>),
  the catalyst-api boots without a posted-back kubeconfig in
  /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns []
  → factory has zero clusters → every
  /api/v1/sovereigns/{depId}/k8s/* request 404s with
  "sovereign \"...\" not registered". The architecture-graph
  in-flight call confirmed live on omantel.biz today.

Fix in this PR:

1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN
   env is set (chroot mode), build a ClusterRef with id resolved from
   CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by
   scanning /var/lib/catalyst/deployments/*.json for a record matching
   the FQDN (mirrors HandleSovereignSelf's store-fallback path for
   consistency). DynamicClient + CoreClient built from
   rest.InClusterConfig(). Append to the cluster list. Mother behavior
   unchanged — SOVEREIGN_FQDN unset → branch is a no-op.

2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide
   get/list/watch on every kind in the k8scache registry (pods,
   deployments, statefulsets, daemonsets, replicasets, services,
   endpointslices, ingresses, configmaps, secrets, persistentvolumes,
   persistentvolumeclaims, hcloud.crossplane.io managed resources,
   vclusters), plus authorization.k8s.io/subjectaccessreviews so the
   per-event SAR gating in the SSE handler doesn't 403 silently.

3. Bump chart 1.4.70 → 1.4.71.

The discovery-probe failure mode that triggered the original revert
(synchronous ServerResourcesForGroupVersion blocking startup) does
NOT recur here — InClusterConfig() returns immediately, NewForConfig
is lazy, and the first network call happens inside the informer
goroutine after Start, off the boot critical path. Mother-side
LoadClustersFromDir behavior is untouched (no probe, just kubeconfig
file parsing as it has been since #981).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): + More popover escapes overflow clip + graph centers via gravity force

Two cloud-page bugs caught live on omantel.biz:

(1) /cloud?view=list&kind=clusters → +More popover non-functional.
    The popover renders at its anchor coords but pointer events pass
    through to the toolbar below it. Diagnosis:
        .cloud-page-toolbar > [data-testid="cloud-kind-chips"] {
          overflow-x: auto;
        }
    Per CSS spec, when one overflow axis is non-visible, the OTHER
    axis becomes auto/hidden too. So overflow-x:auto on the chips
    strip silently sets overflow-y:auto, which clips the absolutely-
    positioned popover that hangs DOWN from the +More button.

    Fix: render the popover via React.createPortal to document.body
    so it's outside any overflow ancestor. Position via fixed
    coordinates computed from the +More button's
    getBoundingClientRect, recomputed on resize/scroll. Click-outside
    dismissal updated to check both wrapper AND portaled popover.

(2) /cloud?view=graph → bubbles drift to canvas edges, leaving the
    centre empty until enough nodes (e.g. worker nodes) are added
    to anchor things via link tension.

    Two coupled root causes:

    a) `forceCenter` only adjusts the centroid — it shifts ALL
       nodes uniformly so their average sits at (cx, cy). It does
       NOT pull individual nodes inward. With small node counts
       and high charge repulsion (-160 for ≤50 nodes), nothing
       opposes outward drift.

    b) `makeForceBound` was a HARD clamp: `if (n.x < minX) n.x =
       minX`. Nodes that hit the wall get arrested with their
       velocity preserved on the perpendicular axis but no inward
       impulse → they slide along the wall and stack at corners.
       The simulation never relaxes back to the centre.

    Fix:
    a) Add forceX(cx) + forceY(cy) with `centerGravity` strength
       per node-count tier (0.08 for ≤50, scaling down with
       larger graphs where link tension is sufficient). This pulls
       every individual node toward the centre proportional to its
       offset.
    b) Replace the hard clamp with an elastic bounce: when a node
       hits the boundary, reverse its velocity component (×0.4
       damping) instead of zeroing it. Energy returns to the
       system, the simulation actually relaxes.

Bump chart 1.4.72 → 1.4.73.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cloud): expose all live K8s kinds in +More popover + chip counts + tighter graph centering

Founder feedback (after PR #1062 lit up the data plane):
1. The +More popover was missing pods, deployments, statefulsets,
   daemonsets, configmaps, secrets, namespaces, etc. — it only
   carried the 6 placeholder kinds the legacy topology API knew
   about.
2. Several chips (Services, Ingresses, Storage Classes) showed "—"
   for count even though the data IS in the live cluster (visible in
   the graph view).
3. The graph view still pushed bubbles to canvas edges; only adding
   worker nodes brought things back. The previous gravity tuning
   wasn't strong enough for ~300 nodes.

This PR addresses all three.

(1) Eleven new K8s-backed list pages exposed in +More:
    Pods, Deployments, StatefulSets, DaemonSets, ReplicaSets,
    ConfigMaps, Secrets, Namespaces, Nodes, PersistentVolumes,
    EndpointSlices.
    Plus replaced the placeholder Services and Ingresses pages with
    live K8s tables.

    All built on a new generic K8sListPage that subscribes to
    /api/v1/sovereigns/{depId}/k8s/stream (same SSE channel the
    architecture-graph already uses) and renders a typed-column
    table per kind. Columns are declared once per kind in
    kindsPages.tsx; the rendering is uniform so adding a kind is a
    ~12-line wrapper.

(2) CloudPage.kindCounts now folds the live K8s snapshot into the
    chip-count map. KIND_TO_REGISTRY in kinds.ts maps each chip id
    to the registry kind name (pods → 'pod' etc). Counts that came
    from null (data not available) flip to live counts the moment
    the SSE stream's initialState=1 arrives.

(3) GraphCanvas physics retuned for live-data scale:
    - centerGravity: 0.08→0.18 for ≤50 nodes, 0.06→0.16 for ≤200,
      0.04→0.14 for ≤1000, 0.03→0.10 for ≤5000, 0.02→0.08 for >5000.
      The forceX/forceY pulls every individual node toward (cx,cy)
      proportional to its offset — 2-3× stronger than the original
      tuning so the canvas centre stays populated.
    - Charge softened: -160→-90 for ≤50 nodes, scaled down through
      every tier. The previous values were calibrated against a
      ~20-node topology stub; live data delivers 10-50× more nodes
      per Sovereign so charge needs to relax proportionally.

Bump chart 1.4.74 → 1.4.75.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud-list): share single SSE subscription via CloudContext — list pages were stuck connecting

After PR #1064 the +More popover was correctly populated and chip
counts were live, but clicking through to a list page (e.g.
/cloud?view=list&kind=pods) hung at "Connecting to live cluster
stream…" while the chip count beside the same kind already showed
the right number (110 pods).

Diagnosis: the K8sListPage was calling useK8sCacheStream with kinds:[kind],
opening its OWN EventSource. The parent CloudPage already had an
EventSource open (subscribing to all kinds — the source of the chip
counts). Two long-lived SSE streams from the same browser to the
same origin starve the connection budget; the second connection
hangs at "connecting" while the first holds the slot.

Fix: hoist the snapshot via CloudContext. CloudPage is already the
owner of the page-level useK8sCacheStream invocation; expose its
snapshot/status/revision through the existing useCloud() context.
K8sListPage now reads from useCloud() instead of opening a duplicate
stream. Single subscription, single source of truth for both chip
counts AND list rows.

Bump chart 1.4.76 → 1.4.77.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloudpage): hoist k8sStream above ctx — was used before declaration

PR #1065 added k8sStream into the ctx useMemo deps but the
useK8sCacheStream() call was at line 396, well after the ctx build at
line 290. tsc -b caught it: TS2448/TS2454 use-before-declaration. CI
build-ui failed.

Move the useK8sCacheStream invocation to immediately precede the ctx
build. No behaviour change.

Bump chart 1.4.78 → 1.4.79.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 22:58:25 +04:00
e3mrah
f02136a89c
fix(cloud-list): share single SSE via CloudContext — list pages were stuck connecting (#1065)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(apps): publish chip on each card — replaces deleted /catalog page

Per founder direction: "if the catalog is just labeling an app to be
shown in marketplace, why don't we do it through the apps?" — drop
the standalone /catalog page (#1058), put the publish toggle on each
/apps card.

Backend (catalyst-api):
- New file sme_catalog_client.go — best-effort client for the
  in-cluster SME catalog microservice at
  http://catalog.sme.svc.cluster.local:8082. 30s response cache,
  1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier
  not deployed on this Sovereign — common when marketplace.enabled
  is false).
- HandleSovereignApps decorates each app with `marketplacePublished`
  *bool joined by slug from the SME catalog. nil ⇒ slug not in SME
  catalog (bootstrap component, or marketplace not deployed) ⇒ FE
  suppresses the chip.
- New handler HandleSovereignAppPublish at PATCH
  /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}.
  Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME
  catalog. Surfaces upstream status verbatim. Invalidates the cache
  so the next /apps poll reflects the change immediately.

Frontend (AppsPage):
- liveAppsQuery returns { statusById, publishedBySlug } instead of
  the bare status map.
- Each AppCard with a non-null marketplacePublished renders a
  PUBLISHED / UNPUBLISHED chip alongside the status chip. Click →
  PATCH → optimistic refetch via React Query.
- Bootstrap components and apps not in the SME catalog have nil →
  no chip (correct: nothing to toggle).
- Cards with marketplace.enabled=false render no chips at all (SME
  catalog unreachable → nil for every slug).

Bump chart 1.4.66 → 1.4.67.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code

Audit triggered by founder asking if PRs #1051..#1059 reach NEW
Sovereigns or just my manual `kubectl set image` patches on omantel.
Answer was: nothing reached anyone except omantel via manual patches.
Both contabo AND every fresh Sovereign would install :2122fb8 — the
SHA frozen at PR #1040's last manual chart-touch on May 6 morning.

Root cause:
- chart/templates/api-deployment.yaml + ui-deployment.yaml carry
  LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"),
  not Helm-templated `{{ .Values.images.catalystApi.tag }}`.
- catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag
  on every push — but no template reads from it. Dead code.
- contabo's catalyst-platform Flux Kustomization at
  ./products/catalyst/chart/templates applies these as raw manifests.
- Sovereigns Helm-install the same chart; Helm passes the literal
  through unchanged.
- Both ended up frozen at whatever literal was committed at the last
  manual chart-touching PR.

Fix:
1. CI's deploy step now bumps both the literal SHAs in the two
   template files AND the unused-but-kept-for-SME-services
   values.yaml. Sed-patches the literal directly so contabo's Kustomize
   path keeps working.
2. The commit step adds the two templates to the staged set alongside
   values.yaml, so every "deploy: update catalyst images to <sha>"
   commit propagates to contabo (10-min reconcile) AND Sovereigns
   (next OCI chart publish via blueprint-release).
3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with
   the latest literal (currently :8361df4) gets republished and
   pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml.

Why drop the "freeze contabo" intent of the previous comment:
The previous comment said contabo auto-roll on every PR was bad
because PR #975's image broke contabo (k8scache startup loop).
Solution there is: fix the bug in the code, not freeze contabo.
Freezing masked real divergence — the reason the founder caught
this is that manual omantel patches were the only thing keeping
omantel current while contabo + every other fresh Sovereign quietly
ran 9 PRs behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane

Founder asked: "make the real-time k8s information propagation
development reused — find the reverted prior work and implement the
final working one."

History:
- PR #358 (May 1) shipped the full informer + SSE data plane:
  internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics}
  + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) +
  UI hook lib/useK8sStream.ts + widget useK8sCacheStream.
- PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream
  with kinds=namespace,node,pv,pod,deployment,...,server.hcloud,
  volume.hcloud and `&initialState=1` for live cloud-graph deltas.
- PR #981 hotfix dropped the synchronous discovery probe in
  factory.go:AddCluster (it was calling
  core.Discovery().ServerResourcesForGroupVersion(gv) with NO context
  timeout — on a kubeconfig pointing at a decommissioned otech the
  call hung the catalyst-api startup for minutes per dead cluster).

After #981 the discovery-probe surgery was clean — no follow-up
broke. The data plane code stayed in the codebase. The remaining
gap was operational, not architectural:

  On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>),
  the catalyst-api boots without a posted-back kubeconfig in
  /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns []
  → factory has zero clusters → every
  /api/v1/sovereigns/{depId}/k8s/* request 404s with
  "sovereign \"...\" not registered". The architecture-graph
  in-flight call confirmed live on omantel.biz today.

Fix in this PR:

1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN
   env is set (chroot mode), build a ClusterRef with id resolved from
   CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by
   scanning /var/lib/catalyst/deployments/*.json for a record matching
   the FQDN (mirrors HandleSovereignSelf's store-fallback path for
   consistency). DynamicClient + CoreClient built from
   rest.InClusterConfig(). Append to the cluster list. Mother behavior
   unchanged — SOVEREIGN_FQDN unset → branch is a no-op.

2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide
   get/list/watch on every kind in the k8scache registry (pods,
   deployments, statefulsets, daemonsets, replicasets, services,
   endpointslices, ingresses, configmaps, secrets, persistentvolumes,
   persistentvolumeclaims, hcloud.crossplane.io managed resources,
   vclusters), plus authorization.k8s.io/subjectaccessreviews so the
   per-event SAR gating in the SSE handler doesn't 403 silently.

3. Bump chart 1.4.70 → 1.4.71.

The discovery-probe failure mode that triggered the original revert
(synchronous ServerResourcesForGroupVersion blocking startup) does
NOT recur here — InClusterConfig() returns immediately, NewForConfig
is lazy, and the first network call happens inside the informer
goroutine after Start, off the boot critical path. Mother-side
LoadClustersFromDir behavior is untouched (no probe, just kubeconfig
file parsing as it has been since #981).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): + More popover escapes overflow clip + graph centers via gravity force

Two cloud-page bugs caught live on omantel.biz:

(1) /cloud?view=list&kind=clusters → +More popover non-functional.
    The popover renders at its anchor coords but pointer events pass
    through to the toolbar below it. Diagnosis:
        .cloud-page-toolbar > [data-testid="cloud-kind-chips"] {
          overflow-x: auto;
        }
    Per CSS spec, when one overflow axis is non-visible, the OTHER
    axis becomes auto/hidden too. So overflow-x:auto on the chips
    strip silently sets overflow-y:auto, which clips the absolutely-
    positioned popover that hangs DOWN from the +More button.

    Fix: render the popover via React.createPortal to document.body
    so it's outside any overflow ancestor. Position via fixed
    coordinates computed from the +More button's
    getBoundingClientRect, recomputed on resize/scroll. Click-outside
    dismissal updated to check both wrapper AND portaled popover.

(2) /cloud?view=graph → bubbles drift to canvas edges, leaving the
    centre empty until enough nodes (e.g. worker nodes) are added
    to anchor things via link tension.

    Two coupled root causes:

    a) `forceCenter` only adjusts the centroid — it shifts ALL
       nodes uniformly so their average sits at (cx, cy). It does
       NOT pull individual nodes inward. With small node counts
       and high charge repulsion (-160 for ≤50 nodes), nothing
       opposes outward drift.

    b) `makeForceBound` was a HARD clamp: `if (n.x < minX) n.x =
       minX`. Nodes that hit the wall get arrested with their
       velocity preserved on the perpendicular axis but no inward
       impulse → they slide along the wall and stack at corners.
       The simulation never relaxes back to the centre.

    Fix:
    a) Add forceX(cx) + forceY(cy) with `centerGravity` strength
       per node-count tier (0.08 for ≤50, scaling down with
       larger graphs where link tension is sufficient). This pulls
       every individual node toward the centre proportional to its
       offset.
    b) Replace the hard clamp with an elastic bounce: when a node
       hits the boundary, reverse its velocity component (×0.4
       damping) instead of zeroing it. Energy returns to the
       system, the simulation actually relaxes.

Bump chart 1.4.72 → 1.4.73.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cloud): expose all live K8s kinds in +More popover + chip counts + tighter graph centering

Founder feedback (after PR #1062 lit up the data plane):
1. The +More popover was missing pods, deployments, statefulsets,
   daemonsets, configmaps, secrets, namespaces, etc. — it only
   carried the 6 placeholder kinds the legacy topology API knew
   about.
2. Several chips (Services, Ingresses, Storage Classes) showed "—"
   for count even though the data IS in the live cluster (visible in
   the graph view).
3. The graph view still pushed bubbles to canvas edges; only adding
   worker nodes brought things back. The previous gravity tuning
   wasn't strong enough for ~300 nodes.

This PR addresses all three.

(1) Eleven new K8s-backed list pages exposed in +More:
    Pods, Deployments, StatefulSets, DaemonSets, ReplicaSets,
    ConfigMaps, Secrets, Namespaces, Nodes, PersistentVolumes,
    EndpointSlices.
    Plus replaced the placeholder Services and Ingresses pages with
    live K8s tables.

    All built on a new generic K8sListPage that subscribes to
    /api/v1/sovereigns/{depId}/k8s/stream (same SSE channel the
    architecture-graph already uses) and renders a typed-column
    table per kind. Columns are declared once per kind in
    kindsPages.tsx; the rendering is uniform so adding a kind is a
    ~12-line wrapper.

(2) CloudPage.kindCounts now folds the live K8s snapshot into the
    chip-count map. KIND_TO_REGISTRY in kinds.ts maps each chip id
    to the registry kind name (pods → 'pod' etc). Counts that came
    from null (data not available) flip to live counts the moment
    the SSE stream's initialState=1 arrives.

(3) GraphCanvas physics retuned for live-data scale:
    - centerGravity: 0.08→0.18 for ≤50 nodes, 0.06→0.16 for ≤200,
      0.04→0.14 for ≤1000, 0.03→0.10 for ≤5000, 0.02→0.08 for >5000.
      The forceX/forceY pulls every individual node toward (cx,cy)
      proportional to its offset — 2-3× stronger than the original
      tuning so the canvas centre stays populated.
    - Charge softened: -160→-90 for ≤50 nodes, scaled down through
      every tier. The previous values were calibrated against a
      ~20-node topology stub; live data delivers 10-50× more nodes
      per Sovereign so charge needs to relax proportionally.

Bump chart 1.4.74 → 1.4.75.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud-list): share single SSE subscription via CloudContext — list pages were stuck connecting

After PR #1064 the +More popover was correctly populated and chip
counts were live, but clicking through to a list page (e.g.
/cloud?view=list&kind=pods) hung at "Connecting to live cluster
stream…" while the chip count beside the same kind already showed
the right number (110 pods).

Diagnosis: the K8sListPage was calling useK8sCacheStream with kinds:[kind],
opening its OWN EventSource. The parent CloudPage already had an
EventSource open (subscribing to all kinds — the source of the chip
counts). Two long-lived SSE streams from the same browser to the
same origin starve the connection budget; the second connection
hangs at "connecting" while the first holds the slot.

Fix: hoist the snapshot via CloudContext. CloudPage is already the
owner of the page-level useK8sCacheStream invocation; expose its
snapshot/status/revision through the existing useCloud() context.
K8sListPage now reads from useCloud() instead of opening a duplicate
stream. Single subscription, single source of truth for both chip
counts AND list rows.

Bump chart 1.4.76 → 1.4.77.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 22:34:16 +04:00
e3mrah
2604c9cf36
feat(cloud): all live K8s kinds in +More + chip counts + tighter graph centering (#1064)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(apps): publish chip on each card — replaces deleted /catalog page

Per founder direction: "if the catalog is just labeling an app to be
shown in marketplace, why don't we do it through the apps?" — drop
the standalone /catalog page (#1058), put the publish toggle on each
/apps card.

Backend (catalyst-api):
- New file sme_catalog_client.go — best-effort client for the
  in-cluster SME catalog microservice at
  http://catalog.sme.svc.cluster.local:8082. 30s response cache,
  1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier
  not deployed on this Sovereign — common when marketplace.enabled
  is false).
- HandleSovereignApps decorates each app with `marketplacePublished`
  *bool joined by slug from the SME catalog. nil ⇒ slug not in SME
  catalog (bootstrap component, or marketplace not deployed) ⇒ FE
  suppresses the chip.
- New handler HandleSovereignAppPublish at PATCH
  /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}.
  Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME
  catalog. Surfaces upstream status verbatim. Invalidates the cache
  so the next /apps poll reflects the change immediately.

Frontend (AppsPage):
- liveAppsQuery returns { statusById, publishedBySlug } instead of
  the bare status map.
- Each AppCard with a non-null marketplacePublished renders a
  PUBLISHED / UNPUBLISHED chip alongside the status chip. Click →
  PATCH → optimistic refetch via React Query.
- Bootstrap components and apps not in the SME catalog have nil →
  no chip (correct: nothing to toggle).
- Cards with marketplace.enabled=false render no chips at all (SME
  catalog unreachable → nil for every slug).

Bump chart 1.4.66 → 1.4.67.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code

Audit triggered by founder asking if PRs #1051..#1059 reach NEW
Sovereigns or just my manual `kubectl set image` patches on omantel.
Answer was: nothing reached anyone except omantel via manual patches.
Both contabo AND every fresh Sovereign would install :2122fb8 — the
SHA frozen at PR #1040's last manual chart-touch on May 6 morning.

Root cause:
- chart/templates/api-deployment.yaml + ui-deployment.yaml carry
  LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"),
  not Helm-templated `{{ .Values.images.catalystApi.tag }}`.
- catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag
  on every push — but no template reads from it. Dead code.
- contabo's catalyst-platform Flux Kustomization at
  ./products/catalyst/chart/templates applies these as raw manifests.
- Sovereigns Helm-install the same chart; Helm passes the literal
  through unchanged.
- Both ended up frozen at whatever literal was committed at the last
  manual chart-touching PR.

Fix:
1. CI's deploy step now bumps both the literal SHAs in the two
   template files AND the unused-but-kept-for-SME-services
   values.yaml. Sed-patches the literal directly so contabo's Kustomize
   path keeps working.
2. The commit step adds the two templates to the staged set alongside
   values.yaml, so every "deploy: update catalyst images to <sha>"
   commit propagates to contabo (10-min reconcile) AND Sovereigns
   (next OCI chart publish via blueprint-release).
3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with
   the latest literal (currently :8361df4) gets republished and
   pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml.

Why drop the "freeze contabo" intent of the previous comment:
The previous comment said contabo auto-roll on every PR was bad
because PR #975's image broke contabo (k8scache startup loop).
Solution there is: fix the bug in the code, not freeze contabo.
Freezing masked real divergence — the reason the founder caught
this is that manual omantel patches were the only thing keeping
omantel current while contabo + every other fresh Sovereign quietly
ran 9 PRs behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane

Founder asked: "make the real-time k8s information propagation
development reused — find the reverted prior work and implement the
final working one."

History:
- PR #358 (May 1) shipped the full informer + SSE data plane:
  internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics}
  + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) +
  UI hook lib/useK8sStream.ts + widget useK8sCacheStream.
- PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream
  with kinds=namespace,node,pv,pod,deployment,...,server.hcloud,
  volume.hcloud and `&initialState=1` for live cloud-graph deltas.
- PR #981 hotfix dropped the synchronous discovery probe in
  factory.go:AddCluster (it was calling
  core.Discovery().ServerResourcesForGroupVersion(gv) with NO context
  timeout — on a kubeconfig pointing at a decommissioned otech the
  call hung the catalyst-api startup for minutes per dead cluster).

After #981 the discovery-probe surgery was clean — no follow-up
broke. The data plane code stayed in the codebase. The remaining
gap was operational, not architectural:

  On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>),
  the catalyst-api boots without a posted-back kubeconfig in
  /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns []
  → factory has zero clusters → every
  /api/v1/sovereigns/{depId}/k8s/* request 404s with
  "sovereign \"...\" not registered". The architecture-graph
  in-flight call confirmed live on omantel.biz today.

Fix in this PR:

1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN
   env is set (chroot mode), build a ClusterRef with id resolved from
   CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by
   scanning /var/lib/catalyst/deployments/*.json for a record matching
   the FQDN (mirrors HandleSovereignSelf's store-fallback path for
   consistency). DynamicClient + CoreClient built from
   rest.InClusterConfig(). Append to the cluster list. Mother behavior
   unchanged — SOVEREIGN_FQDN unset → branch is a no-op.

2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide
   get/list/watch on every kind in the k8scache registry (pods,
   deployments, statefulsets, daemonsets, replicasets, services,
   endpointslices, ingresses, configmaps, secrets, persistentvolumes,
   persistentvolumeclaims, hcloud.crossplane.io managed resources,
   vclusters), plus authorization.k8s.io/subjectaccessreviews so the
   per-event SAR gating in the SSE handler doesn't 403 silently.

3. Bump chart 1.4.70 → 1.4.71.

The discovery-probe failure mode that triggered the original revert
(synchronous ServerResourcesForGroupVersion blocking startup) does
NOT recur here — InClusterConfig() returns immediately, NewForConfig
is lazy, and the first network call happens inside the informer
goroutine after Start, off the boot critical path. Mother-side
LoadClustersFromDir behavior is untouched (no probe, just kubeconfig
file parsing as it has been since #981).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): + More popover escapes overflow clip + graph centers via gravity force

Two cloud-page bugs caught live on omantel.biz:

(1) /cloud?view=list&kind=clusters → +More popover non-functional.
    The popover renders at its anchor coords but pointer events pass
    through to the toolbar below it. Diagnosis:
        .cloud-page-toolbar > [data-testid="cloud-kind-chips"] {
          overflow-x: auto;
        }
    Per CSS spec, when one overflow axis is non-visible, the OTHER
    axis becomes auto/hidden too. So overflow-x:auto on the chips
    strip silently sets overflow-y:auto, which clips the absolutely-
    positioned popover that hangs DOWN from the +More button.

    Fix: render the popover via React.createPortal to document.body
    so it's outside any overflow ancestor. Position via fixed
    coordinates computed from the +More button's
    getBoundingClientRect, recomputed on resize/scroll. Click-outside
    dismissal updated to check both wrapper AND portaled popover.

(2) /cloud?view=graph → bubbles drift to canvas edges, leaving the
    centre empty until enough nodes (e.g. worker nodes) are added
    to anchor things via link tension.

    Two coupled root causes:

    a) `forceCenter` only adjusts the centroid — it shifts ALL
       nodes uniformly so their average sits at (cx, cy). It does
       NOT pull individual nodes inward. With small node counts
       and high charge repulsion (-160 for ≤50 nodes), nothing
       opposes outward drift.

    b) `makeForceBound` was a HARD clamp: `if (n.x < minX) n.x =
       minX`. Nodes that hit the wall get arrested with their
       velocity preserved on the perpendicular axis but no inward
       impulse → they slide along the wall and stack at corners.
       The simulation never relaxes back to the centre.

    Fix:
    a) Add forceX(cx) + forceY(cy) with `centerGravity` strength
       per node-count tier (0.08 for ≤50, scaling down with
       larger graphs where link tension is sufficient). This pulls
       every individual node toward the centre proportional to its
       offset.
    b) Replace the hard clamp with an elastic bounce: when a node
       hits the boundary, reverse its velocity component (×0.4
       damping) instead of zeroing it. Energy returns to the
       system, the simulation actually relaxes.

Bump chart 1.4.72 → 1.4.73.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cloud): expose all live K8s kinds in +More popover + chip counts + tighter graph centering

Founder feedback (after PR #1062 lit up the data plane):
1. The +More popover was missing pods, deployments, statefulsets,
   daemonsets, configmaps, secrets, namespaces, etc. — it only
   carried the 6 placeholder kinds the legacy topology API knew
   about.
2. Several chips (Services, Ingresses, Storage Classes) showed "—"
   for count even though the data IS in the live cluster (visible in
   the graph view).
3. The graph view still pushed bubbles to canvas edges; only adding
   worker nodes brought things back. The previous gravity tuning
   wasn't strong enough for ~300 nodes.

This PR addresses all three.

(1) Eleven new K8s-backed list pages exposed in +More:
    Pods, Deployments, StatefulSets, DaemonSets, ReplicaSets,
    ConfigMaps, Secrets, Namespaces, Nodes, PersistentVolumes,
    EndpointSlices.
    Plus replaced the placeholder Services and Ingresses pages with
    live K8s tables.

    All built on a new generic K8sListPage that subscribes to
    /api/v1/sovereigns/{depId}/k8s/stream (same SSE channel the
    architecture-graph already uses) and renders a typed-column
    table per kind. Columns are declared once per kind in
    kindsPages.tsx; the rendering is uniform so adding a kind is a
    ~12-line wrapper.

(2) CloudPage.kindCounts now folds the live K8s snapshot into the
    chip-count map. KIND_TO_REGISTRY in kinds.ts maps each chip id
    to the registry kind name (pods → 'pod' etc). Counts that came
    from null (data not available) flip to live counts the moment
    the SSE stream's initialState=1 arrives.

(3) GraphCanvas physics retuned for live-data scale:
    - centerGravity: 0.08→0.18 for ≤50 nodes, 0.06→0.16 for ≤200,
      0.04→0.14 for ≤1000, 0.03→0.10 for ≤5000, 0.02→0.08 for >5000.
      The forceX/forceY pulls every individual node toward (cx,cy)
      proportional to its offset — 2-3× stronger than the original
      tuning so the canvas centre stays populated.
    - Charge softened: -160→-90 for ≤50 nodes, scaled down through
      every tier. The previous values were calibrated against a
      ~20-node topology stub; live data delivers 10-50× more nodes
      per Sovereign so charge needs to relax proportionally.

Bump chart 1.4.74 → 1.4.75.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 22:15:25 +04:00
e3mrah
167d09348e
fix(cloud): +More popover escapes overflow clip + graph centers via gravity force (#1063)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(apps): publish chip on each card — replaces deleted /catalog page

Per founder direction: "if the catalog is just labeling an app to be
shown in marketplace, why don't we do it through the apps?" — drop
the standalone /catalog page (#1058), put the publish toggle on each
/apps card.

Backend (catalyst-api):
- New file sme_catalog_client.go — best-effort client for the
  in-cluster SME catalog microservice at
  http://catalog.sme.svc.cluster.local:8082. 30s response cache,
  1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier
  not deployed on this Sovereign — common when marketplace.enabled
  is false).
- HandleSovereignApps decorates each app with `marketplacePublished`
  *bool joined by slug from the SME catalog. nil ⇒ slug not in SME
  catalog (bootstrap component, or marketplace not deployed) ⇒ FE
  suppresses the chip.
- New handler HandleSovereignAppPublish at PATCH
  /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}.
  Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME
  catalog. Surfaces upstream status verbatim. Invalidates the cache
  so the next /apps poll reflects the change immediately.

Frontend (AppsPage):
- liveAppsQuery returns { statusById, publishedBySlug } instead of
  the bare status map.
- Each AppCard with a non-null marketplacePublished renders a
  PUBLISHED / UNPUBLISHED chip alongside the status chip. Click →
  PATCH → optimistic refetch via React Query.
- Bootstrap components and apps not in the SME catalog have nil →
  no chip (correct: nothing to toggle).
- Cards with marketplace.enabled=false render no chips at all (SME
  catalog unreachable → nil for every slug).

Bump chart 1.4.66 → 1.4.67.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code

Audit triggered by founder asking if PRs #1051..#1059 reach NEW
Sovereigns or just my manual `kubectl set image` patches on omantel.
Answer was: nothing reached anyone except omantel via manual patches.
Both contabo AND every fresh Sovereign would install :2122fb8 — the
SHA frozen at PR #1040's last manual chart-touch on May 6 morning.

Root cause:
- chart/templates/api-deployment.yaml + ui-deployment.yaml carry
  LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"),
  not Helm-templated `{{ .Values.images.catalystApi.tag }}`.
- catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag
  on every push — but no template reads from it. Dead code.
- contabo's catalyst-platform Flux Kustomization at
  ./products/catalyst/chart/templates applies these as raw manifests.
- Sovereigns Helm-install the same chart; Helm passes the literal
  through unchanged.
- Both ended up frozen at whatever literal was committed at the last
  manual chart-touching PR.

Fix:
1. CI's deploy step now bumps both the literal SHAs in the two
   template files AND the unused-but-kept-for-SME-services
   values.yaml. Sed-patches the literal directly so contabo's Kustomize
   path keeps working.
2. The commit step adds the two templates to the staged set alongside
   values.yaml, so every "deploy: update catalyst images to <sha>"
   commit propagates to contabo (10-min reconcile) AND Sovereigns
   (next OCI chart publish via blueprint-release).
3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with
   the latest literal (currently :8361df4) gets republished and
   pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml.

Why drop the "freeze contabo" intent of the previous comment:
The previous comment said contabo auto-roll on every PR was bad
because PR #975's image broke contabo (k8scache startup loop).
Solution there is: fix the bug in the code, not freeze contabo.
Freezing masked real divergence — the reason the founder caught
this is that manual omantel patches were the only thing keeping
omantel current while contabo + every other fresh Sovereign quietly
ran 9 PRs behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane

Founder asked: "make the real-time k8s information propagation
development reused — find the reverted prior work and implement the
final working one."

History:
- PR #358 (May 1) shipped the full informer + SSE data plane:
  internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics}
  + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) +
  UI hook lib/useK8sStream.ts + widget useK8sCacheStream.
- PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream
  with kinds=namespace,node,pv,pod,deployment,...,server.hcloud,
  volume.hcloud and `&initialState=1` for live cloud-graph deltas.
- PR #981 hotfix dropped the synchronous discovery probe in
  factory.go:AddCluster (it was calling
  core.Discovery().ServerResourcesForGroupVersion(gv) with NO context
  timeout — on a kubeconfig pointing at a decommissioned otech the
  call hung the catalyst-api startup for minutes per dead cluster).

After #981 the discovery-probe surgery was clean — no follow-up
broke. The data plane code stayed in the codebase. The remaining
gap was operational, not architectural:

  On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>),
  the catalyst-api boots without a posted-back kubeconfig in
  /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns []
  → factory has zero clusters → every
  /api/v1/sovereigns/{depId}/k8s/* request 404s with
  "sovereign \"...\" not registered". The architecture-graph
  in-flight call confirmed live on omantel.biz today.

Fix in this PR:

1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN
   env is set (chroot mode), build a ClusterRef with id resolved from
   CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by
   scanning /var/lib/catalyst/deployments/*.json for a record matching
   the FQDN (mirrors HandleSovereignSelf's store-fallback path for
   consistency). DynamicClient + CoreClient built from
   rest.InClusterConfig(). Append to the cluster list. Mother behavior
   unchanged — SOVEREIGN_FQDN unset → branch is a no-op.

2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide
   get/list/watch on every kind in the k8scache registry (pods,
   deployments, statefulsets, daemonsets, replicasets, services,
   endpointslices, ingresses, configmaps, secrets, persistentvolumes,
   persistentvolumeclaims, hcloud.crossplane.io managed resources,
   vclusters), plus authorization.k8s.io/subjectaccessreviews so the
   per-event SAR gating in the SSE handler doesn't 403 silently.

3. Bump chart 1.4.70 → 1.4.71.

The discovery-probe failure mode that triggered the original revert
(synchronous ServerResourcesForGroupVersion blocking startup) does
NOT recur here — InClusterConfig() returns immediately, NewForConfig
is lazy, and the first network call happens inside the informer
goroutine after Start, off the boot critical path. Mother-side
LoadClustersFromDir behavior is untouched (no probe, just kubeconfig
file parsing as it has been since #981).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): + More popover escapes overflow clip + graph centers via gravity force

Two cloud-page bugs caught live on omantel.biz:

(1) /cloud?view=list&kind=clusters → +More popover non-functional.
    The popover renders at its anchor coords but pointer events pass
    through to the toolbar below it. Diagnosis:
        .cloud-page-toolbar > [data-testid="cloud-kind-chips"] {
          overflow-x: auto;
        }
    Per CSS spec, when one overflow axis is non-visible, the OTHER
    axis becomes auto/hidden too. So overflow-x:auto on the chips
    strip silently sets overflow-y:auto, which clips the absolutely-
    positioned popover that hangs DOWN from the +More button.

    Fix: render the popover via React.createPortal to document.body
    so it's outside any overflow ancestor. Position via fixed
    coordinates computed from the +More button's
    getBoundingClientRect, recomputed on resize/scroll. Click-outside
    dismissal updated to check both wrapper AND portaled popover.

(2) /cloud?view=graph → bubbles drift to canvas edges, leaving the
    centre empty until enough nodes (e.g. worker nodes) are added
    to anchor things via link tension.

    Two coupled root causes:

    a) `forceCenter` only adjusts the centroid — it shifts ALL
       nodes uniformly so their average sits at (cx, cy). It does
       NOT pull individual nodes inward. With small node counts
       and high charge repulsion (-160 for ≤50 nodes), nothing
       opposes outward drift.

    b) `makeForceBound` was a HARD clamp: `if (n.x < minX) n.x =
       minX`. Nodes that hit the wall get arrested with their
       velocity preserved on the perpendicular axis but no inward
       impulse → they slide along the wall and stack at corners.
       The simulation never relaxes back to the centre.

    Fix:
    a) Add forceX(cx) + forceY(cy) with `centerGravity` strength
       per node-count tier (0.08 for ≤50, scaling down with
       larger graphs where link tension is sufficient). This pulls
       every individual node toward the centre proportional to its
       offset.
    b) Replace the hard clamp with an elastic bounce: when a node
       hits the boundary, reverse its velocity component (×0.4
       damping) instead of zeroing it. Energy returns to the
       system, the simulation actually relaxes.

Bump chart 1.4.72 → 1.4.73.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:51:07 +04:00
e3mrah
2ad31b4481
feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes real-time data plane (#1062)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(apps): publish chip on each card — replaces deleted /catalog page

Per founder direction: "if the catalog is just labeling an app to be
shown in marketplace, why don't we do it through the apps?" — drop
the standalone /catalog page (#1058), put the publish toggle on each
/apps card.

Backend (catalyst-api):
- New file sme_catalog_client.go — best-effort client for the
  in-cluster SME catalog microservice at
  http://catalog.sme.svc.cluster.local:8082. 30s response cache,
  1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier
  not deployed on this Sovereign — common when marketplace.enabled
  is false).
- HandleSovereignApps decorates each app with `marketplacePublished`
  *bool joined by slug from the SME catalog. nil ⇒ slug not in SME
  catalog (bootstrap component, or marketplace not deployed) ⇒ FE
  suppresses the chip.
- New handler HandleSovereignAppPublish at PATCH
  /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}.
  Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME
  catalog. Surfaces upstream status verbatim. Invalidates the cache
  so the next /apps poll reflects the change immediately.

Frontend (AppsPage):
- liveAppsQuery returns { statusById, publishedBySlug } instead of
  the bare status map.
- Each AppCard with a non-null marketplacePublished renders a
  PUBLISHED / UNPUBLISHED chip alongside the status chip. Click →
  PATCH → optimistic refetch via React Query.
- Bootstrap components and apps not in the SME catalog have nil →
  no chip (correct: nothing to toggle).
- Cards with marketplace.enabled=false render no chips at all (SME
  catalog unreachable → nil for every slug).

Bump chart 1.4.66 → 1.4.67.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code

Audit triggered by founder asking if PRs #1051..#1059 reach NEW
Sovereigns or just my manual `kubectl set image` patches on omantel.
Answer was: nothing reached anyone except omantel via manual patches.
Both contabo AND every fresh Sovereign would install :2122fb8 — the
SHA frozen at PR #1040's last manual chart-touch on May 6 morning.

Root cause:
- chart/templates/api-deployment.yaml + ui-deployment.yaml carry
  LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"),
  not Helm-templated `{{ .Values.images.catalystApi.tag }}`.
- catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag
  on every push — but no template reads from it. Dead code.
- contabo's catalyst-platform Flux Kustomization at
  ./products/catalyst/chart/templates applies these as raw manifests.
- Sovereigns Helm-install the same chart; Helm passes the literal
  through unchanged.
- Both ended up frozen at whatever literal was committed at the last
  manual chart-touching PR.

Fix:
1. CI's deploy step now bumps both the literal SHAs in the two
   template files AND the unused-but-kept-for-SME-services
   values.yaml. Sed-patches the literal directly so contabo's Kustomize
   path keeps working.
2. The commit step adds the two templates to the staged set alongside
   values.yaml, so every "deploy: update catalyst images to <sha>"
   commit propagates to contabo (10-min reconcile) AND Sovereigns
   (next OCI chart publish via blueprint-release).
3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with
   the latest literal (currently :8361df4) gets republished and
   pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml.

Why drop the "freeze contabo" intent of the previous comment:
The previous comment said contabo auto-roll on every PR was bad
because PR #975's image broke contabo (k8scache startup loop).
Solution there is: fix the bug in the code, not freeze contabo.
Freezing masked real divergence — the reason the founder caught
this is that manual omantel patches were the only thing keeping
omantel current while contabo + every other fresh Sovereign quietly
ran 9 PRs behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane

Founder asked: "make the real-time k8s information propagation
development reused — find the reverted prior work and implement the
final working one."

History:
- PR #358 (May 1) shipped the full informer + SSE data plane:
  internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics}
  + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) +
  UI hook lib/useK8sStream.ts + widget useK8sCacheStream.
- PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream
  with kinds=namespace,node,pv,pod,deployment,...,server.hcloud,
  volume.hcloud and `&initialState=1` for live cloud-graph deltas.
- PR #981 hotfix dropped the synchronous discovery probe in
  factory.go:AddCluster (it was calling
  core.Discovery().ServerResourcesForGroupVersion(gv) with NO context
  timeout — on a kubeconfig pointing at a decommissioned otech the
  call hung the catalyst-api startup for minutes per dead cluster).

After #981 the discovery-probe surgery was clean — no follow-up
broke. The data plane code stayed in the codebase. The remaining
gap was operational, not architectural:

  On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>),
  the catalyst-api boots without a posted-back kubeconfig in
  /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns []
  → factory has zero clusters → every
  /api/v1/sovereigns/{depId}/k8s/* request 404s with
  "sovereign \"...\" not registered". The architecture-graph
  in-flight call confirmed live on omantel.biz today.

Fix in this PR:

1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN
   env is set (chroot mode), build a ClusterRef with id resolved from
   CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by
   scanning /var/lib/catalyst/deployments/*.json for a record matching
   the FQDN (mirrors HandleSovereignSelf's store-fallback path for
   consistency). DynamicClient + CoreClient built from
   rest.InClusterConfig(). Append to the cluster list. Mother behavior
   unchanged — SOVEREIGN_FQDN unset → branch is a no-op.

2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide
   get/list/watch on every kind in the k8scache registry (pods,
   deployments, statefulsets, daemonsets, replicasets, services,
   endpointslices, ingresses, configmaps, secrets, persistentvolumes,
   persistentvolumeclaims, hcloud.crossplane.io managed resources,
   vclusters), plus authorization.k8s.io/subjectaccessreviews so the
   per-event SAR gating in the SSE handler doesn't 403 silently.

3. Bump chart 1.4.70 → 1.4.71.

The discovery-probe failure mode that triggered the original revert
(synchronous ServerResourcesForGroupVersion blocking startup) does
NOT recur here — InClusterConfig() returns immediately, NewForConfig
is lazy, and the first network call happens inside the informer
goroutine after Start, off the boot critical path. Mother-side
LoadClustersFromDir behavior is untouched (no probe, just kubeconfig
file parsing as it has been since #981).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:26:59 +04:00
e3mrah
eb6a3c1812
fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs — Sovereigns + contabo were frozen at :2122fb8 (#1060)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(apps): publish chip on each card — replaces deleted /catalog page

Per founder direction: "if the catalog is just labeling an app to be
shown in marketplace, why don't we do it through the apps?" — drop
the standalone /catalog page (#1058), put the publish toggle on each
/apps card.

Backend (catalyst-api):
- New file sme_catalog_client.go — best-effort client for the
  in-cluster SME catalog microservice at
  http://catalog.sme.svc.cluster.local:8082. 30s response cache,
  1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier
  not deployed on this Sovereign — common when marketplace.enabled
  is false).
- HandleSovereignApps decorates each app with `marketplacePublished`
  *bool joined by slug from the SME catalog. nil ⇒ slug not in SME
  catalog (bootstrap component, or marketplace not deployed) ⇒ FE
  suppresses the chip.
- New handler HandleSovereignAppPublish at PATCH
  /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}.
  Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME
  catalog. Surfaces upstream status verbatim. Invalidates the cache
  so the next /apps poll reflects the change immediately.

Frontend (AppsPage):
- liveAppsQuery returns { statusById, publishedBySlug } instead of
  the bare status map.
- Each AppCard with a non-null marketplacePublished renders a
  PUBLISHED / UNPUBLISHED chip alongside the status chip. Click →
  PATCH → optimistic refetch via React Query.
- Bootstrap components and apps not in the SME catalog have nil →
  no chip (correct: nothing to toggle).
- Cards with marketplace.enabled=false render no chips at all (SME
  catalog unreachable → nil for every slug).

Bump chart 1.4.66 → 1.4.67.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code

Audit triggered by founder asking if PRs #1051..#1059 reach NEW
Sovereigns or just my manual `kubectl set image` patches on omantel.
Answer was: nothing reached anyone except omantel via manual patches.
Both contabo AND every fresh Sovereign would install :2122fb8 — the
SHA frozen at PR #1040's last manual chart-touch on May 6 morning.

Root cause:
- chart/templates/api-deployment.yaml + ui-deployment.yaml carry
  LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"),
  not Helm-templated `{{ .Values.images.catalystApi.tag }}`.
- catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag
  on every push — but no template reads from it. Dead code.
- contabo's catalyst-platform Flux Kustomization at
  ./products/catalyst/chart/templates applies these as raw manifests.
- Sovereigns Helm-install the same chart; Helm passes the literal
  through unchanged.
- Both ended up frozen at whatever literal was committed at the last
  manual chart-touching PR.

Fix:
1. CI's deploy step now bumps both the literal SHAs in the two
   template files AND the unused-but-kept-for-SME-services
   values.yaml. Sed-patches the literal directly so contabo's Kustomize
   path keeps working.
2. The commit step adds the two templates to the staged set alongside
   values.yaml, so every "deploy: update catalyst images to <sha>"
   commit propagates to contabo (10-min reconcile) AND Sovereigns
   (next OCI chart publish via blueprint-release).
3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with
   the latest literal (currently :8361df4) gets republished and
   pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml.

Why drop the "freeze contabo" intent of the previous comment:
The previous comment said contabo auto-roll on every PR was bad
because PR #975's image broke contabo (k8scache startup loop).
Solution there is: fix the bug in the code, not freeze contabo.
Freezing masked real divergence — the reason the founder caught
this is that manual omantel patches were the only thing keeping
omantel current while contabo + every other fresh Sovereign quietly
ran 9 PRs behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:10:31 +04:00
e3mrah
8361df46ac
feat(apps): publish chip on each card — replaces deleted /catalog page (#1059)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(apps): publish chip on each card — replaces deleted /catalog page

Per founder direction: "if the catalog is just labeling an app to be
shown in marketplace, why don't we do it through the apps?" — drop
the standalone /catalog page (#1058), put the publish toggle on each
/apps card.

Backend (catalyst-api):
- New file sme_catalog_client.go — best-effort client for the
  in-cluster SME catalog microservice at
  http://catalog.sme.svc.cluster.local:8082. 30s response cache,
  1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier
  not deployed on this Sovereign — common when marketplace.enabled
  is false).
- HandleSovereignApps decorates each app with `marketplacePublished`
  *bool joined by slug from the SME catalog. nil ⇒ slug not in SME
  catalog (bootstrap component, or marketplace not deployed) ⇒ FE
  suppresses the chip.
- New handler HandleSovereignAppPublish at PATCH
  /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}.
  Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME
  catalog. Surfaces upstream status verbatim. Invalidates the cache
  so the next /apps poll reflects the change immediately.

Frontend (AppsPage):
- liveAppsQuery returns { statusById, publishedBySlug } instead of
  the bare status map.
- Each AppCard with a non-null marketplacePublished renders a
  PUBLISHED / UNPUBLISHED chip alongside the status chip. Click →
  PATCH → optimistic refetch via React Query.
- Bootstrap components and apps not in the SME catalog have nil →
  no chip (correct: nothing to toggle).
- Cards with marketplace.enabled=false render no chips at all (SME
  catalog unreachable → nil for every slug).

Bump chart 1.4.66 → 1.4.67.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 20:43:59 +04:00
e3mrah
aed0a81f75
fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page (#1058)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page

Three chroot-only pages bypassed PortalShell entirely. After
SovereignConsoleLayout went auth-only in #1057, they rendered
full-bleed with no sidebar / no header — visible look-and-feel break.

  /settings/marketplace   → MarketplaceSettings  (wrapped in PortalShell)
  /parent-domains         → ParentDomainsPage    (wrapped in PortalShell)
  /catalog                → CatalogAdminPage     (deleted)

Drop /catalog entirely per founder direction: a separate page just
to flip a "publish to marketplace" boolean per app is the wrong
shape. The natural place for that toggle is on each /apps card
(future PR — needs HandleSovereignApps to join publish state from
the SME catalog microservice). Removed:
  - /catalog route registration in router.tsx
  - 'Catalog' entry in SovereignSidebar's FLAT_NAV
  - CatalogAdminPage.tsx (525 lines)
  - 'catalog' from ActiveSection union + deriveActiveSection regex

The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish
on the SME catalog service is unaffected; it's exposed at
marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future
apps-card toggle will call it via the same path.

Bump chart 1.4.64 → 1.4.65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 20:28:11 +04:00
e3mrah
8c8ccfbfed
fix(chroot): single chrome — no frame in frame, no mother handover banner (#1057)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): single chrome — no frame in frame, no mother handover banner

Two visible bleed-throughs from the mother's wizard UX onto the
chroot Sovereign Console at console.<sov-fqdn>:

1. **Two stacked headers + sidebar inside sidebar** ("frame in frame").
   SovereignConsoleLayout rendered its own sidebar+header AND the page
   inside rendered PortalShell which rendered ANOTHER header (its
   sidebar was already skipped for chroot per a prior fix). User saw
   two horizontal title bars stacked.

   Resolution: SovereignConsoleLayout becomes auth-only on the chroot.
   It runs the cookie/OIDC auth gate + RequiredActionsModal, then
   renders <Outlet/> with NO chrome. PortalShell is now the single
   chrome owner on both surfaces:
     - Mother (/sovereign/provision/$id): renders Sidebar with
       /provision/$id/X URLs + its header.
     - Chroot (console.<sov-fqdn>):       renders SovereignSidebar
       with clean /X URLs + the same header.
   One sidebar, one header, byte-identical to mother layout.

2. **"✓ Sovereign is ready — Redirecting to your Sovereign console"
   banner on /apps.** This is the mother's wizard celebration that
   tells the operator "you can now jump to your new Sovereign". On
   the chroot the operator IS already on the Sovereign Console; the
   banner bleeds through because the imported deployment record
   carries the mother's handover-ready event in its history.

   Resolution: AppsPage gates the banner, the toast, and the
   auto-redirect timer on `!isSovereignMode`. Chroot stays clean.

Bump chart 1.4.62 → 1.4.63.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 20:05:15 +04:00
e3mrah
933b321890
fix(cloud): resolve deploymentId from cookie on chroot (#1056)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined

CloudPage's topology query fired against /deployments/undefined/...
on the chroot (URL is /cloud, no deploymentId path segment), so the
page showed "Couldn't load architecture" with all node counts at 0/0.

Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the
JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling
back from URL params. Topology query also gates on `!!deploymentId`
so it doesn't waste a 404 round-trip during cookie resolution.

Bump chart 1.4.60 → 1.4.61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 19:12:50 +04:00
e3mrah
fb7cfbcf8e
fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s (#1055)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s

JobDetail navigation was 404ing on the chroot because the link builder
URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak")
and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does
not decode `%3A` inside path segments. The catalyst-api router saw
the literal "%3A" and Store.GetJob's exact-match path missed.

Two coupled fixes:

1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding,
   producing /jobs/install-keycloak (Traefik-safe) instead of
   /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already
   accepts both bare jobName and canonical id (see store.go:781-789).

2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so
   the URL param resolves regardless of which format the link emitted.

Bump chart 1.4.58 → 1.4.59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 19:05:12 +04:00
e3mrah
ee8d2e2b0e
fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store, single endpoint (#1054)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56

PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign

The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api
binary as the mother. When that binary runs ON the Sovereign cluster
(catalyst-system namespace on the Sovereign itself), there is no
posted-back kubeconfig — the catalyst-api IS in the cluster it needs
to talk to, and rest.InClusterConfig() returns the right credentials.

Without this, every endpoint that needs the Sovereign-side dynamic
client returned 503 with "sovereign cluster kubeconfig not yet posted
back" — including ListUserAccess (/users page), CreateUserAccess,
infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users
rendered "list user-access: HTTP 503" because the Sovereign-side
catalyst-api was looking for a kubeconfig that doesn't exist on the
chroot side of the cutover boundary.

Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api
deployment by the chart) matches dep.Request.SovereignFQDN. On the
mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot,
SOVEREIGN_FQDN matches the only deployment served (its own) → use
in-cluster.

Same fallback applied to tryDynamicClientLocked (loaderInputFor's
best-effort live-source client) so /infrastructure/topology and the
/cloud graph render with live data on the chroot too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(user-access): empty list when CRD absent + RBAC for chroot

Two coupled fixes for the /users page on chroot Sovereign Console:

1. catalyst-api-cutover-driver ClusterRole: grant read/write on
   useraccesses.access.openova.io. The Sovereign chroot's catalyst-api
   uses the in-cluster ServiceAccount (per PR #1052). The list call
   was returning 403 from the apiserver because the SA had no rule
   covering this CRD.

2. ListUserAccess: return 200 with empty items when the CRD itself
   is not installed (apierrors.IsNotFound). The access.openova.io
   CRD ships via a separate blueprint that may not yet be installed
   on a fresh Sovereign — the page should render its empty state,
   not a 500 toast.

Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the
in-cluster client path: list call surfaced first as 403 (RBAC), then
as 500 "server could not find the requested resource" (CRD absent).
Both now resolve to a 200 + [].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint

Two parallel-baby paths still made the chroot diverge from the mother
on /cloud and /jobs/{jobId}. Both now ship one path that serves
byte-identical data on both surfaces.

1. CloudPage rendered fictional topology (Frankfurt, Helsinki,
   omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when
   the topology query errored — because it fell back to
   `infrastructureTopologyFixture` from `src/test/fixtures/`. That is
   a test-only file leaking into production via the production import
   tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no
   placeholder data — empty state when you don't know).

   Fix: drop the fixture fallback. On error → null → empty-state
   render. The mother shows the same empty state when its loader
   returns nothing; byte-identical.

2. JobsTable + JobDetail rendered a flat green-grid because the chroot
   was hitting `/api/v1/sovereign/jobs` which returns a minimal shape
   (no dependsOn, no parentId, no exec records). Mother's
   `/api/v1/deployments/{depId}/jobs` returns the rich shape from a
   per-deployment jobs.Store, which on the chroot starts empty (the
   mother's exportDeploymentToChild only ships the deployment record,
   not the jobs.Store contents).

   Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`.
   Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when
   SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per-
   deployment jobs.Store has 0 records: do a one-shot HelmRelease
   list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases
   — exported here, mirrors Watcher.SnapshotComponents without
   spinning up an informer), pass through snapshotsToSeeds +
   Bridge.SeedJobsFromInformerList. Subsequent calls read directly
   from the now-populated store and return rich Job records with
   dependsOn / parentId / status — exactly like the mother.

   useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI
   uses the same `/api/v1/deployments/{id}/jobs` URL as the mother.

3. HandleDeploymentImport now also loads the imported record into the
   in-memory deployments map immediately, so `/deployments/{id}/*`
   handlers don't need a pod restart's restoreFromStore to see the
   chroot-imported deployment.

Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 18:57:01 +04:00
e3mrah
9ec32e3311
fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 (#1051)
PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers,
HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology)
but left four route registrations in cmd/api/main.go that still
referenced those handler methods. The catalyst-api build for the merged
revert (run 25439549879) failed with:

  cmd/api/main.go:690:39: h.HandleSovereignUsers undefined
  cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined
  cmd/api/main.go:692:42: h.HandleSovereignSettings undefined
  cmd/api/main.go:693:42: h.HandleSovereignTopology undefined

That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never
published — only the UI image rolled. Result: omantel.biz catalyst-api
pod stuck in ImagePullBackOff.

Drop the four route registrations. Same baby, new address — the chroot
Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via
the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/*
endpoints.

Also revert two more parallel-baby fragments still on main:
  - getHierarchicalInfrastructure mode-aware fetcher → single mother
    URL (the chroot resolves deploymentId from the cookie and the
    mother-side topology handler serves byte-identical data once
    cutover-import has persisted the deployment record on the
    Sovereign's local store)
  - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere

Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster
Kustomization version pin to match.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 18:00:41 +04:00
e3mrah
366395c9d1
fix(graphcanvas): defensive label render + adapter never-undefined labels (#1049)
Crash on omantel.biz /cloud: 'TypeError: Cannot read properties of
undefined (reading length)' at GraphCanvas line 975 — n.label was
undefined when adapter produced a Region node from a topology where
region.name was empty AND region.providerRegion was undefined
(legacy mother-side adapter assumed both were populated).

Two-layer fix:
  1. GraphCanvas — coerce label to '' before .length / .slice.
  2. adapter.ts — addRegion / addCluster fall back to id then a
     literal placeholder so the produced node always has a non-
     empty label.

Bumps bp-catalyst-platform 1.4.54 → 1.4.55.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 17:27:24 +04:00
e3mrah
959879a7e4
fix(architecture-graph): try/catch hierarchyToGraph + k8sToGraph (#1048)
The Sovereign-mode /api/v1/sovereign/topology shape lacks some fields
the legacy hierarchyToGraph adapter dereferences (skuCp, skuWorker,
providerRegion etc.). Wrap both adapter calls in try/catch so a
missing field falls through to an empty graph rather than crashing
the entire /cloud page via the React error boundary. Caught on
omantel.biz 2026-05-06.

Bumps bp-catalyst-platform 1.4.53 → 1.4.54.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 17:20:31 +04:00
e3mrah
28d2cf17df
fix(cloud-page): defensive normalize + try/catch fallback to empty topology (#1047)
CloudPage threw 'Cannot read properties of undefined (reading length)'
on omantel.biz because the Sovereign-mode topology shape carried
slimmer fields than the wizard mother-side shape (region.id/name
empty, node.region missing, etc). Add per-field nullish defaults at
each level of the normalize + a try/catch fallback that renders an
empty topology instead of crashing the entire page via the React
error boundary.

Bumps bp-catalyst-platform 1.4.52 → 1.4.53.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 17:14:39 +04:00
e3mrah
862c77be1b
fix(jobs/jobdetail): URL-encode multi-segment live job ids + strict:false params (#1046)
The live /api/v1/sovereign/jobs endpoint returns job ids like
'job/syft-grype/syft-grype-bp-syft-grype-29633910' that contain '/'.
tan-stack's '/jobs/$jobId' route matches a single segment so links
to multi-segment ids 404'd. Encode the id in the link builder + decode
in JobDetail.

Also switches JobDetail's strict-mode useParams (the
'/provision/$deploymentId/jobs/$jobId' from-clause) to strict:false +
useResolvedDeploymentId fallback so it works on the chroot Sovereign
route too. Caught on omantel.biz 2026-05-06.

Bumps bp-catalyst-platform 1.4.51 → 1.4.52.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 17:10:10 +04:00
e3mrah
fe4aa109d5
fix(sovereign-topology): return CloudSpec[] not object — CloudPage iterates (#1045)
CloudPage threw 'TypeError: e.cloud is not iterable' on omantel.biz
because /api/v1/sovereign/topology returned cloud as a JSON object
{provider, providerRegion} but the UI's HierarchicalInfrastructure
contract is cloud: CloudSpec[] (CloudPage runs for-of and useMemo
over it). Fixed: shape cloud as a single-element array of CloudSpec
(id/name/provider/regionCount/quotaUsed/quotaLimit) and add the
missing storage block (storageClasses/pools/volumes/buckets) the
UI also expects.

Bumps bp-catalyst-platform 1.4.50 → 1.4.51.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 17:07:55 +04:00
e3mrah
15ae8796bc
fix(sovereign-console): close DoD gaps — Invariant + missing endpoints + chroot fetchers (#1044)
This is the comprehensive fix for the chroot Sovereign Console DoD
gaps caught on omantel.biz 2026-05-06. Eight pages were broken with
"Something went wrong!" / "Invariant failed" / "Couldn't load" /
"Not Found"; root causes traced to (a) /api/v1/sovereign/self
returning 503 because env vars weren't populated post-handover,
(b) several Sovereign endpoints (/users, /catalog, /settings,
/topology) didn't exist server-side, and (c) several pages used
strict-mode useParams against the mother-side /provision/$id/...
route which throws Invariant on the chroot /apps, /users, /settings,
/app/$id routes.

Server changes:
  - auth.Claims gains SovereignFQDN + DeploymentID fields.
  - auth_handover.go authHandoverClaims gains the same; the minted
    Sovereign session JWT now carries them so downstream handlers
    can resolve identity without env or store-fallback.
  - sovereign_self.go reads sovereign_fqdn / deployment_id from the
    catalyst_session cookie payload (best-effort base64 decode; no
    signature check needed since this catalyst-api minted the cookie
    in the first place). Resolution order: env → cookie → store →
    503/404.
  - new handlers in sovereign_more.go:
      GET /api/v1/sovereign/users     — Keycloak realm users
      GET /api/v1/sovereign/catalog   — embedded blueprints catalog
      GET /api/v1/sovereign/settings  — tenant identity + features
      GET /api/v1/sovereign/topology  — hierarchical infra view
        for CloudPage's getHierarchicalInfrastructure()
    All return well-shaped empty responses on any error (no 500s
    that bubble into UI error boundaries).

UI changes:
  - SettingsPage / AppDetail / UserAccessListPage replace strict-mode
    useParams({ from: '/provision/$deploymentId/...' }) with
    useParams({ strict: false }) + useResolvedDeploymentId() fall-
    back. Now works on BOTH the mother route AND the chroot
    Sovereign route without throwing Invariant.
  - CatalogAdminPage's fetchApps swaps /catalog/apps → /api/v1/
    sovereign/catalog when window.location.hostname is not
    console.openova.io.
  - getHierarchicalInfrastructure (CloudPage's source) swaps
    /api/v1/deployments/{id}/infrastructure/topology → /api/v1/
    sovereign/topology under the same chroot guard.

Bumps bp-catalyst-platform 1.4.49 → 1.4.50.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 16:58:00 +04:00
e3mrah
68e61eb306
fix(jobs): coerce Sovereign live response into full Job shape (#1042)
The /api/v1/sovereign/jobs endpoint returns a minimal shape
{id, name, namespace, kind, status, startedAt, finishedAt} — no
appId, parentId, dependsOn, childIds. JobsTable iterates
`for (const d of job.dependsOn)` and reads
`job.appId.toLowerCase()` etc., which throws TypeError
'Cannot read properties of undefined (reading length)' and
breaks page render entirely (0 rows shown).

Coerce missing fields to safe defaults in defaultFetchJobs so
the table renders. Followup: server-side handler should return
the full Job shape with empty arrays for missing fields.

Bumps bp-catalyst-platform 1.4.48 → 1.4.49.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 10:20:12 +04:00
e3mrah
8638613225
fix(useLiveJobsBackfill): enable query on Sovereign mode even when deploymentId empty (#1041)
The useLiveJobsBackfill hook gates with `enabled: enabled && !!deploymentId`.
On chroot Sovereign Console where /sovereign/self returns 503
(deployment-id-not-yet-stamped) and the route doesn't carry an
:deploymentId param, deploymentId is the empty string and the query
NEVER mounts. Live jobs always remained empty, mergeJobs fell
through to reducer-derived imported snapshot (every job pinned at
'pending').

Fix: when DETECTED_MODE.mode === 'sovereign', enable the query
regardless of deploymentId emptiness. The URL is FQDN-scoped via
the session cookie, no deploymentId needed in the path.

Bumps bp-catalyst-platform 1.4.47 → 1.4.48.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 10:16:36 +04:00
e3mrah
6f64753ea9
fix(cloud-page): defensive slice guard + bump chart 1.4.47 with literal :2122fb8 (#1040)
CloudPage's switcher rendered `d.id.slice(0, 8)` without a nullish
guard. When listDeployments returns an entry with undefined id (e.g.
malformed/legacy record), this throws TypeError 'Cannot read
properties of undefined (reading slice)' which the React error
boundary catches as 'Invariant failed', breaking all of /cloud.
Caught on omantel.biz 2026-05-06.

Also bumps the literal :91eeeed → :2122fb8 in api-deployment.yaml /
ui-deployment.yaml so freshly provisioned Sovereigns pick up the
JobsPage+AppsPage live-status fix from PR #1039 (chart 1.4.46's
values.yaml had :2122fb8 but the templated literals didn't).

Bumps bp-catalyst-platform 1.4.46 → 1.4.47.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 09:57:20 +04:00
e3mrah
2122fb81c0
fix(sovereign-console): jobs + apps pages show LIVE status (not imported snapshot Pending) (#1039)
Symptom on omantel.biz 2026-05-06: every job and every app on the
Sovereign Console showed "Pending" forever, even when the underlying
HelmReleases were Ready=True and the cluster was fully operational.

Root cause:
- JobsPage's useLiveJobsBackfill was gated by `inFlight =
  streamStatus !== 'completed' && streamStatus !== 'failed'`. The
  imported snapshot mother POSTs at handover ALWAYS arrives with
  streamStatus="completed" (mother considered phase-1 done before
  firing the JWT). So inFlight=false and disablePolling=true on
  Sovereign mode → liveJobs.length=0 → mergeJobs returns the
  reducer-derived imported snapshot (every job pinned at "pending").
- AppsPage read `state.apps[id].status` from the same imported
  reducer state. No live-status overlay.

Fix:
- JobsPage: bypass the inFlight gate when DETECTED_MODE.mode ===
  'sovereign'. Live polling /api/v1/sovereign/jobs is the
  authoritative source on chroot Sovereign Console.
- AppsPage: add a useQuery polling /api/v1/sovereign/apps every 5s
  on Sovereign mode, mapping the server's status enum
  (installed | installing | bootstrap | available) to the UI's
  ApplicationStatus vocabulary, and overlay it on top of the
  reducer-derived status.

Bumps bp-catalyst-platform 1.4.45 → 1.4.46.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 09:51:17 +04:00
e3mrah
838094348a
fix(rbac): grant catalyst-api SA cluster reads for /sovereign/cloud + /apps (#1038)
The Sovereign Console's chroot /cloud and /apps panes back onto
HandleSovereignCloud / HandleSovereignApps in catalyst-api, which
use the in-cluster client to enumerate cluster-wide K8s resources
(Nodes, Namespaces, Services, PVCs, StorageClasses, Ingresses,
HTTPRoutes, HelmReleases). The pre-existing ClusterRole only
covered the cutover-step Job-driving verbs (configmaps/jobs/pods).
Caught on otech130 2026-05-06: /api/v1/sovereign/cloud returned
{nodes:[], namespaces:[], …} because every List call hit a silent
apiserver Forbidden, and the handler's err branch falls through
to an empty response shape.

Adds get/list/watch on:
- core: nodes, namespaces, services, persistentvolumes,
  persistentvolumeclaims
- networking.k8s.io: ingresses
- gateway.networking.k8s.io: httproutes, gateways
- storage.k8s.io: storageclasses
- helm.toolkit.fluxcd.io: helmreleases

Bumps bp-catalyst-platform 1.4.44 → 1.4.45.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 04:20:47 +04:00
e3mrah
d2ca2d492b
chore(bp-catalyst-platform): bump 1.4.43 → 1.4.44 + literal :ff864e9 → :91eeeed (#1032 PortalShell sidebar fix) (#1037)
Chart 1.4.43 was built before PR #1032 bumped chart Chart.yaml in
the same commit, so its values.yaml had tag :91eeeed but the
hardcoded image refs in templates/api-deployment.yaml and
templates/ui-deployment.yaml stayed at :ff864e9 (the previous
bump from PR #1030). Sovereigns provisioned with chart 1.4.43
therefore still have the duplicate-sidebar bug — caught on
otech129 2026-05-05.

This bump pins the literal refs to :91eeeed, which is PR #1032's
commit SHA. Bootstrap-kit pin moves 1.4.43 → 1.4.44 so otech130+
get the PortalShell skip-inner-Sidebar logic.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 04:03:15 +04:00
e3mrah
fc36731b4a
chore(bootstrap-kit): pin bp-catalyst-platform 1.4.41 → 1.4.43 (PR #1032 PortalShell sidebar fix) (#1035)
PR #1032's sed target was '1.4.42' but the in-tree pin was still
1.4.41 (chart Chart.yaml had been bumped 1.4.42 by the deploy job
but the bootstrap-kit YAML file pinning the chart version for
freshly provisioned Sovereigns was untouched). Picked up live on
otech128 2026-05-05 — it provisioned with chart 1.4.41 and still
exhibited the duplicate sidebar bug PR #1032 was meant to fix.
This commit bumps the pin so otech129+ get chart 1.4.43.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 03:32:04 +04:00
e3mrah
a6fb97f2ef
fix(cutover step-01): clone+push (regular repo) instead of pull-mirror (#1033)
PR #1029 added a step-06 PATCH to flip mirror=false before push so
the cutover-helmrepository-patches Job could write HelmRepository
URL pivots to local Gitea. On Gitea 1.22.3 the PATCH returns 200
but silently no-ops — `mirror_interval` updates but `mirror: true`
stays. The repo remains read-only and step-06 still hits HTTP 403
"remote: mirror repository is read-only". Reproduced on otech127
2026-05-05 with chart 0.1.22 deployed.

Per ADR (cutover ends upstream tracking — Sovereign goes
self-hosted from this point), the architecturally correct fix is
to never create the mirror in the first place. Step-01 now creates
a regular Gitea repo and bare-clones+pushes upstream content. All
refs (branches+tags) replicate via `git push --mirror --force`,
which is idempotent on re-runs.

Trade-off: post-cutover Sovereigns no longer auto-sync from
upstream — that's the intended cutover semantics anyway. Operator
re-runs this Job manually for chart rollouts (next-session
follow-up: dedicated post-cutover sync mechanism, perhaps a
periodic CronJob the operator can opt into).

Bumps:
- bp-self-sovereign-cutover chart 0.1.22 → 0.1.23
- bootstrap-kit pin 0.1.22 → 0.1.23

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 03:19:05 +04:00
e3mrah
a070808eda
fix(cutover step-06): convert pull-mirror to standalone before pushing patches (#1029)
Step-01 creates openova/openova on the Sovereign's local Gitea as a
pull mirror so it tracks upstream openova-public during early
bootstrap. After cutover, the Sovereign is self-hosted and MUST
diverge from upstream — but Gitea blocks pushes to a mirror with
HTTP 403 "remote: mirror repository is read-only".

Step-06 adds a Phase-1.5 PATCH /api/v1/repos/{owner}/{repo}
{"mirror": false, "mirror_interval": "0"} BEFORE attempting to
clone+push the HelmRepository URL pivot. This converts the
pull-mirror into a standalone writable repo — the way the post-
cutover Sovereign architecture expects it.

Caught on otech125 2026-05-05: cutover-helmrepository-patches Job
returned "FATAL: git push failed" with no upstream stderr (chart
0.1.20 lacks the printf '%s\n' "$push_err" fix from PR #1022, which
was published in 0.1.21 only). Reproduced by cloning openova/openova
from a debug pod and running git push: "remote: mirror repository
is read-only / fatal: ... HTTP 403". Without the demirror step,
EVERY Sovereign provisioned fails handover at this step.

Bumps:
- bp-self-sovereign-cutover chart 0.1.21 → 0.1.22
- bootstrap-kit pin 0.1.20 → 0.1.22

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 02:53:45 +04:00
e3mrah
4e2192ef4a
fix(deployments-list): row click goes to that row's dashboard, not the current one (#1026)
The Sovereign Console at /sovereign/deployments rendered every row's FQDN
as a Link to=`/dashboard` regardless of which row was clicked. On contabo
(mother) this resolved to /sovereign/dashboard (the CURRENT user's
Sovereign), so clicking ANY row in the deployments list always
navigated to the same dashboard — breaking the operator's expectation
that "click row X to see deployment X's pages."

Fix: route each row to /provision/<row-id>/dashboard on the mother view
(Catalyst-Zero), and to /dashboard on the chroot Sovereign view (where
each Sovereign sees only its own deployment, so /dashboard is correct).

Mode resolved via the existing DETECTED_MODE singleton.

Bumps bp-catalyst-platform chart 1.4.40 → 1.4.41.

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 02:34:06 +04:00
e3mrah
aba77c09a1
chore(bp-catalyst-platform): bump 1.4.39 → 1.4.40 + literal :1b62da7 → :074d65c (#1023 store-fallback) (#1024)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-06 02:18:28 +04:00
e3mrah
362a377dc3
chore(bp-catalyst-platform): bump 1.4.38 → 1.4.39 + literal :69f3be2 → :1b62da7 (#1017 LIVE jobs) (#1020)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-06 02:09:54 +04:00
e3mrah
b8ef07def4
chore(bp-catalyst-platform): bump 1.4.37 → 1.4.38 + literal :32d4a87 → :69f3be2 (#1014 sidebar redux) (#1015)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-06 01:28:14 +04:00
e3mrah
4f3cce668d
chore(bp-catalyst-platform): bump 1.4.36 → 1.4.37 + literal :a1b30cc → :32d4a87 (#1012 wizard validators public) (#1013)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-06 00:53:18 +04:00
e3mrah
78fe10aa87
chore(bp-catalyst-platform): bump 1.4.35 → 1.4.36 + literal :8ec8c01 → :a1b30cc (#1008 public subdomains/check) (#1009)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-05 23:59:50 +04:00
e3mrah
b887f95d29
chore(bp-catalyst-platform): bump 1.4.34 → 1.4.35 + literal :b45a49f → :8ec8c01 (#1005)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-05 23:49:58 +04:00
e3mrah
1b85ab9227
chore(bp-catalyst-platform): bump 1.4.33 → 1.4.34 + literal :11dd19e → :b45a49f (#1000 cloud chroot + wizard banner) (#1003)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-05 23:44:03 +04:00
e3mrah
b15f08bc1e
chore(bp-catalyst-platform): bump 1.4.32 → 1.4.33 + literal :1af1c0d → :11dd19e (#998 chroot fix) (#999)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-05 23:27:12 +04:00
e3mrah
2e493fc4f7
chore(bp-catalyst-platform): bump 1.4.31 → 1.4.32 + literal :ffe3607 → :1af1c0d (#996 redirect fixes) (#997)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-05 23:07:04 +04:00
e3mrah
498a02549a
chore(bp-catalyst-platform): bump 1.4.30 → 1.4.31 + literal :019309f → :ffe3607 (#995)
Lands #994's wizard redirect fix on contabo + Sovereigns.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:00:33 +04:00
e3mrah
92f1eb8468
chore(bp-catalyst-platform): bump 1.4.29 → 1.4.30 + chart literal :8a1fe04 → :019309f (#993)
Lands the clean post-revert image on Sovereigns:

- :019309f is the catalyst-build output for commit 019309f9 (the revert
  merge of #984/#987/#989), which carries PR #983's URL contract fix
  WITHOUT the broken / → /nova/ redirect chain.
- Chart version bumped 1.4.29 → 1.4.30 to invalidate Flux source-controller's
  OCI tag cache (otherwise Sovereigns stay on the first 1.4.29 digest they
  pulled — verified live on otech117).
- Chart template literal bumped because PR #980 stops CI from auto-bumping
  it; this commit IS the operator-approved manual bump.

Contabo stays on :8a1fe04 (manifest at clusters/contabo-mkt unaffected by
the chart literal change since contabo's Kustomize path reads its own copy
of the deployment manifests). When the operator validates :019309f on
Sovereigns, contabo can be re-pinned in a follow-up.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 22:41:42 +04:00
e3mrah
e8fcd66a2b
chore(bp-catalyst-platform): bump 1.4.28 → 1.4.29 — pulls in #983 URL contract (#986)
Bumps the chart version + the per-Sovereign HelmRelease pin in
clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml so all
Sovereigns reconciling against the template (otech117 et al.) pick up
PR #983's fixes:

- /dashboard /apps /jobs /cloud … render at clean roots; no /console/
  prefix and no /provision/<id>/ prefix on Sovereign mode.
- sovereign_self.go store fallback — data flows on clean URLs the
  moment fireHandover POSTs the deployment record to /api/v1/internal/
  deployments/import; no waiting for a chart-values overlay roundtrip.
- Sidebar links land on clean roots — no more /provision//cloud.
- Auth handover redirect target → /dashboard (was /console/dashboard).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 22:04:39 +04:00
e3mrah
ed8872a15b
feat(catalyst-api): mother→child cutover data transfer at handover (#977)
The data half of the mother→child contract that PR #976 set up the
URL routing for. At handover the mother POSTs the full deployment
record (events, jobs history, HRs, cloud topology, kubeconfig meta)
to the child's POST /api/v1/internal/deployments/import — the child
persists it locally so its /api/v1/deployments/{id}/* endpoints
answer with byte-byte-identical data the operator sees on the mother
view at /sovereign/provision/<id>/<page>.

Result: on the child cluster, clean URLs (/dashboard, /apps, /jobs,
/cloud) render with REAL data (events, exec logs, job statuses,
treemap utilisation) instead of empty arrays.

- New endpoint: POST /api/v1/internal/deployments/import (child)
  Validates by FQDN match against CATALYST_OTECH_FQDN. Idempotent.
- Mother fireHandover() now posts the record to the child after the
  JWT mint as a fire-and-forget goroutine. Failure logs loudly per
  INVIOLABLE-PRINCIPLES #3 but does not block SSE emit.

Bumped: bp-catalyst-platform 1.4.27 → 1.4.28.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 20:51:03 +04:00
e3mrah
6ec7851bc2
feat(sovereign-console): kill duplicate /console/* pages, redirect to canonical /provision/$id/* (Iteration 1) (#972)
* feat(sovereign-console): kill duplicate /console/* pages, redirect to canonical /provision/$id/* (Iteration 1)

Founder-reported on otech116/117: the /console/dashboard, /console/apps,
/console/jobs, /console/cloud, /console/users, /console/settings pages
are STUBS that look completely different from the canonical Sovereign
Console operators see at console.openova.io/sovereign/provision/$id/*.

Investigation: 6 duplicate Console*Page React components were shipped in
PR #937 — separate stub implementations of pages that already exist as
the canonical Dashboard / AppsPage / JobsPage / CloudPage /
UserAccessListPage / SettingsPage components used by the
/provision/$deploymentId/* route tree (the same the wizard renders).

Fix (Iteration 1):
  - DELETE the 6 duplicate Console*Page components.
  - Replace the /console/* router routes with SovereignConsoleRedirect:
    a tiny component that fetches /api/v1/sovereign/self for the
    Sovereign's own deployment id, then router-navigates to the
    canonical /provision/<self-id>/<page>. Same components, same data,
    pixel-byte-byte-identical UI to the mothership view.
  - Add catalyst-api endpoint GET /api/v1/sovereign/self that returns
    the deployment id from CATALYST_SELF_DEPLOYMENT_ID env. Mothership
    (env unset) → 404. Sovereign with stamped id → 200. Sovereign
    pre-handover → 503 deployment-id-not-yet-stamped.
  - Wire env via the existing sovereign-fqdn ConfigMap (B1 PR #912):
    new key `selfDeploymentId`, sourced from
    .Values.global.sovereignSelfDeploymentId. Empty until the
    orchestrator's per-Sovereign overlay writer stamps it.
  - Add useResolvedDeploymentId React hook (URL params first, then
    /sovereign/self fallback) — wires Iteration 2 (clean URLs) below.

Iteration 2 (next PR — out of scope here):
  - Drop the /sovereign/provision/<id>/ URL prefix on Sovereign by
    refactoring 6 canonical components to use useResolvedDeploymentId
    instead of strict useParams. Then /console/dashboard renders the
    canonical Dashboard at the clean URL with deployment id resolved
    from /sovereign/self.

Iteration 3 (next PR after — also out of scope):
  - Handover history transfer: contabo's catalyst-api at handover POSTs
    the full deployment record (events, jobs, HRs, cloud topology) to
    the Sovereign's catalyst-api so /provision/<id>/* on the Sovereign
    answers with byte-byte-identical data.

Bumped: bp-catalyst-platform 1.4.26 → 1.4.27.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sovereign-console): clean URLs — /console/* mounts canonical components directly

Removes the SovereignConsoleRedirect indirection. The 6 canonical
operator components (Dashboard, AppsPage, JobsPage, JobDetail,
CloudPage, AppDetail, UserAccessListPage, UserAccessEditPage,
SettingsPage) now render at clean /console/<page> URLs on Sovereign,
NOT under /sovereign/provision/<id>/<page>.

Pages that previously hard-coupled to the URL via
  useParams({ from: '/provision/$deploymentId/...' })
now use useResolvedDeploymentId() which:
  1. reads URL params (when on the legacy /provision/$id/* tree on
     contabo's mothership wizard)
  2. falls back to GET /api/v1/sovereign/self (Sovereign self-discovery)

Refactored: Dashboard, AppsPage, JobsPage, SettingsPage, UserAccessListPage.
CloudPage already used strict:false — no change needed.

Wires the /console/* router subtree to the canonical components +
adds the missing children routes (/jobs/$jobId, /users/new,
/users/$name, /app/$componentId) so the canonical UI's deep-links
work on the clean URL surface too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 20:17:36 +04:00
e3mrah
608db53a25
fix(cutover 0.1.20): Step-06 pushes YAML edit to local Gitea so patches survive Flux reconcile (#970) (#971)
## Root cause (live on otech116 2026-05-05 14:38)

After the #968 fix shipped (0.1.19), the cutover engine reached Step-7
(87%) successfully — Step-01..07 all completed. Then Step-08 (egress-
block-test) caught 38/38 HelmRepositories had reverted to upstream:

```
external HelmRepositories still pointing at ghcr.io/openova-io: 38
  OFFENDER flux-system/bp-cilium=oci://ghcr.io/openova-io
  ... (37 more)
FAIL — at least one HelmRepository did not pivot
```

But Step-06's job logs say:
```
[helmrepository-patches] OK bp-cilium -> oci://harbor.otech116.omani.works/openova-io
... (37 more OK)
ok=38 skip=0 fail=0
```

So Step-06 thought it succeeded — and it had, momentarily. But then
the bootstrap-kit Kustomization (which had successfully pivoted to
local Gitea via Step-05) reconciled its YAML from local Gitea, where
the YAML still declared `url: oci://ghcr.io/openova-io`. Within ~30s
every kubectl patch was undone. The cutover engine then aborted at
Step-8 verification.

## Fix

Step-06 now runs in two phases:
1. **Live K8s patches** (existing behaviour) — flips spec.url on every
   HelmRepository immediately. Useful for the cluster between cutover
   and the next reconcile.
2. **NEW — Push YAML edit to local Gitea** — clones `openova/openova`
   from the local Gitea over basic-auth, sed-rewrites every
   `clusters/_template/bootstrap-kit/*.yaml` declaration of `url:
   oci://ghcr.io/openova-io` → `oci://harbor.<sov-fqdn>/openova-io`,
   commits with a clear message, pushes back. Subsequent reconciles
   see local Harbor as the steady-state.

After the push, the script annotates `flux-system/openova` GitRepository
to trigger immediate reconciliation so the new YAML lands without
waiting for the polling interval.

## Image change

Step-06 image bumped from `bitnami/kubectl:1.31.4` to `alpine/k8s:1.31.4`
because the new phase needs both `kubectl` and `git` in one image
(verified live on otech116 — both binaries present).

## Acceptance gate

Test case 16 added to cutover-contract.sh — guards against future
regressions that remove the `git clone`, the `git push origin main`,
or the `clusters/_template/bootstrap-kit` target dir reference.

## Live verification

Will fire on otech117 (next provision). Expected:
- Step-06 logs `cloning gitea-http.gitea.../openova/openova.git` then `pushed to ...`
- Step-08 verify PASSES (38/38 HelmRepositories pivoted in K8s + Gitea)
- self-sovereign-cutover-status `cutoverComplete: "true"`
- Egress block to ghcr.io safely activates

Co-authored-by: e3mrah <ebaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:55:22 +04:00
e3mrah
3db19b76b1
fix(cutover 0.1.19): Step-01 gitea-mirror DNS readiness probe + backoffLimit=3 (#968) (#969)
## Root cause (live on otech115 2026-05-05 14:15)

After PR #959 (0.1.18) unblocked the auto-trigger to actually call
/internal/cutover/trigger, the cutover engine fired Step-01 within ~8s
of bp-self-sovereign-cutover Helm-install completing. The gitea Pod
had only just reached Ready state — cluster-DNS endpoint publication
for the headless service `gitea-http` was still in flight. One wget
returned `bad address gitea-http.gitea.svc.cluster.local` and exited
non-zero. Catalyst-api's cutover engine stamped Jobs with backoffLimit=0
(cutover.go:584), so a single DNS miss was terminal and aborted all 8
cutover steps. otech115 finished provisioning with cutoverComplete=false
and tethered to upstream github.com/ghcr.io.

## Fix (dual-layer)

**Layer A — catalyst-api (cutover.go)**: backoffLimit lifted from 0 to 3.
A single transient miss is recoverable (4 attempts over each step's
activeDeadlineSeconds) without burning operator-attention. Hard failures
still surface within budget.

**Layer B — chart Step-01 (01-gitea-mirror-job.yaml)**: explicit
nslookup readiness probe at the top of the bash script, before any
wget call. 30 attempts × 5s = 150s budget; alpine/git ships nslookup
in /usr/bin (verified live on otech115). Layer B is faster than Layer A
(in-script DNS retry vs Pod recreate); Layer A is the safety net for
any other transient pre-cluster-stable race we haven't yet enumerated.

## Acceptance gate

Test case 15 added to platform/self-sovereign-cutover/chart/tests/
cutover-contract.sh — guards against future regressions that drop
either the gitea_host extraction or the nslookup loop.

## Live verification

Will fire on the next provision (otech116). Expected:
- Step-01 logs `[gitea-mirror] DNS ready for gitea-http.gitea.svc.cluster.local (attempt N)`
- All 8 cutover Jobs reach Complete
- self-sovereign-cutover-status ConfigMap reaches cutoverComplete=true

Co-authored-by: e3mrah <ebaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:25:15 +04:00
e3mrah
d1431bed09
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without
HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x
FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is
not specified" on every Sovereign (otech112 evidence). HelmRelease
reports Ready=True (Helm install succeeded) but the Pod
CrashLoopBackOffs invisibly behind the False-positive condition.

Closes #916 — wizard let operators dispatch unbuildable topologies
(otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not
encode regional orderability. Hetzner rejected the worker creation 41s
into `tofu apply` after Phase-0 had already created the CP + network +
LB + firewall.

Chart fix (issue #921):
- Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the
  umbrella chart (base64-encoded per upstream contract).
- Render `hetzner-node-config` Secret unconditionally with both keys so
  the upstream Deployment's secretKeyRef references resolve cleanly
  during `helm template` AND in the live cluster regardless of overlay
  state.
- Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto
  the upstream chart's deployment.
- Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps
  it under `flux-system/cloud-credentials.hcloud-cloud-init`; the
  bootstrap-kit overlay lifts that key via Flux `valuesFrom` into
  `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus
  receive the IDENTICAL bootstrap as the Phase-0 worker fleet.
- Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0.
- Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies
  Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's
  blueprint-release "Run chart integration tests" step.

Wizard fix (issue #916):
- Add `availableRegions?: string[]` to NodeSize interface; encode
  cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere
  new) per Hetzner /v1/server_types vs POST /v1/servers gap.
- Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers.
- StepProvider filters SKU dropdowns by selected region; auto-swaps
  current SKU to recommended default when region change drops it out
  of orderability.
- Mirror the matrix Go-side in sku_availability.go; gate
  `provisioner.Request.Validate()` with same predicate so a stale
  wizard build OR direct API caller bypassing the UI cannot dispatch
  otech109's failure mode.
- Two-sided enforcement covers both r.Regions[] (multi-region) and the
  legacy singular path.

Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API
side. Chart smoke renders + helm template gates the env wiring at
publish time.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:21:59 +04:00
e3mrah
ae5766f2d0
fix(bp-catalyst-platform 1.4.26): grant catalyst-api TokenReview RBAC for cutover trigger (#957) (#962)
Chart 0.1.18 fixed the readiness-probe loop on the auto-trigger Job
(was 401-looping forever on /sovereign/cutover/status). The trigger
now reaches /api/v1/internal/cutover/trigger — but every call returns
502 "token-review-failed" in <10ms because the catalyst-api SA does
not have permission to create TokenReviews against the apiserver.

PR #947 wired the endpoint but not its RBAC. The ClusterRole
catalyst-api-cutover-driver had every verb the cutover engine needs
(configmaps, jobs, events, deployments, daemonsets) EXCEPT
authentication.k8s.io/tokenreviews — which the in-cluster trigger
endpoint depends on for SA bearer-token validation.

Live evidence on otech113 2026-05-05 12:02:55:
  GET /healthz → 200  (probe success — 0.1.18 fix working)
  POST /api/v1/internal/cutover/trigger → 502 in 8.879ms

  $ kubectl auth can-i create tokenreviews \
      --as=system:serviceaccount:catalyst-system:catalyst-api-cutover-driver
  no

Fix: add a separate Rule in clusterrole-cutover-driver.yaml for
authentication.k8s.io/tokenreviews verbs=[create]. Per
feedback_rbac_create_no_resourcenames.md the create verb stays in
its own Rule (TokenReview is a virtual sub-resource with no name to
scope to anyway).

Bumped:
  - products/catalyst/chart/Chart.yaml: 1.4.25 → 1.4.26
  - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: pin 1.4.26

Closes the #957 follow-up RBAC gap; PR #959 fixed the readiness loop.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:08:00 +04:00