From eb6a3c1812f74d9dc34bc6c2a9bad917c9df06ce Mon Sep 17 00:00:00 2001 From: e3mrah <81884938+emrahbaysal@users.noreply.github.com> Date: Wed, 6 May 2026 21:10:31 +0400 Subject: [PATCH] =?UTF-8?q?fix(chart,ci):=20auto-bump=20literal=20catalyst?= =?UTF-8?q?-{api,ui}=20SHAs=20=E2=80=94=20Sovereigns=20+=20contabo=20were?= =?UTF-8?q?=20frozen=20at=20:2122fb8=20(#1060)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console. is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the ":" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace., not console., and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to " commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Claude Opus 4.7 (1M context) --- .github/workflows/catalyst-build.yaml | 50 ++++++++++++------- .../13-bp-catalyst-platform.yaml | 2 +- products/catalyst/chart/Chart.yaml | 4 +- .../chart/templates/api-deployment.yaml | 10 +++- .../chart/templates/ui-deployment.yaml | 7 ++- 5 files changed, 51 insertions(+), 22 deletions(-) diff --git a/.github/workflows/catalyst-build.yaml b/.github/workflows/catalyst-build.yaml index 65c8b83e..bdfe7abb 100644 --- a/.github/workflows/catalyst-build.yaml +++ b/.github/workflows/catalyst-build.yaml @@ -339,17 +339,29 @@ jobs: echo "values.yaml after update:" grep -A2 "catalystUi\|catalystApi" "${VALUES}" | head -10 - # NOTE: the literal image refs in templates/api-deployment.yaml and - # templates/ui-deployment.yaml are deliberately NOT auto-bumped here. - # Those manifests are what contabo's Kustomize-path Flux reconciles — - # auto-bumping them auto-rolls contabo on every PR, which broke - # contabo on 2026-05-05 (k8scache startup loop on dead Sovereign - # kubeconfigs in PR #975's image). contabo rolls ONLY when an - # operator manually edits + commits those files (see - # docs/RUNBOOK-CONTABO-IMAGE-BUMP.md). Sovereigns are unaffected: - # they install via OCI chart whose values.yaml carries the new SHA - # (bumped above), which gets picked up at the next blueprint-release - # publish below. + # ALSO bump the literal image refs in the chart templates. + # Sovereigns Helm-install this chart and contabo applies it + # via Kustomize — both consume the literal directly because + # kustomize-controller can't render Helm templates. Without + # this auto-bump, every Sovereign provisioned after 2026-05-06 + # was installing :2122fb8 (frozen at PR #1040's chart-touch), + # so PRs #1051..#1059 never reached anyone except via manual + # `kubectl set image` patches on omantel. + API_TPL="products/catalyst/chart/templates/api-deployment.yaml" + UI_TPL="products/catalyst/chart/templates/ui-deployment.yaml" + sed -i -E "s|(image: \"ghcr\.io/openova-io/openova/catalyst-api:)[^\"]*\"|\1${SHA_SHORT}\"|" "${API_TPL}" + sed -i -E "s|(image: \"ghcr\.io/openova-io/openova/catalyst-ui:)[^\"]*\"|\1${SHA_SHORT}\"|" "${UI_TPL}" + echo "templates after update:" + grep -E "image: \".*catalyst-(api|ui):" "${API_TPL}" "${UI_TPL}" + + # contabo's catalyst-platform Kustomization at + # ./products/catalyst/chart/templates reconciles every 10 min + # — it will pick up the bumped literal on the next interval. + # If the new image breaks contabo, an operator can revert the + # template SHA via a follow-up PR; the previous "freeze" + # behaviour was masking real bugs (contabo silently ran an + # old image while the Sovereign provisioning churned through + # the same SHA being fixed downstream). - name: Commit and push manifest updates id: deploy_commit @@ -358,12 +370,16 @@ jobs: run: | git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" - # Only the chart's values.yaml is auto-bumped — Sovereigns consume - # the OCI chart bump via blueprint-release. Contabo's Kustomize-path - # deployment manifests are intentionally NOT touched here so contabo - # never auto-rolls; an operator manually bumps those files when a - # specific catalyst-api/ui SHA has been validated against contabo. - git add products/catalyst/chart/values.yaml + # values.yaml + the two literal-image templates (api-deployment, + # ui-deployment) are bumped together so: + # - Sovereigns get the new SHA via the next OCI chart publish + # (blueprint-release fires below). + # - contabo's Kustomize-path Flux reconciles the bumped literal + # within 10 min. + # Both surfaces converge on the same SHA on every push. + git add products/catalyst/chart/values.yaml \ + products/catalyst/chart/templates/api-deployment.yaml \ + products/catalyst/chart/templates/ui-deployment.yaml if git diff --staged --quiet; then echo "No changes to commit" echo "pushed=false" >> "$GITHUB_OUTPUT" diff --git a/clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml b/clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml index 14d92f06..8cf193e1 100644 --- a/clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml +++ b/clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml @@ -231,7 +231,7 @@ spec: # fallback (data renders the moment cutover-import lands without # waiting for the orchestrator's chart-values overlay write). # 2026-05-05. - version: 1.4.68 + version: 1.4.70 sourceRef: kind: HelmRepository name: bp-catalyst-platform diff --git a/products/catalyst/chart/Chart.yaml b/products/catalyst/chart/Chart.yaml index 70209aa7..982b8b71 100644 --- a/products/catalyst/chart/Chart.yaml +++ b/products/catalyst/chart/Chart.yaml @@ -124,8 +124,8 @@ name: bp-catalyst-platform # otech113 2026-05-05 — chart 0.1.18 fixed the readiness-probe loop # but every trigger immediately got 502 in <10ms (synchronous # apiserver permission rejection). 2026-05-05. -version: 1.4.68 -appVersion: 1.4.68 +version: 1.4.70 +appVersion: 1.4.70 description: | Catalyst Platform — the unified Catalyst control plane umbrella chart for Catalyst-Zero. Composes the catalyst-{ui,api}, console, admin, marketplace UI modules and the marketplace-api backend. diff --git a/products/catalyst/chart/templates/api-deployment.yaml b/products/catalyst/chart/templates/api-deployment.yaml index 8d0840a8..4dec6203 100644 --- a/products/catalyst/chart/templates/api-deployment.yaml +++ b/products/catalyst/chart/templates/api-deployment.yaml @@ -152,7 +152,15 @@ spec: fsGroupChangePolicy: OnRootMismatch containers: - name: catalyst-api - image: "ghcr.io/openova-io/openova/catalyst-api:2122fb8" + # Literal image ref — required for the contabo-mkt Kustomize + # path (kustomize-controller doesn't render Helm templates). + # Auto-bumped by .github/workflows/catalyst-build.yaml's deploy + # step on every push to main, so Sovereigns AND contabo both + # roll to the latest catalyst-api SHA. The matching + # values.yaml `images.catalystApi.tag` is also bumped (but + # unused for catalyst-api; kept for SME services that DO read + # from values). + image: "ghcr.io/openova-io/openova/catalyst-api:8361df4" imagePullPolicy: IfNotPresent ports: - containerPort: 8080 diff --git a/products/catalyst/chart/templates/ui-deployment.yaml b/products/catalyst/chart/templates/ui-deployment.yaml index 2c05a350..f164fb32 100644 --- a/products/catalyst/chart/templates/ui-deployment.yaml +++ b/products/catalyst/chart/templates/ui-deployment.yaml @@ -19,7 +19,12 @@ spec: - name: ghcr-pull containers: - name: catalyst-ui - image: "ghcr.io/openova-io/openova/catalyst-ui:2122fb8" + # Literal image ref — required for the contabo-mkt Kustomize + # path (kustomize-controller doesn't render Helm templates). + # Auto-bumped by .github/workflows/catalyst-build.yaml's deploy + # step on every push to main, so Sovereigns AND contabo both + # roll to the latest catalyst-ui SHA. + image: "ghcr.io/openova-io/openova/catalyst-ui:8361df4" imagePullPolicy: IfNotPresent ports: - containerPort: 8080