1cde1a085f
1457 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
1cde1a085f |
deploy: update catalyst images to b004820
|
||
|
|
b00482007e
|
fix(catalyst): /jobs/timeline page renders without crash (#1076)
* fix(catalyst): /jobs/timeline page renders without crash
Root cause: JobsTimeline used a strict useParams({ from:
'/provision/$deploymentId/jobs/timeline' }) call, which threw "Invariant
failed" inside useSyncExternalStoreWithSelector when the actual route
tree-match was the chroot consoleJobsTimelineRoute (path '/jobs/timeline'
— added in PR #1073). The throw bubbled into the React Error Boundary
and replaced the entire surface with the "Something went wrong! Show
Error" overlay.
Fix: switch to the canonical useResolvedDeploymentId() pattern that
JobsPage / NotificationsPage / Dashboard use — it reads the URL
:deploymentId param when present (mothership tenant route) and falls
back to /api/v1/sovereign/self when absent (chroot Sovereign route).
Same module owns both topologies; no behaviour change for the
mothership tenant route.
Caught on console.omantel.biz QA pass 2026-05-07 (TC-050).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(catalyst): JobsTimeline header notes both routes
Refer to both /provision/$deploymentId/jobs/timeline (mothership) and
/jobs/timeline (Sovereign chroot) so future readers understand the
component is shared across topologies.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3fa187bc35 |
deploy: update catalyst images to 76830d9
|
||
|
|
76830d9c62
|
fix(catalyst): chroot — skip tenantDiscover polling, /auth/handover redirects authed user to / (#1077)
Two bugs surfaced live on console.omantel.biz on 2026-05-07. TC-229 (P0) — chroot continuous /api/v1/tenant/discover 404 polling. The Sovereign chroot's catalyst-api does not register the tenant/discover endpoint (it is mother-only — only the Catalyst-Zero apex `console.openova.io` knows about the tenant registry). The SPA's bootstrapTenant() at app boot still ran on the chroot, returned 404, and the SPA's React-Query layer kept re-issuing the call as the Dashboard mounted/unmounted. 50+ HTTP 404 lines were captured during a single Dashboard navigation. Fix: short-circuit bootstrapTenant() at the single tenantDiscover.ts seam when DETECTED_MODE.mode === 'sovereign'. Returns the existing 'unwired' status (no registry available; proceed on the host's own identity), caches it so a second call is a no-op, and never touches the network. Tenant identity on chroot is already encoded in the session JWT (sovereign_fqdn / deployment_id claims) so no registry payload is needed. TC-004 (P1) — /auth/handover authenticated visit shows error page. Fix #2 PR #1075 added the SPA-friendly handover-error page for browser visits with no token. That branch fired even when the operator already had a live catalyst_session cookie, so an authed user pasting the bare /auth/handover URL saw "Handover incomplete" copy that confuses people who are already logged in. Fix: add a three-way branch on no-token visits — authenticated browser (302 to authHandoverRedirect, default /dashboard), unauthenticated browser (existing 302 to handover-error page from PR #1075), programmatic caller (existing 401 JSON contract from auth_handover_test.go). New helper hasValidCatalystSession reads the session token via auth.Config.ReadSessionToken (cookie / Bearer / ?access_token query — same channels RequireSession honours) and validates it via auth.Config.ValidateToken (same path RequireSession uses, including LocalPublicKey fallback for self-signed handover- session JWTs). Returns false when authConfig is nil so unconfigured Sovereigns / CI keep working unchanged. Tests: TestAuthHandover_MissingTokenAuthedRedirectsToDashboard (raw-JWT cookie + Bearer header), MissingTokenExpiredSessionFalls- Through (expired session falls through to error page), MissingTokenNoAuthConfigKeepsHTMLBranch (nil authConfig keeps the existing branches working). Existing missing-token tests unchanged. Files touched (per Fix Author #6 brief): - products/catalyst/bootstrap/ui/src/shared/lib/tenantDiscover.ts - products/catalyst/bootstrap/api/internal/handler/auth_handover.go - products/catalyst/bootstrap/api/internal/handler/auth_handover_test.go Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
56a568dc1c |
deploy: update catalyst images to 3dc9f42
|
||
|
|
3dc9f42c95
|
fix(catalyst): chroot SPA 404s for /cloud/legacy + /notifications + /readyz shadow + /auth/handover html error (#1075)
Five live bugs surfaced on console.omantel.biz 2026-05-07:
TC-090..092 /cloud/architecture, /cloud/compute, /cloud/network/ingresses
returned the SPA shell with TanStack Router default 404 in
sovereign mode. The legacy redirects (LEGACY_CLOUD_REDIRECTS)
were only mounted under the mothership /provision/$id/cloud
subtree, never at root for sovereign mode.
TC-160 /notifications returned the SPA shell + 404 because the only
notifications route was /provision/$id/notifications and
NotificationsPage hard-required the URL :deploymentId param
via useParams({ from: '/provision/$deploymentId/notifications' }).
TC-211 /readyz returned the SPA shell (HTTP 200 + index.html)
instead of a real Go-handler probe response, because no
Gateway rule routed it to catalyst-api — nginx try_files
and the SPA catch-all both shadowed the path.
TC-004 /auth/handover with no token returned raw 401 JSON
{"error":"missing token parameter"} to browser visits,
breaking the seamless-handover UX promise for stale
email-link clicks.
Fixes:
* products/catalyst/chart/templates/httproute.yaml — Exact matches
for /readyz and /healthz on the console hostname route to catalyst-api.
External monitors pointing at console.<sov>/readyz now hit the real
Go probe; pod-level k8s probes still hit nginx-internal /healthz.
* products/catalyst/bootstrap/api/internal/handler/auth_handover.go —
Browser visits (Accept: text/html or Sec-Fetch-Mode: navigate) on
the missing-token path 302-redirect to /auth/handover-error?reason=
missing_token. Programmatic callers (Accept: application/json or no
Accept header) keep the legacy 401 JSON contract that the test
matrix pins. New tests cover both branches.
* products/catalyst/bootstrap/ui/src/app/router.tsx — Adds
authHandoverErrorRoute (/auth/handover-error) with a friendly
error surface; consoleNotificationsRoute (/notifications under the
Sovereign console layout); consoleLegacyCloudRedirectRoutes
(sovereign-mode siblings of legacyCloudRedirectRoutes, reusing
LEGACY_CLOUD_REDIRECTS verbatim so the two redirect sets cannot
drift). consoleCloudRoute gains validateSearch matching
provisionCloudRoute.
* products/catalyst/bootstrap/ui/src/pages/sovereign/NotificationsPage.tsx —
Replaces strict useParams({ from: '/provision/$deploymentId/...' })
with useResolvedDeploymentId so the page works on both /provision/$id/
notifications (URL param) and sovereign-mode /notifications
(/api/v1/sovereign/self self-discovery). Mirrors the pattern used by
JobsPage / SettingsPage / Dashboard.
Verification:
helm template products/catalyst/chart — clean
npm run build — clean (1.88MB bundle, vite v8)
npx tsc --noEmit — clean
go build ./... — clean
go test -run TestAuthHandover_MissingToken — PASS (legacy + new HTML branch)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
5a1216992d |
deploy: update catalyst images to 369b60e
|
||
|
|
369b60ec5c
|
fix(catalyst): chroot EventSource auth via access_token query param — unblocks 13 cloud list views (#1074)
The chroot Sovereign Console SPA performs its own PKCE OIDC flow with Keycloak and stores the access_token in sessionStorage. installFetchAuthInterceptor patches window.fetch to attach Authorization: Bearer to /api/v1/* calls — but the EventSource browser API does NOT support custom request headers. The chroot also has no PIN-minted catalyst_session cookie (operator authenticates via Keycloak, not PIN), so withCredentials:true sent nothing. Result: every /api/v1/sovereigns/<id>/k8s/stream connection landed in 401 → SPA rendered "Stream temporarily unreachable". Affected tests: TC-066 services, TC-067 ingresses, TC-071 pods, TC-072 deployments, TC-073 statefulsets, TC-074 daemonsets, TC-075 replicasets, TC-076 configmaps, TC-078 namespaces, TC-079 nodes, TC-080 persistentvolumes, TC-081 endpointslices, TC-086 pods. Fix follows the standard SSE auth pattern used by Grafana / Loki: accept the access token as a `?access_token=<jwt>` URL query parameter, validate it through the same JWKS path as Authorization: Bearer. BE — products/catalyst/bootstrap/api/internal/auth/session.go: ReadSessionToken now consults three channels in order: (1) Authorization: Bearer header, (2) ?access_token=<jwt> query parameter, (3) catalyst_session cookie. Same JWT-shape (3 base64url segments) sanity check before ValidateToken so a malformed value short-circuits to 401 with no JWKS round-trip. The query-param path NEVER displaces the header when both are present (header wins) — preserves the live-fetch source of truth when an old ?access_token= is left in the address bar after a refresh. BE — products/catalyst/bootstrap/api/cmd/api/main.go: Replaced chi's middleware.Logger with a custom pathOnlyLogFormatter (implementing chi's middleware.LogFormatter) that emits r.URL.Path only — never r.RequestURI. Critical for credential hygiene per CLAUDE.md §10: chi.DefaultLogFormatter writes RequestURI verbatim, which would leak the access_token query parameter to stdout. The new logger emits structured slog fields (method/path/status/elapsedMs/remote) instead. FE — useK8sCacheStream.ts + useK8sStream.ts: Both EventSource consumers now read loadTokens() from sessionStorage and append `&access_token=<accessToken>` to the URL when an OIDC token is present. Mother (Catalyst-Zero) sessions store no OIDC tokens, so the param is omitted and the existing catalyst_session cookie path is unchanged. Tests: - 8 new Go tests in session_test.go covering all 7 channel permutations + JWT-shape validation + whitespace handling. - 2 new vitest cases in useK8sStream.test.ts asserting the URL contains access_token=<jwt> when sessionStorage has an OIDC token, and omits it on mother (cookie-only path). Verification: $ go build ./... && go test ./internal/auth/... → ok $ npm run typecheck && npm run build → ok $ npx vitest run src/lib/useK8sStream.test.ts → 11/11 passing $ curl -i 'https://console.omantel.biz/.../k8s/stream?kinds=pod' → 401 (will return 200 + SSE frames after deploy) Risk surface: a stale ?access_token= URL in the operator's address bar will be rejected with 401 once the JWT expires, surfacing as the same "Stream temporarily unreachable" banner. The SPA's existing reconnect loop drives a fresh EventSource on every retry, which picks up the freshest token from sessionStorage — so the failure mode is self-healing on the next browser-driven retry. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
23558f90a7 |
deploy: update catalyst images to 67e55eb
|
||
|
|
67e55ebb0b
|
fix(catalyst): /jobs/timeline router precedence + bp-spire/keycloak detail copy (#1073)
Sovereign Console (chroot, console.<sov-fqdn>) was missing the static /jobs/timeline route entirely — TanStack Router fell through to the dynamic /jobs/$jobId route with jobId='timeline', rendering the 'Job not found' surface. The mothership /provision/$deploymentId/jobs tree already had the correct precedence (timeline before $jobId); this PR ports the same pattern to consoleLayoutRoute children. Also corrects a stale comment in applicationCatalog.ts that listed bp-spire among the bootstrap kit. The generated BOOTSTRAP_KIT (sourced from clusters/_template/bootstrap-kit/) does not include spire — it is a tier-up selection. Documents that /app/bp-spire correctly renders 'App not found' on Sovereigns where the operator did not select it. Caught on console.omantel.biz QA pass 2026-05-07 (TC-050). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
a8da886a18 |
deploy: update catalyst images to 0286276
|
||
|
|
02862769cf |
fix(catalyst): JobDetail crash on Phase-0 jobs (undefined appId.startsWith)
The Phase-0 lifecycle jobs I added in PR #1072 have empty appId (they are NOT Sovereign components). The Job struct serialises appId with omitempty → undefined on the wire. FlowPage.tsx (the canvas embedded inside JobDetail) called j.appId.startsWith('bp-') unguarded, throwing TypeError 'Cannot read properties of undefined (reading startsWith)' the moment any Phase-0 job appeared in the merged jobs list. The whole JobDetail page crashed under the React Error Boundary — exactly what the founder caught on /jobs/install- tempo and /jobs/install-catalyst-platform. Fix: coerce j.appId to '' before .startsWith and fall back to j.jobName when bare is empty. Also skip empty-bare entries from the liveIdByBare map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
cbb653a938 |
deploy: update catalyst images to 0316c44
|
||
|
|
0316c444e1 |
fix(catalyst): chroot JobDetail 'Job not found' + graph WorkerNode duplicates
User found two bugs after the previous round, both verified live: 1. /jobs/install-tempo (and every other deep-link) rendered "Job not found" because useLiveJobsBackfill keyed its React Query on a constant 'sovereign' string. First render fired with empty deploymentId (useResolvedDeploymentId hadn't resolved yet) → /api/v1/deployments//jobs → 400. When the real id arrived, the query key DIDN'T change, so React Query kept the failed cache and never refetched. JobDetail's jobsById stayed empty → Job not found banner. Fix: include resolved deploymentId in the queryKey AND gate enabled on !!deploymentId so the first fetch waits. 2. /cloud?view=graph showed duplicate WorkerNodes (8 instead of 4) because the cloud-side topology synth emitted node id 'node-<k8s-name>' while the k8sAdapter emits bare '<k8s-name>'. mergeGraphs couldn't dedupe across the prefix mismatch. Fix: topology_loader synth now uses the bare K8s node name as the topology id so WorkerNode composite ids match exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
46d868738e |
deploy: update catalyst images to d7c8c47
|
||
|
|
d7c8c47f8c |
fix(catalyst): apps status — ignore reducer's default-pending init on chroot
Previous fix's fallback chain skipped to state.apps[app.id]?.status which is 'pending' by default for every app at reducer init, never reaching the 'available' fallback. Now: live API status wins; SSE reducer state honoured only when it's an explicit non-pending transition; on Sovereign mode with live query loaded, missing app.id falls to 'available' (AVAILABLE pill) instead of 'pending'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
de309e149a |
deploy: update catalyst images to 2f97710
|
||
|
|
2f97710be4 |
fix(catalyst): apps fallback to AVAILABLE not PENDING when no API entry
componentGroups.ts references blueprints not in blueprints.json (KEDA, Axon, Debezium, Envoy, frpc, NetBird, etc) — data drift between the two catalog sources. The FE was rendering these as PENDING (implying install in progress) instead of AVAILABLE (implying not yet deployed). Default to 'available' when no API or reducer state exists so the operator sees the right call-to- action pill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f376ee4551 |
deploy: update catalyst images to 1a85a9b
|
||
|
|
1a85a9b226 |
fix(catalyst): chroot /jobs lifecycle seed runs even when bootstrap-kit children already in store
The early-return guard (existing>0) short-circuited the lifecycle seed on every Sovereign that had previously seeded the bootstrap-kit children. Split the guard so the provisioner-group seed fires independently when missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
15bf2f28cc |
deploy: update catalyst images to 4a171b0
|
||
|
|
4a171b00d8
|
fix(catalyst): chroot /jobs Phase-0 + /cloud topology synth + AVAILABLE pill (#1072)
Three issues raised on console.omantel.biz, each verified live in Playwright BEFORE this fix and to be re-verified after deploy: 1. /jobs missing Phase-0 lifecycle rows. Only the 40 install-* rows from bootstrap-kit children showed; tofu-init/plan/apply/output and cluster-bootstrap rows were absent because those Job records live on the mother only. Fix: chrootSeedJobsStoreIfEmpty now also calls bridge.SeedProvisionerJobs() + MarkProvisionerComplete() so the chroot view shows the full deployment history under a "Provision Hetzner" group, all stamped Succeeded. 2. /cloud kind=clusters / node-pools / vclusters / load-balancers rendered "No clusters yet". The topology loader required the deployment record's Regions to be non-empty; the chroot's synthesised Deployment has empty Regions. Fix: topology_loader.buildTopology now falls through to a chroot path that lists live K8s Nodes via the in-cluster dynamic client, groups them by `node.kubernetes.io/instance-type` to derive NodePools, and emits one Region/Cluster carrying every real Node. lookupDeploymentForInfra now also calls chrootEnsureDeployment so the chroot path actually fires. 3. KEDA (and 14 other catalog items) showed "PENDING" pill with no install affordance — confusing because PENDING is what in-flight installs render. Fix: introduce ApplicationStatus='available' as a distinct value; map API status="available" to it; render an "AVAILABLE" pill (accent-tinted, distinct from neutral PENDING) so the operator sees the right call-to-action. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d45fa4a8b4 |
deploy: update catalyst images to 8e631eb
|
||
|
|
8e631ebd05
|
fix(catalyst): chroot Sovereign Console OIDC bearer auth + self synth id (#1071)
The chroot Sovereign Console SPA performs its own PKCE OIDC flow
(client-side token exchange — no server-minted catalyst_session
cookie). Until now, every /api/v1/* fetch from the chroot 401'd
because the BE's session middleware ONLY read catalyst_session
cookie. The user observed: /apps showed all 36 apps as "pending"
(liveAppsQuery 401 → fell back to wizard frozen state); /jobs
appeared limited; /cloud, /dashboard etc all degraded.
Three coupled fixes:
1. BE session middleware now ALSO accepts Authorization: Bearer
<jwt>. ValidateToken handles signature verification against the
same JWKS regardless of whether the JWT arrived via cookie or
header. (auth/session.go: ReadSessionToken)
2. FE installs a global window.fetch interceptor at boot
(main.tsx → installFetchAuthInterceptor). When the SPA holds an
OIDC access_token in sessionStorage (Sovereign Console only,
never on mother), every /api/v1/ fetch automatically picks up
Authorization: Bearer. Mother (cookie-based) is a transparent
no-op since sessionStorage has no token.
3. HandleSovereignSelf now also reads SOVEREIGN_FQDN env (the
chroot's standard sovereign-fqdn ConfigMap entry — same name
used by k8scache.factory.go). When no deployment id resolves
from any source, synthesise "sovereign-<fqdn>" — matching the
k8scache self-register convention so /api/v1/sovereigns/{id}/*
handlers' chroot-aliasing finds the same single registered
cluster the FE is targeting.
End-to-end: a fresh-cutover Sovereign Console serves real-time
apps + jobs + cloud data to operators who logged in via direct
Keycloak (no handover JWT), no per-deployment cutover-import
step required.
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
deaf74270a |
deploy: update catalyst images to 118b9eb
|
||
|
|
118b9eb67d
|
fix(catalyst): durable Phase-0 jobs + chroot post-cutover live data (#1070)
Three coupled fixes for what the user observed post-cutover on console.omantel.biz: 1. JobsTable rows for tofu-init/plan/apply/output/cluster-bootstrap disappeared the moment bootstrap-kit children landed. Root cause: those rows were synthesised on the FE from the SSE event reducer; when liveJobs from the BE arrived, mergeJobs() switched to backend- only and the reducer-derived rows vanished. Fix: register the 5 Phase-0 lifecycle phases as durable Job records under a new "provisioner" group inside jobs.Store. The bridge now transitions them through Pending → Running → Succeeded/Failed as the provisioner emits its named-phase events; "tofu" stdout/stderr stream lines append to the currently-active phase's Execution. /jobs/tofu-apply (and the four siblings) now resolve from the very first emit and never disappear when the BE feed takes over. 2. /api/v1/sovereigns/<id>/k8s/stream returned 404 on every chroot post-cutover, so /cloud?view=list&kind=services and every other k8scache-backed view rendered "Stream temporarily unreachable". Root cause: the chroot's k8scache.Factory.FromEnv self-register path needed a deployment id, but cutover never imports the mother's record AND step-07 only patches CATALYST_GITOPS_REPO_URL — not CATALYST_SELF_DEPLOYMENT_ID. Result: chroot deferred forever, no informers, no clusters registered. Fix: factory.go now derives a stable "sovereign-<fqdn>" id from SOVEREIGN_FQDN when no other id resolves, so the chroot self- registers exactly one cluster on every Sovereign. The k8s handlers alias any incoming URL cluster id onto that single chroot cluster when SOVEREIGN_FQDN is set, so existing FE that targets the mother's deployment id keeps working byte-identically. 3. /api/v1/deployments/<id>/jobs returned every job as Pending with no Started/Duration/exec-logs because chrootSeedJobsStoreIfEmpty's in-memory ownership-check gate never matched (no deployment record imported). Fix: jobs.go now synthesises an in-memory Deployment record from SOVEREIGN_FQDN on first read, so the lazy seed fires and converts the live HelmRelease state into rich Job records. Together these mean post-cutover Sovereign Consoles serve real-time data for ALL future Sovereigns without any per-deployment cutover import step required. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3b930793c5 |
deploy: update catalyst images to 25f1446
|
||
|
|
25f14469d3
|
fix(provisioner): map wizard's three-mode domain selector to tofu's binary pool/byo enum (#1069)
Caught live on omantel.biz re-provision (deploymentId ab0bf689620f4102):
tofu plan failed at exit 1 with:
Error: Invalid value for variable
on variables.tf line 296:
296: variable "domain_mode" {
├────────────────
│ var.domain_mode is "byo-manual"
Domain mode must be 'pool' or 'byo'.
The wizard's StepDomain has three options (pool / byo-manual /
byo-api) so the UX can branch the operator into the right flow:
- pool: OpenOva owns the parent zone via Dynadot+PDM
- byo-manual: operator pastes NS records into their registrar
- byo-api: operator's registrar API drives NS automatically
The OpenTofu module's `variable "domain_mode"` validation only
accepts the binary pool/byo distinction — from the cloud-infra layer
(Hetzner servers, network, LB) NONE of those wizard distinctions
matter; tofu only needs to know whether to call Dynadot at apply
time. The three-mode wizard value was being written verbatim to the
tfvars without mapping.
Add `mapDomainModeForTofu(wizardMode)` helper:
- "pool" → "pool"
- "byo-manual"→ "byo"
- "byo-api" → "byo"
- empty → "byo" (test path that doesn't set the field)
Bump chart 1.4.83 → 1.4.84.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
adda972dd8 |
deploy: update catalyst images to 0a0b912
|
||
|
|
0a0b912e0d
|
fix(wizard): KServe was wrongly under Always Included on every Sovereign (#1068)
* fix(hetzner-purge): close volumes/primary_ips/floating_ips gap — wipe was leaving Crossplane orphans
Founder caught the gap on omantel.biz post-decommission: Hetzner
console showed 0 servers/LBs/IPs but 1 Volume + 2 Networks + 1
Firewall lingering. Networks/Firewall were the existing async-detach
window (handled by name-prefix fallback in the next provision); the
**Volume** was a hard miss — Purge() never called /v1/volumes.
Root cause: post-handover, the Hetzner Cloud Volume CSI driver
allocates Hetzner Volumes for every CNPG/Harbor/Loki/Mimir
StatefulSet PVC. tofu state never tracks them. When the operator
decommissions, `tofu destroy` is a no-op for the Volume and the
existing label-sweep didn't list /v1/volumes either. Result: orphan
volumes accrue cloud cost across re-provision cycles.
Same architectural gap for primary_ips (CCM-allocated for LoadBalancer
services since Hetzner's 2023 IP-decoupling) and floating_ips
(rare in Catalyst stack but listed for completeness).
Fix: extend Purge() + purgeByNamePrefix() to walk three additional
endpoints in dependency order:
servers → load_balancers → firewalls → networks → ssh_keys
→ volumes (after servers detach)
→ primary_ips (after LBs free their IPs)
→ floating_ips
Both label-pass AND name-prefix-pass cover all 8 kinds. PurgeReport
extended with Volumes/PrimaryIPs/FloatingIPs slices; Total() updated.
CSI-named volumes (`pvc-<uid>` form) won't match either pass — those
need the canonical `catalyst.openova.io/sovereign=<fqdn>` label which
the Crossplane composition for VolumeClaim must apply. That's a
separate composition-layer fix tracked separately; this PR closes
the wipe gap for everything labelled OR name-prefixed.
Bump chart 1.4.80 → 1.4.81.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(wizard): KServe was wrongly under Always Included on every Sovereign
Founder caught on console.openova.io/sovereign/wizard step 4: KServe
appeared in the "Always Included" section as if every Sovereign had
to install it. False positive — KServe is conditionally mandatory
ONLY when the operator opts into the CORTEX (AI/ML) product family.
Two coupled bugs:
(1) Data model: kserve was tagged tier:'mandatory' inside the CORTEX
product family, but tier:'mandatory' is consumed everywhere in
the wizard as "always-on regardless of family selection":
- componentGroups.ts:543 — seedIds.add(c.id) → auto-selected at
wizard init for every Sovereign
- applicationCatalog.ts:97 — seeded into the apps grid
- store.ts:642 — special-cased as undeselectable
- StepComponents.tsx — surfaced under "Always Included" tab
Demote to tier:'recommended'. CORTEX has
cascadeOnMemberSelection:true so picking any CORTEX member (vLLM,
Specter, BGE, Milvus, …) still auto-pulls KServe via the cascade
— that's the right semantics. KServe stays visible under CORTEX
in Tab 1 ("Choose Your Stack") and locks-in once CORTEX is
selected.
(2) UI filter: AlwaysIncludedTab was iterating every PRODUCTS entry
regardless of product.tier and listing every member with
component.tier === 'mandatory'. That mixes the platform-mandatory
layer (PILOT/SPINE/SURGE/SILO/GUARDIAN tier:'mandatory' families)
with conditional-mandatory members of opt-in families
(CORTEX/RELAY tier:'optional', INSIGHTS/FABRIC tier:'recommended').
Filter by product.tier === 'mandatory' so only the always-on
families' mandatory members appear. Defence-in-depth — even if a
new opt-in family ships with internal-mandatory members, they
won't leak into "Always Included".
Audit confirmed kserve was the only offender across all 9 product
families today. PILOT/SPINE/SURGE/SILO/GUARDIAN remain unchanged
(their members rightfully tier:'mandatory'); CORTEX kserve fixed;
others have no internal mandatories.
Bump chart 1.4.81 → 1.4.82.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
9b4376fba7 |
deploy: update catalyst images to b233202
|
||
|
|
b233202b65
|
fix(hetzner-purge): close volumes/primary_ips/floating_ips gap — wipe was leaving Crossplane orphans (#1067)
Founder caught the gap on omantel.biz post-decommission: Hetzner console showed 0 servers/LBs/IPs but 1 Volume + 2 Networks + 1 Firewall lingering. Networks/Firewall were the existing async-detach window (handled by name-prefix fallback in the next provision); the **Volume** was a hard miss — Purge() never called /v1/volumes. Root cause: post-handover, the Hetzner Cloud Volume CSI driver allocates Hetzner Volumes for every CNPG/Harbor/Loki/Mimir StatefulSet PVC. tofu state never tracks them. When the operator decommissions, `tofu destroy` is a no-op for the Volume and the existing label-sweep didn't list /v1/volumes either. Result: orphan volumes accrue cloud cost across re-provision cycles. Same architectural gap for primary_ips (CCM-allocated for LoadBalancer services since Hetzner's 2023 IP-decoupling) and floating_ips (rare in Catalyst stack but listed for completeness). Fix: extend Purge() + purgeByNamePrefix() to walk three additional endpoints in dependency order: servers → load_balancers → firewalls → networks → ssh_keys → volumes (after servers detach) → primary_ips (after LBs free their IPs) → floating_ips Both label-pass AND name-prefix-pass cover all 8 kinds. PurgeReport extended with Volumes/PrimaryIPs/FloatingIPs slices; Total() updated. CSI-named volumes (`pvc-<uid>` form) won't match either pass — those need the canonical `catalyst.openova.io/sovereign=<fqdn>` label which the Crossplane composition for VolumeClaim must apply. That's a separate composition-layer fix tracked separately; this PR closes the wipe gap for everything labelled OR name-prefixed. Bump chart 1.4.80 → 1.4.81. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f958643dc7 |
deploy: update catalyst images to daeff32
|
||
|
|
daeff32cbe
|
fix(cloudpage): hoist k8sStream above ctx — TS use-before-declaration broke build-ui (#1066)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane Founder asked: "make the real-time k8s information propagation development reused — find the reverted prior work and implement the final working one." History: - PR #358 (May 1) shipped the full informer + SSE data plane: internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics} + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) + UI hook lib/useK8sStream.ts + widget useK8sCacheStream. - PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream with kinds=namespace,node,pv,pod,deployment,...,server.hcloud, volume.hcloud and `&initialState=1` for live cloud-graph deltas. - PR #981 hotfix dropped the synchronous discovery probe in factory.go:AddCluster (it was calling core.Discovery().ServerResourcesForGroupVersion(gv) with NO context timeout — on a kubeconfig pointing at a decommissioned otech the call hung the catalyst-api startup for minutes per dead cluster). After #981 the discovery-probe surgery was clean — no follow-up broke. The data plane code stayed in the codebase. The remaining gap was operational, not architectural: On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>), the catalyst-api boots without a posted-back kubeconfig in /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns [] → factory has zero clusters → every /api/v1/sovereigns/{depId}/k8s/* request 404s with "sovereign \"...\" not registered". The architecture-graph in-flight call confirmed live on omantel.biz today. Fix in this PR: 1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN env is set (chroot mode), build a ClusterRef with id resolved from CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by scanning /var/lib/catalyst/deployments/*.json for a record matching the FQDN (mirrors HandleSovereignSelf's store-fallback path for consistency). DynamicClient + CoreClient built from rest.InClusterConfig(). Append to the cluster list. Mother behavior unchanged — SOVEREIGN_FQDN unset → branch is a no-op. 2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide get/list/watch on every kind in the k8scache registry (pods, deployments, statefulsets, daemonsets, replicasets, services, endpointslices, ingresses, configmaps, secrets, persistentvolumes, persistentvolumeclaims, hcloud.crossplane.io managed resources, vclusters), plus authorization.k8s.io/subjectaccessreviews so the per-event SAR gating in the SSE handler doesn't 403 silently. 3. Bump chart 1.4.70 → 1.4.71. The discovery-probe failure mode that triggered the original revert (synchronous ServerResourcesForGroupVersion blocking startup) does NOT recur here — InClusterConfig() returns immediately, NewForConfig is lazy, and the first network call happens inside the informer goroutine after Start, off the boot critical path. Mother-side LoadClustersFromDir behavior is untouched (no probe, just kubeconfig file parsing as it has been since #981). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): + More popover escapes overflow clip + graph centers via gravity force Two cloud-page bugs caught live on omantel.biz: (1) /cloud?view=list&kind=clusters → +More popover non-functional. The popover renders at its anchor coords but pointer events pass through to the toolbar below it. Diagnosis: .cloud-page-toolbar > [data-testid="cloud-kind-chips"] { overflow-x: auto; } Per CSS spec, when one overflow axis is non-visible, the OTHER axis becomes auto/hidden too. So overflow-x:auto on the chips strip silently sets overflow-y:auto, which clips the absolutely- positioned popover that hangs DOWN from the +More button. Fix: render the popover via React.createPortal to document.body so it's outside any overflow ancestor. Position via fixed coordinates computed from the +More button's getBoundingClientRect, recomputed on resize/scroll. Click-outside dismissal updated to check both wrapper AND portaled popover. (2) /cloud?view=graph → bubbles drift to canvas edges, leaving the centre empty until enough nodes (e.g. worker nodes) are added to anchor things via link tension. Two coupled root causes: a) `forceCenter` only adjusts the centroid — it shifts ALL nodes uniformly so their average sits at (cx, cy). It does NOT pull individual nodes inward. With small node counts and high charge repulsion (-160 for ≤50 nodes), nothing opposes outward drift. b) `makeForceBound` was a HARD clamp: `if (n.x < minX) n.x = minX`. Nodes that hit the wall get arrested with their velocity preserved on the perpendicular axis but no inward impulse → they slide along the wall and stack at corners. The simulation never relaxes back to the centre. Fix: a) Add forceX(cx) + forceY(cy) with `centerGravity` strength per node-count tier (0.08 for ≤50, scaling down with larger graphs where link tension is sufficient). This pulls every individual node toward the centre proportional to its offset. b) Replace the hard clamp with an elastic bounce: when a node hits the boundary, reverse its velocity component (×0.4 damping) instead of zeroing it. Energy returns to the system, the simulation actually relaxes. Bump chart 1.4.72 → 1.4.73. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cloud): expose all live K8s kinds in +More popover + chip counts + tighter graph centering Founder feedback (after PR #1062 lit up the data plane): 1. The +More popover was missing pods, deployments, statefulsets, daemonsets, configmaps, secrets, namespaces, etc. — it only carried the 6 placeholder kinds the legacy topology API knew about. 2. Several chips (Services, Ingresses, Storage Classes) showed "—" for count even though the data IS in the live cluster (visible in the graph view). 3. The graph view still pushed bubbles to canvas edges; only adding worker nodes brought things back. The previous gravity tuning wasn't strong enough for ~300 nodes. This PR addresses all three. (1) Eleven new K8s-backed list pages exposed in +More: Pods, Deployments, StatefulSets, DaemonSets, ReplicaSets, ConfigMaps, Secrets, Namespaces, Nodes, PersistentVolumes, EndpointSlices. Plus replaced the placeholder Services and Ingresses pages with live K8s tables. All built on a new generic K8sListPage that subscribes to /api/v1/sovereigns/{depId}/k8s/stream (same SSE channel the architecture-graph already uses) and renders a typed-column table per kind. Columns are declared once per kind in kindsPages.tsx; the rendering is uniform so adding a kind is a ~12-line wrapper. (2) CloudPage.kindCounts now folds the live K8s snapshot into the chip-count map. KIND_TO_REGISTRY in kinds.ts maps each chip id to the registry kind name (pods → 'pod' etc). Counts that came from null (data not available) flip to live counts the moment the SSE stream's initialState=1 arrives. (3) GraphCanvas physics retuned for live-data scale: - centerGravity: 0.08→0.18 for ≤50 nodes, 0.06→0.16 for ≤200, 0.04→0.14 for ≤1000, 0.03→0.10 for ≤5000, 0.02→0.08 for >5000. The forceX/forceY pulls every individual node toward (cx,cy) proportional to its offset — 2-3× stronger than the original tuning so the canvas centre stays populated. - Charge softened: -160→-90 for ≤50 nodes, scaled down through every tier. The previous values were calibrated against a ~20-node topology stub; live data delivers 10-50× more nodes per Sovereign so charge needs to relax proportionally. Bump chart 1.4.74 → 1.4.75. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud-list): share single SSE subscription via CloudContext — list pages were stuck connecting After PR #1064 the +More popover was correctly populated and chip counts were live, but clicking through to a list page (e.g. /cloud?view=list&kind=pods) hung at "Connecting to live cluster stream…" while the chip count beside the same kind already showed the right number (110 pods). Diagnosis: the K8sListPage was calling useK8sCacheStream with kinds:[kind], opening its OWN EventSource. The parent CloudPage already had an EventSource open (subscribing to all kinds — the source of the chip counts). Two long-lived SSE streams from the same browser to the same origin starve the connection budget; the second connection hangs at "connecting" while the first holds the slot. Fix: hoist the snapshot via CloudContext. CloudPage is already the owner of the page-level useK8sCacheStream invocation; expose its snapshot/status/revision through the existing useCloud() context. K8sListPage now reads from useCloud() instead of opening a duplicate stream. Single subscription, single source of truth for both chip counts AND list rows. Bump chart 1.4.76 → 1.4.77. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloudpage): hoist k8sStream above ctx — was used before declaration PR #1065 added k8sStream into the ctx useMemo deps but the useK8sCacheStream() call was at line 396, well after the ctx build at line 290. tsc -b caught it: TS2448/TS2454 use-before-declaration. CI build-ui failed. Move the useK8sCacheStream invocation to immediately precede the ctx build. No behaviour change. Bump chart 1.4.78 → 1.4.79. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f02136a89c
|
fix(cloud-list): share single SSE via CloudContext — list pages were stuck connecting (#1065)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane Founder asked: "make the real-time k8s information propagation development reused — find the reverted prior work and implement the final working one." History: - PR #358 (May 1) shipped the full informer + SSE data plane: internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics} + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) + UI hook lib/useK8sStream.ts + widget useK8sCacheStream. - PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream with kinds=namespace,node,pv,pod,deployment,...,server.hcloud, volume.hcloud and `&initialState=1` for live cloud-graph deltas. - PR #981 hotfix dropped the synchronous discovery probe in factory.go:AddCluster (it was calling core.Discovery().ServerResourcesForGroupVersion(gv) with NO context timeout — on a kubeconfig pointing at a decommissioned otech the call hung the catalyst-api startup for minutes per dead cluster). After #981 the discovery-probe surgery was clean — no follow-up broke. The data plane code stayed in the codebase. The remaining gap was operational, not architectural: On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>), the catalyst-api boots without a posted-back kubeconfig in /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns [] → factory has zero clusters → every /api/v1/sovereigns/{depId}/k8s/* request 404s with "sovereign \"...\" not registered". The architecture-graph in-flight call confirmed live on omantel.biz today. Fix in this PR: 1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN env is set (chroot mode), build a ClusterRef with id resolved from CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by scanning /var/lib/catalyst/deployments/*.json for a record matching the FQDN (mirrors HandleSovereignSelf's store-fallback path for consistency). DynamicClient + CoreClient built from rest.InClusterConfig(). Append to the cluster list. Mother behavior unchanged — SOVEREIGN_FQDN unset → branch is a no-op. 2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide get/list/watch on every kind in the k8scache registry (pods, deployments, statefulsets, daemonsets, replicasets, services, endpointslices, ingresses, configmaps, secrets, persistentvolumes, persistentvolumeclaims, hcloud.crossplane.io managed resources, vclusters), plus authorization.k8s.io/subjectaccessreviews so the per-event SAR gating in the SSE handler doesn't 403 silently. 3. Bump chart 1.4.70 → 1.4.71. The discovery-probe failure mode that triggered the original revert (synchronous ServerResourcesForGroupVersion blocking startup) does NOT recur here — InClusterConfig() returns immediately, NewForConfig is lazy, and the first network call happens inside the informer goroutine after Start, off the boot critical path. Mother-side LoadClustersFromDir behavior is untouched (no probe, just kubeconfig file parsing as it has been since #981). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): + More popover escapes overflow clip + graph centers via gravity force Two cloud-page bugs caught live on omantel.biz: (1) /cloud?view=list&kind=clusters → +More popover non-functional. The popover renders at its anchor coords but pointer events pass through to the toolbar below it. Diagnosis: .cloud-page-toolbar > [data-testid="cloud-kind-chips"] { overflow-x: auto; } Per CSS spec, when one overflow axis is non-visible, the OTHER axis becomes auto/hidden too. So overflow-x:auto on the chips strip silently sets overflow-y:auto, which clips the absolutely- positioned popover that hangs DOWN from the +More button. Fix: render the popover via React.createPortal to document.body so it's outside any overflow ancestor. Position via fixed coordinates computed from the +More button's getBoundingClientRect, recomputed on resize/scroll. Click-outside dismissal updated to check both wrapper AND portaled popover. (2) /cloud?view=graph → bubbles drift to canvas edges, leaving the centre empty until enough nodes (e.g. worker nodes) are added to anchor things via link tension. Two coupled root causes: a) `forceCenter` only adjusts the centroid — it shifts ALL nodes uniformly so their average sits at (cx, cy). It does NOT pull individual nodes inward. With small node counts and high charge repulsion (-160 for ≤50 nodes), nothing opposes outward drift. b) `makeForceBound` was a HARD clamp: `if (n.x < minX) n.x = minX`. Nodes that hit the wall get arrested with their velocity preserved on the perpendicular axis but no inward impulse → they slide along the wall and stack at corners. The simulation never relaxes back to the centre. Fix: a) Add forceX(cx) + forceY(cy) with `centerGravity` strength per node-count tier (0.08 for ≤50, scaling down with larger graphs where link tension is sufficient). This pulls every individual node toward the centre proportional to its offset. b) Replace the hard clamp with an elastic bounce: when a node hits the boundary, reverse its velocity component (×0.4 damping) instead of zeroing it. Energy returns to the system, the simulation actually relaxes. Bump chart 1.4.72 → 1.4.73. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cloud): expose all live K8s kinds in +More popover + chip counts + tighter graph centering Founder feedback (after PR #1062 lit up the data plane): 1. The +More popover was missing pods, deployments, statefulsets, daemonsets, configmaps, secrets, namespaces, etc. — it only carried the 6 placeholder kinds the legacy topology API knew about. 2. Several chips (Services, Ingresses, Storage Classes) showed "—" for count even though the data IS in the live cluster (visible in the graph view). 3. The graph view still pushed bubbles to canvas edges; only adding worker nodes brought things back. The previous gravity tuning wasn't strong enough for ~300 nodes. This PR addresses all three. (1) Eleven new K8s-backed list pages exposed in +More: Pods, Deployments, StatefulSets, DaemonSets, ReplicaSets, ConfigMaps, Secrets, Namespaces, Nodes, PersistentVolumes, EndpointSlices. Plus replaced the placeholder Services and Ingresses pages with live K8s tables. All built on a new generic K8sListPage that subscribes to /api/v1/sovereigns/{depId}/k8s/stream (same SSE channel the architecture-graph already uses) and renders a typed-column table per kind. Columns are declared once per kind in kindsPages.tsx; the rendering is uniform so adding a kind is a ~12-line wrapper. (2) CloudPage.kindCounts now folds the live K8s snapshot into the chip-count map. KIND_TO_REGISTRY in kinds.ts maps each chip id to the registry kind name (pods → 'pod' etc). Counts that came from null (data not available) flip to live counts the moment the SSE stream's initialState=1 arrives. (3) GraphCanvas physics retuned for live-data scale: - centerGravity: 0.08→0.18 for ≤50 nodes, 0.06→0.16 for ≤200, 0.04→0.14 for ≤1000, 0.03→0.10 for ≤5000, 0.02→0.08 for >5000. The forceX/forceY pulls every individual node toward (cx,cy) proportional to its offset — 2-3× stronger than the original tuning so the canvas centre stays populated. - Charge softened: -160→-90 for ≤50 nodes, scaled down through every tier. The previous values were calibrated against a ~20-node topology stub; live data delivers 10-50× more nodes per Sovereign so charge needs to relax proportionally. Bump chart 1.4.74 → 1.4.75. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud-list): share single SSE subscription via CloudContext — list pages were stuck connecting After PR #1064 the +More popover was correctly populated and chip counts were live, but clicking through to a list page (e.g. /cloud?view=list&kind=pods) hung at "Connecting to live cluster stream…" while the chip count beside the same kind already showed the right number (110 pods). Diagnosis: the K8sListPage was calling useK8sCacheStream with kinds:[kind], opening its OWN EventSource. The parent CloudPage already had an EventSource open (subscribing to all kinds — the source of the chip counts). Two long-lived SSE streams from the same browser to the same origin starve the connection budget; the second connection hangs at "connecting" while the first holds the slot. Fix: hoist the snapshot via CloudContext. CloudPage is already the owner of the page-level useK8sCacheStream invocation; expose its snapshot/status/revision through the existing useCloud() context. K8sListPage now reads from useCloud() instead of opening a duplicate stream. Single subscription, single source of truth for both chip counts AND list rows. Bump chart 1.4.76 → 1.4.77. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0cfbb106dc |
deploy: update catalyst images to 2604c9c
|
||
|
|
2604c9cf36
|
feat(cloud): all live K8s kinds in +More + chip counts + tighter graph centering (#1064)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane Founder asked: "make the real-time k8s information propagation development reused — find the reverted prior work and implement the final working one." History: - PR #358 (May 1) shipped the full informer + SSE data plane: internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics} + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) + UI hook lib/useK8sStream.ts + widget useK8sCacheStream. - PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream with kinds=namespace,node,pv,pod,deployment,...,server.hcloud, volume.hcloud and `&initialState=1` for live cloud-graph deltas. - PR #981 hotfix dropped the synchronous discovery probe in factory.go:AddCluster (it was calling core.Discovery().ServerResourcesForGroupVersion(gv) with NO context timeout — on a kubeconfig pointing at a decommissioned otech the call hung the catalyst-api startup for minutes per dead cluster). After #981 the discovery-probe surgery was clean — no follow-up broke. The data plane code stayed in the codebase. The remaining gap was operational, not architectural: On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>), the catalyst-api boots without a posted-back kubeconfig in /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns [] → factory has zero clusters → every /api/v1/sovereigns/{depId}/k8s/* request 404s with "sovereign \"...\" not registered". The architecture-graph in-flight call confirmed live on omantel.biz today. Fix in this PR: 1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN env is set (chroot mode), build a ClusterRef with id resolved from CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by scanning /var/lib/catalyst/deployments/*.json for a record matching the FQDN (mirrors HandleSovereignSelf's store-fallback path for consistency). DynamicClient + CoreClient built from rest.InClusterConfig(). Append to the cluster list. Mother behavior unchanged — SOVEREIGN_FQDN unset → branch is a no-op. 2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide get/list/watch on every kind in the k8scache registry (pods, deployments, statefulsets, daemonsets, replicasets, services, endpointslices, ingresses, configmaps, secrets, persistentvolumes, persistentvolumeclaims, hcloud.crossplane.io managed resources, vclusters), plus authorization.k8s.io/subjectaccessreviews so the per-event SAR gating in the SSE handler doesn't 403 silently. 3. Bump chart 1.4.70 → 1.4.71. The discovery-probe failure mode that triggered the original revert (synchronous ServerResourcesForGroupVersion blocking startup) does NOT recur here — InClusterConfig() returns immediately, NewForConfig is lazy, and the first network call happens inside the informer goroutine after Start, off the boot critical path. Mother-side LoadClustersFromDir behavior is untouched (no probe, just kubeconfig file parsing as it has been since #981). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): + More popover escapes overflow clip + graph centers via gravity force Two cloud-page bugs caught live on omantel.biz: (1) /cloud?view=list&kind=clusters → +More popover non-functional. The popover renders at its anchor coords but pointer events pass through to the toolbar below it. Diagnosis: .cloud-page-toolbar > [data-testid="cloud-kind-chips"] { overflow-x: auto; } Per CSS spec, when one overflow axis is non-visible, the OTHER axis becomes auto/hidden too. So overflow-x:auto on the chips strip silently sets overflow-y:auto, which clips the absolutely- positioned popover that hangs DOWN from the +More button. Fix: render the popover via React.createPortal to document.body so it's outside any overflow ancestor. Position via fixed coordinates computed from the +More button's getBoundingClientRect, recomputed on resize/scroll. Click-outside dismissal updated to check both wrapper AND portaled popover. (2) /cloud?view=graph → bubbles drift to canvas edges, leaving the centre empty until enough nodes (e.g. worker nodes) are added to anchor things via link tension. Two coupled root causes: a) `forceCenter` only adjusts the centroid — it shifts ALL nodes uniformly so their average sits at (cx, cy). It does NOT pull individual nodes inward. With small node counts and high charge repulsion (-160 for ≤50 nodes), nothing opposes outward drift. b) `makeForceBound` was a HARD clamp: `if (n.x < minX) n.x = minX`. Nodes that hit the wall get arrested with their velocity preserved on the perpendicular axis but no inward impulse → they slide along the wall and stack at corners. The simulation never relaxes back to the centre. Fix: a) Add forceX(cx) + forceY(cy) with `centerGravity` strength per node-count tier (0.08 for ≤50, scaling down with larger graphs where link tension is sufficient). This pulls every individual node toward the centre proportional to its offset. b) Replace the hard clamp with an elastic bounce: when a node hits the boundary, reverse its velocity component (×0.4 damping) instead of zeroing it. Energy returns to the system, the simulation actually relaxes. Bump chart 1.4.72 → 1.4.73. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cloud): expose all live K8s kinds in +More popover + chip counts + tighter graph centering Founder feedback (after PR #1062 lit up the data plane): 1. The +More popover was missing pods, deployments, statefulsets, daemonsets, configmaps, secrets, namespaces, etc. — it only carried the 6 placeholder kinds the legacy topology API knew about. 2. Several chips (Services, Ingresses, Storage Classes) showed "—" for count even though the data IS in the live cluster (visible in the graph view). 3. The graph view still pushed bubbles to canvas edges; only adding worker nodes brought things back. The previous gravity tuning wasn't strong enough for ~300 nodes. This PR addresses all three. (1) Eleven new K8s-backed list pages exposed in +More: Pods, Deployments, StatefulSets, DaemonSets, ReplicaSets, ConfigMaps, Secrets, Namespaces, Nodes, PersistentVolumes, EndpointSlices. Plus replaced the placeholder Services and Ingresses pages with live K8s tables. All built on a new generic K8sListPage that subscribes to /api/v1/sovereigns/{depId}/k8s/stream (same SSE channel the architecture-graph already uses) and renders a typed-column table per kind. Columns are declared once per kind in kindsPages.tsx; the rendering is uniform so adding a kind is a ~12-line wrapper. (2) CloudPage.kindCounts now folds the live K8s snapshot into the chip-count map. KIND_TO_REGISTRY in kinds.ts maps each chip id to the registry kind name (pods → 'pod' etc). Counts that came from null (data not available) flip to live counts the moment the SSE stream's initialState=1 arrives. (3) GraphCanvas physics retuned for live-data scale: - centerGravity: 0.08→0.18 for ≤50 nodes, 0.06→0.16 for ≤200, 0.04→0.14 for ≤1000, 0.03→0.10 for ≤5000, 0.02→0.08 for >5000. The forceX/forceY pulls every individual node toward (cx,cy) proportional to its offset — 2-3× stronger than the original tuning so the canvas centre stays populated. - Charge softened: -160→-90 for ≤50 nodes, scaled down through every tier. The previous values were calibrated against a ~20-node topology stub; live data delivers 10-50× more nodes per Sovereign so charge needs to relax proportionally. Bump chart 1.4.74 → 1.4.75. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9d60bbab91 |
deploy: update catalyst images to 167d093
|
||
|
|
167d09348e
|
fix(cloud): +More popover escapes overflow clip + graph centers via gravity force (#1063)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane Founder asked: "make the real-time k8s information propagation development reused — find the reverted prior work and implement the final working one." History: - PR #358 (May 1) shipped the full informer + SSE data plane: internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics} + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) + UI hook lib/useK8sStream.ts + widget useK8sCacheStream. - PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream with kinds=namespace,node,pv,pod,deployment,...,server.hcloud, volume.hcloud and `&initialState=1` for live cloud-graph deltas. - PR #981 hotfix dropped the synchronous discovery probe in factory.go:AddCluster (it was calling core.Discovery().ServerResourcesForGroupVersion(gv) with NO context timeout — on a kubeconfig pointing at a decommissioned otech the call hung the catalyst-api startup for minutes per dead cluster). After #981 the discovery-probe surgery was clean — no follow-up broke. The data plane code stayed in the codebase. The remaining gap was operational, not architectural: On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>), the catalyst-api boots without a posted-back kubeconfig in /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns [] → factory has zero clusters → every /api/v1/sovereigns/{depId}/k8s/* request 404s with "sovereign \"...\" not registered". The architecture-graph in-flight call confirmed live on omantel.biz today. Fix in this PR: 1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN env is set (chroot mode), build a ClusterRef with id resolved from CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by scanning /var/lib/catalyst/deployments/*.json for a record matching the FQDN (mirrors HandleSovereignSelf's store-fallback path for consistency). DynamicClient + CoreClient built from rest.InClusterConfig(). Append to the cluster list. Mother behavior unchanged — SOVEREIGN_FQDN unset → branch is a no-op. 2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide get/list/watch on every kind in the k8scache registry (pods, deployments, statefulsets, daemonsets, replicasets, services, endpointslices, ingresses, configmaps, secrets, persistentvolumes, persistentvolumeclaims, hcloud.crossplane.io managed resources, vclusters), plus authorization.k8s.io/subjectaccessreviews so the per-event SAR gating in the SSE handler doesn't 403 silently. 3. Bump chart 1.4.70 → 1.4.71. The discovery-probe failure mode that triggered the original revert (synchronous ServerResourcesForGroupVersion blocking startup) does NOT recur here — InClusterConfig() returns immediately, NewForConfig is lazy, and the first network call happens inside the informer goroutine after Start, off the boot critical path. Mother-side LoadClustersFromDir behavior is untouched (no probe, just kubeconfig file parsing as it has been since #981). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): + More popover escapes overflow clip + graph centers via gravity force Two cloud-page bugs caught live on omantel.biz: (1) /cloud?view=list&kind=clusters → +More popover non-functional. The popover renders at its anchor coords but pointer events pass through to the toolbar below it. Diagnosis: .cloud-page-toolbar > [data-testid="cloud-kind-chips"] { overflow-x: auto; } Per CSS spec, when one overflow axis is non-visible, the OTHER axis becomes auto/hidden too. So overflow-x:auto on the chips strip silently sets overflow-y:auto, which clips the absolutely- positioned popover that hangs DOWN from the +More button. Fix: render the popover via React.createPortal to document.body so it's outside any overflow ancestor. Position via fixed coordinates computed from the +More button's getBoundingClientRect, recomputed on resize/scroll. Click-outside dismissal updated to check both wrapper AND portaled popover. (2) /cloud?view=graph → bubbles drift to canvas edges, leaving the centre empty until enough nodes (e.g. worker nodes) are added to anchor things via link tension. Two coupled root causes: a) `forceCenter` only adjusts the centroid — it shifts ALL nodes uniformly so their average sits at (cx, cy). It does NOT pull individual nodes inward. With small node counts and high charge repulsion (-160 for ≤50 nodes), nothing opposes outward drift. b) `makeForceBound` was a HARD clamp: `if (n.x < minX) n.x = minX`. Nodes that hit the wall get arrested with their velocity preserved on the perpendicular axis but no inward impulse → they slide along the wall and stack at corners. The simulation never relaxes back to the centre. Fix: a) Add forceX(cx) + forceY(cy) with `centerGravity` strength per node-count tier (0.08 for ≤50, scaling down with larger graphs where link tension is sufficient). This pulls every individual node toward the centre proportional to its offset. b) Replace the hard clamp with an elastic bounce: when a node hits the boundary, reverse its velocity component (×0.4 damping) instead of zeroing it. Energy returns to the system, the simulation actually relaxes. Bump chart 1.4.72 → 1.4.73. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eca1e00ab7 |
deploy: update catalyst images to 2ad31b4
|
||
|
|
2ad31b4481
|
feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes real-time data plane (#1062)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(k8scache): chroot Sovereign self-registers via in-cluster config — completes the real-time data plane Founder asked: "make the real-time k8s information propagation development reused — find the reverted prior work and implement the final working one." History: - PR #358 (May 1) shipped the full informer + SSE data plane: internal/k8scache/{factory,kinds,sar,redact,snapshot,hydrate,metrics} + handler/k8s.go (HandleK8sList, HandleK8sStream, HandleK8sSync) + UI hook lib/useK8sStream.ts + widget useK8sCacheStream. - PR #978 (May 5) wired ArchitectureGraphPage to useK8sCacheStream with kinds=namespace,node,pv,pod,deployment,...,server.hcloud, volume.hcloud and `&initialState=1` for live cloud-graph deltas. - PR #981 hotfix dropped the synchronous discovery probe in factory.go:AddCluster (it was calling core.Discovery().ServerResourcesForGroupVersion(gv) with NO context timeout — on a kubeconfig pointing at a decommissioned otech the call hung the catalyst-api startup for minutes per dead cluster). After #981 the discovery-probe surgery was clean — no follow-up broke. The data plane code stayed in the codebase. The remaining gap was operational, not architectural: On a chroot Sovereign Console (post-cutover, console.<sov-fqdn>), the catalyst-api boots without a posted-back kubeconfig in /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir returns [] → factory has zero clusters → every /api/v1/sovereigns/{depId}/k8s/* request 404s with "sovereign \"...\" not registered". The architecture-graph in-flight call confirmed live on omantel.biz today. Fix in this PR: 1. **k8scache.FactoryFromEnv chroot self-register**: when SOVEREIGN_FQDN env is set (chroot mode), build a ClusterRef with id resolved from CATALYST_SELF_DEPLOYMENT_ID env (orchestrator-stamped) or by scanning /var/lib/catalyst/deployments/*.json for a record matching the FQDN (mirrors HandleSovereignSelf's store-fallback path for consistency). DynamicClient + CoreClient built from rest.InClusterConfig(). Append to the cluster list. Mother behavior unchanged — SOVEREIGN_FQDN unset → branch is a no-op. 2. **ClusterRole catalyst-api-cutover-driver**: grant cluster-wide get/list/watch on every kind in the k8scache registry (pods, deployments, statefulsets, daemonsets, replicasets, services, endpointslices, ingresses, configmaps, secrets, persistentvolumes, persistentvolumeclaims, hcloud.crossplane.io managed resources, vclusters), plus authorization.k8s.io/subjectaccessreviews so the per-event SAR gating in the SSE handler doesn't 403 silently. 3. Bump chart 1.4.70 → 1.4.71. The discovery-probe failure mode that triggered the original revert (synchronous ServerResourcesForGroupVersion blocking startup) does NOT recur here — InClusterConfig() returns immediately, NewForConfig is lazy, and the first network call happens inside the informer goroutine after Start, off the boot critical path. Mother-side LoadClustersFromDir behavior is untouched (no probe, just kubeconfig file parsing as it has been since #981). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f88da5ff6e |
deploy: update catalyst images to eb6a3c1
|
||
|
|
eb6a3c1812
|
fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs — Sovereigns + contabo were frozen at :2122fb8 (#1060)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
66eca90c16 |
deploy: update catalyst images to 8361df4
|
||
|
|
8361df46ac
|
feat(apps): publish chip on each card — replaces deleted /catalog page (#1059)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
45b73651f8 |
deploy: update catalyst images to aed0a81
|
||
|
|
aed0a81f75
|
fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page (#1058)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5d9fa2a5e7 |
deploy: update catalyst images to 8c8ccfb
|
||
|
|
8c8ccfbfed
|
fix(chroot): single chrome — no frame in frame, no mother handover banner (#1057)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bda5617aed |
deploy: update catalyst images to 933b321
|