openova

Author	SHA1	Message	Date
e3mrah	b52fc45c37	fix(bp-catalyst-platform): cutover-driver RBAC dual-mode render (#830 ) (#839 ) Chart 1.3.2 shipped serviceaccount-cutover-driver.yaml + clusterrole-cutover-driver.yaml + clusterrolebinding-cutover-driver.yaml with `{{ .Release.Namespace }}` directives that rendered fine via Helm on Sovereigns but BROKE the Kustomize-mode contabo-mkt deploy: the directives made Kustomize parse the files as invalid YAML and silently skip them. Worse, the new files were never added to templates/ kustomization.yaml's resources list. Result on contabo: catalyst-api Pod's spec.serviceAccountName references a non-existent SA — the Pod fails ContainerCreating with the same RBAC forbidden error #830 was meant to fix. Fix: - Strip `{{ .Release.Namespace }}` directives from the SA + ClusterRole files. metadata.namespace auto-fills from Helm's --namespace flag and from Kustomize's `namespace:` directive. - For ClusterRoleBinding: Helm does NOT auto-inject subjects[0]. namespace the way it does metadata.namespace, so the apiserver rejects bindings without it. Split into two files: * clusterrolebinding-cutover-driver.yaml — Helm-only, uses {{ .Release.Namespace }} (correctly resolves to catalyst-system on Sovereigns). * clusterrolebinding-cutover-driver-kustomize.yaml — Kustomize- only, omits subjects[0].namespace and relies on Kustomize's native injection (resolves to `catalyst` on contabo). The .helmignore excludes the Kustomize-only file from Sovereign chart packaging; templates/kustomization.yaml's resources list references the Kustomize-only file, NOT the Helm-only one. - Add the new RBAC files to templates/kustomization.yaml's resources list so contabo's Flux Kustomization actually renders them. Verified live with `helm template` (subjects[0].namespace=catalyst-system) and `kubectl kustomize` (subjects[0].namespace=catalyst). Bumps bp-catalyst-platform 1.3.2 → 1.3.3. Issue: openova-io/openova#830 (Bug 1 follow-up) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 23:54:03 +04:00
github-actions[bot]	fb9c9b72d9	deploy: update catalyst images to `772d159`	2026-05-04 19:50:19 +00:00
e3mrah	772d159691	feat(sme-tenant): multi-domain Sovereign support — parent-domain dropdown + free-subdomain-under-any-pool-domain (#828 ) (#836 ) Extends the SME tenant provisioning pipeline (#804) for the multi-domain Sovereign (epic #825). The SME tenant create form now lets the operator pick which sme-pool parent zone hosts the tenant; the orchestrator writes DNS records under the chosen parent (not a hardcoded primary). Backend (Go): - store.SMETenantProvisionRecord.ParentDomain — captured at create - handler.SMETenantParentDomain + SMETenantDeps.ParentDomains — pool wiring - POST /api/v1/sme/tenants accepts parent_domain; defaults to the first NS-flip-ready sme-pool entry; rejects unknown parents (400) and not-yet-flipped parents (503 + Retry-After) - DNS provisioner ProvisionFreeSubdomain takes a parentZone parameter; ValidateBYOCNAME accepts a multi-target candidate list (any parent) - Pipeline: writes A records under the chosen parent zone; realm URL, console host, and gitops template hostnames all derive from ParentDomain (data-driven; never hardcoded) - New GET /api/v1/sovereign/parent-domains?role= read-only endpoint with env stub (CATALYST_SME_POOL_DOMAINS) that integrates cleanly with MD-1 (#826) when its data model lands UI (React + TanStack Router + Vitest + Playwright): - New /console/sme/tenants/new — CreateTenantPage with domain-mode radio, parent-domain <select> populated from the new endpoint, per-option NS-flip-ready disabled state, live console URL preview, CNAME validation hint for BYO mode, post-submit progress timeline - 7 Vitest unit tests + 2 Playwright E2E specs (free-subdomain + BYO), 5 1440px screenshots emitted under e2e/screenshots/828-*.png Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent-domain pool is fully data-driven; the UI consumes the same wire shape MD-1 will surface. Per #2 (never compromise on quality) the page paints partial state on hook failure with per-step badges from the response. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 23:48:10 +04:00
github-actions[bot]	090e1f6a34	deploy: update catalyst images to `e96741a`	2026-05-04 19:44:11 +00:00
e3mrah	e96741a0ca	feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827 ) (#838 ) A franchised Sovereign now supports N parent zones, NOT one. The operator brings 1+ parent domains at signup (`omani.works` for own use, `omani.trade` for the SME pool, etc.) and may add more post-handover via the admin console (#829). bp-powerdns 1.2.0 (platform/powerdns/chart): - New `zones: []` values key listing parent domains to bootstrap - New Helm post-install/post-upgrade hook Job (templates/zone-bootstrap-job.yaml) that POSTs each entry to /api/v1/servers/localhost/zones at install time. Idempotent on HTTP 409 — re-runs after upgrades or chart bumps never fail. - Default-values render skips when zones is empty (legacy behavior). bp-catalyst-platform 1.4.0 (products/catalyst/chart): - New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}` values - New templates/sovereign-wildcard-certs.yaml renders one cert-manager.io/v1.Certificate per zone (each `.<zone>` + apex) via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert renews independently. Skips entirely when parentZones is empty so the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml retains ownership of `sovereign-wildcard-tls` (avoids helm-vs-kustomize ownership flap). - New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded into the catalyst-api Pod as CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_SERVER_ID env vars. catalyst-api (products/catalyst/bootstrap/api): - New internal/powerdns package with typed Client (CreateZone, ZoneExists). Idempotent on HTTP 409/412. - handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the typed client when wired via SetPowerDNSZoneClient — the admin-console "Add another parent domain" flow now creates real zones in the Sovereign's PowerDNS at runtime. - main.go wires the client when CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_API_KEY are set. - Comprehensive unit tests (client_test.go: 9 cases incl. 201/409/412/500 + custom NS + custom serverID). Bootstrap-kit slot integration: - clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from Flux postBuild.substitute. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bumps to bp-catalyst-platform 1.4.0 and threads `parentZones: ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two slots stay in lockstep). - infra/hetzner: new `parent_domains_yaml` Terraform variable (defaults to single-zone array derived from sovereign_fqdn) → cloud-init renders the PARENT_DOMAINS_YAML Flux substitute. DoD verified end-to-end with helm template + envsubst: - Multi-zone overlay (omani.works + omani.trade) renders 2 PowerDNS zone-create API calls in the bootstrap Job AND 2 Certificate resources (`.omani.works`, `*.omani.trade`) in bp-catalyst-platform. - Single-zone fallback (PARENT_DOMAINS_YAML defaults to `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy provisioning paths working without per-overlay edits. Closes #827. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-04 23:42:00 +04:00
github-actions[bot]	92e712a8a6	deploy: update catalyst images to `0bf7b3b`	2026-05-04 19:38:24 +00:00
e3mrah	0bf7b3b16d	feat(provisioner): parentDomains[] data model + per-domain abstraction (#826 ) (#835 ) Sub-1 of epic #825 (Multi-domain Sovereign). Backend-only per the SCOPE CORRECTION on issue #826: the wizard stays single-FQDN, multi- domain capability is a Day-2 admin-console action (#829, already merged with an in-memory stub waiting on this PR's persistence layer). What this PR adds: - provisioner.ParentDomain struct (Name, Role, RegistrarKind, RegistrarCredsRef, AddedAt) with role constants ParentDomainRolePrimary \| ParentDomainRoleSMEPool. Wire shape matches the handler-layer ParentDomain in handler/parent_domains.go (#829), so the handler's swap from in-memory store → Deployment.parentDomains[] is a one-line change in a follow-up PR. - Request.ParentDomains []ParentDomain field. Backward-compatible: when the slice is empty, Validate() synthesises a single primary entry from SovereignPoolDomain (or SovereignFQDN) so legacy single-FQDN payloads + on-disk records read cleanly. The next Save() round-trips the array form — transparent migration with no one-shot script. - validateParentDomains: enforces "exactly one primary", role enum, FQDN regex (RFC 1035, mirrors wizard isValidDomain), duplicate- name dedupe, lowercase normalisation in place. - ProvisionParentDomain / ProvisionParentDomains: the per-domain abstraction the issue's DoD calls out as "reusable function ready for #829". Day-2 add-domain calls this with the same step list (registrar-flip → powerdns-zone-create → cert-manager-cert) the Day-1 path uses; idempotent, stops on first error, emits per-step SSE events for the admin panel. - Request.PrimaryParentDomain() / SMEPoolParentDomains() lookup helpers so the catalyst-api handler + SME signup wizard read the primary / sme-pool subset without re-iterating at every call site. - writeTfvars emits parent_domains as a JSON array (never null) so a future OpenTofu module's `for pd in var.parent_domains` validator accepts the input — same nil-trap fix the regions slice already carries. - store.RedactedRequest + ToProvisionerRequest round-trip the slice verbatim. Fields are non-secret (RegistrarCredsRef points at a SealedSecret name; plaintext registrar credentials never live on the deployment record). - store.crdStore mirrors the slice into the ProvisioningState CRD spec so admin tooling reading via the K8s API sees the live pool. What this PR does NOT touch (explicit scope): - products/catalyst/bootstrap/ui/src/pages/wizard/** — wizard UI stays single-FQDN per the issue's SCOPE CORRECTION. - products/catalyst/bootstrap/api/internal/handler/parent_domains.go — the #829-merged Day-2 admin handler keeps its in-memory store; a one-line follow-up PR swaps to Deployment.parentDomains[]. Inviolable Principle #4: defaultRegistrarKindFromEnv reads CATALYST_DEFAULT_REGISTRAR_KIND so operators on registrars other than Dynadot override the synthesis path without code changes. No TLD or count is hardcoded. Tests: - 14 new unit tests across two new files (parent_domains_test.go in provisioner + store packages). Cover: synthesis from SovereignFQDN + SovereignPoolDomain, "exactly one primary" invariant (rejects 2 + 0), unknown role, empty role, malformed FQDN, duplicate names, uppercase normalisation, lookup helpers, step-runner ordering + first-error halt, slice-flavour multi-domain iteration, JSON round-trip through Redact + Save + LoadAll, empty-slice omitempty, legacy on-disk record loads cleanly + migration synthesises primary on Validate. - Pre-existing Harbor-token + AuthHandover-signer-nil failures persist on origin/main; this PR introduces no new failures. Closes #826. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 23:36:28 +04:00
github-actions[bot]	4cacbc2c17	deploy: update catalyst images to `620d8b6`	2026-05-04 19:33:09 +00:00
e3mrah	620d8b6c13	feat(admin-console): add-domain flow + DNS propagation status panel (#829 ) (#834 ) * feat(unified-rbac): SME-tier extension + host-header tenant discovery (#802) Implements the SME-tier extension to the existing Sovereign Console SPA per [Q-mine-1] of #795: same React bundle serves both otech-admin and SME-admin views, tenant context discovered via window.location.host against a back-end registry — not from path/subdomain string parsing. Backend (catalyst-api / unified-rbac slice): - Tenant registry (store.TenantRegistry) — flat-file host → tenant lookup table backing the public discovery endpoint. Host normalised to lowercase; case-insensitive lookups. - GET /api/v1/tenant/discover (public, no auth gate) — returns {tenant_id, tenant_kind, keycloak_realm_url, keycloak_client_id} on 200, 404 on unknown host, 503 if registry unwired. Admin URLs are NEVER on this wire. - POST /api/v1/sme/users — fires ADR-0003 3-step hook (Keycloak → NewAPI → K8s Secret SSA with field manager `unified-rbac`). Each step idempotent; persisted state machine in store.UserProvisionStore per ADR-0003 §3.4. Returns 202 with steps[] progress array so the SPA can render the 3-step indicator even on partial failure. - GET /api/v1/sme/users / DELETE /api/v1/sme/users/{uuid} — list + inverse rollback per ADR-0003 §3.7. - internal/newapi.Client — minimal NewAPI admin REST client; 201 happy-path + 409 idempotent recovery via GET ?external_id=<uuid> per ADR-0003 §3.2 (NewAPI does NOT rotate api_key on conflict). Frontend (Sovereign Console SPA): - Branded TenantID + TenantKind types (shared/types/tenant.ts) — same pattern as DeploymentID (#749). - shared/lib/tenantDiscover.ts — fire-and-forget discovery in main.tsx; result cached in module state for sidebar nav + OIDC bootstrap. - pages/sme/UsersPage.tsx — user CRUD UI with 3-step KC/NewAPI/Secret progress indicator wired off the API response shape. - pages/sme/RolesPage.tsx — canonical Keycloak group → app role map (wordpress / openclaw / stalwart / rbac) per #795 [B]. - pages/sme/sme.api.ts — typed REST client; X-Tenant-Host header carries window.location.host on every call. - Routes mounted at /console/sme/users + /console/sme/roles under the existing SovereignConsoleLayout — same SPA bundle, different route tree per discovered tenant_kind. Tests: 22 new UI tests (4 files), 33 new Go tests (4 files). All green: branded type parsers reject empty/non-string inputs, tenant discovery handles 200/404/503/network-error paths, the 3-step hook runs end-to-end against fake KC/NewAPI/SSA stubs, partial-failure states surface verbatim through the steps[] response field, public discovery endpoint never leaks admin URLs. Per docs/INVIOLABLE-PRINCIPLES.md #4 every URL goes through apiUrl() in shared/config/urls; per #2 wire shapes parse through branded-type parsers at the boundary; per #3 K8s Secret apply uses client-go SSA (field manager `unified-rbac`) — no exec.Command kubectl shell-out. Closes #802. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(unified-rbac): add Playwright E2E for SME-tier UI (#802) Three specs covering: - SME UsersPage: empty state → create form → 3-step progress indicator (KC done / NewAPI done / Secret done) — proves the page is wired to the API response shape. - SME RolesPage: canonical group → app-role table renders the full 7-row mapping locked in #795 [B]. - OTECH tenant: same SPA bundle navigates /console/dashboard for the otech discovery payload — proves [Q-mine-1] of #795 (one bundle, two route trees, host-driven discovery). Backend mocks: route fulfillers stub /tenant/discover, /sme/users, and /whoami so the dev-server harness can drive the SPA without the catalyst-api backend or a live SME vcluster. The full live cross-cluster E2E gates on bp-newapi (#799) seeding the tenant registry at SME-onboarding time, which lands in #804. 1440 px screenshots captured at e2e/screenshots/802-.png: - 802-sme-users-empty-1440.png - 802-sme-users-create-form-1440.png - 802-sme-users-after-create-1440.png - 802-sme-roles-1440.png - 802-otech-dashboard-same-bundle-1440.png Run: VITE_CATALYST_MODE=sovereign VITE_SOVEREIGN_FQDN=acme.otech.example npm run dev npx playwright test e2e/sme-tier-rbac.spec.ts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(admin-console): add-domain flow + DNS propagation status panel (#829) Multi-domain Sovereign — operator-admin "Add another parent domain" surface in the Sovereign Console + live DNS propagation status panel. Closes the MD-4 sub-ticket of epic #825. Backend (catalyst-api/internal/handler/parent_domains.go): - GET /api/v1/sovereign/parent-domains — list pool - POST /api/v1/sovereign/parent-domains — add domain - DELETE /api/v1/sovereign/parent-domains/{name} — remove - GET /api/v1/sovereign/parent-domains/{name}/propagation — fan-out to 5+ public DNS resolvers The Add pipeline calls PDM /set-ns (sister #826), creates the PowerDNS zone (sister #827, env-gated stub until that PR lands), and issues a wildcard cert via cert-manager (also sister #827, env-gated stub). All three steps update the same store row so the UI can render per-step progress. DNS propagation panel uses Go's net.Resolver with a custom Dial that routes lookups through a SPECIFIC resolver IP (8.8.8.8, 1.1.1.1, 9.9.9.9, 208.67.222.222, 4.2.2.1) rather than the system resolver. Per inviolable principle #4, the resolver list, expected NS records, and per-query timeout are all env-overridable. Frontend (ui/src/pages/admin/parent-domains/): - ParentDomainsPage.tsx — list view + Add Domain modal + per-row inline drawer with PropagationPanel - PropagationPanel.tsx — polls /propagation every 60s, renders green/yellow/red pills per resolver + rolling % propagated number - parentDomains.api.ts — typed REST client wrappers, no inline /api/ Routing: - /console/parent-domains registered under SovereignConsoleLayout - Added to Settings sub-nav for operator-admin reachability Tests: - 6 vitest cases (empty state, populated rows, modal open, drawer toggle, primary lock, propagation panel mount) - 13 Go cases covering list/add/delete/validation/propagation wire shape against a stub PDM - 3 Playwright E2E + 1440x900 screenshots: e2e/screenshots/829-1-just-flipped.png (0% propagated) e2e/screenshots/829-2-partially-propagated.png (40%) e2e/screenshots/829-3-fully-propagated.png (100%) Per inviolable principle #10 (credential hygiene) the registrarToken field is forwarded byte-for-byte to PDM and never enters a logged struct; the modal input uses type="password". Refs: #825 (parent epic), #826 (sister MD-1), #827 (sister MD-2) --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 23:31:03 +04:00
github-actions[bot]	ec07488226	deploy: update catalyst images to `c9507c8`	2026-05-04 19:29:59 +00:00
e3mrah	c9507c8369	fix(catalyst-api): durable Phase-1 watcher across Pod restart (#830 ) (#833 ) The Phase-1 helmwatch watcher used to lose state on every catalyst-api Pod roll. fromRecord rewrote any "phase1-watching" status to "failed" on the next Pod start — even though Phase 0 had already committed its tofu state, the Sovereign cluster was healthy, the kubeconfig was on the PVC, and the bootstrap-kit HelmReleases kept reconciling regardless of whether catalyst-api's in-memory watcher was alive. Caught live on otech102 (2026-05-04): a transient catalyst-api roll mid-Phase-1 latched the deployment record to status=failed, the auto- fire handover never triggered, and the operator was stranded on the wizard page. Manual workaround was patching the record back to status=ready + minting handover token by hand. Fix: split the in-flight rewrite into two cases: - Phase-0 in-flight (pending/provisioning/tofu-applying/flux- bootstrapping) — STILL rewritten to failed (tofu workdir on /tmp emptyDir died with the Pod, Hetzner resources orphaned). - phase1-watching — preserved across restart so the post-restart resume path picks it up via shouldResumePhase1 + resumePhase1Watch (already wired). The on-disk store record stays consistent with the in-memory state during rehydrate. Helmwatch's existing resume path (jobs_backfill.go) is idempotent — it just observes HelmRelease.status, never patches/applies, so a fresh informer over the same kubeconfig produces the same per-component events the previous Pod was streaming. Also: - Added isPhase0InFlightStatus helper to distinguish the two semantics; isInFlightStatus retained for release-subdomain conflict check (still includes phase1-watching — won't release a slot mid- Phase-1). - Updated TestPodRestart_StuckPhase1WatchingRewrittenToFailed → TestPodRestart_Phase1WatchingPreservedNotRewrittenToFailed (now asserts the new correct behavior). - New test TestPodRestart_Phase1WatchingResumesWithKubeconfig proves the gating decision (shouldResumePhase1=true) and the preserved Status value. - New parameterized test TestPodRestart_Phase0InFlightStillRewritten ToFailed proves the Phase-0 carve-out still works for all four Phase-0 statuses. - Updated TestShouldResumePhase1_GatesProperly cases to reflect the new phase1-watching=resumable / Phase-0=non-resumable split. Issue: openova-io/openova#830 (Bug 3) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 23:28:07 +04:00
e3mrah	f75f3e79b4	fix(bp-catalyst-platform): add cutover-driver RBAC for catalyst-api (#830 ) (#831 ) The /api/v1/sovereign/cutover/start handler was returning 502 status-read-failed because catalyst-api ran under the catalyst-system/ default ServiceAccount with no RBAC binding to read/patch the cutover ConfigMaps + create/watch Jobs in the `catalyst` namespace. Add a dedicated ServiceAccount + ClusterRole + ClusterRoleBinding so catalyst-api can drive the cutover state machine. Per feedback_rbac_create_no_resourcenames.md the `create` verbs are split into their own Rule WITHOUT resourceNames; combining create with resourceNames produces 403 every POST. Bumps bp-catalyst-platform 1.3.1 → 1.3.2. Issue: openova-io/openova#830 (Bug 1) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 23:26:51 +04:00
github-actions[bot]	1631c0b86c	deploy: update catalyst images to `da3f679`	2026-05-04 18:57:19 +00:00
e3mrah	da3f6797b7	feat(sme-tenant): tenant provisioning pipeline (#804 ) (#824 ) Wire all bp-* charts at vcluster creation time so the SME experience is turnkey from marketplace signup forward. The orchestrator owns a 7-state machine (pending → vcluster_created → bp_charts_installed → dns_provisioned → certs_issued → keycloak_clients_provisioned → tenant_registered → done) persisted in a flat-file store; each step is independently idempotent so a Pod restart never strands a half-provisioned tenant. HTTP surface: - POST /api/v1/sme/tenants — create + start pipeline - GET /api/v1/sme/tenants — list - GET /api/v1/sme/tenants/{id} — read - POST /api/v1/sme/tenants/{id}/reconcile — operator-triggered re-run - DELETE /api/v1/sme/tenants/{id} — inverse pipeline Per Inviolable Principle 3 the orchestrator NEVER calls kubectl apply. Per-tenant overlays are committed to the GitOps repo at clusters/<otech>/sme-tenants/<sme_tenant_id>/ via a Kustomize layout listing every bp-* HelmRelease (bp-keycloak per-organization, bp-cnpg, bp-wordpress-tenant, bp-openclaw, bp-stalwart-tenant) plus the per-host Certificate (BYO mode only — free-subdomain is covered by the otech-wide wildcard). Flux on the OTECH cluster reconciles within ~1 min. Per Inviolable Principle 4 every chart version, image tag, OTECH FQDN, PowerDNS endpoint, and Keycloak SA token is runtime-configurable via env (CATALYST_SME_BP__VER, CATALYST_OTECH_FQDN, CATALYST_OTECH_INGRESS_IPV4, CATALYST_POWERDNS_URL, CATALYST_POWERDNS_API_KEY, CATALYST_SME_KC_SA_TOKEN). Empty chart versions fall back to "" so Flux pulls the latest matching chart. DNS provisioning: - Free-subdomain mode: PowerDNS PATCH writes A records for console/wordpress/openclaw/mail/keycloak.<sub>.<otech>. - BYO mode: net.LookupCNAME resolves console.<byo_domain> and confirms the target ends with the otech FQDN; mismatched CNAMEs surface as terminal errors so the wizard can show "your CNAME doesn't point here yet" without a chat-with-support loop. Keycloak SSO clients (catalyst-ui, wordpress, openclaw, stalwart) + group templates (sme-admin, sme-user) are declared in the bp-keycloak HelmRelease's bootstrap values block; the orchestrator verifies them via the SME-vcluster Keycloak admin API and re-runs the step on transient failures. Tenant registry insertion (per #802 SME-7) uses the existing store.TenantRegistry — host → {tenant_id, keycloak_realm_url, keycloak_client_id, tenant_kind=sme} — so the SPA's /api/v1/tenant/discover endpoint resolves the new tenant on first hit without any further orchestration. The user-create hook (POST /api/v1/sme/users) from #802 already fires the ADR-0003 3-step orchestration (Keycloak → NewAPI → K8s Secret); this PR's tenant pipeline lights up the back end #802 needs to scope every per-user call. Tests: - 14 handler-level table tests covering happy path (free-subdomain + BYO), validation errors, gitops transient retry, registry population, deletion, render correctness for both modes, chart version threading, Keycloak client verification, BYO CNAME resolution. - 5 store tests for state-machine persistence. Live test deferred to #805 E2E demo. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:55:06 +04:00
github-actions[bot]	b003cd80c6	deploy: update catalyst images to `1d93b6c`	2026-05-04 18:54:14 +00:00
e3mrah	1d93b6c5af	feat(e2e): SME demo Playwright spec — full 6-step happy path (#805 ) (#823 ) Authors the load-bearing investor-demo proof artefact for the SME-tenant turnkey experience epic (#795). The spec walks the FULL happy path against the catalyst-ui SPA and emits 1440×900 screenshots at every assertion so the DoD checklist is satisfied with visual evidence rather than narrative. What landed: - products/catalyst/bootstrap/ui/e2e/sme-demo.spec.ts — single linear spec covering Step 1 (marketplace signup) → Step 2 (provisioning) → Step 3 (SME admin first login + dashboard) → Step 4 (create alice via unified-rbac with 3-step ADR-0003 hook progress) → Step 5a (alice on WordPress) → Steps 5b/5c/5d/6 fixme'd with TODO links to unblocking issues. - products/catalyst/bootstrap/ui/e2e/lib/config.ts — central registry of every URL, hostname, fixture user, and UUID the spec uses. Per feedback_never_hardcode_urls.md, no test inlines a hostname; every asserted host derives from OTECH_FQDN + SME_SLUG. - products/catalyst/bootstrap/ui/e2e/lib/sme-fixtures.ts — wire-shape- faithful page.route mocks for tenant discovery, /api/v1/whoami, /api/v1/sme/tenants, /api/v1/sme/users (CRUD), the deployment endpoints, app placeholders for WordPress/OpenClaw/webmail, and the /api/v1/sme/billing/ledger surface. Each helper is the seam between mock-mode (today) and live-mode (post-#804) so the spec opts out of any single mock by simply not calling that helper. - .github/workflows/sme-demo-e2e.yaml — push + PR + dispatch trigger that runs the spec against a freshly-installed dev tree with VITE_CATALYST_MODE=sovereign + VITE_SOVEREIGN_FQDN set so the SovereignConsoleLayout's auth gate has a non-null sovereignFQDN. Uploads the 805-* screenshot evidence as a 30-day artefact. Run today on a fresh checkout: cd products/catalyst/bootstrap/ui VITE_CATALYST_MODE=sovereign \ VITE_SOVEREIGN_FQDN=acme.otech.example \ npm run dev & PLAYWRIGHT_HOST=http://localhost:5173 \ npx playwright test e2e/sme-demo.spec.ts Result: 6 passed, 4 fixme (5b/5c/5d/6, all with TODO links to #804 / #798 / #802-followup). Live-mode follow-up (after #804 lands a fresh otech with the SME tenant pipeline wired): drop the mock installers from beforeEach and flip OTECH_FQDN/SME_SLUG via env. The spec stays — only the helper calls change. Per docs/INVIOLABLE-PRINCIPLES.md: #1 (waterfall): the canonical 6-step contract from #805 is asserted in this first cut, not staged across cycles. #2 (never compromise): every step that's deferred is fixme'd with a blocker link, never silently skipped. #4 (never hardcode): every URL routes through e2e/lib/config.ts. Refs: openova-io/openova#795, openova-io/openova#804, ADR-0003 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-04 22:52:07 +04:00
github-actions[bot]	0cee06161a	deploy: update sme service images to `5cdb738`	2026-05-04 18:37:08 +00:00
e3mrah	01022e8c52	feat(unified-rbac): SME-tier extension + host-header tenant discovery (#802 ) (#816 ) * feat(unified-rbac): SME-tier extension + host-header tenant discovery (#802) Implements the SME-tier extension to the existing Sovereign Console SPA per [Q-mine-1] of #795: same React bundle serves both otech-admin and SME-admin views, tenant context discovered via window.location.host against a back-end registry — not from path/subdomain string parsing. Backend (catalyst-api / unified-rbac slice): - Tenant registry (store.TenantRegistry) — flat-file host → tenant lookup table backing the public discovery endpoint. Host normalised to lowercase; case-insensitive lookups. - GET /api/v1/tenant/discover (public, no auth gate) — returns {tenant_id, tenant_kind, keycloak_realm_url, keycloak_client_id} on 200, 404 on unknown host, 503 if registry unwired. Admin URLs are NEVER on this wire. - POST /api/v1/sme/users — fires ADR-0003 3-step hook (Keycloak → NewAPI → K8s Secret SSA with field manager `unified-rbac`). Each step idempotent; persisted state machine in store.UserProvisionStore per ADR-0003 §3.4. Returns 202 with steps[] progress array so the SPA can render the 3-step indicator even on partial failure. - GET /api/v1/sme/users / DELETE /api/v1/sme/users/{uuid} — list + inverse rollback per ADR-0003 §3.7. - internal/newapi.Client — minimal NewAPI admin REST client; 201 happy-path + 409 idempotent recovery via GET ?external_id=<uuid> per ADR-0003 §3.2 (NewAPI does NOT rotate api_key on conflict). Frontend (Sovereign Console SPA): - Branded TenantID + TenantKind types (shared/types/tenant.ts) — same pattern as DeploymentID (#749). - shared/lib/tenantDiscover.ts — fire-and-forget discovery in main.tsx; result cached in module state for sidebar nav + OIDC bootstrap. - pages/sme/UsersPage.tsx — user CRUD UI with 3-step KC/NewAPI/Secret progress indicator wired off the API response shape. - pages/sme/RolesPage.tsx — canonical Keycloak group → app role map (wordpress / openclaw / stalwart / rbac) per #795 [B]. - pages/sme/sme.api.ts — typed REST client; X-Tenant-Host header carries window.location.host on every call. - Routes mounted at /console/sme/users + /console/sme/roles under the existing SovereignConsoleLayout — same SPA bundle, different route tree per discovered tenant_kind. Tests: 22 new UI tests (4 files), 33 new Go tests (4 files). All green: branded type parsers reject empty/non-string inputs, tenant discovery handles 200/404/503/network-error paths, the 3-step hook runs end-to-end against fake KC/NewAPI/SSA stubs, partial-failure states surface verbatim through the steps[] response field, public discovery endpoint never leaks admin URLs. Per docs/INVIOLABLE-PRINCIPLES.md #4 every URL goes through apiUrl() in shared/config/urls; per #2 wire shapes parse through branded-type parsers at the boundary; per #3 K8s Secret apply uses client-go SSA (field manager `unified-rbac`) — no exec.Command kubectl shell-out. Closes #802. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(unified-rbac): add Playwright E2E for SME-tier UI (#802) Three specs covering: - SME UsersPage: empty state → create form → 3-step progress indicator (KC done / NewAPI done / Secret done) — proves the page is wired to the API response shape. - SME RolesPage: canonical group → app-role table renders the full 7-row mapping locked in #795 [B]. - OTECH tenant: same SPA bundle navigates /console/dashboard for the otech discovery payload — proves [Q-mine-1] of #795 (one bundle, two route trees, host-driven discovery). Backend mocks: route fulfillers stub /tenant/discover, /sme/users, and /whoami so the dev-server harness can drive the SPA without the catalyst-api backend or a live SME vcluster. The full live cross-cluster E2E gates on bp-newapi (#799) seeding the tenant registry at SME-onboarding time, which lands in #804. 1440 px screenshots captured at e2e/screenshots/802-*.png: - 802-sme-users-empty-1440.png - 802-sme-users-create-form-1440.png - 802-sme-users-after-create-1440.png - 802-sme-roles-1440.png - 802-otech-dashboard-same-bundle-1440.png Run: VITE_CATALYST_MODE=sovereign VITE_SOVEREIGN_FQDN=acme.otech.example npm run dev npx playwright test e2e/sme-tier-rbac.spec.ts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:34:11 +04:00
github-actions[bot]	e30a5c34c0	deploy: update catalyst images to `e85035c`	2026-05-04 18:09:28 +00:00
e3mrah	e85035cf9b	wip(console-ui): sovereignty preview stub + e2e spec scaffold (#793 ) (#809 ) Partial work from prior session. Adds: - SovereigntyPreviewPage.tsx (stub) - e2e/sovereignty.spec.ts (472 lines) - router + dashboard wiring Full implementation (button, progress card, SSE) to follow. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>	2026-05-04 22:06:34 +04:00
github-actions[bot]	43e88d5f35	deploy: update catalyst images to `f716fdd`	2026-05-04 17:37:47 +00:00
e3mrah	0382864143	feat(catalyst-api): self-sovereignty cutover endpoints (#792 ) (#806 ) Adds three operator-admin-gated endpoints for orchestrating the post-handover Self-Sovereignty Cutover (parent epic #790): POST /api/v1/sovereign/cutover/start GET /api/v1/sovereign/cutover/status GET /api/v1/sovereign/cutover/events (SSE) The cutover engine consumes the PodSpec ConfigMaps that bp-self-sovereign-cutover (issue #791, sister chart) installs in the catalyst namespace, sequences them by `bp.openova.io/cutover-order`, creates a fresh batchv1.Job per `mode=job` step (8 steps: gitea-mirror, harbor-projects, harbor-prewarm, registry-pivot, flux-gitrepository-patch, helmrepository-patches, catalyst-api-env-patch, egress-block-test), waits for `mode=daemonset-wait` steps to reach `numberReady == desiredNumberScheduled`, and patches the `self-sovereign-cutover-status` ConfigMap with per-step timestamps plus an overall progress counter on every state transition. Endpoints are idempotent — when the status ConfigMap reports `cutoverComplete=true` POST /start returns 200 with the durable snapshot and does NOT re-run. A failed step latches the engine on the failed step (no auto-continue); operator inspects the failure on /status and re-runs once the chart values are corrected, at which point already-successful steps are skipped on resume. Constraints honoured: * IaC-first — every cluster mutation goes through the in-cluster kubernetes.Interface (Create Job / Patch ConfigMap / Get DaemonSet / List ConfigMaps). Zero bespoke cloud-API calls. * Event-driven — Job completion uses the apiserver Watch verb, not periodic GET polling. * Credential hygiene — the handler reads no secrets directly; the chart's PodSpecs reference secrets via envFrom secretRef so each Job's credentials are mounted fresh. * Runtime configurable — namespace, status ConfigMap name, per- step timeouts all read from env per principle #4. Tests: 14 new unit tests in cutover_test.go covering parse/list/ ordering, end-to-end success run with a fake clientset, idempotency, fail-halt semantics, no-steps-found, status JSON shape, and SSE replay-on-connect. Refs: #790, #791 Closes: #792 Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:30:57 +04:00
github-actions[bot]	10d0201a81	deploy: update catalyst images to `ccfe1d4`	2026-05-04 16:42:38 +00:00
e3mrah	ccfe1d42e8	fix(provision-page): re-fetch deployment state on SSE close before showing failure (closes #782 ) (#789 ) The provision page (AppsPage via useDeploymentEvents) treated any SSE close without a terminal `event: done` as a "Provisioning failed" event, hard-coding the message: > Deployment ended with status=phase1-watching But `phase1-watching` is an in-flight phase, not a terminal outcome. The founder repeatedly saw this banner on otech93/otech94 (2026-05-04) while the canonical /deployments/{id} record showed status=ready and handoverFiredAt populated — the SSE was simply dropped by the reverse proxy mid-stream. This change replaces the SSE-close failure path with a single re-fetch of /deployments/{id} that switches on the canonical status: • ready → success banner with handoverURL (existing #764 path) • failed → real error from snapshot.error, never the stale "Deployment ended with status=<phase>" copy • in-flight statuses → keep the streaming spinner up and reconnect SSE with exponential backoff (max 5 attempts) Also surfaces handoverURL recovered from the canonical poll so a backgrounded tab that lost the SSE during the handover-mint window still renders the "Open your Sovereign console →" affordance. Tests added cover all three branches plus the hard regression that "Deployment ended with status=phase1-watching" can never appear in streamError under any SSE-close path. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:40:32 +04:00
github-actions[bot]	ecaef7c17f	deploy: update catalyst images to `2e981f3`	2026-05-04 16:36:27 +00:00
e3mrah	2e981f36a5	fix(bp-keycloak): catalyst-kc-sa-credentials addr → in-cluster Service URL (closes #781 ) (#788 ) Sovereign-side catalyst-api Pod's intra-cluster Keycloak calls (token mint, EnsureUser) were failing with `dial tcp: lookup auth.<sov-fqdn> on 10.43.0.10:53: no such host`. The Sovereign's CoreDNS resolves *.<sov-fqdn> via upstream resolvers — it does NOT forward to the in-cluster PowerDNS that holds those records. Public DNS works (PowerDNS authoritative), but Pod-side lookups of auth.<sov-fqdn> return NXDOMAIN. Live evidence — otech94 2026-05-04: handover URL returned `{"error":"keycloak error: ensure user"}` from a DNS lookup failure inside the catalyst-api Pod. Fix: bp-keycloak chart now writes the in-cluster Service URL (http://<release>.<namespace>.svc.cluster.local) into the catalyst-kc-sa-credentials Secret's `addr` key instead of the public gateway host (https://auth.<sov-fqdn>). This Secret is consumed EXCLUSIVELY by the in-cluster catalyst-api Pod via reflector mirror into catalyst-system; it is NEVER exposed to browsers. The HTTPRoute hostname (.Values.gateway.host) stays at auth.<sov-fqdn> for operator browsers — only the Pod's intra-cluster OAuth client_credentials calls switch to the Service URL. Catalyst-Zero (contabo) is unaffected: it runs `keycloak-zero` (separate chart in openova-private), not bp-keycloak. Changes: - platform/keycloak/chart/templates/configmap-sovereign-realm.yaml: Secret's $kcAddr unconditionally uses http://<release>.<namespace>.svc.cluster.local - platform/keycloak/chart/Chart.yaml: 1.3.1 → 1.3.2 - clusters/_template/bootstrap-kit/09-keycloak.yaml: chart version 1.3.1 → 1.3.2 - products/catalyst/chart/Chart.yaml: 1.3.0 → 1.3.1 (changelog entry only) - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: 1.3.0 → 1.3.1 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:34:22 +04:00
github-actions[bot]	eb9c935ab5	deploy: update catalyst images to `fc2c198`	2026-05-04 15:53:08 +00:00
e3mrah	fc2c198c90	feat(handover): auto-fire on Phase1 Ready + UI redirect (#778 ) When the Phase-1 helmwatch terminates with OutcomeReady, catalyst-api now mints the handover JWT immediately, persists handoverFiredAt + handoverURL on the deployment record, and emits a typed SSE event `event: handover-ready, data: { handoverURL, expiresAt }` so the wizard's provision page can render the "Open your Sovereign console →" CTA + auto-redirect after 5s. Until this landed, the operator was stranded on the apps grid in terminal-completed state — the manual mint endpoint existed but no UI surface ever invoked it. Server (issue #768): - provisioner.Result gains HandoverFiredAt + HandoverURL. - phase1_watch.go: markPhase1Done's Ready transition calls a new fireHandover helper which mints via h.handoverSigner (RS256 5min TTL) and emits onto the durable buffer + live SSE channel. - StreamLogs renders Phase=="handover-ready" events as the typed SSE shape so a browser using addEventListener('handover-ready') receives the JSON payload directly. Idempotent under double- fire (informer reattach scenarios). No-op when handoverSigner is nil — the existing manual-mint path on the AdminPage button remains the fallback. - Lifted HandoverURL + HandoverFiredAt to /deployments/{id} top level so a GET-replay also drives the redirect when the SSE event was missed. UI (issue #764): - useDeploymentEvents subscribes via EventSource.addEventListener ('handover-ready', …) and surfaces the payload as a new `handoverReady` return value. Same value populated from the /events GET-replay snapshot's handoverURL field for the SSE-missed case. - AppsPage renders a prominent green "Sovereign is ready" banner above the apps grid with an "Open your Sovereign console →" anchor link, fires a global success toast with the same CTA, starts a 5s redirect timer (window.location.href = handoverURL), and flips the document title to "✓ Sovereign ready — <fqdn>" so backgrounded tabs surface completion. Tests: - Backend: 6 tests covering auto-fire on Ready, no-fire on failure, idempotency, no-signer no-op, typed-SSE-shape, and /deployments/{id} field lifting. - Frontend: 4 tests covering banner render, FQDN inclusion, 5s auto-redirect, and document.title flip. Closes #764, #768. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:50:09 +04:00
e3mrah	53bc4357ca	feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767 ) (#776 ) * feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB): 1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate" Section with: bootstrap-kit baseline (sum of mandatory-tier component footprints), selected components delta, control-plane overhead, and a "Recommended N x <SKU>" line that turns amber when the operator's chosen worker count is below the rollup. Backed by per-component RAM/CPU floors in components/wizard/steps/componentFootprints.ts (covered by 12 unit tests including the otech92 reproduction). 2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart 9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired from the canonical flux-system/cloud-credentials.hcloud-token Secret cloud-init writes (mirrors the velero/harbor object-storage pattern). Pinned to the control-plane node so the autoscaler never schedules onto a worker it could itself terminate. 10-minute scale-down idle as the cost-saving default. Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA / KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over KEDA for cluster scaling, and the bounds + safety story. Per the issue's MVP scope, this PR ships the blueprint + StepReview estimate WITHOUT the wizard StepProvider min/max pair refactor or the tofu node-pool template restructuring. Those are tracked as a follow-up issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected- bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776 because the file existed without a matching entry in the expected DAG, AND collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort + slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to the expected-bootstrap-deps.yaml so the audit passes. `scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:49:44 +04:00
e3mrah	905319cc14	feat(catalyst): one-click kubeconfig download + merge for k9s parity (closes #765 ) (#775 ) The catalyst-api GET /kubeconfig endpoint now rewrites k3s's hardcoded `default` cluster / context / user names to the Sovereign's subdomain (e.g. `otech94`) before serving the YAML, so the operator can run `k9s --context=otech94` immediately after a single `kubectl config view --flatten` merge — no more manual sed pipeline between every Phase-1 Ready and the next k9s session. Backend (catalyst-api): - New helpers `rewriteKubeconfigContext`, `preferredContextName`, and `kubeconfigDownloadFilename` in internal/handler/kubeconfig.go. - Rewriter uses yaml.v3 Node round-trip so cert-authority-data + token bytes are preserved verbatim. Idempotent — re-applying to an already renamed file is a no-op. Refuses non-kubeconfig YAML so a hand-edited file is never silently corrupted. - Context name resolution: SovereignSubdomain → first FQDN label → literal "sovereign" fallback. Sanitised to RFC-1123 lowercase label charset. - Content-Disposition filename is now `<subdomain>.yaml` (matches operator mental model + makes the merge command shell-friendly). UI (catalyst wizard StepSuccess): - New "Step 1 / Step 2" cluster-access surface on the success step: download button (unchanged endpoint) plus a copy-to-clipboard merge one-liner (`KUBECONFIG=$HOME/.kube/config:$HOME/Downloads/<file> kubectl config view --flatten > config.tmp && mv config.tmp config && chmod 600 && k9s --context=<name>`). - Atomic temp-file move instead of a direct redirect to ~/.kube/config so a Ctrl-C mid-pipe never corrupts the operator's existing config. - Helpers `sovereignContextName` + `buildKubeconfigMergeCommand` exported so the test file (and a future Operator-Tools page on the Sovereign console) can re-use them with no logic drift. Tests: - 6 new Go tests covering the rewriter (idempotence, k3s default, mixed-name file, empty target rejection, malformed YAML rejection, non-kubeconfig rejection) + GET-handler integration test that exercises the subdomain → context-name path on a real fixture. - 3 new vitest tests covering the merge-command UI block + 5 new helper-pure tests for `sovereignContextName` / `buildKubeconfigMergeCommand`. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:48:31 +04:00
github-actions[bot]	116233be51	deploy: update catalyst images to `c4e2c10`	2026-05-04 15:43:52 +00:00
e3mrah	c4e2c10587	fix(wizard): drop redundant 'locked to your sign-in' email microcopy (closes #762 ) (#774 ) PR #759 enforces `req.OrgEmail == session.email` in the catalyst-api on POST /v1/deployments, which means the operator IS the Sovereign owner by definition. Asking again in the wizard, locking the field, and explaining the lock with `Admin contact email · locked to your sign-in` was redundant chrome that made StepDomain feel like a sign-up form for the second time. Changes: - StepDomain: remove the AdminEmailField sub-component entirely (the "locked to your sign-in" microcopy + Lock icon + read-only input + isValidAdminEmail validator + the orgEmail clause in computeNextDisabled). Drop now-unused useSession + Lock + useEffect imports. - StepReview: stamp `orgEmail` from `session.email` at submit time (with the wizard store as a fallback for the brief window between PIN-verify and the next session refetch). Rename the review-page row from "Admin email" to "Sovereign owner" to mirror the new UI vocabulary; the row now reads `session.email` so the operator sees exactly which identity the Sovereign will be owned by. - StepDomain.test: keep the fresh-QueryClient-per-test wrapper but drop the seedSessionEmail plumbing (no longer needed). Add three regression tests confirming the field, the microcopy, and the orgEmail-gate on Continue are all gone. - WizardLayout / WizardPage / StepOrg / StepReview: update doc comments that referenced the now-removed admin-email field. Per docs/INVIOLABLE-PRINCIPLES.md #1 (never trust the client) the load-bearing fix is still on the server (PR #759). This PR removes the redundant client-side defense + the noisy chrome that explained it. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:40:43 +04:00
e3mrah	6a6b502008	fix(decommission): live exec-log view (unified) — was 'stuck' banner (closes #766 ) (#773 ) The `/sovereign/decommission/<id>` page used to render a static "Decommissioning…" button label with no progress signal — operators thought the page was stuck while `tofu destroy` and the Hetzner orphan purge were running for 30+ minutes. The wipe handler in `api/internal/handler/wipe.go` ALREADY emits a per-resource SSE event stream on the same `dep.eventsCh` channel that provisioning uses (surfaced at `GET /api/v1/deployments/{id}/logs`). Every "tofu destroy" tick, every Hetzner DELETE response, every S3 bucket purge step, every PDM release call, every local-state cleanup is already a discrete event with `phase="wipe"`. The UI just wasn't subscribing. Fix is purely UI: • DecommissionPage subscribes to the same SSE via `useDeploymentEvents` once the wipe POST is in flight (`disableStream: false`), flattens every recorded event into `LogLine`, and feeds the unified `LogPane` (the same component `/provision/<id>` JobDetail uses for per-job logs). • Streaming layout replaces the form once submit fires: STREAMING chip, scrolling exec-log, full-screen toggle, search filter — all threaded through the existing LogPane primitives. • On wipe completion: COMPLETE chip + green checkmark + verbatim Hetzner-sweep summary block ("servers: 0 removed, load_balancers: 0 removed, …" — the founder DoD is "0 of every kind on the Hetzner side") + 10s countdown back to /wizard. Operator can scroll back through every deletion at any time. • No backend change — the SSE plumbing is already there. Tests: 7/7 pass (5 original + 2 new for #766). Per #1 (waterfall — target shape on first commit) the streaming view ships with full scrollback, search, full-screen, summary, and countdown in one PR. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:37:27 +04:00
github-actions[bot]	a29238d217	deploy: update catalyst images to `fa58cc3`	2026-05-04 13:46:18 +00:00
e3mrah	fa58cc32b5	fix(catalyst-api): validate orgEmail matches session.email + tighten list cross-tenant policy (closes #748 ) (#759 ) Server-side enforcement is the load-bearing fix per docs/INVIOLABLE-PRINCIPLES.md #1 (never trust the client). Until this lands a signed-in operator could POST a deployment whose req.OrgEmail belonged to some other identity — the catalyst- api accepted the body verbatim and stamped the wrong identity onto the Sovereign-admin / Catalyst-Organization owner. Server changes (deployments.go): - CreateDeployment now reads claims from context (auth.RequireSession populates) with X-User-Email as the off-prod fallback. When a session is present, req.OrgEmail MUST EqualFold session.email — mismatch returns 403. OwnerEmail is stamped from the session-derived value, not request body — a future client-side bug cannot poison the durable owner field. - ListDeployments (issue #747) tightened: when a session is present AND a ?owner= query param is also supplied AND ?owner != session.email, return 200 + empty list rather than silently collapsing to session-only rows. Mirrors the issue #689 404-not-403 rule on /deployments/{id} — the response shape MUST NOT differentiate "exists but not yours" from "doesn't exist". Now also reads ClaimsFromContext as the canonical session source (X-User-Email fallback). Tests: - 4 new tests in deployments_test.go (all pass): - TestCreateDeployment_RejectsMismatchedOrgEmail (403 + no PDM Reserve + no row stored) - TestCreateDeployment_AcceptsMatchingOrgEmail (case-insensitive match, OwnerEmail derived from session not request) - TestListDeployments_FiltersByOwnerSession (cross-tenant row hidden) - TestListDeployments_OwnerQueryParam (cross-tenant ?owner returns empty list, never 403) - deployments_list_test.go: existing TestListDeployments_FilterBySessionEmail rewritten to match the tightened cross-tenant policy (empty list, not silent override). New TestListDeployments_CrossTenantOwnerQueryReturnsEmpty added to assert the explicit boundary. UI changes: - ui/src/pages/wizard/steps/StepDomain.tsx — defense-in-depth UX: AdminEmailField pre-fills orgEmail from useSession() and renders read-only with a Lock icon and tooltip "Sovereigns are owned by the email you signed in with." A useEffect mirrors session.email into the wizard store so a stale value from a previous sign-in cannot survive into the current session. - ui/src/pages/wizard/steps/StepDomain.test.tsx — wraps every render in a fresh QueryClientProvider (AdminEmailField now consumes useSession via TanStack Query). All 15 existing UI tests pass. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:43:58 +04:00
github-actions[bot]	407f37944b	deploy: update catalyst images to `35569e2`	2026-05-04 13:40:49 +00:00
e3mrah	35569e2344	fix(types): DeploymentID branded type — kill 15-char truncation forever (closes #749 , #754 ) (#760 ) The "deployment ID truncated by one char" bug recurred multiple times because every UI code path treated the id as a free-form `string`. Any new error template, toast, or URL builder could (and did) introduce another truncation. This change makes the truncation impossible at compile time: - Adds `shared/types/deployment.ts` with a branded `DeploymentID` type (`string & { readonly __brand: 'DeploymentID' }`) plus `parseDeploymentID()` / `isDeploymentID()` validators. The regex enforces the canonical 16 lowercase hex chars catalyst-api emits. - Updates `entities/deployment/model.ts` to type `WizardState.deploymentId` as `DeploymentID \| null`. Re-exports the brand from the model so existing imports keep working. - Updates `entities/deployment/store.ts` to route `setDeploymentId()` and the persistence `merge()` path through `parseDeploymentID()`. A bad id in localStorage gets wiped rather than rendered as a misleading "<truncated>-is-unknown-to-backend" error. - Updates `pages/sovereign/AppsPage.tsx` to validate the route param at the page boundary via `isDeploymentID()`, and emits a dedicated malformed-id notification when the URL value isn't 16 lowercase hex chars (so the operator sees the FULL invalid value, not a hidden off-by-one). - Adds 25 unit tests covering the parser (valid/invalid lengths, uppercase, non-string types, error-message hygiene) plus the `isDeploymentID` type guard. - Adds an integration test (`ProvisionPage.sse-url.test.tsx`) that mounts the page with a 16-char hex route param, installs a recording EventSource shim, and asserts the constructed URL is exactly `${API_BASE}/v1/deployments/<FULL_16_CHAR_ID>/logs` — including the exact `eeb34ecd1414a505` id from issue #749's live evidence. - Updates `StepSuccess.test.tsx` fixture to a real 16-char hex id so the wizard store accepts it through the new typed setter. Audit findings — search across the entire UI src for `slice(0, 15..19)`, `substring(0, 15..19)`, and `[a-f0-9]{15}` patterns turned up NO direct truncation site in production code. The root cause of the 2026-05-04 incident was that every consumer trusted a raw `string` route param without validation, so a URL with a manually-truncated id fed straight into both the SSE URL builder and the error message verbatim. The branded-type contract is now the structural fix: any future code that tries to assign an unvalidated string to a `DeploymentID` field fails compilation, and any URL with the wrong shape surfaces a clear malformed-id banner instead of "deployment <wrong> is unknown". Closes #749, #754. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:38:27 +04:00
github-actions[bot]	b1915a9e14	deploy: update catalyst images to `8e57abe`	2026-05-04 13:32:38 +00:00
e3mrah	8e57abe9d0	fix(wizard): auto-redirect signed-in user to in-flight /sovereign/provision/<id> (closes #747 ) (#758 ) A signed-in operator who refreshed /sovereign/wizard during a 15-minute provisioning run lost the progress page and landed on Step 1 of an empty form (caught live with otech90 on 2026-05-04). Wires the wizard route to call the new GET /api/v1/deployments?owner=<email> endpoint and redirect to /sovereign/provision/<id> when an in-flight deployment is found. Backend - Add ListDeployments handler returning the slim shape (id, status, sovereignFQDN, region, startedAt, finishedAt, ownerEmail, adoptedAt, error). Filtered server-side by the X-User-Email header injected by RequireSession; ?owner= is a client hint that is silently overridden when the session header is set so a signed-in attacker cannot list someone else's rows. Adopted deployments are excluded — once the customer's Sovereign owns the cluster, the wizard redirect must not pull the operator back to Catalyst-Zero. - Register GET /api/v1/deployments inside the RequireSession group. - 5 new handler tests covering session-override, adopted exclusion, legacy-row exclusion, no-session passthrough, and ?owner= filtering. Frontend - New useInflightDeployment hook (TanStack Query, 30s stale time) returning {inflight, completed, all} buckets. inflight matches pending/provisioning/tofu-applying/tofu-plan/tofu-apply/ flux-bootstrapping/cloud-init-waiting/phase1-watching plus ready-but-not-adopted. Picks the most-recent by startedAt. - WizardPage redirect effect: when session.signedIn && inflight, navigate replace=true to /provision/<id> and render null while the redirect resolves. When the operator has only completed/wiped/failed rows, render a banner with a "View your previous deployments" link. - New DeploymentsList page at /deployments (browser path /sovereign/deployments behind the Traefik strip-prefix). Single table: FQDN, status, started, finished, region. Each FQDN links back to /provision/<id>. - 6 hook unit tests covering most-recent picking, ready-not-adopted, adopted exclusion (defense-in-depth), 401 graceful degrade, and enabled=false short-circuit. Tests - 5 backend handler tests pass (TestListDeployments_*) - 6 frontend hook tests pass (useInflightDeployment.test.tsx) - TS typecheck + Vite build clean - Pre-existing TestAuthHandover_HappyPath panic + StepComponents catalog-data failures verified unrelated (fail on bare main) Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:30:36 +04:00
github-actions[bot]	5bb7d45647	deploy: update catalyst images to `5decebf`	2026-05-04 13:17:56 +00:00
e3mrah	5decebf801	fix(provision): drop bespoke 'Operator' widget, use ProfileMenu top-right (closes #750 ) (#757 ) The /sovereign/provision/<id> page rendered a bespoke "Operator / Provisioning session" card in the bottom-left of its Sidebar. Two problems: 1. Identity placement was inconsistent with the rest of the app (wizard, Sovereign-console, marketplace all place identity top-right). The provisioning surface was the lone outlier. 2. The label "Operator" was hard-coded and never reflected the signed-in user's email — it ignored useSession() entirely. This drops the bespoke card from Sidebar.tsx and renders the canonical <ProfileMenu /> (the same widget WizardLayout uses) in PortalShell's top-right slot. ProfileMenu reads useSession() so anonymous visitors get a [Sign in] button and signed-in operators get an email-initial avatar that opens a "Signed in as <email>" + "Sign out" dropdown. Because PortalShell wraps every /sovereign/provision/* route (apps, jobs, dashboard, cloud, users, settings), this fix touches all of them in one place. Test updates: - Sidebar.test.tsx now asserts the bespoke widget is GONE rather than asserting it renders, locking in the regression guard. No backend / API surface changes. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>	2026-05-04 17:15:46 +04:00
github-actions[bot]	c69e4987da	deploy: update catalyst images to `05065b6`	2026-05-04 13:13:50 +00:00
github-actions[bot]	4b659ced17	deploy: update catalyst images to `e855ab0`	2026-05-04 13:09:40 +00:00
github-actions[bot]	87ffe512c5	deploy: update catalyst images to `ceeefd7`	2026-05-04 12:03:20 +00:00
github-actions[bot]	fea00720f7	deploy: update catalyst images to `468c3ba`	2026-05-04 11:53:06 +00:00
github-actions[bot]	9ee3b2e911	deploy: update catalyst images to `b02fc37`	2026-05-04 11:37:57 +00:00
e3mrah	b02fc3788a	fix(provisioner): cost-optimized defaults use ORDERABLE SKUs — cpx22 CP + cpx32 workers (14% saving) (#744 ) * fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z). After PR #742 fixed the empty SKU strings in tfvars, the next blocker appeared: writeTfvars was emitting `"regions": null` (Go nil slice marshals to JSON null) when the request had no per-region overrides. OpenTofu's variables.tf carries a validation block: validation { condition = alltrue([ for r in var.regions : contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider) ]) } The `for r in var.regions` iteration fails on null with: Error: Iteration over null value on variables.tf line 217, in variable "regions": The variables.tf default `[]` is what the validator expects; emit that shape explicitly via a coalesceRegions(req.Regions) helper that turns nil into an empty slice. Operator overrides round-trip unchanged. Tests: - TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions serialises as JSON `[]`, never `null`, when the request has no per-region overrides. Builds on PR #742. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving) Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the cpx21 CP default from PR #741 fell apart at apply time — Error: Server Type "cpx21" is unavailable in "fsn1" and can no longer be ordered Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog (`/v1/server_types`) but are NOT in the per-DC orderable list (`available_for_migration` on `/v1/datacenters`) for any EU DC (fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on for new Sovereigns in those regions. Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04: • cpx11 (2 vCPU / 2 GB) — too small for the CP working set • cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1 • cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1 • cpx42, cpx52, cpx62 — bigger and more expensive New default per Sovereign: \| Component \| Old \| New \| Savings \| \|-----------------\|-----------------\|------------------\|---------\| \| Control plane \| CPX32 (€16.49) \| CPX22 (€9.49) \| €7.00 \| \| Worker × 2 \| CPX32 × 2 (€33) \| CPX32 × 2 (€33) \| €0 \| \| TOTAL \| €49.47/mo \| €42.47/mo \| 14% \| The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo) assumed those SKUs were orderable. They aren't in EU DCs. The 14% saving from cpx22 CP is the largest concrete optimisation that ships TODAY without compromising the multi-node horizontal-scale agreement (issue #733): still 1 CP + 2 workers from day one. Files changed: - infra/hetzner/variables.tf control_plane_size default cpx21 → cpx22 worker_size default cpx31 → cpx32 (back to the prior orderable choice) - products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing (€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49). Mark both as "listed but NOT orderable in EU DCs" so the wizard surfaces the constraint instead of letting operators pick a non-orderable SKU. Move recommended:true from CPX21 → CPX22. defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31'). - products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx Comment refresh — names the new orderable defaults. - products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22']. Builds on PR #741 (issue #740 chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:35:55 +04:00
github-actions[bot]	20c839efc4	deploy: update catalyst images to `8989ce7`	2026-05-04 11:29:07 +00:00
e3mrah	8989ce7659	fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request (#743 ) Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z). After PR #742 fixed the empty SKU strings in tfvars, the next blocker appeared: writeTfvars was emitting `"regions": null` (Go nil slice marshals to JSON null) when the request had no per-region overrides. OpenTofu's variables.tf carries a validation block: validation { condition = alltrue([ for r in var.regions : contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider) ]) } The `for r in var.regions` iteration fails on null with: Error: Iteration over null value on variables.tf line 217, in variable "regions": The variables.tf default `[]` is what the validator expects; emit that shape explicitly via a coalesceRegions(req.Regions) helper that turns nil into an empty slice. Operator overrides round-trip unchanged. Tests: - TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions serialises as JSON `[]`, never `null`, when the request has no per-region overrides. Builds on PR #742. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:26:58 +04:00
github-actions[bot]	10d1af8c91	deploy: update catalyst images to `7ef5af7`	2026-05-04 11:11:10 +00:00

1 2 3 4 5 ...

615 Commits