Commit Graph

615 Commits

Author SHA1 Message Date
e3mrah
b52fc45c37
fix(bp-catalyst-platform): cutover-driver RBAC dual-mode render (#830) (#839)
Chart 1.3.2 shipped serviceaccount-cutover-driver.yaml +
clusterrole-cutover-driver.yaml + clusterrolebinding-cutover-driver.yaml
with `{{ .Release.Namespace }}` directives that rendered fine via Helm
on Sovereigns but BROKE the Kustomize-mode contabo-mkt deploy: the
directives made Kustomize parse the files as invalid YAML and silently
skip them. Worse, the new files were never added to templates/
kustomization.yaml's resources list.

Result on contabo: catalyst-api Pod's spec.serviceAccountName references
a non-existent SA — the Pod fails ContainerCreating with the same RBAC
forbidden error #830 was meant to fix.

Fix:
  - Strip `{{ .Release.Namespace }}` directives from the SA + ClusterRole
    files. metadata.namespace auto-fills from Helm's --namespace flag
    and from Kustomize's `namespace:` directive.
  - For ClusterRoleBinding: Helm does NOT auto-inject subjects[0].
    namespace the way it does metadata.namespace, so the apiserver
    rejects bindings without it. Split into two files:
      * clusterrolebinding-cutover-driver.yaml — Helm-only, uses
        {{ .Release.Namespace }} (correctly resolves to catalyst-system
        on Sovereigns).
      * clusterrolebinding-cutover-driver-kustomize.yaml — Kustomize-
        only, omits subjects[0].namespace and relies on Kustomize's
        native injection (resolves to `catalyst` on contabo).
    The .helmignore excludes the Kustomize-only file from Sovereign
    chart packaging; templates/kustomization.yaml's resources list
    references the Kustomize-only file, NOT the Helm-only one.
  - Add the new RBAC files to templates/kustomization.yaml's resources
    list so contabo's Flux Kustomization actually renders them.

Verified live with `helm template` (subjects[0].namespace=catalyst-system)
and `kubectl kustomize` (subjects[0].namespace=catalyst).

Bumps bp-catalyst-platform 1.3.2 → 1.3.3.

Issue: openova-io/openova#830 (Bug 1 follow-up)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:54:03 +04:00
github-actions[bot]
fb9c9b72d9 deploy: update catalyst images to 772d159 2026-05-04 19:50:19 +00:00
e3mrah
772d159691
feat(sme-tenant): multi-domain Sovereign support — parent-domain dropdown + free-subdomain-under-any-pool-domain (#828) (#836)
Extends the SME tenant provisioning pipeline (#804) for the multi-domain
Sovereign (epic #825). The SME tenant create form now lets the operator
pick which sme-pool parent zone hosts the tenant; the orchestrator
writes DNS records under the chosen parent (not a hardcoded primary).

Backend (Go):
- store.SMETenantProvisionRecord.ParentDomain — captured at create
- handler.SMETenantParentDomain + SMETenantDeps.ParentDomains — pool wiring
- POST /api/v1/sme/tenants accepts parent_domain; defaults to the first
  NS-flip-ready sme-pool entry; rejects unknown parents (400) and
  not-yet-flipped parents (503 + Retry-After)
- DNS provisioner ProvisionFreeSubdomain takes a parentZone parameter;
  ValidateBYOCNAME accepts a multi-target candidate list (any parent)
- Pipeline: writes A records under the chosen parent zone; realm URL,
  console host, and gitops template hostnames all derive from
  ParentDomain (data-driven; never hardcoded)
- New GET /api/v1/sovereign/parent-domains?role= read-only endpoint
  with env stub (CATALYST_SME_POOL_DOMAINS) that integrates cleanly
  with MD-1 (#826) when its data model lands

UI (React + TanStack Router + Vitest + Playwright):
- New /console/sme/tenants/new — CreateTenantPage with domain-mode
  radio, parent-domain <select> populated from the new endpoint,
  per-option NS-flip-ready disabled state, live console URL preview,
  CNAME validation hint for BYO mode, post-submit progress timeline
- 7 Vitest unit tests + 2 Playwright E2E specs (free-subdomain + BYO),
  5 1440px screenshots emitted under e2e/screenshots/828-*.png

Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent-domain pool is fully
data-driven; the UI consumes the same wire shape MD-1 will surface.
Per #2 (never compromise on quality) the page paints partial state on
hook failure with per-step badges from the response.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:48:10 +04:00
github-actions[bot]
090e1f6a34 deploy: update catalyst images to e96741a 2026-05-04 19:44:11 +00:00
e3mrah
e96741a0ca
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The
operator brings 1+ parent domains at signup (`omani.works` for own
use, `omani.trade` for the SME pool, etc.) and may add more
post-handover via the admin console (#829).

bp-powerdns 1.2.0 (platform/powerdns/chart):
- New `zones: []` values key listing parent domains to bootstrap
- New Helm post-install/post-upgrade hook Job
  (templates/zone-bootstrap-job.yaml) that POSTs each entry to
  /api/v1/servers/localhost/zones at install time. Idempotent on
  HTTP 409 — re-runs after upgrades or chart bumps never fail.
- Default-values render skips when zones is empty (legacy behavior).

bp-catalyst-platform 1.4.0 (products/catalyst/chart):
- New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}`
  values
- New templates/sovereign-wildcard-certs.yaml renders one
  cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex)
  via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert
  renews independently. Skips entirely when parentZones is empty so
  the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml
  retains ownership of `sovereign-wildcard-tls` (avoids
  helm-vs-kustomize ownership flap).
- New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded
  into the catalyst-api Pod as CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_SERVER_ID env vars.

catalyst-api (products/catalyst/bootstrap/api):
- New internal/powerdns package with typed Client (CreateZone,
  ZoneExists). Idempotent on HTTP 409/412.
- handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the
  typed client when wired via SetPowerDNSZoneClient — the
  admin-console "Add another parent domain" flow now creates real
  zones in the Sovereign's PowerDNS at runtime.
- main.go wires the client when CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_API_KEY are set.
- Comprehensive unit tests (client_test.go: 9 cases incl.
  201/409/412/500 + custom NS + custom serverID).

Bootstrap-kit slot integration:
- clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to
  bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from
  Flux postBuild.substitute.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
  bumps to bp-catalyst-platform 1.4.0 and threads `parentZones:
  ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two
  slots stay in lockstep).
- infra/hetzner: new `parent_domains_yaml` Terraform variable
  (defaults to single-zone array derived from sovereign_fqdn) →
  cloud-init renders the PARENT_DOMAINS_YAML Flux substitute.

DoD verified end-to-end with helm template + envsubst:
- Multi-zone overlay (omani.works + omani.trade) renders 2
  PowerDNS zone-create API calls in the bootstrap Job AND 2
  Certificate resources (`*.omani.works`, `*.omani.trade`) in
  bp-catalyst-platform.
- Single-zone fallback (PARENT_DOMAINS_YAML defaults to
  `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy
  provisioning paths working without per-overlay edits.

Closes #827.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 23:42:00 +04:00
github-actions[bot]
92e712a8a6 deploy: update catalyst images to 0bf7b3b 2026-05-04 19:38:24 +00:00
e3mrah
0bf7b3b16d
feat(provisioner): parentDomains[] data model + per-domain abstraction (#826) (#835)
Sub-1 of epic #825 (Multi-domain Sovereign). Backend-only per the
SCOPE CORRECTION on issue #826: the wizard stays single-FQDN, multi-
domain capability is a Day-2 admin-console action (#829, already
merged with an in-memory stub waiting on this PR's persistence
layer).

What this PR adds:

  - provisioner.ParentDomain struct (Name, Role, RegistrarKind,
    RegistrarCredsRef, AddedAt) with role constants
    ParentDomainRolePrimary | ParentDomainRoleSMEPool. Wire shape
    matches the handler-layer ParentDomain in
    handler/parent_domains.go (#829), so the handler's swap from
    in-memory store → Deployment.parentDomains[] is a one-line
    change in a follow-up PR.
  - Request.ParentDomains []ParentDomain field. Backward-compatible:
    when the slice is empty, Validate() synthesises a single primary
    entry from SovereignPoolDomain (or SovereignFQDN) so legacy
    single-FQDN payloads + on-disk records read cleanly. The next
    Save() round-trips the array form — transparent migration with
    no one-shot script.
  - validateParentDomains: enforces "exactly one primary", role enum,
    FQDN regex (RFC 1035, mirrors wizard isValidDomain), duplicate-
    name dedupe, lowercase normalisation in place.
  - ProvisionParentDomain / ProvisionParentDomains: the per-domain
    abstraction the issue's DoD calls out as "reusable function ready
    for #829". Day-2 add-domain calls this with the same step list
    (registrar-flip → powerdns-zone-create → cert-manager-cert) the
    Day-1 path uses; idempotent, stops on first error, emits per-step
    SSE events for the admin panel.
  - Request.PrimaryParentDomain() / SMEPoolParentDomains() lookup
    helpers so the catalyst-api handler + SME signup wizard read the
    primary / sme-pool subset without re-iterating at every call site.
  - writeTfvars emits parent_domains as a JSON array (never null) so
    a future OpenTofu module's `for pd in var.parent_domains`
    validator accepts the input — same nil-trap fix the regions slice
    already carries.
  - store.RedactedRequest + ToProvisionerRequest round-trip the slice
    verbatim. Fields are non-secret (RegistrarCredsRef points at a
    SealedSecret name; plaintext registrar credentials never live on
    the deployment record).
  - store.crdStore mirrors the slice into the ProvisioningState CRD
    spec so admin tooling reading via the K8s API sees the live pool.

What this PR does NOT touch (explicit scope):

  - products/catalyst/bootstrap/ui/src/pages/wizard/** — wizard UI
    stays single-FQDN per the issue's SCOPE CORRECTION.
  - products/catalyst/bootstrap/api/internal/handler/parent_domains.go
    — the #829-merged Day-2 admin handler keeps its in-memory store;
    a one-line follow-up PR swaps to Deployment.parentDomains[].

Inviolable Principle #4: defaultRegistrarKindFromEnv reads
CATALYST_DEFAULT_REGISTRAR_KIND so operators on registrars other
than Dynadot override the synthesis path without code changes. No
TLD or count is hardcoded.

Tests:

  - 14 new unit tests across two new files (parent_domains_test.go in
    provisioner + store packages). Cover: synthesis from
    SovereignFQDN + SovereignPoolDomain, "exactly one primary"
    invariant (rejects 2 + 0), unknown role, empty role, malformed
    FQDN, duplicate names, uppercase normalisation, lookup helpers,
    step-runner ordering + first-error halt, slice-flavour
    multi-domain iteration, JSON round-trip through Redact + Save +
    LoadAll, empty-slice omitempty, legacy on-disk record loads
    cleanly + migration synthesises primary on Validate.
  - Pre-existing Harbor-token + AuthHandover-signer-nil failures
    persist on origin/main; this PR introduces no new failures.

Closes #826.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:36:28 +04:00
github-actions[bot]
4cacbc2c17 deploy: update catalyst images to 620d8b6 2026-05-04 19:33:09 +00:00
e3mrah
620d8b6c13
feat(admin-console): add-domain flow + DNS propagation status panel (#829) (#834)
* feat(unified-rbac): SME-tier extension + host-header tenant discovery (#802)

Implements the SME-tier extension to the existing Sovereign Console SPA
per [Q-mine-1] of #795: same React bundle serves both otech-admin and
SME-admin views, tenant context discovered via window.location.host
against a back-end registry — not from path/subdomain string parsing.

Backend (catalyst-api / unified-rbac slice):
- Tenant registry (store.TenantRegistry) — flat-file host → tenant
  lookup table backing the public discovery endpoint. Host normalised
  to lowercase; case-insensitive lookups.
- GET /api/v1/tenant/discover (public, no auth gate) — returns
  {tenant_id, tenant_kind, keycloak_realm_url, keycloak_client_id} on
  200, 404 on unknown host, 503 if registry unwired. Admin URLs are
  NEVER on this wire.
- POST /api/v1/sme/users — fires ADR-0003 3-step hook (Keycloak →
  NewAPI → K8s Secret SSA with field manager `unified-rbac`). Each
  step idempotent; persisted state machine in store.UserProvisionStore
  per ADR-0003 §3.4. Returns 202 with steps[] progress array so the
  SPA can render the 3-step indicator even on partial failure.
- GET /api/v1/sme/users / DELETE /api/v1/sme/users/{uuid} — list +
  inverse rollback per ADR-0003 §3.7.
- internal/newapi.Client — minimal NewAPI admin REST client; 201
  happy-path + 409 idempotent recovery via GET ?external_id=<uuid>
  per ADR-0003 §3.2 (NewAPI does NOT rotate api_key on conflict).

Frontend (Sovereign Console SPA):
- Branded TenantID + TenantKind types (shared/types/tenant.ts) — same
  pattern as DeploymentID (#749).
- shared/lib/tenantDiscover.ts — fire-and-forget discovery in main.tsx;
  result cached in module state for sidebar nav + OIDC bootstrap.
- pages/sme/UsersPage.tsx — user CRUD UI with 3-step KC/NewAPI/Secret
  progress indicator wired off the API response shape.
- pages/sme/RolesPage.tsx — canonical Keycloak group → app role map
  (wordpress / openclaw / stalwart / rbac) per #795 [B].
- pages/sme/sme.api.ts — typed REST client; X-Tenant-Host header
  carries window.location.host on every call.
- Routes mounted at /console/sme/users + /console/sme/roles under the
  existing SovereignConsoleLayout — same SPA bundle, different route
  tree per discovered tenant_kind.

Tests: 22 new UI tests (4 files), 33 new Go tests (4 files). All
green: branded type parsers reject empty/non-string inputs, tenant
discovery handles 200/404/503/network-error paths, the 3-step hook
runs end-to-end against fake KC/NewAPI/SSA stubs, partial-failure
states surface verbatim through the steps[] response field, public
discovery endpoint never leaks admin URLs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 every URL goes through apiUrl()
in shared/config/urls; per #2 wire shapes parse through branded-type
parsers at the boundary; per #3 K8s Secret apply uses client-go SSA
(field manager `unified-rbac`) — no exec.Command kubectl shell-out.

Closes #802.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(unified-rbac): add Playwright E2E for SME-tier UI (#802)

Three specs covering:
- SME UsersPage: empty state → create form → 3-step progress
  indicator (KC done / NewAPI done / Secret done) — proves the
  page is wired to the API response shape.
- SME RolesPage: canonical group → app-role table renders the
  full 7-row mapping locked in #795 [B].
- OTECH tenant: same SPA bundle navigates /console/dashboard for
  the otech discovery payload — proves [Q-mine-1] of #795
  (one bundle, two route trees, host-driven discovery).

Backend mocks: route fulfillers stub /tenant/discover, /sme/users,
and /whoami so the dev-server harness can drive the SPA without
the catalyst-api backend or a live SME vcluster. The full live
cross-cluster E2E gates on bp-newapi (#799) seeding the tenant
registry at SME-onboarding time, which lands in #804.

1440 px screenshots captured at e2e/screenshots/802-*.png:
- 802-sme-users-empty-1440.png
- 802-sme-users-create-form-1440.png
- 802-sme-users-after-create-1440.png
- 802-sme-roles-1440.png
- 802-otech-dashboard-same-bundle-1440.png

Run: VITE_CATALYST_MODE=sovereign VITE_SOVEREIGN_FQDN=acme.otech.example
     npm run dev
     npx playwright test e2e/sme-tier-rbac.spec.ts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(admin-console): add-domain flow + DNS propagation status panel (#829)

Multi-domain Sovereign — operator-admin "Add another parent domain"
surface in the Sovereign Console + live DNS propagation status panel.
Closes the MD-4 sub-ticket of epic #825.

Backend (catalyst-api/internal/handler/parent_domains.go):
- GET    /api/v1/sovereign/parent-domains             — list pool
- POST   /api/v1/sovereign/parent-domains             — add domain
- DELETE /api/v1/sovereign/parent-domains/{name}      — remove
- GET    /api/v1/sovereign/parent-domains/{name}/propagation
                                                      — fan-out to 5+
                                                        public DNS resolvers

The Add pipeline calls PDM /set-ns (sister #826), creates the PowerDNS
zone (sister #827, env-gated stub until that PR lands), and issues a
wildcard cert via cert-manager (also sister #827, env-gated stub). All
three steps update the same store row so the UI can render per-step
progress.

DNS propagation panel uses Go's net.Resolver with a custom Dial that
routes lookups through a SPECIFIC resolver IP (8.8.8.8, 1.1.1.1,
9.9.9.9, 208.67.222.222, 4.2.2.1) rather than the system resolver.
Per inviolable principle #4, the resolver list, expected NS records,
and per-query timeout are all env-overridable.

Frontend (ui/src/pages/admin/parent-domains/):
- ParentDomainsPage.tsx — list view + Add Domain modal + per-row
  inline drawer with PropagationPanel
- PropagationPanel.tsx — polls /propagation every 60s, renders
  green/yellow/red pills per resolver + rolling % propagated number
- parentDomains.api.ts — typed REST client wrappers, no inline /api/

Routing:
- /console/parent-domains registered under SovereignConsoleLayout
- Added to Settings sub-nav for operator-admin reachability

Tests:
- 6 vitest cases (empty state, populated rows, modal open, drawer
  toggle, primary lock, propagation panel mount)
- 13 Go cases covering list/add/delete/validation/propagation wire
  shape against a stub PDM
- 3 Playwright E2E + 1440x900 screenshots:
  e2e/screenshots/829-1-just-flipped.png       (0% propagated)
  e2e/screenshots/829-2-partially-propagated.png (40%)
  e2e/screenshots/829-3-fully-propagated.png   (100%)

Per inviolable principle #10 (credential hygiene) the registrarToken
field is forwarded byte-for-byte to PDM and never enters a logged
struct; the modal input uses type="password".

Refs: #825 (parent epic), #826 (sister MD-1), #827 (sister MD-2)

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:31:03 +04:00
github-actions[bot]
ec07488226 deploy: update catalyst images to c9507c8 2026-05-04 19:29:59 +00:00
e3mrah
c9507c8369
fix(catalyst-api): durable Phase-1 watcher across Pod restart (#830) (#833)
The Phase-1 helmwatch watcher used to lose state on every catalyst-api
Pod roll. fromRecord rewrote any "phase1-watching" status to "failed"
on the next Pod start — even though Phase 0 had already committed its
tofu state, the Sovereign cluster was healthy, the kubeconfig was on
the PVC, and the bootstrap-kit HelmReleases kept reconciling regardless
of whether catalyst-api's in-memory watcher was alive.

Caught live on otech102 (2026-05-04): a transient catalyst-api roll
mid-Phase-1 latched the deployment record to status=failed, the auto-
fire handover never triggered, and the operator was stranded on the
wizard page. Manual workaround was patching the record back to
status=ready + minting handover token by hand.

Fix: split the in-flight rewrite into two cases:
  - Phase-0 in-flight (pending/provisioning/tofu-applying/flux-
    bootstrapping) — STILL rewritten to failed (tofu workdir on /tmp
    emptyDir died with the Pod, Hetzner resources orphaned).
  - phase1-watching — preserved across restart so the post-restart
    resume path picks it up via shouldResumePhase1 + resumePhase1Watch
    (already wired). The on-disk store record stays consistent with
    the in-memory state during rehydrate.

Helmwatch's existing resume path (jobs_backfill.go) is idempotent —
it just observes HelmRelease.status, never patches/applies, so a fresh
informer over the same kubeconfig produces the same per-component
events the previous Pod was streaming.

Also:
  - Added isPhase0InFlightStatus helper to distinguish the two
    semantics; isInFlightStatus retained for release-subdomain conflict
    check (still includes phase1-watching — won't release a slot mid-
    Phase-1).
  - Updated TestPodRestart_StuckPhase1WatchingRewrittenToFailed →
    TestPodRestart_Phase1WatchingPreservedNotRewrittenToFailed (now
    asserts the new correct behavior).
  - New test TestPodRestart_Phase1WatchingResumesWithKubeconfig proves
    the gating decision (shouldResumePhase1=true) and the preserved
    Status value.
  - New parameterized test TestPodRestart_Phase0InFlightStillRewritten
    ToFailed proves the Phase-0 carve-out still works for all four
    Phase-0 statuses.
  - Updated TestShouldResumePhase1_GatesProperly cases to reflect the
    new phase1-watching=resumable / Phase-0=non-resumable split.

Issue: openova-io/openova#830 (Bug 3)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:28:07 +04:00
e3mrah
f75f3e79b4
fix(bp-catalyst-platform): add cutover-driver RBAC for catalyst-api (#830) (#831)
The /api/v1/sovereign/cutover/start handler was returning 502
status-read-failed because catalyst-api ran under the catalyst-system/
default ServiceAccount with no RBAC binding to read/patch the cutover
ConfigMaps + create/watch Jobs in the `catalyst` namespace.

Add a dedicated ServiceAccount + ClusterRole + ClusterRoleBinding so
catalyst-api can drive the cutover state machine. Per
feedback_rbac_create_no_resourcenames.md the `create` verbs are split
into their own Rule WITHOUT resourceNames; combining create with
resourceNames produces 403 every POST.

Bumps bp-catalyst-platform 1.3.1 → 1.3.2.

Issue: openova-io/openova#830 (Bug 1)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:26:51 +04:00
github-actions[bot]
1631c0b86c deploy: update catalyst images to da3f679 2026-05-04 18:57:19 +00:00
e3mrah
da3f6797b7
feat(sme-tenant): tenant provisioning pipeline (#804) (#824)
Wire all bp-* charts at vcluster creation time so the SME experience
is turnkey from marketplace signup forward. The orchestrator owns a
7-state machine (pending → vcluster_created → bp_charts_installed
→ dns_provisioned → certs_issued → keycloak_clients_provisioned
→ tenant_registered → done) persisted in a flat-file store; each
step is independently idempotent so a Pod restart never strands a
half-provisioned tenant.

HTTP surface:
- POST   /api/v1/sme/tenants            — create + start pipeline
- GET    /api/v1/sme/tenants            — list
- GET    /api/v1/sme/tenants/{id}       — read
- POST   /api/v1/sme/tenants/{id}/reconcile — operator-triggered re-run
- DELETE /api/v1/sme/tenants/{id}       — inverse pipeline

Per Inviolable Principle 3 the orchestrator NEVER calls kubectl apply.
Per-tenant overlays are committed to the GitOps repo at
clusters/<otech>/sme-tenants/<sme_tenant_id>/ via a Kustomize layout
listing every bp-* HelmRelease (bp-keycloak per-organization, bp-cnpg,
bp-wordpress-tenant, bp-openclaw, bp-stalwart-tenant) plus the per-host
Certificate (BYO mode only — free-subdomain is covered by the otech-wide
wildcard). Flux on the OTECH cluster reconciles within ~1 min.

Per Inviolable Principle 4 every chart version, image tag, OTECH FQDN,
PowerDNS endpoint, and Keycloak SA token is runtime-configurable via
env (CATALYST_SME_BP_*_VER, CATALYST_OTECH_FQDN,
CATALYST_OTECH_INGRESS_IPV4, CATALYST_POWERDNS_URL,
CATALYST_POWERDNS_API_KEY, CATALYST_SME_KC_SA_TOKEN). Empty chart
versions fall back to "*" so Flux pulls the latest matching chart.

DNS provisioning:
- Free-subdomain mode: PowerDNS PATCH writes A records for
  console/wordpress/openclaw/mail/keycloak.<sub>.<otech>.
- BYO mode: net.LookupCNAME resolves console.<byo_domain> and
  confirms the target ends with the otech FQDN; mismatched CNAMEs
  surface as terminal errors so the wizard can show "your CNAME
  doesn't point here yet" without a chat-with-support loop.

Keycloak SSO clients (catalyst-ui, wordpress, openclaw, stalwart) +
group templates (sme-admin, sme-user) are declared in the
bp-keycloak HelmRelease's bootstrap values block; the orchestrator
verifies them via the SME-vcluster Keycloak admin API and re-runs
the step on transient failures.

Tenant registry insertion (per #802 SME-7) uses the existing
store.TenantRegistry — host → {tenant_id, keycloak_realm_url,
keycloak_client_id, tenant_kind=sme} — so the SPA's
/api/v1/tenant/discover endpoint resolves the new tenant on first
hit without any further orchestration.

The user-create hook (POST /api/v1/sme/users) from #802 already
fires the ADR-0003 3-step orchestration (Keycloak → NewAPI → K8s
Secret); this PR's tenant pipeline lights up the back end #802
needs to scope every per-user call.

Tests:
- 14 handler-level table tests covering happy path (free-subdomain
  + BYO), validation errors, gitops transient retry, registry
  population, deletion, render correctness for both modes, chart
  version threading, Keycloak client verification, BYO CNAME
  resolution.
- 5 store tests for state-machine persistence.

Live test deferred to #805 E2E demo.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:55:06 +04:00
github-actions[bot]
b003cd80c6 deploy: update catalyst images to 1d93b6c 2026-05-04 18:54:14 +00:00
e3mrah
1d93b6c5af
feat(e2e): SME demo Playwright spec — full 6-step happy path (#805) (#823)
Authors the load-bearing investor-demo proof artefact for the
SME-tenant turnkey experience epic (#795). The spec walks the FULL
happy path against the catalyst-ui SPA and emits 1440×900 screenshots
at every assertion so the DoD checklist is satisfied with visual
evidence rather than narrative.

What landed:

- products/catalyst/bootstrap/ui/e2e/sme-demo.spec.ts — single linear
  spec covering Step 1 (marketplace signup) → Step 2 (provisioning) →
  Step 3 (SME admin first login + dashboard) → Step 4 (create alice
  via unified-rbac with 3-step ADR-0003 hook progress) → Step 5a
  (alice on WordPress) → Steps 5b/5c/5d/6 fixme'd with TODO links to
  unblocking issues.

- products/catalyst/bootstrap/ui/e2e/lib/config.ts — central registry
  of every URL, hostname, fixture user, and UUID the spec uses. Per
  feedback_never_hardcode_urls.md, no test inlines a hostname; every
  asserted host derives from OTECH_FQDN + SME_SLUG.

- products/catalyst/bootstrap/ui/e2e/lib/sme-fixtures.ts — wire-shape-
  faithful page.route mocks for tenant discovery, /api/v1/whoami,
  /api/v1/sme/tenants, /api/v1/sme/users (CRUD), the deployment
  endpoints, app placeholders for WordPress/OpenClaw/webmail, and the
  /api/v1/sme/billing/ledger surface. Each helper is the seam between
  mock-mode (today) and live-mode (post-#804) so the spec opts out of
  any single mock by simply not calling that helper.

- .github/workflows/sme-demo-e2e.yaml — push + PR + dispatch trigger
  that runs the spec against a freshly-installed dev tree with
  VITE_CATALYST_MODE=sovereign + VITE_SOVEREIGN_FQDN set so the
  SovereignConsoleLayout's auth gate has a non-null sovereignFQDN.
  Uploads the 805-* screenshot evidence as a 30-day artefact.

Run today on a fresh checkout:

    cd products/catalyst/bootstrap/ui
    VITE_CATALYST_MODE=sovereign \
      VITE_SOVEREIGN_FQDN=acme.otech.example \
      npm run dev &
    PLAYWRIGHT_HOST=http://localhost:5173 \
      npx playwright test e2e/sme-demo.spec.ts

Result: 6 passed, 4 fixme (5b/5c/5d/6, all with TODO links to #804 /
#798 / #802-followup).

Live-mode follow-up (after #804 lands a fresh otech with the SME
tenant pipeline wired): drop the mock installers from beforeEach and
flip OTECH_FQDN/SME_SLUG via env. The spec stays — only the helper
calls change.

Per docs/INVIOLABLE-PRINCIPLES.md:
  #1 (waterfall): the canonical 6-step contract from #805 is asserted
     in this first cut, not staged across cycles.
  #2 (never compromise): every step that's deferred is fixme'd with a
     blocker link, never silently skipped.
  #4 (never hardcode): every URL routes through e2e/lib/config.ts.

Refs: openova-io/openova#795, openova-io/openova#804, ADR-0003

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 22:52:07 +04:00
github-actions[bot]
0cee06161a deploy: update sme service images to 5cdb738 2026-05-04 18:37:08 +00:00
e3mrah
01022e8c52
feat(unified-rbac): SME-tier extension + host-header tenant discovery (#802) (#816)
* feat(unified-rbac): SME-tier extension + host-header tenant discovery (#802)

Implements the SME-tier extension to the existing Sovereign Console SPA
per [Q-mine-1] of #795: same React bundle serves both otech-admin and
SME-admin views, tenant context discovered via window.location.host
against a back-end registry — not from path/subdomain string parsing.

Backend (catalyst-api / unified-rbac slice):
- Tenant registry (store.TenantRegistry) — flat-file host → tenant
  lookup table backing the public discovery endpoint. Host normalised
  to lowercase; case-insensitive lookups.
- GET /api/v1/tenant/discover (public, no auth gate) — returns
  {tenant_id, tenant_kind, keycloak_realm_url, keycloak_client_id} on
  200, 404 on unknown host, 503 if registry unwired. Admin URLs are
  NEVER on this wire.
- POST /api/v1/sme/users — fires ADR-0003 3-step hook (Keycloak →
  NewAPI → K8s Secret SSA with field manager `unified-rbac`). Each
  step idempotent; persisted state machine in store.UserProvisionStore
  per ADR-0003 §3.4. Returns 202 with steps[] progress array so the
  SPA can render the 3-step indicator even on partial failure.
- GET /api/v1/sme/users / DELETE /api/v1/sme/users/{uuid} — list +
  inverse rollback per ADR-0003 §3.7.
- internal/newapi.Client — minimal NewAPI admin REST client; 201
  happy-path + 409 idempotent recovery via GET ?external_id=<uuid>
  per ADR-0003 §3.2 (NewAPI does NOT rotate api_key on conflict).

Frontend (Sovereign Console SPA):
- Branded TenantID + TenantKind types (shared/types/tenant.ts) — same
  pattern as DeploymentID (#749).
- shared/lib/tenantDiscover.ts — fire-and-forget discovery in main.tsx;
  result cached in module state for sidebar nav + OIDC bootstrap.
- pages/sme/UsersPage.tsx — user CRUD UI with 3-step KC/NewAPI/Secret
  progress indicator wired off the API response shape.
- pages/sme/RolesPage.tsx — canonical Keycloak group → app role map
  (wordpress / openclaw / stalwart / rbac) per #795 [B].
- pages/sme/sme.api.ts — typed REST client; X-Tenant-Host header
  carries window.location.host on every call.
- Routes mounted at /console/sme/users + /console/sme/roles under the
  existing SovereignConsoleLayout — same SPA bundle, different route
  tree per discovered tenant_kind.

Tests: 22 new UI tests (4 files), 33 new Go tests (4 files). All
green: branded type parsers reject empty/non-string inputs, tenant
discovery handles 200/404/503/network-error paths, the 3-step hook
runs end-to-end against fake KC/NewAPI/SSA stubs, partial-failure
states surface verbatim through the steps[] response field, public
discovery endpoint never leaks admin URLs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 every URL goes through apiUrl()
in shared/config/urls; per #2 wire shapes parse through branded-type
parsers at the boundary; per #3 K8s Secret apply uses client-go SSA
(field manager `unified-rbac`) — no exec.Command kubectl shell-out.

Closes #802.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(unified-rbac): add Playwright E2E for SME-tier UI (#802)

Three specs covering:
- SME UsersPage: empty state → create form → 3-step progress
  indicator (KC done / NewAPI done / Secret done) — proves the
  page is wired to the API response shape.
- SME RolesPage: canonical group → app-role table renders the
  full 7-row mapping locked in #795 [B].
- OTECH tenant: same SPA bundle navigates /console/dashboard for
  the otech discovery payload — proves [Q-mine-1] of #795
  (one bundle, two route trees, host-driven discovery).

Backend mocks: route fulfillers stub /tenant/discover, /sme/users,
and /whoami so the dev-server harness can drive the SPA without
the catalyst-api backend or a live SME vcluster. The full live
cross-cluster E2E gates on bp-newapi (#799) seeding the tenant
registry at SME-onboarding time, which lands in #804.

1440 px screenshots captured at e2e/screenshots/802-*.png:
- 802-sme-users-empty-1440.png
- 802-sme-users-create-form-1440.png
- 802-sme-users-after-create-1440.png
- 802-sme-roles-1440.png
- 802-otech-dashboard-same-bundle-1440.png

Run: VITE_CATALYST_MODE=sovereign VITE_SOVEREIGN_FQDN=acme.otech.example
     npm run dev
     npx playwright test e2e/sme-tier-rbac.spec.ts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:34:11 +04:00
github-actions[bot]
e30a5c34c0 deploy: update catalyst images to e85035c 2026-05-04 18:09:28 +00:00
e3mrah
e85035cf9b
wip(console-ui): sovereignty preview stub + e2e spec scaffold (#793) (#809)
Partial work from prior session. Adds:
- SovereigntyPreviewPage.tsx (stub)
- e2e/sovereignty.spec.ts (472 lines)
- router + dashboard wiring

Full implementation (button, progress card, SSE) to follow.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-04 22:06:34 +04:00
github-actions[bot]
43e88d5f35 deploy: update catalyst images to f716fdd 2026-05-04 17:37:47 +00:00
e3mrah
0382864143
feat(catalyst-api): self-sovereignty cutover endpoints (#792) (#806)
Adds three operator-admin-gated endpoints for orchestrating the
post-handover Self-Sovereignty Cutover (parent epic #790):

  POST /api/v1/sovereign/cutover/start
  GET  /api/v1/sovereign/cutover/status
  GET  /api/v1/sovereign/cutover/events  (SSE)

The cutover engine consumes the PodSpec ConfigMaps that
bp-self-sovereign-cutover (issue #791, sister chart) installs in
the catalyst namespace, sequences them by `bp.openova.io/cutover-order`,
creates a fresh batchv1.Job per `mode=job` step (8 steps:
gitea-mirror, harbor-projects, harbor-prewarm, registry-pivot,
flux-gitrepository-patch, helmrepository-patches, catalyst-api-env-patch,
egress-block-test), waits for `mode=daemonset-wait` steps to reach
`numberReady == desiredNumberScheduled`, and patches the
`self-sovereign-cutover-status` ConfigMap with per-step timestamps
plus an overall progress counter on every state transition.

Endpoints are idempotent — when the status ConfigMap reports
`cutoverComplete=true` POST /start returns 200 with the durable
snapshot and does NOT re-run.  A failed step latches the engine on
the failed step (no auto-continue); operator inspects the failure on
/status and re-runs once the chart values are corrected, at which
point already-successful steps are skipped on resume.

Constraints honoured:
  * IaC-first — every cluster mutation goes through the in-cluster
    kubernetes.Interface (Create Job / Patch ConfigMap / Get DaemonSet
    / List ConfigMaps).  Zero bespoke cloud-API calls.
  * Event-driven — Job completion uses the apiserver Watch verb,
    not periodic GET polling.
  * Credential hygiene — the handler reads no secrets directly;
    the chart's PodSpecs reference secrets via envFrom secretRef
    so each Job's credentials are mounted fresh.
  * Runtime configurable — namespace, status ConfigMap name, per-
    step timeouts all read from env per principle #4.

Tests: 14 new unit tests in cutover_test.go covering parse/list/
ordering, end-to-end success run with a fake clientset, idempotency,
fail-halt semantics, no-steps-found, status JSON shape, and
SSE replay-on-connect.

Refs: #790, #791
Closes: #792

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:30:57 +04:00
github-actions[bot]
10d0201a81 deploy: update catalyst images to ccfe1d4 2026-05-04 16:42:38 +00:00
e3mrah
ccfe1d42e8
fix(provision-page): re-fetch deployment state on SSE close before showing failure (closes #782) (#789)
The provision page (AppsPage via useDeploymentEvents) treated any SSE
close without a terminal `event: done` as a "Provisioning failed"
event, hard-coding the message:

  > Deployment ended with status=phase1-watching

But `phase1-watching` is an in-flight phase, not a terminal outcome.
The founder repeatedly saw this banner on otech93/otech94 (2026-05-04)
while the canonical /deployments/{id} record showed status=ready and
handoverFiredAt populated — the SSE was simply dropped by the reverse
proxy mid-stream.

This change replaces the SSE-close failure path with a single
re-fetch of /deployments/{id} that switches on the canonical status:

  • ready              → success banner with handoverURL (existing #764 path)
  • failed             → real error from snapshot.error, never the stale
                         "Deployment ended with status=<phase>" copy
  • in-flight statuses → keep the streaming spinner up and reconnect SSE
                         with exponential backoff (max 5 attempts)

Also surfaces handoverURL recovered from the canonical poll so a
backgrounded tab that lost the SSE during the handover-mint window
still renders the "Open your Sovereign console →" affordance.

Tests added cover all three branches plus the hard regression that
"Deployment ended with status=phase1-watching" can never appear in
streamError under any SSE-close path.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:40:32 +04:00
github-actions[bot]
ecaef7c17f deploy: update catalyst images to 2e981f3 2026-05-04 16:36:27 +00:00
e3mrah
2e981f36a5
fix(bp-keycloak): catalyst-kc-sa-credentials addr → in-cluster Service URL (closes #781) (#788)
Sovereign-side catalyst-api Pod's intra-cluster Keycloak calls (token
mint, EnsureUser) were failing with `dial tcp: lookup
auth.<sov-fqdn> on 10.43.0.10:53: no such host`. The Sovereign's
CoreDNS resolves *.<sov-fqdn> via upstream resolvers — it does NOT
forward to the in-cluster PowerDNS that holds those records. Public
DNS works (PowerDNS authoritative), but Pod-side lookups of
auth.<sov-fqdn> return NXDOMAIN.

Live evidence — otech94 2026-05-04: handover URL returned
`{"error":"keycloak error: ensure user"}` from a DNS lookup failure
inside the catalyst-api Pod.

Fix: bp-keycloak chart now writes the in-cluster Service URL
(http://<release>.<namespace>.svc.cluster.local) into the
catalyst-kc-sa-credentials Secret's `addr` key instead of the public
gateway host (https://auth.<sov-fqdn>). This Secret is consumed
EXCLUSIVELY by the in-cluster catalyst-api Pod via reflector mirror
into catalyst-system; it is NEVER exposed to browsers.

The HTTPRoute hostname (.Values.gateway.host) stays at auth.<sov-fqdn>
for operator browsers — only the Pod's intra-cluster OAuth
client_credentials calls switch to the Service URL.

Catalyst-Zero (contabo) is unaffected: it runs `keycloak-zero`
(separate chart in openova-private), not bp-keycloak.

Changes:
- platform/keycloak/chart/templates/configmap-sovereign-realm.yaml:
  Secret's $kcAddr unconditionally uses
  http://<release>.<namespace>.svc.cluster.local
- platform/keycloak/chart/Chart.yaml: 1.3.1 → 1.3.2
- clusters/_template/bootstrap-kit/09-keycloak.yaml: chart version 1.3.1 → 1.3.2
- products/catalyst/chart/Chart.yaml: 1.3.0 → 1.3.1 (changelog entry only)
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: 1.3.0 → 1.3.1

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:34:22 +04:00
github-actions[bot]
eb9c935ab5 deploy: update catalyst images to fc2c198 2026-05-04 15:53:08 +00:00
e3mrah
fc2c198c90
feat(handover): auto-fire on Phase1 Ready + UI redirect (#778)
When the Phase-1 helmwatch terminates with OutcomeReady, catalyst-api
now mints the handover JWT immediately, persists handoverFiredAt +
handoverURL on the deployment record, and emits a typed SSE event
`event: handover-ready, data: { handoverURL, expiresAt }` so the
wizard's provision page can render the "Open your Sovereign console
→" CTA + auto-redirect after 5s. Until this landed, the operator was
stranded on the apps grid in terminal-completed state — the manual
mint endpoint existed but no UI surface ever invoked it.

Server (issue #768):
  - provisioner.Result gains HandoverFiredAt + HandoverURL.
  - phase1_watch.go: markPhase1Done's Ready transition calls a new
    fireHandover helper which mints via h.handoverSigner (RS256 5min
    TTL) and emits onto the durable buffer + live SSE channel.
  - StreamLogs renders Phase=="handover-ready" events as the typed
    SSE shape so a browser using addEventListener('handover-ready')
    receives the JSON payload directly. Idempotent under double-
    fire (informer reattach scenarios). No-op when handoverSigner
    is nil — the existing manual-mint path on the AdminPage button
    remains the fallback.
  - Lifted HandoverURL + HandoverFiredAt to /deployments/{id} top
    level so a GET-replay also drives the redirect when the SSE
    event was missed.

UI (issue #764):
  - useDeploymentEvents subscribes via EventSource.addEventListener
    ('handover-ready', …) and surfaces the payload as a new
    `handoverReady` return value. Same value populated from the
    /events GET-replay snapshot's handoverURL field for the
    SSE-missed case.
  - AppsPage renders a prominent green "Sovereign is ready" banner
    above the apps grid with an "Open your Sovereign console →"
    anchor link, fires a global success toast with the same CTA,
    starts a 5s redirect timer (window.location.href =
    handoverURL), and flips the document title to "✓ Sovereign
    ready — <fqdn>" so backgrounded tabs surface completion.

Tests:
  - Backend: 6 tests covering auto-fire on Ready, no-fire on
    failure, idempotency, no-signer no-op, typed-SSE-shape, and
    /deployments/{id} field lifting.
  - Frontend: 4 tests covering banner render, FQDN inclusion, 5s
    auto-redirect, and document.title flip.

Closes #764, #768.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:50:09 +04:00
e3mrah
53bc4357ca
feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) (#776)
* feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767)

Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers
couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB):

1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate"
   Section with: bootstrap-kit baseline (sum of mandatory-tier component
   footprints), selected components delta, control-plane overhead, and a
   "Recommended N x <SKU>" line that turns amber when the operator's chosen
   worker count is below the rollup. Backed by per-component RAM/CPU floors
   in components/wizard/steps/componentFootprints.ts (covered by 12 unit
   tests including the otech92 reproduction).

2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at
   bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart
   9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired
   from the canonical flux-system/cloud-credentials.hcloud-token Secret
   cloud-init writes (mirrors the velero/harbor object-storage pattern).
   Pinned to the control-plane node so the autoscaler never schedules onto
   a worker it could itself terminate. 10-minute scale-down idle as the
   cost-saving default.

Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA /
KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over
KEDA for cluster scaling, and the bounds + safety story.

Per the issue's MVP scope, this PR ships the blueprint + StepReview
estimate WITHOUT the wizard StepProvider min/max pair refactor or the
tofu node-pool template restructuring. Those are tracked as a follow-up
issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps

Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected-
bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776
because the file existed without a matching entry in the expected DAG, AND
collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort +
slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to
the expected-bootstrap-deps.yaml so the audit passes.

`scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:49:44 +04:00
e3mrah
905319cc14
feat(catalyst): one-click kubeconfig download + merge for k9s parity (closes #765) (#775)
The catalyst-api GET /kubeconfig endpoint now rewrites k3s's hardcoded
`default` cluster / context / user names to the Sovereign's subdomain
(e.g. `otech94`) before serving the YAML, so the operator can run
`k9s --context=otech94` immediately after a single
`kubectl config view --flatten` merge — no more manual sed pipeline
between every Phase-1 Ready and the next k9s session.

Backend (catalyst-api):
- New helpers `rewriteKubeconfigContext`, `preferredContextName`, and
  `kubeconfigDownloadFilename` in internal/handler/kubeconfig.go.
- Rewriter uses yaml.v3 Node round-trip so cert-authority-data + token
  bytes are preserved verbatim. Idempotent — re-applying to an already
  renamed file is a no-op. Refuses non-kubeconfig YAML so a hand-edited
  file is never silently corrupted.
- Context name resolution: SovereignSubdomain → first FQDN label →
  literal "sovereign" fallback. Sanitised to RFC-1123 lowercase label
  charset.
- Content-Disposition filename is now `<subdomain>.yaml` (matches
  operator mental model + makes the merge command shell-friendly).

UI (catalyst wizard StepSuccess):
- New "Step 1 / Step 2" cluster-access surface on the success step:
  download button (unchanged endpoint) plus a copy-to-clipboard merge
  one-liner (`KUBECONFIG=$HOME/.kube/config:$HOME/Downloads/<file> kubectl
  config view --flatten > config.tmp && mv config.tmp config && chmod
  600 && k9s --context=<name>`).
- Atomic temp-file move instead of a direct redirect to ~/.kube/config
  so a Ctrl-C mid-pipe never corrupts the operator's existing config.
- Helpers `sovereignContextName` + `buildKubeconfigMergeCommand`
  exported so the test file (and a future Operator-Tools page on the
  Sovereign console) can re-use them with no logic drift.

Tests:
- 6 new Go tests covering the rewriter (idempotence, k3s default,
  mixed-name file, empty target rejection, malformed YAML rejection,
  non-kubeconfig rejection) + GET-handler integration test that
  exercises the subdomain → context-name path on a real fixture.
- 3 new vitest tests covering the merge-command UI block + 5 new
  helper-pure tests for `sovereignContextName` /
  `buildKubeconfigMergeCommand`.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:48:31 +04:00
github-actions[bot]
116233be51 deploy: update catalyst images to c4e2c10 2026-05-04 15:43:52 +00:00
e3mrah
c4e2c10587
fix(wizard): drop redundant 'locked to your sign-in' email microcopy (closes #762) (#774)
PR #759 enforces `req.OrgEmail == session.email` in the catalyst-api on
POST /v1/deployments, which means the operator IS the Sovereign owner
by definition. Asking again in the wizard, locking the field, and
explaining the lock with `Admin contact email · locked to your
sign-in` was redundant chrome that made StepDomain feel like a sign-up
form for the second time.

Changes:
- StepDomain: remove the AdminEmailField sub-component entirely (the
  "locked to your sign-in" microcopy + Lock icon + read-only input +
  isValidAdminEmail validator + the orgEmail clause in
  computeNextDisabled). Drop now-unused useSession + Lock + useEffect
  imports.
- StepReview: stamp `orgEmail` from `session.email` at submit time
  (with the wizard store as a fallback for the brief window between
  PIN-verify and the next session refetch). Rename the review-page
  row from "Admin email" to "Sovereign owner" to mirror the new UI
  vocabulary; the row now reads `session.email` so the operator sees
  exactly which identity the Sovereign will be owned by.
- StepDomain.test: keep the fresh-QueryClient-per-test wrapper but
  drop the seedSessionEmail plumbing (no longer needed). Add three
  regression tests confirming the field, the microcopy, and the
  orgEmail-gate on Continue are all gone.
- WizardLayout / WizardPage / StepOrg / StepReview: update doc
  comments that referenced the now-removed admin-email field.

Per docs/INVIOLABLE-PRINCIPLES.md #1 (never trust the client) the
load-bearing fix is still on the server (PR #759). This PR removes
the redundant client-side defense + the noisy chrome that explained it.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:40:43 +04:00
e3mrah
6a6b502008
fix(decommission): live exec-log view (unified) — was 'stuck' banner (closes #766) (#773)
The `/sovereign/decommission/<id>` page used to render a static
"Decommissioning…" button label with no progress signal — operators
thought the page was stuck while `tofu destroy` and the Hetzner orphan
purge were running for 30+ minutes.

The wipe handler in `api/internal/handler/wipe.go` ALREADY emits a
per-resource SSE event stream on the same `dep.eventsCh` channel that
provisioning uses (surfaced at `GET /api/v1/deployments/{id}/logs`).
Every "tofu destroy" tick, every Hetzner DELETE response, every S3
bucket purge step, every PDM release call, every local-state cleanup
is already a discrete event with `phase="wipe"`. The UI just wasn't
subscribing.

Fix is purely UI:

  • DecommissionPage subscribes to the same SSE via `useDeploymentEvents`
    once the wipe POST is in flight (`disableStream: false`), flattens
    every recorded event into `LogLine`, and feeds the unified
    `LogPane` (the same component `/provision/<id>` JobDetail uses for
    per-job logs).
  • Streaming layout replaces the form once submit fires: STREAMING
    chip, scrolling exec-log, full-screen toggle, search filter — all
    threaded through the existing LogPane primitives.
  • On wipe completion: COMPLETE chip + green checkmark + verbatim
    Hetzner-sweep summary block ("servers: 0 removed, load_balancers:
    0 removed, …" — the founder DoD is "0 of every kind on the
    Hetzner side") + 10s countdown back to /wizard. Operator can scroll
    back through every deletion at any time.
  • No backend change — the SSE plumbing is already there.

Tests: 7/7 pass (5 original + 2 new for #766). Per #1 (waterfall —
target shape on first commit) the streaming view ships with full
scrollback, search, full-screen, summary, and countdown in one PR.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:37:27 +04:00
github-actions[bot]
a29238d217 deploy: update catalyst images to fa58cc3 2026-05-04 13:46:18 +00:00
e3mrah
fa58cc32b5
fix(catalyst-api): validate orgEmail matches session.email + tighten list cross-tenant policy (closes #748) (#759)
Server-side enforcement is the load-bearing fix per docs/INVIOLABLE-PRINCIPLES.md
#1 (never trust the client). Until this lands a signed-in operator could POST
a deployment whose req.OrgEmail belonged to some other identity — the catalyst-
api accepted the body verbatim and stamped the wrong identity onto the
Sovereign-admin / Catalyst-Organization owner.

Server changes (deployments.go):
- CreateDeployment now reads claims from context (auth.RequireSession populates)
  with X-User-Email as the off-prod fallback. When a session is present,
  req.OrgEmail MUST EqualFold session.email — mismatch returns 403.
  OwnerEmail is stamped from the session-derived value, not request body —
  a future client-side bug cannot poison the durable owner field.
- ListDeployments (issue #747) tightened: when a session is present AND a
  ?owner= query param is also supplied AND ?owner != session.email, return
  200 + empty list rather than silently collapsing to session-only rows.
  Mirrors the issue #689 404-not-403 rule on /deployments/{id} — the
  response shape MUST NOT differentiate "exists but not yours" from
  "doesn't exist". Now also reads ClaimsFromContext as the canonical
  session source (X-User-Email fallback).

Tests:
- 4 new tests in deployments_test.go (all pass):
  - TestCreateDeployment_RejectsMismatchedOrgEmail (403 + no PDM Reserve
    + no row stored)
  - TestCreateDeployment_AcceptsMatchingOrgEmail (case-insensitive match,
    OwnerEmail derived from session not request)
  - TestListDeployments_FiltersByOwnerSession (cross-tenant row hidden)
  - TestListDeployments_OwnerQueryParam (cross-tenant ?owner returns
    empty list, never 403)
- deployments_list_test.go: existing TestListDeployments_FilterBySessionEmail
  rewritten to match the tightened cross-tenant policy (empty list, not
  silent override). New TestListDeployments_CrossTenantOwnerQueryReturnsEmpty
  added to assert the explicit boundary.

UI changes:
- ui/src/pages/wizard/steps/StepDomain.tsx — defense-in-depth UX:
  AdminEmailField pre-fills orgEmail from useSession() and renders
  read-only with a Lock icon and tooltip "Sovereigns are owned by the
  email you signed in with." A useEffect mirrors session.email into
  the wizard store so a stale value from a previous sign-in cannot
  survive into the current session.
- ui/src/pages/wizard/steps/StepDomain.test.tsx — wraps every render
  in a fresh QueryClientProvider (AdminEmailField now consumes
  useSession via TanStack Query). All 15 existing UI tests pass.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:43:58 +04:00
github-actions[bot]
407f37944b deploy: update catalyst images to 35569e2 2026-05-04 13:40:49 +00:00
e3mrah
35569e2344
fix(types): DeploymentID branded type — kill 15-char truncation forever (closes #749, #754) (#760)
The "deployment ID truncated by one char" bug recurred multiple times because
every UI code path treated the id as a free-form `string`. Any new error
template, toast, or URL builder could (and did) introduce another truncation.

This change makes the truncation impossible at compile time:

- Adds `shared/types/deployment.ts` with a branded `DeploymentID` type
  (`string & { readonly __brand: 'DeploymentID' }`) plus
  `parseDeploymentID()` / `isDeploymentID()` validators. The regex
  enforces the canonical 16 lowercase hex chars catalyst-api emits.
- Updates `entities/deployment/model.ts` to type `WizardState.deploymentId`
  as `DeploymentID | null`. Re-exports the brand from the model so
  existing imports keep working.
- Updates `entities/deployment/store.ts` to route `setDeploymentId()` and
  the persistence `merge()` path through `parseDeploymentID()`. A bad id
  in localStorage gets wiped rather than rendered as a misleading
  "<truncated>-is-unknown-to-backend" error.
- Updates `pages/sovereign/AppsPage.tsx` to validate the route param at
  the page boundary via `isDeploymentID()`, and emits a dedicated
  malformed-id notification when the URL value isn't 16 lowercase hex
  chars (so the operator sees the FULL invalid value, not a hidden
  off-by-one).
- Adds 25 unit tests covering the parser (valid/invalid lengths,
  uppercase, non-string types, error-message hygiene) plus the
  `isDeploymentID` type guard.
- Adds an integration test (`ProvisionPage.sse-url.test.tsx`) that
  mounts the page with a 16-char hex route param, installs a recording
  EventSource shim, and asserts the constructed URL is exactly
  `${API_BASE}/v1/deployments/<FULL_16_CHAR_ID>/logs` — including the
  exact `eeb34ecd1414a505` id from issue #749's live evidence.
- Updates `StepSuccess.test.tsx` fixture to a real 16-char hex id so
  the wizard store accepts it through the new typed setter.

Audit findings — search across the entire UI src for `slice(0, 15..19)`,
`substring(0, 15..19)`, and `[a-f0-9]{15}` patterns turned up NO direct
truncation site in production code. The root cause of the 2026-05-04
incident was that every consumer trusted a raw `string` route param
without validation, so a URL with a manually-truncated id fed straight
into both the SSE URL builder and the error message verbatim. The
branded-type contract is now the structural fix: any future code that
tries to assign an unvalidated string to a `DeploymentID` field fails
compilation, and any URL with the wrong shape surfaces a clear
malformed-id banner instead of "deployment <wrong> is unknown".

Closes #749, #754.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:38:27 +04:00
github-actions[bot]
b1915a9e14 deploy: update catalyst images to 8e57abe 2026-05-04 13:32:38 +00:00
e3mrah
8e57abe9d0
fix(wizard): auto-redirect signed-in user to in-flight /sovereign/provision/<id> (closes #747) (#758)
A signed-in operator who refreshed /sovereign/wizard during a 15-minute
provisioning run lost the progress page and landed on Step 1 of an empty
form (caught live with otech90 on 2026-05-04). Wires the wizard route
to call the new GET /api/v1/deployments?owner=<email> endpoint and
redirect to /sovereign/provision/<id> when an in-flight deployment is
found.

Backend
- Add ListDeployments handler returning the slim shape (id, status,
  sovereignFQDN, region, startedAt, finishedAt, ownerEmail, adoptedAt,
  error). Filtered server-side by the X-User-Email header injected by
  RequireSession; ?owner= is a client hint that is silently overridden
  when the session header is set so a signed-in attacker cannot list
  someone else's rows. Adopted deployments are excluded — once the
  customer's Sovereign owns the cluster, the wizard redirect must not
  pull the operator back to Catalyst-Zero.
- Register GET /api/v1/deployments inside the RequireSession group.
- 5 new handler tests covering session-override, adopted exclusion,
  legacy-row exclusion, no-session passthrough, and ?owner= filtering.

Frontend
- New useInflightDeployment hook (TanStack Query, 30s stale time)
  returning {inflight, completed, all} buckets. inflight matches
  pending/provisioning/tofu-applying/tofu-plan/tofu-apply/
  flux-bootstrapping/cloud-init-waiting/phase1-watching plus
  ready-but-not-adopted. Picks the most-recent by startedAt.
- WizardPage redirect effect: when session.signedIn && inflight,
  navigate replace=true to /provision/<id> and render null while the
  redirect resolves. When the operator has only completed/wiped/failed
  rows, render a banner with a "View your previous deployments" link.
- New DeploymentsList page at /deployments (browser path
  /sovereign/deployments behind the Traefik strip-prefix). Single table:
  FQDN, status, started, finished, region. Each FQDN links back to
  /provision/<id>.
- 6 hook unit tests covering most-recent picking, ready-not-adopted,
  adopted exclusion (defense-in-depth), 401 graceful degrade, and
  enabled=false short-circuit.

Tests
- 5 backend handler tests pass (TestListDeployments_*)
- 6 frontend hook tests pass (useInflightDeployment.test.tsx)
- TS typecheck + Vite build clean
- Pre-existing TestAuthHandover_HappyPath panic + StepComponents
  catalog-data failures verified unrelated (fail on bare main)

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:30:36 +04:00
github-actions[bot]
5bb7d45647 deploy: update catalyst images to 5decebf 2026-05-04 13:17:56 +00:00
e3mrah
5decebf801
fix(provision): drop bespoke 'Operator' widget, use ProfileMenu top-right (closes #750) (#757)
The /sovereign/provision/<id> page rendered a bespoke "Operator /
Provisioning session" card in the bottom-left of its Sidebar. Two
problems:

  1. Identity placement was inconsistent with the rest of the app
     (wizard, Sovereign-console, marketplace all place identity
     top-right). The provisioning surface was the lone outlier.

  2. The label "Operator" was hard-coded and never reflected the
     signed-in user's email — it ignored useSession() entirely.

This drops the bespoke card from Sidebar.tsx and renders the canonical
<ProfileMenu /> (the same widget WizardLayout uses) in PortalShell's
top-right slot. ProfileMenu reads useSession() so anonymous visitors
get a [Sign in] button and signed-in operators get an email-initial
avatar that opens a "Signed in as <email>" + "Sign out" dropdown.

Because PortalShell wraps every /sovereign/provision/* route (apps,
jobs, dashboard, cloud, users, settings), this fix touches all of
them in one place.

Test updates:
  - Sidebar.test.tsx now asserts the bespoke widget is GONE rather
    than asserting it renders, locking in the regression guard.

No backend / API surface changes.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
2026-05-04 17:15:46 +04:00
github-actions[bot]
c69e4987da deploy: update catalyst images to 05065b6 2026-05-04 13:13:50 +00:00
github-actions[bot]
4b659ced17 deploy: update catalyst images to e855ab0 2026-05-04 13:09:40 +00:00
github-actions[bot]
87ffe512c5 deploy: update catalyst images to ceeefd7 2026-05-04 12:03:20 +00:00
github-actions[bot]
fea00720f7 deploy: update catalyst images to 468c3ba 2026-05-04 11:53:06 +00:00
github-actions[bot]
9ee3b2e911 deploy: update catalyst images to b02fc37 2026-05-04 11:37:57 +00:00
e3mrah
b02fc3788a
fix(provisioner): cost-optimized defaults use ORDERABLE SKUs — cpx22 CP + cpx32 workers (14% saving) (#744)
* fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request

Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z).
After PR #742 fixed the empty SKU strings in tfvars, the next blocker
appeared: writeTfvars was emitting `"regions": null` (Go nil slice
marshals to JSON null) when the request had no per-region overrides.

OpenTofu's variables.tf carries a validation block:

  validation {
    condition = alltrue([
      for r in var.regions :
      contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider)
    ])
  }

The `for r in var.regions` iteration fails on null with:

  Error: Iteration over null value
  on variables.tf line 217, in variable "regions":

The variables.tf default `[]` is what the validator expects; emit
that shape explicitly via a coalesceRegions(req.Regions) helper that
turns nil into an empty slice. Operator overrides round-trip
unchanged.

Tests:
- TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions
  serialises as JSON `[]`, never `null`, when the request has no
  per-region overrides.

Builds on PR #742.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving)

Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the
cpx21 CP default from PR #741 fell apart at apply time —

  Error: Server Type "cpx21" is unavailable in "fsn1" and can no
  longer be ordered

Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog
(`/v1/server_types`) but are NOT in the per-DC orderable list
(`available_for_migration` on `/v1/datacenters`) for any EU DC
(fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on
for new Sovereigns in those regions.

Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04:
  • cpx11 (2 vCPU / 2 GB) — too small for the CP working set
  • cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1
  • cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1
  • cpx42, cpx52, cpx62 — bigger and more expensive

New default per Sovereign:

| Component       | Old             | New              | Savings |
|-----------------|-----------------|------------------|---------|
| Control plane   | CPX32 (€16.49)  | CPX22 (€9.49)    | €7.00   |
| Worker × 2      | CPX32 × 2 (€33) | CPX32 × 2 (€33)  | €0      |
| TOTAL           | €49.47/mo       | €42.47/mo        | 14%     |

The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo)
assumed those SKUs were orderable. They aren't in EU DCs. The 14%
saving from cpx22 CP is the largest concrete optimisation that
ships TODAY without compromising the multi-node horizontal-scale
agreement (issue #733): still 1 CP + 2 workers from day one.

Files changed:

- infra/hetzner/variables.tf
  control_plane_size default cpx21 → cpx22
  worker_size        default cpx31 → cpx32 (back to the prior orderable choice)

- products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts
  Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing
  (€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49).
  Mark both as "listed but NOT orderable in EU DCs" so the wizard
  surfaces the constraint instead of letting operators pick a
  non-orderable SKU.
  Move recommended:true from CPX21 → CPX22.
  defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31').

- products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx
  Comment refresh — names the new orderable defaults.

- products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts
  Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22'].

Builds on PR #741 (issue #740 chain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:35:55 +04:00
github-actions[bot]
20c839efc4 deploy: update catalyst images to 8989ce7 2026-05-04 11:29:07 +00:00
e3mrah
8989ce7659
fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request (#743)
Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z).
After PR #742 fixed the empty SKU strings in tfvars, the next blocker
appeared: writeTfvars was emitting `"regions": null` (Go nil slice
marshals to JSON null) when the request had no per-region overrides.

OpenTofu's variables.tf carries a validation block:

  validation {
    condition = alltrue([
      for r in var.regions :
      contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider)
    ])
  }

The `for r in var.regions` iteration fails on null with:

  Error: Iteration over null value
  on variables.tf line 217, in variable "regions":

The variables.tf default `[]` is what the validator expects; emit
that shape explicitly via a coalesceRegions(req.Regions) helper that
turns nil into an empty slice. Operator overrides round-trip
unchanged.

Tests:
- TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions
  serialises as JSON `[]`, never `null`, when the request has no
  per-region overrides.

Builds on PR #742.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:26:58 +04:00
github-actions[bot]
10d1af8c91 deploy: update catalyst images to 7ef5af7 2026-05-04 11:11:10 +00:00