Chart 1.3.2 shipped serviceaccount-cutover-driver.yaml +
clusterrole-cutover-driver.yaml + clusterrolebinding-cutover-driver.yaml
with `{{ .Release.Namespace }}` directives that rendered fine via Helm
on Sovereigns but BROKE the Kustomize-mode contabo-mkt deploy: the
directives made Kustomize parse the files as invalid YAML and silently
skip them. Worse, the new files were never added to templates/
kustomization.yaml's resources list.
Result on contabo: catalyst-api Pod's spec.serviceAccountName references
a non-existent SA — the Pod fails ContainerCreating with the same RBAC
forbidden error #830 was meant to fix.
Fix:
- Strip `{{ .Release.Namespace }}` directives from the SA + ClusterRole
files. metadata.namespace auto-fills from Helm's --namespace flag
and from Kustomize's `namespace:` directive.
- For ClusterRoleBinding: Helm does NOT auto-inject subjects[0].
namespace the way it does metadata.namespace, so the apiserver
rejects bindings without it. Split into two files:
* clusterrolebinding-cutover-driver.yaml — Helm-only, uses
{{ .Release.Namespace }} (correctly resolves to catalyst-system
on Sovereigns).
* clusterrolebinding-cutover-driver-kustomize.yaml — Kustomize-
only, omits subjects[0].namespace and relies on Kustomize's
native injection (resolves to `catalyst` on contabo).
The .helmignore excludes the Kustomize-only file from Sovereign
chart packaging; templates/kustomization.yaml's resources list
references the Kustomize-only file, NOT the Helm-only one.
- Add the new RBAC files to templates/kustomization.yaml's resources
list so contabo's Flux Kustomization actually renders them.
Verified live with `helm template` (subjects[0].namespace=catalyst-system)
and `kubectl kustomize` (subjects[0].namespace=catalyst).
Bumps bp-catalyst-platform 1.3.2 → 1.3.3.
Issue: openova-io/openova#830 (Bug 1 follow-up)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the SME tenant provisioning pipeline (#804) for the multi-domain
Sovereign (epic #825). The SME tenant create form now lets the operator
pick which sme-pool parent zone hosts the tenant; the orchestrator
writes DNS records under the chosen parent (not a hardcoded primary).
Backend (Go):
- store.SMETenantProvisionRecord.ParentDomain — captured at create
- handler.SMETenantParentDomain + SMETenantDeps.ParentDomains — pool wiring
- POST /api/v1/sme/tenants accepts parent_domain; defaults to the first
NS-flip-ready sme-pool entry; rejects unknown parents (400) and
not-yet-flipped parents (503 + Retry-After)
- DNS provisioner ProvisionFreeSubdomain takes a parentZone parameter;
ValidateBYOCNAME accepts a multi-target candidate list (any parent)
- Pipeline: writes A records under the chosen parent zone; realm URL,
console host, and gitops template hostnames all derive from
ParentDomain (data-driven; never hardcoded)
- New GET /api/v1/sovereign/parent-domains?role= read-only endpoint
with env stub (CATALYST_SME_POOL_DOMAINS) that integrates cleanly
with MD-1 (#826) when its data model lands
UI (React + TanStack Router + Vitest + Playwright):
- New /console/sme/tenants/new — CreateTenantPage with domain-mode
radio, parent-domain <select> populated from the new endpoint,
per-option NS-flip-ready disabled state, live console URL preview,
CNAME validation hint for BYO mode, post-submit progress timeline
- 7 Vitest unit tests + 2 Playwright E2E specs (free-subdomain + BYO),
5 1440px screenshots emitted under e2e/screenshots/828-*.png
Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent-domain pool is fully
data-driven; the UI consumes the same wire shape MD-1 will surface.
Per #2 (never compromise on quality) the page paints partial state on
hook failure with per-step badges from the response.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A franchised Sovereign now supports N parent zones, NOT one. The
operator brings 1+ parent domains at signup (`omani.works` for own
use, `omani.trade` for the SME pool, etc.) and may add more
post-handover via the admin console (#829).
bp-powerdns 1.2.0 (platform/powerdns/chart):
- New `zones: []` values key listing parent domains to bootstrap
- New Helm post-install/post-upgrade hook Job
(templates/zone-bootstrap-job.yaml) that POSTs each entry to
/api/v1/servers/localhost/zones at install time. Idempotent on
HTTP 409 — re-runs after upgrades or chart bumps never fail.
- Default-values render skips when zones is empty (legacy behavior).
bp-catalyst-platform 1.4.0 (products/catalyst/chart):
- New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}`
values
- New templates/sovereign-wildcard-certs.yaml renders one
cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex)
via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert
renews independently. Skips entirely when parentZones is empty so
the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml
retains ownership of `sovereign-wildcard-tls` (avoids
helm-vs-kustomize ownership flap).
- New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded
into the catalyst-api Pod as CATALYST_POWERDNS_API_URL +
CATALYST_POWERDNS_SERVER_ID env vars.
catalyst-api (products/catalyst/bootstrap/api):
- New internal/powerdns package with typed Client (CreateZone,
ZoneExists). Idempotent on HTTP 409/412.
- handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the
typed client when wired via SetPowerDNSZoneClient — the
admin-console "Add another parent domain" flow now creates real
zones in the Sovereign's PowerDNS at runtime.
- main.go wires the client when CATALYST_POWERDNS_API_URL +
CATALYST_POWERDNS_API_KEY are set.
- Comprehensive unit tests (client_test.go: 9 cases incl.
201/409/412/500 + custom NS + custom serverID).
Bootstrap-kit slot integration:
- clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to
bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from
Flux postBuild.substitute.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
bumps to bp-catalyst-platform 1.4.0 and threads `parentZones:
${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two
slots stay in lockstep).
- infra/hetzner: new `parent_domains_yaml` Terraform variable
(defaults to single-zone array derived from sovereign_fqdn) →
cloud-init renders the PARENT_DOMAINS_YAML Flux substitute.
DoD verified end-to-end with helm template + envsubst:
- Multi-zone overlay (omani.works + omani.trade) renders 2
PowerDNS zone-create API calls in the bootstrap Job AND 2
Certificate resources (`*.omani.works`, `*.omani.trade`) in
bp-catalyst-platform.
- Single-zone fallback (PARENT_DOMAINS_YAML defaults to
`[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy
provisioning paths working without per-overlay edits.
Closes#827.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Sub-1 of epic #825 (Multi-domain Sovereign). Backend-only per the
SCOPE CORRECTION on issue #826: the wizard stays single-FQDN, multi-
domain capability is a Day-2 admin-console action (#829, already
merged with an in-memory stub waiting on this PR's persistence
layer).
What this PR adds:
- provisioner.ParentDomain struct (Name, Role, RegistrarKind,
RegistrarCredsRef, AddedAt) with role constants
ParentDomainRolePrimary | ParentDomainRoleSMEPool. Wire shape
matches the handler-layer ParentDomain in
handler/parent_domains.go (#829), so the handler's swap from
in-memory store → Deployment.parentDomains[] is a one-line
change in a follow-up PR.
- Request.ParentDomains []ParentDomain field. Backward-compatible:
when the slice is empty, Validate() synthesises a single primary
entry from SovereignPoolDomain (or SovereignFQDN) so legacy
single-FQDN payloads + on-disk records read cleanly. The next
Save() round-trips the array form — transparent migration with
no one-shot script.
- validateParentDomains: enforces "exactly one primary", role enum,
FQDN regex (RFC 1035, mirrors wizard isValidDomain), duplicate-
name dedupe, lowercase normalisation in place.
- ProvisionParentDomain / ProvisionParentDomains: the per-domain
abstraction the issue's DoD calls out as "reusable function ready
for #829". Day-2 add-domain calls this with the same step list
(registrar-flip → powerdns-zone-create → cert-manager-cert) the
Day-1 path uses; idempotent, stops on first error, emits per-step
SSE events for the admin panel.
- Request.PrimaryParentDomain() / SMEPoolParentDomains() lookup
helpers so the catalyst-api handler + SME signup wizard read the
primary / sme-pool subset without re-iterating at every call site.
- writeTfvars emits parent_domains as a JSON array (never null) so
a future OpenTofu module's `for pd in var.parent_domains`
validator accepts the input — same nil-trap fix the regions slice
already carries.
- store.RedactedRequest + ToProvisionerRequest round-trip the slice
verbatim. Fields are non-secret (RegistrarCredsRef points at a
SealedSecret name; plaintext registrar credentials never live on
the deployment record).
- store.crdStore mirrors the slice into the ProvisioningState CRD
spec so admin tooling reading via the K8s API sees the live pool.
What this PR does NOT touch (explicit scope):
- products/catalyst/bootstrap/ui/src/pages/wizard/** — wizard UI
stays single-FQDN per the issue's SCOPE CORRECTION.
- products/catalyst/bootstrap/api/internal/handler/parent_domains.go
— the #829-merged Day-2 admin handler keeps its in-memory store;
a one-line follow-up PR swaps to Deployment.parentDomains[].
Inviolable Principle #4: defaultRegistrarKindFromEnv reads
CATALYST_DEFAULT_REGISTRAR_KIND so operators on registrars other
than Dynadot override the synthesis path without code changes. No
TLD or count is hardcoded.
Tests:
- 14 new unit tests across two new files (parent_domains_test.go in
provisioner + store packages). Cover: synthesis from
SovereignFQDN + SovereignPoolDomain, "exactly one primary"
invariant (rejects 2 + 0), unknown role, empty role, malformed
FQDN, duplicate names, uppercase normalisation, lookup helpers,
step-runner ordering + first-error halt, slice-flavour
multi-domain iteration, JSON round-trip through Redact + Save +
LoadAll, empty-slice omitempty, legacy on-disk record loads
cleanly + migration synthesises primary on Validate.
- Pre-existing Harbor-token + AuthHandover-signer-nil failures
persist on origin/main; this PR introduces no new failures.
Closes#826.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(unified-rbac): SME-tier extension + host-header tenant discovery (#802)
Implements the SME-tier extension to the existing Sovereign Console SPA
per [Q-mine-1] of #795: same React bundle serves both otech-admin and
SME-admin views, tenant context discovered via window.location.host
against a back-end registry — not from path/subdomain string parsing.
Backend (catalyst-api / unified-rbac slice):
- Tenant registry (store.TenantRegistry) — flat-file host → tenant
lookup table backing the public discovery endpoint. Host normalised
to lowercase; case-insensitive lookups.
- GET /api/v1/tenant/discover (public, no auth gate) — returns
{tenant_id, tenant_kind, keycloak_realm_url, keycloak_client_id} on
200, 404 on unknown host, 503 if registry unwired. Admin URLs are
NEVER on this wire.
- POST /api/v1/sme/users — fires ADR-0003 3-step hook (Keycloak →
NewAPI → K8s Secret SSA with field manager `unified-rbac`). Each
step idempotent; persisted state machine in store.UserProvisionStore
per ADR-0003 §3.4. Returns 202 with steps[] progress array so the
SPA can render the 3-step indicator even on partial failure.
- GET /api/v1/sme/users / DELETE /api/v1/sme/users/{uuid} — list +
inverse rollback per ADR-0003 §3.7.
- internal/newapi.Client — minimal NewAPI admin REST client; 201
happy-path + 409 idempotent recovery via GET ?external_id=<uuid>
per ADR-0003 §3.2 (NewAPI does NOT rotate api_key on conflict).
Frontend (Sovereign Console SPA):
- Branded TenantID + TenantKind types (shared/types/tenant.ts) — same
pattern as DeploymentID (#749).
- shared/lib/tenantDiscover.ts — fire-and-forget discovery in main.tsx;
result cached in module state for sidebar nav + OIDC bootstrap.
- pages/sme/UsersPage.tsx — user CRUD UI with 3-step KC/NewAPI/Secret
progress indicator wired off the API response shape.
- pages/sme/RolesPage.tsx — canonical Keycloak group → app role map
(wordpress / openclaw / stalwart / rbac) per #795 [B].
- pages/sme/sme.api.ts — typed REST client; X-Tenant-Host header
carries window.location.host on every call.
- Routes mounted at /console/sme/users + /console/sme/roles under the
existing SovereignConsoleLayout — same SPA bundle, different route
tree per discovered tenant_kind.
Tests: 22 new UI tests (4 files), 33 new Go tests (4 files). All
green: branded type parsers reject empty/non-string inputs, tenant
discovery handles 200/404/503/network-error paths, the 3-step hook
runs end-to-end against fake KC/NewAPI/SSA stubs, partial-failure
states surface verbatim through the steps[] response field, public
discovery endpoint never leaks admin URLs.
Per docs/INVIOLABLE-PRINCIPLES.md #4 every URL goes through apiUrl()
in shared/config/urls; per #2 wire shapes parse through branded-type
parsers at the boundary; per #3 K8s Secret apply uses client-go SSA
(field manager `unified-rbac`) — no exec.Command kubectl shell-out.
Closes#802.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(unified-rbac): add Playwright E2E for SME-tier UI (#802)
Three specs covering:
- SME UsersPage: empty state → create form → 3-step progress
indicator (KC done / NewAPI done / Secret done) — proves the
page is wired to the API response shape.
- SME RolesPage: canonical group → app-role table renders the
full 7-row mapping locked in #795 [B].
- OTECH tenant: same SPA bundle navigates /console/dashboard for
the otech discovery payload — proves [Q-mine-1] of #795
(one bundle, two route trees, host-driven discovery).
Backend mocks: route fulfillers stub /tenant/discover, /sme/users,
and /whoami so the dev-server harness can drive the SPA without
the catalyst-api backend or a live SME vcluster. The full live
cross-cluster E2E gates on bp-newapi (#799) seeding the tenant
registry at SME-onboarding time, which lands in #804.
1440 px screenshots captured at e2e/screenshots/802-*.png:
- 802-sme-users-empty-1440.png
- 802-sme-users-create-form-1440.png
- 802-sme-users-after-create-1440.png
- 802-sme-roles-1440.png
- 802-otech-dashboard-same-bundle-1440.png
Run: VITE_CATALYST_MODE=sovereign VITE_SOVEREIGN_FQDN=acme.otech.example
npm run dev
npx playwright test e2e/sme-tier-rbac.spec.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(admin-console): add-domain flow + DNS propagation status panel (#829)
Multi-domain Sovereign — operator-admin "Add another parent domain"
surface in the Sovereign Console + live DNS propagation status panel.
Closes the MD-4 sub-ticket of epic #825.
Backend (catalyst-api/internal/handler/parent_domains.go):
- GET /api/v1/sovereign/parent-domains — list pool
- POST /api/v1/sovereign/parent-domains — add domain
- DELETE /api/v1/sovereign/parent-domains/{name} — remove
- GET /api/v1/sovereign/parent-domains/{name}/propagation
— fan-out to 5+
public DNS resolvers
The Add pipeline calls PDM /set-ns (sister #826), creates the PowerDNS
zone (sister #827, env-gated stub until that PR lands), and issues a
wildcard cert via cert-manager (also sister #827, env-gated stub). All
three steps update the same store row so the UI can render per-step
progress.
DNS propagation panel uses Go's net.Resolver with a custom Dial that
routes lookups through a SPECIFIC resolver IP (8.8.8.8, 1.1.1.1,
9.9.9.9, 208.67.222.222, 4.2.2.1) rather than the system resolver.
Per inviolable principle #4, the resolver list, expected NS records,
and per-query timeout are all env-overridable.
Frontend (ui/src/pages/admin/parent-domains/):
- ParentDomainsPage.tsx — list view + Add Domain modal + per-row
inline drawer with PropagationPanel
- PropagationPanel.tsx — polls /propagation every 60s, renders
green/yellow/red pills per resolver + rolling % propagated number
- parentDomains.api.ts — typed REST client wrappers, no inline /api/
Routing:
- /console/parent-domains registered under SovereignConsoleLayout
- Added to Settings sub-nav for operator-admin reachability
Tests:
- 6 vitest cases (empty state, populated rows, modal open, drawer
toggle, primary lock, propagation panel mount)
- 13 Go cases covering list/add/delete/validation/propagation wire
shape against a stub PDM
- 3 Playwright E2E + 1440x900 screenshots:
e2e/screenshots/829-1-just-flipped.png (0% propagated)
e2e/screenshots/829-2-partially-propagated.png (40%)
e2e/screenshots/829-3-fully-propagated.png (100%)
Per inviolable principle #10 (credential hygiene) the registrarToken
field is forwarded byte-for-byte to PDM and never enters a logged
struct; the modal input uses type="password".
Refs: #825 (parent epic), #826 (sister MD-1), #827 (sister MD-2)
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase-1 helmwatch watcher used to lose state on every catalyst-api
Pod roll. fromRecord rewrote any "phase1-watching" status to "failed"
on the next Pod start — even though Phase 0 had already committed its
tofu state, the Sovereign cluster was healthy, the kubeconfig was on
the PVC, and the bootstrap-kit HelmReleases kept reconciling regardless
of whether catalyst-api's in-memory watcher was alive.
Caught live on otech102 (2026-05-04): a transient catalyst-api roll
mid-Phase-1 latched the deployment record to status=failed, the auto-
fire handover never triggered, and the operator was stranded on the
wizard page. Manual workaround was patching the record back to
status=ready + minting handover token by hand.
Fix: split the in-flight rewrite into two cases:
- Phase-0 in-flight (pending/provisioning/tofu-applying/flux-
bootstrapping) — STILL rewritten to failed (tofu workdir on /tmp
emptyDir died with the Pod, Hetzner resources orphaned).
- phase1-watching — preserved across restart so the post-restart
resume path picks it up via shouldResumePhase1 + resumePhase1Watch
(already wired). The on-disk store record stays consistent with
the in-memory state during rehydrate.
Helmwatch's existing resume path (jobs_backfill.go) is idempotent —
it just observes HelmRelease.status, never patches/applies, so a fresh
informer over the same kubeconfig produces the same per-component
events the previous Pod was streaming.
Also:
- Added isPhase0InFlightStatus helper to distinguish the two
semantics; isInFlightStatus retained for release-subdomain conflict
check (still includes phase1-watching — won't release a slot mid-
Phase-1).
- Updated TestPodRestart_StuckPhase1WatchingRewrittenToFailed →
TestPodRestart_Phase1WatchingPreservedNotRewrittenToFailed (now
asserts the new correct behavior).
- New test TestPodRestart_Phase1WatchingResumesWithKubeconfig proves
the gating decision (shouldResumePhase1=true) and the preserved
Status value.
- New parameterized test TestPodRestart_Phase0InFlightStillRewritten
ToFailed proves the Phase-0 carve-out still works for all four
Phase-0 statuses.
- Updated TestShouldResumePhase1_GatesProperly cases to reflect the
new phase1-watching=resumable / Phase-0=non-resumable split.
Issue: openova-io/openova#830 (Bug 3)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /api/v1/sovereign/cutover/start handler was returning 502
status-read-failed because catalyst-api ran under the catalyst-system/
default ServiceAccount with no RBAC binding to read/patch the cutover
ConfigMaps + create/watch Jobs in the `catalyst` namespace.
Add a dedicated ServiceAccount + ClusterRole + ClusterRoleBinding so
catalyst-api can drive the cutover state machine. Per
feedback_rbac_create_no_resourcenames.md the `create` verbs are split
into their own Rule WITHOUT resourceNames; combining create with
resourceNames produces 403 every POST.
Bumps bp-catalyst-platform 1.3.1 → 1.3.2.
Issue: openova-io/openova#830 (Bug 1)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire all bp-* charts at vcluster creation time so the SME experience
is turnkey from marketplace signup forward. The orchestrator owns a
7-state machine (pending → vcluster_created → bp_charts_installed
→ dns_provisioned → certs_issued → keycloak_clients_provisioned
→ tenant_registered → done) persisted in a flat-file store; each
step is independently idempotent so a Pod restart never strands a
half-provisioned tenant.
HTTP surface:
- POST /api/v1/sme/tenants — create + start pipeline
- GET /api/v1/sme/tenants — list
- GET /api/v1/sme/tenants/{id} — read
- POST /api/v1/sme/tenants/{id}/reconcile — operator-triggered re-run
- DELETE /api/v1/sme/tenants/{id} — inverse pipeline
Per Inviolable Principle 3 the orchestrator NEVER calls kubectl apply.
Per-tenant overlays are committed to the GitOps repo at
clusters/<otech>/sme-tenants/<sme_tenant_id>/ via a Kustomize layout
listing every bp-* HelmRelease (bp-keycloak per-organization, bp-cnpg,
bp-wordpress-tenant, bp-openclaw, bp-stalwart-tenant) plus the per-host
Certificate (BYO mode only — free-subdomain is covered by the otech-wide
wildcard). Flux on the OTECH cluster reconciles within ~1 min.
Per Inviolable Principle 4 every chart version, image tag, OTECH FQDN,
PowerDNS endpoint, and Keycloak SA token is runtime-configurable via
env (CATALYST_SME_BP_*_VER, CATALYST_OTECH_FQDN,
CATALYST_OTECH_INGRESS_IPV4, CATALYST_POWERDNS_URL,
CATALYST_POWERDNS_API_KEY, CATALYST_SME_KC_SA_TOKEN). Empty chart
versions fall back to "*" so Flux pulls the latest matching chart.
DNS provisioning:
- Free-subdomain mode: PowerDNS PATCH writes A records for
console/wordpress/openclaw/mail/keycloak.<sub>.<otech>.
- BYO mode: net.LookupCNAME resolves console.<byo_domain> and
confirms the target ends with the otech FQDN; mismatched CNAMEs
surface as terminal errors so the wizard can show "your CNAME
doesn't point here yet" without a chat-with-support loop.
Keycloak SSO clients (catalyst-ui, wordpress, openclaw, stalwart) +
group templates (sme-admin, sme-user) are declared in the
bp-keycloak HelmRelease's bootstrap values block; the orchestrator
verifies them via the SME-vcluster Keycloak admin API and re-runs
the step on transient failures.
Tenant registry insertion (per #802 SME-7) uses the existing
store.TenantRegistry — host → {tenant_id, keycloak_realm_url,
keycloak_client_id, tenant_kind=sme} — so the SPA's
/api/v1/tenant/discover endpoint resolves the new tenant on first
hit without any further orchestration.
The user-create hook (POST /api/v1/sme/users) from #802 already
fires the ADR-0003 3-step orchestration (Keycloak → NewAPI → K8s
Secret); this PR's tenant pipeline lights up the back end #802
needs to scope every per-user call.
Tests:
- 14 handler-level table tests covering happy path (free-subdomain
+ BYO), validation errors, gitops transient retry, registry
population, deletion, render correctness for both modes, chart
version threading, Keycloak client verification, BYO CNAME
resolution.
- 5 store tests for state-machine persistence.
Live test deferred to #805 E2E demo.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Authors the load-bearing investor-demo proof artefact for the
SME-tenant turnkey experience epic (#795). The spec walks the FULL
happy path against the catalyst-ui SPA and emits 1440×900 screenshots
at every assertion so the DoD checklist is satisfied with visual
evidence rather than narrative.
What landed:
- products/catalyst/bootstrap/ui/e2e/sme-demo.spec.ts — single linear
spec covering Step 1 (marketplace signup) → Step 2 (provisioning) →
Step 3 (SME admin first login + dashboard) → Step 4 (create alice
via unified-rbac with 3-step ADR-0003 hook progress) → Step 5a
(alice on WordPress) → Steps 5b/5c/5d/6 fixme'd with TODO links to
unblocking issues.
- products/catalyst/bootstrap/ui/e2e/lib/config.ts — central registry
of every URL, hostname, fixture user, and UUID the spec uses. Per
feedback_never_hardcode_urls.md, no test inlines a hostname; every
asserted host derives from OTECH_FQDN + SME_SLUG.
- products/catalyst/bootstrap/ui/e2e/lib/sme-fixtures.ts — wire-shape-
faithful page.route mocks for tenant discovery, /api/v1/whoami,
/api/v1/sme/tenants, /api/v1/sme/users (CRUD), the deployment
endpoints, app placeholders for WordPress/OpenClaw/webmail, and the
/api/v1/sme/billing/ledger surface. Each helper is the seam between
mock-mode (today) and live-mode (post-#804) so the spec opts out of
any single mock by simply not calling that helper.
- .github/workflows/sme-demo-e2e.yaml — push + PR + dispatch trigger
that runs the spec against a freshly-installed dev tree with
VITE_CATALYST_MODE=sovereign + VITE_SOVEREIGN_FQDN set so the
SovereignConsoleLayout's auth gate has a non-null sovereignFQDN.
Uploads the 805-* screenshot evidence as a 30-day artefact.
Run today on a fresh checkout:
cd products/catalyst/bootstrap/ui
VITE_CATALYST_MODE=sovereign \
VITE_SOVEREIGN_FQDN=acme.otech.example \
npm run dev &
PLAYWRIGHT_HOST=http://localhost:5173 \
npx playwright test e2e/sme-demo.spec.ts
Result: 6 passed, 4 fixme (5b/5c/5d/6, all with TODO links to #804 /
#798 / #802-followup).
Live-mode follow-up (after #804 lands a fresh otech with the SME
tenant pipeline wired): drop the mock installers from beforeEach and
flip OTECH_FQDN/SME_SLUG via env. The spec stays — only the helper
calls change.
Per docs/INVIOLABLE-PRINCIPLES.md:
#1 (waterfall): the canonical 6-step contract from #805 is asserted
in this first cut, not staged across cycles.
#2 (never compromise): every step that's deferred is fixme'd with a
blocker link, never silently skipped.
#4 (never hardcode): every URL routes through e2e/lib/config.ts.
Refs: openova-io/openova#795, openova-io/openova#804, ADR-0003
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
* feat(unified-rbac): SME-tier extension + host-header tenant discovery (#802)
Implements the SME-tier extension to the existing Sovereign Console SPA
per [Q-mine-1] of #795: same React bundle serves both otech-admin and
SME-admin views, tenant context discovered via window.location.host
against a back-end registry — not from path/subdomain string parsing.
Backend (catalyst-api / unified-rbac slice):
- Tenant registry (store.TenantRegistry) — flat-file host → tenant
lookup table backing the public discovery endpoint. Host normalised
to lowercase; case-insensitive lookups.
- GET /api/v1/tenant/discover (public, no auth gate) — returns
{tenant_id, tenant_kind, keycloak_realm_url, keycloak_client_id} on
200, 404 on unknown host, 503 if registry unwired. Admin URLs are
NEVER on this wire.
- POST /api/v1/sme/users — fires ADR-0003 3-step hook (Keycloak →
NewAPI → K8s Secret SSA with field manager `unified-rbac`). Each
step idempotent; persisted state machine in store.UserProvisionStore
per ADR-0003 §3.4. Returns 202 with steps[] progress array so the
SPA can render the 3-step indicator even on partial failure.
- GET /api/v1/sme/users / DELETE /api/v1/sme/users/{uuid} — list +
inverse rollback per ADR-0003 §3.7.
- internal/newapi.Client — minimal NewAPI admin REST client; 201
happy-path + 409 idempotent recovery via GET ?external_id=<uuid>
per ADR-0003 §3.2 (NewAPI does NOT rotate api_key on conflict).
Frontend (Sovereign Console SPA):
- Branded TenantID + TenantKind types (shared/types/tenant.ts) — same
pattern as DeploymentID (#749).
- shared/lib/tenantDiscover.ts — fire-and-forget discovery in main.tsx;
result cached in module state for sidebar nav + OIDC bootstrap.
- pages/sme/UsersPage.tsx — user CRUD UI with 3-step KC/NewAPI/Secret
progress indicator wired off the API response shape.
- pages/sme/RolesPage.tsx — canonical Keycloak group → app role map
(wordpress / openclaw / stalwart / rbac) per #795 [B].
- pages/sme/sme.api.ts — typed REST client; X-Tenant-Host header
carries window.location.host on every call.
- Routes mounted at /console/sme/users + /console/sme/roles under the
existing SovereignConsoleLayout — same SPA bundle, different route
tree per discovered tenant_kind.
Tests: 22 new UI tests (4 files), 33 new Go tests (4 files). All
green: branded type parsers reject empty/non-string inputs, tenant
discovery handles 200/404/503/network-error paths, the 3-step hook
runs end-to-end against fake KC/NewAPI/SSA stubs, partial-failure
states surface verbatim through the steps[] response field, public
discovery endpoint never leaks admin URLs.
Per docs/INVIOLABLE-PRINCIPLES.md #4 every URL goes through apiUrl()
in shared/config/urls; per #2 wire shapes parse through branded-type
parsers at the boundary; per #3 K8s Secret apply uses client-go SSA
(field manager `unified-rbac`) — no exec.Command kubectl shell-out.
Closes#802.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(unified-rbac): add Playwright E2E for SME-tier UI (#802)
Three specs covering:
- SME UsersPage: empty state → create form → 3-step progress
indicator (KC done / NewAPI done / Secret done) — proves the
page is wired to the API response shape.
- SME RolesPage: canonical group → app-role table renders the
full 7-row mapping locked in #795 [B].
- OTECH tenant: same SPA bundle navigates /console/dashboard for
the otech discovery payload — proves [Q-mine-1] of #795
(one bundle, two route trees, host-driven discovery).
Backend mocks: route fulfillers stub /tenant/discover, /sme/users,
and /whoami so the dev-server harness can drive the SPA without
the catalyst-api backend or a live SME vcluster. The full live
cross-cluster E2E gates on bp-newapi (#799) seeding the tenant
registry at SME-onboarding time, which lands in #804.
1440 px screenshots captured at e2e/screenshots/802-*.png:
- 802-sme-users-empty-1440.png
- 802-sme-users-create-form-1440.png
- 802-sme-users-after-create-1440.png
- 802-sme-roles-1440.png
- 802-otech-dashboard-same-bundle-1440.png
Run: VITE_CATALYST_MODE=sovereign VITE_SOVEREIGN_FQDN=acme.otech.example
npm run dev
npx playwright test e2e/sme-tier-rbac.spec.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three operator-admin-gated endpoints for orchestrating the
post-handover Self-Sovereignty Cutover (parent epic #790):
POST /api/v1/sovereign/cutover/start
GET /api/v1/sovereign/cutover/status
GET /api/v1/sovereign/cutover/events (SSE)
The cutover engine consumes the PodSpec ConfigMaps that
bp-self-sovereign-cutover (issue #791, sister chart) installs in
the catalyst namespace, sequences them by `bp.openova.io/cutover-order`,
creates a fresh batchv1.Job per `mode=job` step (8 steps:
gitea-mirror, harbor-projects, harbor-prewarm, registry-pivot,
flux-gitrepository-patch, helmrepository-patches, catalyst-api-env-patch,
egress-block-test), waits for `mode=daemonset-wait` steps to reach
`numberReady == desiredNumberScheduled`, and patches the
`self-sovereign-cutover-status` ConfigMap with per-step timestamps
plus an overall progress counter on every state transition.
Endpoints are idempotent — when the status ConfigMap reports
`cutoverComplete=true` POST /start returns 200 with the durable
snapshot and does NOT re-run. A failed step latches the engine on
the failed step (no auto-continue); operator inspects the failure on
/status and re-runs once the chart values are corrected, at which
point already-successful steps are skipped on resume.
Constraints honoured:
* IaC-first — every cluster mutation goes through the in-cluster
kubernetes.Interface (Create Job / Patch ConfigMap / Get DaemonSet
/ List ConfigMaps). Zero bespoke cloud-API calls.
* Event-driven — Job completion uses the apiserver Watch verb,
not periodic GET polling.
* Credential hygiene — the handler reads no secrets directly;
the chart's PodSpecs reference secrets via envFrom secretRef
so each Job's credentials are mounted fresh.
* Runtime configurable — namespace, status ConfigMap name, per-
step timeouts all read from env per principle #4.
Tests: 14 new unit tests in cutover_test.go covering parse/list/
ordering, end-to-end success run with a fake clientset, idempotency,
fail-halt semantics, no-steps-found, status JSON shape, and
SSE replay-on-connect.
Refs: #790, #791Closes: #792
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The provision page (AppsPage via useDeploymentEvents) treated any SSE
close without a terminal `event: done` as a "Provisioning failed"
event, hard-coding the message:
> Deployment ended with status=phase1-watching
But `phase1-watching` is an in-flight phase, not a terminal outcome.
The founder repeatedly saw this banner on otech93/otech94 (2026-05-04)
while the canonical /deployments/{id} record showed status=ready and
handoverFiredAt populated — the SSE was simply dropped by the reverse
proxy mid-stream.
This change replaces the SSE-close failure path with a single
re-fetch of /deployments/{id} that switches on the canonical status:
• ready → success banner with handoverURL (existing #764 path)
• failed → real error from snapshot.error, never the stale
"Deployment ended with status=<phase>" copy
• in-flight statuses → keep the streaming spinner up and reconnect SSE
with exponential backoff (max 5 attempts)
Also surfaces handoverURL recovered from the canonical poll so a
backgrounded tab that lost the SSE during the handover-mint window
still renders the "Open your Sovereign console →" affordance.
Tests added cover all three branches plus the hard regression that
"Deployment ended with status=phase1-watching" can never appear in
streamError under any SSE-close path.
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sovereign-side catalyst-api Pod's intra-cluster Keycloak calls (token
mint, EnsureUser) were failing with `dial tcp: lookup
auth.<sov-fqdn> on 10.43.0.10:53: no such host`. The Sovereign's
CoreDNS resolves *.<sov-fqdn> via upstream resolvers — it does NOT
forward to the in-cluster PowerDNS that holds those records. Public
DNS works (PowerDNS authoritative), but Pod-side lookups of
auth.<sov-fqdn> return NXDOMAIN.
Live evidence — otech94 2026-05-04: handover URL returned
`{"error":"keycloak error: ensure user"}` from a DNS lookup failure
inside the catalyst-api Pod.
Fix: bp-keycloak chart now writes the in-cluster Service URL
(http://<release>.<namespace>.svc.cluster.local) into the
catalyst-kc-sa-credentials Secret's `addr` key instead of the public
gateway host (https://auth.<sov-fqdn>). This Secret is consumed
EXCLUSIVELY by the in-cluster catalyst-api Pod via reflector mirror
into catalyst-system; it is NEVER exposed to browsers.
The HTTPRoute hostname (.Values.gateway.host) stays at auth.<sov-fqdn>
for operator browsers — only the Pod's intra-cluster OAuth
client_credentials calls switch to the Service URL.
Catalyst-Zero (contabo) is unaffected: it runs `keycloak-zero`
(separate chart in openova-private), not bp-keycloak.
Changes:
- platform/keycloak/chart/templates/configmap-sovereign-realm.yaml:
Secret's $kcAddr unconditionally uses
http://<release>.<namespace>.svc.cluster.local
- platform/keycloak/chart/Chart.yaml: 1.3.1 → 1.3.2
- clusters/_template/bootstrap-kit/09-keycloak.yaml: chart version 1.3.1 → 1.3.2
- products/catalyst/chart/Chart.yaml: 1.3.0 → 1.3.1 (changelog entry only)
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: 1.3.0 → 1.3.1
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the Phase-1 helmwatch terminates with OutcomeReady, catalyst-api
now mints the handover JWT immediately, persists handoverFiredAt +
handoverURL on the deployment record, and emits a typed SSE event
`event: handover-ready, data: { handoverURL, expiresAt }` so the
wizard's provision page can render the "Open your Sovereign console
→" CTA + auto-redirect after 5s. Until this landed, the operator was
stranded on the apps grid in terminal-completed state — the manual
mint endpoint existed but no UI surface ever invoked it.
Server (issue #768):
- provisioner.Result gains HandoverFiredAt + HandoverURL.
- phase1_watch.go: markPhase1Done's Ready transition calls a new
fireHandover helper which mints via h.handoverSigner (RS256 5min
TTL) and emits onto the durable buffer + live SSE channel.
- StreamLogs renders Phase=="handover-ready" events as the typed
SSE shape so a browser using addEventListener('handover-ready')
receives the JSON payload directly. Idempotent under double-
fire (informer reattach scenarios). No-op when handoverSigner
is nil — the existing manual-mint path on the AdminPage button
remains the fallback.
- Lifted HandoverURL + HandoverFiredAt to /deployments/{id} top
level so a GET-replay also drives the redirect when the SSE
event was missed.
UI (issue #764):
- useDeploymentEvents subscribes via EventSource.addEventListener
('handover-ready', …) and surfaces the payload as a new
`handoverReady` return value. Same value populated from the
/events GET-replay snapshot's handoverURL field for the
SSE-missed case.
- AppsPage renders a prominent green "Sovereign is ready" banner
above the apps grid with an "Open your Sovereign console →"
anchor link, fires a global success toast with the same CTA,
starts a 5s redirect timer (window.location.href =
handoverURL), and flips the document title to "✓ Sovereign
ready — <fqdn>" so backgrounded tabs surface completion.
Tests:
- Backend: 6 tests covering auto-fire on Ready, no-fire on
failure, idempotency, no-signer no-op, typed-SSE-shape, and
/deployments/{id} field lifting.
- Frontend: 4 tests covering banner render, FQDN inclusion, 5s
auto-redirect, and document.title flip.
Closes#764, #768.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes#767)
Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers
couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB):
1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate"
Section with: bootstrap-kit baseline (sum of mandatory-tier component
footprints), selected components delta, control-plane overhead, and a
"Recommended N x <SKU>" line that turns amber when the operator's chosen
worker count is below the rollup. Backed by per-component RAM/CPU floors
in components/wizard/steps/componentFootprints.ts (covered by 12 unit
tests including the otech92 reproduction).
2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at
bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart
9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired
from the canonical flux-system/cloud-credentials.hcloud-token Secret
cloud-init writes (mirrors the velero/harbor object-storage pattern).
Pinned to the control-plane node so the autoscaler never schedules onto
a worker it could itself terminate. 10-minute scale-down idle as the
cost-saving default.
Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA /
KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over
KEDA for cluster scaling, and the bounds + safety story.
Per the issue's MVP scope, this PR ships the blueprint + StepReview
estimate WITHOUT the wizard StepProvider min/max pair refactor or the
tofu node-pool template restructuring. Those are tracked as a follow-up
issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps
Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected-
bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776
because the file existed without a matching entry in the expected DAG, AND
collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort +
slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to
the expected-bootstrap-deps.yaml so the audit passes.
`scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The catalyst-api GET /kubeconfig endpoint now rewrites k3s's hardcoded
`default` cluster / context / user names to the Sovereign's subdomain
(e.g. `otech94`) before serving the YAML, so the operator can run
`k9s --context=otech94` immediately after a single
`kubectl config view --flatten` merge — no more manual sed pipeline
between every Phase-1 Ready and the next k9s session.
Backend (catalyst-api):
- New helpers `rewriteKubeconfigContext`, `preferredContextName`, and
`kubeconfigDownloadFilename` in internal/handler/kubeconfig.go.
- Rewriter uses yaml.v3 Node round-trip so cert-authority-data + token
bytes are preserved verbatim. Idempotent — re-applying to an already
renamed file is a no-op. Refuses non-kubeconfig YAML so a hand-edited
file is never silently corrupted.
- Context name resolution: SovereignSubdomain → first FQDN label →
literal "sovereign" fallback. Sanitised to RFC-1123 lowercase label
charset.
- Content-Disposition filename is now `<subdomain>.yaml` (matches
operator mental model + makes the merge command shell-friendly).
UI (catalyst wizard StepSuccess):
- New "Step 1 / Step 2" cluster-access surface on the success step:
download button (unchanged endpoint) plus a copy-to-clipboard merge
one-liner (`KUBECONFIG=$HOME/.kube/config:$HOME/Downloads/<file> kubectl
config view --flatten > config.tmp && mv config.tmp config && chmod
600 && k9s --context=<name>`).
- Atomic temp-file move instead of a direct redirect to ~/.kube/config
so a Ctrl-C mid-pipe never corrupts the operator's existing config.
- Helpers `sovereignContextName` + `buildKubeconfigMergeCommand`
exported so the test file (and a future Operator-Tools page on the
Sovereign console) can re-use them with no logic drift.
Tests:
- 6 new Go tests covering the rewriter (idempotence, k3s default,
mixed-name file, empty target rejection, malformed YAML rejection,
non-kubeconfig rejection) + GET-handler integration test that
exercises the subdomain → context-name path on a real fixture.
- 3 new vitest tests covering the merge-command UI block + 5 new
helper-pure tests for `sovereignContextName` /
`buildKubeconfigMergeCommand`.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #759 enforces `req.OrgEmail == session.email` in the catalyst-api on
POST /v1/deployments, which means the operator IS the Sovereign owner
by definition. Asking again in the wizard, locking the field, and
explaining the lock with `Admin contact email · locked to your
sign-in` was redundant chrome that made StepDomain feel like a sign-up
form for the second time.
Changes:
- StepDomain: remove the AdminEmailField sub-component entirely (the
"locked to your sign-in" microcopy + Lock icon + read-only input +
isValidAdminEmail validator + the orgEmail clause in
computeNextDisabled). Drop now-unused useSession + Lock + useEffect
imports.
- StepReview: stamp `orgEmail` from `session.email` at submit time
(with the wizard store as a fallback for the brief window between
PIN-verify and the next session refetch). Rename the review-page
row from "Admin email" to "Sovereign owner" to mirror the new UI
vocabulary; the row now reads `session.email` so the operator sees
exactly which identity the Sovereign will be owned by.
- StepDomain.test: keep the fresh-QueryClient-per-test wrapper but
drop the seedSessionEmail plumbing (no longer needed). Add three
regression tests confirming the field, the microcopy, and the
orgEmail-gate on Continue are all gone.
- WizardLayout / WizardPage / StepOrg / StepReview: update doc
comments that referenced the now-removed admin-email field.
Per docs/INVIOLABLE-PRINCIPLES.md #1 (never trust the client) the
load-bearing fix is still on the server (PR #759). This PR removes
the redundant client-side defense + the noisy chrome that explained it.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `/sovereign/decommission/<id>` page used to render a static
"Decommissioning…" button label with no progress signal — operators
thought the page was stuck while `tofu destroy` and the Hetzner orphan
purge were running for 30+ minutes.
The wipe handler in `api/internal/handler/wipe.go` ALREADY emits a
per-resource SSE event stream on the same `dep.eventsCh` channel that
provisioning uses (surfaced at `GET /api/v1/deployments/{id}/logs`).
Every "tofu destroy" tick, every Hetzner DELETE response, every S3
bucket purge step, every PDM release call, every local-state cleanup
is already a discrete event with `phase="wipe"`. The UI just wasn't
subscribing.
Fix is purely UI:
• DecommissionPage subscribes to the same SSE via `useDeploymentEvents`
once the wipe POST is in flight (`disableStream: false`), flattens
every recorded event into `LogLine`, and feeds the unified
`LogPane` (the same component `/provision/<id>` JobDetail uses for
per-job logs).
• Streaming layout replaces the form once submit fires: STREAMING
chip, scrolling exec-log, full-screen toggle, search filter — all
threaded through the existing LogPane primitives.
• On wipe completion: COMPLETE chip + green checkmark + verbatim
Hetzner-sweep summary block ("servers: 0 removed, load_balancers:
0 removed, …" — the founder DoD is "0 of every kind on the
Hetzner side") + 10s countdown back to /wizard. Operator can scroll
back through every deletion at any time.
• No backend change — the SSE plumbing is already there.
Tests: 7/7 pass (5 original + 2 new for #766). Per #1 (waterfall —
target shape on first commit) the streaming view ships with full
scrollback, search, full-screen, summary, and countdown in one PR.
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server-side enforcement is the load-bearing fix per docs/INVIOLABLE-PRINCIPLES.md
#1 (never trust the client). Until this lands a signed-in operator could POST
a deployment whose req.OrgEmail belonged to some other identity — the catalyst-
api accepted the body verbatim and stamped the wrong identity onto the
Sovereign-admin / Catalyst-Organization owner.
Server changes (deployments.go):
- CreateDeployment now reads claims from context (auth.RequireSession populates)
with X-User-Email as the off-prod fallback. When a session is present,
req.OrgEmail MUST EqualFold session.email — mismatch returns 403.
OwnerEmail is stamped from the session-derived value, not request body —
a future client-side bug cannot poison the durable owner field.
- ListDeployments (issue #747) tightened: when a session is present AND a
?owner= query param is also supplied AND ?owner != session.email, return
200 + empty list rather than silently collapsing to session-only rows.
Mirrors the issue #689 404-not-403 rule on /deployments/{id} — the
response shape MUST NOT differentiate "exists but not yours" from
"doesn't exist". Now also reads ClaimsFromContext as the canonical
session source (X-User-Email fallback).
Tests:
- 4 new tests in deployments_test.go (all pass):
- TestCreateDeployment_RejectsMismatchedOrgEmail (403 + no PDM Reserve
+ no row stored)
- TestCreateDeployment_AcceptsMatchingOrgEmail (case-insensitive match,
OwnerEmail derived from session not request)
- TestListDeployments_FiltersByOwnerSession (cross-tenant row hidden)
- TestListDeployments_OwnerQueryParam (cross-tenant ?owner returns
empty list, never 403)
- deployments_list_test.go: existing TestListDeployments_FilterBySessionEmail
rewritten to match the tightened cross-tenant policy (empty list, not
silent override). New TestListDeployments_CrossTenantOwnerQueryReturnsEmpty
added to assert the explicit boundary.
UI changes:
- ui/src/pages/wizard/steps/StepDomain.tsx — defense-in-depth UX:
AdminEmailField pre-fills orgEmail from useSession() and renders
read-only with a Lock icon and tooltip "Sovereigns are owned by the
email you signed in with." A useEffect mirrors session.email into
the wizard store so a stale value from a previous sign-in cannot
survive into the current session.
- ui/src/pages/wizard/steps/StepDomain.test.tsx — wraps every render
in a fresh QueryClientProvider (AdminEmailField now consumes
useSession via TanStack Query). All 15 existing UI tests pass.
Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "deployment ID truncated by one char" bug recurred multiple times because
every UI code path treated the id as a free-form `string`. Any new error
template, toast, or URL builder could (and did) introduce another truncation.
This change makes the truncation impossible at compile time:
- Adds `shared/types/deployment.ts` with a branded `DeploymentID` type
(`string & { readonly __brand: 'DeploymentID' }`) plus
`parseDeploymentID()` / `isDeploymentID()` validators. The regex
enforces the canonical 16 lowercase hex chars catalyst-api emits.
- Updates `entities/deployment/model.ts` to type `WizardState.deploymentId`
as `DeploymentID | null`. Re-exports the brand from the model so
existing imports keep working.
- Updates `entities/deployment/store.ts` to route `setDeploymentId()` and
the persistence `merge()` path through `parseDeploymentID()`. A bad id
in localStorage gets wiped rather than rendered as a misleading
"<truncated>-is-unknown-to-backend" error.
- Updates `pages/sovereign/AppsPage.tsx` to validate the route param at
the page boundary via `isDeploymentID()`, and emits a dedicated
malformed-id notification when the URL value isn't 16 lowercase hex
chars (so the operator sees the FULL invalid value, not a hidden
off-by-one).
- Adds 25 unit tests covering the parser (valid/invalid lengths,
uppercase, non-string types, error-message hygiene) plus the
`isDeploymentID` type guard.
- Adds an integration test (`ProvisionPage.sse-url.test.tsx`) that
mounts the page with a 16-char hex route param, installs a recording
EventSource shim, and asserts the constructed URL is exactly
`${API_BASE}/v1/deployments/<FULL_16_CHAR_ID>/logs` — including the
exact `eeb34ecd1414a505` id from issue #749's live evidence.
- Updates `StepSuccess.test.tsx` fixture to a real 16-char hex id so
the wizard store accepts it through the new typed setter.
Audit findings — search across the entire UI src for `slice(0, 15..19)`,
`substring(0, 15..19)`, and `[a-f0-9]{15}` patterns turned up NO direct
truncation site in production code. The root cause of the 2026-05-04
incident was that every consumer trusted a raw `string` route param
without validation, so a URL with a manually-truncated id fed straight
into both the SSE URL builder and the error message verbatim. The
branded-type contract is now the structural fix: any future code that
tries to assign an unvalidated string to a `DeploymentID` field fails
compilation, and any URL with the wrong shape surfaces a clear
malformed-id banner instead of "deployment <wrong> is unknown".
Closes#749, #754.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A signed-in operator who refreshed /sovereign/wizard during a 15-minute
provisioning run lost the progress page and landed on Step 1 of an empty
form (caught live with otech90 on 2026-05-04). Wires the wizard route
to call the new GET /api/v1/deployments?owner=<email> endpoint and
redirect to /sovereign/provision/<id> when an in-flight deployment is
found.
Backend
- Add ListDeployments handler returning the slim shape (id, status,
sovereignFQDN, region, startedAt, finishedAt, ownerEmail, adoptedAt,
error). Filtered server-side by the X-User-Email header injected by
RequireSession; ?owner= is a client hint that is silently overridden
when the session header is set so a signed-in attacker cannot list
someone else's rows. Adopted deployments are excluded — once the
customer's Sovereign owns the cluster, the wizard redirect must not
pull the operator back to Catalyst-Zero.
- Register GET /api/v1/deployments inside the RequireSession group.
- 5 new handler tests covering session-override, adopted exclusion,
legacy-row exclusion, no-session passthrough, and ?owner= filtering.
Frontend
- New useInflightDeployment hook (TanStack Query, 30s stale time)
returning {inflight, completed, all} buckets. inflight matches
pending/provisioning/tofu-applying/tofu-plan/tofu-apply/
flux-bootstrapping/cloud-init-waiting/phase1-watching plus
ready-but-not-adopted. Picks the most-recent by startedAt.
- WizardPage redirect effect: when session.signedIn && inflight,
navigate replace=true to /provision/<id> and render null while the
redirect resolves. When the operator has only completed/wiped/failed
rows, render a banner with a "View your previous deployments" link.
- New DeploymentsList page at /deployments (browser path
/sovereign/deployments behind the Traefik strip-prefix). Single table:
FQDN, status, started, finished, region. Each FQDN links back to
/provision/<id>.
- 6 hook unit tests covering most-recent picking, ready-not-adopted,
adopted exclusion (defense-in-depth), 401 graceful degrade, and
enabled=false short-circuit.
Tests
- 5 backend handler tests pass (TestListDeployments_*)
- 6 frontend hook tests pass (useInflightDeployment.test.tsx)
- TS typecheck + Vite build clean
- Pre-existing TestAuthHandover_HappyPath panic + StepComponents
catalog-data failures verified unrelated (fail on bare main)
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /sovereign/provision/<id> page rendered a bespoke "Operator /
Provisioning session" card in the bottom-left of its Sidebar. Two
problems:
1. Identity placement was inconsistent with the rest of the app
(wizard, Sovereign-console, marketplace all place identity
top-right). The provisioning surface was the lone outlier.
2. The label "Operator" was hard-coded and never reflected the
signed-in user's email — it ignored useSession() entirely.
This drops the bespoke card from Sidebar.tsx and renders the canonical
<ProfileMenu /> (the same widget WizardLayout uses) in PortalShell's
top-right slot. ProfileMenu reads useSession() so anonymous visitors
get a [Sign in] button and signed-in operators get an email-initial
avatar that opens a "Signed in as <email>" + "Sign out" dropdown.
Because PortalShell wraps every /sovereign/provision/* route (apps,
jobs, dashboard, cloud, users, settings), this fix touches all of
them in one place.
Test updates:
- Sidebar.test.tsx now asserts the bespoke widget is GONE rather
than asserting it renders, locking in the regression guard.
No backend / API surface changes.
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
* fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request
Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z).
After PR #742 fixed the empty SKU strings in tfvars, the next blocker
appeared: writeTfvars was emitting `"regions": null` (Go nil slice
marshals to JSON null) when the request had no per-region overrides.
OpenTofu's variables.tf carries a validation block:
validation {
condition = alltrue([
for r in var.regions :
contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider)
])
}
The `for r in var.regions` iteration fails on null with:
Error: Iteration over null value
on variables.tf line 217, in variable "regions":
The variables.tf default `[]` is what the validator expects; emit
that shape explicitly via a coalesceRegions(req.Regions) helper that
turns nil into an empty slice. Operator overrides round-trip
unchanged.
Tests:
- TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions
serialises as JSON `[]`, never `null`, when the request has no
per-region overrides.
Builds on PR #742.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving)
Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the
cpx21 CP default from PR #741 fell apart at apply time —
Error: Server Type "cpx21" is unavailable in "fsn1" and can no
longer be ordered
Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog
(`/v1/server_types`) but are NOT in the per-DC orderable list
(`available_for_migration` on `/v1/datacenters`) for any EU DC
(fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on
for new Sovereigns in those regions.
Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04:
• cpx11 (2 vCPU / 2 GB) — too small for the CP working set
• cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1
• cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1
• cpx42, cpx52, cpx62 — bigger and more expensive
New default per Sovereign:
| Component | Old | New | Savings |
|-----------------|-----------------|------------------|---------|
| Control plane | CPX32 (€16.49) | CPX22 (€9.49) | €7.00 |
| Worker × 2 | CPX32 × 2 (€33) | CPX32 × 2 (€33) | €0 |
| TOTAL | €49.47/mo | €42.47/mo | 14% |
The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo)
assumed those SKUs were orderable. They aren't in EU DCs. The 14%
saving from cpx22 CP is the largest concrete optimisation that
ships TODAY without compromising the multi-node horizontal-scale
agreement (issue #733): still 1 CP + 2 workers from day one.
Files changed:
- infra/hetzner/variables.tf
control_plane_size default cpx21 → cpx22
worker_size default cpx31 → cpx32 (back to the prior orderable choice)
- products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts
Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing
(€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49).
Mark both as "listed but NOT orderable in EU DCs" so the wizard
surfaces the constraint instead of letting operators pick a
non-orderable SKU.
Move recommended:true from CPX21 → CPX22.
defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31').
- products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx
Comment refresh — names the new orderable defaults.
- products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts
Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22'].
Builds on PR #741 (issue #740 chain).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z).
After PR #742 fixed the empty SKU strings in tfvars, the next blocker
appeared: writeTfvars was emitting `"regions": null` (Go nil slice
marshals to JSON null) when the request had no per-region overrides.
OpenTofu's variables.tf carries a validation block:
validation {
condition = alltrue([
for r in var.regions :
contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider)
])
}
The `for r in var.regions` iteration fails on null with:
Error: Iteration over null value
on variables.tf line 217, in variable "regions":
The variables.tf default `[]` is what the validator expects; emit
that shape explicitly via a coalesceRegions(req.Regions) helper that
turns nil into an empty slice. Operator overrides round-trip
unchanged.
Tests:
- TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions
serialises as JSON `[]`, never `null`, when the request has no
per-region overrides.
Builds on PR #742.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>