Commit Graph

338 Commits

Author SHA1 Message Date
hatiyildiz
e7caa0696f feat(catalyst-wizard): SSH keypair UX in StepCredentials — auto-generate + paste-existing
Closes #160 ([I] ux: SSH keypair UX in wizard).

Backend (Go):
  - Add POST /api/v1/sshkey/generate handler at
    products/catalyst/bootstrap/api/internal/handler/sshkey.go.
  - Generates an Ed25519 keypair via crypto/ed25519 + rand.Reader,
    encodes the public half to OpenSSH authorized_keys wire format
    and the private half to PEM-armoured openssh-key-v1 (no passphrase),
    returns SHA256 fingerprint matching `ssh-keygen -lf`.
  - Logs ONLY the fingerprint per credential-hygiene principle #10 —
    private key never written to disk; comment derived from caller-
    supplied FQDN, never hardcoded.
  - Wire into chi router in cmd/api/main.go.
  - sshkey_test.go covers response shape, authorized_keys format, PEM
    decode + openssh-key-v1 magic header, fingerprint length/format,
    two-call uniqueness, default comment fallback.

Frontend (React + Zustand):
  - Extend StepCredentials with an SSHKeySection — two-mode UX:
      Mode A (Generate keypair) — POST /api/v1/sshkey/generate, capture
        public key + fingerprint into store, trigger Blob-URL download
        of the private key as `<fqdn-or-catalyst>.pem`, show one-time
        warning banner ("Private key shown once. Save it now or you
        lose access.") with re-download + re-generate buttons.
      Mode B (Paste existing public key) — textarea, RFC validation
        regex matching infra/hetzner/variables.tf (ssh-ed25519 / ssh-rsa
        / ecdsa-sha2-nistp256/384/521), inline error on malformed input.
  - Wizard's Continue button is now gated on isValidSSHPublicKey(store.sshPublicKey).
  - Wire store.sshPublicKey into the StepReview deployment payload —
    replaces the previous `sshPublicKey: ''` TODO.
  - Store extension: sshPublicKey, sshKeyGeneratedThisSession,
    sshPrivateKeyOnce, sshFingerprint + setSshPublicKey,
    setSshGenerated, clearSshPrivateKey actions; partialize() strips
    the private blob + session flag from localStorage so a fresh
    tab always re-prompts (credential hygiene #10).
  - Vitest (StepCredentials.test.tsx) covers both modes:
    request shape, store population, download trigger (URL.createObjectURL
    + anchor.click spies), one-time-warning render, HTTP-500 path leaves
    store empty, paste validation accepts/rejects per algorithm whitelist.

OpenTofu integration:
  - provisioner.Request.SSHPublicKey was already declared from group J;
    StepReview now feeds it the captured public half so `tofu apply`
    receives a non-empty value and the variables.tf regex validator
    accepts the run.

Tests:
  - npm run typecheck PASS (zero errors).
  - npm run test PASS (27/27 tests across 2 files).
  - npm run build PASS (vite production bundle 862 kB).
  - Go unit tests run in CI (no Go toolchain on the build host).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:00:37 +02:00
hatiyildiz
2ee2e8d24d fix(catalyst-platform): unblock cutover Kustomization — revert Helm templating
919514c added Helm template expressions (`{{ .Values.* }}`) into
products/catalyst/chart/templates/ingress.yaml + ui-deployment.yaml +
ui-configmap.yaml + values.yaml. These files are consumed by the
catalyst-platform Flux Kustomization on Catalyst-Zero (Contabo), which
goes through kustomize-controller — not helm-controller — so the
template expressions are NOT rendered.

Failure observed in production:
  catalyst-platform kustomize build failed: updating name reference in
  spec/ingressClassName field of Ingress.networking/console-sovereign:
  path config error; no name field in node

The ingressClassName template expression broke kustomize's name-reference
resolver. The ConfigMap with Helm expressions in nginx config strings
would have left nginx unable to resolve upstreams at runtime.

Surgical revert:
- ingress.yaml, ui-deployment.yaml: back to pre-919514c plain YAML
- ui-configmap.yaml, values.yaml: deleted (had no plain-YAML predecessor)

The values-driven /sovereign nginx routing remains the right target
state — but the path forward is to convert catalyst-platform to a Flux
HelmRelease (helm-controller renders templates), not to mix Helm
templates into a kustomize-applied directory. Tracking ticket follows.
2026-04-28 22:48:02 +02:00
hatiyildiz
2323e74048 merge: Group L — Playwright UI smoke tests (#142, #143, #144) 2026-04-28 19:54:28 +02:00
hatiyildiz
55b8a18b32 test(e2e): #142, #143, #144 — Playwright UI smoke tests for sovereign wizard, admin vouchers, marketplace bp-<x> grid
Group L closes the three UI smoke-test gaps the verify-sweep flagged:

  #142 sovereign wizard       — tests/e2e/playwright/tests/sovereign-wizard.spec.ts
  #143 admin voucher UI       — tests/e2e/playwright/tests/admin-vouchers.spec.ts
  #144 unified bp-<x> grid    — tests/e2e/playwright/tests/marketplace-cards.spec.ts

Tests target the actual shipped UI shape (Pass 105+):

* Wizard step model is StepOrg → StepTopology → StepProvider →
  StepCredentials → StepComponents → StepReview, not the original ticket's
  StepDomain/StepHetzner draft from before the unified-Blueprints refactor.
* Admin voucher model uses an `active` toggle, not ISSUED/REVOKED status.
* "Marketplace card grid" = the Catalyst wizard's StepComponents (bp-<x>
  Blueprints), NOT the SME marketplace at core/marketplace (which is for
  SaaS Apps). Today every Blueprint is `visibility: unlisted`, so the test
  asserts the data layer (catalog.generated.ts) plus the documented
  EmptyState; once `visibility: listed` lands, the third assertion
  auto-extends to the rendered card grid.

Per principle #4 ("never hardcode"), all URLs come from env vars with
sensible local-dev defaults. Per principle #1 ("never speculate"), tests
self-skip with explicit reasons when their target app isn't reachable
instead of fail-noisy.

CI: .github/workflows/playwright-smoke.yaml boots the Catalyst UI in the
background and runs the suite on PRs touching UI sources or tests; admin
and marketplace specs self-skip in that workflow because spinning up all
three Astro apps + catalyst-api + Postgres is the full E2E pipeline's
job, not this smoke.

Local run (Catalyst UI on :4399, admin on :4398): 5 passed, 2 skipped
(skip reasons: marketplace #3 needs StepComponents reachable past
required-field gating; admin #2 needs ADMIN_TEST_COOKIE for an
authenticated session).

Refs: #142, #143, #144
2026-04-28 19:54:04 +02:00
hatiyildiz
919514ca78 merge: /sovereign nginx routing — values-driven /sovereign + /api/v1 (a35da92) 2026-04-28 19:50:39 +02:00
hatiyildiz
a35da929f1 feat(sovereign-route): values-driven /sovereign + /api/v1 routing
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the catalyst-ui
nginx config now flows from values.yaml at chart-render time:

- routing.basePath (/sovereign) — also drives ingress strip-prefix
- routing.catalystApi.serviceDNS — in-cluster reverse-proxy target
- routing.catalystApi.port — upstream port
- dns.resolverIP — CoreDNS for proxy-time resolution (avoids stale
  ClusterIP after catalyst-api restarts)
- ingress.host / ingress.priority / ingress.className

Files:
- products/catalyst/chart/values.yaml — new, documents every default
- products/catalyst/chart/templates/ui-configmap.yaml — new, nginx
  reverse-proxies /api/* to catalyst-api Service DNS
- products/catalyst/chart/templates/ui-deployment.yaml — mounts the
  ConfigMap at /etc/nginx/conf.d/default.conf
- products/catalyst/chart/templates/ingress.yaml — values-driven host
  + path + priority + class
- tests/e2e/sovereign-routing/* — Playwright smoke for the routing

Captured from stalled agent /tmp/agent-sovereign-route-finish — agent
stream watchdog timed out after the work was authored but before commit.
2026-04-28 19:48:40 +02:00
hatiyildiz
8886eff708 Merge branch 'feat/group-g-dns-finish-v3'
Group G DNS finish (v3): #110 (Dynadot multi-domain table-driven tests),
#112 (catalyst-dns httptest-mocked Dynadot coverage), #113 (cert-manager
LE DNS-01 + HTTP-01 ClusterIssuer templates with operator runbook for
the cert-manager-dynadot-webhook gap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:45:35 +02:00
hatiyildiz
dd8b16f0c5 merge: feat/group-i-success-state-126-v2 — Group I StepSuccess (#126)
Adds wizard StepSuccess terminal step with sovereign console URL,
first-time admin login flow, kubeconfig download, voucher CTA, SSE log
tail, and docs link. All URLs derived from wizard state — never
hardcoded. 16 / 16 vitest tests green; tsc -b --noEmit clean.

Closes #126.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:44:59 +02:00
hatiyildiz
97e942e0bc feat(cert-manager): #113 — Lets Encrypt DNS-01 + HTTP-01 ClusterIssuers
Adds platform/cert-manager/chart/templates/clusterissuer-letsencrypt-dns01.yaml
with two ClusterIssuers, both Catalyst-curated, rendered conditionally
from values.yaml:

- letsencrypt-dns01-prod (TARGET STATE, default disabled) — ACME DNS-01
  via the cert-manager webhook solver, pointing at a future
  `cert-manager-dynadot-webhook` Catalyst binary that will implement the
  webhook.acme.cert-manager.io/v1alpha1 contract against the existing
  internal/dynadot/ package. Shipping the issuer template ahead of the
  webhook so cluster overlays only need a values flip + secret ref —
  no template edits — once the webhook lands.

- letsencrypt-http01-prod (INTERIM, default enabled) — ACME HTTP-01
  via the cilium ingress class. Issues certs for the explicit hostnames
  (console, gitea, harbor, admin, api) but NOT for wildcards; the
  canonical *.<sub>.<domain> record needs DNS-01.

Header comment explains the gap: the Catalyst external-dns webhook
(products/catalyst/bootstrap/api/cmd/external-dns-dynadot-webhook/)
implements a DIFFERENT RPC contract (records.list/add/delete) than what
cert-manager DNS-01 expects (Present/CleanUp on ChallengeRequest CRD),
so it cannot be reused; a dedicated cmd/cert-manager-dynadot-webhook/
must be built. Operator runbook for cutover is in the file header.

values.yaml gains a `certManager.issuers.{email,acmeServer,dns01,http01}`
section so all knobs are runtime-configurable per
docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode); cluster overlays in
clusters/<sovereign>/ can flip dns01.enabled via the bp-catalyst-platform
umbrella's values without rebuilding the Blueprint OCI artifact.

blueprint.yaml gains a spec.outputs section advertising:
- issuerName: letsencrypt-http01-prod (default)
- wildcardIssuerName: letsencrypt-dns01-prod (target state)
- issuerKind: ClusterIssuer

so dependent Blueprints (cilium-gateway, harbor, gitea) can consume the
issuer name without hardcoding it.

Closes #113.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:44:56 +02:00
hatiyildiz
7af848e2bd feat(catalyst-bootstrap): #126 add wizard StepSuccess terminal state
Group I — closes the success-state UX gap after the 11-phase
bootstrap kit finishes green:

  - Primary CTA opens https://console.<sovereign-fqdn>/ — domain
    derived from wizard state (resolveSovereignDomain) or from the
    catalyst-api `done` event payload (lastProvisionResult.consoleURL).
    No hardcoded URLs (Inviolable-Principle #4).
  - First-time admin login: username = admin@<sovereign-fqdn>; the
    "Mint one-time login URL" button calls
    GET /api/v1/deployments/<id>/admin-login-url and falls back to a
    documented Keycloak realm-master + reset-password flow when the
    endpoint returns 404/501 (RUNBOOK-PROVISIONING.md §First login).
  - kubeconfig download fetches /api/v1/deployments/<id>/kubeconfig,
    falls back to "Coming soon — fetch via SSH" + runbook link when
    the endpoint isn't implemented.
  - Voucher-issuance shortcut (secondary CTA) →
    https://admin.<sovereign-fqdn>/billing/vouchers/new
  - SSE final-state log tail (last 20 lines) collapsed/expandable.
  - Sovereign /docs link as second tile next to voucher CTA.

Wires StepSuccess as the 7th step in WizardPage.STEPS so the wizard's
existing currentStep navigation can land on it once provisioning
completes (lastProvisionResult populated by StepProvisioning's `done`
SSE event handler — to be wired in a separate ticket).

Test coverage (vitest + @testing-library/react, 16 cases): every CTA's
href is asserted against a fixture FQDN, including a BYO domain switch
to prove no hardcoded hostname leaks. Adds devDeps vitest, jsdom,
@testing-library/react, @testing-library/jest-dom, plus npm scripts
test/test:watch/typecheck.

Files:
  products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepSuccess.tsx (new)
  products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepSuccess.test.tsx (new)
  products/catalyst/bootstrap/ui/src/pages/wizard/WizardPage.tsx
  products/catalyst/bootstrap/ui/vite.config.ts (vitest config)
  products/catalyst/bootstrap/ui/package.json (test scripts + devDeps)

Verification:
  npm run typecheck  → green
  npm run test       → 16 / 16 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:44:31 +02:00
hatiyildiz
77a3014f74 fix(workflow): blueprint-release supports products/ tree on workflow_dispatch
Adds a `tree` input (default `platform`) so manual triggers can build
umbrella charts under products/ — e.g.
  gh workflow run blueprint-release.yaml -f blueprint=catalyst -f tree=products
will dispatch a build of products/catalyst/chart.

Push-triggered builds already detect both platform/* and products/* via
the diff filter; this only fixes the workflow_dispatch path which was
hardcoded to platform/.
2026-04-28 19:43:47 +02:00
hatiyildiz
8643b0fb9e Merge branch 'feat/bp-external-dns-leaf-chart'
Authors the bp-external-dns leaf chart so the umbrella bp-catalyst-platform's
dependency block (11 leaves) resolves — closes the Group F gap that surfaced
in workflow run 25068433765.
2026-04-28 19:42:30 +02:00
hatiyildiz
c07e0ad1ee feat(external-dns): #109 — author bp-external-dns leaf chart for OCI publish
The bp-catalyst-platform umbrella (issue #104) declares a dependency on
bp-external-dns:1.0.0 — but the chart didn't exist; only README + Dynadot
multi-domain policy lived under platform/external-dns/. Without this leaf
the umbrella's `helm dependency build` fails (verified in run 25068433765).

This commit authors the minimal target-state leaf:
- Chart.yaml: name=bp-external-dns, version=1.0.0
- values.yaml: catalystBlueprint.upstream metadata (external-dns 1.15.0
  from kubernetes-sigs/external-dns Helm repo) + Catalyst-curated values
  overlay (sources, txtOwnerId, ServiceMonitor, RBAC, resources)

Per BLUEPRINT-AUTHORING.md §3, leaf charts are pure values-overlay wrappers:
no templates dir, just Chart.yaml + values.yaml with the catalystBlueprint
metadata block read by the bootstrap-kit installer at helm-install time.

Per-Sovereign provider/zone/credential overrides are overlaid by the
Crossplane Composition that materializes the HelmRelease — keeping this
chart provider-agnostic (no hardcoded Cloudflare/Dynadot/Hetzner choice
per INVIOLABLE-PRINCIPLES.md §4).

After this lands, blueprint-release.yaml will publish
ghcr.io/openova-io/bp-external-dns:1.0.0 and the next umbrella push will
resolve all 11 leaf deps successfully.
2026-04-28 19:42:23 +02:00
hatiyildiz
dc3a2b306e test(catalyst-dns): #112 — provisioning DNS write coverage with mocked Dynadot
Refactors catalyst-dns/main.go to expose a testable run() core (validate +
AddSovereignRecords loop) so the binary can be exercised against an
httptest.Server without touching api.dynadot.com.

Adds main_test.go with five scenarios:

- TestRun_WritesSixCanonicalARecords — the headline assertion: a single
  invocation produces exactly six POSTs against the mocked Dynadot
  endpoint, one per canonical subdomain (*.<sub>, console, gitea, harbor,
  admin, api), all A records pointing at the LB IP, all carrying
  add_dns_to_current_setting=yes.
- TestRun_NeverWipesZone — strict regression guard for the cardinal
  rule from feedback_dynadot_dns.md (a single missing flag wipes the
  zone). Asserted on every iteration of the loop.
- TestRun_ValidationErrors — table-driven coverage of every input
  contract failure (missing key/secret/domain/subdomain/lb-ip,
  unmanaged-domain rejection); zero Dynadot calls happen on validation
  failure so the OpenTofu module gets a deterministic fast-fail.
- TestRun_FailsFastOnDynadotError — when Dynadot rejects the first
  record, run() returns immediately rather than leaving a partial zone.
- TestRun_NeverHitsRealDynadot — paranoia guard proving the rewrite
  transport is in place; a guarded transport refuses any non-loopback
  host so a regression in the rewrite would surface immediately.
- TestReadInputsFromEnv — env-var contract coverage.

Per docs/INVIOLABLE-PRINCIPLES.md #2 (no compromise on quality), the HTTP
client, URL encoding, and JSON parsing are real package code paths;
only the upstream Dynadot endpoint is substituted with httptest.Server.
Hitting the real api.dynadot.com would write real records and burn real
quota every CI run, which is exactly the failure the never-mock
principle is designed to prevent in this case.

Closes #112.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:41:31 +02:00
hatiyildiz
a1673af401 Merge branch 'feat/group-f-umbrella-chart-fix-v2'
Group F: bp-catalyst-platform umbrella chart (#104) + 11th OCI artifact (#107).
Renames products/catalyst/chart from `catalyst-platform` to `bp-catalyst-platform`,
bumps to 1.0.1, declares dependencies on the 11 leaf Blueprints.
Workflow blueprint-release.yaml now reads chart name from Chart.yaml instead of
deriving from folder basename, and adds helm registry login for OCI deps.

Disclosed in commit 497643a: bp-external-dns:1.0.0 dep is declared but not
yet published — gates on issue #109.
2026-04-28 19:40:20 +02:00
hatiyildiz
497643a4bf fix(catalyst): #104 #107 — bp-catalyst-platform umbrella chart with 11 leaf deps
Issue #104: products/catalyst/chart/Chart.yaml had `name: catalyst-platform`
(missing the `bp-` prefix required by BLUEPRINT-AUTHORING.md §3) and no
`dependencies:` block. The Catalyst umbrella must depend on the 11 bootstrap-kit
leaf Blueprints so a single Flux HelmRelease at the umbrella OCI ref pulls in
the full Catalyst-Zero control plane.

Issue #107: bp-catalyst-platform was the missing 11th OCI artifact at
ghcr.io/openova-io. With this fix, blueprint-release.yaml will publish
ghcr.io/openova-io/bp-catalyst-platform:1.0.1 on push.

Changes:
- Rename chart to `bp-catalyst-platform`, bump version 1.0.0 -> 1.0.1
- Add `dependencies:` block listing all 11 leaves
  (cilium, cert-manager, flux, crossplane, sealed-secrets, spire,
   nats-jetstream, openbao, keycloak, gitea, external-dns), each
  pinned to 1.0.0 at oci://ghcr.io/openova-io
- Workflow blueprint-release.yaml: read chart name from Chart.yaml `name:`
  field instead of deriving `bp-<basename>` from the folder. The umbrella
  folder is `catalyst` but the chart name is `bp-catalyst-platform` —
  basename-derivation is wrong for any chart whose name doesn't equal
  `bp-<folder>`. Removes the implicit `bp-` prefix in the push step;
  Chart.yaml carries the full canonical name.
- Workflow: add `helm registry login ghcr.io` step before `helm dependency
  build` so OCI-hosted leaf deps resolve. The pre-existing docker login
  is for cosign/syft only; helm has its own auth store.

Disclosure (per INVIOLABLE-PRINCIPLES.md §8):
- bp-external-dns:1.0.0 is listed as a dependency but is not yet published;
  platform/external-dns/ has README + policies but no chart/ dir (issue #109
  scope). The umbrella build will fail on `helm dependency build` until #109
  authors the chart and publishes bp-external-dns:1.0.0. The dependency is
  declared anyway because the target-state contract per #104 is exactly 11
  leaves — partial declaration would be a quality compromise (principle #2).

Verified leaf chart names (platform/<x>/chart/Chart.yaml, all `bp-<x>`):
  cilium, cert-manager, flux, crossplane, sealed-secrets, spire,
  nats-jetstream, openbao, keycloak, gitea — all match.
Verified published OCI tags (10/11 at ghcr.io/openova-io/bp-<name>:1.0.0).
2026-04-28 19:39:48 +02:00
hatiyildiz
7fd24fb1c1 test(dynadot): #110 — add table-driven multi-domain ManagedDomains test matrix
Augments the existing #108-landed test suite with:
- TestManagedDomains_TableDriven — a single matrix asserting all seven
  resolution-order scenarios (canonical multi, whitespace-separated,
  case-insensitive, whitespace-trimmed query, legacy fallback, defaults
  fallback, canonical-precedence-over-legacy) in one place.
- TestAddSovereignRecords_AllUseAddDNSToCurrentSetting — explicit
  regression guard that EVERY one of the six AddSovereignRecords loop
  iterations carries add_dns_to_current_setting=yes (per
  feedback_dynadot_dns.md: a single missing flag wipes the zone).

The dynadot.go client itself was already complete after #108/921eabd —
ManagedDomains() reads DYNADOT_MANAGED_DOMAINS canonical, falls back to
DYNADOT_DOMAIN legacy single-value, then to built-in defaults. This
commit adds the consolidated table-driven coverage requested for #110.

Closes #110.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:37:49 +02:00
hatiyildiz
4554bd6d5d feat(dod): #149-#157 — Group M DoD scaffolding (DEMO-RUNBOOK + dod_test.go + dod.yaml)
Manual-dispatch-only DoD scaffolding for the omantel.omani.works
end-to-end test. Operator-gated; the test t.Skip()s when
HETZNER_TEST_TOKEN env var is missing so CI stays green.

- docs/DEMO-RUNBOOK.md: 9-step operator runbook covering Group C
  cutover, wizard provision, voucher issuance, tenant redemption.
- tests/dod/dod_test.go: HTTP-driven E2E that streams SSE through
  all 11 phases, asserts cert + DNS + voucher + redemption flow.
- .github/workflows/dod.yaml: workflow_dispatch only — never
  on-push (Hetzner cost gating).

Cherry-picked additive files from /tmp/agent-group-m-dod (a40b495);
the agent's branch had stale-base deletions of #108/#109/Pass-107
that we drop.
2026-04-28 19:34:46 +02:00
e3mrah
c3d6385974 provision: deploy tenant bakkal (plan: m, apps: 5) 2026-04-28 21:20:56 +04:00
hatiyildiz
e5bf5baab1 Merge branch 'docs/validation-log-pass-107' — Pass 107 audit-log entry 2026-04-28 14:52:38 +02:00
hatiyildiz
628b6a6bff docs(validation-log): pass 107 — Lessons #24/#25/#26 closures + waterfall completion snapshot
13 acceptance greps re-run on 14ff252; verdict NIRVANA. Cross-attests
Lesson #24 (bespoke Hetzner+helm-exec replaced with OpenTofu→Crossplane→Flux),
Lesson #25 (catalystBlueprint.upstream metadata block in all 10 G2 wrappers),
Lesson #26 (INVIOLABLE-PRINCIPLES.md anchored in 3 places). Records live
waterfall progress (~88%): A/B/D/F/H/I/J/L closed; C ready; E mostly closed;
K 7/8; G in-flight; M scaffolding. No new violations; no new lessons.
2026-04-28 14:51:50 +02:00
hatiyildiz
7d359668b3 fix(catalyst-api): #148 — eliminate race in CreateDeployment status read
Race detector caught a write/read race between the response writer's
read of dep.Status (line 101) and the runProvisioning goroutine's
mu-locked write at line 166. The reader doesn't take dep.mu, so
even though the goroutine writes under the lock the read isn't
synchronised. Capturing the status into a local before launching
the goroutine eliminates the race — the response carries the
known-just-set "provisioning" value verbatim.

Closes the recurring TestLoad_TenConcurrentDeploymentsAreIsolated
failure on cf60bd7, 333b859, f0fe300.
2026-04-28 14:49:02 +02:00
hatiyildiz
f0fe3006ba feat(external-dns): #109 — Catalyst-curated dynadot-multi-domain policy
Adds platform/external-dns/policies/dynadot-multi-domain.yaml — the
canonical external-dns + dynadot webhook deployment that ships in every
Sovereign on an OpenOva pool domain.

Why a webhook: external-dns has no upstream Dynadot provider; the
canonical pattern is the webhook RPC contract, with a sidecar that
implements the provider in our preferred language. We reuse the same
internal/dynadot/ package the catalyst-api uses, so the never-wipe rule,
record encoding, and managed-domain allowlist are identical on both
write paths (per docs/INVIOLABLE-PRINCIPLES.md #2 — no duplicate
implementations of the same concern).

Multi-domain:
- One --domain-filter per zone in the external-dns args; adding a third
  pool domain (e.g. acme.io) is a one-line edit here PLUS a one-key edit
  on dynadot-api-credentials' `domains` field. No webhook rebuild.
- Webhook reads DYNADOT_MANAGED_DOMAINS from the same secret with
  optional=true, preserving backward compatibility with the legacy
  single-`domain` secret shape (pre-#108).

TXT registry:
- --txt-owner-id=$(SOVEREIGN_FQDN), --txt-prefix=_externaldns.<sub>.
- Cluster overlays substitute SOVEREIGN_FQDN via the bp-catalyst-platform
  umbrella so two clusters sharing a parent zone (alpha.omani.works,
  beta.omani.works) cannot collide.

Closes #109.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:45:53 +02:00
hatiyildiz
921eabdc47 feat(dynadot): #108 — multi-domain secret support (omani.works + openova.io + future)
The dynadot-api-credentials K8s secret in openova-system used to carry a
single `domain=openova.io` field. Per docs/INVIOLABLE-PRINCIPLES.md #4
("never hardcode") and the design constraint that adding a third pool
domain (e.g. acme.io) must NOT require a code change, the secret now
carries a `domains` field — a comma- or whitespace-separated list — and
the catalyst-api reads it at runtime via DYNADOT_MANAGED_DOMAINS.

Resolution order in dynadot.IsManagedDomain:
  1. DYNADOT_MANAGED_DOMAINS env (canonical, multi-domain)
  2. DYNADOT_DOMAIN env (legacy single-value, backward-compat)
  3. Built-in defaults (openova.io, omani.works) — defensive fail-closed
     fallback if the secret was not mounted.

The deployment manifest mounts both env vars from the secret with
optional=true, so existing clusters whose secret only carries the legacy
`domain` key keep working; migration is a one-key secret update with no
deployment edit required.

Closes #108.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:45:53 +02:00
hatiyildiz
14ff25214a docs(orchestrator): persist orchestration state for parallel-agent coordination
ORCHESTRATOR-STATE.md is the durable hand-off record for the multi-agent waterfall. Captures:
- Live ticket counts (74 closed / 43 open as of cf60bd7)
- Per-group status with branch/commit references
- Architectural-compliance verification (Lesson #24 closed)
- DoD checklist (what still needs operator action)
- Active parallel work + resume protocol

Companion durable memory at ~/.claude/projects/.../memory/catalyst-bootstrap-plan.md points here so a fresh session re-loads orchestrator state without losing context.
2026-04-28 14:30:26 +02:00
hatiyildiz
cf60bd77dd feat(wizard): #125 retry-phase endpoint + UX for failed bootstrap-kit phases
Group I leftover. New POST /api/v1/deployments/{id}/phases/{phase}/retry endpoint distinguishes:
- Phase 0 (tofu-*) → catalyst-api re-runs tofu apply against the existing workdir (idempotent per OpenTofu state model)
- Phase 1 (bootstrap-kit HelmReleases) → Flux owns reconciliation per Lesson #24; HelmRelease.spec.install.remediation.retries=3 handles transient failures automatically; operator-driven retries go via the Flux Receiver webhook published by bp-catalyst-platform (NEVER kubectl/helm exec from catalyst-api)

BootstrapProgress.tsx extended:
- Failed-phase rendering (red border, error message from event stream)
- "Retry phase" button (only Phase 0 phases) calling the new endpoint
- "View runbook" link to docs/RUNBOOK-PROVISIONING.md for operator-driven retries

Closes #125 — failed-phase UX.
2026-04-28 14:29:17 +02:00
github-actions[bot]
7ef93f4d06 deploy: update catalyst images to 333b859 2026-04-28 12:24:15 +00:00
hatiyildiz
333b8593b8 fix(catalyst-api): reuse alpine's UID 65534 nobody account in Containerfile
Alpine 3.20 already provisions UID 65534 as the 'nobody' user, so the
explicit 'adduser -D -u 65534 nonroot' step failed with 'uid 65534 in
use' and broke the catalyst-build CI. Drop the adduser and rely on the
existing system account; the numeric USER directive still satisfies
runAsNonRoot.
2026-04-28 14:23:17 +02:00
github-actions[bot]
5ff2c8b0a6 deploy: update sme service images to 046e5eb 2026-04-28 12:10:52 +00:00
hatiyildiz
e87913a7d7 feat(wizard): StepComponents → unified bp-<x> marketplace card grid
Group D deliverable. Replaces the legacy 353-line category-grouped checkbox tree (PILOT/SPINE/SURGE/...) with a 457-line unified marketplace card grid driven by every platform/<name>/blueprint.yaml + products/<name>/blueprint.yaml in the monorepo.

Per docs/INVIOLABLE-PRINCIPLES.md #2 + Pass 103/104 unification: every installable Catalyst Application is `bp-<name>` shape regardless of category. One catalog, one selection model, one card surface. The user-facing grid now mirrors the SME marketplace surface in core/marketplace/.

Visibility filter:
- listed   → renders as card; user opts in/out
- unlisted → mandatory infra (cilium, flux, crossplane, openbao, cert-manager, ...) auto-installed by bootstrap-kit, NEVER appears in this grid
- private  → org-private

Data source: src/shared/constants/catalog.generated.ts — auto-generated by scripts/build-catalog.mjs from every blueprint.yaml at build time. Re-runs on `npm run build:catalog` (invoked by `npm run dev` + `npm run build` prebuild hook). Never hardcoded; per principle #4.

New files:
- scripts/build-catalog.mjs — generator
- src/shared/constants/catalog.generated.ts — generated catalog data (committed for repro builds; regenerated on each build)
- src/shared/constants/{components,env,hetzner}.ts — supporting data tables

Modified:
- src/pages/wizard/steps/StepComponents.tsx — full rewrite (353 → 457 lines)
- src/entities/deployment/{model,store}.ts — selection state shape extended
- vite.config.ts — prebuild script wiring
- package.json — build:catalog script + prebuild hook

Also recovers products/catalyst/bootstrap/api/internal/handler/load_test.go — load test scaffold from Group L's testing work, untracked since the L merge.
2026-04-28 14:10:45 +02:00
github-actions[bot]
fd8228c2a1 deploy: update Catalyst admin image to 046e5eb 2026-04-28 12:10:23 +00:00
github-actions[bot]
629d67b6a5 deploy: update Catalyst marketplace image to 046e5eb 2026-04-28 12:10:09 +00:00
hatiyildiz
046e5ebc18 feat(day2-iac): Crossplane Compositions + per-Sovereign Flux cluster tree + catalyst-dns binary
Group F deliverables — completes the day-2 IaC layer that takes over after OpenTofu's Phase 0 hand-off (per docs/SOVEREIGN-PROVISIONING.md §4).

Three artifacts:

1. platform/crossplane/compositions/ — XRDs + Compositions for canonical Hetzner resources
   under the canonical compose.openova.io/v1alpha1 group (per BLUEPRINT-AUTHORING.md §8):
   - XHetznerNetwork + composition-network.yaml — wraps hcloud_network + subnet
   - XHetznerFirewall + composition-firewall.yaml
   - XHetznerServer + composition-server.yaml
   - XHetznerLoadBalancer + composition-loadbalancer.yaml (lb11, 80→31080, 443→31443)
   - README documenting the canonical pattern

2. clusters/_template/ — the canonical per-Sovereign Flux Kustomization tree.
   Copied to clusters/<sovereign-fqdn>/ at provisioning time; cloud-init's
   GitRepository points at the result.
   - kustomization.yaml (root: flux-system + infrastructure + bootstrap-kit)
   - flux-system/ (placeholder for Flux self-config customization)
   - infrastructure/ (provider-hcloud + ProviderConfig referencing hcloud-credentials secret OpenTofu writes)
   - bootstrap-kit/ — 11 HelmRelease manifests in dependency order:
     01-cilium → 02-cert-manager → 03-flux → 04-crossplane → 05-sealed-secrets
     → 06-spire → 07-nats-jetstream → 08-openbao → 09-keycloak → 10-gitea → 11-bp-catalyst-platform
     Each pulls from oci://ghcr.io/openova-io/bp-<name>:1.0.0 — the wrapper charts published by blueprint-release CI.
     dependsOn declarations enforce the canonical install order at runtime.

3. clusters/omantel.omani.works/ — the first concrete Sovereign instance.
   Mirror of _template with SOVEREIGN_FQDN_PLACEHOLDER substituted to omantel.omani.works.
   This is what the wizard's first omantel.omani.works run will actually reconcile.

4. products/catalyst/bootstrap/api/cmd/catalyst-dns/main.go — small Go binary the
   OpenTofu module's null_resource.dns_pool invokes via local-exec at Phase-0 apply time.
   Reads DYNADOT_API_KEY/SECRET/DOMAIN/SUBDOMAIN/LB_IP env vars; calls existing dynadot.Client.AddSovereignRecords. Containerfile already builds + ships it at /usr/local/bin/catalyst-dns.

Architectural compliance (Lesson #24 closed):
- No bespoke Go cloud-API calls (Crossplane Compositions are the canonical day-2 IaC)
- No exec.Command("helm", ...) (Flux HelmReleases are the canonical install unit)
- No kubectl apply from outside (cloud-init kubectl-applies one Flux GitRepository, then Flux owns everything)

After this commit, the path is end-to-end: wizard → catalyst-api → tofu apply (with infra/hetzner/) → cloud-init installs k3s + Flux + applies GitRepository pointing at clusters/omantel.omani.works/ → Flux reconciles bootstrap-kit (11 HelmReleases in dependency order) → Crossplane adopts day-2 management.
2026-04-28 14:09:29 +02:00
Emrah Baysal
9519c1ef00 merge: Group L testing (Playwright e2e smoke tests, Hetzner provisioning test scaffold gated on HETZNER_TEST_TOKEN secret, integration tests for bootstrap installer + Dynadot + voucher) 2026-04-28 14:05:59 +02:00
Emrah Baysal
2bcf5644cb merge: Group I wizard UX (11-bootstrap-phase progress indicator, SSE log pane, error handling for token/subdomain/phase failure, pre-submit subdomain check) 2026-04-28 14:05:58 +02:00
Emrah Baysal
f2951afd08 merge: Group H franchise + vouchers (real /billing/vouchers backend, public /redeem page, sovereign-admin role wiring, GLOSSARY+BUSINESS-STRATEGY updates) 2026-04-28 14:05:50 +02:00
Emrah Baysal
e5550d784d merge: Group J Hetzner infra (cx32→cx42 sizing fix, OS hardening cloud-init, operator README) 2026-04-28 14:05:50 +02:00
Emrah Baysal
dc3f50d738 merge: Group K docs (component count 53→56, RUNBOOK-PROVISIONING.md, IMPLEMENTATION-STATUS updates, VALIDATION-LOG Pass 105/106) 2026-04-28 14:05:42 +02:00
hatiyildiz
e0dc23a818 feat(catalyst): pre-submit subdomain availability check (#124)
Adds POST /api/v1/subdomains/check on the catalyst-api side and a
debounced React hook on the wizard side, so collisions on pool
subdomains are caught BEFORE the user clicks Submit instead of
failing at provisioning time when Dynadot rejects the duplicate
record.

Backend (handler/subdomains.go):
  - Validates subdomain syntax (RFC 1035 label).
  - Rejects unsupported pool domains (defence-in-depth — wizard
    already filters its own dropdown but the handler never trusts
    client input).
  - Rejects reserved control-plane names (api, admin, console,
    gitea, harbor, keycloak, www, mail, smtp, vpn, openova,
    catalyst, docs, status, app, system, openbao, vault, flux,
    k8s) — these are auto-allocated by the Sovereign provisioner.
  - Resolves <subdomain>.<pool> via the system DNS resolver with a
    2-second timeout. NXDOMAIN ⇒ available; any address record
    returned ⇒ taken; other errors ⇒ surfaced as lookup-error
    (transient — user retries).

Per the auto-memory feedback_dynadot_dns.md the handler deliberately
does NOT call Dynadot's API for the availability check — Dynadot's
set_dns2 is write-only-safe; the global DNS resolver is the
eventually-consistent source of truth for what names already point
somewhere.

Wizard (useSubdomainAvailability + StepOrg):
  - Debounces by 400 ms so fast typists don't trigger fetches per
    keystroke.
  - Renders a live status pill next to the Subdomain label
    (checking… / available / taken / invalid / check failed).
  - On taken/reserved/invalid/error, surfaces the backend's detail
    string verbatim in an inline-error card directly under the
    input.
  - Blocks the wizard's Next button while the check is in flight,
    when the subdomain is taken/invalid, or when the check itself
    failed (operator must resolve before proceeding).

Closes #124.
2026-04-28 14:02:17 +02:00
hatiyildiz
7c7c46bc62 test: Hetzner Sovereign end-to-end provisioning test (#141)
Closes the Group L "end-to-end provisioning test on Hetzner test project"
ticket. Per the ticket's exact wording: scaffolding + harness + CI
workflow, gated on HETZNER_TEST_TOKEN, NEVER mocked.

Lifecycle when HETZNER_TEST_TOKEN is set:
  1. Generate unique sovereign FQDN (e2e-<run-id>.openova.io)
  2. Stage canonical infra/hetzner/ OpenTofu module into temp dir
  3. Render tofu.auto.tfvars.json with test inputs (BYO domain mode so
     Dynadot isn't touched; region runtime-configurable; SSH key minted
     by CI per-run)
  4. tofu init && tofu apply -auto-approve (30m timeout)
  5. Assert outputs: control_plane_ip + load_balancer_ip are valid IPv4
  6. Assert TCP/22 reachable on control plane (5m await)
  7. Assert TCP/443 reachable on LB after Cilium + Flux land (15m await,
     soft-failure since the Catalyst control plane install is the long
     tail and partial-bootstrap is acceptable proof of OpenTofu + Flux)
  8. tofu destroy -auto-approve (always — t.Cleanup, runs even on fail)
  9. Verify state list is empty after destroy (no leaked resources)

When HETZNER_TEST_TOKEN is absent, the test SKIPS — does not mock, does
not fall through to a stub. Per docs/INVIOLABLE-PRINCIPLES.md #2,
mocking the cloud would tell us nothing about whether the OpenTofu module,
hcloud provider, cloud-init scripts, or k3s actually work. A second test
(TestHarness_NoHetznerCredsSkips) explicitly verifies the skip semantics
so future refactors don't accidentally land mocking.

CI workflow (.github/workflows/test-hetzner-e2e.yaml):
  - Triggers on workflow_dispatch (operator initiates real run) or PR
    labeled `test/hetzner-e2e` — NOT on every push (each run costs real
    Hetzner minutes ~EUR 0.005/run).
  - Generates a per-run throwaway SSH ed25519 keypair so no secret
    long-term key lands in any logs.
  - Installs OpenTofu via opentofu/setup-opentofu@v1.
  - Reads HETZNER_TEST_TOKEN + HETZNER_TEST_PROJECT_ID from repo secrets;
    operator populates them out-of-band (per the ticket: "operator will
    populate later").
  - 55m job timeout, plus the test itself uses contexts of 30m apply
    + 20m destroy.

Files:
  - tests/e2e/hetzner-provisioning/main_test.go (the harness)
  - tests/e2e/hetzner-provisioning/go.mod (separate module, stdlib-only)
  - .github/workflows/test-hetzner-e2e.yaml (gated CI)

Refs #141

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:00:29 +02:00
hatiyildiz
7edf63ca7e docs(franchise),test(billing): voucher CRD propagation invariant
#118 verifies that the voucher shape on a franchised Sovereign is
identical to Catalyst-Zero. Two artefacts:

1. New §"Voucher shape propagates automatically" in
   docs/FRANCHISE-MODEL.md explaining WHY there is no propagation
   problem to solve: vouchers are not a CRD. They are rows in the
   per-Sovereign billing service's Postgres database, and every
   Sovereign runs the same SHA-pinned core/services/billing image.
   Same image → same migration → same schema → same handlers → same
   shape. The doc lists which file owns each part of the shape and
   includes a 4-step curl smoke test to run on any Sovereign at
   first-provisioning to confirm the invariant holds.

2. New core/services/billing/handlers/vouchers_test.go covering the
   public POST /billing/vouchers/redeem-preview endpoint added in
   #117. Four cases:
   - 404 on unknown / soft-deleted code (no tombstone leak)
   - 200 on a valid live code, asserting the public shape excludes
     times_redeemed and max_redemptions (defence-in-depth against
     enumeration)
   - 410 Gone on a code that exists but has hit its cap, with the
     credit/description still in the response so the landing page can
     show "campaign ended"
   - 400 on whitespace-only input

The tests run on every CI build of the billing service, on every
Sovereign that builds from this repo. If a future change drifts the
preview endpoint's shape, the tests fail before the regression can
ship.

Also tidies vouchers.go imports (removed two unused stdlib imports
that were placeholder).

Closes #118.
2026-04-28 13:59:31 +02:00
hatiyildiz
3dced3fdda test: bootstrap-kit Flux Kustomization integration test (#145)
Closes the Group L "integration test — provisioner backend bootstrap-kit
installer — all 11 phases install in sequence on a kind cluster" ticket.

Per the ticket note, the bootstrap installer is now Flux-driven from
clusters/<sovereign-fqdn>/ — NOT the bespoke Go-based installer that was
reverted in commit e668637. The test verifies that Flux reconciles the
right Kustomizations rather than that Go code helm-installs anything.

Two layers of validation:

1. Static manifest layer (runs on every push, cheap)
   - All 11 platform/<x>/blueprint.yaml + chart/Chart.yaml exist
   - Each blueprint.yaml satisfies catalyst.openova.io/v1alpha1 schema
     (apiVersion/kind/metadata.name/spec.version/card.title/card.summary)
   - Chart.yaml name matches "bp-<x>" and version matches blueprint.yaml
     spec.version
   - clusters/_template/ YAMLs parse after SOVEREIGN_FQDN_PLACEHOLDER
     substitution (when the template tree is on the branch — Group J/M
     ticket lands the per-Sovereign template)
   - The dependency order matches the canonical 11-phase sequence from
     SOVEREIGN-PROVISIONING.md §3 (cilium → cert-manager → flux →
     crossplane → sealed-secrets → spire → nats-jetstream → openbao →
     keycloak → gitea → bp-catalyst-platform)

2. Kind-cluster layer (runs on main pushes, gated on
   BOOTSTRAP_KIT_KIND_TEST=1)
   - Brings up kubernetes-in-docker
   - Installs Flux CRDs + source/kustomize controllers
   - Registers a GitRepository pointing at this monorepo
   - Synthesizes the 11 bootstrap-kit Kustomizations and applies them
   - Asserts the API server accepts all 11 (manifests are valid, schema
     satisfied) — this is the test's narrow scope per the ticket

The test deliberately does NOT wait for the kit to fully install upstream
charts or reach steady-state reconciliation. That belongs to #141 (real
Hetzner E2E with cloud credentials and outbound network), not a kind
cluster test in CI.

Files:
  - tests/e2e/bootstrap-kit/main_test.go (Go test, 11 subtests + 4 main)
  - tests/e2e/bootstrap-kit/go.mod (separate module — keeps test deps
    isolated from the production Go modules)
  - .github/workflows/test-bootstrap-kit.yaml (kind-action + flux2/action)

Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:58:18 +02:00
hatiyildiz
a0ff764736 feat(catalyst-ui): inline error UX when Hetzner rejects token (#123)
Replaces the silently-swallow-on-error branch in TokenSection.validate()
with a real failure-mode taxonomy and an inline error card that surfaces
the exact reason the token failed plus a remediation hint and a retry
button.

Failure modes the validator now distinguishes:
  - rejected      backend confirmed token is wrong (Read-only, expired, …)
  - too-short     client- or server-side length validation
  - unreachable   could not reach the cloud provider's API (HTTP 503)
  - network       could not reach catalyst-api (offline, CORS, DNS)
  - parse         backend response was malformed
  - http          any other unhandled non-2xx status

Each kind has its own remediation hint pre-baked into FAILURE_HINTS;
the inline ValidationErrorCard renders kind + summary + HTTP status +
hint + raw backend message verbatim + retry / copy-diagnostic buttons.

The previous implementation flipped to state=valid on network failure
("backend doesn't reach Hetzner → assume token is good"), violating
docs/INVIOLABLE-PRINCIPLES.md #1 ("never compromise from quality"):
the wizard would let the user proceed with a token that may or may not
work, then fail at provisioning time. Now any non-success path surfaces
a specific, actionable error and blocks Next.

Closes #123.
2026-04-28 13:57:00 +02:00
hatiyildiz
9404632830 feat(marketplace): public /redeem?code=... voucher landing flow
#116 adds the public landing page that the franchise model relies on
to convert voucher distribution into Catalyst signups (per
docs/FRANCHISE-MODEL.md §3, "redemption flow end-to-end").

New page core/marketplace/src/pages/redeem.astro:

- Reads ?code=... from the URL (or accepts manual entry if absent).
- POSTs to /api/billing/vouchers/redeem-preview (added in #117) — does
  NOT consume the voucher, just validates it.
- Renders one of four states:
  * Valid (200): "X OMR credit" + description + "Sign up to redeem"
    CTA. The CTA stashes the code in localStorage under
    `sme-pending-voucher` and routes to /plans (the start of the
    existing signup wizard).
  * Campaign ended (410): inactive or capped — shows the credit that
    was offered + a path to sign up without a voucher.
  * Not valid (404): never existed or soft-deleted (#91 tombstone-leak
    protection — the two are indistinguishable on the public surface).
  * No code present: a manual input form so a redeemer who landed on
    /redeem without a query string can paste their code.

CheckoutStep wiring (core/marketplace/src/components/CheckoutStep.svelte):

- The `promoCode` $state now hydrates from `sme-pending-voucher` so a
  redeemer arriving via /redeem reaches /checkout with the field
  pre-filled. They can still edit or clear it.
- After submitting to /billing/checkout, we clear the localStorage
  stash. This prevents a second signup on the same browser from
  silently carrying over the previous voucher.

The actual redemption (insert into promo_redemptions, increment
times_redeemed, credit_ledger entry) still happens transactionally
inside POST /billing/checkout — splitting it out would risk a
partially-redeemed code with no Order to show for it (the same
class of bug #91 fixed).

Per docs/INVIOLABLE-PRINCIPLES.md §1: target-state shape, not MVP.
The page handles all four observable backend states; manual-entry
fallback is included; the "campaign ended" path keeps the user moving
into signup rather than dead-ending.

Closes #116.
2026-04-28 13:56:54 +02:00
hatiyildiz
d6c1d3fbeb docs(validation-log): Pass 105 + Pass 106 entries documenting consolidation + Group K work
Closes #140.

Two new audit-log entries appended to docs/VALIDATION-LOG.md:

**Pass 105 — Catalyst-Zero consolidation + 11 G2 wrapper charts**
Records the cross-cutting work landed across commits 3c2f7e4 (Group A
code consolidation), 7646840 (Group B SME services), and 8c0f766 (Group F
G2 wrapper charts). Critically documents the +3 new platform/ folders
(spire, nats-jetstream, sealed-secrets) that raised the count from 53
to 56. Per Lesson #26, recorded as 🚧 not  — runtime DoD is Group M.

**Pass 106 — Group K documentation reconciliation**
Records the 5 commits this branch lands:
  224d81e — component-count anchor refresh 53 → 56 across CLAUDE.md,
            AUDIT-PROCEDURE, BUSINESS-STRATEGY, PROVISIONING-PLAN, TF
  7b24f96 — PLATFORM-TECH-STACK §1+§2.3+§3.2 cross-doc consistency
  ab456d4 — IMPLEMENTATION-STATUS §7 catalyst-provisioner 📐🚧
  3a7ec9e — SOVEREIGN-PROVISIONING §3 deployed-reality rewrite
  e8c3f6f — RUNBOOK-PROVISIONING new operator-level doc

Acceptance greps recorded:
- '\\b53 components\\b|\\b53 platform components\\b|\\b53 curated\\b|\\b53-component\\b'
  → empty (excluding VALIDATION-LOG self-references)
- ls -d platform/*/ | wc -l → 56
- BUSINESS-STRATEGY '\\b56\\b' count → 26 (consistent across the canon)

Pass 106 explicitly notes #134 is NOT closed (omantel 📐 requires
Group M DoD per INVIOLABLE-PRINCIPLES.md #7) and the omantel row in
IMPLEMENTATION-STATUS.md §6 was correctly left as 📐.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:56:08 +02:00
hatiyildiz
3440bf70f0 feat(catalyst-ui): SSE log-stream widget — tail -f equivalent (#122)
Live log viewer that consumes the catalyst-api SSE event stream and
renders it as a tail-style pane during StepProvisioning.

Features:
  - Auto-scroll to newest line, with a follow/paused toggle that
    auto-disengages when the user scrolls up to inspect history.
  - Per-phase filter — clicking a phase row in the bootstrap-progress
    widget passes its id here, scoping the log to that phase.
  - Per-level filter — info / warn / error toggle chips with running
    counts for the current scope.
  - Live free-text grep across visible window (case-insensitive,
    matches both phase id and message).
  - Copy-all-visible button (always copies the currently filtered view
    in tail-style "<time>  [<phase>] <LEVEL>  <msg>" format).
  - Connection-state pill — connecting / streaming / completed / failed,
    bound 1:1 to the underlying EventSource.readyState.

The widget is presentational and consumes the real ProvisioningEvent
stream from useProvisioningStream — no mock data, per
docs/INVIOLABLE-PRINCIPLES.md #1 ("waterfall is the contract").

Closes #122.
2026-04-28 13:54:26 +02:00
hatiyildiz
12387a4a74 feat(billing): /billing/vouchers/{issue,list,revoke,redeem-preview} surface
#117 adds a franchise-aligned URL surface for the existing PromoCode
voucher implementation, plus one new endpoint (redeem-preview) for the
public landing flow described in docs/FRANCHISE-MODEL.md §3.

The orchestrator's hint was right — the issue/list/revoke handlers
already exist (AdminUpsertPromo / AdminListPromos / AdminDeletePromo
on the legacy /billing/admin/promos surface). This commit:

1. Adds new endpoint handlers in core/services/billing/handlers/vouchers.go:
   - POST   /billing/vouchers/issue          (superadmin or sovereign-admin)
   - GET    /billing/vouchers/list           (superadmin or sovereign-admin)
   - DELETE /billing/vouchers/revoke/{code}  (superadmin or sovereign-admin)
   - POST   /billing/vouchers/redeem-preview (unauthenticated; public)

   The first three reuse the existing store-layer methods. The last is
   new — it validates a code without consuming it, returning a safe
   shape (no times_redeemed, no max_redemptions exposure) so an
   attacker scraping the public endpoint cannot enumerate cap status.

2. Distinguishes 404 (code never existed or soft-deleted — same
   tombstone-leak protection as #91) from 410 Gone (code exists but is
   inactive or capped). The 410 body still includes the credit and
   description so the landing page can show "this campaign has ended".

3. Keeps the legacy /billing/admin/promos endpoints in place — the
   existing admin UI continues to work without any breaking change.
   New code should target /billing/vouchers/...

4. Updates docs/FRANCHISE-MODEL.md to point to the new URL surface.

The actual REDEMPTION still happens transactionally inside POST
/billing/checkout via the `promo_code` field — that path locks the
promo row, inserts the promo_redemptions edge, increments
times_redeemed, and adds the credit_ledger entry in one transaction.
Splitting it into a separate /redeem endpoint would break that
atomicity, so we deliberately do not add one. The public redeem flow
is preview → signup → checkout-with-promo_code.

Closes #117.
2026-04-28 13:54:19 +02:00
hatiyildiz
e7a74f0eef feat(infra/hetzner): bump default to cx42, add OS hardening + operator README
Group J — closes #127, #128, #129, #130, #131, #132.

Defaults
- control_plane_size default cx42 (16 GB) — cx32 (8 GB) is INSUFFICIENT
  for a solo Sovereign per PLATFORM-TECH-STACK.md §7.1 (~11.3 GB Catalyst)
  + §7.4 (~8.8 GB per-host-cluster) = ~20 GB minimum. The previous cx32
  default would OOM during the OpenBao + Keycloak step of bootstrap.
- New k3s_version variable (v1.31.4+k3s1) — pinned, validated against
  the INSTALL_K3S_VERSION format. Previously hardcoded inside the
  cloud-init templates, in violation of INVIOLABLE-PRINCIPLES.md §4.

Validation
- Region restricted to the 5 known Hetzner locations.
- control_plane_size + worker_size restricted to the cxNN | ccxNN | caxNN
  namespace (blocks tiny dev sizes that would OOM at runtime).
- k3s_version regex matches the upstream installer's version format.
- ssh_allowed_cidrs validated as proper CIDRs.

Firewall
- Document each open port (80, 443, 6443, ICMP) and each blocked port
  (22, 10250, 2379/2380, 8472) in README.md §"Firewall rules".
- SSH (22) is now a dynamic rule keyed off ssh_allowed_cidrs (default
  empty = no SSH at the firewall, break-glass via Hetzner Console).

OS hardening (cloudinit-*.tftpl)
- sshd drop-in: PasswordAuthentication no, PermitRootLogin
  prohibit-password, no forwarding, MaxAuthTries=3, LoginGraceTime=30.
- enable_unattended_upgrades (default true): security-only pocket,
  auto-reboot at 02:30, removes unused kernels.
- enable_fail2ban (default true): sshd jail, systemd backend.
- Both control-plane and worker templates carry the same baseline.

Documentation
- New infra/hetzner/README.md (operator-facing) covers:
  * What the module creates + Phase-0/Phase-1 boundary.
  * Sizing rationale with the §7.1+§7.4 RAM math + upgrade path.
  * Firewall rules: every open port, every blocked port, every
    deliberate egress flow.
  * k3s flag-by-flag rationale tied to PLATFORM-TECH-STACK.md §8.
  * SSH key management: why no auto-generated keys (break-glass +
    audit-trail + custody + compliance).
  * OS hardening table.
  * Standalone CLI invocation pattern (tofu apply -var-file=...).
  * What the module does NOT do (Crossplane / Flux territory).

Closes #127 #128 #129 #130 #131 #132
2026-04-28 13:54:15 +02:00
hatiyildiz
e8c3f6fd05 docs(runbook-provisioning): operator-level guide for sovereign-cloud teams
Closes #136.

New runbook companion to SOVEREIGN-PROVISIONING.md (the architectural
contract) and PROVISIONING-PLAN.md (the Catalyst-Zero waterfall).
Audience: a Sovereign cloud team (e.g. omantel-cloud) onboarding their
first Sovereign via Catalyst-Zero at console.openova.io/sovereign.

Sections:
1. What you get end-to-end
2. Pre-flight checklist (Hetzner project, API token, SSH key, region,
   domain mode, org name+email, topology) with cost estimate
3. Step-by-step:
   a. Open the wizard
   b. Walk the 7 steps with what each captures and why
   c. Watch the SSE event log (5 phases: tofu-init/plan/apply/output/flux-bootstrap)
   d. First login + DNS / cert-manager / CNAME caveats
   e. Day-1 setup checklist linked to SOVEREIGN-PROVISIONING.md §5
4. Troubleshooting matrix with 8 common failure modes mapped to recovery
   steps (token scope, hcloud quota, regional capacity, Cilium readiness
   chicken-and-egg, Let's Encrypt rate-limit, DNS propagation, Keycloak SMTP)
5. Re-runs + idempotency notes (tofu apply on existing state is safe)
6. Decommission flow tying back to SOVEREIGN-PROVISIONING.md §10.2

All claims about runtime behaviour cross-link to the canonical artifacts:
provisioner.go for the SSE phases, infra/hetzner/main.tf for resource
shape, cloudinit-control-plane.tftpl for the k3s+Flux bootstrap. Per
INVIOLABLE-PRINCIPLES.md #7 the runbook flags Group M DoD as pending —
it is operator-facing documentation of the deployed shape, not a claim
of end-to-end runtime verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:54:14 +02:00
hatiyildiz
171ff9c883 feat(catalyst-ui): bootstrap-progress widget for 11-phase indicator (#121)
Adds the canonical phase list (5 OpenTofu Phase 0 + 11 bootstrap-kit) as
the single source of truth, the SSE hook that consumes the catalyst-api
provisioning stream, and the vertical step-progress indicator widget.

The phase list is keyed off the actual provisioner.go emit() phase ids
and the documented bootstrap-kit dependency order from PROVISIONING-PLAN.md
"Phase 5 — Bootstrap kit" + SOVEREIGN-PROVISIONING.md §3-§4. No hardcoded
component versions or provider URLs — every phase entry is a configuration
record consumed both by the indicator widget and the log-stream filter.

The indicator renders a checkpoint per component installed, splits Layer A
(OpenTofu) from Layer B (bootstrap-kit) with section headers, and exposes
status + duration + sovereign-state markers so operators can correlate
with backend logs.

Closes #121.
2026-04-28 13:54:14 +02:00