Commit Graph

662 Commits

Author SHA1 Message Date
hatiyildiz
e28b16935a merge: cloudinit Cilium k8sServiceHost=127.0.0.1 (verified working on omantel) 2026-04-29 15:31:34 +02:00
hatiyildiz
548720095a fix(cloudinit): use 127.0.0.1 for Cilium k8sServiceHost (host's local apiserver)
Cilium with --set k8sServiceHost=10.0.1.2 (the cp1 private NIC IP) sat
in init phase forever — the agent's API client kept logging
"Establishing connection to apiserver host=https://10.0.1.2:6443" and
never got a response, even though `curl https://10.0.1.2:6443/healthz`
from the host returned 401 (TLS+auth challenge = endpoint reachable).

Switching to k8sServiceHost=127.0.0.1 brought the DaemonSet up
immediately. Verified end-to-end on the live cluster:

  $ kubectl get nodes
  catalyst-omantel-omani-works-cp1   Ready   ...   32m   v1.31.4+k3s1

The node's local apiserver always binds 127.0.0.1:6443; using that as
the bootstrap apiserver endpoint sidesteps whatever was rejecting the
private-NIC IP route during Cilium's pre-CNI bring-up. Once Cilium is
the CNI and the cluster has real Service VIPs, every other component
reaches the apiserver via the kubernetes.default service as usual.
2026-04-29 15:31:21 +02:00
github-actions[bot]
9f2f3416f5 deploy: update catalyst images to f0f2513 2026-04-29 13:30:31 +00:00
hatiyildiz
f0f2513c3d merge: cloudinit installs Cilium before Flux (fix CNI bootstrap deadlock) 2026-04-29 15:29:20 +02:00
hatiyildiz
e571ec7aa2 fix(cloudinit): install Cilium BEFORE Flux to break CNI bootstrap deadlock
omantel.omani.works deployment 5cd1bceaaacb71f6 reached Phase 0 success
(10 Hetzner resources up, LB IP 49.12.16.160, DNS committed via PDM)
but stayed silent for 25 minutes — `https://console.omantel.omani.works`
returned no response, every Flux pod was Pending, and the node was
NotReady. SSH'd into the cp1 box (firewall opened temporarily for the
operator IP) and found the canonical CNI bootstrap deadlock:

  Ready: False  (KubeletNotReady)
  message: container runtime network not ready: NetworkReady=false
   reason:NetworkPluginNotReady cni plugin not initialized

cloud-init started k3s with --flannel-backend=none + --disable-network-policy
(the right Cilium-ready posture), then immediately applied the Flux
install.yaml. Flux pods are Pending because there is no CNI yet, so
Flux never starts → never reconciles bp-cilium → CNI never installs →
deadlock. The "wait for deployment Available --timeout=300s" line
silently times out and cloud-init proceeds anyway with the Flux
GitRepository + Kustomization that nothing reconciles.

Resolution: install Cilium ONCE in cloud-init via the canonical Helm
chart at the SAME version (1.16.5) that platform/cilium/blueprint.yaml
declares for bp-cilium. When Flux later reconciles
clusters/<sovereign_fqdn>/bootstrap-kit/01-cilium.yaml it adopts the
existing Helm release (release name + namespace match), so the wizard's
ownership model stays single-source-of-truth (Flux + Blueprints) after
the bootstrap exception.

Per INVIOLABLE-PRINCIPLES.md #3, this Helm install is the one-shot
bootstrap exception authorised by "the GitOps engine is Flux —
everything ELSE gets installed by Flux". Cilium IS the CNI Flux needs,
so it cannot be installed by Flux without bootstrapping itself first.
Every other component still flows through the Blueprint pipeline.

Verified: ssh'd into the running omantel cp1 (firewall opened for the
operator IP), ran the same `helm install cilium ...` command this
patch encodes, and the cluster recovered — node Ready, Flux pods
scheduling, GitRepository pulling. Will redeploy from scratch with
the patched cloud-init to validate the full unattended path.

Cloud-init is the Phase-0 OpenTofu artifact baked into the Hetzner
server's user_data, so this change activates on the NEXT `tofu apply`
that creates a new control-plane server. Existing omantel cp1 is
manually unblocked already; new Sovereigns provisioned after the
catalyst-api image with this template is rolled will not hit the
deadlock.
2026-04-29 15:29:10 +02:00
hatiyildiz
7a10ae6c4e merge: SSE events buffer + replay endpoint for completed deployments 2026-04-29 15:27:21 +02:00
hatiyildiz
29fcb9a8db fix(catalyst-api): buffer SSE events on Deployment + replay on connect for ProvisionPage history
Closes the user-reported regression "this is empty are you sure this is
progressing?" — `/sovereign/provision/<id>` rendered `0 events · done`
even when the deployment succeeded with 10 Hetzner resources, because a
browser that connected after `event: done` arrived at an already-closed
channel with nothing to replay.

API:
- Add `eventsBuf` durable slice (mutex-guarded) on `Deployment`, capped
  at 10,000 events with FIFO eviction so a runaway producer cannot OOM.
- Tee every emit through `recordEvent` — single source of truth for the
  buffer + the live channel, so they cannot diverge.
- StreamLogs replays the buffer on connect; if the deployment is already
  done, replays + emits `event: done` and closes.
- New `GET /api/v1/deployments/{id}/events` returns slice + state JSON
  for stateless reconnect / fast-path render.
- `Deployment.State()` includes `numEvents` summary.
- New tests prove buffer fill, replay-on-completed, GET endpoint shape,
  and FIFO eviction at cap.

UI:
- ProvisionPage fetches GET /events on mount BEFORE attaching the SSE
  stream; replays through `applyEventToContext()` so a deep-link to a
  completed deployment renders the FULL history of bubbles + log
  entries instead of an empty shell.
- Live SSE `seen` counter de-duplicates the SSE replay-on-connect
  against the GET fetch we already applied.
- Elapsed clock anchors on first event time for completed deployments.
- 4 new vitest tests (153 total) cover the GET fetch, completed-state
  bubble flip, 404 graceful handling, and elapsed-clock anchor.

Closes #180.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:23:39 +02:00
github-actions[bot]
6f9c0b261a deploy: update catalyst images to b3497e6 2026-04-29 13:15:38 +00:00
hatiyildiz
b3497e6a16 merge: smoke-test bp-* via fixed-string grep on extracted bundle (false-negative fix) 2026-04-29 15:14:27 +02:00
hatiyildiz
19e7ba14e8 ci(catalyst-build): smoke-test bp-* bundle presence via fixed-string grep on extracted bundle 2026-04-29 15:14:16 +02:00
hatiyildiz
a336a3315c merge: bundle bootstrap-kit + platform + products into catalyst-ui build (fix empty Provision DAG) 2026-04-29 15:10:09 +02:00
hatiyildiz
0898a0dfd9 fix(ui): bundle bootstrap-kit + platform + products into catalyst-ui build
The wizard's /sovereign/provision/<id> page rendered only 2 supernodes
(Hetzner-infra + Flux-bootstrap) instead of the 11 bootstrap-kit
Blueprints + the user's selected components. Verified by greping the
deployed bundle:

  $ kubectl exec -n catalyst <ui-pod> -- \
      grep -c "bp-cilium\|bp-cert-manager" /usr/share/nginx/html/assets/index-*.js
  0

Root cause: scripts/build-catalog.mjs computes REPO_ROOT relative to the
script's own location and walks platform/<name>/blueprint.yaml,
products/<name>/blueprint.yaml, clusters/_template/bootstrap-kit/. The
docker build context for catalyst-ui was set to
products/catalyst/bootstrap/ui/, so REPO_ROOT in the container resolved
to a directory ABOVE the build context that holds nothing. The script
silently emitted catalog.generated.ts with BOOTSTRAP_KIT = [] and
ALL_BLUEPRINTS = [], shipping an empty bundle.

Three coupled fixes (no bandaid):

1. scripts/build-catalog.mjs — accept OPENOVA_REPO_ROOT env override AND
   fail loudly with a clear message if any of platform/, products/,
   clusters/_template/bootstrap-kit/ is missing. A future
   misconfigured context cannot silently regress the bundle.

2. products/catalyst/bootstrap/ui/Containerfile — build context is now
   /repo (the OpenOva repo root). Containerfile COPYs the four needed
   subtrees explicitly (platform/, products/, clusters/_template/
   bootstrap-kit/, products/catalyst/bootstrap/ui/) and exports
   OPENOVA_REPO_ROOT=/repo so the prebuild script picks them up.

3. .github/workflows/catalyst-build.yaml — UI build context flipped from
   openova-src/products/catalyst/bootstrap/ui to openova-src. Plus a new
   bootstrap-kit smoke test that asserts every bp-* id (cilium,
   cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream,
   openbao, keycloak, gitea) is present in the built bundle. Failure of
   this step fails the build — the regression is now caught in CI, not
   by the user staring at an empty progress page.

Verified locally: `node scripts/build-catalog.mjs` still emits 11
blueprints when run from the dev path (env override falls back to the
relative-resolve mode).
2026-04-29 15:09:58 +02:00
github-actions[bot]
364a4903d4 deploy: update catalyst images to 7b1eb7b 2026-04-29 13:04:35 +00:00
hatiyildiz
7b1eb7badf merge: per-brand-colour logo tiles (Alloy orange, FerretDB navy, Temporal blue, etc.) 2026-04-29 15:01:28 +02:00
hatiyildiz
8d99acf38c fix(wizard): logo tiles use each project's canonical brand colour as backplate
Replaces the synthetic 2-tone classification (light=slate-900,
color=slate-100) with a per-brand surface map keyed by each project's
canonical homepage / press-kit colour. Every component's logo tile now
renders against its own brand surface — exactly how each project
displays its mark on its own homepage:

  - Alloy → Grafana orange (#FF671D), white wordmark crisp
  - FerretDB → navy (#042B41), fawn glyph clearly visible
  - Temporal → signature blue (#127ED1), white logo crisp
  - Cilium → navy (#1A2236), hexagon mosaic visible
  - Grafana → dark navy (#0B0F19), orange-yellow gradient pops
  - Cert-manager / OpenSearch → white tile (matches their on-white brand)
  - Stalwart → navy (#100E42), coral red wordmark
  - Strimzi → navy (#192C47), cyan accent visible

Per-brand surface is theme-INDEPENDENT — homepage logos look the same
regardless of viewer theme, and the wizard mirrors that. The card
BODY surrounding the tile still flips with the wizard theme; only the
LOGO TILE is brand-locked.

Internal letter-mark components without a finalized upstream brand
mark (axon, bge, continuum, specter, powerdns) are assigned distinct
slate / navy tones from the OpenOva platform palette so the letter
reads cleanly and the tile doesn't visually clash with neighbouring
brand tiles in the same family.

Backwards-compatibility shim retained: `getLogoToneStyle` aliases
`getLogoSurface`, so the four call sites (StepComponents, StepReview,
MarketplaceFamilyPage, MarketplaceProductPage) work unchanged. Their
descriptive comments are updated to reflect the per-brand semantics.

Refs #179

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:55:54 +02:00
github-actions[bot]
cbc09d1109 deploy: update catalyst images to 58b5d6d 2026-04-29 12:54:22 +00:00
hatiyildiz
58b5d6d6f4 merge: drop redundant null_resource.dns_pool — PDM owns DNS writes 2026-04-29 14:53:06 +02:00
hatiyildiz
330211d275 fix(tofu): drop redundant null_resource.dns_pool — PDM owns DNS writes
Every tofu apply on a pool deployment was hitting:

  null_resource.dns_pool[0]: Provisioning with 'local-exec'...
  null_resource.dns_pool[0] (local-exec): (output suppressed due to sensitive value in config)
  Error: Invalid field in API request
  catalyst-dns: write DNS: add *.omantel record: dynadot api error: code=

Two separate code paths were both writing Dynadot records for the same
deployment:

  1. The OpenTofu module's null_resource.dns_pool — a local-exec that
     shells out to /usr/local/bin/catalyst-dns inside the catalyst-api
     container. The binary's request payload is rejected by Dynadot.
  2. catalyst-api's pool-domain-manager call — pdm.Commit() at
     handler/deployments.go:247 writes the canonical record set with the
     LB IP after tofu apply returns. This path works.

Per #168 PDM is the single owner of all pool-domain Dynadot writes.
The null_resource path is a pre-#168 artifact that should have been
removed when PDM took ownership; keeping it dual-wrote DNS records
(when it worked) and broke the entire provision flow (when it didn't).

Verified end-to-end against the live catalyst-api at
console.openova.io: tofu apply created 7 of 11 Hetzner resources
(network, firewall, subnet, LB, 2 LB services, ssh_key) before
failing at null_resource.dns_pool[0]. With this commit the DNS-write
step disappears from the plan, and PDM /commit handles record
creation after the LB IP is known.

The dynadot_key + dynadot_secret variables in variables.tf remain
declared (provisioner.go still passes them through tfvars.json) but
are no longer referenced by any resource. Removing them is a separate
sweep — left for a follow-up to keep this commit narrowly scoped to
the failure path.
2026-04-29 14:52:57 +02:00
hatiyildiz
132d3dcd38 fix(tofu): drop redundant null_resource.dns_pool — PDM owns DNS writes
Every tofu apply on a pool deployment was hitting:

  null_resource.dns_pool[0]: Provisioning with 'local-exec'...
  null_resource.dns_pool[0] (local-exec): (output suppressed due to sensitive value in config)
  Error: Invalid field in API request
  catalyst-dns: write DNS: add *.omantel record: dynadot api error: code=

Two separate code paths were both writing Dynadot records for the same
deployment:

  1. The OpenTofu module's null_resource.dns_pool — a local-exec that
     shells out to /usr/local/bin/catalyst-dns inside the catalyst-api
     container. The binary's request payload is rejected by Dynadot.
  2. catalyst-api's pool-domain-manager call — pdm.Commit() at
     handler/deployments.go:247 writes the canonical record set with the
     LB IP after tofu apply returns. This path works.

Per #168 PDM is the single owner of all pool-domain Dynadot writes.
The null_resource path is a pre-#168 artifact that should have been
removed when PDM took ownership; keeping it dual-wrote DNS records
(when it worked) and broke the entire provision flow (when it didn't).

Verified end-to-end against the live catalyst-api at
console.openova.io: tofu apply created 7 of 11 Hetzner resources
(network, firewall, subnet, LB, 2 LB services, ssh_key) before
failing at null_resource.dns_pool[0]. With this commit the DNS-write
step disappears from the plan, and PDM /commit handles record
creation after the LB IP is known.

The dynadot_key + dynadot_secret variables in variables.tf remain
declared (provisioner.go still passes them through tfvars.json) but
are no longer referenced by any resource. Removing them is a separate
sweep — left for a follow-up to keep this commit narrowly scoped to
the failure path.
2026-04-29 14:52:24 +02:00
github-actions[bot]
96f4fe9265 deploy: update catalyst images to 80b86a1 2026-04-29 12:45:06 +00:00
hatiyildiz
80b86a14ac merge: accept cpx* SKU family + empty worker_size for solo Sovereigns 2026-04-29 14:44:02 +02:00
hatiyildiz
c6cbfe684c fix(tofu): accept cpx* SKU family + empty worker_size for solo Sovereigns
The wizard's recommended Hetzner SKU is CPX32 (4 vCPU AMD / 8 GB / €0.0232/hr)
but the module's variables.tf validation rule only accepted the cx / ccx /
cax families — CPX (AMD shared) was missing entirely. Every Launch through
the wizard hit:

  Error: Invalid value for variable
  on variables.tf line 68: variable "control_plane_size" {
  var.control_plane_size is "cpx32"
  control_plane_size must match Hetzner server-type naming (cxNN | ccxNN | caxNN)

Solo Sovereigns (worker_count = 0) also legitimately have an empty
worker_size — the validation rejected that too:

  Error: Invalid value for variable
  on variables.tf line 91: variable "worker_size" {
  var.worker_size is ""

Both fixed by extending the regex with the cpx* family AND permitting
the empty string on worker_size when the operator runs a solo Sovereign.

Reproduced end-to-end against the deployed catalyst-api before the fix:
the SSE stream surfaced exactly these two validation errors. With the
regex updated they no longer fire — failure now requires a real
Hetzner token instead of being blocked at module-validation time.
2026-04-29 14:43:52 +02:00
github-actions[bot]
a646afa041 deploy: update catalyst images to dc07b0d 2026-04-29 12:24:02 +00:00
hatiyildiz
dc07b0d68e merge: logo tile mirrors canonical marketplace treatment (theme-aware, Temporal visible) 2026-04-29 14:21:56 +02:00
hatiyildiz
5ba0c1c53b fix(wizard): logo tile mirrors canonical marketplace treatment (theme-aware, Temporal visible)
The universal `rgba(255,255,255,0.96)` tile from 691467b4 dropped
white-on-transparent brand marks (Temporal, LiveKit, Mimir, Tempo,
Velero, OpenBao …) into a blinding white pill — the user's "almost
nothing is visible" complaint.

Mirrors the SME marketplace's per-asset PNG approach
(https://marketplace.openova.io/apps/) with metadata-driven
backplates instead of universal chrome:

  - new `logoTone.ts` classifies every vendored component logo as
    `light` (white-glyph, needs slate-900 backplate) or `color`
    (full-colour or dark-glyph, reads on slate-100). Both tones are
    theme-independent — exactly like marketplace PNGs ship the same
    surface regardless of card theme. Empirically validated against
    every asset under public/component-logos/ on five candidate
    surfaces.
  - StepComponents.tsx — `.corp-comp-logo` tile + IconFallback now
    consume `getLogoToneStyle(entry.id)`.
  - StepReview.tsx — ComponentMiniCard 40×40 tile + LetterFallback
    same.
  - MarketplaceFamilyPage.tsx — `.mp-related-logo` / `.mp-related-icon`
    CSS rules now own geometry only; surface is per-asset inline
    style.
  - MarketplaceProductPage.tsx — `.mp-product-logo` /
    `.mp-product-icon` same pattern on the 80×80 hero tile.

Per-component verification (dark + light wizard themes):
  Temporal       — light tone → slate-900 backplate, white logo crisp
  Cilium         — color tone → slate-100, full hexagon visible
  Cert-manager   — color tone → slate-100, blue badge readable
  Grafana        — color tone → slate-100, orange G readable
  Strimzi        — color tone → slate-100, dark mark visible
  Keycloak       — color tone → slate-100, color badge readable
  FerretDB       — color tone → slate-100, wordmark + glyph visible

Gates: tsc --noEmit clean · 149/149 vitest tests pass · vite build OK.
2026-04-29 14:21:12 +02:00
hatiyildiz
cea9621072 merge: bundle OpenTofu CLI in catalyst-api image; fix catalyst-system → catalyst namespace string 2026-04-29 14:08:36 +02:00
hatiyildiz
9b6c297dd8 fix(catalyst-api): bundle OpenTofu CLI in runtime image (pinned + checksum verified)
The previous image bundled the infra/hetzner/ .tf sources but not the tofu
binary itself, so every Launch failed with:

  tofu init: exec: "tofu": executable file not found in $PATH

Add a dedicated builder stage that downloads OpenTofu v1.11.6 from the
canonical GitHub release, verifies the SHA256 against the upstream
SHA256SUMS file before extraction, and ships the binary into the runtime
image at /usr/local/bin/tofu (mode 0755 so UID 65534 can exec it). The
stage branches on $TARGETARCH (amd64 / arm64) to keep multi-arch buildx
correct; both arch checksums are pinned as build args so version bumps
are an explicit two-line change.

Add a CI smoke step in catalyst-build.yaml's build-api job that runs
`tofu version` inside the freshly-built image and asserts the output
matches EXPECTED_TOFU_VERSION; failure fails the build. Also re-run with
`--user 65534:65534` to gate exec-as-non-root at build time. The prior
infra/hetzner/ presence smoke step is preserved unchanged.

Sibling fix in ProvisionPage's FailureCard: the kubectl hint pointed at
namespace `catalyst-system`, but catalyst-api actually runs in namespace
`catalyst` (per chart/templates/api-deployment.yaml + live cluster).
Replace the namespace literal so the diagnostic command copy-pastes
correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:08:03 +02:00
github-actions[bot]
5e3cd1efbe deploy: update catalyst images to 80db0da 2026-04-29 11:56:13 +00:00
hatiyildiz
80db0da908 merge: contrast audit — restore theme tokens on ProvisionPage non-logo surfaces 2026-04-29 13:54:47 +02:00
hatiyildiz
6327d8db8b fix(wizard): contrast audit — restore theme tokens on non-logo surfaces
Provision page styled three surfaces with hardcoded
rgba(255,255,255,...) literals rather than the page's theme tokens.
The theme tokens (--s1, --md, --lo) already flip correctly under
.provision-shell[data-theme="light"], so any element painted with
the raw rgba was theme-locked to dark and washed out / invisible
against the light radial-gradient page background.

Three surfaces switched to tokens that already exist on the same
page and flip per-theme:

  • DAG bubble label fill (pending state) — colour
    rgba(255,255,255,0.45) → var(--lo)
    Dark: --lo = rgba(255,255,255,0.40) (≈ same)
    Light: --lo = #475569 (slate-600, readable on light bg)

  • Live-log info-line text — color rgba(255,255,255,.78)
    → var(--md)
    Dark: --md = rgba(255,255,255,0.65)
    Light: --md = #334155 (readable on light log panel)

  • Live-log meta pill + failure-card hint <code> background —
    rgba(255,255,255,.04) → var(--s1)
    Dark: --s1 = rgba(255,255,255,0.04) (unchanged)
    Light: --s1 = #fff (lifted pill on slate page bg)

The wizard StepReview surfaces (Section / Field / RegionCard /
ComponentMiniCard) and the marketplace family/product pages were
already migrated off raw rgba in 4f6dd10a; logo TILES intentionally
keep rgba(255,255,255,0.96) per the documented contract in
StepComponents.tsx LOGO_TILE_BG (vendored brand marks render in
mixed treatments — dark glyphs designed for white backdrops, white
glyphs on transparent — and a near-white pill keeps every glyph
legible regardless of theme).

Verification:
  • npx tsc --noEmit                                       ✓
  • npm run build                                          ✓
  • ./node_modules/.bin/vitest run — 149 passed (149)      ✓
  • Live wizard at /sovereign/wizard — every step's section
    surfaces and card surfaces render with proper contrast in
    BOTH dark and light themes; logo tiles still readable.
  • Live marketplace at /sovereign/marketplace/family/cortex
    and /sovereign/marketplace/product/axon — flat-section
    layout intact, logo tiles crisp.

No layout, no test selectors, no router, no componentGroups.ts,
no providerSizes.ts changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:54:03 +02:00
hatiyildiz
5520e91443 merge: review components — drop family summary, pixel-match marketplace.openova.io/review 2026-04-29 13:49:07 +02:00
hatiyildiz
4f6dd10a20 fix(wizard): review components — drop family summary, pixel-match marketplace.openova.io/review
The Components section on StepReview rendered both a family-summary
mini-card grid (PILOT M5 / SPINE M5 R1 O1 / …) AND a per-component
card grid below. The summary was a duplicate read of the same data —
each per-component card already shows its family chip, so the strip
above counted what the cards already display. Drop it.

The per-component cards themselves were tiny `auto-fill,
minmax(180px, 1fr)` chips with logo + name + tier letter + family
chip. Replace with a pixel-mirror of the canonical `.stack-card` on
https://marketplace.openova.io/review/ — same horizontal flex
layout, 40×40 logo tile, semibold name, low-key category pill, and
single-line description. Tokens map 1:1 (light theme):

  marketplace `--color-bg`            → wizard `--wiz-bg-input`
  marketplace `--color-border`        → wizard `--wiz-border`
  marketplace `--color-text-strong`   → wizard `--wiz-text-hi`
  marketplace `--color-text-dim`      → wizard `--wiz-text-md` (desc),
                                                 `--wiz-text-sub` (cat)

Card geometry verified pixel-identical to marketplace at 1440px
width: padding 10.4px, gap 10.4px, border-radius 8px, card height
66.078125px, 2-column grid with 8px gap collapsing to 1 column under
700px. Tier (M/R/O) intentionally dropped — not on the canonical
card; the Components step before review already enforces tier
semantics. The legend below the grid goes with it.

Section + Field shells switched from `--wiz-bg-xs` to `--wiz-bg-sub`
so the card surfaces lift visibly off the section background in
light mode — the previous near-white tint was the same colour as the
cards, so cards visually melted into the section ("white-on-white").

Verification:
  • npx tsc --noEmit                                       ✓
  • npm run build                                          ✓
  • ./node_modules/.bin/vitest run — 149 passed (149)      ✓
  • Live wizard at /sovereign/wizard step 7 — components section
    renders 2-col grid of stack-card-shaped components, no family
    summary, no tier legend, computed CSS matches marketplace.

POST body to /v1/deployments unchanged. componentGroups.ts,
provider/topology cards, router.tsx untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:48:28 +02:00
github-actions[bot]
e62fd5f3eb deploy: update catalyst images to 7931f79 2026-04-29 11:46:08 +00:00
hatiyildiz
7931f79ac4 merge: bundle infra/hetzner/ tofu module into catalyst-api image 2026-04-29 13:44:50 +02:00
hatiyildiz
61c6122633 fix(catalyst-api): bundle infra/hetzner/ tofu module into the image
The catalyst-api Pod is the OpenTofu runner — provisioner.New() reads
CATALYST_TOFU_MODULE_PATH (default /infra/hetzner) and stageModule()
copies the canonical .tf / .tftpl files into a per-deployment workdir
on every Launch. The previous Containerfile did not COPY the module
in, so every Launch failed:

    {"level":"ERROR","msg":"provision failed",
     "err":"stage tofu module: open /infra/hetzner: no such file or directory"}

Containerfile changes
- Build context is now the public openova repo root (Containerfile
  paths COPY from products/catalyst/bootstrap/api/ explicitly).
- New `COPY infra/hetzner/ /infra/hetzner/` brings the FULL tree
  (main.tf, variables.tf, outputs.tf, versions.tf, cloudinit-*.tftpl,
  README.md) into the runtime image. The path /infra/hetzner/ matches
  provisioner.New()'s default and the catalyst-platform Helm chart's
  CATALYST_TOFU_MODULE_PATH override.

Workflow changes (.github/workflows/catalyst-build.yaml, build-api job)
- context: openova-src/products/catalyst/bootstrap/api -> openova-src
  (the repo root is needed so infra/hetzner/ is in the build context).
- Split build into Build (load: true) + Smoke + Push, mirroring the UI
  job pattern. The smoke step runs `ls -la /infra/hetzner/` inside the
  built image and asserts main.tf, variables.tf, outputs.tf, versions.tf,
  and both cloudinit-*.tftpl files are present. Failure fails the build
  — broken images can no longer ship.

Verification (local)
- go vet ./... + go test ./... in products/catalyst/bootstrap/api: clean
- docker build -f products/catalyst/bootstrap/api/Containerfile . at the
  repo root succeeds; `docker run --rm --entrypoint sh catalyst-api:test
  -c 'ls -la /infra/hetzner/'` lists main.tf, variables.tf, outputs.tf,
  versions.tf, cloudinit-control-plane.tftpl, cloudinit-worker.tftpl.

provisioner.go business logic untouched. catalyst-platform Helm chart
api-deployment.yaml untouched (CATALYST_TOFU_MODULE_PATH already aligns
with /infra/hetzner).
2026-04-29 13:44:11 +02:00
github-actions[bot]
127398e969 deploy: update catalyst images to 36747a3 2026-04-29 11:39:01 +00:00
hatiyildiz
36747a3b26 merge: provision route invariant fix (use internal route id) 2026-04-29 13:38:00 +02:00
hatiyildiz
18d56ab8b8 fix(provision): use internal route id for useParams (basepath stripped)
The /provision/ route is registered against the router's
internal path; '/sovereign' is the basepath, stripped before matching.
The 'from: "/sovereign/provision/$deploymentId"' lookup matched no
route at runtime — TanStack Router throws 'Invariant failed' for any
useParams call against an unknown route id. Cast was hiding the type
error.

This unblocks the SPA route — /sovereign/provision/<id> now renders the
ProvisionPage without throwing.
2026-04-29 13:36:34 +02:00
github-actions[bot]
0745945eb8 deploy: update catalyst images to 4e5c75e 2026-04-29 11:17:59 +00:00
hatiyildiz
4e5c75e05c merge: provision as SPA route /sovereign/provision/:deploymentId; fix FQDN, components count, failure UX
# Conflicts:
#	products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepReview.tsx
2026-04-29 13:16:53 +02:00
hatiyildiz
8f8d9c0d8a merge: dense multi-card review rows; per-component cards in Components 2026-04-29 13:15:41 +02:00
hatiyildiz
6a54782c7f merge: neutral high-contrast logo tile across cards, review, marketplace 2026-04-29 13:14:41 +02:00
hatiyildiz
08cd438762 fix(wizard): provision as SPA route /sovereign/provision/:deploymentId; fix FQDN, components count, failure UX
The provision page was a 1198-line static public/provision.html artefact
plus a sibling provision.js / catalog.js triple. The .html URL was the
visible give-away that the page wasn't first-class — it was rendered
outside the React app, did not share design tokens, did not get bundled,
and could not consume the wizard's zustand store directly. The result
was a page that displayed "omantel.omani-works · SOLO · 0 components ·
Failed" with no actionable detail when something went wrong.

This commit deletes all three static artefacts and ships a real SPA
route at `/sovereign/provision/$deploymentId` instead. Same DAG visual,
same EventSource wiring, same phase→bubble state machine — but as a
React component that:

- reads the deploymentId from URL params (deep-linkable, refresh-safe)
- reads selectedComponents + topology from useWizardStore directly
- resolves the FQDN via resolveSovereignDomain(store) — fixes the
  "omantel.omani-works" hyphen bug; the page now shows "omantel.omani.works"
- renders a real FailureCard when SSE surfaces status="failed", carrying
  the deployment's actual error message + Retry / Back-to-wizard CTAs
- handles 404 / EventSource error with a clean retry surface

Wiring:
- New /sovereign/provision/$deploymentId route in router.tsx
- StepReview's provision() callback now navigates via router.navigate
  instead of window.location.href = path('provision.html')
- BOOTSTRAP_KIT export added to catalog.generated.ts (read from
  clusters/_template/bootstrap-kit/ at build time, ordered by NN- prefix)
  so the React route can import the same source-of-truth the deleted
  catalog.js used to surface as window.CATALYST_CATALOG
- emitPublicCatalog() removed from build-catalog.mjs — no static page
  consumes it any more

Files deleted:
- public/provision.html
- public/provision.js
- public/catalog.js

Files added:
- src/pages/provision/ProvisionPage.tsx (1300+ lines: catalog read,
  expandWithDependencies, buildNodes, buildEdges, computeLayout,
  applyEvent state machine, sidebar, log panel, failure card, status
  pill)

Verified: tsc clean, 149/149 vitest tests pass.
2026-04-29 13:14:31 +02:00
hatiyildiz
9280cd4a4b fix(wizard): dense multi-card review rows; per-component cards in Components
Review page packs small fields/cards in horizontal rows instead of stacking
them top-to-bottom. The Components section now renders every selected
component as its own mini-card (logo + name + family chip + tier) so the
operator sees exactly what will be installed, not just family-level
counts. Reduced section padding and dropped redundant whitespace between
rows so the review fits a typical viewport without scrolling.

The provision()-to-/v1/deployments POST body is unchanged — visual only.
2026-04-29 13:10:41 +02:00
hatiyildiz
691467b486 fix(wizard): neutral high-contrast logo tile across cards, review, marketplace
Component-logos vendored under public/component-logos/ are upstream brand
marks rendered as-shipped — some are dark glyphs designed for white
backdrops, some are white glyphs on transparent (designed for dark
surfaces), some are full-colour. The previous tile (rgba(255,255,255,0.04)
with the icon-fallback using oklch hue rotation) made dark glyphs invisible
in dark mode and white glyphs invisible against the dim tile. Worse, the
contrast story was inconsistent across surfaces — the wizard cards, the
review page, and the marketplace family/product pages each picked their
own background.

This commit pins ONE tile contract used in every place a component logo
renders:

- background: rgba(255,255,255,0.96) (near-white pill, theme-independent)
- border-radius: 10px
- 1px outer border in --wiz-border-sub so the tile doesn't fight the card
- 6px internal padding so tight square SVGs aren't cropped
- IconFallback letter colour pinned to fixed slate (#0f172a) so the letter
  reads against the white tile in BOTH dark- and light-mode themes
  (--wiz-text-hi flips with the theme and would white-out in dark mode)

Files updated:
- StepComponents.tsx — .corp-comp-logo + IconFallback
- MarketplaceFamilyPage.tsx — .mp-related-logo + .mp-related-icon
- MarketplaceProductPage.tsx — .mp-product-logo + .mp-product-icon

Verified by toggling dark/light theme and walking the wizard +
marketplace pages — every brand mark legible regardless of glyph palette
or theme.
2026-04-29 13:09:37 +02:00
github-actions[bot]
676889d67c deploy: update catalyst images to 4149c44 2026-04-29 10:38:14 +00:00
hatiyildiz
4149c443e4 merge: 4-line card grid; 6-10 word professional descs; full-width text body 2026-04-29 12:36:38 +02:00
hatiyildiz
9af51d980e fix(wizard): 4-line card grid; 6-10 word descs; full-width text body
The wizard component cards were copying the SME marketplace's
`app-body { padding-right: 72px }` pattern, which reserves the right
quarter of every card for an absolute-positioned hover-only round Add
button. Combined with one- to three-word `desc` strings, every card
showed a name, a chip line, a single half-line of description, and a
visually empty right column — a quarter of valuable space wasted.

This change restructures the cards around a rigid 4-line grid that
spans the FULL body width:

  Line 1 — name (left, flex) + family chip + inline toggle (right)
  Line 2 — description line 1 (full width)
  Line 3 — description line 2 (full width, two-line clamp)
  Line 4 — tier chip + dependency chips + SELECTED dot (right)

Chips appear ONLY on line 1 or line 4, never on lines 2-3. The
`.corp-comp-body` no longer reserves any horizontal padding for
overlay buttons; descriptions use the entire body column.

The toggle affordance is relocated from an absolute-positioned 32×32
overlay (top-right of the card, opacity-0 until hover) to an inline
22×22 round button at the trailing edge of line 1, sharing the chip
row with the family chip. It still fades in on card hover and stays
visible when in-cart, but it occupies a single inline cell instead of
reserving a vertical column.

The bottom-right SELECTED text pill is replaced by a compact green
dot anchored to the right end of line 4. The card already conveys
selection through its green border, green-tinted background, and the
green ✓ toggle button on line 1; the loud text pill duplicated those
signals while crowding the dependency chips on cards with deps.

Every component description in `componentGroups.ts` is rewritten as a
6-10 word professional sentence-fragment distilled from the long-form
`COMPONENT_COPY.positioning` text in `marketplaceCopy.ts`. Same voice:
factual, technical, terse — no hype, no forbidden vocabulary.

Five before/after samples:
  flux:        "GitOps delivery engine"           → "GitOps reconciler driving every Sovereign cluster from Git"
  cilium:      "CNI & eBPF service mesh"          → "eBPF CNI and service mesh with kernel-level policy"
  cert-manager:"TLS certificate automation"       → "Automated TLS issuance and rotation for every ingress"
  grafana:     "Dashboards & alerting"            → "Curated dashboards across metrics, logs, and traces"
  langfuse:    "LLM observability & tracing"      → "Prompt, completion, and cost tracing for the AI plane"

All 63 component descriptions verified within 6-10 words; no
forbidden vocabulary ("MVP", "for now", "stub", "iterative", "demo");
no marketing fluff. CSS changes preserve the canonical 108px resting
height; tablet/mobile responsive floor unchanged. All 149 vitest
specs continue to pass; existing data-testid selectors
(`toggle-<id>`, `family-chip-<id>`, `tier-<id>`, `selected-<id>`,
`deps-<id>-<dep>`, `includes-<id>`, `component-card-<id>`) are
preserved unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:35:57 +02:00
hatiyildiz
570147cd8f merge: canonical SKU catalogs (Hetzner CPX32 recommended; Huawei c7n.xlarge.2; OCI E5.Flex.2.16; AWS m6i.xlarge; Azure D4s_v5) 2026-04-29 12:32:08 +02:00
hatiyildiz
183c3066f2 fix(wizard): canonical SKU catalog from each provider's pricing page (Hetzner, Huawei, OCI, AWS, Azure)
Replaces the guessed per-provider SKU catalog with values that match what
each cloud provider publishes on its canonical pricing page today
(snapshot 2026-04-29). Confused CX (Intel) vs CPX (AMD) vs CAX (ARM) vs
CCX (dedicated) labels are gone — each id, label, vCPU/RAM/disk spec, and
EUR price now comes from the source pricing page directly.

Hetzner   (19 SKUs): full CX23/33/43/53 (Intel), CPX22/32/42/52/62 (AMD),
                     CAX11/21/31/41 (ARM), CCX13/23/33/43/53/63 (dedicated).
                     Recommended: CPX32 — 4 vCPU AMD / 8 GB / 160 GB SSD,
                     €0.0232/hr €14.49/mo (founder-stated EU starter).
                     Sources: hetzner.com/cloud/regular-performance,
                     /cost-optimized, /general-purpose.
Huawei    (11 SKUs): s7 / c7n / m7 families across 2/4/8/16 vCPU sizes.
                     Recommended: c7n.xlarge.2 (4 vCPU / 8 GB).
                     Source: huaweicloud.com/intl/en-us/product/ecs/pricing.html
                     (specs cross-checked on Cloud Mercato).
OCI       (11 SKUs): VM.Standard.E5.Flex (AMD Genoa), .E4.Flex (Milan),
                     .Standard3.Flex (Intel), .A1.Flex (Ampere ARM).
                     Recommended: VM.Standard.E5.Flex (2 OCPU / 16 GB).
                     Source: oracle.com/cloud/compute/pricing/
                     ($0.030/OCPU + $0.002/GB AMD; $0.010/OCPU ARM).
AWS       (15 SKUs): m6i / c6i / r6i (Intel Ice Lake) plus m7g (Graviton3
                     ARM) at .large/.xlarge/.2xlarge/.4xlarge.
                     Recommended: m6i.xlarge (4 vCPU / 16 GB).
                     Source: aws.amazon.com/ec2/pricing/on-demand/
                     (us-east-1 Linux on-demand, verified on Vantage).
Azure     (10 SKUs): Dsv5 / Esv5 / Dpsv5 v5 generation (Intel + Ampere ARM)
                     at 2/4/8/16 vCPU sizes.
                     Recommended: Standard_D4s_v5 (4 vCPU / 16 GB).
                     Source: azure.microsoft.com/en-us/pricing/details/
                     virtual-machines/linux/ (West Europe, verified on Vantage).

NodeSize interface gains `disk: number | string` (local SSD GB or
"EBS-only"/"Variable") and `priceMonth: number` (Hetzner cap; hyperscaler
hour×730). USD list prices converted to EUR at 1 USD = 0.92 EUR (snapshot
2026-04, applied once at table-build time via priceUSDtoEUR helper).

StepProvider sublabel now renders disk + monthly cap alongside vCPU/RAM/
hourly. Stale comment references to "cx32"/"cx42" updated to "CPX32" (the
canonical Hetzner page calls it CPX32, never "CX32 — Standard").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:31:32 +02:00