openova/.github/workflows/catalyst-build.yaml
e3mrah 4d24914ae4
feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318) (#346)
* feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318)

Adds a first-class Phase-0 recovery surface so an operator can purge a
failed pre-handover deployment from the wizard UI without dropping to
hcloud CLI runbooks. Two entry-points, one canonical implementation.

## Backend

NEW: products/catalyst/bootstrap/api/internal/handler/wipe.go
  POST /api/v1/deployments/{id}/wipe — single-flight destructive op:
    1. tofu destroy against the per-deployment workdir (idempotent).
    2. Hetzner orphan force-purge by label-selector
       `catalyst-deployment-id=<id>` (servers, load balancers,
       networks, firewalls, ssh-keys). Belt-and-braces — catches
       resources tofu didn't track (half-failed cloud-init, manual
       experiments). Per docs/INVIOLABLE-PRINCIPLES.md #3 this direct
       API path is fallback ONLY for orphan cleanup, never new
       resource creation.
    3. PDM /v1/release for pool-subdomain Sovereigns (best-effort).
    4. Local cleanup: kubeconfig file (mode 0600), tofu workdir,
       on-disk deployment record JSON.
    5. SSE events stream throughout on the same channel as the
       original provisioning + Phase-1 watch.
    6. Marks Status="wiped"; sync.Map entry reaped after a 60s TTL.

NEW: products/catalyst/bootstrap/api/internal/hetzner/purge.go
  Hetzner Cloud API enumeration + force-delete by label selector.
  Uses a 60s timeout (vs the 10s ValidateToken default) because async
  server-delete jobs can queue. 404s treated as success (already gone).

NEW: products/catalyst/bootstrap/api/internal/provisioner/provisioner.go
  Provisioner.Destroy() — runs `tofu destroy -auto-approve` against
  the per-deployment workdir, then removes the workdir on success so
  re-provisioning starts fresh. Re-stages module + tfvars first so a
  partially-cleaned workdir still has what tofu needs.

TOUCHED: products/catalyst/bootstrap/api/cmd/api/main.go
  Registers POST /api/v1/deployments/{id}/wipe.

## Frontend (aligned with existing CrudModals conventions per founder
##           directive — no ad-hoc surface)

NEW: products/catalyst/bootstrap/ui/src/components/CrudModals/WipeDeploymentModal.tsx
  Two-stage modal built on the canonical ModalShell. Pre-wipe confirm
  view requires the operator to:
    - Type the sovereign FQDN to confirm scope.
    - Re-paste their Hetzner Cloud API token (catalyst-api intentionally
      GCs the original after writeTfvars per credential hygiene).
  Post-wipe success view shows the PurgeReport (servers, lbs, networks,
  firewalls, ssh-keys removed; tofu/PDM/local-state ✓/✗) and a
  "Start fresh deployment" CTA that nav's to /sovereign.

TOUCHED: products/catalyst/bootstrap/ui/src/components/CrudModals/index.ts
  Re-exports WipeDeploymentModal + WipeReport.

TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/AppsPage.tsx
  FailureCard now exposes a "Cancel & Wipe" red button next to
  "Retry stream" / "Back to wizard" — opens WipeDeploymentModal.

TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/InfrastructureTopology.tsx
  Cloud → Architecture canvas: the `cloud` (root) node action menu
  gains "Cancel & Wipe deployment" as a `danger:true` action,
  alongside the existing "+ Add region". Distinct from the
  per-resource DeleteCascadeConfirm on region/cluster/vCluster — this
  is deployment-scope (Phase-0 orphan purge), the others are
  Crossplane-XRC scope (day-2). The two paths coexist; operators
  choose by what state the deployment is in.

## Why two entry-points

Wizard banner (failed state on AppsPage) — recovery from a known
failure. Already a red-banner page; the button is right there.

Cloud → Architecture cloud-node action — proactive cancel from the
canvas, mirrors how the existing per-resource deletes are reachable.
Same modal, same backend.

## Constraints honoured

- Per docs/INVIOLABLE-PRINCIPLES.md #3 (Crossplane is the ONLY day-2
  IaC): the per-resource DELETE handler at infrastructure.go is
  unchanged and continues to flip XRC deletionPolicy. Wipe operates
  ONLY in Phase-0 scope where Crossplane never adopted resources.
- Per #4 (never hardcode): every endpoint lives behind API_BASE; the
  Hetzner purge enumerates by deterministic label selector built from
  var.sovereign_fqdn (the OpenTofu module's existing tagging convention).
- Per credential hygiene: the Hetzner token is re-prompted at wipe time
  rather than persisted; the modal uses an <input type="password">.

## Refs

#318 — pre-handover wipe spec (this PR closes it)
#317 — handover finalisation (sibling; this PR is the failure-path
       complement)
feedback_idempotent_iac_purge.md — operator runbook this implements
PR #313 — sealed-secrets cleanup (independent; safe to land in any order)
PR #334 — bp-external-secrets split (independent)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): catalyst-build event-driven only — drop cron, push-on-main with path filter

Per docs/INVIOLABLE-PRINCIPLES.md (event-driven end to end — Flux
dependsOn, NATS JetStream, SSE, Helm hooks), GitHub Actions must follow
the same model. The previous `schedule: cron 0 3 * * *` daily build was
the only canonical deploy path, which created a 24h roll latency on
every change to the catalyst surface and incentivised "wait for cron"
stalls in operator workflows.

Replaces with:
  on:
    push:
      branches: [main]
      paths:
        - 'core/console/**'
        - 'core/admin/**'
        - 'core/marketplace/**'
        - 'core/marketplace-api/**'
        - 'products/catalyst/bootstrap/**'
        - 'products/catalyst/chart/**'
        - '.github/workflows/catalyst-build.yaml'
    workflow_dispatch:

`workflow_dispatch` retained for ad-hoc re-runs (config-only changes
that bypass the path filter, e.g. a secret rotation that doesn't touch
code). Path filter mirrors the actual surface this workflow rebuilds.

After this lands, every merge to main that touches the catalyst surface
auto-deploys. No cron lag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 09:24:40 +04:00

324 lines
13 KiB
YAML

name: Build & Deploy Catalyst
# Event-driven only. Cron is forbidden — the OpenOva architecture is
# event-driven end to end (Flux dependsOn, NATS JetStream, SSE,
# Helm post-install hooks). `push` on the relevant paths is the
# canonical trigger; `workflow_dispatch` exists for ad-hoc re-runs
# without a code change.
on:
push:
branches: [main]
paths:
- 'core/console/**'
- 'core/admin/**'
- 'core/marketplace/**'
- 'core/marketplace-api/**'
- 'products/catalyst/bootstrap/**'
- 'products/catalyst/chart/**'
- '.github/workflows/catalyst-build.yaml'
workflow_dispatch:
env:
REGISTRY: ghcr.io
UI_IMAGE: ghcr.io/openova-io/openova/catalyst-ui
API_IMAGE: ghcr.io/openova-io/openova/catalyst-api
jobs:
build-ui:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
outputs:
sha_short: ${{ steps.vars.outputs.sha_short }}
steps:
- name: Checkout openova-private
uses: actions/checkout@v4
- name: Checkout openova (public source)
uses: actions/checkout@v4
with:
repository: openova-io/openova
path: openova-src
- name: Set short SHA
id: vars
run: echo "sha_short=$(echo $GITHUB_SHA | head -c 7)" >> "$GITHUB_OUTPUT"
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build UI image (test)
uses: docker/build-push-action@v6
with:
# Build context is the repo root so the Vite prebuild script can
# walk platform/, products/, clusters/_template/bootstrap-kit/ to
# populate the catalog + BOOTSTRAP_KIT. The Containerfile fails
# the build if any of those dirs is missing.
context: openova-src
file: openova-src/products/catalyst/bootstrap/ui/Containerfile
push: false
load: true
tags: ${{ env.UI_IMAGE }}:test
build-args: VITE_APP_MODE=selfhosted
- name: Smoke test UI
run: |
docker run -d --name smoke-ui -p 8080:8080 ${{ env.UI_IMAGE }}:test
sleep 3
STATUS=$(curl -s -o /dev/null -w '%{http_code}' http://localhost:8080/)
if [ "$STATUS" != "200" ]; then
echo "Smoke test failed: expected 200 from /, got $STATUS"
docker stop smoke-ui
exit 1
fi
echo "Smoke test (root) passed: HTTP $STATUS"
# Logo path regression guard (#173): the wizard's StepComponents
# references `${BASE}component-logos/<id>.<ext>` where BASE is the
# Vite base and the extension is whatever the upstream brand mark
# is published as (some are SVG, some are PNG — we use the canonical
# upstream asset rather than auto-converting). Inside the catalyst-
# ui pod nginx serves the file at /component-logos/<id>.<ext>
# (Traefik strips /sovereign before proxying — see nginx.conf
# comment). We list every logo path that componentGroups.ts
# references, so a missing or mis-cased asset fails the build,
# not the user.
for path in \
component-logos/cilium.svg \
component-logos/flux.svg \
component-logos/harbor.svg \
component-logos/grafana.svg \
component-logos/keycloak.svg \
component-logos/openbao.svg \
component-logos/langfuse.png \
component-logos/vllm.png \
component-logos/temporal.svg \
component-logos/stalwart.svg \
component-logos/cnpg.svg \
component-logos/loki.png \
component-logos/mimir.png \
component-logos/tempo.svg \
component-logos/ntfy.svg \
component-logos/ferretdb.png \
component-logos/openmeter.png \
component-logos/coraza.png \
component-logos/external-dns.png \
component-logos/netbird.png \
component-logos/strongswan.png \
component-logos/trivy.png \
component-logos/syft-grype.png ; do
CODE=$(curl -s -o /dev/null -w '%{http_code}' \
"http://localhost:8080/${path}")
if [ "$CODE" != "200" ]; then
echo "Logo smoke FAILED: /${path} returned $CODE"
docker stop smoke-ui
exit 1
fi
echo "Logo smoke OK: /${path} HTTP $CODE"
done
# Bootstrap-kit regression guard: the Provision page reads
# BOOTSTRAP_KIT from the bundled catalog.generated.ts to render
# the per-Blueprint bubbles. An earlier revision shipped with a
# docker context that didn't include clusters/_template/bootstrap-kit/
# so the prebuild script silently produced an empty array — the
# page rendered only the 2 supernodes. Asserting the bundle
# contains every bp-* id makes that regression impossible.
#
# Implementation note: we extract the entire bundle once via
# `tar c -C ... --transform`, then grep locally. Earlier we ran
# `grep` inside docker run -c "..." and the nested quote escaping
# produced false negatives (bp-cilium was in the bundle but the
# grep argument matched a literal `"bp-cilium"` whose surrounding
# quotes were eaten by shell expansion). Local grep on the
# extracted file removes that whole class of escaping bugs.
BUNDLE_TMP=$(mktemp)
docker run --rm --entrypoint sh ${{ env.UI_IMAGE }}:test \
-c 'cat $(find /usr/share/nginx/html/assets -name "index-*.js" | head -1)' \
> "$BUNDLE_TMP"
BUNDLE_BYTES=$(wc -c < "$BUNDLE_TMP")
echo "Bundle size: $BUNDLE_BYTES bytes"
if [ "$BUNDLE_BYTES" -lt 100000 ]; then
echo "Bootstrap-kit smoke FAILED: bundle suspiciously small ($BUNDLE_BYTES bytes)"
docker stop smoke-ui
exit 1
fi
for bp in bp-cilium bp-cert-manager bp-flux bp-crossplane bp-sealed-secrets \
bp-spire bp-nats-jetstream bp-openbao bp-keycloak bp-gitea ; do
if ! grep -q -F "$bp" "$BUNDLE_TMP" ; then
echo "Bootstrap-kit smoke FAILED: ${bp} missing from bundle"
docker stop smoke-ui
exit 1
fi
echo "Bootstrap-kit smoke OK: ${bp}"
done
rm -f "$BUNDLE_TMP"
docker stop smoke-ui
echo "All smoke tests passed."
- name: Push UI image
uses: docker/build-push-action@v6
with:
# Build context is the repo root so the Vite prebuild script can
# walk platform/, products/, clusters/_template/bootstrap-kit/ to
# populate the catalog + BOOTSTRAP_KIT. The Containerfile fails
# the build if any of those dirs is missing.
context: openova-src
file: openova-src/products/catalyst/bootstrap/ui/Containerfile
push: true
tags: |
${{ env.UI_IMAGE }}:${{ steps.vars.outputs.sha_short }}
${{ env.UI_IMAGE }}:latest
build-args: VITE_APP_MODE=selfhosted
build-api:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
outputs:
sha_short: ${{ steps.vars.outputs.sha_short }}
steps:
- name: Checkout openova-private
uses: actions/checkout@v4
- name: Checkout openova (public source)
uses: actions/checkout@v4
with:
repository: openova-io/openova
path: openova-src
- name: Set short SHA
id: vars
run: echo "sha_short=$(echo $GITHUB_SHA | head -c 7)" >> "$GITHUB_OUTPUT"
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
# Build context is the public openova repo root (openova-src/), not just
# products/catalyst/bootstrap/api/, because the runtime image bundles the
# canonical OpenTofu module from infra/hetzner/. The Containerfile's
# COPY paths are written relative to the repo root accordingly. Without
# this, /infra/hetzner/ is missing inside the image and every Launch
# fails with `stage tofu module: open /infra/hetzner: no such file or
# directory`.
- name: Build API image (test)
uses: docker/build-push-action@v6
with:
context: openova-src
file: openova-src/products/catalyst/bootstrap/api/Containerfile
push: false
load: true
tags: ${{ env.API_IMAGE }}:test
# Smoke test — the catalyst-api Pod is the OpenTofu runner, so the .tf
# sources MUST be present at /infra/hetzner/ inside the image. Anything
# less ships a broken image that fails on every Launch with `stage tofu
# module: open /infra/hetzner: no such file or directory`. Failure of
# this step fails the build.
- name: Smoke test API — verify infra/hetzner/ is bundled
run: |
set -euo pipefail
LISTING=$(docker run --rm --entrypoint sh ${{ env.API_IMAGE }}:test \
-c 'ls -la /infra/hetzner/')
echo "$LISTING"
for f in main.tf variables.tf outputs.tf versions.tf \
cloudinit-control-plane.tftpl cloudinit-worker.tftpl ; do
if ! echo "$LISTING" | grep -q " ${f}\$"; then
echo "Smoke test FAILED: /infra/hetzner/${f} missing from image"
exit 1
fi
echo "Smoke test OK: /infra/hetzner/${f} present"
done
echo "All API smoke tests passed."
# tofu CLI smoke test — the runtime image bundles the OpenTofu CLI
# because internal/provisioner execs `tofu init / plan / apply` (see
# internal/provisioner/provisioner.go runTofu()). Without the binary
# every Launch SSE stream returns:
# tofu init: exec: "tofu": executable file not found in $PATH
# We assert (a) `tofu version` succeeds inside the image and (b) the
# output matches the EXPECTED_TOFU_VERSION pinned here, which must
# stay in lockstep with the TOFU_VERSION ARG in the Containerfile.
# When you bump the version in the Containerfile, bump it here too.
- name: Smoke test API — verify OpenTofu CLI is installed
env:
EXPECTED_TOFU_VERSION: 1.11.6
run: |
set -euo pipefail
OUT=$(docker run --rm --entrypoint sh ${{ env.API_IMAGE }}:test \
-c 'tofu version')
echo "$OUT"
if ! echo "$OUT" | grep -q "^OpenTofu v${EXPECTED_TOFU_VERSION}\$"; then
echo "Smoke test FAILED: expected 'OpenTofu v${EXPECTED_TOFU_VERSION}', got:"
echo "$OUT"
exit 1
fi
echo "Smoke test OK: OpenTofu v${EXPECTED_TOFU_VERSION} present on PATH."
# Re-assert the binary is executable for the actual runtime UID
# (65534, set in api-deployment.yaml securityContext.runAsUser).
# `--user` overrides the image USER directive, simulating the K8s
# securityContext: a missing exec bit or wrong owner here would
# surface as a Launch failure in production, never in CI, so we
# gate it at build time.
docker run --rm --user 65534:65534 --entrypoint sh \
${{ env.API_IMAGE }}:test -c 'tofu version | head -1'
echo "Smoke test OK: tofu executable as UID 65534."
- name: Push API image
uses: docker/build-push-action@v6
with:
context: openova-src
file: openova-src/products/catalyst/bootstrap/api/Containerfile
push: true
tags: |
${{ env.API_IMAGE }}:${{ steps.vars.outputs.sha_short }}
${{ env.API_IMAGE }}:latest
deploy:
needs: [build-ui, build-api]
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Update deployment manifests with new SHA tags
env:
SHA_SHORT: ${{ needs.build-ui.outputs.sha_short }}
run: |
DEPLOY_DIR="products/catalyst/chart/templates"
sed -i "s|image: ${UI_IMAGE}:.*|image: ${UI_IMAGE}:${SHA_SHORT}|" \
"${DEPLOY_DIR}/ui-deployment.yaml"
sed -i "s|image: ${API_IMAGE}:.*|image: ${API_IMAGE}:${SHA_SHORT}|" \
"${DEPLOY_DIR}/api-deployment.yaml"
echo "Updated manifests to SHA ${SHA_SHORT}:"
grep "image:" "${DEPLOY_DIR}/ui-deployment.yaml"
grep "image:" "${DEPLOY_DIR}/api-deployment.yaml"
- name: Commit and push manifest updates
env:
SHA_SHORT: ${{ needs.build-ui.outputs.sha_short }}
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add products/
git diff --staged --quiet && echo "No changes to commit" && exit 0
git commit -m "deploy: update catalyst images to ${SHA_SHORT}"
git push