* feat(wipe): deployment-level Cancel & Wipe — backend endpoint + Cloud-Architecture + wizard banner entry-points (closes #318) Adds a first-class Phase-0 recovery surface so an operator can purge a failed pre-handover deployment from the wizard UI without dropping to hcloud CLI runbooks. Two entry-points, one canonical implementation. ## Backend NEW: products/catalyst/bootstrap/api/internal/handler/wipe.go POST /api/v1/deployments/{id}/wipe — single-flight destructive op: 1. tofu destroy against the per-deployment workdir (idempotent). 2. Hetzner orphan force-purge by label-selector `catalyst-deployment-id=<id>` (servers, load balancers, networks, firewalls, ssh-keys). Belt-and-braces — catches resources tofu didn't track (half-failed cloud-init, manual experiments). Per docs/INVIOLABLE-PRINCIPLES.md #3 this direct API path is fallback ONLY for orphan cleanup, never new resource creation. 3. PDM /v1/release for pool-subdomain Sovereigns (best-effort). 4. Local cleanup: kubeconfig file (mode 0600), tofu workdir, on-disk deployment record JSON. 5. SSE events stream throughout on the same channel as the original provisioning + Phase-1 watch. 6. Marks Status="wiped"; sync.Map entry reaped after a 60s TTL. NEW: products/catalyst/bootstrap/api/internal/hetzner/purge.go Hetzner Cloud API enumeration + force-delete by label selector. Uses a 60s timeout (vs the 10s ValidateToken default) because async server-delete jobs can queue. 404s treated as success (already gone). NEW: products/catalyst/bootstrap/api/internal/provisioner/provisioner.go Provisioner.Destroy() — runs `tofu destroy -auto-approve` against the per-deployment workdir, then removes the workdir on success so re-provisioning starts fresh. Re-stages module + tfvars first so a partially-cleaned workdir still has what tofu needs. TOUCHED: products/catalyst/bootstrap/api/cmd/api/main.go Registers POST /api/v1/deployments/{id}/wipe. ## Frontend (aligned with existing CrudModals conventions per founder ## directive — no ad-hoc surface) NEW: products/catalyst/bootstrap/ui/src/components/CrudModals/WipeDeploymentModal.tsx Two-stage modal built on the canonical ModalShell. Pre-wipe confirm view requires the operator to: - Type the sovereign FQDN to confirm scope. - Re-paste their Hetzner Cloud API token (catalyst-api intentionally GCs the original after writeTfvars per credential hygiene). Post-wipe success view shows the PurgeReport (servers, lbs, networks, firewalls, ssh-keys removed; tofu/PDM/local-state ✓/✗) and a "Start fresh deployment" CTA that nav's to /sovereign. TOUCHED: products/catalyst/bootstrap/ui/src/components/CrudModals/index.ts Re-exports WipeDeploymentModal + WipeReport. TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/AppsPage.tsx FailureCard now exposes a "Cancel & Wipe" red button next to "Retry stream" / "Back to wizard" — opens WipeDeploymentModal. TOUCHED: products/catalyst/bootstrap/ui/src/pages/sovereign/InfrastructureTopology.tsx Cloud → Architecture canvas: the `cloud` (root) node action menu gains "Cancel & Wipe deployment" as a `danger:true` action, alongside the existing "+ Add region". Distinct from the per-resource DeleteCascadeConfirm on region/cluster/vCluster — this is deployment-scope (Phase-0 orphan purge), the others are Crossplane-XRC scope (day-2). The two paths coexist; operators choose by what state the deployment is in. ## Why two entry-points Wizard banner (failed state on AppsPage) — recovery from a known failure. Already a red-banner page; the button is right there. Cloud → Architecture cloud-node action — proactive cancel from the canvas, mirrors how the existing per-resource deletes are reachable. Same modal, same backend. ## Constraints honoured - Per docs/INVIOLABLE-PRINCIPLES.md #3 (Crossplane is the ONLY day-2 IaC): the per-resource DELETE handler at infrastructure.go is unchanged and continues to flip XRC deletionPolicy. Wipe operates ONLY in Phase-0 scope where Crossplane never adopted resources. - Per #4 (never hardcode): every endpoint lives behind API_BASE; the Hetzner purge enumerates by deterministic label selector built from var.sovereign_fqdn (the OpenTofu module's existing tagging convention). - Per credential hygiene: the Hetzner token is re-prompted at wipe time rather than persisted; the modal uses an <input type="password">. ## Refs #318 — pre-handover wipe spec (this PR closes it) #317 — handover finalisation (sibling; this PR is the failure-path complement) feedback_idempotent_iac_purge.md — operator runbook this implements PR #313 — sealed-secrets cleanup (independent; safe to land in any order) PR #334 — bp-external-secrets split (independent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): catalyst-build event-driven only — drop cron, push-on-main with path filter Per docs/INVIOLABLE-PRINCIPLES.md (event-driven end to end — Flux dependsOn, NATS JetStream, SSE, Helm hooks), GitHub Actions must follow the same model. The previous `schedule: cron 0 3 * * *` daily build was the only canonical deploy path, which created a 24h roll latency on every change to the catalyst surface and incentivised "wait for cron" stalls in operator workflows. Replaces with: on: push: branches: [main] paths: - 'core/console/**' - 'core/admin/**' - 'core/marketplace/**' - 'core/marketplace-api/**' - 'products/catalyst/bootstrap/**' - 'products/catalyst/chart/**' - '.github/workflows/catalyst-build.yaml' workflow_dispatch: `workflow_dispatch` retained for ad-hoc re-runs (config-only changes that bypass the path filter, e.g. a secret rotation that doesn't touch code). Path filter mirrors the actual surface this workflow rebuilds. After this lands, every merge to main that touches the catalyst surface auto-deploys. No cron lag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
324 lines
13 KiB
YAML
324 lines
13 KiB
YAML
name: Build & Deploy Catalyst
|
|
|
|
# Event-driven only. Cron is forbidden — the OpenOva architecture is
|
|
# event-driven end to end (Flux dependsOn, NATS JetStream, SSE,
|
|
# Helm post-install hooks). `push` on the relevant paths is the
|
|
# canonical trigger; `workflow_dispatch` exists for ad-hoc re-runs
|
|
# without a code change.
|
|
on:
|
|
push:
|
|
branches: [main]
|
|
paths:
|
|
- 'core/console/**'
|
|
- 'core/admin/**'
|
|
- 'core/marketplace/**'
|
|
- 'core/marketplace-api/**'
|
|
- 'products/catalyst/bootstrap/**'
|
|
- 'products/catalyst/chart/**'
|
|
- '.github/workflows/catalyst-build.yaml'
|
|
workflow_dispatch:
|
|
|
|
env:
|
|
REGISTRY: ghcr.io
|
|
UI_IMAGE: ghcr.io/openova-io/openova/catalyst-ui
|
|
API_IMAGE: ghcr.io/openova-io/openova/catalyst-api
|
|
|
|
jobs:
|
|
build-ui:
|
|
runs-on: ubuntu-latest
|
|
permissions:
|
|
contents: read
|
|
packages: write
|
|
outputs:
|
|
sha_short: ${{ steps.vars.outputs.sha_short }}
|
|
steps:
|
|
- name: Checkout openova-private
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Checkout openova (public source)
|
|
uses: actions/checkout@v4
|
|
with:
|
|
repository: openova-io/openova
|
|
path: openova-src
|
|
|
|
- name: Set short SHA
|
|
id: vars
|
|
run: echo "sha_short=$(echo $GITHUB_SHA | head -c 7)" >> "$GITHUB_OUTPUT"
|
|
|
|
- name: Login to GHCR
|
|
uses: docker/login-action@v3
|
|
with:
|
|
registry: ${{ env.REGISTRY }}
|
|
username: ${{ github.actor }}
|
|
password: ${{ secrets.GITHUB_TOKEN }}
|
|
|
|
- name: Build UI image (test)
|
|
uses: docker/build-push-action@v6
|
|
with:
|
|
# Build context is the repo root so the Vite prebuild script can
|
|
# walk platform/, products/, clusters/_template/bootstrap-kit/ to
|
|
# populate the catalog + BOOTSTRAP_KIT. The Containerfile fails
|
|
# the build if any of those dirs is missing.
|
|
context: openova-src
|
|
file: openova-src/products/catalyst/bootstrap/ui/Containerfile
|
|
push: false
|
|
load: true
|
|
tags: ${{ env.UI_IMAGE }}:test
|
|
build-args: VITE_APP_MODE=selfhosted
|
|
|
|
- name: Smoke test UI
|
|
run: |
|
|
docker run -d --name smoke-ui -p 8080:8080 ${{ env.UI_IMAGE }}:test
|
|
sleep 3
|
|
STATUS=$(curl -s -o /dev/null -w '%{http_code}' http://localhost:8080/)
|
|
if [ "$STATUS" != "200" ]; then
|
|
echo "Smoke test failed: expected 200 from /, got $STATUS"
|
|
docker stop smoke-ui
|
|
exit 1
|
|
fi
|
|
echo "Smoke test (root) passed: HTTP $STATUS"
|
|
|
|
# Logo path regression guard (#173): the wizard's StepComponents
|
|
# references `${BASE}component-logos/<id>.<ext>` where BASE is the
|
|
# Vite base and the extension is whatever the upstream brand mark
|
|
# is published as (some are SVG, some are PNG — we use the canonical
|
|
# upstream asset rather than auto-converting). Inside the catalyst-
|
|
# ui pod nginx serves the file at /component-logos/<id>.<ext>
|
|
# (Traefik strips /sovereign before proxying — see nginx.conf
|
|
# comment). We list every logo path that componentGroups.ts
|
|
# references, so a missing or mis-cased asset fails the build,
|
|
# not the user.
|
|
for path in \
|
|
component-logos/cilium.svg \
|
|
component-logos/flux.svg \
|
|
component-logos/harbor.svg \
|
|
component-logos/grafana.svg \
|
|
component-logos/keycloak.svg \
|
|
component-logos/openbao.svg \
|
|
component-logos/langfuse.png \
|
|
component-logos/vllm.png \
|
|
component-logos/temporal.svg \
|
|
component-logos/stalwart.svg \
|
|
component-logos/cnpg.svg \
|
|
component-logos/loki.png \
|
|
component-logos/mimir.png \
|
|
component-logos/tempo.svg \
|
|
component-logos/ntfy.svg \
|
|
component-logos/ferretdb.png \
|
|
component-logos/openmeter.png \
|
|
component-logos/coraza.png \
|
|
component-logos/external-dns.png \
|
|
component-logos/netbird.png \
|
|
component-logos/strongswan.png \
|
|
component-logos/trivy.png \
|
|
component-logos/syft-grype.png ; do
|
|
CODE=$(curl -s -o /dev/null -w '%{http_code}' \
|
|
"http://localhost:8080/${path}")
|
|
if [ "$CODE" != "200" ]; then
|
|
echo "Logo smoke FAILED: /${path} returned $CODE"
|
|
docker stop smoke-ui
|
|
exit 1
|
|
fi
|
|
echo "Logo smoke OK: /${path} HTTP $CODE"
|
|
done
|
|
|
|
# Bootstrap-kit regression guard: the Provision page reads
|
|
# BOOTSTRAP_KIT from the bundled catalog.generated.ts to render
|
|
# the per-Blueprint bubbles. An earlier revision shipped with a
|
|
# docker context that didn't include clusters/_template/bootstrap-kit/
|
|
# so the prebuild script silently produced an empty array — the
|
|
# page rendered only the 2 supernodes. Asserting the bundle
|
|
# contains every bp-* id makes that regression impossible.
|
|
#
|
|
# Implementation note: we extract the entire bundle once via
|
|
# `tar c -C ... --transform`, then grep locally. Earlier we ran
|
|
# `grep` inside docker run -c "..." and the nested quote escaping
|
|
# produced false negatives (bp-cilium was in the bundle but the
|
|
# grep argument matched a literal `"bp-cilium"` whose surrounding
|
|
# quotes were eaten by shell expansion). Local grep on the
|
|
# extracted file removes that whole class of escaping bugs.
|
|
BUNDLE_TMP=$(mktemp)
|
|
docker run --rm --entrypoint sh ${{ env.UI_IMAGE }}:test \
|
|
-c 'cat $(find /usr/share/nginx/html/assets -name "index-*.js" | head -1)' \
|
|
> "$BUNDLE_TMP"
|
|
BUNDLE_BYTES=$(wc -c < "$BUNDLE_TMP")
|
|
echo "Bundle size: $BUNDLE_BYTES bytes"
|
|
if [ "$BUNDLE_BYTES" -lt 100000 ]; then
|
|
echo "Bootstrap-kit smoke FAILED: bundle suspiciously small ($BUNDLE_BYTES bytes)"
|
|
docker stop smoke-ui
|
|
exit 1
|
|
fi
|
|
for bp in bp-cilium bp-cert-manager bp-flux bp-crossplane bp-sealed-secrets \
|
|
bp-spire bp-nats-jetstream bp-openbao bp-keycloak bp-gitea ; do
|
|
if ! grep -q -F "$bp" "$BUNDLE_TMP" ; then
|
|
echo "Bootstrap-kit smoke FAILED: ${bp} missing from bundle"
|
|
docker stop smoke-ui
|
|
exit 1
|
|
fi
|
|
echo "Bootstrap-kit smoke OK: ${bp}"
|
|
done
|
|
rm -f "$BUNDLE_TMP"
|
|
|
|
docker stop smoke-ui
|
|
echo "All smoke tests passed."
|
|
|
|
- name: Push UI image
|
|
uses: docker/build-push-action@v6
|
|
with:
|
|
# Build context is the repo root so the Vite prebuild script can
|
|
# walk platform/, products/, clusters/_template/bootstrap-kit/ to
|
|
# populate the catalog + BOOTSTRAP_KIT. The Containerfile fails
|
|
# the build if any of those dirs is missing.
|
|
context: openova-src
|
|
file: openova-src/products/catalyst/bootstrap/ui/Containerfile
|
|
push: true
|
|
tags: |
|
|
${{ env.UI_IMAGE }}:${{ steps.vars.outputs.sha_short }}
|
|
${{ env.UI_IMAGE }}:latest
|
|
build-args: VITE_APP_MODE=selfhosted
|
|
|
|
build-api:
|
|
runs-on: ubuntu-latest
|
|
permissions:
|
|
contents: read
|
|
packages: write
|
|
outputs:
|
|
sha_short: ${{ steps.vars.outputs.sha_short }}
|
|
steps:
|
|
- name: Checkout openova-private
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Checkout openova (public source)
|
|
uses: actions/checkout@v4
|
|
with:
|
|
repository: openova-io/openova
|
|
path: openova-src
|
|
|
|
- name: Set short SHA
|
|
id: vars
|
|
run: echo "sha_short=$(echo $GITHUB_SHA | head -c 7)" >> "$GITHUB_OUTPUT"
|
|
|
|
- name: Login to GHCR
|
|
uses: docker/login-action@v3
|
|
with:
|
|
registry: ${{ env.REGISTRY }}
|
|
username: ${{ github.actor }}
|
|
password: ${{ secrets.GITHUB_TOKEN }}
|
|
|
|
# Build context is the public openova repo root (openova-src/), not just
|
|
# products/catalyst/bootstrap/api/, because the runtime image bundles the
|
|
# canonical OpenTofu module from infra/hetzner/. The Containerfile's
|
|
# COPY paths are written relative to the repo root accordingly. Without
|
|
# this, /infra/hetzner/ is missing inside the image and every Launch
|
|
# fails with `stage tofu module: open /infra/hetzner: no such file or
|
|
# directory`.
|
|
- name: Build API image (test)
|
|
uses: docker/build-push-action@v6
|
|
with:
|
|
context: openova-src
|
|
file: openova-src/products/catalyst/bootstrap/api/Containerfile
|
|
push: false
|
|
load: true
|
|
tags: ${{ env.API_IMAGE }}:test
|
|
|
|
# Smoke test — the catalyst-api Pod is the OpenTofu runner, so the .tf
|
|
# sources MUST be present at /infra/hetzner/ inside the image. Anything
|
|
# less ships a broken image that fails on every Launch with `stage tofu
|
|
# module: open /infra/hetzner: no such file or directory`. Failure of
|
|
# this step fails the build.
|
|
- name: Smoke test API — verify infra/hetzner/ is bundled
|
|
run: |
|
|
set -euo pipefail
|
|
LISTING=$(docker run --rm --entrypoint sh ${{ env.API_IMAGE }}:test \
|
|
-c 'ls -la /infra/hetzner/')
|
|
echo "$LISTING"
|
|
for f in main.tf variables.tf outputs.tf versions.tf \
|
|
cloudinit-control-plane.tftpl cloudinit-worker.tftpl ; do
|
|
if ! echo "$LISTING" | grep -q " ${f}\$"; then
|
|
echo "Smoke test FAILED: /infra/hetzner/${f} missing from image"
|
|
exit 1
|
|
fi
|
|
echo "Smoke test OK: /infra/hetzner/${f} present"
|
|
done
|
|
echo "All API smoke tests passed."
|
|
|
|
# tofu CLI smoke test — the runtime image bundles the OpenTofu CLI
|
|
# because internal/provisioner execs `tofu init / plan / apply` (see
|
|
# internal/provisioner/provisioner.go runTofu()). Without the binary
|
|
# every Launch SSE stream returns:
|
|
# tofu init: exec: "tofu": executable file not found in $PATH
|
|
# We assert (a) `tofu version` succeeds inside the image and (b) the
|
|
# output matches the EXPECTED_TOFU_VERSION pinned here, which must
|
|
# stay in lockstep with the TOFU_VERSION ARG in the Containerfile.
|
|
# When you bump the version in the Containerfile, bump it here too.
|
|
- name: Smoke test API — verify OpenTofu CLI is installed
|
|
env:
|
|
EXPECTED_TOFU_VERSION: 1.11.6
|
|
run: |
|
|
set -euo pipefail
|
|
OUT=$(docker run --rm --entrypoint sh ${{ env.API_IMAGE }}:test \
|
|
-c 'tofu version')
|
|
echo "$OUT"
|
|
if ! echo "$OUT" | grep -q "^OpenTofu v${EXPECTED_TOFU_VERSION}\$"; then
|
|
echo "Smoke test FAILED: expected 'OpenTofu v${EXPECTED_TOFU_VERSION}', got:"
|
|
echo "$OUT"
|
|
exit 1
|
|
fi
|
|
echo "Smoke test OK: OpenTofu v${EXPECTED_TOFU_VERSION} present on PATH."
|
|
|
|
# Re-assert the binary is executable for the actual runtime UID
|
|
# (65534, set in api-deployment.yaml securityContext.runAsUser).
|
|
# `--user` overrides the image USER directive, simulating the K8s
|
|
# securityContext: a missing exec bit or wrong owner here would
|
|
# surface as a Launch failure in production, never in CI, so we
|
|
# gate it at build time.
|
|
docker run --rm --user 65534:65534 --entrypoint sh \
|
|
${{ env.API_IMAGE }}:test -c 'tofu version | head -1'
|
|
echo "Smoke test OK: tofu executable as UID 65534."
|
|
|
|
- name: Push API image
|
|
uses: docker/build-push-action@v6
|
|
with:
|
|
context: openova-src
|
|
file: openova-src/products/catalyst/bootstrap/api/Containerfile
|
|
push: true
|
|
tags: |
|
|
${{ env.API_IMAGE }}:${{ steps.vars.outputs.sha_short }}
|
|
${{ env.API_IMAGE }}:latest
|
|
|
|
deploy:
|
|
needs: [build-ui, build-api]
|
|
runs-on: ubuntu-latest
|
|
permissions:
|
|
contents: write
|
|
steps:
|
|
- name: Checkout
|
|
uses: actions/checkout@v4
|
|
|
|
- name: Update deployment manifests with new SHA tags
|
|
env:
|
|
SHA_SHORT: ${{ needs.build-ui.outputs.sha_short }}
|
|
run: |
|
|
DEPLOY_DIR="products/catalyst/chart/templates"
|
|
|
|
sed -i "s|image: ${UI_IMAGE}:.*|image: ${UI_IMAGE}:${SHA_SHORT}|" \
|
|
"${DEPLOY_DIR}/ui-deployment.yaml"
|
|
|
|
sed -i "s|image: ${API_IMAGE}:.*|image: ${API_IMAGE}:${SHA_SHORT}|" \
|
|
"${DEPLOY_DIR}/api-deployment.yaml"
|
|
|
|
echo "Updated manifests to SHA ${SHA_SHORT}:"
|
|
grep "image:" "${DEPLOY_DIR}/ui-deployment.yaml"
|
|
grep "image:" "${DEPLOY_DIR}/api-deployment.yaml"
|
|
|
|
- name: Commit and push manifest updates
|
|
env:
|
|
SHA_SHORT: ${{ needs.build-ui.outputs.sha_short }}
|
|
run: |
|
|
git config user.name "github-actions[bot]"
|
|
git config user.email "github-actions[bot]@users.noreply.github.com"
|
|
git add products/
|
|
git diff --staged --quiet && echo "No changes to commit" && exit 0
|
|
git commit -m "deploy: update catalyst images to ${SHA_SHORT}"
|
|
git push
|