openova/products/catalyst/bootstrap/api
e3mrah c9507c8369
fix(catalyst-api): durable Phase-1 watcher across Pod restart (#830) (#833)
The Phase-1 helmwatch watcher used to lose state on every catalyst-api
Pod roll. fromRecord rewrote any "phase1-watching" status to "failed"
on the next Pod start — even though Phase 0 had already committed its
tofu state, the Sovereign cluster was healthy, the kubeconfig was on
the PVC, and the bootstrap-kit HelmReleases kept reconciling regardless
of whether catalyst-api's in-memory watcher was alive.

Caught live on otech102 (2026-05-04): a transient catalyst-api roll
mid-Phase-1 latched the deployment record to status=failed, the auto-
fire handover never triggered, and the operator was stranded on the
wizard page. Manual workaround was patching the record back to
status=ready + minting handover token by hand.

Fix: split the in-flight rewrite into two cases:
  - Phase-0 in-flight (pending/provisioning/tofu-applying/flux-
    bootstrapping) — STILL rewritten to failed (tofu workdir on /tmp
    emptyDir died with the Pod, Hetzner resources orphaned).
  - phase1-watching — preserved across restart so the post-restart
    resume path picks it up via shouldResumePhase1 + resumePhase1Watch
    (already wired). The on-disk store record stays consistent with
    the in-memory state during rehydrate.

Helmwatch's existing resume path (jobs_backfill.go) is idempotent —
it just observes HelmRelease.status, never patches/applies, so a fresh
informer over the same kubeconfig produces the same per-component
events the previous Pod was streaming.

Also:
  - Added isPhase0InFlightStatus helper to distinguish the two
    semantics; isInFlightStatus retained for release-subdomain conflict
    check (still includes phase1-watching — won't release a slot mid-
    Phase-1).
  - Updated TestPodRestart_StuckPhase1WatchingRewrittenToFailed →
    TestPodRestart_Phase1WatchingPreservedNotRewrittenToFailed (now
    asserts the new correct behavior).
  - New test TestPodRestart_Phase1WatchingResumesWithKubeconfig proves
    the gating decision (shouldResumePhase1=true) and the preserved
    Status value.
  - New parameterized test TestPodRestart_Phase0InFlightStillRewritten
    ToFailed proves the Phase-0 carve-out still works for all four
    Phase-0 statuses.
  - Updated TestShouldResumePhase1_GatesProperly cases to reflect the
    new phase1-watching=resumable / Phase-0=non-resumable split.

Issue: openova-io/openova#830 (Bug 3)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:28:07 +04:00
..
cmd feat(sme-tenant): tenant provisioning pipeline (#804) (#824) 2026-05-04 22:55:06 +04:00
internal fix(catalyst-api): durable Phase-1 watcher across Pod restart (#830) (#833) 2026-05-04 23:28:07 +04:00
Containerfile fix(catalyst-api): bump Containerfile build stage golang 1.23 → 1.26 (matches go.mod) 2026-04-29 17:41:08 +02:00
go.mod feat(catalyst-api): /auth/handover endpoint for seamless single-identity flow (Closes #606) (#612) 2026-05-02 17:34:26 +04:00
go.sum feat(catalyst-api): /auth/handover endpoint for seamless single-identity flow (Closes #606) (#612) 2026-05-02 17:34:26 +04:00