The Phase-1 helmwatch watcher used to lose state on every catalyst-api
Pod roll. fromRecord rewrote any "phase1-watching" status to "failed"
on the next Pod start — even though Phase 0 had already committed its
tofu state, the Sovereign cluster was healthy, the kubeconfig was on
the PVC, and the bootstrap-kit HelmReleases kept reconciling regardless
of whether catalyst-api's in-memory watcher was alive.
Caught live on otech102 (2026-05-04): a transient catalyst-api roll
mid-Phase-1 latched the deployment record to status=failed, the auto-
fire handover never triggered, and the operator was stranded on the
wizard page. Manual workaround was patching the record back to
status=ready + minting handover token by hand.
Fix: split the in-flight rewrite into two cases:
- Phase-0 in-flight (pending/provisioning/tofu-applying/flux-
bootstrapping) — STILL rewritten to failed (tofu workdir on /tmp
emptyDir died with the Pod, Hetzner resources orphaned).
- phase1-watching — preserved across restart so the post-restart
resume path picks it up via shouldResumePhase1 + resumePhase1Watch
(already wired). The on-disk store record stays consistent with
the in-memory state during rehydrate.
Helmwatch's existing resume path (jobs_backfill.go) is idempotent —
it just observes HelmRelease.status, never patches/applies, so a fresh
informer over the same kubeconfig produces the same per-component
events the previous Pod was streaming.
Also:
- Added isPhase0InFlightStatus helper to distinguish the two
semantics; isInFlightStatus retained for release-subdomain conflict
check (still includes phase1-watching — won't release a slot mid-
Phase-1).
- Updated TestPodRestart_StuckPhase1WatchingRewrittenToFailed →
TestPodRestart_Phase1WatchingPreservedNotRewrittenToFailed (now
asserts the new correct behavior).
- New test TestPodRestart_Phase1WatchingResumesWithKubeconfig proves
the gating decision (shouldResumePhase1=true) and the preserved
Status value.
- New parameterized test TestPodRestart_Phase0InFlightStillRewritten
ToFailed proves the Phase-0 carve-out still works for all four
Phase-0 statuses.
- Updated TestShouldResumePhase1_GatesProperly cases to reflect the
new phase1-watching=resumable / Phase-0=non-resumable split.
Issue: openova-io/openova#830 (Bug 3)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>