fix(cutover 0.1.19): Step-01 gitea-mirror DNS readiness probe + backoffLimit=3 (#968) (#969)

## Root cause (live on otech115 2026-05-05 14:15) After PR #959 (0.1.18) unblocked the auto-trigger to actually call /internal/cutover/trigger, the cutover engine fired Step-01 within ~8s of bp-self-sovereign-cutover Helm-install completing. The gitea Pod had only just reached Ready state — cluster-DNS endpoint publication for the headless service `gitea-http` was still in flight. One wget returned `bad address gitea-http.gitea.svc.cluster.local` and exited non-zero. Catalyst-api's cutover engine stamped Jobs with backoffLimit=0 (cutover.go:584), so a single DNS miss was terminal and aborted all 8 cutover steps. otech115 finished provisioning with cutoverComplete=false and tethered to upstream github.com/ghcr.io. ## Fix (dual-layer) **Layer A — catalyst-api (cutover.go)**: backoffLimit lifted from 0 to 3. A single transient miss is recoverable (4 attempts over each step's activeDeadlineSeconds) without burning operator-attention. Hard failures still surface within budget. **Layer B — chart Step-01 (01-gitea-mirror-job.yaml)**: explicit nslookup readiness probe at the top of the bash script, before any wget call. 30 attempts × 5s = 150s budget; alpine/git ships nslookup in /usr/bin (verified live on otech115). Layer B is faster than Layer A (in-script DNS retry vs Pod recreate); Layer A is the safety net for any other transient pre-cluster-stable race we haven't yet enumerated. ## Acceptance gate Test case 15 added to platform/self-sovereign-cutover/chart/tests/ cutover-contract.sh — guards against future regressions that drop either the gitea_host extraction or the nslookup loop. ## Live verification Will fire on the next provision (otech116). Expected: - Step-01 logs `[gitea-mirror] DNS ready for gitea-http.gitea.svc.cluster.local (attempt N)` - All 8 cutover Jobs reach Complete - self-sovereign-cutover-status ConfigMap reaches cutoverComplete=true Co-authored-by: e3mrah <ebaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:25:15 +04:00 · 2026-05-05 18:25:15 +04:00 · 3db19b76b1
commit 3db19b76b1
parent 39732ff41b
5 changed files with 82 additions and 3 deletions
--- a/clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml
+++ b/clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml
@ -174,7 +174,21 @@ spec:
      #   Also drops the pre-flight cutoverComplete=true short-
      #   circuit since /internal/cutover/trigger is itself
      #   idempotent.
-      version: 0.1.18
+      # 0.1.19: Step-01 gitea-mirror DNS race + backoffLimit=0 (#968).
+      #   0.1.18 unblocked the auto-trigger so the cutover engine fired
+      #   correctly on otech115 (2026-05-05) — but Step-01 then failed
+      #   within 8s with `wget: bad address gitea-http.gitea.svc.cluster.
+      #   local`. The gitea Pod had reached Ready ~2-3s prior; cluster-
+      #   DNS endpoint propagation was still in flight. catalyst-api
+      #   stamped the Job with `backoffLimit=0` (cutover.go:584), so
+      #   one DNS miss was terminal and the cutover engine aborted all
+      #   8 steps. Fix is dual: (a) catalyst-api now stamps Jobs with
+      #   `backoffLimit=3` so a single miss is recoverable; (b) Step-01
+      #   bash script gains an explicit `nslookup` readiness loop (30 ×
+      #   5s) at the top, before any wget call. Both layers are needed —
+      #   the in-script probe is fastest; the backoffLimit is the
+      #   safety net for any other transient pre-cluster-stable race.
+      version: 0.1.19
      sourceRef:
        kind: HelmRepository
        name: bp-self-sovereign-cutover
--- a/platform/self-sovereign-cutover/chart/Chart.yaml
+++ b/platform/self-sovereign-cutover/chart/Chart.yaml
@ -1,6 +1,6 @@
 apiVersion: v2
 name: bp-self-sovereign-cutover
-version: 0.1.18
+version: 0.1.19
 description: |
  Catalyst Self-Sovereignty Cutover Blueprint. Installs DORMANT — this
  chart ships eight step ConfigMaps (PodSpec ConfigMaps, one per step),
--- a/platform/self-sovereign-cutover/chart/templates/01-gitea-mirror-job.yaml
+++ b/platform/self-sovereign-cutover/chart/templates/01-gitea-mirror-job.yaml
@ -76,6 +76,40 @@ data:
            echo "[gitea-mirror] target=${redacted_url}"
            echo "[gitea-mirror] mirror_interval=${MIRROR_INTERVAL}"

+            # #968 — DNS-readiness probe for gitea-http.
+            #
+            # The cutover auto-trigger fires within seconds of bp-self-
+            # sovereign-cutover Helm-install completing. On a fresh
+            # Sovereign the gitea Pod can still be moving from Running
+            # to Ready, in which case the headless service `gitea-http`
+            # has no DNS record published yet. Without this probe the
+            # very first wget call returns `bad address` and the Job
+            # exits non-zero. catalyst-api's cutover engine treats that
+            # as a hard failure (per cutover.go #968 backoffLimit was
+            # raised to 3, but local resolve here is cheaper and faster
+            # than burning Pod-restart budget). On otech115 2026-05-05
+            # this race fired Step-01 at +8s after gitea reached Ready
+            # and DNS hadn't propagated; one nslookup wait of ~10s would
+            # have been sufficient. Loop budget = 30 × 5s = 150s, well
+            # under the step's activeDeadlineSeconds.
+            gitea_host="$(printf '%s' "${GITEA_INTERNAL_URL}" | sed -E 's|^https?://||' | cut -d: -f1 | cut -d/ -f1)"
+            if [ -n "${gitea_host}" ]; then
+              echo "[gitea-mirror] waiting for DNS resolution of ${gitea_host}"
+              dns_ready="false"
+              for i in $(seq 1 30); do
+                if nslookup "${gitea_host}" >/dev/null 2>&1; then
+                  echo "[gitea-mirror] DNS ready for ${gitea_host} (attempt ${i})"
+                  dns_ready="true"
+                  break
+                fi
+                sleep 5
+              done
+              if [ "${dns_ready}" != "true" ]; then
+                echo "[gitea-mirror] FATAL: ${gitea_host} did not resolve within 150s" >&2
+                exit 1
+              fi
+            fi
+
            # Build BusyBox-wget-compatible Basic auth header. printf -n
            # avoids the trailing newline that would otherwise corrupt
            # the base64 encoding (and thus the credential).
--- a/platform/self-sovereign-cutover/chart/tests/cutover-contract.sh
+++ b/platform/self-sovereign-cutover/chart/tests/cutover-contract.sh
@ -245,4 +245,26 @@ if grep -E "grep.*cutoverComplete.*/tmp/status\.json" "$TMP/render.yaml" >/dev/n
 fi
 echo "  PASS (no stale cutoverComplete pre-read)"

+echo "[cutover-contract] Case 15: Step-01 gitea-mirror has DNS-readiness probe (#968)"
+# 0.1.18 Step-01 fired wget against gitea-http.gitea.svc.cluster.local
+# the moment the auto-trigger fired, racing the gitea Pod's endpoint
+# publication. One DNS miss returned `wget: bad address` and (combined
+# with catalyst-api's backoffLimit=0) terminated the Job permanently
+# — which the cutover engine surfaced as a hard cutover failure (caught
+# live on otech115 2026-05-05).
+#
+# 0.1.19 Step-01 prefixes its wget calls with an `nslookup` readiness
+# loop (30 x 5s) so the Job tolerates the ~10s endpoint-publish lag
+# without burning Pod-restart budget. This gate guards against future
+# regressions that drop the loop.
+if ! grep -q 'nslookup "${gitea_host}"' "$TMP/render.yaml"; then
+  echo "FAIL: Step-01 gitea-mirror missing nslookup readiness probe (#968)" >&2
+  exit 1
+fi
+if ! grep -q 'gitea_host=' "$TMP/render.yaml"; then
+  echo "FAIL: Step-01 gitea-mirror missing gitea_host= variable extraction (#968)" >&2
+  exit 1
+fi
+echo "  PASS (Step-01 has DNS readiness probe)"
+
 echo "[cutover-contract] All gates green."
--- a/products/catalyst/bootstrap/api/internal/handler/cutover.go
+++ b/products/catalyst/bootstrap/api/internal/handler/cutover.go
@ -581,7 +581,16 @@ func cutoverJobName(stepName string, runEpoch int64) string {
 // hook-style Helm Jobs the bootstrap-kit uses elsewhere.
 func createCutoverJob(ctx context.Context, deps *cutoverDeps, step cutoverStep, runEpoch int64) (*batchv1.Job, error) {
 	name := cutoverJobName(step.stepName, runEpoch)
-	backoffLimit := int32(0)   // No retries — fail fast, surface to the operator.
+	// #968 — backoffLimit raised from 0 to 3 to absorb the gitea-mirror
+	// step's known race against gitea-http endpoint publication. The
+	// step Pod can land in scheduling within seconds of the gitea Pod
+	// reaching Ready, before cluster-DNS endpoint propagation. One DNS
+	// miss used to be terminal because the Job had no retry budget;
+	// the cutover engine then aborted all 8 steps. With backoffLimit=3
+	// + the per-step DNS readiness probe (chart-side), a single miss
+	// is recoverable and steps still surface real failures (4× attempts
+	// over the activeDeadlineSeconds window).
+	backoffLimit := int32(3)
 	ttl := int32(24 * 60 * 60) // 24h GC so the Job evidence stays around for audit.
 	activeDeadline := int64(cutoverStepTimeout().Seconds())
 	job := &batchv1.Job{