## Root cause (live on otech115 2026-05-05 14:15) After PR #959 (0.1.18) unblocked the auto-trigger to actually call /internal/cutover/trigger, the cutover engine fired Step-01 within ~8s of bp-self-sovereign-cutover Helm-install completing. The gitea Pod had only just reached Ready state — cluster-DNS endpoint publication for the headless service `gitea-http` was still in flight. One wget returned `bad address gitea-http.gitea.svc.cluster.local` and exited non-zero. Catalyst-api's cutover engine stamped Jobs with backoffLimit=0 (cutover.go:584), so a single DNS miss was terminal and aborted all 8 cutover steps. otech115 finished provisioning with cutoverComplete=false and tethered to upstream github.com/ghcr.io. ## Fix (dual-layer) **Layer A — catalyst-api (cutover.go)**: backoffLimit lifted from 0 to 3. A single transient miss is recoverable (4 attempts over each step's activeDeadlineSeconds) without burning operator-attention. Hard failures still surface within budget. **Layer B — chart Step-01 (01-gitea-mirror-job.yaml)**: explicit nslookup readiness probe at the top of the bash script, before any wget call. 30 attempts × 5s = 150s budget; alpine/git ships nslookup in /usr/bin (verified live on otech115). Layer B is faster than Layer A (in-script DNS retry vs Pod recreate); Layer A is the safety net for any other transient pre-cluster-stable race we haven't yet enumerated. ## Acceptance gate Test case 15 added to platform/self-sovereign-cutover/chart/tests/ cutover-contract.sh — guards against future regressions that drop either the gitea_host extraction or the nslookup loop. ## Live verification Will fire on the next provision (otech116). Expected: - Step-01 logs `[gitea-mirror] DNS ready for gitea-http.gitea.svc.cluster.local (attempt N)` - All 8 cutover Jobs reach Complete - self-sovereign-cutover-status ConfigMap reaches cutoverComplete=true Co-authored-by: e3mrah <ebaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
39732ff41b
commit
3db19b76b1
@ -174,7 +174,21 @@ spec:
|
||||
# Also drops the pre-flight cutoverComplete=true short-
|
||||
# circuit since /internal/cutover/trigger is itself
|
||||
# idempotent.
|
||||
version: 0.1.18
|
||||
# 0.1.19: Step-01 gitea-mirror DNS race + backoffLimit=0 (#968).
|
||||
# 0.1.18 unblocked the auto-trigger so the cutover engine fired
|
||||
# correctly on otech115 (2026-05-05) — but Step-01 then failed
|
||||
# within 8s with `wget: bad address gitea-http.gitea.svc.cluster.
|
||||
# local`. The gitea Pod had reached Ready ~2-3s prior; cluster-
|
||||
# DNS endpoint propagation was still in flight. catalyst-api
|
||||
# stamped the Job with `backoffLimit=0` (cutover.go:584), so
|
||||
# one DNS miss was terminal and the cutover engine aborted all
|
||||
# 8 steps. Fix is dual: (a) catalyst-api now stamps Jobs with
|
||||
# `backoffLimit=3` so a single miss is recoverable; (b) Step-01
|
||||
# bash script gains an explicit `nslookup` readiness loop (30 ×
|
||||
# 5s) at the top, before any wget call. Both layers are needed —
|
||||
# the in-script probe is fastest; the backoffLimit is the
|
||||
# safety net for any other transient pre-cluster-stable race.
|
||||
version: 0.1.19
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-self-sovereign-cutover
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
apiVersion: v2
|
||||
name: bp-self-sovereign-cutover
|
||||
version: 0.1.18
|
||||
version: 0.1.19
|
||||
description: |
|
||||
Catalyst Self-Sovereignty Cutover Blueprint. Installs DORMANT — this
|
||||
chart ships eight step ConfigMaps (PodSpec ConfigMaps, one per step),
|
||||
|
||||
@ -76,6 +76,40 @@ data:
|
||||
echo "[gitea-mirror] target=${redacted_url}"
|
||||
echo "[gitea-mirror] mirror_interval=${MIRROR_INTERVAL}"
|
||||
|
||||
# #968 — DNS-readiness probe for gitea-http.
|
||||
#
|
||||
# The cutover auto-trigger fires within seconds of bp-self-
|
||||
# sovereign-cutover Helm-install completing. On a fresh
|
||||
# Sovereign the gitea Pod can still be moving from Running
|
||||
# to Ready, in which case the headless service `gitea-http`
|
||||
# has no DNS record published yet. Without this probe the
|
||||
# very first wget call returns `bad address` and the Job
|
||||
# exits non-zero. catalyst-api's cutover engine treats that
|
||||
# as a hard failure (per cutover.go #968 backoffLimit was
|
||||
# raised to 3, but local resolve here is cheaper and faster
|
||||
# than burning Pod-restart budget). On otech115 2026-05-05
|
||||
# this race fired Step-01 at +8s after gitea reached Ready
|
||||
# and DNS hadn't propagated; one nslookup wait of ~10s would
|
||||
# have been sufficient. Loop budget = 30 × 5s = 150s, well
|
||||
# under the step's activeDeadlineSeconds.
|
||||
gitea_host="$(printf '%s' "${GITEA_INTERNAL_URL}" | sed -E 's|^https?://||' | cut -d: -f1 | cut -d/ -f1)"
|
||||
if [ -n "${gitea_host}" ]; then
|
||||
echo "[gitea-mirror] waiting for DNS resolution of ${gitea_host}"
|
||||
dns_ready="false"
|
||||
for i in $(seq 1 30); do
|
||||
if nslookup "${gitea_host}" >/dev/null 2>&1; then
|
||||
echo "[gitea-mirror] DNS ready for ${gitea_host} (attempt ${i})"
|
||||
dns_ready="true"
|
||||
break
|
||||
fi
|
||||
sleep 5
|
||||
done
|
||||
if [ "${dns_ready}" != "true" ]; then
|
||||
echo "[gitea-mirror] FATAL: ${gitea_host} did not resolve within 150s" >&2
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Build BusyBox-wget-compatible Basic auth header. printf -n
|
||||
# avoids the trailing newline that would otherwise corrupt
|
||||
# the base64 encoding (and thus the credential).
|
||||
|
||||
@ -245,4 +245,26 @@ if grep -E "grep.*cutoverComplete.*/tmp/status\.json" "$TMP/render.yaml" >/dev/n
|
||||
fi
|
||||
echo " PASS (no stale cutoverComplete pre-read)"
|
||||
|
||||
echo "[cutover-contract] Case 15: Step-01 gitea-mirror has DNS-readiness probe (#968)"
|
||||
# 0.1.18 Step-01 fired wget against gitea-http.gitea.svc.cluster.local
|
||||
# the moment the auto-trigger fired, racing the gitea Pod's endpoint
|
||||
# publication. One DNS miss returned `wget: bad address` and (combined
|
||||
# with catalyst-api's backoffLimit=0) terminated the Job permanently
|
||||
# — which the cutover engine surfaced as a hard cutover failure (caught
|
||||
# live on otech115 2026-05-05).
|
||||
#
|
||||
# 0.1.19 Step-01 prefixes its wget calls with an `nslookup` readiness
|
||||
# loop (30 x 5s) so the Job tolerates the ~10s endpoint-publish lag
|
||||
# without burning Pod-restart budget. This gate guards against future
|
||||
# regressions that drop the loop.
|
||||
if ! grep -q 'nslookup "${gitea_host}"' "$TMP/render.yaml"; then
|
||||
echo "FAIL: Step-01 gitea-mirror missing nslookup readiness probe (#968)" >&2
|
||||
exit 1
|
||||
fi
|
||||
if ! grep -q 'gitea_host=' "$TMP/render.yaml"; then
|
||||
echo "FAIL: Step-01 gitea-mirror missing gitea_host= variable extraction (#968)" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo " PASS (Step-01 has DNS readiness probe)"
|
||||
|
||||
echo "[cutover-contract] All gates green."
|
||||
|
||||
@ -581,7 +581,16 @@ func cutoverJobName(stepName string, runEpoch int64) string {
|
||||
// hook-style Helm Jobs the bootstrap-kit uses elsewhere.
|
||||
func createCutoverJob(ctx context.Context, deps *cutoverDeps, step cutoverStep, runEpoch int64) (*batchv1.Job, error) {
|
||||
name := cutoverJobName(step.stepName, runEpoch)
|
||||
backoffLimit := int32(0) // No retries — fail fast, surface to the operator.
|
||||
// #968 — backoffLimit raised from 0 to 3 to absorb the gitea-mirror
|
||||
// step's known race against gitea-http endpoint publication. The
|
||||
// step Pod can land in scheduling within seconds of the gitea Pod
|
||||
// reaching Ready, before cluster-DNS endpoint propagation. One DNS
|
||||
// miss used to be terminal because the Job had no retry budget;
|
||||
// the cutover engine then aborted all 8 steps. With backoffLimit=3
|
||||
// + the per-step DNS readiness probe (chart-side), a single miss
|
||||
// is recoverable and steps still surface real failures (4× attempts
|
||||
// over the activeDeadlineSeconds window).
|
||||
backoffLimit := int32(3)
|
||||
ttl := int32(24 * 60 * 60) // 24h GC so the Job evidence stays around for audit.
|
||||
activeDeadline := int64(cutoverStepTimeout().Seconds())
|
||||
job := &batchv1.Job{
|
||||
|
||||
Loading…
Reference in New Issue
Block a user