fix(self-sovereign-cutover): set HR timeout 15m + lower hook deadlines below it (#127)

Two prior provisions (#12 d22b6d4dada2aef2, #14 12e194090631a885) wedged
identically at phase1-watching: bp-self-sovereign-cutover@0.1.25 post-install
hook timed out at 5m (Helm default), Flux marked release Failed, retried 3x,
gave up. catalyst-api never received kubeconfig PUT-back because the cutover
chain inside the Sovereign couldn't complete.

Root cause: HelmRelease had no explicit install/upgrade timeout → Helm's 5m
default → hit before the auto-trigger Job's activeDeadlineSeconds (600s) and
WAIT_TIMEOUT_SECONDS (300s) could complete cleanly.

Fix:
- HR install/upgrade timeout: 15m (covers cold-start cluster + auto-trigger)
- values.autoWaitForAPISeconds: 300 → 720 (12m wait, exits 0 below 15m HR cap)
- values.autoTimeoutSeconds: 600 → 840 (14m Job deadline, below 15m HR cap)
- Chart bump 0.1.25 → 0.1.26

Per CLAUDE.md principle 16: canonical seam = HR timeout + chart hook deadlines.
Both must align: hook deadlines < HR timeout, otherwise Helm gives up before
hook completes regardless of how the Job exits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-11 00:13:07 +02:00
parent 86231d1d2f
commit 40612f19ea
3 changed files with 11 additions and 4 deletions

View File

@ -241,10 +241,12 @@ spec:
namespace: flux-system
install:
disableWait: true
timeout: 15m
remediation:
retries: 3
upgrade:
disableWait: true
timeout: 15m
remediation:
retries: 3
# Per-Sovereign overrides — the chart's values.yaml carries

View File

@ -1,6 +1,6 @@
apiVersion: v2
name: bp-self-sovereign-cutover
version: 0.1.25
version: 0.1.26
description: |
Catalyst Self-Sovereignty Cutover Blueprint. Installs DORMANT — this
chart ships eight step ConfigMaps (PodSpec ConfigMaps, one per step),

View File

@ -331,14 +331,19 @@ trigger:
catalystAPIURL: "http://catalyst-api.catalyst-system.svc.cluster.local:8080"
# How long the auto-trigger Job will wait for catalyst-api to be
# reachable before giving up (and exiting 0 so the operator can fire
# manually). 5 minutes is enough for a Sovereign mid-cold-start.
autoWaitForAPISeconds: 300
# manually). Must finish below the HelmRelease install/upgrade
# timeout (15m for bp-self-sovereign-cutover) AND the activeDeadline
# below so the Job exits cleanly even when catalyst-api never comes
# up — 12 minutes leaves a healthy 3m buffer below the 15m HR cap.
autoWaitForAPISeconds: 720
# Overall cap on the auto-trigger Job runtime. activeDeadlineSeconds
# on the Job spec — anything longer means catalyst-api is sick and
# the operator should investigate. The Job exiting at this deadline
# is non-fatal for the chart install (the cutover engine already
# runs detached inside catalyst-api once /start returns 200).
autoTimeoutSeconds: 600
# Must stay below the HelmRelease install/upgrade timeout (15m =
# 900s) so the Job ends and the hook unblocks before Helm gives up.
autoTimeoutSeconds: 840
# TTL on the completed Job — kept for audit so operators can read
# the trigger Pod logs if something looks wrong.
autoJobTTLSeconds: 86400