fix(bp-catalyst-platform): qa-finalizer-strip hook unschedulable on saturated worker (Fix #185, prov #38/#39/#41 recurrence)

Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC):

  bp-catalyst-platform HR install.timeout=15m
    → Helm pre-install hook: qa-finalizer-strip Job (weight -99)
      → Pod requests 50m CPU + 64Mi memory (tiny)
        → BUT no tolerations → scheduler restricted to worker
          → worker cpx32 (8vCPU/16GB) at 99% CPU requests
            (7980m of 8000m allocated) after bootstrap-kit fan-out
            → FailedScheduling: "0/2 nodes are available: 1
              Insufficient cpu, 1 node(s) had untolerated taint
              {node-role.kubernetes.io/control-plane: true}"
            → autoscaler triggers scale-up worker 2→3 → "1 in backoff
              after failed scale-up" → still Pending → 15m timeout
              → InstallFailed → Flux uninstall+rollback → installFailures: 3
              → Flux gives up entirely

Live evidence quoted from chroot kubeconfig on prov #41:
  - bp-catalyst-platform HR `Reconciling=True, reason=Progressing,
    message="Running 'install' action with timeout of 15m0s"`
  - HR `Released=False, reason=InstallFailed, message="Helm install
    failed for release catalyst-system/catalyst-platform with chart
    bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred:
    * timed out waiting for the condition"`
  - Pod `qa-finalizer-strip-m2hdb` status=Pending; events:
    `Warning  FailedScheduling 108s default-scheduler 0/2 nodes are
    available: 1 Insufficient cpu, 1 node(s) had untolerated taint
    {node-role.kubernetes.io/control-plane: true}`
  - Worker `Allocated cpu 7980m (99%) of 8000m capacity`
  - Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle)

Fix: add tolerations for the control-plane NoSchedule taint +
priorityClassName: system-cluster-critical so the qa-finalizer-strip
Job can ALWAYS schedule regardless of worker-node CPU saturation.
The hook is a defense-in-depth cleanup that runs in seconds on a
clean cluster; it legitimately belongs anywhere with free capacity
including the control-plane node (which on prov #41 had 7365m CPU
free vs. the hook's 50m request).

Why prior fixes didn't suffice:
  - Fix #114 introduced this hook to break a finalizer-deadlock loop
    on prov #9. Correct fix for that wedge; never anticipated worker
    saturation as a scheduling failure mode for the hook itself.
  - Fix #138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed +
    qa-cnpg-status-seed hooks (weight 0/post-install) to regular
    release resources to break a circular DAG dep. Different hook
    surface.
  - Fix #184 (chart 1.4.140) raised the gitea-token-mint pre-install
    hook (weight +10) wait budget for cold-start autoscaler. That
    hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook
    never starts, the +10 hook never runs.

Recurring class: same family as Fix #114 (hook scheduling failure
wedges entire HR install). 3 consecutive recurrences (prov #38, #39,
#41) on chart pin 1.4.140 trigger the category-level audit threshold
(CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene
swept in same commit:

  - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
    redirect for deprecated Bitnami images, 2025-08 cutover
    documented at platform/self-sovereign-cutover/chart/values.yaml:
    252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 —
    the canonical alpine-based kubectl image already used by sibling
    hook catalyst-gitea-token-mint (Fix #163). MIRROR-EVERYTHING +
    ARCHITECT-FIRST rules.

Coordinator follow-up tickets:
  - Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml
    (qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl
    :1.29.3 — same Bitnami-deprecation class. Out of scope for this
    Fix (not part of the recurrence cluster); flagged for a sweep.
  - Worker cpx32 sizing may be undersized for the bootstrap-kit fan-
    out on omantel.biz — separate sizing ticket, not blocking.

Changes:
  - products/catalyst/chart/templates/qa-fixtures/pre-install-
    finalizer-strip.yaml: add tolerations + priorityClassName;
    switch image to alpine/k8s:1.31.4. Inline doc comments explain
    the 4-layer trace and the Fix #114/#138/#184 history.
  - products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with
    changelog entry capturing root cause + budget arithmetic.
  - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
    bump HR pin 1.4.140 → 1.4.141.

Verification:
  - helm template renders cleanly (exit 0, ~6700 lines).
  - kubectl apply --dry-run=client validates the rendered Job
    manifest (job.batch/qa-finalizer-strip created (dry run)).
  - Rendered Job contains tolerations[control-plane Exists NoSchedule],
    priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-12 02:36:01 +02:00
parent 569d780b86
commit 746d5df152
3 changed files with 141 additions and 3 deletions

View File

@ -458,7 +458,22 @@ spec:
# before gitea was reachable → bp-catalyst-platform installFailed
# and HR loop-rolled forever. Budget arithmetic: hook 840s + 60s
# slack ≤ HR install.timeout 900s (15m).
version: 1.4.140
# 1.4.141 (qa-loop Fix #185, prov #38/#39/#41 recurrence,
# 2026-05-12): qa-finalizer-strip pre-install hook (helm.sh/hook-
# weight -99) now tolerates the control-plane NoSchedule taint
# and runs with priorityClassName: system-cluster-critical so it
# is ALWAYS schedulable regardless of worker-node CPU saturation.
# Root cause on prov #41: after bootstrap-kit fan-out the worker
# (cpx32, 8vCPU/16GB) sat at 99% CPU requests; the autoscaler
# had backed off scale-up of a second worker; the Job's 50m CPU
# request couldn't be satisfied; Helm pre-install timed out at
# 15m; Flux remediated 3× and gave up. Same recurring failure on
# prov #38, #39, #41 — all on chart pin 1.4.140 which (correctly)
# had no scheduling concession for the -99 hook. Image switched
# from bitnamilegacy/kubectl:1.29.3 → alpine/k8s:1.31.4 in same
# commit (rule-17 MIRROR-EVERYTHING hygiene; bitnamilegacy is
# the Docker-Hub redirect for deprecated Bitnami 2025-08 cutover).
version: 1.4.141
sourceRef:
kind: HelmRepository
name: bp-catalyst-platform

View File

@ -1058,8 +1058,59 @@ name: bp-catalyst-platform
# Fix #154 (HR-timeout audit). Those bumped the HelmRelease
# install.timeout. This bumps the chart-INTERNAL wait loop budget
# inside the pre-install hook Job, which is a different seam.
version: 1.4.140
version: 1.4.141
appVersion: 1.4.94
# 1.4.141 (qa-loop Fix #185, prov #38/#39/#41 recurrence — pre-install
# hook unscheduable on saturated worker):
#
# Symptom (prov #41, omantel.biz, 2026-05-12 00:28 UTC):
# bp-catalyst-platform HR stuck Reconciling → InstallFailed →
# "failed pre-install: timed out waiting for the condition" after 15m.
# Flux uninstall remediation runs, then re-installs, loop forever.
# `installFailures: 3` after which Flux gives up entirely.
#
# Root cause:
# The qa-finalizer-strip pre-install Job (helm.sh/hook-weight -99,
# introduced by Fix #114 to break a finalizer-deadlock loop) has no
# tolerations. On a fresh Sovereign with workerCount=0 + autoscaler
# (Fix #157), the FIRST autoscaled worker is sized just large enough
# for the rest of the bootstrap-kit Pods; by the time
# bp-catalyst-platform HR triggers pre-install, the worker is at
# 99% CPU requests (7980m of 8000m allocated) and the autoscaler
# has backed off scale-up of a second worker. Pod sits Pending
# forever ("FailedScheduling: 0/2 nodes are available: 1
# Insufficient cpu, 1 node(s) had untolerated taint
# {node-role.kubernetes.io/control-plane: true}"). Helm pre-install
# times out, Flux remediates 3×, gives up.
#
# Fix: add tolerations for control-plane NoSchedule + master taints +
# priorityClassName: system-cluster-critical to the qa-finalizer-strip
# Job. The hook is a defense-in-depth cleanup that runs in seconds; it
# MUST be schedulable somewhere on the cluster regardless of worker
# saturation. Control-plane node on prov #41 sits at 7% CPU / 9%
# memory — 7365m CPU free vs. the hook's 50m request.
#
# Why prior fixes didn't suffice:
# - Fix #114 introduced this hook; never anticipated worker
# saturation at install time.
# - Fix #138 (1.4.138) addressed CIRCULAR-DEP post-install seeders,
# a different hook surface.
# - Fix #184 (1.4.140) raised the gitea-token-mint pre-install hook
# (weight +10) wait budget. That hook runs AFTER qa-finalizer-strip
# (-99 < +10); if the -99 hook never starts, the +10 hook never
# runs either.
#
# Coupled chart hygiene (rule 17, MIRROR-EVERYTHING + ARCHITECT-FIRST):
# - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
# redirect for deprecated Bitnami images, 2025-08 cutover) to
# harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 — the
# canonical alpine-based kubectl image already used by sibling
# hook catalyst-gitea-token-mint (Fix #163).
#
# Recurring class: same family as Fix #114 (hook scheduling failure
# wedges entire HR install), Fix #138 (circular-dep hooks), Fix #184
# (cold-start budget). This addresses the SCHEDULING surface of the
# weight -99 hook itself.
# 1.4.129 (qa-loop iter-16 Fix #65): ship the missing
# `openova-catalog` Flux v1 HelmRepository in flux-system. The
# application-controller has always defaulted its rendered HelmRelease

View File

@ -144,6 +144,70 @@ spec:
spec:
serviceAccountName: qa-finalizer-strip
restartPolicy: OnFailure
# ── Fix #185 (prov #38/#39/#41 recurrence, 2026-05-12) ────────
# Tolerate the control-plane NoSchedule taint so this pre-install
# hook can ALWAYS land somewhere on the cluster regardless of
# worker-node CPU saturation. On a fresh Sovereign with
# workerCount=0 + autoscaler-hcloud (Fix #157), the FIRST worker
# is sized just large enough for the bootstrap-kit Pods themselves;
# bp-catalyst-platform's umbrella install lands several Deployments
# + StatefulSets that push that worker to 99% CPU requests before
# the umbrella's own pre-install hooks complete. Without this
# toleration the pre-install Job sits Pending forever (autoscaler
# backoff after failed scale-up of a second worker), Helm pre-
# install times out at 15m, Flux remediates 3× then gives up.
# prov #38/#39/#41 each hit this in succession on chart pin 1.4.140.
#
# 4-layer trace (prov #41, omantel.biz, 2026-05-12 00:28 UTC):
# bp-catalyst-platform HR install.timeout=15m
# → Helm pre-install hook: qa-finalizer-strip Job (weight -99)
# → Pod requests 50m CPU + 64Mi memory (tiny)
# → BUT no tolerations → scheduler restricted to worker
# → worker cpx32 (8vCPU/16GB) at 99% CPU requests
# (7980m of 8000m allocated) after bootstrap-kit fan-out
# → FailedScheduling: Insufficient cpu (event quoted:
# "0/2 nodes are available: 1 Insufficient cpu, 1
# node(s) had untolerated taint
# {node-role.kubernetes.io/control-plane: true}")
# → autoscaler triggers scale-up worker 2→3 but goes
# into "1 in backoff after failed scale-up"
# → still Pending → 15m timeout → InstallFailed
# → Flux rollback → ∞ loop (installFailures: 3)
#
# Fix: the qa-finalizer-strip hook is a defense-in-depth cleanup
# that completes in seconds on a clean cluster. It legitimately
# belongs ANYWHERE there's free capacity, including the control-
# plane node (which on prov #41 sits at 7% CPU / 9% memory —
# 7365m CPU free vs. 50m request). Tolerating the CP taint +
# system-cluster-critical priority is the canonical pattern for
# short-lived chart-bootstrap Jobs that MUST not be capacity-
# blocked.
#
# Why prior fixes didn't suffice:
# - Fix #114 introduced this hook to break a finalizer-deadlock
# loop on prov #9. Correct scope, never anticipated worker
# saturation as a scheduling failure mode.
# - Fix #138 (chart 1.4.138) converted the qa-cnpg-backup-s3-
# seed + qa-cnpg-status-seed hooks (weight 0/post-install) to
# regular release resources to break a circular DAG dep.
# Different hook surface from this one.
# - Fix #184 (chart 1.4.140) raised the gitea-token-mint hook
# (weight +10) wait budget for cold-start autoscaler. That
# hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99
# hook never starts, weight-10 hook never runs.
#
# priorityClassName: system-cluster-critical is a stock built-in
# k8s PriorityClass (value 2000000000) — high enough to preempt
# user workloads if the scheduler exhausts other options. Safe
# on every Sovereign.
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
priorityClassName: system-cluster-critical
securityContext:
runAsNonRoot: true
runAsUser: 65534
@ -153,7 +217,15 @@ spec:
type: RuntimeDefault
containers:
- name: strip
image: harbor.openova.io/proxy-dockerhub/bitnamilegacy/kubectl:1.29.3
# Fix #185: switch bitnamilegacy/kubectl → alpine/k8s per
# CLAUDE.md ARCHITECT-FIRST + MIRROR-EVERYTHING rules.
# bitnamilegacy is a Docker-Hub redirect for the deprecated
# Bitnami images (2025-08 cutover documented in platform/
# self-sovereign-cutover/chart/values.yaml:252). alpine/k8s
# is the canonical alpine-based kubectl image already used
# by the sibling pre-install hook catalyst-gitea-token-mint
# (Fix #163). Tag 1.31.4 matches the rest of the chart family.
image: harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false