fix(bp-newapi): pre-install gate on external-secrets webhook readiness (Fix #138) (#1347)

prov #20: bp-newapi 1.4.2 HR FAILED with the chart's
templates/external-secret.yaml apply rejected by the apiserver:

  Internal error occurred: failed calling webhook
  "validate.externalsecret.external-secrets.io": ...
  no endpoints available for service "external-secrets-webhook"

bp-external-secrets reaches HR Ready=True the moment its Deployments
report Ready, but Pod Ready != webhook EndpointSlice reachable: the
apiserver-side EndpointSlice for the webhook Service has not been
observed by the validating admission controller's lookup yet. Flux
dependsOn satisfies the dependency graph but does NOT close this race.

Same root-cause class as Fix #137 (bp-external-secrets-stores) but a
DIFFERENT chart and DIFFERENT validation endpoint (ExternalSecret vs
ClusterSecretStore).

Canonical seam (Inviolable Principle #16): the chart that CONSUMES the
webhook owns the readiness gate. NOT the upstream external-secrets
chart (Fix #137 territory) and NOT a Flux HR-level dependsOn (which
checks the wrong layer).

Adds platform/newapi/chart/templates/000-external-secrets-webhook-
readiness-job.yaml — a pre-install/pre-upgrade Helm hook that polls
the webhook (default
external-secrets-webhook.external-secrets-system.svc:443/validate-
external-secrets-io-v1beta1-externalsecret) until it returns a
structured HTTP response (200/400/405/415/422). 60s wall budget, 2s
interval, no RBAC required (curl-only Pod, HTTPS to ClusterIP).

Templated end-to-end via .Values.externalSecretsWebhookGate.* per
Inviolable Principle #4 — operator may override service, namespace,
port, path, timeout, interval, or disable the gate entirely from a
per-Sovereign overlay.

Capability-gated on the external-secrets.io/v1beta1 CRD AND on the
existing catalystIntegration.externalSecret.enabled chain, so a
Sovereign that disables catalyst-integration pays no probe overhead.

Chart 1.4.2 -> 1.4.4 (1.4.3 was a deploy-only image-tag bump).
HR template clusters/_template/bootstrap-kit/80-newapi.yaml repinned.

## Claimed TCs
Infra-only fix; no UI behaviour change. Unblocks bp-newapi reaching
HR Ready=True on every fresh provision, which is a hard prerequisite
for:
- ADR-0003 §3.2 Catalyst signup hook (alice -> per-user NewAPI key)
- alice signup gate 5 (LLM) end-to-end
- Any TC that exercises /v1/* customer API or admin.<sovereign-fqdn>

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-11 05:03:08 +04:00 committed by GitHub
parent 47f568923a
commit 54e65aa4b1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 308 additions and 2 deletions

View File

@ -92,7 +92,17 @@ spec:
# `.github/workflows/build-bp-newapi.yaml` workflow). Pre-1.4.2
# the NewAPI Pod ImagePullBackOff'd 403 on every fresh Sovereign,
# blocking alice signup gate 5 (LLM).
version: 1.4.2
# 1.4.4 (qa-loop bounded-cycle audit prov #20 Fix #138, 2026-05-11):
# add pre-install/pre-upgrade hook that polls the external-secrets
# validating-admission webhook until it returns a structured HTTP
# response — closes the race between bp-external-secrets reaching
# HR Ready=True and the apiserver-side EndpointSlice for the
# webhook Service being observable. Pre-1.4.4 the chart's
# ExternalSecret apply was rejected with `no endpoints available
# for service "external-secrets-webhook"` on every fresh provision,
# blocking the chart from reaching Ready and the Catalyst signup
# hook (ADR-0003 §3.2) from finding the admin-token Secret.
version: 1.4.4
sourceRef:
kind: HelmRepository
name: bp-newapi

View File

@ -1,5 +1,41 @@
apiVersion: v2
name: bp-newapi
# 1.4.4 (qa-loop bounded-cycle audit prov #20 Fix #138, 2026-05-11): add
# pre-install/pre-upgrade hook (templates/000-external-secrets-webhook-
# readiness-job.yaml) that polls the external-secrets validating-
# admission webhook (`external-secrets-webhook.external-secrets-system.
# svc:443/validate-external-secrets-io-v1beta1-externalsecret`) until
# it returns a structured HTTP response (200/400/405/415/422). Closes
# the race between bp-external-secrets reaching HR Ready=True (Pods
# Ready) and the apiserver-side EndpointSlice for the webhook Service
# being observable by the validating admission controller.
#
# ROOT CAUSE (prov #20):
# bp-newapi 1.4.2 HR FAILED with the chart's templates/external-
# secret.yaml apply rejected by the apiserver:
# Internal error occurred: failed calling webhook
# "validate.externalsecret.external-secrets.io": ...
# no endpoints available for service "external-secrets-webhook"
# bp-external-secrets satisfies Flux dependsOn the moment its
# Deployments report Ready, but Pod Ready ≠ webhook EndpointSlice
# reachable. The chart immediately tried to apply ExternalSecret and
# the webhook returned 503/connect-error.
#
# CANONICAL SEAM (Inviolable Principle #16):
# The chart that CONSUMES the webhook owns the readiness gate — NOT
# the upstream external-secrets chart (owned by Fix #137 territory)
# and NOT a Flux HR-level dependsOn (which checks the wrong layer).
# Fix #138 mirrors Fix #137's pattern but probes a different
# validation endpoint (ExternalSecret vs ClusterSecretStore).
#
# TEMPLATABILITY (Inviolable Principle #4):
# Every knob (webhook service name, namespace, port, path, timeout,
# interval, gate-enabled flag) is operator-tunable from per-
# Sovereign overlays via .Values.externalSecretsWebhookGate.*.
#
# 1.4.3: deploy-only bump (a9861f94 — no semantic change beyond
# carrying the v0.13.2 image-tag fix to the bootstrap-kit).
#
# 1.4.2 (qa-loop bounded-cycle audit prov #7 Gap F, 2026-05-10): point
# `.Values.newapi.image.tag` at a tag that ACTUALLY EXISTS in GHCR. Pre-
# 1.4.2 the chart referenced `ghcr.io/openova-io/openova/newapi-mirror:
@ -86,7 +122,7 @@ name: bp-newapi
# Issue #915 (epic SME tenant integration DoD: alice → OpenClaw →
# NewAPI → Qwen3.6@BankDhofar end-to-end).
# 1.2.0: Traefik Middleware gated behind ingress.middleware.enabled.
version: 1.4.3
version: 1.4.4
appVersion: "0.13.2"
description: |
Catalyst Blueprint scratch chart for NewAPI — multi-tenant LLM

View File

@ -0,0 +1,218 @@
{{- /*
External-Secrets webhook readiness pre-install/pre-upgrade gate
(qa-loop bounded-cycle audit prov #20 Fix #138).
ROOT CAUSE
----------
prov #20: bp-newapi 1.4.2 HR FAILED with the chart's ExternalSecret being
rejected by the apiserver:
Internal error occurred: failed calling webhook
"validate.externalsecret.external-secrets.io": failed to call webhook:
Post "https://external-secrets-webhook.external-secrets-system.svc:443/
validate-external-secrets-io-v1beta1-externalsecret?timeout=5s":
no endpoints available for service "external-secrets-webhook"
bp-external-secrets reaches HR Ready=True the moment its Deployments have
spec.replicas == status.readyReplicas, but Pod Ready ≠ webhook endpoint
reachable: the apiserver-side EndpointSlice for the webhook Service has
not been observed by the validating admission controller's lookup yet.
Flux dependsOn satisfies the dependency graph but does NOT close this
race — the chart immediately tries to apply ExternalSecret and the
webhook returns 503 / connect-error.
This is the SAME root-cause class as bp-external-secrets-stores (Fix
#137), but a DIFFERENT chart (bp-newapi) and DIFFERENT consumer
(ExternalSecret resource on slot 80, not ClusterSecretStore on slot 15a).
CANONICAL SEAM
--------------
A LOCAL pre-install/pre-upgrade Helm hook on bp-newapi — NOT a fix to
the upstream external-secrets chart (owned by Fix #137 territory and
the upstream chain), and NOT a Flux HelmRelease-level dependsOn (which
checks the wrong layer). The chart that consumes the webhook is the
chart that must gate on its readiness, per docs/INVIOLABLE-PRINCIPLES.md
#3 (event-driven, not timeout band-aid; the consumer owns the gate).
The Job is identical-pattern to platform/k8s-ws-proxy/chart/templates/
hmac-bootstrap-job.yaml (the canonical Catalyst seam for in-chart pre-
install hooks): curlimages/curl:8.10.1, in-cluster ServiceAccount with
NO RBAC needed (probe is HTTPS to a Service ClusterIP, not k8s API),
hook-weight ordering SA(-20) → Job(-10), hook-delete-policy
before-hook-creation,hook-succeeded.
WEBHOOK PROBE LOGIC
-------------------
Polls https://<webhook-svc>:443/<validate-path> via curl --insecure
(the webhook serves a self-signed cert from cert-manager; we don't need
TLS validation here — only L7 reachability). Acceptance criteria:
- HTTP 200/400/405 → webhook IS serving (any structured response from
the admission controller, even a 400 "wrong content type", proves
the validator code is alive AND the EndpointSlice is observable)
- HTTP 503 or connect-error → webhook NOT YET serving → retry
60s wall budget with 2s probe interval (30 attempts max). On exhaust
the Job FAILS, which surfaces as Helm install failure → HelmRelease
remediation retries kick in (3 retries by default) → on the next
attempt the EndpointSlice has typically converged.
The webhook URL/namespace are fully templatable from values.yaml
(.Values.externalSecretsWebhookGate.*) per docs/INVIOLABLE-PRINCIPLES.md
#4 (never hardcode); operator overlays may flip
.Values.externalSecretsWebhookGate.enabled=false on Sovereigns where
external-secrets is provisioned via a different distribution.
RBAC
----
NONE required: the probe target is a Service ClusterIP at a known DNS
name + port; the curl Pod does not call the k8s API. No ServiceAccount
override needed; the Job uses the namespace's `default` SA with no
explicit Role/RoleBinding (curl-only Pod, no kubectl).
CAPABILITY GATE
---------------
Skip the entire Job if the chart isn't going to render an ExternalSecret
in the first place. The condition mirrors templates/external-secret.yaml's
gate (catalystIntegration.enabled AND externalSecret.enabled AND CRD
present); without it, a Sovereign that disables catalystIntegration would
pay the 60s probe budget for nothing.
*/ -}}
{{- $ci := .Values.catalystIntegration | default dict -}}
{{- $es := $ci.externalSecret | default dict -}}
{{- $gate := .Values.externalSecretsWebhookGate | default dict -}}
{{- if and (default true $gate.enabled) $ci.enabled $es.enabled (.Capabilities.APIVersions.Has "external-secrets.io/v1beta1") -}}
{{- $ns := .Release.Namespace }}
{{- $svc := $gate.service | default "external-secrets-webhook" -}}
{{- $svcNs := $gate.namespace | default "external-secrets-system" -}}
{{- $port := $gate.port | default 443 -}}
{{- $path := $gate.path | default "/validate-external-secrets-io-v1beta1-externalsecret" -}}
{{- $timeoutSeconds := $gate.timeoutSeconds | default 60 -}}
{{- $intervalSeconds := $gate.intervalSeconds | default 2 -}}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: newapi-eswebhook-gate
namespace: {{ $ns }}
labels:
catalyst.openova.io/blueprint: bp-newapi
catalyst.openova.io/component: external-secrets-webhook-gate
annotations:
"helm.sh/hook": "pre-install,pre-upgrade"
"helm.sh/hook-weight": "-20"
"helm.sh/hook-delete-policy": "before-hook-creation"
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 2 }}
{{- end }}
---
apiVersion: batch/v1
kind: Job
metadata:
name: newapi-eswebhook-gate
namespace: {{ $ns }}
labels:
catalyst.openova.io/blueprint: bp-newapi
catalyst.openova.io/component: external-secrets-webhook-gate
annotations:
"helm.sh/hook": "pre-install,pre-upgrade"
"helm.sh/hook-weight": "-10"
"helm.sh/hook-delete-policy": "before-hook-creation,hook-succeeded"
spec:
backoffLimit: 3
ttlSecondsAfterFinished: 300
template:
metadata:
labels:
catalyst.openova.io/blueprint: bp-newapi
catalyst.openova.io/component: external-secrets-webhook-gate
spec:
serviceAccountName: newapi-eswebhook-gate
restartPolicy: Never
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
seccompProfile:
type: RuntimeDefault
containers:
- name: probe
image: curlimages/curl:8.10.1
imagePullPolicy: IfNotPresent
env:
- name: WEBHOOK_URL
value: {{ printf "https://%s.%s.svc:%v%s" $svc $svcNs $port $path | quote }}
- name: TIMEOUT_SECONDS
value: {{ $timeoutSeconds | quote }}
- name: INTERVAL_SECONDS
value: {{ $intervalSeconds | quote }}
command:
- /bin/sh
- -c
- |
set -eu
echo "[eswebhook-gate] target: ${WEBHOOK_URL}"
echo "[eswebhook-gate] budget: ${TIMEOUT_SECONDS}s, interval: ${INTERVAL_SECONDS}s"
deadline=$(( $(date +%s) + ${TIMEOUT_SECONDS} ))
attempt=0
while [ "$(date +%s)" -lt "${deadline}" ]; do
attempt=$(( attempt + 1 ))
# POST a minimal AdmissionReview-shaped probe. The webhook
# rejects malformed bodies with HTTP 400 — that is PROOF the
# validator code is alive AND the EndpointSlice is reachable.
# 503 / connect-error means no endpoints yet (Service has 0
# endpoint addresses) — keep retrying.
code=$(curl -ksS -o /tmp/probe.body -w '%{http_code}' \
--max-time 5 \
-H 'Content-Type: application/json' \
-X POST \
--data '{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","request":{"uid":"probe","kind":{"group":"external-secrets.io","version":"v1beta1","kind":"ExternalSecret"}}}' \
"${WEBHOOK_URL}" 2>/tmp/probe.err || echo "000")
case "${code}" in
200|400|405|415|422)
echo "[eswebhook-gate] webhook reachable (HTTP ${code}) after ${attempt} attempt(s) — proceeding"
exit 0
;;
503|000)
# 503 = no endpoints; 000 = curl connect-error (DNS
# resolves but no endpoint, or TLS handshake aborted
# because Pod isn't actually ready yet).
echo "[eswebhook-gate] attempt ${attempt}: HTTP ${code} (not ready) — sleeping ${INTERVAL_SECONDS}s"
sleep "${INTERVAL_SECONDS}"
;;
*)
echo "[eswebhook-gate] attempt ${attempt}: HTTP ${code} (unexpected, treating as not-ready)"
cat /tmp/probe.body 2>/dev/null || true
sleep "${INTERVAL_SECONDS}"
;;
esac
done
echo "[eswebhook-gate] FATAL: webhook ${WEBHOOK_URL} did not become reachable within ${TIMEOUT_SECONDS}s (last code: ${code:-unknown})" >&2
cat /tmp/probe.body 2>/dev/null || true
cat /tmp/probe.err 2>/dev/null || true
exit 1
resources:
requests:
cpu: 10m
memory: 32Mi
limits:
cpu: 100m
memory: 64Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir:
medium: Memory
sizeLimit: 8Mi
{{- end }}

View File

@ -470,6 +470,48 @@ catalystIntegration:
# `property` is the JSON field inside the OpenBao secret holding
# the bearer token used by the Catalyst signup hook (ADR-0003 §3.2).
property: "ADMIN_API_TOKEN"
# ─── External-Secrets webhook readiness gate (Fix #138, prov #20) ────────
# Pre-install/pre-upgrade hook that polls the external-secrets validating-
# admission webhook until it serves a structured response (HTTP 200/400/
# 405/415/422). Closes the race between bp-external-secrets reaching
# HR Ready=True (Pods Ready) and the apiserver-side EndpointSlice for
# the webhook Service being observable by the admission controller.
#
# Without this gate the chart's templates/external-secret.yaml apply
# attempt fails with:
# Internal error occurred: failed calling webhook
# "validate.externalsecret.external-secrets.io": ... no endpoints
# available for service "external-secrets-webhook"
#
# Per docs/INVIOLABLE-PRINCIPLES.md #4 every knob is operator-tunable
# via per-Sovereign overlay (e.g. swap to a different webhook namespace
# if external-secrets is provisioned via a non-Catalyst distribution).
# Per #3 the gate is event-driven (probe-then-proceed), NOT a blanket
# `helm install --timeout`.
externalSecretsWebhookGate:
# Default ON. Set false ONLY on Sovereigns where external-secrets is
# provisioned outside the Catalyst bootstrap-kit and the operator has
# confirmed the webhook is already serving by the time bp-newapi
# installs (rare; recommended to leave true).
enabled: true
# Webhook Service name (default = upstream external-secrets chart's
# `<release>-webhook` Service when releaseName is `external-secrets`).
service: "external-secrets-webhook"
# Namespace the webhook Service lives in. Catalyst default per
# clusters/_template/bootstrap-kit/15-external-secrets.yaml.
namespace: "external-secrets-system"
# HTTPS port the webhook serves on (upstream chart default is 443).
port: 443
# Validation path for ExternalSecret resources. Sourced from upstream
# chart's templates/validatingwebhook.yaml — different from the
# ClusterSecretStore path (used by bp-external-secrets-stores in Fix
# #137); each consumer probes its own validation endpoint.
path: "/validate-external-secrets-io-v1beta1-externalsecret"
# Wall-clock budget. 60s comfortably exceeds the typical 515s
# EndpointSlice propagation window observed across fresh provisions.
timeoutSeconds: 60
# Probe interval inside the wall-clock budget.
intervalSeconds: 2
# ─── Service ─────────────────────────────────────────────────────────────
service:
type: ClusterIP