prov #20: bp-newapi 1.4.2 HR FAILED with the chart's templates/external-secret.yaml apply rejected by the apiserver: Internal error occurred: failed calling webhook "validate.externalsecret.external-secrets.io": ... no endpoints available for service "external-secrets-webhook" bp-external-secrets reaches HR Ready=True the moment its Deployments report Ready, but Pod Ready != webhook EndpointSlice reachable: the apiserver-side EndpointSlice for the webhook Service has not been observed by the validating admission controller's lookup yet. Flux dependsOn satisfies the dependency graph but does NOT close this race. Same root-cause class as Fix #137 (bp-external-secrets-stores) but a DIFFERENT chart and DIFFERENT validation endpoint (ExternalSecret vs ClusterSecretStore). Canonical seam (Inviolable Principle #16): the chart that CONSUMES the webhook owns the readiness gate. NOT the upstream external-secrets chart (Fix #137 territory) and NOT a Flux HR-level dependsOn (which checks the wrong layer). Adds platform/newapi/chart/templates/000-external-secrets-webhook- readiness-job.yaml — a pre-install/pre-upgrade Helm hook that polls the webhook (default external-secrets-webhook.external-secrets-system.svc:443/validate- external-secrets-io-v1beta1-externalsecret) until it returns a structured HTTP response (200/400/405/415/422). 60s wall budget, 2s interval, no RBAC required (curl-only Pod, HTTPS to ClusterIP). Templated end-to-end via .Values.externalSecretsWebhookGate.* per Inviolable Principle #4 — operator may override service, namespace, port, path, timeout, interval, or disable the gate entirely from a per-Sovereign overlay. Capability-gated on the external-secrets.io/v1beta1 CRD AND on the existing catalystIntegration.externalSecret.enabled chain, so a Sovereign that disables catalyst-integration pays no probe overhead. Chart 1.4.2 -> 1.4.4 (1.4.3 was a deploy-only image-tag bump). HR template clusters/_template/bootstrap-kit/80-newapi.yaml repinned. ## Claimed TCs Infra-only fix; no UI behaviour change. Unblocks bp-newapi reaching HR Ready=True on every fresh provision, which is a hard prerequisite for: - ADR-0003 §3.2 Catalyst signup hook (alice -> per-user NewAPI key) - alice signup gate 5 (LLM) end-to-end - Any TC that exercises /v1/* customer API or admin.<sovereign-fqdn> Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
47f568923a
commit
54e65aa4b1
@ -92,7 +92,17 @@ spec:
|
||||
# `.github/workflows/build-bp-newapi.yaml` workflow). Pre-1.4.2
|
||||
# the NewAPI Pod ImagePullBackOff'd 403 on every fresh Sovereign,
|
||||
# blocking alice signup gate 5 (LLM).
|
||||
version: 1.4.2
|
||||
# 1.4.4 (qa-loop bounded-cycle audit prov #20 Fix #138, 2026-05-11):
|
||||
# add pre-install/pre-upgrade hook that polls the external-secrets
|
||||
# validating-admission webhook until it returns a structured HTTP
|
||||
# response — closes the race between bp-external-secrets reaching
|
||||
# HR Ready=True and the apiserver-side EndpointSlice for the
|
||||
# webhook Service being observable. Pre-1.4.4 the chart's
|
||||
# ExternalSecret apply was rejected with `no endpoints available
|
||||
# for service "external-secrets-webhook"` on every fresh provision,
|
||||
# blocking the chart from reaching Ready and the Catalyst signup
|
||||
# hook (ADR-0003 §3.2) from finding the admin-token Secret.
|
||||
version: 1.4.4
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-newapi
|
||||
|
||||
@ -1,5 +1,41 @@
|
||||
apiVersion: v2
|
||||
name: bp-newapi
|
||||
# 1.4.4 (qa-loop bounded-cycle audit prov #20 Fix #138, 2026-05-11): add
|
||||
# pre-install/pre-upgrade hook (templates/000-external-secrets-webhook-
|
||||
# readiness-job.yaml) that polls the external-secrets validating-
|
||||
# admission webhook (`external-secrets-webhook.external-secrets-system.
|
||||
# svc:443/validate-external-secrets-io-v1beta1-externalsecret`) until
|
||||
# it returns a structured HTTP response (200/400/405/415/422). Closes
|
||||
# the race between bp-external-secrets reaching HR Ready=True (Pods
|
||||
# Ready) and the apiserver-side EndpointSlice for the webhook Service
|
||||
# being observable by the validating admission controller.
|
||||
#
|
||||
# ROOT CAUSE (prov #20):
|
||||
# bp-newapi 1.4.2 HR FAILED with the chart's templates/external-
|
||||
# secret.yaml apply rejected by the apiserver:
|
||||
# Internal error occurred: failed calling webhook
|
||||
# "validate.externalsecret.external-secrets.io": ...
|
||||
# no endpoints available for service "external-secrets-webhook"
|
||||
# bp-external-secrets satisfies Flux dependsOn the moment its
|
||||
# Deployments report Ready, but Pod Ready ≠ webhook EndpointSlice
|
||||
# reachable. The chart immediately tried to apply ExternalSecret and
|
||||
# the webhook returned 503/connect-error.
|
||||
#
|
||||
# CANONICAL SEAM (Inviolable Principle #16):
|
||||
# The chart that CONSUMES the webhook owns the readiness gate — NOT
|
||||
# the upstream external-secrets chart (owned by Fix #137 territory)
|
||||
# and NOT a Flux HR-level dependsOn (which checks the wrong layer).
|
||||
# Fix #138 mirrors Fix #137's pattern but probes a different
|
||||
# validation endpoint (ExternalSecret vs ClusterSecretStore).
|
||||
#
|
||||
# TEMPLATABILITY (Inviolable Principle #4):
|
||||
# Every knob (webhook service name, namespace, port, path, timeout,
|
||||
# interval, gate-enabled flag) is operator-tunable from per-
|
||||
# Sovereign overlays via .Values.externalSecretsWebhookGate.*.
|
||||
#
|
||||
# 1.4.3: deploy-only bump (a9861f94 — no semantic change beyond
|
||||
# carrying the v0.13.2 image-tag fix to the bootstrap-kit).
|
||||
#
|
||||
# 1.4.2 (qa-loop bounded-cycle audit prov #7 Gap F, 2026-05-10): point
|
||||
# `.Values.newapi.image.tag` at a tag that ACTUALLY EXISTS in GHCR. Pre-
|
||||
# 1.4.2 the chart referenced `ghcr.io/openova-io/openova/newapi-mirror:
|
||||
@ -86,7 +122,7 @@ name: bp-newapi
|
||||
# Issue #915 (epic SME tenant integration DoD: alice → OpenClaw →
|
||||
# NewAPI → Qwen3.6@BankDhofar end-to-end).
|
||||
# 1.2.0: Traefik Middleware gated behind ingress.middleware.enabled.
|
||||
version: 1.4.3
|
||||
version: 1.4.4
|
||||
appVersion: "0.13.2"
|
||||
description: |
|
||||
Catalyst Blueprint scratch chart for NewAPI — multi-tenant LLM
|
||||
|
||||
@ -0,0 +1,218 @@
|
||||
{{- /*
|
||||
External-Secrets webhook readiness pre-install/pre-upgrade gate
|
||||
(qa-loop bounded-cycle audit prov #20 Fix #138).
|
||||
|
||||
ROOT CAUSE
|
||||
----------
|
||||
prov #20: bp-newapi 1.4.2 HR FAILED with the chart's ExternalSecret being
|
||||
rejected by the apiserver:
|
||||
Internal error occurred: failed calling webhook
|
||||
"validate.externalsecret.external-secrets.io": failed to call webhook:
|
||||
Post "https://external-secrets-webhook.external-secrets-system.svc:443/
|
||||
validate-external-secrets-io-v1beta1-externalsecret?timeout=5s":
|
||||
no endpoints available for service "external-secrets-webhook"
|
||||
|
||||
bp-external-secrets reaches HR Ready=True the moment its Deployments have
|
||||
spec.replicas == status.readyReplicas, but Pod Ready ≠ webhook endpoint
|
||||
reachable: the apiserver-side EndpointSlice for the webhook Service has
|
||||
not been observed by the validating admission controller's lookup yet.
|
||||
Flux dependsOn satisfies the dependency graph but does NOT close this
|
||||
race — the chart immediately tries to apply ExternalSecret and the
|
||||
webhook returns 503 / connect-error.
|
||||
|
||||
This is the SAME root-cause class as bp-external-secrets-stores (Fix
|
||||
#137), but a DIFFERENT chart (bp-newapi) and DIFFERENT consumer
|
||||
(ExternalSecret resource on slot 80, not ClusterSecretStore on slot 15a).
|
||||
|
||||
CANONICAL SEAM
|
||||
--------------
|
||||
A LOCAL pre-install/pre-upgrade Helm hook on bp-newapi — NOT a fix to
|
||||
the upstream external-secrets chart (owned by Fix #137 territory and
|
||||
the upstream chain), and NOT a Flux HelmRelease-level dependsOn (which
|
||||
checks the wrong layer). The chart that consumes the webhook is the
|
||||
chart that must gate on its readiness, per docs/INVIOLABLE-PRINCIPLES.md
|
||||
#3 (event-driven, not timeout band-aid; the consumer owns the gate).
|
||||
|
||||
The Job is identical-pattern to platform/k8s-ws-proxy/chart/templates/
|
||||
hmac-bootstrap-job.yaml (the canonical Catalyst seam for in-chart pre-
|
||||
install hooks): curlimages/curl:8.10.1, in-cluster ServiceAccount with
|
||||
NO RBAC needed (probe is HTTPS to a Service ClusterIP, not k8s API),
|
||||
hook-weight ordering SA(-20) → Job(-10), hook-delete-policy
|
||||
before-hook-creation,hook-succeeded.
|
||||
|
||||
WEBHOOK PROBE LOGIC
|
||||
-------------------
|
||||
Polls https://<webhook-svc>:443/<validate-path> via curl --insecure
|
||||
(the webhook serves a self-signed cert from cert-manager; we don't need
|
||||
TLS validation here — only L7 reachability). Acceptance criteria:
|
||||
- HTTP 200/400/405 → webhook IS serving (any structured response from
|
||||
the admission controller, even a 400 "wrong content type", proves
|
||||
the validator code is alive AND the EndpointSlice is observable)
|
||||
- HTTP 503 or connect-error → webhook NOT YET serving → retry
|
||||
|
||||
60s wall budget with 2s probe interval (30 attempts max). On exhaust
|
||||
the Job FAILS, which surfaces as Helm install failure → HelmRelease
|
||||
remediation retries kick in (3 retries by default) → on the next
|
||||
attempt the EndpointSlice has typically converged.
|
||||
|
||||
The webhook URL/namespace are fully templatable from values.yaml
|
||||
(.Values.externalSecretsWebhookGate.*) per docs/INVIOLABLE-PRINCIPLES.md
|
||||
#4 (never hardcode); operator overlays may flip
|
||||
.Values.externalSecretsWebhookGate.enabled=false on Sovereigns where
|
||||
external-secrets is provisioned via a different distribution.
|
||||
|
||||
RBAC
|
||||
----
|
||||
NONE required: the probe target is a Service ClusterIP at a known DNS
|
||||
name + port; the curl Pod does not call the k8s API. No ServiceAccount
|
||||
override needed; the Job uses the namespace's `default` SA with no
|
||||
explicit Role/RoleBinding (curl-only Pod, no kubectl).
|
||||
|
||||
CAPABILITY GATE
|
||||
---------------
|
||||
Skip the entire Job if the chart isn't going to render an ExternalSecret
|
||||
in the first place. The condition mirrors templates/external-secret.yaml's
|
||||
gate (catalystIntegration.enabled AND externalSecret.enabled AND CRD
|
||||
present); without it, a Sovereign that disables catalystIntegration would
|
||||
pay the 60s probe budget for nothing.
|
||||
*/ -}}
|
||||
{{- $ci := .Values.catalystIntegration | default dict -}}
|
||||
{{- $es := $ci.externalSecret | default dict -}}
|
||||
{{- $gate := .Values.externalSecretsWebhookGate | default dict -}}
|
||||
{{- if and (default true $gate.enabled) $ci.enabled $es.enabled (.Capabilities.APIVersions.Has "external-secrets.io/v1beta1") -}}
|
||||
{{- $ns := .Release.Namespace }}
|
||||
{{- $svc := $gate.service | default "external-secrets-webhook" -}}
|
||||
{{- $svcNs := $gate.namespace | default "external-secrets-system" -}}
|
||||
{{- $port := $gate.port | default 443 -}}
|
||||
{{- $path := $gate.path | default "/validate-external-secrets-io-v1beta1-externalsecret" -}}
|
||||
{{- $timeoutSeconds := $gate.timeoutSeconds | default 60 -}}
|
||||
{{- $intervalSeconds := $gate.intervalSeconds | default 2 -}}
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: newapi-eswebhook-gate
|
||||
namespace: {{ $ns }}
|
||||
labels:
|
||||
catalyst.openova.io/blueprint: bp-newapi
|
||||
catalyst.openova.io/component: external-secrets-webhook-gate
|
||||
annotations:
|
||||
"helm.sh/hook": "pre-install,pre-upgrade"
|
||||
"helm.sh/hook-weight": "-20"
|
||||
"helm.sh/hook-delete-policy": "before-hook-creation"
|
||||
{{- with .Values.imagePullSecrets }}
|
||||
imagePullSecrets:
|
||||
{{- toYaml . | nindent 2 }}
|
||||
{{- end }}
|
||||
---
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: newapi-eswebhook-gate
|
||||
namespace: {{ $ns }}
|
||||
labels:
|
||||
catalyst.openova.io/blueprint: bp-newapi
|
||||
catalyst.openova.io/component: external-secrets-webhook-gate
|
||||
annotations:
|
||||
"helm.sh/hook": "pre-install,pre-upgrade"
|
||||
"helm.sh/hook-weight": "-10"
|
||||
"helm.sh/hook-delete-policy": "before-hook-creation,hook-succeeded"
|
||||
spec:
|
||||
backoffLimit: 3
|
||||
ttlSecondsAfterFinished: 300
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
catalyst.openova.io/blueprint: bp-newapi
|
||||
catalyst.openova.io/component: external-secrets-webhook-gate
|
||||
spec:
|
||||
serviceAccountName: newapi-eswebhook-gate
|
||||
restartPolicy: Never
|
||||
{{- with .Values.imagePullSecrets }}
|
||||
imagePullSecrets:
|
||||
{{- toYaml . | nindent 8 }}
|
||||
{{- end }}
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 65534
|
||||
runAsGroup: 65534
|
||||
seccompProfile:
|
||||
type: RuntimeDefault
|
||||
containers:
|
||||
- name: probe
|
||||
image: curlimages/curl:8.10.1
|
||||
imagePullPolicy: IfNotPresent
|
||||
env:
|
||||
- name: WEBHOOK_URL
|
||||
value: {{ printf "https://%s.%s.svc:%v%s" $svc $svcNs $port $path | quote }}
|
||||
- name: TIMEOUT_SECONDS
|
||||
value: {{ $timeoutSeconds | quote }}
|
||||
- name: INTERVAL_SECONDS
|
||||
value: {{ $intervalSeconds | quote }}
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- |
|
||||
set -eu
|
||||
echo "[eswebhook-gate] target: ${WEBHOOK_URL}"
|
||||
echo "[eswebhook-gate] budget: ${TIMEOUT_SECONDS}s, interval: ${INTERVAL_SECONDS}s"
|
||||
|
||||
deadline=$(( $(date +%s) + ${TIMEOUT_SECONDS} ))
|
||||
attempt=0
|
||||
while [ "$(date +%s)" -lt "${deadline}" ]; do
|
||||
attempt=$(( attempt + 1 ))
|
||||
# POST a minimal AdmissionReview-shaped probe. The webhook
|
||||
# rejects malformed bodies with HTTP 400 — that is PROOF the
|
||||
# validator code is alive AND the EndpointSlice is reachable.
|
||||
# 503 / connect-error means no endpoints yet (Service has 0
|
||||
# endpoint addresses) — keep retrying.
|
||||
code=$(curl -ksS -o /tmp/probe.body -w '%{http_code}' \
|
||||
--max-time 5 \
|
||||
-H 'Content-Type: application/json' \
|
||||
-X POST \
|
||||
--data '{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","request":{"uid":"probe","kind":{"group":"external-secrets.io","version":"v1beta1","kind":"ExternalSecret"}}}' \
|
||||
"${WEBHOOK_URL}" 2>/tmp/probe.err || echo "000")
|
||||
case "${code}" in
|
||||
200|400|405|415|422)
|
||||
echo "[eswebhook-gate] webhook reachable (HTTP ${code}) after ${attempt} attempt(s) — proceeding"
|
||||
exit 0
|
||||
;;
|
||||
503|000)
|
||||
# 503 = no endpoints; 000 = curl connect-error (DNS
|
||||
# resolves but no endpoint, or TLS handshake aborted
|
||||
# because Pod isn't actually ready yet).
|
||||
echo "[eswebhook-gate] attempt ${attempt}: HTTP ${code} (not ready) — sleeping ${INTERVAL_SECONDS}s"
|
||||
sleep "${INTERVAL_SECONDS}"
|
||||
;;
|
||||
*)
|
||||
echo "[eswebhook-gate] attempt ${attempt}: HTTP ${code} (unexpected, treating as not-ready)"
|
||||
cat /tmp/probe.body 2>/dev/null || true
|
||||
sleep "${INTERVAL_SECONDS}"
|
||||
;;
|
||||
esac
|
||||
done
|
||||
echo "[eswebhook-gate] FATAL: webhook ${WEBHOOK_URL} did not become reachable within ${TIMEOUT_SECONDS}s (last code: ${code:-unknown})" >&2
|
||||
cat /tmp/probe.body 2>/dev/null || true
|
||||
cat /tmp/probe.err 2>/dev/null || true
|
||||
exit 1
|
||||
resources:
|
||||
requests:
|
||||
cpu: 10m
|
||||
memory: 32Mi
|
||||
limits:
|
||||
cpu: 100m
|
||||
memory: 64Mi
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
readOnlyRootFilesystem: true
|
||||
capabilities:
|
||||
drop: ["ALL"]
|
||||
volumeMounts:
|
||||
- name: tmp
|
||||
mountPath: /tmp
|
||||
volumes:
|
||||
- name: tmp
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
sizeLimit: 8Mi
|
||||
{{- end }}
|
||||
@ -470,6 +470,48 @@ catalystIntegration:
|
||||
# `property` is the JSON field inside the OpenBao secret holding
|
||||
# the bearer token used by the Catalyst signup hook (ADR-0003 §3.2).
|
||||
property: "ADMIN_API_TOKEN"
|
||||
# ─── External-Secrets webhook readiness gate (Fix #138, prov #20) ────────
|
||||
# Pre-install/pre-upgrade hook that polls the external-secrets validating-
|
||||
# admission webhook until it serves a structured response (HTTP 200/400/
|
||||
# 405/415/422). Closes the race between bp-external-secrets reaching
|
||||
# HR Ready=True (Pods Ready) and the apiserver-side EndpointSlice for
|
||||
# the webhook Service being observable by the admission controller.
|
||||
#
|
||||
# Without this gate the chart's templates/external-secret.yaml apply
|
||||
# attempt fails with:
|
||||
# Internal error occurred: failed calling webhook
|
||||
# "validate.externalsecret.external-secrets.io": ... no endpoints
|
||||
# available for service "external-secrets-webhook"
|
||||
#
|
||||
# Per docs/INVIOLABLE-PRINCIPLES.md #4 every knob is operator-tunable
|
||||
# via per-Sovereign overlay (e.g. swap to a different webhook namespace
|
||||
# if external-secrets is provisioned via a non-Catalyst distribution).
|
||||
# Per #3 the gate is event-driven (probe-then-proceed), NOT a blanket
|
||||
# `helm install --timeout`.
|
||||
externalSecretsWebhookGate:
|
||||
# Default ON. Set false ONLY on Sovereigns where external-secrets is
|
||||
# provisioned outside the Catalyst bootstrap-kit and the operator has
|
||||
# confirmed the webhook is already serving by the time bp-newapi
|
||||
# installs (rare; recommended to leave true).
|
||||
enabled: true
|
||||
# Webhook Service name (default = upstream external-secrets chart's
|
||||
# `<release>-webhook` Service when releaseName is `external-secrets`).
|
||||
service: "external-secrets-webhook"
|
||||
# Namespace the webhook Service lives in. Catalyst default per
|
||||
# clusters/_template/bootstrap-kit/15-external-secrets.yaml.
|
||||
namespace: "external-secrets-system"
|
||||
# HTTPS port the webhook serves on (upstream chart default is 443).
|
||||
port: 443
|
||||
# Validation path for ExternalSecret resources. Sourced from upstream
|
||||
# chart's templates/validatingwebhook.yaml — different from the
|
||||
# ClusterSecretStore path (used by bp-external-secrets-stores in Fix
|
||||
# #137); each consumer probes its own validation endpoint.
|
||||
path: "/validate-external-secrets-io-v1beta1-externalsecret"
|
||||
# Wall-clock budget. 60s comfortably exceeds the typical 5–15s
|
||||
# EndpointSlice propagation window observed across fresh provisions.
|
||||
timeoutSeconds: 60
|
||||
# Probe interval inside the wall-clock budget.
|
||||
intervalSeconds: 2
|
||||
# ─── Service ─────────────────────────────────────────────────────────────
|
||||
service:
|
||||
type: ClusterIP
|
||||
|
||||
Loading…
Reference in New Issue
Block a user