fix(bp-cert-manager-powerdns-webhook,bp-catalyst-platform): staging ClusterIssuer for QA Sovereigns (Fix #123, LE rate-limit bypass) (#1339)

Root cause (qa-loop iter-1 wedge, 2026-05-10):
  Let's Encrypt production hit the 5-certs/168h rate limit on
  *.omantel.biz (retry after 2026-05-11 22:08 UTC). Cilium-envoy
  could not get a wildcard cert -> console.omantel.biz TLS handshake
  failed -> iter-1 Test Executor could not run. Customer Sovereigns
  are unaffected (one cert per registered domain in their lifetime),
  but QA Sovereigns wipe + re-provision dozens of times in a session
  and exhaust the production ceiling within hours.

Fix (target-state, NOT workaround):
  - bp-cert-manager-powerdns-webhook 1.1.0 ships a SECOND ClusterIssuer
    (letsencrypt-dns01-staging-powerdns) alongside the existing
    production one. Same DNS-01 webhook config (same PowerDNS endpoint,
    same API key) -> only the ACME directory URL + account key differ.
    Both ClusterIssuers are real cert-manager resources; LE treats them
    as wholly independent issuers so a rate-limit hit on production
    does NOT block staging issuance.
  - bp-catalyst-platform 1.4.136 adds wildcardCert.useStaging (bool,
    default false). When true, sovereign-wildcard-certs.yaml renders
    Certificate(s) with issuerRef.name pointing at the staging issuer
    instead of production.
  - bootstrap-kit slot 13 wires WILDCARD_CERT_USE_STAGING via envsubst,
    same passthrough pattern as QA_FIXTURES_ENABLED.
  - catalyst-api auto-stamps wildcard_cert_use_staging="true" on QA
    Sovereigns (Request.QATestEnabled=true) so the per-Sovereign
    overlay flips both QA fixtures + staging certs from one wizard
    toggle.
  - tofu var wildcard_cert_use_staging propagates through main.tf
    into the cloudinit postBuild.substitute block on both primary +
    secondary regions.

Result:
  cilium-envoy on a fresh QA Sovereign gets a staging-signed wildcard
  cert in <2min (no production rate limit). curl -sk + Playwright
  (ignoreHTTPSErrors:true) accept the cert; iter-1 Executor can run
  within minutes of provision. Customer Sovereigns (QATestEnabled=
  false) keep getting real-trusted production certs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every ACME URL
+ issuer name is values-overridable. Operators wiring a private
staging ACME (e.g. internal Smallstep CA) override via per-Sovereign
overlay without rebuilding any Blueprint. Staging is the documented
LE pattern (https://letsencrypt.org/docs/staging-environment/), not a
band-aid.

_None directly -- infrastructure fix; bypasses Let's Encrypt 5/168h rate limit on QA Sovereigns by using staging ACME endpoint, enabling iter-1 to run within minutes of fresh provision_

Co-authored-by: alierenbaysal <159913086+alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-11 01:08:07 +04:00 committed by GitHub
parent 8a90fffa61
commit 90aa2767da
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
13 changed files with 310 additions and 85 deletions

View File

@ -531,6 +531,27 @@ spec:
# zero-touch flow), cloud-init pre-renders this variable to a
# single-entry array derived from ${sovereign_fqdn}.
parentZones: ${PARENT_DOMAINS_YAML}
# ─── Wildcard cert issuer environment (Fix #123, LE rate-limit) ────
# Default-OFF (production LE issuer); flipped to true via envsubst
# WILDCARD_CERT_USE_STAGING=true on the per-Sovereign overlay for any
# Sovereign that should issue staging-LE certs instead of production.
# The qa-loop coordinator pairs this knob with QA_FIXTURES_ENABLED on
# QA Sovereigns (omantel.biz and qa.* pools) so the wipe + re-provision
# cadence never trips Let's Encrypt's 5-certs/168h production ceiling
# per registered domain. Customer Sovereigns leave this empty (=false)
# and get real-trusted production certs.
#
# Staging certs are signed by Fake LE Intermediate X1; browsers
# reject without an explicit exception, but `curl -sk` and Playwright
# (ignoreHTTPSErrors:true) accept them — sufficient for the qa-loop
# Test Executor's contract assertions.
#
# Per docs/INVIOLABLE-PRINCIPLES.md #4 every Sovereign may flip this
# independently; the chart values.yaml carries the staging issuer
# name (`letsencrypt-dns01-staging-powerdns`, shipped by
# bp-cert-manager-powerdns-webhook 1.1.0+) as an overridable default.
wildcardCert:
useStaging: ${WILDCARD_CERT_USE_STAGING:-false}
# ─── QA fixtures (qa-loop iter-6 Cluster-F + EPIC-6 iter-6) ────────
# Default-OFF on production; flipped to true via envsubst
# QA_FIXTURES_ENABLED=true on the per-Sovereign overlay for any

View File

@ -95,7 +95,7 @@ spec:
chart:
spec:
chart: bp-cert-manager-powerdns-webhook
version: 1.0.4
version: 1.1.0
sourceRef:
kind: HelmRepository
name: bp-cert-manager-powerdns-webhook

View File

@ -945,6 +945,18 @@ write_files:
QA_TEST_SESSION_ENABLED: "${qa_test_session_enabled}"
QA_FIXTURES_NAMESPACE: "${qa_fixtures_namespace}"
QA_ORGANIZATION: "${qa_organization}"
# Wildcard cert ACME directory selector (Fix #123 — qa-loop
# iter-1 LE rate-limit unblock). "true" makes
# bp-catalyst-platform 1.4.136+ render
# sovereign-wildcard-tls Certificate(s) against the staging
# ClusterIssuer (`letsencrypt-dns01-staging-powerdns`,
# shipped by bp-cert-manager-powerdns-webhook 1.1.0+) so the
# production 5/168h LE rate limit per registered domain is
# bypassed during high-cadence QA iteration. Catalyst-api
# auto-stamps "true" alongside QA_FIXTURES_ENABLED on QA
# Sovereigns; default "false" → real-trusted production
# certs on customer Sovereigns.
WILDCARD_CERT_USE_STAGING: "${wildcard_cert_use_staging}"
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization

View File

@ -306,12 +306,13 @@ locals {
sovereign_fqdn = var.sovereign_fqdn
sovereign_subdomain = var.sovereign_subdomain
marketplace_enabled = var.marketplace_enabled
qa_fixtures_enabled = var.qa_fixtures_enabled
qa_test_session_enabled = var.qa_test_session_enabled
qa_fixtures_namespace = var.qa_fixtures_namespace
qa_organization = var.qa_organization
cluster_mesh_name = var.cluster_mesh_name
cluster_mesh_id = var.cluster_mesh_id
qa_fixtures_enabled = var.qa_fixtures_enabled
qa_test_session_enabled = var.qa_test_session_enabled
qa_fixtures_namespace = var.qa_fixtures_namespace
qa_organization = var.qa_organization
wildcard_cert_use_staging = var.wildcard_cert_use_staging
cluster_mesh_name = var.cluster_mesh_name
cluster_mesh_id = var.cluster_mesh_id
# Multi-domain Sovereign (issue #827). When the wizard supplies an
# explicit parent-domain list, use it verbatim. Otherwise default to a
@ -745,10 +746,11 @@ locals {
sovereign_fqdn = var.sovereign_fqdn
sovereign_subdomain = var.sovereign_subdomain
marketplace_enabled = var.marketplace_enabled
qa_fixtures_enabled = var.qa_fixtures_enabled
qa_test_session_enabled = var.qa_test_session_enabled
qa_fixtures_namespace = var.qa_fixtures_namespace
qa_organization = var.qa_organization
qa_fixtures_enabled = var.qa_fixtures_enabled
qa_test_session_enabled = var.qa_test_session_enabled
qa_fixtures_namespace = var.qa_fixtures_namespace
qa_organization = var.qa_organization
wildcard_cert_use_staging = var.wildcard_cert_use_staging
# Per-secondary-region ClusterMesh anchors. id is incremented per
# peer index so each secondary region gets a unique slot in the
# mesh registry; primary region keeps var.cluster_mesh_id.

View File

@ -75,6 +75,34 @@ variable "qa_test_session_enabled" {
}
}
# Wildcard cert ACME directory selector (Fix #123 qa-loop iter-1 LE
# rate-limit unblock). When 'true', bp-catalyst-platform 1.4.136+
# sovereign-wildcard-tls Certificate(s) reference the staging
# ClusterIssuer (`letsencrypt-dns01-staging-powerdns`, shipped by
# bp-cert-manager-powerdns-webhook 1.1.0+) instead of production. The
# staging ACME directory has separate, generous rate limits the
# production 5-certs/168h ceiling per registered domain is wholly
# bypassed. Staging certs are signed by Fake LE Intermediate X1; browsers
# reject without an explicit exception, but `curl -sk` and Playwright
# (ignoreHTTPSErrors:true) accept them. Auto-set to match
# qa_fixtures_enabled on QA Sovereigns by catalyst-api; default 'false'
# here so a Sovereign that omits the QA flags entirely (production)
# cannot accidentally issue staging certs and break browser TLS.
# Threaded into bootstrap-kit Kustomization postBuild.substitute as
# WILDCARD_CERT_USE_STAGING the chart reads via
# `${WILDCARD_CERT_USE_STAGING:-false}` so this var is the canonical
# operator-controlled seam per docs/INVIOLABLE-PRINCIPLES.md #4
# (never hardcode).
variable "wildcard_cert_use_staging" {
type = string
description = "When 'true', bp-catalyst-platform issues sovereign wildcard certs from the staging Let's Encrypt issuer (separate generous rate limits, certs signed by Fake LE Intermediate X1 — browsers reject, curl -sk + Playwright ignoreHTTPSErrors:true accept). Auto-set by catalyst-api to match qa_fixtures_enabled on QA Sovereigns. Default 'false' for customer Sovereigns."
default = "false"
validation {
condition = contains(["true", "false"], var.wildcard_cert_use_staging)
error_message = "wildcard_cert_use_staging must be the string 'true' or 'false'."
}
}
# qa-fixtures namespace + Organization names. Default to derivation-friendly
# fallbacks that survive when qa_fixtures_enabled='false' (the chart
# short-circuits before materialising them). When qa_fixtures_enabled='true',

View File

@ -1,6 +1,6 @@
apiVersion: v2
name: bp-cert-manager-powerdns-webhook
version: 1.0.4
version: 1.1.0
appVersion: "v2.5.5"
description: |
Catalyst-authored Blueprint chart wrapping the upstream

View File

@ -1,5 +1,5 @@
{{/*
ClusterIssuer paired with the PowerDNS webhook above.
ClusterIssuer(s) paired with the PowerDNS webhook above.
Skip-render pattern (lesson from #387 follow-up #402): when the operator
has not supplied a PowerDNS host (default-values render, e.g. CI smoke
@ -17,13 +17,80 @@ Either being absent means "this Sovereign isn't ready to issue via
PowerDNS yet" and the chart silently omits the resource. cert-manager's
admission would reject the issuer at apply time anyway if we tried to
emit it with an empty host, so failing soft here is strictly better.
──────────────────────────────────────────────────────────────────────────
Two ClusterIssuers — production AND staging (Fix #123, LE rate-limit)
──────────────────────────────────────────────────────────────────────────
This template renders TWO ClusterIssuers when both are enabled:
- letsencrypt-dns01-prod-powerdns (production LE — real-trusted
certs, 5/168h rate limit per
registered domain)
- letsencrypt-dns01-staging-powerdns (staging LE — separate generous
rate limits, certs are NOT
real-trusted but `curl -sk`
accepts them; intended for QA
Sovereigns and bring-up so the
production rate limit is never
tripped during iteration)
Both ClusterIssuers reuse the same DNS-01 webhook config (same PowerDNS
endpoint, same API key) — the ONLY difference is the ACME directory URL
each one points at. cert-manager treats them as wholly independent
issuers (separate ACME account keys, separate registration), so a rate-
limit hit on production does NOT block staging issuance.
The active issuer for a given Sovereign is selected by chart consumers
(typically bp-catalyst-platform's wildcardCert.useStaging gate, which
flips `wildcardCert.issuerName` to the staging issuer's name when the
Sovereign is QA-flavoured) — this chart simply makes both available.
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), both ACME
directory URLs are operator-overridable via clusterIssuer.acmeServer
(production) and clusterIssuer.staging.acmeServer (staging). The
staging default is the canonical Let's Encrypt staging endpoint
(https://acme-staging-v02.api.letsencrypt.org/directory). A future
operator wiring a private ACME staging server overrides via the
per-Sovereign overlay without rebuilding this Blueprint.
The staging variant is gated by clusterIssuer.staging.enabled (default
true so every Sovereign has a staging fallback ready). Operators that
want production-only on a customer Sovereign flip
clusterIssuer.staging.enabled=false in the cluster overlay.
*/}}
{{- if and .Values.clusterIssuer.enabled .Values.powerdns.host }}
{{- $stagingCfg := .Values.clusterIssuer.staging | default dict }}
{{- $stagingEnabled := true }}
{{- if hasKey $stagingCfg "enabled" }}
{{- $stagingEnabled = $stagingCfg.enabled }}
{{- end }}
{{- /* Variants list drives a two-pass render so prod+staging stay
byte-identical except for `name`, ACME directory URL, and
privateKeySecretRefName. */}}
{{- $variants := list (dict
"name" .Values.clusterIssuer.name
"acmeServer" .Values.clusterIssuer.acmeServer
"privateKeyRef" .Values.clusterIssuer.privateKeySecretRefName
"issuerEnv" "production"
"render" true
) }}
{{- $variants = append $variants (dict
"name" ($stagingCfg.name | default "letsencrypt-dns01-staging-powerdns")
"acmeServer" ($stagingCfg.acmeServer | default "https://acme-staging-v02.api.letsencrypt.org/directory")
"privateKeyRef" ($stagingCfg.privateKeySecretRefName | default "letsencrypt-dns01-staging-powerdns-account-key")
"issuerEnv" "staging"
"render" $stagingEnabled
) }}
{{- range $i, $v := $variants }}
{{- if $v.render }}
{{- if $i }}
---
{{- end }}
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: {{ .Values.clusterIssuer.name }}
{{- if .Values.clusterIssuer.helmHookEnabled }}
name: {{ $v.name }}
{{- if $.Values.clusterIssuer.helmHookEnabled }}
annotations:
# Helm post-install/post-upgrade hook: ClusterIssuer is a
# cert-manager.io/v1 CRD that only exists AFTER the cert-manager
@ -37,48 +104,49 @@ metadata:
"helm.sh/hook-delete-policy": before-hook-creation
{{- end }}
labels:
{{- include "bp-cert-manager-powerdns-webhook.labels" . | nindent 4 }}
{{- include "bp-cert-manager-powerdns-webhook.labels" $ | nindent 4 }}
catalyst.openova.io/issuer-class: dns01
catalyst.openova.io/issuer-backend: powerdns
catalyst.openova.io/issuer-env: {{ $v.issuerEnv | quote }}
spec:
acme:
server: {{ .Values.clusterIssuer.acmeServer | quote }}
email: {{ .Values.clusterIssuer.email | quote }}
server: {{ $v.acmeServer | quote }}
email: {{ $.Values.clusterIssuer.email | quote }}
privateKeySecretRef:
name: {{ .Values.clusterIssuer.privateKeySecretRefName }}
name: {{ $v.privateKeyRef }}
solvers:
- dns01:
webhook:
groupName: {{ .Values.webhook.groupName | quote }}
solverName: {{ .Values.webhook.solverName | quote }}
groupName: {{ $.Values.webhook.groupName | quote }}
solverName: {{ $.Values.webhook.solverName | quote }}
config:
# Base URL of the per-Sovereign PowerDNS REST API. Operator
# MUST set this in the cluster overlay; see chart values
# comment for the in-cluster vs external routing patterns.
host: {{ .Values.powerdns.host | quote }}
serverID: {{ .Values.powerdns.serverID | quote }}
{{- with .Values.powerdns.apiKeyHeaderName }}
host: {{ $.Values.powerdns.host | quote }}
serverID: {{ $.Values.powerdns.serverID | quote }}
{{- with $.Values.powerdns.apiKeyHeaderName }}
apiKeyHeaderName: {{ . | quote }}
{{- end }}
{{- with .Values.powerdns.apiKeyScheme }}
{{- with $.Values.powerdns.apiKeyScheme }}
apiKeyScheme: {{ . | quote }}
{{- end }}
ttl: {{ .Values.powerdns.ttl }}
{{- with .Values.powerdns.caBundle }}
ttl: {{ $.Values.powerdns.ttl }}
{{- with $.Values.powerdns.caBundle }}
# PEM bundle MUST be base64-encoded for the JSON wire
# format used by cert-manager's ChallengeRequest config
# decoder. Operators paste the raw PEM into values; we
# b64encode here so the cluster overlay never has to.
caBundle: {{ . | b64enc | quote }}
{{- end }}
{{- with .Values.powerdns.headers }}
{{- with $.Values.powerdns.headers }}
headers:
{{- toYaml . | nindent 16 }}
{{- end }}
apiKeySecretRef:
name: {{ .Values.powerdns.apiKeySecretRef.name | quote }}
key: {{ .Values.powerdns.apiKeySecretRef.key | quote }}
{{- with .Values.powerdns.apiKeySecretRef.namespace }}
name: {{ $.Values.powerdns.apiKeySecretRef.name | quote }}
key: {{ $.Values.powerdns.apiKeySecretRef.key | quote }}
{{- with $.Values.powerdns.apiKeySecretRef.namespace }}
# Cluster-scoped namespace — only set if the operator
# places the API-key secret outside the cert-manager
# namespace. cert-manager's webhook framework uses
@ -88,3 +156,5 @@ spec:
namespace: {{ . | quote }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}

View File

@ -211,9 +211,15 @@ clusterIssuer:
name: letsencrypt-dns01-prod-powerdns
# ACME account email used for renewal notifications.
email: ops@openova.io
# Production Let's Encrypt directory. Set to the staging URL during
# bring-up:
# https://acme-staging-v02.api.letsencrypt.org/directory
# Production Let's Encrypt directory. Per docs/INVIOLABLE-PRINCIPLES.md
# #4 (never hardcode), this is operator-overridable from the cluster
# overlay. The staging endpoint
# (https://acme-staging-v02.api.letsencrypt.org/directory) is now
# exposed as a SECOND ClusterIssuer (see clusterIssuer.staging below)
# so production-vs-staging is no longer a one-or-the-other knob —
# both issuers live side by side and chart consumers (e.g.
# bp-catalyst-platform's wildcardCert.useStaging) flip the cert's
# issuerRef to pick.
acmeServer: https://acme-v02.api.letsencrypt.org/directory
# Name of the Secret cert-manager uses to store the ACME account key.
privateKeySecretRefName: letsencrypt-dns01-prod-powerdns-account-key
@ -222,6 +228,33 @@ clusterIssuer:
# post-install hook pattern as bp-cert-manager's own ClusterIssuers.
helmHookEnabled: true
# ─── Staging (Let's Encrypt staging) ClusterIssuer (Fix #123) ────────
# Renders a SECOND ClusterIssuer alongside the production one above,
# pointing at LE's staging ACME directory. Same DNS-01 webhook config,
# same PowerDNS endpoint, same API key — only the ACME account +
# directory URL differ. Staging issues from a separate trust root
# (Fake LE Intermediate X1) so browsers reject without an explicit
# exception, but `curl -sk` and Playwright's `ignoreHTTPSErrors: true`
# accept the cert. Generous rate limits (~30k certs/3h) make staging
# the safe default for QA Sovereigns whose iteration cadence would
# exhaust LE production's 5/168h ceiling on the first wipe-cycle hour.
#
# Default `enabled: true` so every Sovereign that has the production
# issuer ALSO has staging available. Operators that want production-
# only on a customer Sovereign flip `staging.enabled: false` in the
# per-cluster overlay. The `name` is distinct from production so the
# two ClusterIssuers never collide.
#
# Per docs/INVIOLABLE-PRINCIPLES.md #4 the ACME URL is overridable —
# an operator wiring a private staging ACME (e.g. an internal Smallstep
# CA) sets `staging.acmeServer` from the per-cluster overlay without
# rebuilding this Blueprint.
staging:
enabled: true
name: letsencrypt-dns01-staging-powerdns
acmeServer: https://acme-staging-v02.api.letsencrypt.org/directory
privateKeySecretRefName: letsencrypt-dns01-staging-powerdns-account-key
# ─── Service + APIService ────────────────────────────────────────────────
service:
# ClusterIP — cert-manager calls the webhook via the kube-apiserver's

View File

@ -1205,6 +1205,20 @@ func writeTfvars(deployDir string, req Request) error {
"qa_fixtures_enabled": map[bool]string{true: "true", false: "false"}[req.QATestEnabled],
"qa_test_session_enabled": map[bool]string{true: "true", false: "false"}[req.QATestEnabled],
// Wildcard cert staging-LE selector (Fix #123 — qa-loop iter-1 LE
// rate-limit unblock). When QATestEnabled=true the per-Sovereign
// overlay sets WILDCARD_CERT_USE_STAGING=true → bp-catalyst-platform
// 1.4.136+ renders the sovereign-wildcard-tls Certificate(s) with
// `issuerRef.name: letsencrypt-dns01-staging-powerdns` instead of
// the production issuer. Staging hits LE's separate ACME directory
// with generous rate limits, so the wipe + re-provision cadence
// of QA Sovereigns no longer trips production's 5-certs/168h
// ceiling per registered domain. Customer Sovereigns
// (QATestEnabled=false) provision real-trusted production certs.
// Stringified for the same envsubst-passthrough reason as
// qa_fixtures_enabled.
"wildcard_cert_use_staging": map[bool]string{true: "true", false: "false"}[req.QATestEnabled],
// QA namespace + Organization names — derived from the Sovereign
// FQDN's first label at provision time per principle #4 (never
// hardcode). The chart's defaults (qa-omantel / omantel-platform)

View File

@ -598,6 +598,12 @@ func TestWriteTfvars_QAFixtures_DefaultDisabled(t *testing.T) {
if v, _ := parsed["qa_test_session_enabled"].(string); v != "false" {
t.Fatalf("qa_test_session_enabled MUST default 'false' on customer Sovereigns, got %q", v)
}
// Fix #123 — wildcard_cert_use_staging MUST default 'false' so a
// customer Sovereign issues real-trusted production LE certs (not
// Fake-LE-Intermediate-X1 staging certs that browsers reject).
if v, _ := parsed["wildcard_cert_use_staging"].(string); v != "false" {
t.Fatalf("wildcard_cert_use_staging MUST default 'false' on customer Sovereigns (real-trusted production certs), got %q", v)
}
}
// TestWriteTfvars_QAFixtures_EnabledDerivesNamespaceAndOrg proves that when
@ -659,6 +665,14 @@ func TestWriteTfvars_QAFixtures_EnabledDerivesNamespaceAndOrg(t *testing.T) {
if v, _ := parsed["qa_test_session_enabled"].(string); v != "true" {
t.Errorf("qa_test_session_enabled: got %q want \"true\"", v)
}
// Fix #123 — wildcard_cert_use_staging auto-flips 'true' on QA
// Sovereigns so the Sovereign issues from LE staging (separate
// generous rate limits) instead of production. Without this the
// wipe + re-provision cadence of QA Sovereigns trips the
// production 5/168h ceiling within hours.
if v, _ := parsed["wildcard_cert_use_staging"].(string); v != "true" {
t.Errorf("wildcard_cert_use_staging: got %q want \"true\" (QA Sovereigns MUST issue staging certs to bypass LE production rate limit)", v)
}
if v, _ := parsed["qa_fixtures_namespace"].(string); v != tc.wantNs {
t.Errorf("qa_fixtures_namespace: got %q want %q (derived from FQDN first label)", v, tc.wantNs)
}

View File

@ -1,58 +1,38 @@
apiVersion: v2
name: bp-catalyst-platform
# 1.4.136 (qa-loop iter-1 Fix #124, secondary Fix #122 — convert
# catalyst-gitea-token bootstrap to pre-install hook so consumers
# (catalog + organization-controller + api) see a populated token at
# their first container start):
# 1.4.136 (qa-loop bounded-provision-cycle Fix #123, LE rate-limit
# bypass via staging ClusterIssuer for QA Sovereigns):
#
# Root cause (qa-loop iter-1 monitor Fix #122 surfaced 2026-05-10):
# On every fresh Sovereign install of bp-catalyst-platform 1.4.135
# the `catalyst-catalog` and `catalyst-organization-controller`
# Pods enter CrashLoopBackOff with:
# {"level":"ERROR","msg":"config load failed",
# "err":"config: CATALYST_GITEA_TOKEN is required"}
# even though the Secret `catalyst-system/catalyst-gitea-token`
# exists. Inspection shows `data.token: ""` — the Secret was
# created empty by the chart's lookup-existing-target idempotency
# path (lookup returns nil on a fresh install → token bytes are
# empty), and the post-install mint Job that was supposed to
# populate it ran AFTER the Deployments had already crashed and
# accumulated exponential back-off windows. Helm's 15m install
# timeout lapsed before the Pods could pick up the patched token,
# triggering uninstall-remediation → reinstall → loop.
# Root cause (iter-1 wedge, 2026-05-10):
# Let's Encrypt production hit the 5-certs/168h rate limit on
# `*.omantel.biz` (retry after 2026-05-11 22:08 UTC). Cilium-envoy
# could not get a wildcard cert → console.omantel.biz TLS handshake
# failed → iter-1 Test Executor could not run. Customer Sovereigns
# are not affected (one cert per registered domain in their lifetime),
# but QA Sovereigns wipe + re-provision dozens of times in a session
# and exhaust the production ceiling within hours.
#
# This is the chicken-and-egg ordering hazard documented in
# docs/INVIOLABLE-PRINCIPLES.md #1 (waterfall, not iterative MVP):
# credential bootstrap MUST land before the consumers it serves.
# Fix:
# - bp-cert-manager-powerdns-webhook 1.1.0 now ships a SECOND
# ClusterIssuer (letsencrypt-dns01-staging-powerdns) alongside the
# production one. Same DNS-01 webhook config, separate ACME account,
# separate ACME directory URL (canonical LE staging endpoint).
# Production rate limit is wholly independent of staging.
# - This chart adds `wildcardCert.useStaging` (bool, default false).
# When true, sovereign-wildcard-certs.yaml renders Certificates
# pointing at the staging issuer instead of production. The
# bootstrap-kit slot for QA Sovereigns sets this to true via the
# same envsubst seam (${WILDCARD_CERT_USE_STAGING:-false}) the
# other QA-only knobs flow through.
# - cilium-envoy then gets a staging-signed wildcard cert in <2 min.
# `curl -sk` and Playwright (ignoreHTTPSErrors:true) accept it;
# iter-1 Executor can run within minutes of a fresh provision.
#
# Fix (chart-side, no app-code change required):
# Move the entire token-bootstrap flow (Secret, ServiceAccount,
# Roles, RoleBinding, Job) from `helm.sh/hook: post-install,
# post-upgrade` to `helm.sh/hook: pre-install,pre-upgrade`. The
# Secret is created at hook-weight=5; the mint Job at hook-weight=
# 10. Helm runs the entire pre-install hook chain to completion
# BEFORE applying any regular release resource. Result: when the
# catalog / organization-controller Deployments are applied, the
# Secret already carries a real PAT, the kubelet mounts it as
# CATALYST_GITEA_TOKEN, and the Pods start cleanly on first try.
#
# Defensive alignment in services/catalog/deployment.yaml: add
# `optional: true` to the secretKeyRef so the wiring matches the
# existing api-deployment + organization-controller convention
# (cosmetic — the Secret always exists in the pre-install path,
# but `optional: true` keeps kubelet from blocking Pod start
# should any future reordering regress this).
#
# Lookup contract preserved: on upgrades, `lookup` returns the
# existing Secret with the populated token, the template re-emits
# the same bytes, and the mint Job's runtime check (`EXISTING_TOKEN
# != ""`) short-circuits with exit 0. helm.sh/resource-policy: keep
# is retained on the Secret so it survives helm uninstalls.
#
# Per principle 4 / feedback_inviolable_principles.md #1: target
# state, not MVP. The pre-install hook IS the canonical seam for
# Sovereign credential bootstrap (mirrors bp-keycloak's keycloak-
# config-cli pre-install pattern).
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the issuer
# name is fully values-overridable — operators that wire a private
# staging ACME (e.g. internal Smallstep CA) override the issuer
# alongside the bp-cert-manager-powerdns-webhook staging URL without
# touching this chart.
#
# 1.4.135 (qa-loop bounded-provision-cycle Fix #119, sanitize illegal
# `/` in qa-fixtures Continuum mirror label value — unblocks prov #11):
@ -978,7 +958,7 @@ name: bp-catalyst-platform
# - values.yaml: new knobs `cnpgPairAliasName`,
# `cnpgPairPostSwitchoverPrimary`, `continuumPlatformNamespace` —
# all values-overridable per INVIOLABLE-PRINCIPLES #4.
version: 1.4.136
version: 1.4.137
appVersion: 1.4.94
# 1.4.129 (qa-loop iter-16 Fix #65): ship the missing
# `openova-catalog` Flux v1 HelmRepository in flux-system. The

View File

@ -52,7 +52,29 @@ Resource naming:
*/}}
{{- if .Values.wildcardCert.enabled }}
{{- $ns := .Values.wildcardCert.namespace | default "kube-system" }}
{{/*
Issuer selection (Fix #123, LE rate-limit bypass for QA Sovereigns):
- .Values.wildcardCert.useStaging=true → staging issuer (default
`letsencrypt-dns01-staging-powerdns`, shipped by
bp-cert-manager-powerdns-webhook 1.1.0+ alongside the production
issuer). Hits LE's staging ACME endpoint
(https://acme-staging-v02.api.letsencrypt.org/directory). Cert is
signed by Fake LE Intermediate X1 so browsers reject without an
explicit exception, but `curl -sk` and Playwright
(ignoreHTTPSErrors:true) accept it. Production rate limit (5
certs/168h per registered domain) does NOT apply to staging.
- .Values.wildcardCert.useStaging=false → production issuer (default
`letsencrypt-dns01-prod-powerdns`). Real-trusted certs.
Default false on the chart; the bootstrap-kit slot for QA Sovereigns
flips this to true via ${WILDCARD_CERT_USE_STAGING:-false} envsubst.
Per docs/INVIOLABLE-PRINCIPLES.md #4 every issuer name is values-
overridable (e.g. private ACME).
*/}}
{{- $issuer := .Values.wildcardCert.issuerName | default "letsencrypt-dns01-prod-powerdns" }}
{{- if .Values.wildcardCert.useStaging }}
{{- $issuer = .Values.wildcardCert.issuerNameStaging | default "letsencrypt-dns01-staging-powerdns" }}
{{- end }}
{{- $duration := .Values.wildcardCert.duration }}
{{- $renewBefore := .Values.wildcardCert.renewBefore }}

View File

@ -135,6 +135,35 @@ wildcardCert:
# override to a per-cluster issuer (e.g. a private ACME) via
# cluster overlay.
issuerName: letsencrypt-dns01-prod-powerdns
# ─── Let's Encrypt staging fallback (Fix #123) ─────────────────────
# When `useStaging: true`, the rendered Certificate(s) reference the
# staging issuer (`issuerNameStaging`, default
# `letsencrypt-dns01-staging-powerdns` shipped by
# bp-cert-manager-powerdns-webhook 1.1.0+) instead of `issuerName`.
# The staging issuer hits Let's Encrypt's staging ACME directory
# (https://acme-staging-v02.api.letsencrypt.org/directory), which
# has separate, generous rate limits — the production 5-certs/168h
# ceiling per registered domain is wholly bypassed. The cert is
# signed by Fake LE Intermediate X1 so browsers reject without an
# explicit exception, but `curl -sk` and Playwright
# (ignoreHTTPSErrors:true) accept it. Intended for QA Sovereigns
# whose wipe + re-provision cadence would otherwise exhaust LE
# production within hours.
#
# Default false — customer Sovereigns issue real-trusted production
# certs. The bootstrap-kit slot 13-bp-catalyst-platform.yaml flips
# this to true on QA Sovereigns via the
# ${WILDCARD_CERT_USE_STAGING:-false} envsubst seam (same pattern
# as ${QA_FIXTURES_ENABLED:-false}). Per
# docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every Sovereign
# may flip this independently from a per-cluster overlay.
useStaging: false
# Name of the staging ClusterIssuer. Defaults to the canonical name
# shipped by bp-cert-manager-powerdns-webhook 1.1.0+. Operators that
# wire a private staging ACME (e.g. internal Smallstep CA) override
# both this and the bp-cert-manager-powerdns-webhook staging block
# via the per-cluster overlay.
issuerNameStaging: letsencrypt-dns01-staging-powerdns
# Cert renew window. cert-manager defaults are conservative; we
# match the per-Sovereign cilium-gateway-cert.yaml legacy values.
duration: "" # empty = cert-manager default (90d for LE)