fix(bp-cert-manager-powerdns-webhook,bp-catalyst-platform): staging ClusterIssuer for QA Sovereigns (Fix #123, LE rate-limit bypass) (#1339)
Root cause (qa-loop iter-1 wedge, 2026-05-10):
Let's Encrypt production hit the 5-certs/168h rate limit on
*.omantel.biz (retry after 2026-05-11 22:08 UTC). Cilium-envoy
could not get a wildcard cert -> console.omantel.biz TLS handshake
failed -> iter-1 Test Executor could not run. Customer Sovereigns
are unaffected (one cert per registered domain in their lifetime),
but QA Sovereigns wipe + re-provision dozens of times in a session
and exhaust the production ceiling within hours.
Fix (target-state, NOT workaround):
- bp-cert-manager-powerdns-webhook 1.1.0 ships a SECOND ClusterIssuer
(letsencrypt-dns01-staging-powerdns) alongside the existing
production one. Same DNS-01 webhook config (same PowerDNS endpoint,
same API key) -> only the ACME directory URL + account key differ.
Both ClusterIssuers are real cert-manager resources; LE treats them
as wholly independent issuers so a rate-limit hit on production
does NOT block staging issuance.
- bp-catalyst-platform 1.4.136 adds wildcardCert.useStaging (bool,
default false). When true, sovereign-wildcard-certs.yaml renders
Certificate(s) with issuerRef.name pointing at the staging issuer
instead of production.
- bootstrap-kit slot 13 wires WILDCARD_CERT_USE_STAGING via envsubst,
same passthrough pattern as QA_FIXTURES_ENABLED.
- catalyst-api auto-stamps wildcard_cert_use_staging="true" on QA
Sovereigns (Request.QATestEnabled=true) so the per-Sovereign
overlay flips both QA fixtures + staging certs from one wizard
toggle.
- tofu var wildcard_cert_use_staging propagates through main.tf
into the cloudinit postBuild.substitute block on both primary +
secondary regions.
Result:
cilium-envoy on a fresh QA Sovereign gets a staging-signed wildcard
cert in <2min (no production rate limit). curl -sk + Playwright
(ignoreHTTPSErrors:true) accept the cert; iter-1 Executor can run
within minutes of provision. Customer Sovereigns (QATestEnabled=
false) keep getting real-trusted production certs.
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every ACME URL
+ issuer name is values-overridable. Operators wiring a private
staging ACME (e.g. internal Smallstep CA) override via per-Sovereign
overlay without rebuilding any Blueprint. Staging is the documented
LE pattern (https://letsencrypt.org/docs/staging-environment/), not a
band-aid.
_None directly -- infrastructure fix; bypasses Let's Encrypt 5/168h rate limit on QA Sovereigns by using staging ACME endpoint, enabling iter-1 to run within minutes of fresh provision_
Co-authored-by: alierenbaysal <159913086+alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
8a90fffa61
commit
90aa2767da
@ -531,6 +531,27 @@ spec:
|
||||
# zero-touch flow), cloud-init pre-renders this variable to a
|
||||
# single-entry array derived from ${sovereign_fqdn}.
|
||||
parentZones: ${PARENT_DOMAINS_YAML}
|
||||
# ─── Wildcard cert issuer environment (Fix #123, LE rate-limit) ────
|
||||
# Default-OFF (production LE issuer); flipped to true via envsubst
|
||||
# WILDCARD_CERT_USE_STAGING=true on the per-Sovereign overlay for any
|
||||
# Sovereign that should issue staging-LE certs instead of production.
|
||||
# The qa-loop coordinator pairs this knob with QA_FIXTURES_ENABLED on
|
||||
# QA Sovereigns (omantel.biz and qa.* pools) so the wipe + re-provision
|
||||
# cadence never trips Let's Encrypt's 5-certs/168h production ceiling
|
||||
# per registered domain. Customer Sovereigns leave this empty (=false)
|
||||
# and get real-trusted production certs.
|
||||
#
|
||||
# Staging certs are signed by Fake LE Intermediate X1; browsers
|
||||
# reject without an explicit exception, but `curl -sk` and Playwright
|
||||
# (ignoreHTTPSErrors:true) accept them — sufficient for the qa-loop
|
||||
# Test Executor's contract assertions.
|
||||
#
|
||||
# Per docs/INVIOLABLE-PRINCIPLES.md #4 every Sovereign may flip this
|
||||
# independently; the chart values.yaml carries the staging issuer
|
||||
# name (`letsencrypt-dns01-staging-powerdns`, shipped by
|
||||
# bp-cert-manager-powerdns-webhook 1.1.0+) as an overridable default.
|
||||
wildcardCert:
|
||||
useStaging: ${WILDCARD_CERT_USE_STAGING:-false}
|
||||
# ─── QA fixtures (qa-loop iter-6 Cluster-F + EPIC-6 iter-6) ────────
|
||||
# Default-OFF on production; flipped to true via envsubst
|
||||
# QA_FIXTURES_ENABLED=true on the per-Sovereign overlay for any
|
||||
|
||||
@ -95,7 +95,7 @@ spec:
|
||||
chart:
|
||||
spec:
|
||||
chart: bp-cert-manager-powerdns-webhook
|
||||
version: 1.0.4
|
||||
version: 1.1.0
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-cert-manager-powerdns-webhook
|
||||
|
||||
@ -945,6 +945,18 @@ write_files:
|
||||
QA_TEST_SESSION_ENABLED: "${qa_test_session_enabled}"
|
||||
QA_FIXTURES_NAMESPACE: "${qa_fixtures_namespace}"
|
||||
QA_ORGANIZATION: "${qa_organization}"
|
||||
# Wildcard cert ACME directory selector (Fix #123 — qa-loop
|
||||
# iter-1 LE rate-limit unblock). "true" makes
|
||||
# bp-catalyst-platform 1.4.136+ render
|
||||
# sovereign-wildcard-tls Certificate(s) against the staging
|
||||
# ClusterIssuer (`letsencrypt-dns01-staging-powerdns`,
|
||||
# shipped by bp-cert-manager-powerdns-webhook 1.1.0+) so the
|
||||
# production 5/168h LE rate limit per registered domain is
|
||||
# bypassed during high-cadence QA iteration. Catalyst-api
|
||||
# auto-stamps "true" alongside QA_FIXTURES_ENABLED on QA
|
||||
# Sovereigns; default "false" → real-trusted production
|
||||
# certs on customer Sovereigns.
|
||||
WILDCARD_CERT_USE_STAGING: "${wildcard_cert_use_staging}"
|
||||
---
|
||||
apiVersion: kustomize.toolkit.fluxcd.io/v1
|
||||
kind: Kustomization
|
||||
|
||||
@ -310,6 +310,7 @@ locals {
|
||||
qa_test_session_enabled = var.qa_test_session_enabled
|
||||
qa_fixtures_namespace = var.qa_fixtures_namespace
|
||||
qa_organization = var.qa_organization
|
||||
wildcard_cert_use_staging = var.wildcard_cert_use_staging
|
||||
cluster_mesh_name = var.cluster_mesh_name
|
||||
cluster_mesh_id = var.cluster_mesh_id
|
||||
|
||||
@ -749,6 +750,7 @@ locals {
|
||||
qa_test_session_enabled = var.qa_test_session_enabled
|
||||
qa_fixtures_namespace = var.qa_fixtures_namespace
|
||||
qa_organization = var.qa_organization
|
||||
wildcard_cert_use_staging = var.wildcard_cert_use_staging
|
||||
# Per-secondary-region ClusterMesh anchors. id is incremented per
|
||||
# peer index so each secondary region gets a unique slot in the
|
||||
# mesh registry; primary region keeps var.cluster_mesh_id.
|
||||
|
||||
@ -75,6 +75,34 @@ variable "qa_test_session_enabled" {
|
||||
}
|
||||
}
|
||||
|
||||
# Wildcard cert ACME directory selector (Fix #123 — qa-loop iter-1 LE
|
||||
# rate-limit unblock). When 'true', bp-catalyst-platform 1.4.136+
|
||||
# sovereign-wildcard-tls Certificate(s) reference the staging
|
||||
# ClusterIssuer (`letsencrypt-dns01-staging-powerdns`, shipped by
|
||||
# bp-cert-manager-powerdns-webhook 1.1.0+) instead of production. The
|
||||
# staging ACME directory has separate, generous rate limits — the
|
||||
# production 5-certs/168h ceiling per registered domain is wholly
|
||||
# bypassed. Staging certs are signed by Fake LE Intermediate X1; browsers
|
||||
# reject without an explicit exception, but `curl -sk` and Playwright
|
||||
# (ignoreHTTPSErrors:true) accept them. Auto-set to match
|
||||
# qa_fixtures_enabled on QA Sovereigns by catalyst-api; default 'false'
|
||||
# here so a Sovereign that omits the QA flags entirely (production)
|
||||
# cannot accidentally issue staging certs and break browser TLS.
|
||||
# Threaded into bootstrap-kit Kustomization postBuild.substitute as
|
||||
# WILDCARD_CERT_USE_STAGING — the chart reads via
|
||||
# `${WILDCARD_CERT_USE_STAGING:-false}` so this var is the canonical
|
||||
# operator-controlled seam per docs/INVIOLABLE-PRINCIPLES.md #4
|
||||
# (never hardcode).
|
||||
variable "wildcard_cert_use_staging" {
|
||||
type = string
|
||||
description = "When 'true', bp-catalyst-platform issues sovereign wildcard certs from the staging Let's Encrypt issuer (separate generous rate limits, certs signed by Fake LE Intermediate X1 — browsers reject, curl -sk + Playwright ignoreHTTPSErrors:true accept). Auto-set by catalyst-api to match qa_fixtures_enabled on QA Sovereigns. Default 'false' for customer Sovereigns."
|
||||
default = "false"
|
||||
validation {
|
||||
condition = contains(["true", "false"], var.wildcard_cert_use_staging)
|
||||
error_message = "wildcard_cert_use_staging must be the string 'true' or 'false'."
|
||||
}
|
||||
}
|
||||
|
||||
# qa-fixtures namespace + Organization names. Default to derivation-friendly
|
||||
# fallbacks that survive when qa_fixtures_enabled='false' (the chart
|
||||
# short-circuits before materialising them). When qa_fixtures_enabled='true',
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
apiVersion: v2
|
||||
name: bp-cert-manager-powerdns-webhook
|
||||
version: 1.0.4
|
||||
version: 1.1.0
|
||||
appVersion: "v2.5.5"
|
||||
description: |
|
||||
Catalyst-authored Blueprint chart wrapping the upstream
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
{{/*
|
||||
ClusterIssuer paired with the PowerDNS webhook above.
|
||||
ClusterIssuer(s) paired with the PowerDNS webhook above.
|
||||
|
||||
Skip-render pattern (lesson from #387 follow-up #402): when the operator
|
||||
has not supplied a PowerDNS host (default-values render, e.g. CI smoke
|
||||
@ -17,13 +17,80 @@ Either being absent means "this Sovereign isn't ready to issue via
|
||||
PowerDNS yet" and the chart silently omits the resource. cert-manager's
|
||||
admission would reject the issuer at apply time anyway if we tried to
|
||||
emit it with an empty host, so failing soft here is strictly better.
|
||||
|
||||
──────────────────────────────────────────────────────────────────────────
|
||||
Two ClusterIssuers — production AND staging (Fix #123, LE rate-limit)
|
||||
──────────────────────────────────────────────────────────────────────────
|
||||
This template renders TWO ClusterIssuers when both are enabled:
|
||||
|
||||
- letsencrypt-dns01-prod-powerdns (production LE — real-trusted
|
||||
certs, 5/168h rate limit per
|
||||
registered domain)
|
||||
- letsencrypt-dns01-staging-powerdns (staging LE — separate generous
|
||||
rate limits, certs are NOT
|
||||
real-trusted but `curl -sk`
|
||||
accepts them; intended for QA
|
||||
Sovereigns and bring-up so the
|
||||
production rate limit is never
|
||||
tripped during iteration)
|
||||
|
||||
Both ClusterIssuers reuse the same DNS-01 webhook config (same PowerDNS
|
||||
endpoint, same API key) — the ONLY difference is the ACME directory URL
|
||||
each one points at. cert-manager treats them as wholly independent
|
||||
issuers (separate ACME account keys, separate registration), so a rate-
|
||||
limit hit on production does NOT block staging issuance.
|
||||
|
||||
The active issuer for a given Sovereign is selected by chart consumers
|
||||
(typically bp-catalyst-platform's wildcardCert.useStaging gate, which
|
||||
flips `wildcardCert.issuerName` to the staging issuer's name when the
|
||||
Sovereign is QA-flavoured) — this chart simply makes both available.
|
||||
|
||||
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), both ACME
|
||||
directory URLs are operator-overridable via clusterIssuer.acmeServer
|
||||
(production) and clusterIssuer.staging.acmeServer (staging). The
|
||||
staging default is the canonical Let's Encrypt staging endpoint
|
||||
(https://acme-staging-v02.api.letsencrypt.org/directory). A future
|
||||
operator wiring a private ACME staging server overrides via the
|
||||
per-Sovereign overlay without rebuilding this Blueprint.
|
||||
|
||||
The staging variant is gated by clusterIssuer.staging.enabled (default
|
||||
true so every Sovereign has a staging fallback ready). Operators that
|
||||
want production-only on a customer Sovereign flip
|
||||
clusterIssuer.staging.enabled=false in the cluster overlay.
|
||||
*/}}
|
||||
{{- if and .Values.clusterIssuer.enabled .Values.powerdns.host }}
|
||||
{{- $stagingCfg := .Values.clusterIssuer.staging | default dict }}
|
||||
{{- $stagingEnabled := true }}
|
||||
{{- if hasKey $stagingCfg "enabled" }}
|
||||
{{- $stagingEnabled = $stagingCfg.enabled }}
|
||||
{{- end }}
|
||||
{{- /* Variants list drives a two-pass render so prod+staging stay
|
||||
byte-identical except for `name`, ACME directory URL, and
|
||||
privateKeySecretRefName. */}}
|
||||
{{- $variants := list (dict
|
||||
"name" .Values.clusterIssuer.name
|
||||
"acmeServer" .Values.clusterIssuer.acmeServer
|
||||
"privateKeyRef" .Values.clusterIssuer.privateKeySecretRefName
|
||||
"issuerEnv" "production"
|
||||
"render" true
|
||||
) }}
|
||||
{{- $variants = append $variants (dict
|
||||
"name" ($stagingCfg.name | default "letsencrypt-dns01-staging-powerdns")
|
||||
"acmeServer" ($stagingCfg.acmeServer | default "https://acme-staging-v02.api.letsencrypt.org/directory")
|
||||
"privateKeyRef" ($stagingCfg.privateKeySecretRefName | default "letsencrypt-dns01-staging-powerdns-account-key")
|
||||
"issuerEnv" "staging"
|
||||
"render" $stagingEnabled
|
||||
) }}
|
||||
{{- range $i, $v := $variants }}
|
||||
{{- if $v.render }}
|
||||
{{- if $i }}
|
||||
---
|
||||
{{- end }}
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: ClusterIssuer
|
||||
metadata:
|
||||
name: {{ .Values.clusterIssuer.name }}
|
||||
{{- if .Values.clusterIssuer.helmHookEnabled }}
|
||||
name: {{ $v.name }}
|
||||
{{- if $.Values.clusterIssuer.helmHookEnabled }}
|
||||
annotations:
|
||||
# Helm post-install/post-upgrade hook: ClusterIssuer is a
|
||||
# cert-manager.io/v1 CRD that only exists AFTER the cert-manager
|
||||
@ -37,48 +104,49 @@ metadata:
|
||||
"helm.sh/hook-delete-policy": before-hook-creation
|
||||
{{- end }}
|
||||
labels:
|
||||
{{- include "bp-cert-manager-powerdns-webhook.labels" . | nindent 4 }}
|
||||
{{- include "bp-cert-manager-powerdns-webhook.labels" $ | nindent 4 }}
|
||||
catalyst.openova.io/issuer-class: dns01
|
||||
catalyst.openova.io/issuer-backend: powerdns
|
||||
catalyst.openova.io/issuer-env: {{ $v.issuerEnv | quote }}
|
||||
spec:
|
||||
acme:
|
||||
server: {{ .Values.clusterIssuer.acmeServer | quote }}
|
||||
email: {{ .Values.clusterIssuer.email | quote }}
|
||||
server: {{ $v.acmeServer | quote }}
|
||||
email: {{ $.Values.clusterIssuer.email | quote }}
|
||||
privateKeySecretRef:
|
||||
name: {{ .Values.clusterIssuer.privateKeySecretRefName }}
|
||||
name: {{ $v.privateKeyRef }}
|
||||
solvers:
|
||||
- dns01:
|
||||
webhook:
|
||||
groupName: {{ .Values.webhook.groupName | quote }}
|
||||
solverName: {{ .Values.webhook.solverName | quote }}
|
||||
groupName: {{ $.Values.webhook.groupName | quote }}
|
||||
solverName: {{ $.Values.webhook.solverName | quote }}
|
||||
config:
|
||||
# Base URL of the per-Sovereign PowerDNS REST API. Operator
|
||||
# MUST set this in the cluster overlay; see chart values
|
||||
# comment for the in-cluster vs external routing patterns.
|
||||
host: {{ .Values.powerdns.host | quote }}
|
||||
serverID: {{ .Values.powerdns.serverID | quote }}
|
||||
{{- with .Values.powerdns.apiKeyHeaderName }}
|
||||
host: {{ $.Values.powerdns.host | quote }}
|
||||
serverID: {{ $.Values.powerdns.serverID | quote }}
|
||||
{{- with $.Values.powerdns.apiKeyHeaderName }}
|
||||
apiKeyHeaderName: {{ . | quote }}
|
||||
{{- end }}
|
||||
{{- with .Values.powerdns.apiKeyScheme }}
|
||||
{{- with $.Values.powerdns.apiKeyScheme }}
|
||||
apiKeyScheme: {{ . | quote }}
|
||||
{{- end }}
|
||||
ttl: {{ .Values.powerdns.ttl }}
|
||||
{{- with .Values.powerdns.caBundle }}
|
||||
ttl: {{ $.Values.powerdns.ttl }}
|
||||
{{- with $.Values.powerdns.caBundle }}
|
||||
# PEM bundle MUST be base64-encoded for the JSON wire
|
||||
# format used by cert-manager's ChallengeRequest config
|
||||
# decoder. Operators paste the raw PEM into values; we
|
||||
# b64encode here so the cluster overlay never has to.
|
||||
caBundle: {{ . | b64enc | quote }}
|
||||
{{- end }}
|
||||
{{- with .Values.powerdns.headers }}
|
||||
{{- with $.Values.powerdns.headers }}
|
||||
headers:
|
||||
{{- toYaml . | nindent 16 }}
|
||||
{{- end }}
|
||||
apiKeySecretRef:
|
||||
name: {{ .Values.powerdns.apiKeySecretRef.name | quote }}
|
||||
key: {{ .Values.powerdns.apiKeySecretRef.key | quote }}
|
||||
{{- with .Values.powerdns.apiKeySecretRef.namespace }}
|
||||
name: {{ $.Values.powerdns.apiKeySecretRef.name | quote }}
|
||||
key: {{ $.Values.powerdns.apiKeySecretRef.key | quote }}
|
||||
{{- with $.Values.powerdns.apiKeySecretRef.namespace }}
|
||||
# Cluster-scoped namespace — only set if the operator
|
||||
# places the API-key secret outside the cert-manager
|
||||
# namespace. cert-manager's webhook framework uses
|
||||
@ -88,3 +156,5 @@ spec:
|
||||
namespace: {{ . | quote }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
|
||||
@ -211,9 +211,15 @@ clusterIssuer:
|
||||
name: letsencrypt-dns01-prod-powerdns
|
||||
# ACME account email used for renewal notifications.
|
||||
email: ops@openova.io
|
||||
# Production Let's Encrypt directory. Set to the staging URL during
|
||||
# bring-up:
|
||||
# https://acme-staging-v02.api.letsencrypt.org/directory
|
||||
# Production Let's Encrypt directory. Per docs/INVIOLABLE-PRINCIPLES.md
|
||||
# #4 (never hardcode), this is operator-overridable from the cluster
|
||||
# overlay. The staging endpoint
|
||||
# (https://acme-staging-v02.api.letsencrypt.org/directory) is now
|
||||
# exposed as a SECOND ClusterIssuer (see clusterIssuer.staging below)
|
||||
# so production-vs-staging is no longer a one-or-the-other knob —
|
||||
# both issuers live side by side and chart consumers (e.g.
|
||||
# bp-catalyst-platform's wildcardCert.useStaging) flip the cert's
|
||||
# issuerRef to pick.
|
||||
acmeServer: https://acme-v02.api.letsencrypt.org/directory
|
||||
# Name of the Secret cert-manager uses to store the ACME account key.
|
||||
privateKeySecretRefName: letsencrypt-dns01-prod-powerdns-account-key
|
||||
@ -222,6 +228,33 @@ clusterIssuer:
|
||||
# post-install hook pattern as bp-cert-manager's own ClusterIssuers.
|
||||
helmHookEnabled: true
|
||||
|
||||
# ─── Staging (Let's Encrypt staging) ClusterIssuer (Fix #123) ────────
|
||||
# Renders a SECOND ClusterIssuer alongside the production one above,
|
||||
# pointing at LE's staging ACME directory. Same DNS-01 webhook config,
|
||||
# same PowerDNS endpoint, same API key — only the ACME account +
|
||||
# directory URL differ. Staging issues from a separate trust root
|
||||
# (Fake LE Intermediate X1) so browsers reject without an explicit
|
||||
# exception, but `curl -sk` and Playwright's `ignoreHTTPSErrors: true`
|
||||
# accept the cert. Generous rate limits (~30k certs/3h) make staging
|
||||
# the safe default for QA Sovereigns whose iteration cadence would
|
||||
# exhaust LE production's 5/168h ceiling on the first wipe-cycle hour.
|
||||
#
|
||||
# Default `enabled: true` so every Sovereign that has the production
|
||||
# issuer ALSO has staging available. Operators that want production-
|
||||
# only on a customer Sovereign flip `staging.enabled: false` in the
|
||||
# per-cluster overlay. The `name` is distinct from production so the
|
||||
# two ClusterIssuers never collide.
|
||||
#
|
||||
# Per docs/INVIOLABLE-PRINCIPLES.md #4 the ACME URL is overridable —
|
||||
# an operator wiring a private staging ACME (e.g. an internal Smallstep
|
||||
# CA) sets `staging.acmeServer` from the per-cluster overlay without
|
||||
# rebuilding this Blueprint.
|
||||
staging:
|
||||
enabled: true
|
||||
name: letsencrypt-dns01-staging-powerdns
|
||||
acmeServer: https://acme-staging-v02.api.letsencrypt.org/directory
|
||||
privateKeySecretRefName: letsencrypt-dns01-staging-powerdns-account-key
|
||||
|
||||
# ─── Service + APIService ────────────────────────────────────────────────
|
||||
service:
|
||||
# ClusterIP — cert-manager calls the webhook via the kube-apiserver's
|
||||
|
||||
@ -1205,6 +1205,20 @@ func writeTfvars(deployDir string, req Request) error {
|
||||
"qa_fixtures_enabled": map[bool]string{true: "true", false: "false"}[req.QATestEnabled],
|
||||
"qa_test_session_enabled": map[bool]string{true: "true", false: "false"}[req.QATestEnabled],
|
||||
|
||||
// Wildcard cert staging-LE selector (Fix #123 — qa-loop iter-1 LE
|
||||
// rate-limit unblock). When QATestEnabled=true the per-Sovereign
|
||||
// overlay sets WILDCARD_CERT_USE_STAGING=true → bp-catalyst-platform
|
||||
// 1.4.136+ renders the sovereign-wildcard-tls Certificate(s) with
|
||||
// `issuerRef.name: letsencrypt-dns01-staging-powerdns` instead of
|
||||
// the production issuer. Staging hits LE's separate ACME directory
|
||||
// with generous rate limits, so the wipe + re-provision cadence
|
||||
// of QA Sovereigns no longer trips production's 5-certs/168h
|
||||
// ceiling per registered domain. Customer Sovereigns
|
||||
// (QATestEnabled=false) provision real-trusted production certs.
|
||||
// Stringified for the same envsubst-passthrough reason as
|
||||
// qa_fixtures_enabled.
|
||||
"wildcard_cert_use_staging": map[bool]string{true: "true", false: "false"}[req.QATestEnabled],
|
||||
|
||||
// QA namespace + Organization names — derived from the Sovereign
|
||||
// FQDN's first label at provision time per principle #4 (never
|
||||
// hardcode). The chart's defaults (qa-omantel / omantel-platform)
|
||||
|
||||
@ -598,6 +598,12 @@ func TestWriteTfvars_QAFixtures_DefaultDisabled(t *testing.T) {
|
||||
if v, _ := parsed["qa_test_session_enabled"].(string); v != "false" {
|
||||
t.Fatalf("qa_test_session_enabled MUST default 'false' on customer Sovereigns, got %q", v)
|
||||
}
|
||||
// Fix #123 — wildcard_cert_use_staging MUST default 'false' so a
|
||||
// customer Sovereign issues real-trusted production LE certs (not
|
||||
// Fake-LE-Intermediate-X1 staging certs that browsers reject).
|
||||
if v, _ := parsed["wildcard_cert_use_staging"].(string); v != "false" {
|
||||
t.Fatalf("wildcard_cert_use_staging MUST default 'false' on customer Sovereigns (real-trusted production certs), got %q", v)
|
||||
}
|
||||
}
|
||||
|
||||
// TestWriteTfvars_QAFixtures_EnabledDerivesNamespaceAndOrg proves that when
|
||||
@ -659,6 +665,14 @@ func TestWriteTfvars_QAFixtures_EnabledDerivesNamespaceAndOrg(t *testing.T) {
|
||||
if v, _ := parsed["qa_test_session_enabled"].(string); v != "true" {
|
||||
t.Errorf("qa_test_session_enabled: got %q want \"true\"", v)
|
||||
}
|
||||
// Fix #123 — wildcard_cert_use_staging auto-flips 'true' on QA
|
||||
// Sovereigns so the Sovereign issues from LE staging (separate
|
||||
// generous rate limits) instead of production. Without this the
|
||||
// wipe + re-provision cadence of QA Sovereigns trips the
|
||||
// production 5/168h ceiling within hours.
|
||||
if v, _ := parsed["wildcard_cert_use_staging"].(string); v != "true" {
|
||||
t.Errorf("wildcard_cert_use_staging: got %q want \"true\" (QA Sovereigns MUST issue staging certs to bypass LE production rate limit)", v)
|
||||
}
|
||||
if v, _ := parsed["qa_fixtures_namespace"].(string); v != tc.wantNs {
|
||||
t.Errorf("qa_fixtures_namespace: got %q want %q (derived from FQDN first label)", v, tc.wantNs)
|
||||
}
|
||||
|
||||
@ -1,58 +1,38 @@
|
||||
apiVersion: v2
|
||||
name: bp-catalyst-platform
|
||||
# 1.4.136 (qa-loop iter-1 Fix #124, secondary Fix #122 — convert
|
||||
# catalyst-gitea-token bootstrap to pre-install hook so consumers
|
||||
# (catalog + organization-controller + api) see a populated token at
|
||||
# their first container start):
|
||||
# 1.4.136 (qa-loop bounded-provision-cycle Fix #123, LE rate-limit
|
||||
# bypass via staging ClusterIssuer for QA Sovereigns):
|
||||
#
|
||||
# Root cause (qa-loop iter-1 monitor Fix #122 surfaced 2026-05-10):
|
||||
# On every fresh Sovereign install of bp-catalyst-platform 1.4.135
|
||||
# the `catalyst-catalog` and `catalyst-organization-controller`
|
||||
# Pods enter CrashLoopBackOff with:
|
||||
# {"level":"ERROR","msg":"config load failed",
|
||||
# "err":"config: CATALYST_GITEA_TOKEN is required"}
|
||||
# even though the Secret `catalyst-system/catalyst-gitea-token`
|
||||
# exists. Inspection shows `data.token: ""` — the Secret was
|
||||
# created empty by the chart's lookup-existing-target idempotency
|
||||
# path (lookup returns nil on a fresh install → token bytes are
|
||||
# empty), and the post-install mint Job that was supposed to
|
||||
# populate it ran AFTER the Deployments had already crashed and
|
||||
# accumulated exponential back-off windows. Helm's 15m install
|
||||
# timeout lapsed before the Pods could pick up the patched token,
|
||||
# triggering uninstall-remediation → reinstall → loop.
|
||||
# Root cause (iter-1 wedge, 2026-05-10):
|
||||
# Let's Encrypt production hit the 5-certs/168h rate limit on
|
||||
# `*.omantel.biz` (retry after 2026-05-11 22:08 UTC). Cilium-envoy
|
||||
# could not get a wildcard cert → console.omantel.biz TLS handshake
|
||||
# failed → iter-1 Test Executor could not run. Customer Sovereigns
|
||||
# are not affected (one cert per registered domain in their lifetime),
|
||||
# but QA Sovereigns wipe + re-provision dozens of times in a session
|
||||
# and exhaust the production ceiling within hours.
|
||||
#
|
||||
# This is the chicken-and-egg ordering hazard documented in
|
||||
# docs/INVIOLABLE-PRINCIPLES.md #1 (waterfall, not iterative MVP):
|
||||
# credential bootstrap MUST land before the consumers it serves.
|
||||
# Fix:
|
||||
# - bp-cert-manager-powerdns-webhook 1.1.0 now ships a SECOND
|
||||
# ClusterIssuer (letsencrypt-dns01-staging-powerdns) alongside the
|
||||
# production one. Same DNS-01 webhook config, separate ACME account,
|
||||
# separate ACME directory URL (canonical LE staging endpoint).
|
||||
# Production rate limit is wholly independent of staging.
|
||||
# - This chart adds `wildcardCert.useStaging` (bool, default false).
|
||||
# When true, sovereign-wildcard-certs.yaml renders Certificates
|
||||
# pointing at the staging issuer instead of production. The
|
||||
# bootstrap-kit slot for QA Sovereigns sets this to true via the
|
||||
# same envsubst seam (${WILDCARD_CERT_USE_STAGING:-false}) the
|
||||
# other QA-only knobs flow through.
|
||||
# - cilium-envoy then gets a staging-signed wildcard cert in <2 min.
|
||||
# `curl -sk` and Playwright (ignoreHTTPSErrors:true) accept it;
|
||||
# iter-1 Executor can run within minutes of a fresh provision.
|
||||
#
|
||||
# Fix (chart-side, no app-code change required):
|
||||
# Move the entire token-bootstrap flow (Secret, ServiceAccount,
|
||||
# Roles, RoleBinding, Job) from `helm.sh/hook: post-install,
|
||||
# post-upgrade` to `helm.sh/hook: pre-install,pre-upgrade`. The
|
||||
# Secret is created at hook-weight=5; the mint Job at hook-weight=
|
||||
# 10. Helm runs the entire pre-install hook chain to completion
|
||||
# BEFORE applying any regular release resource. Result: when the
|
||||
# catalog / organization-controller Deployments are applied, the
|
||||
# Secret already carries a real PAT, the kubelet mounts it as
|
||||
# CATALYST_GITEA_TOKEN, and the Pods start cleanly on first try.
|
||||
#
|
||||
# Defensive alignment in services/catalog/deployment.yaml: add
|
||||
# `optional: true` to the secretKeyRef so the wiring matches the
|
||||
# existing api-deployment + organization-controller convention
|
||||
# (cosmetic — the Secret always exists in the pre-install path,
|
||||
# but `optional: true` keeps kubelet from blocking Pod start
|
||||
# should any future reordering regress this).
|
||||
#
|
||||
# Lookup contract preserved: on upgrades, `lookup` returns the
|
||||
# existing Secret with the populated token, the template re-emits
|
||||
# the same bytes, and the mint Job's runtime check (`EXISTING_TOKEN
|
||||
# != ""`) short-circuits with exit 0. helm.sh/resource-policy: keep
|
||||
# is retained on the Secret so it survives helm uninstalls.
|
||||
#
|
||||
# Per principle 4 / feedback_inviolable_principles.md #1: target
|
||||
# state, not MVP. The pre-install hook IS the canonical seam for
|
||||
# Sovereign credential bootstrap (mirrors bp-keycloak's keycloak-
|
||||
# config-cli pre-install pattern).
|
||||
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the issuer
|
||||
# name is fully values-overridable — operators that wire a private
|
||||
# staging ACME (e.g. internal Smallstep CA) override the issuer
|
||||
# alongside the bp-cert-manager-powerdns-webhook staging URL without
|
||||
# touching this chart.
|
||||
#
|
||||
# 1.4.135 (qa-loop bounded-provision-cycle Fix #119, sanitize illegal
|
||||
# `/` in qa-fixtures Continuum mirror label value — unblocks prov #11):
|
||||
@ -978,7 +958,7 @@ name: bp-catalyst-platform
|
||||
# - values.yaml: new knobs `cnpgPairAliasName`,
|
||||
# `cnpgPairPostSwitchoverPrimary`, `continuumPlatformNamespace` —
|
||||
# all values-overridable per INVIOLABLE-PRINCIPLES #4.
|
||||
version: 1.4.136
|
||||
version: 1.4.137
|
||||
appVersion: 1.4.94
|
||||
# 1.4.129 (qa-loop iter-16 Fix #65): ship the missing
|
||||
# `openova-catalog` Flux v1 HelmRepository in flux-system. The
|
||||
|
||||
@ -52,7 +52,29 @@ Resource naming:
|
||||
*/}}
|
||||
{{- if .Values.wildcardCert.enabled }}
|
||||
{{- $ns := .Values.wildcardCert.namespace | default "kube-system" }}
|
||||
{{/*
|
||||
Issuer selection (Fix #123, LE rate-limit bypass for QA Sovereigns):
|
||||
- .Values.wildcardCert.useStaging=true → staging issuer (default
|
||||
`letsencrypt-dns01-staging-powerdns`, shipped by
|
||||
bp-cert-manager-powerdns-webhook 1.1.0+ alongside the production
|
||||
issuer). Hits LE's staging ACME endpoint
|
||||
(https://acme-staging-v02.api.letsencrypt.org/directory). Cert is
|
||||
signed by Fake LE Intermediate X1 so browsers reject without an
|
||||
explicit exception, but `curl -sk` and Playwright
|
||||
(ignoreHTTPSErrors:true) accept it. Production rate limit (5
|
||||
certs/168h per registered domain) does NOT apply to staging.
|
||||
- .Values.wildcardCert.useStaging=false → production issuer (default
|
||||
`letsencrypt-dns01-prod-powerdns`). Real-trusted certs.
|
||||
|
||||
Default false on the chart; the bootstrap-kit slot for QA Sovereigns
|
||||
flips this to true via ${WILDCARD_CERT_USE_STAGING:-false} envsubst.
|
||||
Per docs/INVIOLABLE-PRINCIPLES.md #4 every issuer name is values-
|
||||
overridable (e.g. private ACME).
|
||||
*/}}
|
||||
{{- $issuer := .Values.wildcardCert.issuerName | default "letsencrypt-dns01-prod-powerdns" }}
|
||||
{{- if .Values.wildcardCert.useStaging }}
|
||||
{{- $issuer = .Values.wildcardCert.issuerNameStaging | default "letsencrypt-dns01-staging-powerdns" }}
|
||||
{{- end }}
|
||||
{{- $duration := .Values.wildcardCert.duration }}
|
||||
{{- $renewBefore := .Values.wildcardCert.renewBefore }}
|
||||
|
||||
|
||||
@ -135,6 +135,35 @@ wildcardCert:
|
||||
# override to a per-cluster issuer (e.g. a private ACME) via
|
||||
# cluster overlay.
|
||||
issuerName: letsencrypt-dns01-prod-powerdns
|
||||
# ─── Let's Encrypt staging fallback (Fix #123) ─────────────────────
|
||||
# When `useStaging: true`, the rendered Certificate(s) reference the
|
||||
# staging issuer (`issuerNameStaging`, default
|
||||
# `letsencrypt-dns01-staging-powerdns` shipped by
|
||||
# bp-cert-manager-powerdns-webhook 1.1.0+) instead of `issuerName`.
|
||||
# The staging issuer hits Let's Encrypt's staging ACME directory
|
||||
# (https://acme-staging-v02.api.letsencrypt.org/directory), which
|
||||
# has separate, generous rate limits — the production 5-certs/168h
|
||||
# ceiling per registered domain is wholly bypassed. The cert is
|
||||
# signed by Fake LE Intermediate X1 so browsers reject without an
|
||||
# explicit exception, but `curl -sk` and Playwright
|
||||
# (ignoreHTTPSErrors:true) accept it. Intended for QA Sovereigns
|
||||
# whose wipe + re-provision cadence would otherwise exhaust LE
|
||||
# production within hours.
|
||||
#
|
||||
# Default false — customer Sovereigns issue real-trusted production
|
||||
# certs. The bootstrap-kit slot 13-bp-catalyst-platform.yaml flips
|
||||
# this to true on QA Sovereigns via the
|
||||
# ${WILDCARD_CERT_USE_STAGING:-false} envsubst seam (same pattern
|
||||
# as ${QA_FIXTURES_ENABLED:-false}). Per
|
||||
# docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every Sovereign
|
||||
# may flip this independently from a per-cluster overlay.
|
||||
useStaging: false
|
||||
# Name of the staging ClusterIssuer. Defaults to the canonical name
|
||||
# shipped by bp-cert-manager-powerdns-webhook 1.1.0+. Operators that
|
||||
# wire a private staging ACME (e.g. internal Smallstep CA) override
|
||||
# both this and the bp-cert-manager-powerdns-webhook staging block
|
||||
# via the per-cluster overlay.
|
||||
issuerNameStaging: letsencrypt-dns01-staging-powerdns
|
||||
# Cert renew window. cert-manager defaults are conservative; we
|
||||
# match the per-Sovereign cilium-gateway-cert.yaml legacy values.
|
||||
duration: "" # empty = cert-manager default (90d for LE)
|
||||
|
||||
Loading…
Reference in New Issue
Block a user