History

e3mrah 2e9cfd4a57 fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643 ) * fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>		2026-05-02 23:52:42 +04:00
..
chart	fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643 )	2026-05-02 23:52:42 +04:00
blueprint.yaml	feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159 ) (#291 )	2026-04-30 19:37:47 +04:00
README.md	feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159 ) (#291 )	2026-04-30 19:37:47 +04:00

README.md

bp-cert-manager-dynadot-webhook

Catalyst Blueprint for the cert-manager DNS-01 external webhook for Dynadot. Closes openova#159.

What it is

A Go binary that satisfies cert-manager's external webhook contract (webhook.acme.cert-manager.io/v1alpha1 — Present / CleanUp on a ChallengeRequest) and writes ACME challenge TXT records to a Dynadot-managed pool domain via the api3.json endpoint.

The binary lives at core/cmd/cert-manager-dynadot-webhook/. The HTTP transport, command builders, and zone-safety contract live in core/pkg/dynadot-client/ and are shared with the other Catalyst services that talk to Dynadot (pool-domain-manager, catalyst-dns).

Why this exists separately from external-dns-dynadot-webhook

cert-manager's webhook contract and external-dns's webhook contract are DIFFERENT protocols. external-dns expects a sidecar that implements records.list / records.add / records.delete over an HTTP RPC schema; cert-manager expects an aggregated Kubernetes apiserver that responds to ChallengeRequest CRs. The two binaries cannot share code at the transport layer. They DO share the underlying Dynadot HTTP client at core/pkg/dynadot-client/.

What this chart deploys

Resource	Purpose
Deployment	Runs the webhook binary as a non-root pod in the chart's release namespace.
Service	ClusterIP fronting the Deployment on port 443.
APIService	Registers `v1alpha1.acme.dynadot.openova.io` so the kube-apiserver routes ChallengeRequest calls to the Service.
Issuer (selfsigned)	Bootstraps the CA chain that issues the webhook's serving cert.
Issuer (CA)	Signs the leaf serving cert from the CA Secret.
Certificate (CA)	Root CA cert used by the APIService's `cert-manager.io/inject-ca-from` annotation.
Certificate (serving)	Leaf cert mounted into the Deployment at `/tls`.
ServiceAccount	Identity for the Deployment.
ClusterRoleBinding (auth-delegator)	Lets the aggregated apiserver delegate auth back to kube-apiserver.
RoleBinding (auth-reader)	Reads `extension-apiserver-authentication` ConfigMap from `kube-system`.
Role + RoleBinding (dynadot secret)	Grants the SA read access to the Dynadot credentials Secret in the configured namespace.

Pairing with bp-cert-manager

bp-cert-manager's letsencrypt-dns01-prod ClusterIssuer points at this webhook via solvers[].dns01.webhook.groupName + solverName. The two charts MUST be deployed on the same Sovereign and bp-cert-manager-dynadot- webhook MUST be Ready before any wildcard Certificate is requested.

The bp-cert-manager chart now ships with dns01.enabled: true by default (changed in this PR — was false while the webhook was being built). The interim letsencrypt-http01-prod issuer remains templated as the rollback path; flip certManager.issuers.dns01.enabled=false in the umbrella values to disable wildcard issuance and continue with per-host certs.

Credentials

The webhook reads three values from a Kubernetes Secret in its release namespace:

Env var	Default secret key
`DYNADOT_API_KEY`	`api-key`
`DYNADOT_API_SECRET`	`api-secret`
`DYNADOT_MANAGED_DOMAINS`	`domains` (legacy fallback: `domain`)

The canonical secret (dynadot-api-credentials in openova-system) is shared with pool-domain-manager and catalyst-dns. Because Pod secretKeyRef cannot cross namespaces, the cluster overlay MUST replicate the secret into the webhook's release namespace via ExternalSecret (preferred) or reflector annotations. See clusters/_template/dynadot-credentials-replication.yaml.

Domain allowlist

DYNADOT_MANAGED_DOMAINS is a comma- or whitespace-separated allowlist of pool domains the webhook is permitted to mutate. ChallengeRequests for domains NOT under any allowlisted apex are rejected before any Dynadot API call is made. This is the same defence pattern pool-domain-manager and catalyst-dns use; it prevents a misconfigured ClusterIssuer from causing the webhook to write to a third-party domain.

Zone safety

The shared core/pkg/dynadot-client/ enforces the safety contract documented in memory/feedback_dynadot_dns.md: every mutation either uses the append path (add_dns_to_current_setting=yes) or performs a read-modify-write via domain_info → set_dns2. The destructive zone-wipe variant of set_dns2 is unexported. The webhook's Present path uses AddRecord (append); CleanUp uses RemoveSubRecord (read-modify-write that match-deletes a single record).

Smoke test

Once both charts are reconciled on a Sovereign:

# Verify the webhook is running and the APIService is healthy
kubectl get -n cert-manager deploy/release-name-bp-cert-manager-dynadot-webhook
kubectl get apiservices.apiregistration.k8s.io v1alpha1.acme.dynadot.openova.io

# Issue a wildcard cert against the Sovereign apex
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-omantel-omani-works
  namespace: cilium-gateway
spec:
  secretName: wildcard-omantel-omani-works-tls
  issuerRef:
    name: letsencrypt-dns01-prod
    kind: ClusterIssuer
  dnsNames:
    - "*.omantel.omani.works"
EOF

# Watch the Order + Challenge progress
kubectl get certificate,order,challenge -A -w