openova/platform/cert-manager-dynadot-webhook
e3mrah 2e9cfd4a57
fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:52:42 +04:00
..
chart fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643) 2026-05-02 23:52:42 +04:00
blueprint.yaml feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159) (#291) 2026-04-30 19:37:47 +04:00
README.md feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159) (#291) 2026-04-30 19:37:47 +04:00

bp-cert-manager-dynadot-webhook

Catalyst Blueprint for the cert-manager DNS-01 external webhook for Dynadot. Closes openova#159.

What it is

A Go binary that satisfies cert-manager's external webhook contract (webhook.acme.cert-manager.io/v1alpha1Present / CleanUp on a ChallengeRequest) and writes ACME challenge TXT records to a Dynadot-managed pool domain via the api3.json endpoint.

The binary lives at core/cmd/cert-manager-dynadot-webhook/. The HTTP transport, command builders, and zone-safety contract live in core/pkg/dynadot-client/ and are shared with the other Catalyst services that talk to Dynadot (pool-domain-manager, catalyst-dns).

Why this exists separately from external-dns-dynadot-webhook

cert-manager's webhook contract and external-dns's webhook contract are DIFFERENT protocols. external-dns expects a sidecar that implements records.list / records.add / records.delete over an HTTP RPC schema; cert-manager expects an aggregated Kubernetes apiserver that responds to ChallengeRequest CRs. The two binaries cannot share code at the transport layer. They DO share the underlying Dynadot HTTP client at core/pkg/dynadot-client/.

What this chart deploys

Resource Purpose
Deployment Runs the webhook binary as a non-root pod in the chart's release namespace.
Service ClusterIP fronting the Deployment on port 443.
APIService Registers v1alpha1.acme.dynadot.openova.io so the kube-apiserver routes ChallengeRequest calls to the Service.
Issuer (selfsigned) Bootstraps the CA chain that issues the webhook's serving cert.
Issuer (CA) Signs the leaf serving cert from the CA Secret.
Certificate (CA) Root CA cert used by the APIService's cert-manager.io/inject-ca-from annotation.
Certificate (serving) Leaf cert mounted into the Deployment at /tls.
ServiceAccount Identity for the Deployment.
ClusterRoleBinding (auth-delegator) Lets the aggregated apiserver delegate auth back to kube-apiserver.
RoleBinding (auth-reader) Reads extension-apiserver-authentication ConfigMap from kube-system.
Role + RoleBinding (dynadot secret) Grants the SA read access to the Dynadot credentials Secret in the configured namespace.

Pairing with bp-cert-manager

bp-cert-manager's letsencrypt-dns01-prod ClusterIssuer points at this webhook via solvers[].dns01.webhook.groupName + solverName. The two charts MUST be deployed on the same Sovereign and bp-cert-manager-dynadot- webhook MUST be Ready before any wildcard Certificate is requested.

The bp-cert-manager chart now ships with dns01.enabled: true by default (changed in this PR — was false while the webhook was being built). The interim letsencrypt-http01-prod issuer remains templated as the rollback path; flip certManager.issuers.dns01.enabled=false in the umbrella values to disable wildcard issuance and continue with per-host certs.

Credentials

The webhook reads three values from a Kubernetes Secret in its release namespace:

Env var Default secret key
DYNADOT_API_KEY api-key
DYNADOT_API_SECRET api-secret
DYNADOT_MANAGED_DOMAINS domains (legacy fallback: domain)

The canonical secret (dynadot-api-credentials in openova-system) is shared with pool-domain-manager and catalyst-dns. Because Pod secretKeyRef cannot cross namespaces, the cluster overlay MUST replicate the secret into the webhook's release namespace via ExternalSecret (preferred) or reflector annotations. See clusters/_template/dynadot-credentials-replication.yaml.

Domain allowlist

DYNADOT_MANAGED_DOMAINS is a comma- or whitespace-separated allowlist of pool domains the webhook is permitted to mutate. ChallengeRequests for domains NOT under any allowlisted apex are rejected before any Dynadot API call is made. This is the same defence pattern pool-domain-manager and catalyst-dns use; it prevents a misconfigured ClusterIssuer from causing the webhook to write to a third-party domain.

Zone safety

The shared core/pkg/dynadot-client/ enforces the safety contract documented in memory/feedback_dynadot_dns.md: every mutation either uses the append path (add_dns_to_current_setting=yes) or performs a read-modify-write via domain_info → set_dns2. The destructive zone-wipe variant of set_dns2 is unexported. The webhook's Present path uses AddRecord (append); CleanUp uses RemoveSubRecord (read-modify-write that match-deletes a single record).

Smoke test

Once both charts are reconciled on a Sovereign:

# Verify the webhook is running and the APIService is healthy
kubectl get -n cert-manager deploy/release-name-bp-cert-manager-dynadot-webhook
kubectl get apiservices.apiregistration.k8s.io v1alpha1.acme.dynadot.openova.io

# Issue a wildcard cert against the Sovereign apex
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-omantel-omani-works
  namespace: cilium-gateway
spec:
  secretName: wildcard-omantel-omani-works-tls
  issuerRef:
    name: letsencrypt-dns01-prod
    kind: ClusterIssuer
  dnsNames:
    - "*.omantel.omani.works"
EOF

# Watch the Order + Challenge progress
kubectl get certificate,order,challenge -A -w

See also

  • core/cmd/cert-manager-dynadot-webhook/ — binary source
  • core/pkg/dynadot-client/ — shared Dynadot HTTP client
  • platform/cert-manager/chart/templates/clusterissuer-letsencrypt-dns01.yaml — paired ClusterIssuer
  • openova#159 — closing issue
  • cert-manager DNS-01 webhook docs