* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|---|---|---|
| .. | ||
| chart | ||
| blueprint.yaml | ||
| README.md | ||
bp-cert-manager-dynadot-webhook
Catalyst Blueprint for the cert-manager DNS-01 external webhook for Dynadot. Closes openova#159.
What it is
A Go binary that satisfies cert-manager's external webhook contract
(webhook.acme.cert-manager.io/v1alpha1 — Present / CleanUp on a
ChallengeRequest) and writes ACME challenge TXT records to a
Dynadot-managed pool domain via the api3.json endpoint.
The binary lives at core/cmd/cert-manager-dynadot-webhook/. The
HTTP transport, command builders, and zone-safety contract live in
core/pkg/dynadot-client/ and are shared with the other Catalyst
services that talk to Dynadot (pool-domain-manager, catalyst-dns).
Why this exists separately from external-dns-dynadot-webhook
cert-manager's webhook contract and external-dns's webhook contract are
DIFFERENT protocols. external-dns expects a sidecar that implements
records.list / records.add / records.delete over an HTTP RPC schema;
cert-manager expects an aggregated Kubernetes apiserver that responds to
ChallengeRequest CRs. The two binaries cannot share code at the
transport layer. They DO share the underlying Dynadot HTTP client at
core/pkg/dynadot-client/.
What this chart deploys
| Resource | Purpose |
|---|---|
| Deployment | Runs the webhook binary as a non-root pod in the chart's release namespace. |
| Service | ClusterIP fronting the Deployment on port 443. |
| APIService | Registers v1alpha1.acme.dynadot.openova.io so the kube-apiserver routes ChallengeRequest calls to the Service. |
| Issuer (selfsigned) | Bootstraps the CA chain that issues the webhook's serving cert. |
| Issuer (CA) | Signs the leaf serving cert from the CA Secret. |
| Certificate (CA) | Root CA cert used by the APIService's cert-manager.io/inject-ca-from annotation. |
| Certificate (serving) | Leaf cert mounted into the Deployment at /tls. |
| ServiceAccount | Identity for the Deployment. |
| ClusterRoleBinding (auth-delegator) | Lets the aggregated apiserver delegate auth back to kube-apiserver. |
| RoleBinding (auth-reader) | Reads extension-apiserver-authentication ConfigMap from kube-system. |
| Role + RoleBinding (dynadot secret) | Grants the SA read access to the Dynadot credentials Secret in the configured namespace. |
Pairing with bp-cert-manager
bp-cert-manager's letsencrypt-dns01-prod ClusterIssuer points at this
webhook via solvers[].dns01.webhook.groupName + solverName. The two
charts MUST be deployed on the same Sovereign and bp-cert-manager-dynadot-
webhook MUST be Ready before any wildcard Certificate is requested.
The bp-cert-manager chart now ships with dns01.enabled: true by
default (changed in this PR — was false while the webhook was being
built). The interim letsencrypt-http01-prod issuer remains templated
as the rollback path; flip certManager.issuers.dns01.enabled=false in
the umbrella values to disable wildcard issuance and continue with
per-host certs.
Credentials
The webhook reads three values from a Kubernetes Secret in its release namespace:
| Env var | Default secret key |
|---|---|
DYNADOT_API_KEY |
api-key |
DYNADOT_API_SECRET |
api-secret |
DYNADOT_MANAGED_DOMAINS |
domains (legacy fallback: domain) |
The canonical secret (dynadot-api-credentials in openova-system) is
shared with pool-domain-manager and catalyst-dns. Because Pod
secretKeyRef cannot cross namespaces, the cluster overlay MUST
replicate the secret into the webhook's release namespace via
ExternalSecret (preferred) or reflector annotations. See
clusters/_template/dynadot-credentials-replication.yaml.
Domain allowlist
DYNADOT_MANAGED_DOMAINS is a comma- or whitespace-separated allowlist
of pool domains the webhook is permitted to mutate. ChallengeRequests
for domains NOT under any allowlisted apex are rejected before any
Dynadot API call is made. This is the same defence pattern
pool-domain-manager and catalyst-dns use; it prevents a misconfigured
ClusterIssuer from causing the webhook to write to a third-party domain.
Zone safety
The shared core/pkg/dynadot-client/ enforces the safety contract
documented in memory/feedback_dynadot_dns.md: every mutation either
uses the append path (add_dns_to_current_setting=yes) or performs a
read-modify-write via domain_info → set_dns2. The destructive
zone-wipe variant of set_dns2 is unexported. The webhook's Present
path uses AddRecord (append); CleanUp uses RemoveSubRecord
(read-modify-write that match-deletes a single record).
Smoke test
Once both charts are reconciled on a Sovereign:
# Verify the webhook is running and the APIService is healthy
kubectl get -n cert-manager deploy/release-name-bp-cert-manager-dynadot-webhook
kubectl get apiservices.apiregistration.k8s.io v1alpha1.acme.dynadot.openova.io
# Issue a wildcard cert against the Sovereign apex
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: wildcard-omantel-omani-works
namespace: cilium-gateway
spec:
secretName: wildcard-omantel-omani-works-tls
issuerRef:
name: letsencrypt-dns01-prod
kind: ClusterIssuer
dnsNames:
- "*.omantel.omani.works"
EOF
# Watch the Order + Challenge progress
kubectl get certificate,order,challenge -A -w
See also
core/cmd/cert-manager-dynadot-webhook/— binary sourcecore/pkg/dynadot-client/— shared Dynadot HTTP clientplatform/cert-manager/chart/templates/clusterissuer-letsencrypt-dns01.yaml— paired ClusterIssuer- openova#159 — closing issue
- cert-manager DNS-01 webhook docs