openova/docs/MULTI-REGION-DNS.md
hatiyildiz 04559e5c37 docs(reconcile-pass-1): align docs with ground truth at dd578d1c
Reconcile Pass 1 — first holistic LLM-driven reconciliation pass per
~/.claude/skills/reconcile-catalyst-docs/SKILL.md. Skill triggered after
the post-Group-M architectural batch (#161, #162, #163, #167, #168,
#169, #170, #171, #173, #174, #175). Live ground truth verified against
kubectl + ls platform/ + git log + GHCR + componentGroups.ts.

Drift categories fixed:

- A. Numerical: bp-powerdns 1.0.5 → 1.0.6; component-logos 63 → 62
  (powerdns SVG missing, tracked under #173); bootstrap kit 11 → 12
  with bp-powerdns added per #167.
- B. Service: pool-domain-manager + 5 registrar adapters
  (Cloudflare/Namecheap/GoDaddy/OVH/Dynadot, #170) added to
  IMPLEMENTATION-STATUS, ARCHITECTURE, PLATFORM-TECH-STACK, GLOSSARY,
  and PROVISIONING-PLAN; bp-powerdns added to ARCHITECTURE bootstrap
  kit + Catalyst-on-Catalyst dependency tree.
- C. Architectural: SOVEREIGN-PROVISIONING §3 + DEMO-RUNBOOK Step 4
  + ORCHESTRATOR-STATE Step 6 rewritten from Dynadot-direct DNS writes
  to PowerDNS authoritative + PDM /v1/commit + registrar-adapter
  NS-flip; PROVISIONING-PLAN Phase 4 paths corrected to
  products/catalyst/bootstrap/api/ (per INVIOLABLE-PRINCIPLES #3 the
  Go provisioner does NOT call cloud APIs); Phase 6 retitled and
  rewritten for the new DNS architecture.
- D. Process: RUNBOOK-PROVISIONING §2 wizard-step table + DEMO-RUNBOOK
  Step 2 wizard-step table updated to canonical 7-step ordering
  (Org → Domain → Topology → Provider → Credentials → Components →
  Review per WIZARD_STEPS in WizardLayout.tsx, post #169 + #174); the
  three-mode StepDomain (pool / byo-manual / byo-api per #169) and
  two-tab StepComponents (mandatory infra + apps per #161/#162/#175)
  now documented.
- E. Cross-doc: Group G  across PROVISIONING-PLAN +
  ORCHESTRATOR-STATE (superseded by #167+#163+#170, not by the
  original Dynadot-multi-domain plan); Group C  in
  PROVISIONING-PLAN (Flux is reconciling from openova-public today);
  README Stack-at-a-glance DNS row expanded.
- F. Stale terminology: 11-grep banned-terms scan clean — every k8gb
  residual is a legitimate "removed at #171, replaced by lua-records"
  reference.

VALIDATION-LOG.md gains the Reconcile Pass 1 entry per skill spec.
Reconcile-skill numbering is independent of the Audit-skill numbering
(which continues at Pass 108+).

Files: 13 docs + VALIDATION-LOG entry.
Escalations: none.
2026-04-29 09:40:10 +02:00

13 KiB

Multi-Region DNS — health-checked failover with PowerDNS lua-records

Status: Authoritative. Updated: 2026-04-29 (Reconcile Pass 1).

This document is the canonical reference for how Catalyst routes traffic across regions. Geographic redundancy in OpenOva is realized at the authoritative DNS layer, not at the K8s controller layer. PowerDNS lua-records (ifurlup, ifportup, pickclosest, pickrandom, pickwhashed) provide everything Catalyst needs:

  • Geo-aware response selection — answer the closest healthy backend for the resolver's source IP / ECS subnet.
  • Health-checked failover — drop a backend from the response set when a TCP/HTTP probe fails, restore it when the probe recovers.
  • Latency-aware routing — combine ifurlup (health) with pickclosest (geo) for active-active steering.
  • Same operational layer Catalyst already runs — PowerDNS is bp-powerdns, deployed by the bootstrap kit on every Sovereign's mgt cluster. No separate operator, no extra CRDs, no extra reconciliation loop.

This subsumes the role previously assigned to k8gb. The k8gb component has been removed from componentGroups.ts, the umbrella chart, and the wizard; lua-records cover every failover scenario k8gb covered without the dedicated GSLB controller.


1. Why PowerDNS lua-records (and why not k8gb)

Concern k8gb (removed) PowerDNS lua-records (current)
Authoritative DNS CoreDNS plugin, separate zone PowerDNS authoritative — same zones used for external-dns, ACME, etc.
Operator footprint k8gb controller + CRDs (Gslb, GslbHttpRoute) + per-cluster CoreDNS pod set None — declarative LUA records in the existing PowerDNS zone
Health-check primitive k8gb-managed liveness probes PowerDNS ifurlup / ifportup (HTTP / TCP probes from PowerDNS pods)
Geo selection EdgeDNS witness + custom logic pickclosest (geo by source IP), pickrandom (RR), pickwhashed (sticky weighted)
DNSSEC Layered on top, separate signer Native — PowerDNS signs the lua-record's computed answer with the zone's KSK/ZSK
Operational surface k8gb pods + CoreDNS pods + custom CRDs Existing PowerDNS deployment + dnsdist rate-limit shield
Cluster-coordination Required (gslb endpoints sync between clusters) Not required — authoritative DNS is the source of truth

The architectural cost difference is large enough that the deletion is the right move per INVIOLABLE-PRINCIPLES.md #2 ("never compromise from quality — pick the unified primitive, not the dual-shape design") and #4 ("never hardcode — health probes, weights, geo policy are configuration in the lua-record body, not code in a controller").


2. Failover patterns (the lua-record cookbook)

Every Catalyst Sovereign zone is hosted on PowerDNS. The records below sit alongside ordinary A/AAAA/CNAME records that external-dns writes via the PowerDNS REST API. Lua-record syntax follows the upstream PowerDNS documentation.

Note on examples. Backend IPv4 addresses (5.161.42.18, 95.217.189.42) and the FQDN primary.example.com below are placeholders — they illustrate the lua-record shape only. The canonical 6-record set per Sovereign zone is written by pool-domain-manager (PDM, core/pool-domain-manager/) on /v1/commit; lua-records (geo / health-check policy) are written by the catalyst-dns controller (Catalyst control-plane sidecar) from each Application's Placement spec — see docs/PLATFORM-POWERDNS.md §"In-cluster consumers".

2.1 Active-active across two regions, health-checked

foo.acme.com.  IN  LUA  A "ifurlup('https://primary.example.com/healthz', {'5.161.42.18', '95.217.189.42'}, {selector='all'})"
  • PowerDNS HTTP-probes https://primary.example.com/healthz from each PowerDNS pod every 5s (default; configurable via interval option).
  • selector='all' returns every healthy backend — the resolver's stub then picks one (typical client behaviour: rotate, retry on failure).
  • When the probe to a backend fails three times in a row (default failOnIncerror=true, 3 fails to drop), that backend is removed from the answer set within the next TTL window.
  • When the probe recovers, the backend is restored automatically.

2.2 Geo-aware active-active (pickclosest)

api.acme.com.  IN  LUA  A "pickclosest({'5.161.42.18', '95.217.189.42'})"
  • PowerDNS uses ECS (EDNS Client Subnet) when present, falling back to the resolver's source IP.
  • The closer regional LB by GeoIP wins.
  • Combine with ifurlup for health-aware closeness:
api.acme.com.  IN  LUA  A "
  ifurlup('https://primary.example.com/healthz', {
    {'5.161.42.18', '95.217.189.42'}
  }, {selector='pickclosest'})
"

2.3 Active-passive (primary → DR)

api.acme.com.  IN  LUA  A "ifurlup('https://primary.example.com/healthz', {'5.161.42.18', '95.217.189.42'}, {selector='pickfirst'})"
  • pickfirst returns the first healthy backend in the list.
  • When 5.161.42.18 (primary) is healthy → answer is 5.161.42.18.
  • When primary fails the probe → answer flips to 95.217.189.42 (DR) within one TTL window.
  • When primary recovers → answer flips back to primary on the next probe success.

2.4 TCP-only / non-HTTP services (ifportup)

For services that don't expose an HTTP /healthz (e.g. SMTP, IMAP, custom TCP):

mail.acme.com.  IN  LUA  A "ifportup(587, {'5.161.42.18', '95.217.189.42'})"
  • PowerDNS attempts a TCP connect to port 587 on each backend.
  • Connect-fail → drop from the response set; connect-success → include.

2.5 Weighted round-robin (pickwhashed)

For canary releases or traffic-shifting:

api.acme.com.  IN  LUA  A "pickwhashed({{80, '5.161.42.18'}, {20, '95.217.189.42'}})"
  • 80% of distinct client IPs are pinned to 5.161.42.18, 20% to 95.217.189.42 (consistent hash on source IP — the same client gets the same answer until the weight changes).

3. Catalyst integration points

3.1 Where lua-records are written

Lua-records are part of each Sovereign's PowerDNS zone, alongside the canonical 6-record set (PLATFORM-POWERDNS.md §"Per-Sovereign zone model"). The 6-record set is written once at provisioning by pool-domain-manager (PDM /v1/commit); ongoing A/AAAA/CNAME records are written by external-dns; LUA records are written by the catalyst-dns controller (sidecar to the Catalyst control plane on the mgt cluster):

PDM         ──► PowerDNS REST API ──► canonical 6-record set (one-shot at provision)
external-dns ──► PowerDNS REST API ──► A/AAAA/CNAME records (per-region LB IPs)
catalyst-dns ──► PowerDNS REST API ──► LUA records (geo / health-check policy)

This separation matters: external-dns knows about a single K8s Service or Ingress; it has no concept of multi-region health policy. The catalyst-dns controller reads the Application's Placement field from the per-Org Gitea repo, sees placement: active-active (or active-hotstandby, etc.), and synthesizes the corresponding lua-record body.

3.2 Application Placement → lua-record selector mapping

Application Placement lua-record idiom
single-region Plain A record(s) — no lua-record needed
active-active ifurlup(..., {selector='all'}) (or selector='pickclosest' for geo-affinity)
active-hotstandby ifurlup(..., {selector='pickfirst'}) — primary first, DR second
active-passive-warm ifurlup(..., {selector='pickfirst'}) + longer TTL (manual operator promotion is the contract; the LUA only flips when the probe fails enough times)
weighted-canary pickwhashed({{w1, ip1}, {w2, ip2}}) — adjust weights via Catalyst console (re-emits the lua-record body with new weights)

3.3 Probe target

Every Catalyst Application Blueprint MUST expose /healthz on its public endpoint. The catalyst-dns controller defaults to https://<app-fqdn>/healthz as the probe target, configurable per-Application via spec.healthCheck.path in the Blueprint instance.

DNS pods are inside the Sovereign — they probe outbound to the regional LB IPs over the public internet (or via the Cilium Cluster Mesh + WireGuard back-channel for cross-region private probes). The probe direction is intentional: DNS pods are the source of truth on whether a regional LB is reachable from the same place the public internet would reach it.

3.4 Split-brain protection (failover-controller)

Lua-records are necessary but not sufficient for split-brain protection during a network partition. The failover-controller layers a lease-based witness on top:

  • During healthy operation, each regional cluster renews a lease in a cloud witness (Cloudflare KV or similar — out of band from the Sovereign's own infra).
  • The PowerDNS lua-record probes are the primary failover signal (sub-minute response).
  • The lease becomes the tie-breaker for stateful promotion (OpenBao DR, CNPG primary promotion) — only the cluster holding a valid lease is allowed to take over write authority.
  • See SRE.md §2.4 for the witness protocol; this doc covers only the DNS-routing half.

4. When to add a second Sovereign region (the HA upgrade path)

A single-region Sovereign is the SME default (PLATFORM-TECH-STACK.md §9.2). For corporate / regulated tier (and for any Sovereign that signs an SLA strict enough that single-region downtime would breach it), the upgrade path is:

  1. Sovereign provisioned in Region A (e.g. hz-fsn-rtz-prod) — single LB IP, plain A records.
  2. Operator decides to add Region B via the Catalyst admin UI: Admin → Infrastructure → Add Region (see SOVEREIGN-PROVISIONING.md §8).
  3. Crossplane provisions Region B's clusters (rtz + dmz) with the same building blocks as Region A.
  4. Region B's PowerDNS replicas join the Sovereign's authoritative NS set via SOA NOTIFY + AXFR (PowerDNS-native zone replication; no external sync layer needed).
  5. catalyst-dns rewrites every Application's lua-record from single-regionactive-active (or whichever Placement the Application opts into). Old plain A records are replaced with ifurlup(...) lua-records pointing at both regional LBs.
  6. The cloud witness (failover-controller) starts arbitrating leases across the two clusters.

The cluster name never changes during this upgrade — Region A's cluster is still hz-fsn-rtz-prod, Region B is now hz-hel-rtz-prod, and neither is "primary" or "DR". This is the explicit design from NAMING-CONVENTION.md §1.3 — failover is a routing event, not a renaming event.

4.1 Triggers for adding a second region

Trigger Recommendation
SLA target ≥ 99.95% uptime Mandatory second region — single-region cannot meet this
Compliance requirement (DORA, NIS2, GDPR data residency split) Mandatory — typically one region per data-residency boundary
Application's Placement set to active-active / active-hotstandby / active-passive-warm Mandatory — these placements require ≥ 2 regions to honour
Latency-sensitive global traffic (regional users far from Region A) Strongly recommended — pickclosest lua-records cut median RTT
Cost-sensitive single-tenant Sovereign on a low-tier SLA Defer — pay for it when a workload demands it

5. Operational checks

5.1 Verify a lua-record is healthy

dig +short api.acme.com @ns1.openova.io
# Expected: an A record from the healthy regional LB set.
dig +short api.acme.com @ns1.openova.io \
  +subnet=80.81.82.0/24
# Expected: with a EU client subnet, pickclosest returns the EU regional LB.

5.2 Force a probe-failure simulation (chaos-engineering)

The Litmus chaos suite includes a scenario that black-holes a regional LB's probe target. After ~1 TTL window:

dig +short api.acme.com @ns1.openova.io
# Expected: the affected backend IP is absent from the response.

When the probe target is restored, the IP returns automatically — no operator action.

5.3 Read PowerDNS probe state

kubectl exec -n openova-system deploy/powerdns -- pdns_control bind-list-record api.acme.com

PowerDNS exposes the current probe status (last probe timestamp, last result, current selection set) — useful when investigating "why is the answer set what it is?" during an incident.


6. References


Part of OpenOva Catalyst. Read Inviolable Principles before any changes.