PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything k8gb was doing — geo-aware response selection, health-checked failover, weighted round-robin — at the authoritative DNS layer. Eliminates a separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign. Changes: - platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never authored — only README existed) - products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted - componentGroups.ts: remove k8gb component (PowerDNS already there) - componentLogos.tsx: drop logo_k8gb + k8gb map entry - model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns - StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb - provision.html: replace k8gb tile and edges with powerdns - catalog.generated.ts regenerated (now includes bp-powerdns) - docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING- CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY, COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs (cilium, external-dns, failover-controller, litmus, flux, opentofu) rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md. Historical entries in VALIDATION-LOG.md preserved as audit trail. - New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed), Application Placement → lua-record selector mapping, when to add a second Sovereign region, operational checks. Closes #171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.1 KiB
Failover Controller
Multi-region failover orchestration. Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3.6). Pairs with PowerDNS lua-records (which route via authoritative DNS — see docs/MULTI-REGION-DNS.md) — the failover-controller adds lease-based split-brain protection via a cloud witness so the lua-record probe layer can't silently mis-failover during a network partition. See docs/SRE.md §2.4 for the witness pattern and docs/SECURITY.md §5.2 for OpenBao DR promotion semantics.
Status: Accepted | Updated: 2026-04-27
Overview
The Failover Controller orchestrates cross-region failover with split-brain protection, ensuring safe promotion of DR regions during outages.
Architecture
flowchart TB
subgraph Region1["Region 1"]
FC1[Failover Controller]
Health1[Health Checks]
end
subgraph Region2["Region 2"]
FC2[Failover Controller]
Health2[Health Checks]
end
subgraph External["External Witnesses"]
CF[Cloudflare Workers]
Witnesses[DNS Witnesses]
end
FC1 <-->|"Lease"| CF
FC2 <-->|"Lease"| CF
FC1 --> Health1
FC2 --> Health2
FC1 --> Witnesses
FC2 --> Witnesses
Split-Brain Protection
External DNS Witnesses
Failover Controller queries external DNS witnesses before promotion:
| Resolver | Provider |
|---|---|
| 8.8.8.8 | |
| 1.1.1.1 | Cloudflare |
| 9.9.9.9 | Quad9 |
Quorum: 2/3 must agree the other region is unreachable before promotion.
Cloudflare Workers Lease
A Cloudflare Worker provides distributed lease management:
sequenceDiagram
participant R1 as Region 1
participant CF as Cloudflare Worker
participant R2 as Region 2
R1->>CF: Acquire lease
CF->>R1: Lease granted (TTL: 30s)
R1->>CF: Heartbeat (every 10s)
Note over R1: Region 1 fails
R2->>CF: Check lease
CF->>R2: Lease expired
R2->>CF: Acquire lease
CF->>R2: Lease granted
R2->>R2: Promote to primary
Failover Flow
flowchart TB
Start[Detect Primary Failure]
Check[Check External Witnesses]
Quorum{Quorum<br/>Reached?}
Lease[Acquire Lease]
Promote[Promote DR]
Update[Update DNS]
End[Failover Complete]
Start --> Check
Check --> Quorum
Quorum -->|"Yes"| Lease
Quorum -->|"No"| Start
Lease --> Promote
Promote --> Update
Update --> End
Configuration
Failover Controller
apiVersion: apps/v1
kind: Deployment
metadata:
name: failover-controller
namespace: platform-services
spec:
replicas: 1
template:
spec:
containers:
- name: controller
image: openova/failover-controller:v1.0.0
env:
- name: REGION
value: region1
- name: PEER_REGION
value: region2
- name: CLOUDFLARE_WORKER_URL
valueFrom:
secretKeyRef:
name: failover-config
key: worker-url
- name: DNS_WITNESSES
value: "8.8.8.8,1.1.1.1,9.9.9.9"
- name: QUORUM
value: "2"
Cloudflare Worker
// Simplified lease management
export default {
async fetch(request, env) {
const { pathname } = new URL(request.url);
if (pathname === '/acquire') {
const region = request.headers.get('X-Region');
const current = await env.LEASE.get('primary');
if (!current || isExpired(current)) {
await env.LEASE.put('primary', region, { expirationTtl: 30 });
return new Response(JSON.stringify({ granted: true }));
}
return new Response(JSON.stringify({ granted: false, holder: current }));
}
}
};
Monitoring
| Metric | Description |
|---|---|
failover_lease_status |
Current lease holder |
failover_witness_reachable |
DNS witness reachability |
failover_last_failover_time |
Last failover timestamp |
Part of OpenOva