openova/platform/failover-controller
hatiyildiz f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00
..
README.md refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171) 2026-04-29 08:51:09 +02:00

Failover Controller

Multi-region failover orchestration. Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3.6). Pairs with PowerDNS lua-records (which route via authoritative DNS — see docs/MULTI-REGION-DNS.md) — the failover-controller adds lease-based split-brain protection via a cloud witness so the lua-record probe layer can't silently mis-failover during a network partition. See docs/SRE.md §2.4 for the witness pattern and docs/SECURITY.md §5.2 for OpenBao DR promotion semantics.

Status: Accepted | Updated: 2026-04-27


Overview

The Failover Controller orchestrates cross-region failover with split-brain protection, ensuring safe promotion of DR regions during outages.


Architecture

flowchart TB
    subgraph Region1["Region 1"]
        FC1[Failover Controller]
        Health1[Health Checks]
    end

    subgraph Region2["Region 2"]
        FC2[Failover Controller]
        Health2[Health Checks]
    end

    subgraph External["External Witnesses"]
        CF[Cloudflare Workers]
        Witnesses[DNS Witnesses]
    end

    FC1 <-->|"Lease"| CF
    FC2 <-->|"Lease"| CF
    FC1 --> Health1
    FC2 --> Health2
    FC1 --> Witnesses
    FC2 --> Witnesses

Split-Brain Protection

External DNS Witnesses

Failover Controller queries external DNS witnesses before promotion:

Resolver Provider
8.8.8.8 Google
1.1.1.1 Cloudflare
9.9.9.9 Quad9

Quorum: 2/3 must agree the other region is unreachable before promotion.

Cloudflare Workers Lease

A Cloudflare Worker provides distributed lease management:

sequenceDiagram
    participant R1 as Region 1
    participant CF as Cloudflare Worker
    participant R2 as Region 2

    R1->>CF: Acquire lease
    CF->>R1: Lease granted (TTL: 30s)
    R1->>CF: Heartbeat (every 10s)

    Note over R1: Region 1 fails

    R2->>CF: Check lease
    CF->>R2: Lease expired
    R2->>CF: Acquire lease
    CF->>R2: Lease granted
    R2->>R2: Promote to primary

Failover Flow

flowchart TB
    Start[Detect Primary Failure]
    Check[Check External Witnesses]
    Quorum{Quorum<br/>Reached?}
    Lease[Acquire Lease]
    Promote[Promote DR]
    Update[Update DNS]
    End[Failover Complete]

    Start --> Check
    Check --> Quorum
    Quorum -->|"Yes"| Lease
    Quorum -->|"No"| Start
    Lease --> Promote
    Promote --> Update
    Update --> End

Configuration

Failover Controller

apiVersion: apps/v1
kind: Deployment
metadata:
  name: failover-controller
  namespace: platform-services
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: controller
          image: openova/failover-controller:v1.0.0
          env:
            - name: REGION
              value: region1
            - name: PEER_REGION
              value: region2
            - name: CLOUDFLARE_WORKER_URL
              valueFrom:
                secretKeyRef:
                  name: failover-config
                  key: worker-url
            - name: DNS_WITNESSES
              value: "8.8.8.8,1.1.1.1,9.9.9.9"
            - name: QUORUM
              value: "2"

Cloudflare Worker

// Simplified lease management
export default {
  async fetch(request, env) {
    const { pathname } = new URL(request.url);

    if (pathname === '/acquire') {
      const region = request.headers.get('X-Region');
      const current = await env.LEASE.get('primary');

      if (!current || isExpired(current)) {
        await env.LEASE.put('primary', region, { expirationTtl: 30 });
        return new Response(JSON.stringify({ granted: true }));
      }

      return new Response(JSON.stringify({ granted: false, holder: current }));
    }
  }
};

Monitoring

Metric Description
failover_lease_status Current lease holder
failover_witness_reachable DNS witness reachability
failover_last_failover_time Last failover timestamp

Part of OpenOva