hatiyildiz f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171 )

PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 08:51:09 +02:00

19 KiB

Raw Blame History

SRE Handbook

Status: Authoritative target playbook. Updated: 2026-04-27. Implementation: Most automation described (alert webhooks, Runbook CRDs, failover-controller actions) is design-stage. See IMPLEMENTATION-STATUS.md. Existing Sovereign deployments may rely on simpler manual procedures until the automation lands.

Site Reliability Engineering practices for Catalyst Sovereigns. Defer to GLOSSARY.md for terminology, ARCHITECTURE.md for the model, SECURITY.md for credentials and identity.

1. Overview

This handbook covers running a Sovereign in production: multi-region topology, progressive delivery, auto-remediation, secret rotation, GDPR automation, air-gap considerations, SLOs, GPU operations, and incident response. Audience: sovereign-admin and SRE personas across SME-style and corporate-style Sovereigns.

2. Multi-region strategy

2.1 Architecture

Multi-region is strongly recommended for production-tier Sovereigns. Two or more independent host clusters across regions provide geographic redundancy with automatic failover.

Clusters are named by building block (functional security zone), not by failover role — there is no "primary" or "DR" designation. Both clusters run the same building blocks symmetrically; PowerDNS lua-records (ifurlup, pickclosest) handle traffic distribution authoritatively at the DNS layer. After a failover event, the surviving cluster serves all traffic — its name does not change. See MULTI-REGION-DNS.md for the lua-record patterns.

See NAMING-CONVENTION.md §1.3.

flowchart TB
    subgraph RegionA["Region A (e.g. hz-fsn-rtz-prod)"]
        K8s1[Restricted Trust Zone Cluster]
        Stack1[Per-Org vclusters + workloads]
    end

    subgraph RegionB["Region B (e.g. hz-hel-rtz-prod)"]
        K8s2[Restricted Trust Zone Cluster]
        Stack2[Per-Org vclusters + workloads]
    end

    subgraph GSLB["Global Load Balancing"]
        pdns[PowerDNS Authoritative<br>+ lua-records (ifurlup / pickclosest)]
        Witness[Cloud witness<br>(lease for split-brain protection)]
    end

    K8s1 <-->|"WireGuard"| K8s2
    K8s1 --> pdns
    K8s2 --> pdns
    pdns -.-> Witness

2.2 Key principles

Each cluster survives independently during network partition — no shared control plane.
No stretched clusters (avoids split-brain). This applies to OpenBao, JetStream, etcd, and any other quorum-based component.
Both clusters are peers — neither is designated primary or DR.
Async data replication (eventual consistency).
PowerDNS as authoritative DNS for the GSLB zone — ifurlup / ifportup lua-records remove unhealthy endpoints from the response set automatically. See MULTI-REGION-DNS.md.
Cloud witness (lease) for split-brain protection — see §2.4.

2.3 Cross-region networking

Option	Use case
WireGuard mesh	Different providers, secure overlay
Native peering	Same provider (lower latency, e.g. Hetzner vSwitch, OCI FastConnect)

2.4 Split-brain protection

Failover Controller uses a cloud witness for lease-based authority:

Component	Role
Cloud witness	Holds the lease — typically Cloudflare KV (cheap, globally distributed) or another out-of-Sovereign storage
Failover Controller	Per-cluster controller managing readiness and gating endpoints

Witness pattern:

Active region holds a lease (renews every 10s, TTL 30s).
Standby regions cannot become active while a valid lease exists.
Network partition: both regions reach the witness → active keeps renewing → no split-brain.
Witness unreachable: failover controller falls back to multi-resolver DNS quorum (8.8.8.8 + 1.1.1.1 + 9.9.9.9, 2-of-3).

Three layers controlled by the failover controller:

Layer	Mechanism
External traffic (Gateway API → PowerDNS lua-record backend pool)	HTTPRoute readiness toggling — failure flips the lua probe target
Internal traffic (Cilium Cluster Mesh)	Service endpoint manipulation
Stateful services (CNPG, FerretDB, Strimzi)	Database promotion signaling

Modes: automatic | semi-automatic | manual (regulated tier).

2.5 Data replication patterns

These apply to stateful components — Application Blueprints (data services) and per-host-cluster infrastructure with state (SeaweedFS, Harbor). The Catalyst control plane's own state-bearing components use different patterns: see SECURITY.md for OpenBao and ARCHITECTURE.md for NATS JetStream.

Component	Layer	Replication method	RPO
CNPG (PostgreSQL)	Application Blueprint	WAL streaming to async standby	Near-zero
FerretDB	Application Blueprint	Via CNPG WAL streaming	Near-zero
Strimzi/Kafka	Application Blueprint	MirrorMaker2	Seconds
Valkey	Application Blueprint	REPLICAOF	Seconds
ClickHouse	Application Blueprint	ReplicatedMergeTree	Seconds
OpenSearch	Application Blueprint	Cross-cluster replication	Seconds
Milvus	Application Blueprint	Collection sync	Minutes
Neo4j	Application Blueprint	Causal cluster replication	Seconds
SeaweedFS	Per-host-cluster infra	Cross-region async bucket replication + cold-tier passthrough to shared cloud archive	Minutes
Harbor	Per-host-cluster infra	Registry replication	Minutes
Gitea	Catalyst control plane	Intra-cluster HA replicas + CNPG primary-replica (NOT cross-region mirror — see platform/gitea/README.md §"Multi-Region Strategy"). DR for Gitea is via mgt-cluster recovery, not bidirectional sync.	Seconds (intra-cluster only)

3. Progressive delivery

3.1 Canary deployments

Flagger is the planned canary controller (currently a "components to watch" addition, see TECHNOLOGY-FORECAST-2027-2030.md):

Flux-native integration
Automatic rollback on metric degradation (latency, error rate)
No ArgoCD dependency

When added, Flagger is intended to live as per-host-cluster infrastructure on each rtz cluster; per-Application canary configuration in the Application Blueprint. Status: design — not yet a deployed Blueprint.

3.2 Feature flags

Flipt is the planned feature-flag service (also "components to watch"):

Self-hosted, zero-cost
Simple SDK integration (Go, TypeScript, Python)
Gradual rollout control

Status: design — not yet a deployed Blueprint.

4. Auto-remediation

4.1 Architecture

Gitea Actions triggered by Alertmanager webhooks for automated incident response.

flowchart LR
    Alert[Alert Fires] --> AM[Alertmanager]
    AM --> GA[Gitea Actions]
    GA --> Remediate[Auto-Remediate]
    Remediate --> Verify[Verify Fix]
    Verify -->|Success| Resolve[Resolve Alert]
    Verify -->|Failure| Log[Log for Analysis]

4.2 Alert-to-action mapping

Catalyst control plane

Alert	Auto-action	Verification
HighMemoryUsage	Scale up deployment	Check memory
PodCrashLoopBackOff	Restart pod	Check pod status
HighErrorRate	Trigger rollback	Check error rate
DatabaseConnectionExhausted	Restart PgBouncer	Check connections
CertificateExpiringSoon	Trigger renewal	Check expiry
HighLatency	Scale service	Check latency
GslbEndpointDown	Check PowerDNS lua-record probe state (`pdnsutil get-meta`)	`dig` the GSLB host and verify the unhealthy backend IP is absent from the response
OpenBaoSealed	Auto-unseal via SPIRE-attested unseal keys	Check unseal status
JetstreamLagHigh	Add JetStream consumer replica	Check consumer lag

AI Hub (when bp-cortex installed)

Alert	Auto-action	Verification
VLLMHighLatency	Scale vLLM replicas	Check inference latency
VLLMOOMKilled	Reduce batch size	Check memory
GPUUtilizationLow	Scale down GPU pods	Check utilization
GPUMemoryExhausted	Evict low-priority jobs	Check GPU memory
MilvusQuerySlow	Rebuild index	Check query latency
EmbeddingQueueBacklog	Scale BGE replicas	Check queue depth
RAGRetrievalEmpty	Alert + log	Check retrieval quality
LLMGatewayQuotaExhausted	Notify user	Check quota

Open Banking (when bp-fingate installed)

Alert	Auto-action	Verification
KeycloakHighLatency	Scale Keycloak	Check auth latency
QuotaServiceDown	Failover to backup	Check quota service
BillingWebhookFailed	Retry with backoff	Check webhook status
TPPCertExpiring	Alert sovereign-admin team	Check certificate

4.3 Budget control

Threshold	Action
80% of budget	Warning log
100% of budget	Block scale-up

5. Secret rotation

Detailed in SECURITY.md §7. Summary:

Class	Default frequency	Method
Workload identity (SPIRE SVID)	5 minutes	Auto
Database credentials (dynamic)	1 hour	OpenBao DB engine + sidecar
API tokens, OAuth client secrets	90 days	OpenBao + ESO + Reloader
Signing keys	365 days	Manual approval (security-officer)
TLS certificates	cert-manager	Auto (Let's Encrypt or corporate CA)
User passwords (Keycloak)	User-managed + MFA	Realm policy enforces min age

Process	Schedule	Notes
Data subject requests	Daily 02:00	Application Blueprints expose DSAR endpoints
Data retention	Weekly Sunday 03:00	Per-Application policy
Audit log cleanup	Monthly	Retains per regulatory requirement
Vector embedding purge	On data deletion request	If using bp-cortex
Chat history cleanup	Per retention policy	If using bp-relay

GDPR automation is a Catalyst-level controller (gdpr-controller) that orchestrates Application-specific deletion via the App's exposed compliance API.

7. Air-gap compliance

For regulated industries requiring air-gapped deployments:

flowchart LR
    subgraph Connected["Connected Zone"]
        Pull[Pull Images / OCI Blueprints]
    end

    subgraph DMZ["DMZ Transfer Zone"]
        Scan[Trivy + cosign verify]
        Stage[Staging Area]
    end

    subgraph AirGap["Air-Gapped Sovereign"]
        Harbor[Harbor Registry]
        Git[Gitea<br>(Blueprint mirror)]
        Flux[Flux per vcluster]
        K8s[Per-host clusters]
    end

    Pull --> Scan
    Scan --> Stage
    Stage -->|Physical / Diode| Harbor
    Stage -->|Physical / Diode| Git

7.1 Prerequisites

All Catalyst control-plane components support air-gap:

Harbor — local registry with replication
SeaweedFS — local S3 + cold-tier encapsulation (cold backend can be air-gap-internal if no cloud archive is reachable)
Flux — reconciles from local Git
Velero — backups to local SeaweedFS
Grafana stack — self-contained observability
OpenBao + Keycloak — fully self-hosted; no external dependencies

7.2 AI Hub air-gap considerations (when bp-cortex installed)

Component	Air-gap requirement
vLLM	Pre-download model weights to SeaweedFS
BGE-M3	Pre-download embedding models
Milvus	No external dependencies
Neo4j	No external dependencies
NeMo Guardrails	No external dependencies
LangFuse	No external dependencies

7.3 Content transfer

Content type	Air-gap destination
Container images	Harbor
Helm charts	Harbor (OCI)
Blueprint OCI manifests	Harbor → blueprint-controller registers
Git repositories	Self-hosted Gitea
OS packages	Local mirror
LLM model weights	SeaweedFS
Embedding models	SeaweedFS

8. Catalyst observability

8.1 Self-monitoring

The Catalyst control plane runs its own Grafana stack (Alloy + Loki + Mimir + Tempo + Grafana) in the catalyst-grafana namespace on the management cluster. Every Catalyst component emits OpenTelemetry traces, metrics, and logs.

8.2 Per-Sovereign dashboards

Dashboard	Purpose
Sovereign Health	All Catalyst components, control-plane SLOs
Per-Org Footprint	Resource usage by Organization
Per-Environment	Application states, error budgets
Promotion Activity	EnvironmentPolicy-driven promotions, soak times, approver latency
Secret Rotation	All credentials, age, next rotation
Audit	Commits, RBAC events, SecretPolicy actions

8.3 Per-Organization dashboards

Each Organization sees their own slice:

Dashboard	Purpose
Apps Overview	All Applications across Environments
Budget	Cost projection, billing
Compliance Posture	Per-control evidence
Incidents	Open issues, MTTR

9. SLOs

9.1 Catalyst control plane

SLI	Target	Alert threshold
Console availability	99.9%	<99.5% for 5m
API p95 latency	<500ms	>1s for 5m
Catalog query p95	<200ms	>500ms for 5m
projector SSE delivery	<2s end-to-end	>5s for 5m
Gitea webhook → reconcile	<30s	>2m for 5m

9.2 AI Hub (bp-cortex)

SLI	Target	Alert threshold
LLM Inference Latency (p95)	<5s	>10s for 5m
LLM Token Throughput	>50 tok/s	<20 tok/s for 5m
Embedding Latency (p95)	<100ms	>500ms for 5m
RAG Retrieval Latency (p95)	<500ms	>2s for 5m
GPU Utilization	>60%	<30% for 15m
Vector Search Latency (p95)	<50ms	>200ms for 5m

9.3 Open Banking (bp-fingate)

SLI	Target	Alert threshold
Auth Latency (p95)	<200ms	>500ms for 5m
API Availability	99.95%	<99.5% for 5m
Consent Flow Success	>99%	<95% for 5m

9.4 Data & Integration (bp-fabric)

SLI	Target	Alert threshold
Kafka Produce Latency (p95)	<50ms	>200ms for 5m
Flink Checkpoint Duration	<30s	>60s for 5m
Temporal Workflow Latency (p95)	<1s	>5s for 5m
CDC Lag (Debezium)	<10s	>60s for 5m
ClickHouse Query Latency (p95)	<500ms	>2s for 5m

9.5 Communication (bp-relay)

SLI	Target	Alert threshold
Email Delivery Rate	>99.5%	<98% for 15m
LiveKit Call Setup (p95)	<2s	>5s for 5m
Matrix Message Delivery (p95)	<500ms	>2s for 5m
TURN Relay Success Rate	>99%	<95% for 5m

10. GPU operations

10.1 GPU node management

nodeSelector:
  node.kubernetes.io/gpu: "true"
  nvidia.com/gpu.product: "NVIDIA-A10"

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

10.2 GPU monitoring metrics

Metric	Query	Purpose
GPU utilization	`DCGM_FI_DEV_GPU_UTIL`	Compute usage
GPU memory used	`DCGM_FI_DEV_FB_USED`	Memory pressure
GPU temperature	`DCGM_FI_DEV_GPU_TEMP`	Thermal throttling
GPU power	`DCGM_FI_DEV_POWER_USAGE`	Power consumption
SM clock	`DCGM_FI_DEV_SM_CLOCK`	Clock throttling

10.3 vLLM operations

# Within an Org's vcluster, scoped to bp-cortex Application namespace
kubectl exec -n cortex deploy/vllm -- curl localhost:8000/health
kubectl exec -n cortex deploy/vllm -- curl localhost:8000/v1/models

10.4 KServe operations

kubectl get inferenceservices -n cortex
kubectl get inferenceservice <name> -o jsonpath='{.status.conditions}'
kubectl patch inferenceservice <name> -p '{"spec":{"predictor":{"minReplicas":2}}}'

11. Vector database operations

11.1 Milvus health checks

kubectl exec -n cortex milvus-proxy-0 -- curl localhost:9091/healthz
curl -X GET "http://milvus.cortex.svc:19530/v1/vector/collections/<collection>/stats"
curl -X POST "http://milvus.cortex.svc:19530/v1/vector/collections/<collection>/compact"

11.2 Maintenance

Task	Schedule	Command
Index rebuild	Weekly	`collection.create_index()`
Compaction	Daily	`collection.compact()`
Backup	Daily	Velero snapshot
Stats refresh	Hourly	`collection.get_stats()`

12. Alertmanager configuration

receivers:
  - name: gitea-actions
    webhook_configs:
      - url: https://gitea.<location-code>.<sovereign-domain>/api/v1/repos/<org>/platform/actions/dispatches
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/gitea-token
        send_resolved: true

  - name: ai-oncall
    webhook_configs:
      - url: https://gitea.<location-code>.<sovereign-domain>/api/v1/repos/<org>/cortex/actions/dispatches
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/gitea-token

route:
  receiver: gitea-actions
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  routes:
    - match:
        severity: critical
      receiver: gitea-actions
      group_wait: 10s
    - match:
        namespace: cortex
      receiver: ai-oncall
      group_by: ['alertname', 'model']

13. Incident response

13.1 Severity levels

Level	Definition	Response time
P1	Catalyst control plane down (Sovereign-impacting)	15 minutes
P2	Major Application or feature broken	1 hour
P3	Minor issue	4 hours
P4	Low priority	Next business day

13.2 Catalyst-specific incidents

Incident	Severity	Runbook
Console unreachable	P1	Check Cilium Gateway, console pods, projector pods
Gitea unreachable	P1	Check Gitea pods, CNPG primary, NetworkPolicy
Environment-controller stuck	P1	Check controller logs, Crossplane provider auth
OpenBao sealed	P1	Auto-unseal SPIRE — verify SPIRE server health
JetStream consumer lag	P2	Add consumer replica, check disk pressure
projector lag	P2	Check JetStream consumer status, projector replicas
Per-Org vcluster down	P2	Check vcluster pod, host cluster capacity
Flux reconciliation stalled	P2	Check source-controller logs, Git connectivity
New Sovereign provisioning failed	P3	Check OpenTofu state, cloud provider quotas

13.3 AI Hub incidents (bp-cortex)

Incident	Severity	Runbook
vLLM not responding	P1	Restart vLLM, check GPU
GPU OOM	P2	Reduce batch size, scale
Milvus query timeout	P2	Check index, rebuild
Embedding service down	P2	Failover, restart BGE
RAG returning empty	P3	Check retrieval config

14. Runbooks

Sovereign-wide runbooks live in system/runbooks (Sovereign-admin scope). Org-specific runbooks may live in <org>/runbooks Gitea repo (one per Org if used). Both are version-controlled and indexed by Catalyst's runbook-controller, which surfaces them in incident response panels.

A typical runbook:

apiVersion: catalyst.openova.io/v1alpha1
kind: Runbook
metadata:
  name: openbao-sealed
  namespace: catalyst-system
spec:
  triggers:
    - alertName: OpenBaoSealed
  preconditions:
    - check: spire.server.healthy
    - check: openbao.unseal.shamir.shares.available
  steps:
    - run: openbao operator unseal --auto-shamir
    - verify: openbao.status.sealed == false
  rollback: page-oncall

Cross-reference ARCHITECTURE.md, SECURITY.md, PERSONAS-AND-JOURNEYS.md.

19 KiB Raw Blame History