openova/docs/SRE.md
Emrah Baysal 54b1b4bd3d docs: add unified naming convention and align existing docs
- Add docs/NAMING-CONVENTION.md — canonical naming standard for all
  cloud resources, K8s objects, DNS, and tags across all providers.
  Covers dimension taxonomy (provider/region/building-block/environment),
  the Don't-Repeat-the-Parent principle, 4-char DNS location codes with
  full lookup table, multi-tenant scoping via namespace, and migration rules.

- Fix SRE.md: remove primary/DR region labels; clusters are named by
  building block (rtz/dmz/mgt), not failover role. Both regions run
  symmetric rtz clusters; k8gb owns traffic distribution.

- Fix PLATFORM-TECH-STACK.md: update both Mermaid diagrams and region
  table to use Region A / Region B (rtz cluster) language.

- Fix core/README.md: Platform CRD example now references cluster context
  names (hz-fsn-rtz-prod / hz-hel-rtz-prod) instead of primary/standby roles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 12:22:52 +01:00

14 KiB

SRE Handbook

Site Reliability Engineering practices for OpenOva platform operations.

Status: Accepted | Updated: 2026-02-26


Overview

This document covers SRE practices, multi-region strategy, progressive delivery, auto-remediation, and operational tooling for OpenOva deployments including AI Hub, Open Banking, Data & Integration, and Communication blueprints.


Multi-Region Strategy

Architecture

Multi-region is strongly recommended. Two independent clusters across regions provides geographic redundancy with automatic failover. Clusters are named by building block (functional security zone), not by failover role — there is no "primary" or "DR" designation. Both clusters run the same building blocks symmetrically; k8gb and GSLB handle traffic distribution. After a failover event the surviving cluster serves all traffic — its name does not change.

See NAMING-CONVENTION.md for the canonical cluster naming standard.

flowchart TB
    subgraph RegionA["Region A  (e.g. hz-fsn-rtz-prod)"]
        K8s1[Restricted Trust Zone Cluster]
        Stack1[Full Workload Stack]
    end

    subgraph RegionB["Region B  (e.g. hz-hel-rtz-prod)"]
        K8s2[Restricted Trust Zone Cluster]
        Stack2[Full Workload Stack]
    end

    subgraph GSLB["Global Load Balancing"]
        k8gb[k8gb Authoritative DNS]
        Witnesses[External DNS Witnesses]
    end

    K8s1 <-->|"WireGuard"| K8s2
    K8s1 --> k8gb
    K8s2 --> k8gb

Key Principles

  • Each cluster survives independently during network partition — no shared control plane
  • No stretched clusters (avoids split-brain)
  • Both clusters are peers — neither is designated primary or DR
  • Async data replication (eventual consistency)
  • k8gb as authoritative DNS for GSLB zone — removes unhealthy endpoints automatically
  • External DNS witnesses for split-brain protection

Cross-Region Networking

Option Use Case
WireGuard mesh Different providers, secure overlay
Native peering Same provider (lower latency)

Data Replication

Service Replication Method RPO
CNPG (Postgres) WAL streaming to async standby Near-zero
FerretDB Via CNPG WAL streaming (PostgreSQL backend) Near-zero
Strimzi/Kafka MirrorMaker2 Seconds
Valkey REPLICAOF command Seconds
MinIO Bucket replication Minutes
Harbor Registry replication Minutes
OpenBao ESO PushSecrets to both Seconds
Gitea Bidirectional mirror + CNPG Seconds
Milvus Collection sync Minutes
Neo4j Causal cluster replication Seconds

Split-Brain Protection

Failover Controller queries external DNS witnesses:

Resolver Provider
8.8.8.8 Google
1.1.1.1 Cloudflare
9.9.9.9 Quad9

Quorum: 2/3 must agree other region is unreachable before promotion.


Progressive Delivery

Canary Deployments

Flagger provides automatic canary analysis with rollback:

  • Flux-native integration
  • Automatic rollback on metric degradation
  • No ArgoCD dependency

Feature Flags

Flipt for zero-cost feature flagging:

  • Self-hosted deployment
  • Simple SDK integration
  • Gradual rollout control

Auto-Remediation

Architecture

Gitea Actions triggered by Alertmanager webhooks for automated incident response.

flowchart LR
    Alert[Alert Fires] --> AM[Alertmanager]
    AM --> GA[Gitea Actions]
    GA --> Remediate[Auto-Remediate]
    Remediate --> Verify[Verify Fix]
    Verify -->|Success| Resolve[Resolve Alert]
    Verify -->|Failure| Log[Log for Analysis]

Alert-to-Action Mapping

Platform Alerts

Alert Auto-Action Verification
HighMemoryUsage Scale up deployment Check memory
PodCrashLoopBackOff Restart pod Check pod status
HighErrorRate Trigger rollback Check error rate
DatabaseConnectionExhausted Restart PgBouncer Check connections
CertificateExpiringSoon Trigger renewal Check expiry
HighLatency Scale service Check latency
GslbEndpointDown Check k8gb status Verify DNS

AI Hub Alerts

Alert Auto-Action Verification
VLLMHighLatency Scale vLLM replicas Check inference latency
VLLMOOMKilled Reduce batch size Check memory
GPUUtilizationLow Scale down GPU pods Check utilization
GPUMemoryExhausted Evict low-priority jobs Check GPU memory
MilvusQuerySlow Rebuild index Check query latency
EmbeddingQueueBacklog Scale BGE replicas Check queue depth
RAGRetrievalEmpty Alert + log for analysis Check retrieval quality
LLMGatewayQuotaExhausted Notify user Check quota

Open Banking Alerts

Alert Auto-Action Verification
KeycloakHighLatency Scale Keycloak Check auth latency
QuotaServiceDown Failover to backup Check quota service
BillingWebhookFailed Retry with backoff Check webhook status
TPPCertExpiring Alert ops team Check certificate

Budget Control

Threshold Action
80% of budget Warning log
100% of budget Block scale-up

Secret Rotation

Secret Type Frequency Method
Database credentials Monthly CronJob + ESO
JWT signing keys 30 days CronJob
TLS certificates Auto cert-manager
Gitea tokens 90 days CronJob + ESO
LLM API keys 90 days CronJob + ESO
Keycloak client secrets 90 days CronJob + ESO

GDPR Automation

Process Schedule
Data subject requests Daily 2 AM
Data retention Weekly Sunday 3 AM
Audit log cleanup Monthly
Vector embedding purge On data deletion request
Chat history cleanup Per retention policy

Air-Gap Compliance

For regulated industries requiring air-gapped deployments:

Architecture

flowchart LR
    subgraph Connected["Connected Zone"]
        Pull[Pull Images/Charts]
    end

    subgraph DMZ["DMZ Transfer Zone"]
        Scan[Security Scan]
        Stage[Staging Area]
    end

    subgraph AirGap["Air-Gapped Zone"]
        Harbor[Harbor Registry]
        Git[Gitea]
        Flux[Flux CD]
        K8s[Kubernetes]
    end

    Pull --> Scan
    Scan --> Stage
    Stage -->|"Physical/Diode"| Harbor
    Stage -->|"Physical/Diode"| Git

Prerequisites

All mandatory components support air-gap:

  • Harbor - local registry with replication
  • MinIO - local object storage
  • Flux - reconciles from local Git
  • Velero - backups to local MinIO
  • Grafana Stack - self-contained observability

AI Hub Air-Gap Considerations

Component Air-Gap Requirement
vLLM Pre-download model weights to MinIO
BGE-M3 Pre-download embedding models
Milvus No external dependencies
Neo4j No external dependencies
NeMo Guardrails No external dependencies
LangFuse No external dependencies

Content Transfer

Content Type Air-Gap Destination
Container images Harbor
Helm charts Harbor ChartMuseum
Git repositories Self-hosted Gitea
OS packages Local mirror
LLM model weights MinIO
Embedding models MinIO

Platform Engineering Tools

Tool Selection

Tool Purpose Status
Crossplane Cloud resource provisioning (day-2) Mandatory
Catalyst IDP Internal Developer Platform Via OpenOva Catalyst
Flux GitOps delivery engine Mandatory
OpenTofu Bootstrap IaC only Mandatory

Architecture

flowchart TB
    subgraph IDP["Internal Developer Platform"]
        CAT[Catalyst IDP]
    end

    subgraph GitOps
        Git[Git Repository]
        Flux[Flux CD]
    end

    subgraph IaC["Infrastructure as Code"]
        TF[OpenTofu Bootstrap]
        CP[Crossplane Day 2+]
    end

    CAT -->|"Templates"| Git
    Git -->|"Reconcile"| Flux
    Flux -->|"Apply"| CP
    CP --> Cloud
    TF -.->|"Initial bootstrap"| Cloud

Crossplane Providers

Provider Support
Hetzner Cloud hcloud provider
Huawei Cloud huaweicloud provider
Oracle Cloud oci provider
AWS aws provider
GCP gcp provider
Azure azure provider

Monitoring SLOs

Platform SLOs

SLI Target Alert Threshold
Availability 99.9% <99.5% for 5m
Latency (p95) <500ms >1s for 5m
Error Rate <0.1% >1% for 5m

AI Hub SLOs

SLI Target Alert Threshold
LLM Inference Latency (p95) <5s >10s for 5m
LLM Token Throughput >50 tok/s <20 tok/s for 5m
Embedding Latency (p95) <100ms >500ms for 5m
RAG Retrieval Latency (p95) <500ms >2s for 5m
GPU Utilization >60% <30% for 15m
Vector Search Latency (p95) <50ms >200ms for 5m

Open Banking SLOs

SLI Target Alert Threshold
Auth Latency (p95) <200ms >500ms for 5m
API Availability 99.95% <99.5% for 5m
Consent Flow Success >99% <95% for 5m

Data & Integration (Fabric) SLOs

SLI Target Alert Threshold
Kafka Produce Latency (p95) <50ms >200ms for 5m
Flink Checkpoint Duration <30s >60s for 5m
Temporal Workflow Latency (p95) <1s >5s for 5m
CDC Lag (Debezium) <10s >60s for 5m
ClickHouse Query Latency (p95) <500ms >2s for 5m

Communication (Relay) SLOs

SLI Target Alert Threshold
Email Delivery Rate >99.5% <98% for 15m
LiveKit Call Setup (p95) <2s >5s for 5m
Matrix Message Delivery (p95) <500ms >2s for 5m
TURN Relay Success Rate >99% <95% for 5m

GPU Operations

GPU Node Management

# GPU node pool labels
nodeSelector:
  node.kubernetes.io/gpu: "true"
  nvidia.com/gpu.product: "NVIDIA-A10"

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

GPU Monitoring Metrics

Metric Query Purpose
GPU Utilization DCGM_FI_DEV_GPU_UTIL Compute usage
GPU Memory Used DCGM_FI_DEV_FB_USED Memory pressure
GPU Temperature DCGM_FI_DEV_GPU_TEMP Thermal throttling
GPU Power DCGM_FI_DEV_POWER_USAGE Power consumption
SM Clock DCGM_FI_DEV_SM_CLOCK Clock throttling

vLLM Operations

# Check vLLM health
curl http://vllm.ai-hub.svc:8000/health

# Check loaded models
curl http://vllm.ai-hub.svc:8000/v1/models

# Monitor generation metrics
curl http://vllm.ai-hub.svc:8000/metrics | grep vllm_

KServe Operations

# List InferenceServices
kubectl get inferenceservices -n ai-hub

# Check model readiness
kubectl get inferenceservice <name> -o jsonpath='{.status.conditions}'

# Scale model replicas
kubectl patch inferenceservice <name> -p '{"spec":{"predictor":{"minReplicas":2}}}'

Vector Database Operations

Milvus Health Checks

# Check cluster status
kubectl exec -it milvus-proxy-0 -n ai-hub -- curl localhost:9091/healthz

# Check collection stats
curl -X GET "http://milvus.ai-hub.svc:19530/v1/vector/collections/<collection>/stats"

# Compact collection
curl -X POST "http://milvus.ai-hub.svc:19530/v1/vector/collections/<collection>/compact"

Milvus Maintenance

Task Schedule Command
Index rebuild Weekly collection.create_index()
Compaction Daily collection.compact()
Backup Daily Velero snapshot
Stats refresh Hourly collection.get_stats()

Alertmanager Configuration

receivers:
  - name: gitea-actions
    webhook_configs:
      - url: https://gitea.<domain>/api/v1/repos/<org>/platform/actions/dispatches
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/gitea-token
        send_resolved: true

  - name: ai-hub-oncall
    webhook_configs:
      - url: https://gitea.<domain>/api/v1/repos/<org>/ai-hub/actions/dispatches
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/gitea-token

route:
  receiver: gitea-actions
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  routes:
    - match:
        severity: critical
      receiver: gitea-actions
      group_wait: 10s
    - match:
        namespace: ai-hub
      receiver: ai-hub-oncall
      group_by: ['alertname', 'model']

Grafana Dashboards

Platform Dashboards

Dashboard Purpose
Platform Overview Request rates, latencies, errors
Cilium Network Traffic flows, policy drops
Flux GitOps Reconciliation status
CNPG Postgres Database performance

AI Hub Dashboards

Dashboard Purpose
AI Hub Overview Request rates, model usage
GPU Metrics Utilization, memory, temperature
LLM Inference Latency, throughput, queue depth
RAG Analytics Retrieval quality, citations
Vector Search Query latency, index stats
User Analytics Usage by agent, user

Open Banking Dashboards

Dashboard Purpose
Open Banking Overview API calls, consent flows
Keycloak Auth Authentication metrics
Billing Usage metering, revenue

Incident Response

Severity Levels

Level Definition Response Time
P1 Platform down 15 minutes
P2 Major feature broken 1 hour
P3 Minor issue 4 hours
P4 Low priority Next business day

AI Hub Specific Incidents

Incident Severity Runbook
vLLM not responding P1 Restart vLLM, check GPU
GPU OOM P2 Reduce batch size, scale
Milvus query timeout P2 Check index, rebuild
Embedding service down P2 Failover, restart BGE
RAG returning empty P3 Check retrieval config

Part of OpenOva