Emrah Baysal 54b1b4bd3d docs: add unified naming convention and align existing docs

- Add docs/NAMING-CONVENTION.md — canonical naming standard for all
  cloud resources, K8s objects, DNS, and tags across all providers.
  Covers dimension taxonomy (provider/region/building-block/environment),
  the Don't-Repeat-the-Parent principle, 4-char DNS location codes with
  full lookup table, multi-tenant scoping via namespace, and migration rules.

- Fix SRE.md: remove primary/DR region labels; clusters are named by
  building block (rtz/dmz/mgt), not failover role. Both regions run
  symmetric rtz clusters; k8gb owns traffic distribution.

- Fix PLATFORM-TECH-STACK.md: update both Mermaid diagrams and region
  table to use Region A / Region B (rtz cluster) language.

- Fix core/README.md: Platform CRD example now references cluster context
  names (hz-fsn-rtz-prod / hz-hel-rtz-prod) instead of primary/standby roles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-19 12:22:52 +01:00

14 KiB

Raw Blame History

SRE Handbook

Site Reliability Engineering practices for OpenOva platform operations.

Status: Accepted | Updated: 2026-02-26

Overview

This document covers SRE practices, multi-region strategy, progressive delivery, auto-remediation, and operational tooling for OpenOva deployments including AI Hub, Open Banking, Data & Integration, and Communication blueprints.

Multi-Region Strategy

Architecture

Multi-region is strongly recommended. Two independent clusters across regions provides geographic redundancy with automatic failover. Clusters are named by building block (functional security zone), not by failover role — there is no "primary" or "DR" designation. Both clusters run the same building blocks symmetrically; k8gb and GSLB handle traffic distribution. After a failover event the surviving cluster serves all traffic — its name does not change.

See NAMING-CONVENTION.md for the canonical cluster naming standard.

flowchart TB
    subgraph RegionA["Region A  (e.g. hz-fsn-rtz-prod)"]
        K8s1[Restricted Trust Zone Cluster]
        Stack1[Full Workload Stack]
    end

    subgraph RegionB["Region B  (e.g. hz-hel-rtz-prod)"]
        K8s2[Restricted Trust Zone Cluster]
        Stack2[Full Workload Stack]
    end

    subgraph GSLB["Global Load Balancing"]
        k8gb[k8gb Authoritative DNS]
        Witnesses[External DNS Witnesses]
    end

    K8s1 <-->|"WireGuard"| K8s2
    K8s1 --> k8gb
    K8s2 --> k8gb

Key Principles

Each cluster survives independently during network partition — no shared control plane
No stretched clusters (avoids split-brain)
Both clusters are peers — neither is designated primary or DR
Async data replication (eventual consistency)
k8gb as authoritative DNS for GSLB zone — removes unhealthy endpoints automatically
External DNS witnesses for split-brain protection

Cross-Region Networking

Option	Use Case
WireGuard mesh	Different providers, secure overlay
Native peering	Same provider (lower latency)

Data Replication

Service	Replication Method	RPO
CNPG (Postgres)	WAL streaming to async standby	Near-zero
FerretDB	Via CNPG WAL streaming (PostgreSQL backend)	Near-zero
Strimzi/Kafka	MirrorMaker2	Seconds
Valkey	REPLICAOF command	Seconds
MinIO	Bucket replication	Minutes
Harbor	Registry replication	Minutes
OpenBao	ESO PushSecrets to both	Seconds
Gitea	Bidirectional mirror + CNPG	Seconds
Milvus	Collection sync	Minutes
Neo4j	Causal cluster replication	Seconds

Split-Brain Protection

Failover Controller queries external DNS witnesses:

Resolver	Provider
8.8.8.8	Google
1.1.1.1	Cloudflare
9.9.9.9	Quad9

Quorum: 2/3 must agree other region is unreachable before promotion.

Progressive Delivery

Canary Deployments

Flagger provides automatic canary analysis with rollback:

Flux-native integration
Automatic rollback on metric degradation
No ArgoCD dependency

Feature Flags

Flipt for zero-cost feature flagging:

Self-hosted deployment
Simple SDK integration
Gradual rollout control

Auto-Remediation

Architecture

Gitea Actions triggered by Alertmanager webhooks for automated incident response.

flowchart LR
    Alert[Alert Fires] --> AM[Alertmanager]
    AM --> GA[Gitea Actions]
    GA --> Remediate[Auto-Remediate]
    Remediate --> Verify[Verify Fix]
    Verify -->|Success| Resolve[Resolve Alert]
    Verify -->|Failure| Log[Log for Analysis]

Alert-to-Action Mapping

Platform Alerts

Alert	Auto-Action	Verification
HighMemoryUsage	Scale up deployment	Check memory
PodCrashLoopBackOff	Restart pod	Check pod status
HighErrorRate	Trigger rollback	Check error rate
DatabaseConnectionExhausted	Restart PgBouncer	Check connections
CertificateExpiringSoon	Trigger renewal	Check expiry
HighLatency	Scale service	Check latency
GslbEndpointDown	Check k8gb status	Verify DNS

AI Hub Alerts

Alert	Auto-Action	Verification
VLLMHighLatency	Scale vLLM replicas	Check inference latency
VLLMOOMKilled	Reduce batch size	Check memory
GPUUtilizationLow	Scale down GPU pods	Check utilization
GPUMemoryExhausted	Evict low-priority jobs	Check GPU memory
MilvusQuerySlow	Rebuild index	Check query latency
EmbeddingQueueBacklog	Scale BGE replicas	Check queue depth
RAGRetrievalEmpty	Alert + log for analysis	Check retrieval quality
LLMGatewayQuotaExhausted	Notify user	Check quota

Open Banking Alerts

Alert	Auto-Action	Verification
KeycloakHighLatency	Scale Keycloak	Check auth latency
QuotaServiceDown	Failover to backup	Check quota service
BillingWebhookFailed	Retry with backoff	Check webhook status
TPPCertExpiring	Alert ops team	Check certificate

Budget Control

Threshold	Action
80% of budget	Warning log
100% of budget	Block scale-up

Secret Rotation

Secret Type	Frequency	Method
Database credentials	Monthly	CronJob + ESO
JWT signing keys	30 days	CronJob
TLS certificates	Auto	cert-manager
Gitea tokens	90 days	CronJob + ESO
LLM API keys	90 days	CronJob + ESO
Keycloak client secrets	90 days	CronJob + ESO

Process	Schedule
Data subject requests	Daily 2 AM
Data retention	Weekly Sunday 3 AM
Audit log cleanup	Monthly
Vector embedding purge	On data deletion request
Chat history cleanup	Per retention policy

Air-Gap Compliance

For regulated industries requiring air-gapped deployments:

Architecture

flowchart LR
    subgraph Connected["Connected Zone"]
        Pull[Pull Images/Charts]
    end

    subgraph DMZ["DMZ Transfer Zone"]
        Scan[Security Scan]
        Stage[Staging Area]
    end

    subgraph AirGap["Air-Gapped Zone"]
        Harbor[Harbor Registry]
        Git[Gitea]
        Flux[Flux CD]
        K8s[Kubernetes]
    end

    Pull --> Scan
    Scan --> Stage
    Stage -->|"Physical/Diode"| Harbor
    Stage -->|"Physical/Diode"| Git

Prerequisites

All mandatory components support air-gap:

Harbor - local registry with replication
MinIO - local object storage
Flux - reconciles from local Git
Velero - backups to local MinIO
Grafana Stack - self-contained observability

AI Hub Air-Gap Considerations

Component	Air-Gap Requirement
vLLM	Pre-download model weights to MinIO
BGE-M3	Pre-download embedding models
Milvus	No external dependencies
Neo4j	No external dependencies
NeMo Guardrails	No external dependencies
LangFuse	No external dependencies

Content Transfer

Content Type	Air-Gap Destination
Container images	Harbor
Helm charts	Harbor ChartMuseum
Git repositories	Self-hosted Gitea
OS packages	Local mirror
LLM model weights	MinIO
Embedding models	MinIO

Platform Engineering Tools

Tool Selection

Tool	Purpose	Status
Crossplane	Cloud resource provisioning (day-2)	Mandatory
Catalyst IDP	Internal Developer Platform	Via OpenOva Catalyst
Flux	GitOps delivery engine	Mandatory
OpenTofu	Bootstrap IaC only	Mandatory

Architecture

flowchart TB
    subgraph IDP["Internal Developer Platform"]
        CAT[Catalyst IDP]
    end

    subgraph GitOps
        Git[Git Repository]
        Flux[Flux CD]
    end

    subgraph IaC["Infrastructure as Code"]
        TF[OpenTofu Bootstrap]
        CP[Crossplane Day 2+]
    end

    CAT -->|"Templates"| Git
    Git -->|"Reconcile"| Flux
    Flux -->|"Apply"| CP
    CP --> Cloud
    TF -.->|"Initial bootstrap"| Cloud

Crossplane Providers

Provider	Support
Hetzner Cloud	hcloud provider
Huawei Cloud	huaweicloud provider
Oracle Cloud	oci provider
AWS	aws provider
GCP	gcp provider
Azure	azure provider

Monitoring SLOs

Platform SLOs

SLI	Target	Alert Threshold
Availability	99.9%	<99.5% for 5m
Latency (p95)	<500ms	>1s for 5m
Error Rate	<0.1%	>1% for 5m

AI Hub SLOs

SLI	Target	Alert Threshold
LLM Inference Latency (p95)	<5s	>10s for 5m
LLM Token Throughput	>50 tok/s	<20 tok/s for 5m
Embedding Latency (p95)	<100ms	>500ms for 5m
RAG Retrieval Latency (p95)	<500ms	>2s for 5m
GPU Utilization	>60%	<30% for 15m
Vector Search Latency (p95)	<50ms	>200ms for 5m

Open Banking SLOs

SLI	Target	Alert Threshold
Auth Latency (p95)	<200ms	>500ms for 5m
API Availability	99.95%	<99.5% for 5m
Consent Flow Success	>99%	<95% for 5m

Data & Integration (Fabric) SLOs

SLI	Target	Alert Threshold
Kafka Produce Latency (p95)	<50ms	>200ms for 5m
Flink Checkpoint Duration	<30s	>60s for 5m
Temporal Workflow Latency (p95)	<1s	>5s for 5m
CDC Lag (Debezium)	<10s	>60s for 5m
ClickHouse Query Latency (p95)	<500ms	>2s for 5m

Communication (Relay) SLOs

SLI	Target	Alert Threshold
Email Delivery Rate	>99.5%	<98% for 15m
LiveKit Call Setup (p95)	<2s	>5s for 5m
Matrix Message Delivery (p95)	<500ms	>2s for 5m
TURN Relay Success Rate	>99%	<95% for 5m

GPU Operations

GPU Node Management

# GPU node pool labels
nodeSelector:
  node.kubernetes.io/gpu: "true"
  nvidia.com/gpu.product: "NVIDIA-A10"

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

GPU Monitoring Metrics

Metric	Query	Purpose
GPU Utilization	`DCGM_FI_DEV_GPU_UTIL`	Compute usage
GPU Memory Used	`DCGM_FI_DEV_FB_USED`	Memory pressure
GPU Temperature	`DCGM_FI_DEV_GPU_TEMP`	Thermal throttling
GPU Power	`DCGM_FI_DEV_POWER_USAGE`	Power consumption
SM Clock	`DCGM_FI_DEV_SM_CLOCK`	Clock throttling

vLLM Operations

# Check vLLM health
curl http://vllm.ai-hub.svc:8000/health

# Check loaded models
curl http://vllm.ai-hub.svc:8000/v1/models

# Monitor generation metrics
curl http://vllm.ai-hub.svc:8000/metrics | grep vllm_

KServe Operations

# List InferenceServices
kubectl get inferenceservices -n ai-hub

# Check model readiness
kubectl get inferenceservice <name> -o jsonpath='{.status.conditions}'

# Scale model replicas
kubectl patch inferenceservice <name> -p '{"spec":{"predictor":{"minReplicas":2}}}'

Vector Database Operations

Milvus Health Checks

# Check cluster status
kubectl exec -it milvus-proxy-0 -n ai-hub -- curl localhost:9091/healthz

# Check collection stats
curl -X GET "http://milvus.ai-hub.svc:19530/v1/vector/collections/<collection>/stats"

# Compact collection
curl -X POST "http://milvus.ai-hub.svc:19530/v1/vector/collections/<collection>/compact"

Milvus Maintenance

Task	Schedule	Command
Index rebuild	Weekly	`collection.create_index()`
Compaction	Daily	`collection.compact()`
Backup	Daily	Velero snapshot
Stats refresh	Hourly	`collection.get_stats()`

Alertmanager Configuration

receivers:
  - name: gitea-actions
    webhook_configs:
      - url: https://gitea.<domain>/api/v1/repos/<org>/platform/actions/dispatches
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/gitea-token
        send_resolved: true

  - name: ai-hub-oncall
    webhook_configs:
      - url: https://gitea.<domain>/api/v1/repos/<org>/ai-hub/actions/dispatches
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/gitea-token

route:
  receiver: gitea-actions
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  routes:
    - match:
        severity: critical
      receiver: gitea-actions
      group_wait: 10s
    - match:
        namespace: ai-hub
      receiver: ai-hub-oncall
      group_by: ['alertname', 'model']

Grafana Dashboards

Platform Dashboards

Dashboard	Purpose
Platform Overview	Request rates, latencies, errors
Cilium Network	Traffic flows, policy drops
Flux GitOps	Reconciliation status
CNPG Postgres	Database performance

AI Hub Dashboards

Dashboard	Purpose
AI Hub Overview	Request rates, model usage
GPU Metrics	Utilization, memory, temperature
LLM Inference	Latency, throughput, queue depth
RAG Analytics	Retrieval quality, citations
Vector Search	Query latency, index stats
User Analytics	Usage by agent, user

Open Banking Dashboards

Dashboard	Purpose
Open Banking Overview	API calls, consent flows
Keycloak Auth	Authentication metrics
Billing	Usage metering, revenue

Incident Response

Severity Levels

Level	Definition	Response Time
P1	Platform down	15 minutes
P2	Major feature broken	1 hour
P3	Minor issue	4 hours
P4	Low priority	Next business day

AI Hub Specific Incidents

Incident	Severity	Runbook
vLLM not responding	P1	Restart vLLM, check GPU
GPU OOM	P2	Reduce batch size, scale
Milvus query timeout	P2	Check index, rebuild
Embedding service down	P2	Failover, restart BGE
RAG returning empty	P3	Check retrieval config

Part of OpenOva

14 KiB Raw Blame History

SRE Handbook

Overview

Multi-Region Strategy

Architecture

Key Principles

Cross-Region Networking

Data Replication

Split-Brain Protection

Progressive Delivery

Canary Deployments

Feature Flags

Auto-Remediation

Architecture

Alert-to-Action Mapping

Platform Alerts

AI Hub Alerts

Open Banking Alerts

Budget Control

Secret Rotation

GDPR Automation

Air-Gap Compliance

Architecture

Prerequisites

AI Hub Air-Gap Considerations

Content Transfer

Platform Engineering Tools

Tool Selection

Architecture

Crossplane Providers

Monitoring SLOs

Platform SLOs

AI Hub SLOs

Open Banking SLOs

Data & Integration (Fabric) SLOs

Communication (Relay) SLOs

GPU Operations

GPU Node Management

GPU Monitoring Metrics

vLLM Operations

KServe Operations

Vector Database Operations

Milvus Health Checks

Milvus Maintenance

Alertmanager Configuration

Grafana Dashboards

Platform Dashboards

AI Hub Dashboards

Open Banking Dashboards

Incident Response

Severity Levels

AI Hub Specific Incidents

14 KiB

Raw Blame History