openova/docs/SRE.md
hatiyildiz 7cafa3c894 docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay
Component-level architectural correction (two changes):

1. MinIO → SeaweedFS as unified S3 encapsulation layer

The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface.

2. Apache Guacamole added as Application Blueprint §4.5 Communication

Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access.

Component changes:
- DELETED: platform/minio/
- CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section)
- CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings)

Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count.

Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric.

UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added.

VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer.

Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.
2026-04-28 10:23:46 +02:00

19 KiB

SRE Handbook

Status: Authoritative target playbook. Updated: 2026-04-27. Implementation: Most automation described (alert webhooks, Runbook CRDs, failover-controller actions) is design-stage. See IMPLEMENTATION-STATUS.md. Existing Sovereign deployments may rely on simpler manual procedures until the automation lands.

Site Reliability Engineering practices for Catalyst Sovereigns. Defer to GLOSSARY.md for terminology, ARCHITECTURE.md for the model, SECURITY.md for credentials and identity.


1. Overview

This handbook covers running a Sovereign in production: multi-region topology, progressive delivery, auto-remediation, secret rotation, GDPR automation, air-gap considerations, SLOs, GPU operations, and incident response. Audience: sovereign-admin and SRE personas across SME-style and corporate-style Sovereigns.


2. Multi-region strategy

2.1 Architecture

Multi-region is strongly recommended for production-tier Sovereigns. Two or more independent host clusters across regions provide geographic redundancy with automatic failover.

Clusters are named by building block (functional security zone), not by failover role — there is no "primary" or "DR" designation. Both clusters run the same building blocks symmetrically; k8gb and GSLB handle traffic distribution. After a failover event, the surviving cluster serves all traffic — its name does not change.

See NAMING-CONVENTION.md §1.3.

flowchart TB
    subgraph RegionA["Region A (e.g. hz-fsn-rtz-prod)"]
        K8s1[Restricted Trust Zone Cluster]
        Stack1[Per-Org vclusters + workloads]
    end

    subgraph RegionB["Region B (e.g. hz-hel-rtz-prod)"]
        K8s2[Restricted Trust Zone Cluster]
        Stack2[Per-Org vclusters + workloads]
    end

    subgraph GSLB["Global Load Balancing"]
        k8gb[k8gb Authoritative DNS]
        Witness[Cloud witness<br>(lease for split-brain protection)]
    end

    K8s1 <-->|"WireGuard"| K8s2
    K8s1 --> k8gb
    K8s2 --> k8gb
    k8gb -.-> Witness

2.2 Key principles

  • Each cluster survives independently during network partition — no shared control plane.
  • No stretched clusters (avoids split-brain). This applies to OpenBao, JetStream, etcd, and any other quorum-based component.
  • Both clusters are peers — neither is designated primary or DR.
  • Async data replication (eventual consistency).
  • k8gb as authoritative DNS for GSLB zone — removes unhealthy endpoints automatically.
  • Cloud witness (lease) for split-brain protection — see §2.4.

2.3 Cross-region networking

Option Use case
WireGuard mesh Different providers, secure overlay
Native peering Same provider (lower latency, e.g. Hetzner vSwitch, OCI FastConnect)

2.4 Split-brain protection

Failover Controller uses a cloud witness for lease-based authority:

Component Role
Cloud witness Holds the lease — typically Cloudflare KV (cheap, globally distributed) or another out-of-Sovereign storage
Failover Controller Per-cluster controller managing readiness and gating endpoints

Witness pattern:

  • Active region holds a lease (renews every 10s, TTL 30s).
  • Standby regions cannot become active while a valid lease exists.
  • Network partition: both regions reach the witness → active keeps renewing → no split-brain.
  • Witness unreachable: failover controller falls back to multi-resolver DNS quorum (8.8.8.8 + 1.1.1.1 + 9.9.9.9, 2-of-3).

Three layers controlled by the failover controller:

Layer Mechanism
External traffic (Gateway API → k8gb) HTTPRoute readiness toggling
Internal traffic (Cilium Cluster Mesh) Service endpoint manipulation
Stateful services (CNPG, FerretDB, Strimzi) Database promotion signaling

Modes: automatic | semi-automatic | manual (regulated tier).

2.5 Data replication patterns

These apply to stateful components — Application Blueprints (data services) and per-host-cluster infrastructure with state (SeaweedFS, Harbor). The Catalyst control plane's own state-bearing components use different patterns: see SECURITY.md for OpenBao and ARCHITECTURE.md for NATS JetStream.

Component Layer Replication method RPO
CNPG (PostgreSQL) Application Blueprint WAL streaming to async standby Near-zero
FerretDB Application Blueprint Via CNPG WAL streaming Near-zero
Strimzi/Kafka Application Blueprint MirrorMaker2 Seconds
Valkey Application Blueprint REPLICAOF Seconds
ClickHouse Application Blueprint ReplicatedMergeTree Seconds
OpenSearch Application Blueprint Cross-cluster replication Seconds
Milvus Application Blueprint Collection sync Minutes
Neo4j Application Blueprint Causal cluster replication Seconds
SeaweedFS Per-host-cluster infra Cross-region async bucket replication + cold-tier passthrough to shared cloud archive Minutes
Harbor Per-host-cluster infra Registry replication Minutes
Gitea Catalyst control plane Intra-cluster HA replicas + CNPG primary-replica (NOT cross-region mirror — see platform/gitea/README.md §"Multi-Region Strategy"). DR for Gitea is via mgt-cluster recovery, not bidirectional sync. Seconds (intra-cluster only)

3. Progressive delivery

3.1 Canary deployments

Flagger is the planned canary controller (currently a "components to watch" addition, see TECHNOLOGY-FORECAST-2027-2030.md):

  • Flux-native integration
  • Automatic rollback on metric degradation (latency, error rate)
  • No ArgoCD dependency

When added, Flagger is intended to live as per-host-cluster infrastructure on each rtz cluster; per-Application canary configuration in the Application Blueprint. Status: design — not yet a deployed Blueprint.

3.2 Feature flags

Flipt is the planned feature-flag service (also "components to watch"):

  • Self-hosted, zero-cost
  • Simple SDK integration (Go, TypeScript, Python)
  • Gradual rollout control

Status: design — not yet a deployed Blueprint.


4. Auto-remediation

4.1 Architecture

Gitea Actions triggered by Alertmanager webhooks for automated incident response.

flowchart LR
    Alert[Alert Fires] --> AM[Alertmanager]
    AM --> GA[Gitea Actions]
    GA --> Remediate[Auto-Remediate]
    Remediate --> Verify[Verify Fix]
    Verify -->|Success| Resolve[Resolve Alert]
    Verify -->|Failure| Log[Log for Analysis]

4.2 Alert-to-action mapping

Catalyst control plane

Alert Auto-action Verification
HighMemoryUsage Scale up deployment Check memory
PodCrashLoopBackOff Restart pod Check pod status
HighErrorRate Trigger rollback Check error rate
DatabaseConnectionExhausted Restart PgBouncer Check connections
CertificateExpiringSoon Trigger renewal Check expiry
HighLatency Scale service Check latency
GslbEndpointDown Check k8gb status Verify DNS
OpenBaoSealed Auto-unseal via SPIRE-attested unseal keys Check unseal status
JetstreamLagHigh Add JetStream consumer replica Check consumer lag

AI Hub (when bp-cortex installed)

Alert Auto-action Verification
VLLMHighLatency Scale vLLM replicas Check inference latency
VLLMOOMKilled Reduce batch size Check memory
GPUUtilizationLow Scale down GPU pods Check utilization
GPUMemoryExhausted Evict low-priority jobs Check GPU memory
MilvusQuerySlow Rebuild index Check query latency
EmbeddingQueueBacklog Scale BGE replicas Check queue depth
RAGRetrievalEmpty Alert + log Check retrieval quality
LLMGatewayQuotaExhausted Notify user Check quota

Open Banking (when bp-fingate installed)

Alert Auto-action Verification
KeycloakHighLatency Scale Keycloak Check auth latency
QuotaServiceDown Failover to backup Check quota service
BillingWebhookFailed Retry with backoff Check webhook status
TPPCertExpiring Alert sovereign-admin team Check certificate

4.3 Budget control

Threshold Action
80% of budget Warning log
100% of budget Block scale-up

5. Secret rotation

Detailed in SECURITY.md §7. Summary:

Class Default frequency Method
Workload identity (SPIRE SVID) 5 minutes Auto
Database credentials (dynamic) 1 hour OpenBao DB engine + sidecar
API tokens, OAuth client secrets 90 days OpenBao + ESO + Reloader
Signing keys 365 days Manual approval (security-officer)
TLS certificates cert-manager Auto (Let's Encrypt or corporate CA)
User passwords (Keycloak) User-managed + MFA Realm policy enforces min age

6. GDPR automation

Process Schedule Notes
Data subject requests Daily 02:00 Application Blueprints expose DSAR endpoints
Data retention Weekly Sunday 03:00 Per-Application policy
Audit log cleanup Monthly Retains per regulatory requirement
Vector embedding purge On data deletion request If using bp-cortex
Chat history cleanup Per retention policy If using bp-relay

GDPR automation is a Catalyst-level controller (gdpr-controller) that orchestrates Application-specific deletion via the App's exposed compliance API.


7. Air-gap compliance

For regulated industries requiring air-gapped deployments:

flowchart LR
    subgraph Connected["Connected Zone"]
        Pull[Pull Images / OCI Blueprints]
    end

    subgraph DMZ["DMZ Transfer Zone"]
        Scan[Trivy + cosign verify]
        Stage[Staging Area]
    end

    subgraph AirGap["Air-Gapped Sovereign"]
        Harbor[Harbor Registry]
        Git[Gitea<br>(Blueprint mirror)]
        Flux[Flux per vcluster]
        K8s[Per-host clusters]
    end

    Pull --> Scan
    Scan --> Stage
    Stage -->|Physical / Diode| Harbor
    Stage -->|Physical / Diode| Git

7.1 Prerequisites

All Catalyst control-plane components support air-gap:

  • Harbor — local registry with replication
  • SeaweedFS — local S3 + cold-tier encapsulation (cold backend can be air-gap-internal if no cloud archive is reachable)
  • Flux — reconciles from local Git
  • Velero — backups to local SeaweedFS
  • Grafana stack — self-contained observability
  • OpenBao + Keycloak — fully self-hosted; no external dependencies

7.2 AI Hub air-gap considerations (when bp-cortex installed)

Component Air-gap requirement
vLLM Pre-download model weights to SeaweedFS
BGE-M3 Pre-download embedding models
Milvus No external dependencies
Neo4j No external dependencies
NeMo Guardrails No external dependencies
LangFuse No external dependencies

7.3 Content transfer

Content type Air-gap destination
Container images Harbor
Helm charts Harbor (OCI)
Blueprint OCI manifests Harbor → blueprint-controller registers
Git repositories Self-hosted Gitea
OS packages Local mirror
LLM model weights SeaweedFS
Embedding models SeaweedFS

8. Catalyst observability

8.1 Self-monitoring

The Catalyst control plane runs its own Grafana stack (Alloy + Loki + Mimir + Tempo + Grafana) in the catalyst-grafana namespace on the management cluster. Every Catalyst component emits OpenTelemetry traces, metrics, and logs.

8.2 Per-Sovereign dashboards

Dashboard Purpose
Sovereign Health All Catalyst components, control-plane SLOs
Per-Org Footprint Resource usage by Organization
Per-Environment Application states, error budgets
Promotion Activity EnvironmentPolicy-driven promotions, soak times, approver latency
Secret Rotation All credentials, age, next rotation
Audit Commits, RBAC events, SecretPolicy actions

8.3 Per-Organization dashboards

Each Organization sees their own slice:

Dashboard Purpose
Apps Overview All Applications across Environments
Budget Cost projection, billing
Compliance Posture Per-control evidence
Incidents Open issues, MTTR

9. SLOs

9.1 Catalyst control plane

SLI Target Alert threshold
Console availability 99.9% <99.5% for 5m
API p95 latency <500ms >1s for 5m
Catalog query p95 <200ms >500ms for 5m
projector SSE delivery <2s end-to-end >5s for 5m
Gitea webhook → reconcile <30s >2m for 5m

9.2 AI Hub (bp-cortex)

SLI Target Alert threshold
LLM Inference Latency (p95) <5s >10s for 5m
LLM Token Throughput >50 tok/s <20 tok/s for 5m
Embedding Latency (p95) <100ms >500ms for 5m
RAG Retrieval Latency (p95) <500ms >2s for 5m
GPU Utilization >60% <30% for 15m
Vector Search Latency (p95) <50ms >200ms for 5m

9.3 Open Banking (bp-fingate)

SLI Target Alert threshold
Auth Latency (p95) <200ms >500ms for 5m
API Availability 99.95% <99.5% for 5m
Consent Flow Success >99% <95% for 5m

9.4 Data & Integration (bp-fabric)

SLI Target Alert threshold
Kafka Produce Latency (p95) <50ms >200ms for 5m
Flink Checkpoint Duration <30s >60s for 5m
Temporal Workflow Latency (p95) <1s >5s for 5m
CDC Lag (Debezium) <10s >60s for 5m
ClickHouse Query Latency (p95) <500ms >2s for 5m

9.5 Communication (bp-relay)

SLI Target Alert threshold
Email Delivery Rate >99.5% <98% for 15m
LiveKit Call Setup (p95) <2s >5s for 5m
Matrix Message Delivery (p95) <500ms >2s for 5m
TURN Relay Success Rate >99% <95% for 5m

10. GPU operations

10.1 GPU node management

nodeSelector:
  node.kubernetes.io/gpu: "true"
  nvidia.com/gpu.product: "NVIDIA-A10"

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

10.2 GPU monitoring metrics

Metric Query Purpose
GPU utilization DCGM_FI_DEV_GPU_UTIL Compute usage
GPU memory used DCGM_FI_DEV_FB_USED Memory pressure
GPU temperature DCGM_FI_DEV_GPU_TEMP Thermal throttling
GPU power DCGM_FI_DEV_POWER_USAGE Power consumption
SM clock DCGM_FI_DEV_SM_CLOCK Clock throttling

10.3 vLLM operations

# Within an Org's vcluster, scoped to bp-cortex Application namespace
kubectl exec -n cortex deploy/vllm -- curl localhost:8000/health
kubectl exec -n cortex deploy/vllm -- curl localhost:8000/v1/models

10.4 KServe operations

kubectl get inferenceservices -n cortex
kubectl get inferenceservice <name> -o jsonpath='{.status.conditions}'
kubectl patch inferenceservice <name> -p '{"spec":{"predictor":{"minReplicas":2}}}'

11. Vector database operations

11.1 Milvus health checks

kubectl exec -n cortex milvus-proxy-0 -- curl localhost:9091/healthz
curl -X GET "http://milvus.cortex.svc:19530/v1/vector/collections/<collection>/stats"
curl -X POST "http://milvus.cortex.svc:19530/v1/vector/collections/<collection>/compact"

11.2 Maintenance

Task Schedule Command
Index rebuild Weekly collection.create_index()
Compaction Daily collection.compact()
Backup Daily Velero snapshot
Stats refresh Hourly collection.get_stats()

12. Alertmanager configuration

receivers:
  - name: gitea-actions
    webhook_configs:
      - url: https://gitea.<location-code>.<sovereign-domain>/api/v1/repos/<org>/platform/actions/dispatches
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/gitea-token
        send_resolved: true

  - name: ai-oncall
    webhook_configs:
      - url: https://gitea.<location-code>.<sovereign-domain>/api/v1/repos/<org>/cortex/actions/dispatches
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/gitea-token

route:
  receiver: gitea-actions
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  routes:
    - match:
        severity: critical
      receiver: gitea-actions
      group_wait: 10s
    - match:
        namespace: cortex
      receiver: ai-oncall
      group_by: ['alertname', 'model']

13. Incident response

13.1 Severity levels

Level Definition Response time
P1 Catalyst control plane down (Sovereign-impacting) 15 minutes
P2 Major Application or feature broken 1 hour
P3 Minor issue 4 hours
P4 Low priority Next business day

13.2 Catalyst-specific incidents

Incident Severity Runbook
Console unreachable P1 Check Cilium Gateway, console pods, projector pods
Gitea unreachable P1 Check Gitea pods, CNPG primary, NetworkPolicy
Environment-controller stuck P1 Check controller logs, Crossplane provider auth
OpenBao sealed P1 Auto-unseal SPIRE — verify SPIRE server health
JetStream consumer lag P2 Add consumer replica, check disk pressure
projector lag P2 Check JetStream consumer status, projector replicas
Per-Org vcluster down P2 Check vcluster pod, host cluster capacity
Flux reconciliation stalled P2 Check source-controller logs, Git connectivity
New Sovereign provisioning failed P3 Check OpenTofu state, cloud provider quotas

13.3 AI Hub incidents (bp-cortex)

Incident Severity Runbook
vLLM not responding P1 Restart vLLM, check GPU
GPU OOM P2 Reduce batch size, scale
Milvus query timeout P2 Check index, rebuild
Embedding service down P2 Failover, restart BGE
RAG returning empty P3 Check retrieval config

14. Runbooks

Sovereign-wide runbooks live in system/runbooks (Sovereign-admin scope). Org-specific runbooks may live in <org>/runbooks Gitea repo (one per Org if used). Both are version-controlled and indexed by Catalyst's runbook-controller, which surfaces them in incident response panels.

A typical runbook:

apiVersion: catalyst.openova.io/v1alpha1
kind: Runbook
metadata:
  name: openbao-sealed
  namespace: catalyst-system
spec:
  triggers:
    - alertName: OpenBaoSealed
  preconditions:
    - check: spire.server.healthy
    - check: openbao.unseal.shamir.shares.available
  steps:
    - run: openbao operator unseal --auto-shamir
    - verify: openbao.status.sealed == false
  rollback: page-oncall

Cross-reference ARCHITECTURE.md, SECURITY.md, PERSONAS-AND-JOURNEYS.md.