§2.5 (Data replication patterns) line 106 had: Gitea | Bidirectional
mirror + CNPG primary-replica. Direct architectural contradiction with
platform/gitea/README.md "Multi-Region Strategy" which EXPLICITLY rejects
bidirectional mirror (write-conflict semantics, EnvironmentPolicy
enforcement). The canonical pattern is intra-cluster HA replicas +
CNPG primary-replica on the mgt cluster only — DR for Gitea is via
mgt-cluster recovery, not cross-region sync.
Same drift category as Pass 7 (component READMEs active-active) and
Pass 26 (BUSINESS-STRATEGY active-active OpenBao). The "active-active
for everything stateful" mental model survived in this row.
Fixed to canonical wording with inline pointer to gitea README.
SRE.md §1-§14 deep re-scan otherwise clean. §7.1 framing nit ("All
Catalyst control-plane components" lists per-host-cluster infra too)
flagged but not fixed — reads as "Catalyst-managed" in context.
§14 Runbooks <org>/runbooks path-placeholder unambiguous in context;
optional tightening flagged.
platform/keda/README.md: clean. mimir.monitoring.svc reference is the
per-host-cluster Mimir collector (consistent with dual-categorization
Pass 38 documented), not contradicting SRE §8.1's catalyst-grafana
namespace which is per-Sovereign self-monitoring.
538 lines
18 KiB
Markdown
538 lines
18 KiB
Markdown
# SRE Handbook
|
|
|
|
**Status:** Authoritative target playbook. **Updated:** 2026-04-27.
|
|
**Implementation:** Most automation described (alert webhooks, Runbook CRDs, failover-controller actions) is design-stage. See [`IMPLEMENTATION-STATUS.md`](IMPLEMENTATION-STATUS.md). Existing Sovereign deployments may rely on simpler manual procedures until the automation lands.
|
|
|
|
Site Reliability Engineering practices for Catalyst Sovereigns. Defer to [`GLOSSARY.md`](GLOSSARY.md) for terminology, [`ARCHITECTURE.md`](ARCHITECTURE.md) for the model, [`SECURITY.md`](SECURITY.md) for credentials and identity.
|
|
|
|
---
|
|
|
|
## 1. Overview
|
|
|
|
This handbook covers running a Sovereign in production: multi-region topology, progressive delivery, auto-remediation, secret rotation, GDPR automation, air-gap considerations, SLOs, GPU operations, and incident response. Audience: `sovereign-admin` and SRE personas across SME-style and corporate-style Sovereigns.
|
|
|
|
---
|
|
|
|
## 2. Multi-region strategy
|
|
|
|
### 2.1 Architecture
|
|
|
|
Multi-region is **strongly recommended** for production-tier Sovereigns. Two or more independent host clusters across regions provide geographic redundancy with automatic failover.
|
|
|
|
Clusters are named by **building block** (functional security zone), not by failover role — there is **no "primary" or "DR" designation**. Both clusters run the same building blocks symmetrically; k8gb and GSLB handle traffic distribution. After a failover event, the surviving cluster serves all traffic — its name does not change.
|
|
|
|
See [`NAMING-CONVENTION.md`](NAMING-CONVENTION.md) §1.3.
|
|
|
|
```mermaid
|
|
flowchart TB
|
|
subgraph RegionA["Region A (e.g. hz-fsn-rtz-prod)"]
|
|
K8s1[Restricted Trust Zone Cluster]
|
|
Stack1[Per-Org vclusters + workloads]
|
|
end
|
|
|
|
subgraph RegionB["Region B (e.g. hz-hel-rtz-prod)"]
|
|
K8s2[Restricted Trust Zone Cluster]
|
|
Stack2[Per-Org vclusters + workloads]
|
|
end
|
|
|
|
subgraph GSLB["Global Load Balancing"]
|
|
k8gb[k8gb Authoritative DNS]
|
|
Witness[Cloud witness<br>(lease for split-brain protection)]
|
|
end
|
|
|
|
K8s1 <-->|"WireGuard"| K8s2
|
|
K8s1 --> k8gb
|
|
K8s2 --> k8gb
|
|
k8gb -.-> Witness
|
|
```
|
|
|
|
### 2.2 Key principles
|
|
|
|
- Each cluster survives independently during network partition — no shared control plane.
|
|
- **No stretched clusters** (avoids split-brain). This applies to OpenBao, JetStream, etcd, and any other quorum-based component.
|
|
- Both clusters are peers — neither is designated primary or DR.
|
|
- Async data replication (eventual consistency).
|
|
- k8gb as authoritative DNS for GSLB zone — removes unhealthy endpoints automatically.
|
|
- Cloud witness (lease) for split-brain protection — see §2.4.
|
|
|
|
### 2.3 Cross-region networking
|
|
|
|
| Option | Use case |
|
|
|---|---|
|
|
| WireGuard mesh | Different providers, secure overlay |
|
|
| Native peering | Same provider (lower latency, e.g. Hetzner vSwitch, OCI FastConnect) |
|
|
|
|
### 2.4 Split-brain protection
|
|
|
|
Failover Controller uses a **cloud witness** for lease-based authority:
|
|
|
|
| Component | Role |
|
|
|---|---|
|
|
| Cloud witness | Holds the lease — typically Cloudflare KV (cheap, globally distributed) or another out-of-Sovereign storage |
|
|
| Failover Controller | Per-cluster controller managing readiness and gating endpoints |
|
|
|
|
**Witness pattern:**
|
|
- Active region holds a lease (renews every 10s, TTL 30s).
|
|
- Standby regions cannot become active while a valid lease exists.
|
|
- Network partition: both regions reach the witness → active keeps renewing → no split-brain.
|
|
- Witness unreachable: failover controller falls back to multi-resolver DNS quorum (8.8.8.8 + 1.1.1.1 + 9.9.9.9, 2-of-3).
|
|
|
|
**Three layers controlled** by the failover controller:
|
|
|
|
| Layer | Mechanism |
|
|
|---|---|
|
|
| External traffic (Gateway API → k8gb) | HTTPRoute readiness toggling |
|
|
| Internal traffic (Cilium Cluster Mesh) | Service endpoint manipulation |
|
|
| Stateful services (CNPG, FerretDB, Strimzi) | Database promotion signaling |
|
|
|
|
**Modes:** automatic | semi-automatic | manual (regulated tier).
|
|
|
|
### 2.5 Data replication patterns
|
|
|
|
These apply to stateful components — Application Blueprints (data services) **and** per-host-cluster infrastructure with state (MinIO, Harbor). The Catalyst control plane's own state-bearing components use different patterns: see [`SECURITY.md`](SECURITY.md) for OpenBao and [`ARCHITECTURE.md`](ARCHITECTURE.md) for NATS JetStream.
|
|
|
|
| Component | Layer | Replication method | RPO |
|
|
|---|---|---|---|
|
|
| CNPG (PostgreSQL) | Application Blueprint | WAL streaming to async standby | Near-zero |
|
|
| FerretDB | Application Blueprint | Via CNPG WAL streaming | Near-zero |
|
|
| Strimzi/Kafka | Application Blueprint | MirrorMaker2 | Seconds |
|
|
| Valkey | Application Blueprint | REPLICAOF | Seconds |
|
|
| ClickHouse | Application Blueprint | ReplicatedMergeTree | Seconds |
|
|
| OpenSearch | Application Blueprint | Cross-cluster replication | Seconds |
|
|
| Milvus | Application Blueprint | Collection sync | Minutes |
|
|
| Neo4j | Application Blueprint | Causal cluster replication | Seconds |
|
|
| MinIO | Per-host-cluster infra | Bucket replication | Minutes |
|
|
| Harbor | Per-host-cluster infra | Registry replication | Minutes |
|
|
| Gitea | Catalyst control plane | Intra-cluster HA replicas + CNPG primary-replica (NOT cross-region mirror — see [platform/gitea/README.md](../platform/gitea/README.md) §"Multi-Region Strategy"). DR for Gitea is via mgt-cluster recovery, not bidirectional sync. | Seconds (intra-cluster only) |
|
|
|
|
---
|
|
|
|
## 3. Progressive delivery
|
|
|
|
### 3.1 Canary deployments
|
|
|
|
[Flagger](https://flagger.app) is the planned canary controller (currently a "components to watch" addition, see [`TECHNOLOGY-FORECAST-2027-2030.md`](TECHNOLOGY-FORECAST-2027-2030.md)):
|
|
|
|
- Flux-native integration
|
|
- Automatic rollback on metric degradation (latency, error rate)
|
|
- No ArgoCD dependency
|
|
|
|
When added, Flagger is intended to live as per-host-cluster infrastructure on each `rtz` cluster; per-Application canary configuration in the Application Blueprint. **Status:** design — not yet a deployed Blueprint.
|
|
|
|
### 3.2 Feature flags
|
|
|
|
[Flipt](https://flipt.io) is the planned feature-flag service (also "components to watch"):
|
|
|
|
- Self-hosted, zero-cost
|
|
- Simple SDK integration (Go, TypeScript, Python)
|
|
- Gradual rollout control
|
|
|
|
**Status:** design — not yet a deployed Blueprint.
|
|
|
|
---
|
|
|
|
## 4. Auto-remediation
|
|
|
|
### 4.1 Architecture
|
|
|
|
Gitea Actions triggered by Alertmanager webhooks for automated incident response.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
Alert[Alert Fires] --> AM[Alertmanager]
|
|
AM --> GA[Gitea Actions]
|
|
GA --> Remediate[Auto-Remediate]
|
|
Remediate --> Verify[Verify Fix]
|
|
Verify -->|Success| Resolve[Resolve Alert]
|
|
Verify -->|Failure| Log[Log for Analysis]
|
|
```
|
|
|
|
### 4.2 Alert-to-action mapping
|
|
|
|
#### Catalyst control plane
|
|
|
|
| Alert | Auto-action | Verification |
|
|
|---|---|---|
|
|
| HighMemoryUsage | Scale up deployment | Check memory |
|
|
| PodCrashLoopBackOff | Restart pod | Check pod status |
|
|
| HighErrorRate | Trigger rollback | Check error rate |
|
|
| DatabaseConnectionExhausted | Restart PgBouncer | Check connections |
|
|
| CertificateExpiringSoon | Trigger renewal | Check expiry |
|
|
| HighLatency | Scale service | Check latency |
|
|
| GslbEndpointDown | Check k8gb status | Verify DNS |
|
|
| OpenBaoSealed | Auto-unseal via SPIRE-attested unseal keys | Check unseal status |
|
|
| JetstreamLagHigh | Add JetStream consumer replica | Check consumer lag |
|
|
|
|
#### AI Hub (when bp-cortex installed)
|
|
|
|
| Alert | Auto-action | Verification |
|
|
|---|---|---|
|
|
| VLLMHighLatency | Scale vLLM replicas | Check inference latency |
|
|
| VLLMOOMKilled | Reduce batch size | Check memory |
|
|
| GPUUtilizationLow | Scale down GPU pods | Check utilization |
|
|
| GPUMemoryExhausted | Evict low-priority jobs | Check GPU memory |
|
|
| MilvusQuerySlow | Rebuild index | Check query latency |
|
|
| EmbeddingQueueBacklog | Scale BGE replicas | Check queue depth |
|
|
| RAGRetrievalEmpty | Alert + log | Check retrieval quality |
|
|
| LLMGatewayQuotaExhausted | Notify user | Check quota |
|
|
|
|
#### Open Banking (when bp-fingate installed)
|
|
|
|
| Alert | Auto-action | Verification |
|
|
|---|---|---|
|
|
| KeycloakHighLatency | Scale Keycloak | Check auth latency |
|
|
| QuotaServiceDown | Failover to backup | Check quota service |
|
|
| BillingWebhookFailed | Retry with backoff | Check webhook status |
|
|
| TPPCertExpiring | Alert sovereign-admin team | Check certificate |
|
|
|
|
### 4.3 Budget control
|
|
|
|
| Threshold | Action |
|
|
|---|---|
|
|
| 80% of budget | Warning log |
|
|
| 100% of budget | Block scale-up |
|
|
|
|
---
|
|
|
|
## 5. Secret rotation
|
|
|
|
Detailed in [`SECURITY.md`](SECURITY.md) §7. Summary:
|
|
|
|
| Class | Default frequency | Method |
|
|
|---|---|---|
|
|
| Workload identity (SPIRE SVID) | 5 minutes | Auto |
|
|
| Database credentials (dynamic) | 1 hour | OpenBao DB engine + sidecar |
|
|
| API tokens, OAuth client secrets | 90 days | OpenBao + ESO + Reloader |
|
|
| Signing keys | 365 days | Manual approval (security-officer) |
|
|
| TLS certificates | cert-manager | Auto (Let's Encrypt or corporate CA) |
|
|
| User passwords (Keycloak) | User-managed + MFA | Realm policy enforces min age |
|
|
|
|
---
|
|
|
|
## 6. GDPR automation
|
|
|
|
| Process | Schedule | Notes |
|
|
|---|---|---|
|
|
| Data subject requests | Daily 02:00 | Application Blueprints expose DSAR endpoints |
|
|
| Data retention | Weekly Sunday 03:00 | Per-Application policy |
|
|
| Audit log cleanup | Monthly | Retains per regulatory requirement |
|
|
| Vector embedding purge | On data deletion request | If using bp-cortex |
|
|
| Chat history cleanup | Per retention policy | If using bp-relay |
|
|
|
|
GDPR automation is a Catalyst-level controller (`gdpr-controller`) that orchestrates Application-specific deletion via the App's exposed compliance API.
|
|
|
|
---
|
|
|
|
## 7. Air-gap compliance
|
|
|
|
For regulated industries requiring air-gapped deployments:
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph Connected["Connected Zone"]
|
|
Pull[Pull Images / OCI Blueprints]
|
|
end
|
|
|
|
subgraph DMZ["DMZ Transfer Zone"]
|
|
Scan[Trivy + cosign verify]
|
|
Stage[Staging Area]
|
|
end
|
|
|
|
subgraph AirGap["Air-Gapped Sovereign"]
|
|
Harbor[Harbor Registry]
|
|
Git[Gitea<br>(Blueprint mirror)]
|
|
Flux[Flux per vcluster]
|
|
K8s[Per-host clusters]
|
|
end
|
|
|
|
Pull --> Scan
|
|
Scan --> Stage
|
|
Stage -->|Physical / Diode| Harbor
|
|
Stage -->|Physical / Diode| Git
|
|
```
|
|
|
|
### 7.1 Prerequisites
|
|
|
|
All Catalyst control-plane components support air-gap:
|
|
|
|
- Harbor — local registry with replication
|
|
- MinIO — local object storage
|
|
- Flux — reconciles from local Git
|
|
- Velero — backups to local MinIO
|
|
- Grafana stack — self-contained observability
|
|
- OpenBao + Keycloak — fully self-hosted; no external dependencies
|
|
|
|
### 7.2 AI Hub air-gap considerations (when bp-cortex installed)
|
|
|
|
| Component | Air-gap requirement |
|
|
|---|---|
|
|
| vLLM | Pre-download model weights to MinIO |
|
|
| BGE-M3 | Pre-download embedding models |
|
|
| Milvus | No external dependencies |
|
|
| Neo4j | No external dependencies |
|
|
| NeMo Guardrails | No external dependencies |
|
|
| LangFuse | No external dependencies |
|
|
|
|
### 7.3 Content transfer
|
|
|
|
| Content type | Air-gap destination |
|
|
|---|---|
|
|
| Container images | Harbor |
|
|
| Helm charts | Harbor (OCI) |
|
|
| Blueprint OCI manifests | Harbor → blueprint-controller registers |
|
|
| Git repositories | Self-hosted Gitea |
|
|
| OS packages | Local mirror |
|
|
| LLM model weights | MinIO |
|
|
| Embedding models | MinIO |
|
|
|
|
---
|
|
|
|
## 8. Catalyst observability
|
|
|
|
### 8.1 Self-monitoring
|
|
|
|
The Catalyst control plane runs its own Grafana stack (Alloy + Loki + Mimir + Tempo + Grafana) in the `catalyst-grafana` namespace on the management cluster. Every Catalyst component emits OpenTelemetry traces, metrics, and logs.
|
|
|
|
### 8.2 Per-Sovereign dashboards
|
|
|
|
| Dashboard | Purpose |
|
|
|---|---|
|
|
| Sovereign Health | All Catalyst components, control-plane SLOs |
|
|
| Per-Org Footprint | Resource usage by Organization |
|
|
| Per-Environment | Application states, error budgets |
|
|
| Promotion Activity | EnvironmentPolicy-driven promotions, soak times, approver latency |
|
|
| Secret Rotation | All credentials, age, next rotation |
|
|
| Audit | Commits, RBAC events, SecretPolicy actions |
|
|
|
|
### 8.3 Per-Organization dashboards
|
|
|
|
Each Organization sees their own slice:
|
|
|
|
| Dashboard | Purpose |
|
|
|---|---|
|
|
| Apps Overview | All Applications across Environments |
|
|
| Budget | Cost projection, billing |
|
|
| Compliance Posture | Per-control evidence |
|
|
| Incidents | Open issues, MTTR |
|
|
|
|
---
|
|
|
|
## 9. SLOs
|
|
|
|
### 9.1 Catalyst control plane
|
|
|
|
| SLI | Target | Alert threshold |
|
|
|---|---|---|
|
|
| Console availability | 99.9% | <99.5% for 5m |
|
|
| API p95 latency | <500ms | >1s for 5m |
|
|
| Catalog query p95 | <200ms | >500ms for 5m |
|
|
| projector SSE delivery | <2s end-to-end | >5s for 5m |
|
|
| Gitea webhook → reconcile | <30s | >2m for 5m |
|
|
|
|
### 9.2 AI Hub (bp-cortex)
|
|
|
|
| SLI | Target | Alert threshold |
|
|
|---|---|---|
|
|
| LLM Inference Latency (p95) | <5s | >10s for 5m |
|
|
| LLM Token Throughput | >50 tok/s | <20 tok/s for 5m |
|
|
| Embedding Latency (p95) | <100ms | >500ms for 5m |
|
|
| RAG Retrieval Latency (p95) | <500ms | >2s for 5m |
|
|
| GPU Utilization | >60% | <30% for 15m |
|
|
| Vector Search Latency (p95) | <50ms | >200ms for 5m |
|
|
|
|
### 9.3 Open Banking (bp-fingate)
|
|
|
|
| SLI | Target | Alert threshold |
|
|
|---|---|---|
|
|
| Auth Latency (p95) | <200ms | >500ms for 5m |
|
|
| API Availability | 99.95% | <99.5% for 5m |
|
|
| Consent Flow Success | >99% | <95% for 5m |
|
|
|
|
### 9.4 Data & Integration (bp-fabric)
|
|
|
|
| SLI | Target | Alert threshold |
|
|
|---|---|---|
|
|
| Kafka Produce Latency (p95) | <50ms | >200ms for 5m |
|
|
| Flink Checkpoint Duration | <30s | >60s for 5m |
|
|
| Temporal Workflow Latency (p95) | <1s | >5s for 5m |
|
|
| CDC Lag (Debezium) | <10s | >60s for 5m |
|
|
| ClickHouse Query Latency (p95) | <500ms | >2s for 5m |
|
|
|
|
### 9.5 Communication (bp-relay)
|
|
|
|
| SLI | Target | Alert threshold |
|
|
|---|---|---|
|
|
| Email Delivery Rate | >99.5% | <98% for 15m |
|
|
| LiveKit Call Setup (p95) | <2s | >5s for 5m |
|
|
| Matrix Message Delivery (p95) | <500ms | >2s for 5m |
|
|
| TURN Relay Success Rate | >99% | <95% for 5m |
|
|
|
|
---
|
|
|
|
## 10. GPU operations
|
|
|
|
### 10.1 GPU node management
|
|
|
|
```yaml
|
|
nodeSelector:
|
|
node.kubernetes.io/gpu: "true"
|
|
nvidia.com/gpu.product: "NVIDIA-A10"
|
|
|
|
tolerations:
|
|
- key: nvidia.com/gpu
|
|
operator: Exists
|
|
effect: NoSchedule
|
|
```
|
|
|
|
### 10.2 GPU monitoring metrics
|
|
|
|
| Metric | Query | Purpose |
|
|
|---|---|---|
|
|
| GPU utilization | `DCGM_FI_DEV_GPU_UTIL` | Compute usage |
|
|
| GPU memory used | `DCGM_FI_DEV_FB_USED` | Memory pressure |
|
|
| GPU temperature | `DCGM_FI_DEV_GPU_TEMP` | Thermal throttling |
|
|
| GPU power | `DCGM_FI_DEV_POWER_USAGE` | Power consumption |
|
|
| SM clock | `DCGM_FI_DEV_SM_CLOCK` | Clock throttling |
|
|
|
|
### 10.3 vLLM operations
|
|
|
|
```bash
|
|
# Within an Org's vcluster, scoped to bp-cortex Application namespace
|
|
kubectl exec -n cortex deploy/vllm -- curl localhost:8000/health
|
|
kubectl exec -n cortex deploy/vllm -- curl localhost:8000/v1/models
|
|
```
|
|
|
|
### 10.4 KServe operations
|
|
|
|
```bash
|
|
kubectl get inferenceservices -n cortex
|
|
kubectl get inferenceservice <name> -o jsonpath='{.status.conditions}'
|
|
kubectl patch inferenceservice <name> -p '{"spec":{"predictor":{"minReplicas":2}}}'
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Vector database operations
|
|
|
|
### 11.1 Milvus health checks
|
|
|
|
```bash
|
|
kubectl exec -n cortex milvus-proxy-0 -- curl localhost:9091/healthz
|
|
curl -X GET "http://milvus.cortex.svc:19530/v1/vector/collections/<collection>/stats"
|
|
curl -X POST "http://milvus.cortex.svc:19530/v1/vector/collections/<collection>/compact"
|
|
```
|
|
|
|
### 11.2 Maintenance
|
|
|
|
| Task | Schedule | Command |
|
|
|---|---|---|
|
|
| Index rebuild | Weekly | `collection.create_index()` |
|
|
| Compaction | Daily | `collection.compact()` |
|
|
| Backup | Daily | Velero snapshot |
|
|
| Stats refresh | Hourly | `collection.get_stats()` |
|
|
|
|
---
|
|
|
|
## 12. Alertmanager configuration
|
|
|
|
```yaml
|
|
receivers:
|
|
- name: gitea-actions
|
|
webhook_configs:
|
|
- url: https://gitea.<location-code>.<sovereign-domain>/api/v1/repos/<org>/platform/actions/dispatches
|
|
http_config:
|
|
authorization:
|
|
type: Bearer
|
|
credentials_file: /etc/alertmanager/gitea-token
|
|
send_resolved: true
|
|
|
|
- name: ai-oncall
|
|
webhook_configs:
|
|
- url: https://gitea.<location-code>.<sovereign-domain>/api/v1/repos/<org>/cortex/actions/dispatches
|
|
http_config:
|
|
authorization:
|
|
type: Bearer
|
|
credentials_file: /etc/alertmanager/gitea-token
|
|
|
|
route:
|
|
receiver: gitea-actions
|
|
group_by: ['alertname', 'namespace']
|
|
group_wait: 30s
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: gitea-actions
|
|
group_wait: 10s
|
|
- match:
|
|
namespace: cortex
|
|
receiver: ai-oncall
|
|
group_by: ['alertname', 'model']
|
|
```
|
|
|
|
---
|
|
|
|
## 13. Incident response
|
|
|
|
### 13.1 Severity levels
|
|
|
|
| Level | Definition | Response time |
|
|
|---|---|---|
|
|
| P1 | Catalyst control plane down (Sovereign-impacting) | 15 minutes |
|
|
| P2 | Major Application or feature broken | 1 hour |
|
|
| P3 | Minor issue | 4 hours |
|
|
| P4 | Low priority | Next business day |
|
|
|
|
### 13.2 Catalyst-specific incidents
|
|
|
|
| Incident | Severity | Runbook |
|
|
|---|---|---|
|
|
| Console unreachable | P1 | Check Cilium Gateway, console pods, projector pods |
|
|
| Gitea unreachable | P1 | Check Gitea pods, CNPG primary, NetworkPolicy |
|
|
| Environment-controller stuck | P1 | Check controller logs, Crossplane provider auth |
|
|
| OpenBao sealed | P1 | Auto-unseal SPIRE — verify SPIRE server health |
|
|
| JetStream consumer lag | P2 | Add consumer replica, check disk pressure |
|
|
| projector lag | P2 | Check JetStream consumer status, projector replicas |
|
|
| Per-Org vcluster down | P2 | Check vcluster pod, host cluster capacity |
|
|
| Flux reconciliation stalled | P2 | Check source-controller logs, Git connectivity |
|
|
| New Sovereign provisioning failed | P3 | Check OpenTofu state, cloud provider quotas |
|
|
|
|
### 13.3 AI Hub incidents (bp-cortex)
|
|
|
|
| Incident | Severity | Runbook |
|
|
|---|---|---|
|
|
| vLLM not responding | P1 | Restart vLLM, check GPU |
|
|
| GPU OOM | P2 | Reduce batch size, scale |
|
|
| Milvus query timeout | P2 | Check index, rebuild |
|
|
| Embedding service down | P2 | Failover, restart BGE |
|
|
| RAG returning empty | P3 | Check retrieval config |
|
|
|
|
---
|
|
|
|
## 14. Runbooks
|
|
|
|
Runbooks live in the customer's `<org>/runbooks` Gitea repo, version-controlled alongside the rest of their config. Catalyst's `runbook-controller` indexes them and surfaces them in incident response panels.
|
|
|
|
A typical runbook:
|
|
|
|
```yaml
|
|
apiVersion: catalyst.openova.io/v1alpha1
|
|
kind: Runbook
|
|
metadata:
|
|
name: openbao-sealed
|
|
namespace: catalyst-system
|
|
spec:
|
|
triggers:
|
|
- alertName: OpenBaoSealed
|
|
preconditions:
|
|
- check: spire.server.healthy
|
|
- check: openbao.unseal.shamir.shares.available
|
|
steps:
|
|
- run: openbao operator unseal --auto-shamir
|
|
- verify: openbao.status.sealed == false
|
|
rollback: page-oncall
|
|
```
|
|
|
|
---
|
|
|
|
*Cross-reference [`ARCHITECTURE.md`](ARCHITECTURE.md), [`SECURITY.md`](SECURITY.md), [`PERSONAS-AND-JOURNEYS.md`](PERSONAS-AND-JOURNEYS.md).*
|