Component-level architectural correction (two changes): 1. MinIO → SeaweedFS as unified S3 encapsulation layer The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface. 2. Apache Guacamole added as Application Blueprint §4.5 Communication Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access. Component changes: - DELETED: platform/minio/ - CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section) - CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings) Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count. Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric. UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added. VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer. Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.
4.2 KiB
4.2 KiB
Grafana Stack
LGTM observability stack (Loki, Grafana, Tempo, Mimir + Alloy collector). Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3 / observability layer in §2.3) — runs on every host cluster a Sovereign owns. Catalyst's own self-monitoring uses this stack on the management cluster; Application telemetry from per-Org vclusters also flows here unless an Org installs its own observability stack.
Status: Accepted | Updated: 2026-04-27
Overview
The Grafana Stack provides unified observability with:
- Loki - Log aggregation
- Grafana - Visualization
- Tempo - Distributed tracing
- Mimir - Metrics storage
- Alloy - Telemetry collection
Architecture
flowchart TB
subgraph Apps["Applications"]
App1[App 1]
App2[App 2]
OTel[OTel SDK]
end
subgraph Alloy["Grafana Alloy"]
Collector[Telemetry Collector]
end
subgraph Storage["Storage Layer"]
Loki[Loki<br/>Logs]
Tempo[Tempo<br/>Traces]
Mimir[Mimir<br/>Metrics]
end
subgraph Tier["Tiered Storage"]
Hot[Hot: Local]
Cold[Cold: SeaweedFS]
Archive[Archive: R2]
end
subgraph UI["Visualization"]
Grafana[Grafana]
end
App1 --> Collector
App2 --> Collector
OTel --> Collector
Collector --> Loki
Collector --> Tempo
Collector --> Mimir
Loki --> Hot
Hot --> Cold
Cold --> Archive
Grafana --> Loki
Grafana --> Tempo
Grafana --> Mimir
Components
| Component | Purpose | Memory |
|---|---|---|
| Grafana Alloy | Telemetry collection (OTLP, Prometheus) | 256MB |
| Loki | Log aggregation | 512MB |
| Tempo | Distributed tracing | 256MB |
| Mimir | Metrics storage | 512MB |
| Grafana | Visualization | 256MB |
Tiered Storage
flowchart LR
subgraph Hot["Hot (7 days)"]
Local[Local PV]
end
subgraph Warm["Warm (30 days)"]
SeaweedFS[SeaweedFS]
end
subgraph Cold["Cold (1 year)"]
R2[Cloudflare R2]
end
Local -->|"After 7d"| SeaweedFS
SeaweedFS -->|"After 30d"| R2
| Tier | Duration | Storage |
|---|---|---|
| Hot | 0-7 days | Local PV |
| Warm | 7-30 days | SeaweedFS |
| Cold | 30d-1 year | Cloudflare R2 |
Configuration
Alloy Collector
apiVersion: v1
kind: ConfigMap
metadata:
name: alloy-config
namespace: monitoring
data:
config.alloy: |
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
}
otelcol.exporter.loki "default" {
forward_to = [loki.write.default.receiver]
}
otelcol.exporter.otlp "tempo" {
client { endpoint = "tempo.monitoring.svc:4317" }
}
prometheus.scrape "pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [prometheus.remote_write.mimir.receiver]
}
Loki with S3 Backend
loki:
schemaConfig:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
storage:
type: s3
s3:
endpoint: seaweedfs.storage.svc:8333
bucketnames: loki-data
access_key_id: ${SEAWEEDFS_ACCESS_KEY}
secret_access_key: ${SEAWEEDFS_SECRET_KEY}
OpenTelemetry Integration
Applications send telemetry via OTLP:
# OTel auto-instrumentation
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: default
namespace: <org>
spec:
exporter:
endpoint: http://alloy.monitoring.svc:4317
propagators:
- tracecontext
- baggage
Dashboards
| Dashboard | Purpose |
|---|---|
| Platform Overview | Request rates, latencies, errors |
| Cilium Network | Traffic flows, policy drops |
| Flux GitOps | Reconciliation status |
| CNPG Postgres | Database performance |
| AI Hub Overview | LLM inference metrics |
| GPU Metrics | Utilization, memory, temperature |
Alerting
Alerts flow through Alertmanager to Gitea Actions:
flowchart LR
Mimir[Mimir] -->|"Alert Rules"| AM[Alertmanager]
AM -->|"Webhook"| GA[Gitea Actions]
GA -->|"Auto-Remediation"| K8s[Kubernetes]
Part of OpenOva