Component-level architectural correction (two changes): 1. MinIO → SeaweedFS as unified S3 encapsulation layer The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface. 2. Apache Guacamole added as Application Blueprint §4.5 Communication Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access. Component changes: - DELETED: platform/minio/ - CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section) - CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings) Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count. Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric. UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added. VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer. Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.
207 lines
4.2 KiB
Markdown
207 lines
4.2 KiB
Markdown
# Grafana Stack
|
|
|
|
LGTM observability stack (Loki, Grafana, Tempo, Mimir + Alloy collector). Per-host-cluster infrastructure (see [`docs/PLATFORM-TECH-STACK.md`](../../docs/PLATFORM-TECH-STACK.md) §3 / observability layer in §2.3) — runs on every host cluster a Sovereign owns. Catalyst's own self-monitoring uses this stack on the management cluster; Application telemetry from per-Org vclusters also flows here unless an Org installs its own observability stack.
|
|
|
|
**Status:** Accepted | **Updated:** 2026-04-27
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
The Grafana Stack provides unified observability with:
|
|
- **Loki** - Log aggregation
|
|
- **Grafana** - Visualization
|
|
- **Tempo** - Distributed tracing
|
|
- **Mimir** - Metrics storage
|
|
- **Alloy** - Telemetry collection
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
flowchart TB
|
|
subgraph Apps["Applications"]
|
|
App1[App 1]
|
|
App2[App 2]
|
|
OTel[OTel SDK]
|
|
end
|
|
|
|
subgraph Alloy["Grafana Alloy"]
|
|
Collector[Telemetry Collector]
|
|
end
|
|
|
|
subgraph Storage["Storage Layer"]
|
|
Loki[Loki<br/>Logs]
|
|
Tempo[Tempo<br/>Traces]
|
|
Mimir[Mimir<br/>Metrics]
|
|
end
|
|
|
|
subgraph Tier["Tiered Storage"]
|
|
Hot[Hot: Local]
|
|
Cold[Cold: SeaweedFS]
|
|
Archive[Archive: R2]
|
|
end
|
|
|
|
subgraph UI["Visualization"]
|
|
Grafana[Grafana]
|
|
end
|
|
|
|
App1 --> Collector
|
|
App2 --> Collector
|
|
OTel --> Collector
|
|
Collector --> Loki
|
|
Collector --> Tempo
|
|
Collector --> Mimir
|
|
Loki --> Hot
|
|
Hot --> Cold
|
|
Cold --> Archive
|
|
Grafana --> Loki
|
|
Grafana --> Tempo
|
|
Grafana --> Mimir
|
|
```
|
|
|
|
---
|
|
|
|
## Components
|
|
|
|
| Component | Purpose | Memory |
|
|
|-----------|---------|--------|
|
|
| Grafana Alloy | Telemetry collection (OTLP, Prometheus) | 256MB |
|
|
| Loki | Log aggregation | 512MB |
|
|
| Tempo | Distributed tracing | 256MB |
|
|
| Mimir | Metrics storage | 512MB |
|
|
| Grafana | Visualization | 256MB |
|
|
|
|
---
|
|
|
|
## Tiered Storage
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph Hot["Hot (7 days)"]
|
|
Local[Local PV]
|
|
end
|
|
|
|
subgraph Warm["Warm (30 days)"]
|
|
SeaweedFS[SeaweedFS]
|
|
end
|
|
|
|
subgraph Cold["Cold (1 year)"]
|
|
R2[Cloudflare R2]
|
|
end
|
|
|
|
Local -->|"After 7d"| SeaweedFS
|
|
SeaweedFS -->|"After 30d"| R2
|
|
```
|
|
|
|
| Tier | Duration | Storage |
|
|
|------|----------|---------|
|
|
| Hot | 0-7 days | Local PV |
|
|
| Warm | 7-30 days | SeaweedFS |
|
|
| Cold | 30d-1 year | Cloudflare R2 |
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
### Alloy Collector
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: alloy-config
|
|
namespace: monitoring
|
|
data:
|
|
config.alloy: |
|
|
otelcol.receiver.otlp "default" {
|
|
grpc { endpoint = "0.0.0.0:4317" }
|
|
http { endpoint = "0.0.0.0:4318" }
|
|
}
|
|
|
|
otelcol.exporter.loki "default" {
|
|
forward_to = [loki.write.default.receiver]
|
|
}
|
|
|
|
otelcol.exporter.otlp "tempo" {
|
|
client { endpoint = "tempo.monitoring.svc:4317" }
|
|
}
|
|
|
|
prometheus.scrape "pods" {
|
|
targets = discovery.kubernetes.pods.targets
|
|
forward_to = [prometheus.remote_write.mimir.receiver]
|
|
}
|
|
```
|
|
|
|
### Loki with S3 Backend
|
|
|
|
```yaml
|
|
loki:
|
|
schemaConfig:
|
|
configs:
|
|
- from: 2024-01-01
|
|
store: tsdb
|
|
object_store: s3
|
|
schema: v13
|
|
|
|
storage:
|
|
type: s3
|
|
s3:
|
|
endpoint: seaweedfs.storage.svc:8333
|
|
bucketnames: loki-data
|
|
access_key_id: ${SEAWEEDFS_ACCESS_KEY}
|
|
secret_access_key: ${SEAWEEDFS_SECRET_KEY}
|
|
```
|
|
|
|
---
|
|
|
|
## OpenTelemetry Integration
|
|
|
|
Applications send telemetry via OTLP:
|
|
|
|
```yaml
|
|
# OTel auto-instrumentation
|
|
apiVersion: opentelemetry.io/v1alpha1
|
|
kind: Instrumentation
|
|
metadata:
|
|
name: default
|
|
namespace: <org>
|
|
spec:
|
|
exporter:
|
|
endpoint: http://alloy.monitoring.svc:4317
|
|
propagators:
|
|
- tracecontext
|
|
- baggage
|
|
```
|
|
|
|
---
|
|
|
|
## Dashboards
|
|
|
|
| Dashboard | Purpose |
|
|
|-----------|---------|
|
|
| Platform Overview | Request rates, latencies, errors |
|
|
| Cilium Network | Traffic flows, policy drops |
|
|
| Flux GitOps | Reconciliation status |
|
|
| CNPG Postgres | Database performance |
|
|
| AI Hub Overview | LLM inference metrics |
|
|
| GPU Metrics | Utilization, memory, temperature |
|
|
|
|
---
|
|
|
|
## Alerting
|
|
|
|
Alerts flow through Alertmanager to Gitea Actions:
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
Mimir[Mimir] -->|"Alert Rules"| AM[Alertmanager]
|
|
AM -->|"Webhook"| GA[Gitea Actions]
|
|
GA -->|"Auto-Remediation"| K8s[Kubernetes]
|
|
```
|
|
|
|
---
|
|
|
|
*Part of [OpenOva](https://openova.io)*
|