openova/platform/grafana/README.md
hatiyildiz 7cafa3c894 docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay
Component-level architectural correction (two changes):

1. MinIO → SeaweedFS as unified S3 encapsulation layer

The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface.

2. Apache Guacamole added as Application Blueprint §4.5 Communication

Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access.

Component changes:
- DELETED: platform/minio/
- CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section)
- CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings)

Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count.

Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric.

UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added.

VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer.

Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.
2026-04-28 10:23:46 +02:00

207 lines
4.2 KiB
Markdown

# Grafana Stack
LGTM observability stack (Loki, Grafana, Tempo, Mimir + Alloy collector). Per-host-cluster infrastructure (see [`docs/PLATFORM-TECH-STACK.md`](../../docs/PLATFORM-TECH-STACK.md) §3 / observability layer in §2.3) — runs on every host cluster a Sovereign owns. Catalyst's own self-monitoring uses this stack on the management cluster; Application telemetry from per-Org vclusters also flows here unless an Org installs its own observability stack.
**Status:** Accepted | **Updated:** 2026-04-27
---
## Overview
The Grafana Stack provides unified observability with:
- **Loki** - Log aggregation
- **Grafana** - Visualization
- **Tempo** - Distributed tracing
- **Mimir** - Metrics storage
- **Alloy** - Telemetry collection
---
## Architecture
```mermaid
flowchart TB
subgraph Apps["Applications"]
App1[App 1]
App2[App 2]
OTel[OTel SDK]
end
subgraph Alloy["Grafana Alloy"]
Collector[Telemetry Collector]
end
subgraph Storage["Storage Layer"]
Loki[Loki<br/>Logs]
Tempo[Tempo<br/>Traces]
Mimir[Mimir<br/>Metrics]
end
subgraph Tier["Tiered Storage"]
Hot[Hot: Local]
Cold[Cold: SeaweedFS]
Archive[Archive: R2]
end
subgraph UI["Visualization"]
Grafana[Grafana]
end
App1 --> Collector
App2 --> Collector
OTel --> Collector
Collector --> Loki
Collector --> Tempo
Collector --> Mimir
Loki --> Hot
Hot --> Cold
Cold --> Archive
Grafana --> Loki
Grafana --> Tempo
Grafana --> Mimir
```
---
## Components
| Component | Purpose | Memory |
|-----------|---------|--------|
| Grafana Alloy | Telemetry collection (OTLP, Prometheus) | 256MB |
| Loki | Log aggregation | 512MB |
| Tempo | Distributed tracing | 256MB |
| Mimir | Metrics storage | 512MB |
| Grafana | Visualization | 256MB |
---
## Tiered Storage
```mermaid
flowchart LR
subgraph Hot["Hot (7 days)"]
Local[Local PV]
end
subgraph Warm["Warm (30 days)"]
SeaweedFS[SeaweedFS]
end
subgraph Cold["Cold (1 year)"]
R2[Cloudflare R2]
end
Local -->|"After 7d"| SeaweedFS
SeaweedFS -->|"After 30d"| R2
```
| Tier | Duration | Storage |
|------|----------|---------|
| Hot | 0-7 days | Local PV |
| Warm | 7-30 days | SeaweedFS |
| Cold | 30d-1 year | Cloudflare R2 |
---
## Configuration
### Alloy Collector
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alloy-config
namespace: monitoring
data:
config.alloy: |
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
}
otelcol.exporter.loki "default" {
forward_to = [loki.write.default.receiver]
}
otelcol.exporter.otlp "tempo" {
client { endpoint = "tempo.monitoring.svc:4317" }
}
prometheus.scrape "pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [prometheus.remote_write.mimir.receiver]
}
```
### Loki with S3 Backend
```yaml
loki:
schemaConfig:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
storage:
type: s3
s3:
endpoint: seaweedfs.storage.svc:8333
bucketnames: loki-data
access_key_id: ${SEAWEEDFS_ACCESS_KEY}
secret_access_key: ${SEAWEEDFS_SECRET_KEY}
```
---
## OpenTelemetry Integration
Applications send telemetry via OTLP:
```yaml
# OTel auto-instrumentation
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: default
namespace: <org>
spec:
exporter:
endpoint: http://alloy.monitoring.svc:4317
propagators:
- tracecontext
- baggage
```
---
## Dashboards
| Dashboard | Purpose |
|-----------|---------|
| Platform Overview | Request rates, latencies, errors |
| Cilium Network | Traffic flows, policy drops |
| Flux GitOps | Reconciliation status |
| CNPG Postgres | Database performance |
| AI Hub Overview | LLM inference metrics |
| GPU Metrics | Utilization, memory, temperature |
---
## Alerting
Alerts flow through Alertmanager to Gitea Actions:
```mermaid
flowchart LR
Mimir[Mimir] -->|"Alert Rules"| AM[Alertmanager]
AM -->|"Webhook"| GA[Gitea Actions]
GA -->|"Auto-Remediation"| K8s[Kubernetes]
```
---
*Part of [OpenOva](https://openova.io)*