History

e3mrah 83ec889f06 feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560 ) (#580 ) Charts bumped: - bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented) - bp-crossplane 1.1.3 -> 1.1.4 (subchart stub) - bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched) - bp-velero 1.2.0 -> 1.2.1 (subchart stub) - bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented) - bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented) - bp-grafana 1.0.0 -> 1.0.1 (subchart stub) - bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented) - bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services) Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache. Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry Co-authored-by: alierenbaysal <alierenbaysal@openova.io>		2026-05-02 13:21:53 +04:00
..
chart	feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560 ) (#580 )	2026-05-02 13:21:53 +04:00
blueprint.yaml	feat(platform): observability stack umbrellas (grafana/loki/mimir/tempo/alloy/otel/langfuse/velero) (#214 )	2026-04-29 22:11:04 +02:00
README.md	docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay	2026-04-28 10:23:46 +02:00

README.md

Grafana Stack

LGTM observability stack (Loki, Grafana, Tempo, Mimir + Alloy collector). Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3 / observability layer in §2.3) — runs on every host cluster a Sovereign owns. Catalyst's own self-monitoring uses this stack on the management cluster; Application telemetry from per-Org vclusters also flows here unless an Org installs its own observability stack.

Status: Accepted | Updated: 2026-04-27

Overview

The Grafana Stack provides unified observability with:

Loki - Log aggregation
Grafana - Visualization
Tempo - Distributed tracing
Mimir - Metrics storage
Alloy - Telemetry collection

Architecture

flowchart TB
    subgraph Apps["Applications"]
        App1[App 1]
        App2[App 2]
        OTel[OTel SDK]
    end

    subgraph Alloy["Grafana Alloy"]
        Collector[Telemetry Collector]
    end

    subgraph Storage["Storage Layer"]
        Loki[Loki<br/>Logs]
        Tempo[Tempo<br/>Traces]
        Mimir[Mimir<br/>Metrics]
    end

    subgraph Tier["Tiered Storage"]
        Hot[Hot: Local]
        Cold[Cold: SeaweedFS]
        Archive[Archive: R2]
    end

    subgraph UI["Visualization"]
        Grafana[Grafana]
    end

    App1 --> Collector
    App2 --> Collector
    OTel --> Collector
    Collector --> Loki
    Collector --> Tempo
    Collector --> Mimir
    Loki --> Hot
    Hot --> Cold
    Cold --> Archive
    Grafana --> Loki
    Grafana --> Tempo
    Grafana --> Mimir

Components

Component	Purpose	Memory
Grafana Alloy	Telemetry collection (OTLP, Prometheus)	256MB
Loki	Log aggregation	512MB
Tempo	Distributed tracing	256MB
Mimir	Metrics storage	512MB
Grafana	Visualization	256MB

Tiered Storage

flowchart LR
    subgraph Hot["Hot (7 days)"]
        Local[Local PV]
    end

    subgraph Warm["Warm (30 days)"]
        SeaweedFS[SeaweedFS]
    end

    subgraph Cold["Cold (1 year)"]
        R2[Cloudflare R2]
    end

    Local -->|"After 7d"| SeaweedFS
    SeaweedFS -->|"After 30d"| R2

Tier	Duration	Storage
Hot	0-7 days	Local PV
Warm	7-30 days	SeaweedFS
Cold	30d-1 year	Cloudflare R2

Configuration

Alloy Collector

apiVersion: v1
kind: ConfigMap
metadata:
  name: alloy-config
  namespace: monitoring
data:
  config.alloy: |
    otelcol.receiver.otlp "default" {
      grpc { endpoint = "0.0.0.0:4317" }
      http { endpoint = "0.0.0.0:4318" }
    }

    otelcol.exporter.loki "default" {
      forward_to = [loki.write.default.receiver]
    }

    otelcol.exporter.otlp "tempo" {
      client { endpoint = "tempo.monitoring.svc:4317" }
    }

    prometheus.scrape "pods" {
      targets = discovery.kubernetes.pods.targets
      forward_to = [prometheus.remote_write.mimir.receiver]
    }

Loki with S3 Backend

loki:
  schemaConfig:
    configs:
      - from: 2024-01-01
        store: tsdb
        object_store: s3
        schema: v13

  storage:
    type: s3
    s3:
      endpoint: seaweedfs.storage.svc:8333
      bucketnames: loki-data
      access_key_id: ${SEAWEEDFS_ACCESS_KEY}
      secret_access_key: ${SEAWEEDFS_SECRET_KEY}

OpenTelemetry Integration

Applications send telemetry via OTLP:

# OTel auto-instrumentation
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: default
  namespace: <org>
spec:
  exporter:
    endpoint: http://alloy.monitoring.svc:4317
  propagators:
    - tracecontext
    - baggage

Dashboards

Dashboard	Purpose
Platform Overview	Request rates, latencies, errors
Cilium Network	Traffic flows, policy drops
Flux GitOps	Reconciliation status
CNPG Postgres	Database performance
AI Hub Overview	LLM inference metrics
GPU Metrics	Utilization, memory, temperature

Alerting

Alerts flow through Alertmanager to Gitea Actions:

flowchart LR
    Mimir[Mimir] -->|"Alert Rules"| AM[Alertmanager]
    AM -->|"Webhook"| GA[Gitea Actions]
    GA -->|"Auto-Remediation"| K8s[Kubernetes]

Part of OpenOva