openova/docs/SECURITY.md
hatiyildiz 7cafa3c894 docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay
Component-level architectural correction (two changes):

1. MinIO → SeaweedFS as unified S3 encapsulation layer

The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface.

2. Apache Guacamole added as Application Blueprint §4.5 Communication

Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access.

Component changes:
- DELETED: platform/minio/
- CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section)
- CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings)

Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count.

Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric.

UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added.

VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer.

Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.
2026-04-28 10:23:46 +02:00

18 KiB

Catalyst Security Model

Status: Authoritative target architecture. Updated: 2026-04-27. Implementation: Per-component status tracked in IMPLEMENTATION-STATUS.md. OpenBao, ESO, SPIRE, Keycloak component READMEs exist; Catalyst's integration glue is design-stage.

Identity, secrets, rotation, and multi-region credential semantics for Catalyst Sovereigns. Defer to GLOSSARY.md for terminology.


1. Identity: two systems, two purposes

Subject System Token Lifetime What it auths
Workloads (every Pod, every controller) SPIFFE/SPIRE SVID (X.509 mTLS cert) 5 minutes, auto-rotated Pod ↔ Pod; Pod ↔ OpenBao; Pod ↔ NATS; Pod ↔ Catalyst APIs
Users (every human) Keycloak OIDC JWT 15 min access / 30 day refresh UI auth, REST/GraphQL API, Gitea, console SSE

Two systems, never conflated. Workload identity is bound to a Kubernetes ServiceAccount. User identity is bound to a Keycloak realm subject. The two meet only at boundaries where a service acts on behalf of a user (and even then, the workload presents both: its own SVID for transport mTLS, and the user's JWT in the request body).


2. SPIFFE/SPIRE — workload identity

┌──────────────────────────────────────────────────────────────────────┐
│ Each Sovereign runs a SPIRE server (in catalyst-spire namespace)     │
│  - one HA SPIRE server per host cluster                              │
│  - upstream-bundle to a root SPIRE server in the management cluster  │
│  - issues SVIDs to a SPIRE agent on every node                       │
└──────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────────┐
│ SPIRE agent on each node                                             │
│  - exposes Workload API (Unix socket) to Pods on that node           │
│  - mints SVIDs scoped by SPIFFE ID:                                  │
│      spiffe://<sovereign>/ns/<namespace>/sa/<service-account>        │
│  - rotates every 5 minutes; Pods refresh in-memory                   │
└──────────────────────────────────────────────────────────────────────┘

SPIFFE ID examples in Catalyst:

spiffe://omantel/ns/catalyst-projector/sa/projector
spiffe://omantel/ns/catalyst-gitea/sa/gitea
spiffe://omantel/ns/muscatpharmacy/sa/wordpress     ← Application workload
spiffe://omantel/ns/catalyst-openbao/sa/openbao     ← OpenBao itself

OpenBao authenticates clients by their SVID. JetStream authenticates clients by their SVID. The Catalyst REST API authenticates workloads by their SVID and users by their JWT.

Why SPIFFE over static service-account tokens:

  • Static tokens leak. SVIDs auto-rotate at 5-minute boundaries.
  • SPIFFE IDs are portable across clusters (cross-region service-to-service auth works without cross-cluster ServiceAccount sync).
  • mTLS by default — every connection is authenticated and encrypted.

3. Secrets: OpenBao + ESO

Static secrets (API tokens, passwords, signing keys, OAuth client secrets) live in OpenBao. They reach Pods via External Secrets Operator (ESO).

       OpenBao (Raft cluster, region-local)
              │
              │  ┌──────────────────────────────────────────────┐
              │  │  ExternalSecret CR in Git, in the Application │
              │  │  Gitea repo. References path in OpenBao.     │
              │  └──────────────────────────────────────────────┘
              │                          │
              │                          ▼
              │  ┌──────────────────────────────────────────────┐
              │  │  ESO (in vcluster) reads ExternalSecret CR   │
              │  │  Authenticates to OpenBao via SVID           │
              │  └──────────────────────────────────────────────┘
              │                          │
              │                          ▼
              │  ┌──────────────────────────────────────────────┐
              │  │  K8s Secret (rendered, versioned)             │
              │  │  Reloader watches hash → rolling deploy      │
              │  └──────────────────────────────────────────────┘
              │                          │
              ▼                          ▼
   (audit log + telemetry)         Pod mounts the secret

What's in Git (always):

  • ExternalSecret CR pointing at an OpenBao path
  • SecretStore CR pointing at the OpenBao endpoint
  • SecretPolicy CR (rotation rules)
  • Public keys, root CA certs (CRDs)

What's NEVER in Git:

  • Secret values (passwords, tokens, private keys, etc.)
  • OpenBao root tokens
  • Static API credentials

4. Dynamic credentials

For databases, S3, and other systems supporting short-lived credentials, OpenBao mints them on demand:

Pod                   catalyst-secret-sidecar          OpenBao (DB engine)
 │                          │                                  │
 │ "give me Postgres"      │ authenticates via SVID            │
 │─────────────────────────►│                                   │
 │                          │ mints Postgres user             │
 │                          │ TTL=1h                          │
 │                          │──────────────────────────────────►│
 │                          │ returns user/password           │
 │◄─────────────────────────│◄──────────────────────────────────│
 │
 │ connects to Postgres, opens connection pool
 │
 │ at T+50min: sidecar pre-emptively requests new creds
 │              app drains old pool, swaps to new creds
 │              no downtime
 │
 │ at T+1h: OpenBao revokes the old user

The sidecar is automatic for any Pod whose Blueprint declares dynamicSecrets: true. Apps that prefer in-process can use the Catalyst SDK directly. Apps that can't do either get a rolling restart at the TTL boundary (acceptable for low-tier workloads).

Database engines supported: PostgreSQL (CNPG), FerretDB, MongoDB-compatible, ClickHouse, Valkey, SeaweedFS/S3.


5. Multi-region OpenBao — INDEPENDENT, NOT STRETCHED

Critical: each region runs its own Raft cluster. There is no cross-region Raft quorum. Region failures are independent failure domains.

   Region A (Muscat)              Region B (Salalah)              Region C (Frankfurt DR)
   ┌──────────────────┐           ┌──────────────────┐            ┌──────────────────┐
   │ OpenBao cluster  │           │ OpenBao cluster  │            │ OpenBao cluster  │
   │ 3 Raft nodes     │           │ 3 Raft nodes     │            │ 3 Raft nodes     │
   │ INDEPENDENT      │           │ INDEPENDENT      │            │ INDEPENDENT      │
   │ Raft quorum      │           │ Raft quorum      │            │ Raft quorum      │
   └──────┬───────────┘           └──────────────────┘            └──────────────────┘
          │                                ▲                                ▲
          │ async log shipping             │ async log shipping             │
          │ (Performance Replication)      │                                │
          └────────────────────────────────┴────────────────────────────────┘
                  one-way: primary → secondaries; no cross-region quorum

5.1 Fault domain semantics

  • Each region has its own self-contained 3-node Raft cluster. Quorum is intra-region only (need 2-of-3 in the same region).
  • A total Region A failure does NOT require any other region to do anything. Region B and C continue serving reads from their local replicated data.
  • Network partition between regions: each region keeps operating independently. Writes pause on standby regions (since they're read-only by design).
  • DR promotion is explicit. Either sovereign-admin-approved or automated by failover-controller with strict criteria. Not automatic on every blip.

5.2 Read/write semantics

  • Writes (rotations, new secrets) → primary OpenBao only.
  • Reads → local OpenBao replica (sub-10ms latency in same continent).
  • Replication lag <1s typical. Apps in B and C read post-rotation values without any cross-region call.
  • Region failure → DR replica promoted by the failover-controller. New writes are blocked briefly during promotion (~30s). After promotion, the DR region accepts writes.

5.3 Why NOT a stretched cluster

A stretched Raft cluster (5 nodes across 3 regions, single quorum) seems superficially appealing but is fragile:

  • A single region's network blip can cause loss of quorum if 3 of 5 nodes are in the affected region.
  • Cross-region latency degrades all writes (every write needs cross-region majority ack).
  • An entire region failure can leave the cluster without quorum.

We deliberately reject this pattern. Each region is its own failure domain.


6. Keycloak topology

Set at Sovereign provisioning time:

# In Sovereign CRD spec
keycloakTopology: per-organization      # SME-style: each Org gets its own
# OR
keycloakTopology: shared-sovereign      # Corporate: one Keycloak for the Sovereign

6.1 SME-style (per-organization)

Sovereign: omantel
└── Each Organization gets a minimal Keycloak (1 replica, embedded H2/sqlite,
    ~150 MB RAM, no HA)
    │
    ├── Organization muscatpharmacy
    │     Keycloak realm: muscatpharmacy
    │     Federations: Omantel-Mobile-OTP, Google, Apple
    ├── Organization acme-shop
    │     Keycloak realm: acme-shop
    └── …

Why per-Org for SME: blast radius. Muscat-pharmacy's Keycloak outage cannot affect Lulu-Hypermarket. Operationally cheap — minimal Keycloak fits in <200MB. SME tier customers don't need HA; if their Keycloak restarts in 10s during a deploy, that's tolerable.

Larger SMEs can opt into HA via a tier upgrade — same data model, just more replicas + Postgres backend instead of embedded H2.

6.2 Corporate (shared-sovereign)

Sovereign: bankdhofar
└── ONE Keycloak (HA, 3 replicas, Postgres backend)
    Federates to Bank Dhofar's corporate Azure AD
    │
    ├── Realm: catalyst-admin (sovereign-admin team)
    ├── Realm: core-banking (Org)
    ├── Realm: digital-channels (Org)
    ├── Realm: analytics (Org)
    └── Realm: corporate-it (Org)

Why shared for corporate: the bank's security perimeter is the entire Sovereign. Every Organization within is a business unit of the same legal entity. Federation to Azure AD is the single auth choke-point anyway. Per-Org Keycloak would mean N times the Azure AD federation config — operational overhead with no security benefit.

6.3 App-level SSO

Every Application Blueprint can declare SSO support:

# in bp-wordpress configSchema
sso:
  enabled: true   # auto-creates a Keycloak client in the Org's realm
                  # injects credentials via OpenBao + ExternalSecret

End users get one-click SSO across all Apps in their Organization without ever seeing OAuth config.


7. Rotation policy

Every credential class has a SecretPolicy that drives automatic rotation.

apiVersion: catalyst.openova.io/v1alpha1
kind: SecretPolicy
metadata:
  name: stricter-rotation
  namespace: catalyst-system
spec:
  appliesTo:
    organizationLabels:
      tier: regulated
  rules:
    - kind: database-credentials
      maxTTL: 1h
      autoRotate: true
    - kind: api-token
      maxTTL: 90d
      autoRotate: true
      rotateBefore: 7d
    - kind: oauth-client-secret
      maxTTL: 90d
      autoRotate: true
    - kind: signing-key
      maxTTL: 365d
      autoRotate: false               # requires explicit approval
      requireApproval: [security-officer]
    - kind: tls-cert
      maxTTL: cert-manager-managed
Class Default Notes
Workload identity (SPIRE SVID) 5 min, auto Not configurable.
Dynamic DB creds 1 h, auto Per-Blueprint TTL configurable.
API tokens, OAuth client secrets 90 d, auto rotateBefore: 7d gives apps a refresh window.
Signing keys, root CAs 365 d, manual approval Auto-rotation possible but disabled by default for high-impact keys.
TLS certs cert-manager controlled Acme/Let's Encrypt, ~60 d, automatic.
User passwords (Keycloak) User-managed + MFA Min age policy enforced by realm.

A security-officer sees a RotationDashboard view: every credential class, age, next rotation, force-rotate button (RBAC-gated).


8. The path of a secret value (no leakage)

1. Generated:   Crossplane composition or OpenBao auto-generator creates value.
                Never printed. Never echoed. Written directly to OpenBao via API.

2. Referenced:  ExternalSecret CR in Git names the OpenBao path. No value in Git.

3. Materialized: ESO reads OpenBao path (auth via SVID), renders K8s Secret.
                The K8s Secret is base64-encoded; never logged.

4. Consumed:    Pod mounts as env or file. Reloader watches hash; rolls deploy
                on change. Application sees plaintext only via mount or env.

5. Rotated:     SecretPolicy controller invokes rotation API on OpenBao.
                New value generated, replication propagates, ESO re-reads,
                Reloader rolls. Old value retained for grace window (24h),
                then revoked.

6. Audited:     Every step logged to Catalyst audit log. No plaintext.

What never happens:

  • Plaintext secrets in Git.
  • Plaintext secrets in shell command output.
  • Plaintext secrets in issues, PRs, comments, or chat.
  • Plaintext secrets in commit messages, branch names, tag names.

If a secret is ever leaked via terminal output (a misconfigured kubectl describe, a debug log), the leak is treated as a P1 incident: rotate immediately, audit history, communicate.


9. Compliance posture

Standard Catalyst posture
SOC 2 Type 2 Audit logging in JetStream + OpenSearch SIEM cold storage. SecretPolicy enforces rotation. EnvironmentPolicy enforces approvals.
PSD2 / FAPI Fingate Blueprint composes Keycloak (FAPI authorization), eIDAS cert verification, ext_authz.
DORA Resilience testing via Litmus chaos Blueprint. Multi-region by default for regulated tier.
NIS2 Falco runtime detection + OpenSearch SIEM + Kyverno policy + supply-chain (cosign + Syft+Grype).
GDPR Per-region data residency via Placement spec. Right-to-be-forgotten flow defined per Application Blueprint.
ISO 27001 Mappings published per control; evidence surfaced via Catalyst console audit views and SIEM exports.

Every Sovereign exports its audit log to a customer-specified SIEM. Default: OpenSearch in the Sovereign itself; customers may push to external Splunk, Datadog SIEM, etc.


10. Threat model summary

Threat Mitigation
Stolen ServiceAccount token SVID is 5-min TTL; revoked by SPIRE on rotation.
Stolen K8s Secret Encrypted at rest in etcd. Pulled only via ESO with SVID.
Compromised Pod NetworkPolicy (Cilium) + L7 policies limit blast radius. Falco detects anomalous syscalls.
Malicious commit to Environment Gitea EnvironmentPolicy requires PR approvals. Kyverno admission control denies non-policy-compliant manifests.
Compromised Blueprint upstream All Blueprints are cosigned. Kyverno verify-signatures policy denies unsigned/wrong-issuer artifacts.
Cross-Org leakage vcluster isolation. JetStream Account isolation. Keycloak realm isolation (per-Org or shared).
Compromised sovereign-admin account MFA required at Keycloak. JIT elevation for production-impacting actions. Full audit trail to SIEM.
Compromised OpenBao node 2-of-3 Raft quorum required for writes. Audit log captures every read. Rotate root token + re-shard quarterly.
Region-wide failure Independent OpenBao Raft per region. k8gb removes affected endpoints. Apps with active-active keep serving from healthy region.
Supply-chain attack on a build SLSA-3 build provenance, cosign signing, Syft+Grype SBOM scanned in CI and at runtime by Trivy.

See ARCHITECTURE.md for the broader platform context.