openova/docs/SECURITY.md
hatiyildiz f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00

18 KiB

Catalyst Security Model

Status: Authoritative target architecture. Updated: 2026-04-27. Implementation: Per-component status tracked in IMPLEMENTATION-STATUS.md. OpenBao, ESO, SPIRE, Keycloak component READMEs exist; Catalyst's integration glue is design-stage.

Identity, secrets, rotation, and multi-region credential semantics for Catalyst Sovereigns. Defer to GLOSSARY.md for terminology.


1. Identity: two systems, two purposes

Subject System Token Lifetime What it auths
Workloads (every Pod, every controller) SPIFFE/SPIRE SVID (X.509 mTLS cert) 5 minutes, auto-rotated Pod ↔ Pod; Pod ↔ OpenBao; Pod ↔ NATS; Pod ↔ Catalyst APIs
Users (every human) Keycloak OIDC JWT 15 min access / 30 day refresh UI auth, REST/GraphQL API, Gitea, console SSE

Two systems, never conflated. Workload identity is bound to a Kubernetes ServiceAccount. User identity is bound to a Keycloak realm subject. The two meet only at boundaries where a service acts on behalf of a user (and even then, the workload presents both: its own SVID for transport mTLS, and the user's JWT in the request body).


2. SPIFFE/SPIRE — workload identity

┌──────────────────────────────────────────────────────────────────────┐
│ Each Sovereign runs a SPIRE server (in catalyst-spire namespace)     │
│  - one HA SPIRE server per host cluster                              │
│  - upstream-bundle to a root SPIRE server in the management cluster  │
│  - issues SVIDs to a SPIRE agent on every node                       │
└──────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────────┐
│ SPIRE agent on each node                                             │
│  - exposes Workload API (Unix socket) to Pods on that node           │
│  - mints SVIDs scoped by SPIFFE ID:                                  │
│      spiffe://<sovereign>/ns/<namespace>/sa/<service-account>        │
│  - rotates every 5 minutes; Pods refresh in-memory                   │
└──────────────────────────────────────────────────────────────────────┘

SPIFFE ID examples in Catalyst:

spiffe://omantel/ns/catalyst-projector/sa/projector
spiffe://omantel/ns/catalyst-gitea/sa/gitea
spiffe://omantel/ns/muscatpharmacy/sa/wordpress     ← Application workload
spiffe://omantel/ns/catalyst-openbao/sa/openbao     ← OpenBao itself

OpenBao authenticates clients by their SVID. JetStream authenticates clients by their SVID. The Catalyst REST API authenticates workloads by their SVID and users by their JWT.

Why SPIFFE over static service-account tokens:

  • Static tokens leak. SVIDs auto-rotate at 5-minute boundaries.
  • SPIFFE IDs are portable across clusters (cross-region service-to-service auth works without cross-cluster ServiceAccount sync).
  • mTLS by default — every connection is authenticated and encrypted.

3. Secrets: OpenBao + ESO

Static secrets (API tokens, passwords, signing keys, OAuth client secrets) live in OpenBao. They reach Pods via External Secrets Operator (ESO).

       OpenBao (Raft cluster, region-local)
              │
              │  ┌──────────────────────────────────────────────┐
              │  │  ExternalSecret CR in Git, in the Application │
              │  │  Gitea repo. References path in OpenBao.     │
              │  └──────────────────────────────────────────────┘
              │                          │
              │                          ▼
              │  ┌──────────────────────────────────────────────┐
              │  │  ESO (in vcluster) reads ExternalSecret CR   │
              │  │  Authenticates to OpenBao via SVID           │
              │  └──────────────────────────────────────────────┘
              │                          │
              │                          ▼
              │  ┌──────────────────────────────────────────────┐
              │  │  K8s Secret (rendered, versioned)             │
              │  │  Reloader watches hash → rolling deploy      │
              │  └──────────────────────────────────────────────┘
              │                          │
              ▼                          ▼
   (audit log + telemetry)         Pod mounts the secret

What's in Git (always):

  • ExternalSecret CR pointing at an OpenBao path
  • SecretStore CR pointing at the OpenBao endpoint
  • SecretPolicy CR (rotation rules)
  • Public keys, root CA certs (CRDs)

What's NEVER in Git:

  • Secret values (passwords, tokens, private keys, etc.)
  • OpenBao root tokens
  • Static API credentials

4. Dynamic credentials

For databases, S3, and other systems supporting short-lived credentials, OpenBao mints them on demand:

Pod                   catalyst-secret-sidecar          OpenBao (DB engine)
 │                          │                                  │
 │ "give me Postgres"      │ authenticates via SVID            │
 │─────────────────────────►│                                   │
 │                          │ mints Postgres user             │
 │                          │ TTL=1h                          │
 │                          │──────────────────────────────────►│
 │                          │ returns user/password           │
 │◄─────────────────────────│◄──────────────────────────────────│
 │
 │ connects to Postgres, opens connection pool
 │
 │ at T+50min: sidecar pre-emptively requests new creds
 │              app drains old pool, swaps to new creds
 │              no downtime
 │
 │ at T+1h: OpenBao revokes the old user

The sidecar is automatic for any Pod whose Blueprint declares dynamicSecrets: true. Apps that prefer in-process can use the Catalyst SDK directly. Apps that can't do either get a rolling restart at the TTL boundary (acceptable for low-tier workloads).

Database engines supported: PostgreSQL (CNPG), FerretDB, MongoDB-compatible, ClickHouse, Valkey, SeaweedFS/S3.


5. Multi-region OpenBao — INDEPENDENT, NOT STRETCHED

Critical: each region runs its own Raft cluster. There is no cross-region Raft quorum. Region failures are independent failure domains.

   Region A (Muscat)              Region B (Salalah)              Region C (Frankfurt DR)
   ┌──────────────────┐           ┌──────────────────┐            ┌──────────────────┐
   │ OpenBao cluster  │           │ OpenBao cluster  │            │ OpenBao cluster  │
   │ 3 Raft nodes     │           │ 3 Raft nodes     │            │ 3 Raft nodes     │
   │ INDEPENDENT      │           │ INDEPENDENT      │            │ INDEPENDENT      │
   │ Raft quorum      │           │ Raft quorum      │            │ Raft quorum      │
   └──────┬───────────┘           └──────────────────┘            └──────────────────┘
          │                                ▲                                ▲
          │ async log shipping             │ async log shipping             │
          │ (Performance Replication)      │                                │
          └────────────────────────────────┴────────────────────────────────┘
                  one-way: primary → secondaries; no cross-region quorum

5.1 Fault domain semantics

  • Each region has its own self-contained 3-node Raft cluster. Quorum is intra-region only (need 2-of-3 in the same region).
  • A total Region A failure does NOT require any other region to do anything. Region B and C continue serving reads from their local replicated data.
  • Network partition between regions: each region keeps operating independently. Writes pause on standby regions (since they're read-only by design).
  • DR promotion is explicit. Either sovereign-admin-approved or automated by failover-controller with strict criteria. Not automatic on every blip.

5.2 Read/write semantics

  • Writes (rotations, new secrets) → primary OpenBao only.
  • Reads → local OpenBao replica (sub-10ms latency in same continent).
  • Replication lag <1s typical. Apps in B and C read post-rotation values without any cross-region call.
  • Region failure → DR replica promoted by the failover-controller. New writes are blocked briefly during promotion (~30s). After promotion, the DR region accepts writes.

5.3 Why NOT a stretched cluster

A stretched Raft cluster (5 nodes across 3 regions, single quorum) seems superficially appealing but is fragile:

  • A single region's network blip can cause loss of quorum if 3 of 5 nodes are in the affected region.
  • Cross-region latency degrades all writes (every write needs cross-region majority ack).
  • An entire region failure can leave the cluster without quorum.

We deliberately reject this pattern. Each region is its own failure domain.


6. Keycloak topology

Set at Sovereign provisioning time:

# In Sovereign CRD spec
keycloakTopology: per-organization      # SME-style: each Org gets its own
# OR
keycloakTopology: shared-sovereign      # Corporate: one Keycloak for the Sovereign

6.1 SME-style (per-organization)

Sovereign: omantel
└── Each Organization gets a minimal Keycloak (1 replica, embedded H2/sqlite,
    ~150 MB RAM, no HA)
    │
    ├── Organization muscatpharmacy
    │     Keycloak realm: muscatpharmacy
    │     Federations: Omantel-Mobile-OTP, Google, Apple
    ├── Organization acme-shop
    │     Keycloak realm: acme-shop
    └── …

Why per-Org for SME: blast radius. Muscat-pharmacy's Keycloak outage cannot affect Lulu-Hypermarket. Operationally cheap — minimal Keycloak fits in <200MB. SME tier customers don't need HA; if their Keycloak restarts in 10s during a deploy, that's tolerable.

Larger SMEs can opt into HA via a tier upgrade — same data model, just more replicas + Postgres backend instead of embedded H2.

6.2 Corporate (shared-sovereign)

Sovereign: bankdhofar
└── ONE Keycloak (HA, 3 replicas, Postgres backend)
    Federates to Bank Dhofar's corporate Azure AD
    │
    ├── Realm: catalyst-admin (sovereign-admin team)
    ├── Realm: core-banking (Org)
    ├── Realm: digital-channels (Org)
    ├── Realm: analytics (Org)
    └── Realm: corporate-it (Org)

Why shared for corporate: the bank's security perimeter is the entire Sovereign. Every Organization within is a business unit of the same legal entity. Federation to Azure AD is the single auth choke-point anyway. Per-Org Keycloak would mean N times the Azure AD federation config — operational overhead with no security benefit.

6.3 App-level SSO

Every Application Blueprint can declare SSO support:

# in bp-wordpress configSchema
sso:
  enabled: true   # auto-creates a Keycloak client in the Org's realm
                  # injects credentials via OpenBao + ExternalSecret

End users get one-click SSO across all Apps in their Organization without ever seeing OAuth config.


7. Rotation policy

Every credential class has a SecretPolicy that drives automatic rotation.

apiVersion: catalyst.openova.io/v1alpha1
kind: SecretPolicy
metadata:
  name: stricter-rotation
  namespace: catalyst-system
spec:
  appliesTo:
    organizationLabels:
      tier: regulated
  rules:
    - kind: database-credentials
      maxTTL: 1h
      autoRotate: true
    - kind: api-token
      maxTTL: 90d
      autoRotate: true
      rotateBefore: 7d
    - kind: oauth-client-secret
      maxTTL: 90d
      autoRotate: true
    - kind: signing-key
      maxTTL: 365d
      autoRotate: false               # requires explicit approval
      requireApproval: [security-officer]
    - kind: tls-cert
      maxTTL: cert-manager-managed
Class Default Notes
Workload identity (SPIRE SVID) 5 min, auto Not configurable.
Dynamic DB creds 1 h, auto Per-Blueprint TTL configurable.
API tokens, OAuth client secrets 90 d, auto rotateBefore: 7d gives apps a refresh window.
Signing keys, root CAs 365 d, manual approval Auto-rotation possible but disabled by default for high-impact keys.
TLS certs cert-manager controlled Acme/Let's Encrypt, ~60 d, automatic.
User passwords (Keycloak) User-managed + MFA Min age policy enforced by realm.

A security-officer sees a RotationDashboard view: every credential class, age, next rotation, force-rotate button (RBAC-gated).


8. The path of a secret value (no leakage)

1. Generated:   Crossplane composition or OpenBao auto-generator creates value.
                Never printed. Never echoed. Written directly to OpenBao via API.

2. Referenced:  ExternalSecret CR in Git names the OpenBao path. No value in Git.

3. Materialized: ESO reads OpenBao path (auth via SVID), renders K8s Secret.
                The K8s Secret is base64-encoded; never logged.

4. Consumed:    Pod mounts as env or file. Reloader watches hash; rolls deploy
                on change. Application sees plaintext only via mount or env.

5. Rotated:     SecretPolicy controller invokes rotation API on OpenBao.
                New value generated, replication propagates, ESO re-reads,
                Reloader rolls. Old value retained for grace window (24h),
                then revoked.

6. Audited:     Every step logged to Catalyst audit log. No plaintext.

What never happens:

  • Plaintext secrets in Git.
  • Plaintext secrets in shell command output.
  • Plaintext secrets in issues, PRs, comments, or chat.
  • Plaintext secrets in commit messages, branch names, tag names.

If a secret is ever leaked via terminal output (a misconfigured kubectl describe, a debug log), the leak is treated as a P1 incident: rotate immediately, audit history, communicate.


9. Compliance posture

Standard Catalyst posture
SOC 2 Type 2 Audit logging in JetStream + OpenSearch SIEM cold storage. SecretPolicy enforces rotation. EnvironmentPolicy enforces approvals.
PSD2 / FAPI Fingate Blueprint composes Keycloak (FAPI authorization), eIDAS cert verification, ext_authz.
DORA Resilience testing via Litmus chaos Blueprint. Multi-region by default for regulated tier.
NIS2 Falco runtime detection + OpenSearch SIEM + Kyverno policy + supply-chain (cosign + Syft+Grype).
GDPR Per-region data residency via Placement spec. Right-to-be-forgotten flow defined per Application Blueprint.
ISO 27001 Mappings published per control; evidence surfaced via Catalyst console audit views and SIEM exports.

Every Sovereign exports its audit log to a customer-specified SIEM. Default: OpenSearch in the Sovereign itself; customers may push to external Splunk, Datadog SIEM, etc.


10. Threat model summary

Threat Mitigation
Stolen ServiceAccount token SVID is 5-min TTL; revoked by SPIRE on rotation.
Stolen K8s Secret Encrypted at rest in etcd. Pulled only via ESO with SVID.
Compromised Pod NetworkPolicy (Cilium) + L7 policies limit blast radius. Falco detects anomalous syscalls.
Malicious commit to Environment Gitea EnvironmentPolicy requires PR approvals. Kyverno admission control denies non-policy-compliant manifests.
Compromised Blueprint upstream All Blueprints are cosigned. Kyverno verify-signatures policy denies unsigned/wrong-issuer artifacts.
Cross-Org leakage vcluster isolation. JetStream Account isolation. Keycloak realm isolation (per-Org or shared).
Compromised sovereign-admin account MFA required at Keycloak. JIT elevation for production-impacting actions. Full audit trail to SIEM.
Compromised OpenBao node 2-of-3 Raft quorum required for writes. Audit log captures every read. Rotate root token + re-shard quarterly.
Region-wide failure Independent OpenBao Raft per region. PowerDNS lua-records (ifurlup) drop the affected regional endpoint from authoritative responses within the health-check window. Apps with active-active keep serving from healthy region.
Supply-chain attack on a build SLSA-3 build provenance, cosign signing, Syft+Grype SBOM scanned in CI and at runtime by Trivy.

See ARCHITECTURE.md for the broader platform context.