openova/docs/SECURITY.md
hatiyildiz f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00

349 lines
18 KiB
Markdown

# Catalyst Security Model
**Status:** Authoritative target architecture. **Updated:** 2026-04-27.
**Implementation:** Per-component status tracked in [`IMPLEMENTATION-STATUS.md`](IMPLEMENTATION-STATUS.md). OpenBao, ESO, SPIRE, Keycloak component READMEs exist; Catalyst's integration glue is design-stage.
Identity, secrets, rotation, and multi-region credential semantics for Catalyst Sovereigns. Defer to [`GLOSSARY.md`](GLOSSARY.md) for terminology.
---
## 1. Identity: two systems, two purposes
| Subject | System | Token | Lifetime | What it auths |
|---|---|---|---|---|
| **Workloads** (every Pod, every controller) | SPIFFE/SPIRE | SVID (X.509 mTLS cert) | 5 minutes, auto-rotated | Pod ↔ Pod; Pod ↔ OpenBao; Pod ↔ NATS; Pod ↔ Catalyst APIs |
| **Users** (every human) | Keycloak | OIDC JWT | 15 min access / 30 day refresh | UI auth, REST/GraphQL API, Gitea, console SSE |
Two systems, never conflated. Workload identity is bound to a Kubernetes ServiceAccount. User identity is bound to a Keycloak realm subject. The two meet only at boundaries where a service acts on behalf of a user (and even then, the workload presents both: its own SVID for transport mTLS, and the user's JWT in the request body).
---
## 2. SPIFFE/SPIRE — workload identity
```
┌──────────────────────────────────────────────────────────────────────┐
│ Each Sovereign runs a SPIRE server (in catalyst-spire namespace) │
│ - one HA SPIRE server per host cluster │
│ - upstream-bundle to a root SPIRE server in the management cluster │
│ - issues SVIDs to a SPIRE agent on every node │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ SPIRE agent on each node │
│ - exposes Workload API (Unix socket) to Pods on that node │
│ - mints SVIDs scoped by SPIFFE ID: │
│ spiffe://<sovereign>/ns/<namespace>/sa/<service-account> │
│ - rotates every 5 minutes; Pods refresh in-memory │
└──────────────────────────────────────────────────────────────────────┘
```
**SPIFFE ID examples** in Catalyst:
```
spiffe://omantel/ns/catalyst-projector/sa/projector
spiffe://omantel/ns/catalyst-gitea/sa/gitea
spiffe://omantel/ns/muscatpharmacy/sa/wordpress ← Application workload
spiffe://omantel/ns/catalyst-openbao/sa/openbao ← OpenBao itself
```
OpenBao authenticates clients by their SVID. JetStream authenticates clients by their SVID. The Catalyst REST API authenticates workloads by their SVID and users by their JWT.
**Why SPIFFE over static service-account tokens:**
- Static tokens leak. SVIDs auto-rotate at 5-minute boundaries.
- SPIFFE IDs are portable across clusters (cross-region service-to-service auth works without cross-cluster ServiceAccount sync).
- mTLS by default — every connection is authenticated and encrypted.
---
## 3. Secrets: OpenBao + ESO
Static secrets (API tokens, passwords, signing keys, OAuth client secrets) live in OpenBao. They reach Pods via External Secrets Operator (ESO).
```
OpenBao (Raft cluster, region-local)
│ ┌──────────────────────────────────────────────┐
│ │ ExternalSecret CR in Git, in the Application │
│ │ Gitea repo. References path in OpenBao. │
│ └──────────────────────────────────────────────┘
│ │
│ ▼
│ ┌──────────────────────────────────────────────┐
│ │ ESO (in vcluster) reads ExternalSecret CR │
│ │ Authenticates to OpenBao via SVID │
│ └──────────────────────────────────────────────┘
│ │
│ ▼
│ ┌──────────────────────────────────────────────┐
│ │ K8s Secret (rendered, versioned) │
│ │ Reloader watches hash → rolling deploy │
│ └──────────────────────────────────────────────┘
│ │
▼ ▼
(audit log + telemetry) Pod mounts the secret
```
**What's in Git** (always):
- `ExternalSecret` CR pointing at an OpenBao path
- `SecretStore` CR pointing at the OpenBao endpoint
- `SecretPolicy` CR (rotation rules)
- Public keys, root CA certs (CRDs)
**What's NEVER in Git:**
- Secret values (passwords, tokens, private keys, etc.)
- OpenBao root tokens
- Static API credentials
---
## 4. Dynamic credentials
For databases, S3, and other systems supporting short-lived credentials, OpenBao mints them on demand:
```
Pod catalyst-secret-sidecar OpenBao (DB engine)
│ │ │
│ "give me Postgres" │ authenticates via SVID │
│─────────────────────────►│ │
│ │ mints Postgres user │
│ │ TTL=1h │
│ │──────────────────────────────────►│
│ │ returns user/password │
│◄─────────────────────────│◄──────────────────────────────────│
│ connects to Postgres, opens connection pool
│ at T+50min: sidecar pre-emptively requests new creds
│ app drains old pool, swaps to new creds
│ no downtime
│ at T+1h: OpenBao revokes the old user
```
The sidecar is automatic for any Pod whose Blueprint declares `dynamicSecrets: true`. Apps that prefer in-process can use the Catalyst SDK directly. Apps that can't do either get a rolling restart at the TTL boundary (acceptable for low-tier workloads).
**Database engines supported:** PostgreSQL (CNPG), FerretDB, MongoDB-compatible, ClickHouse, Valkey, SeaweedFS/S3.
---
## 5. Multi-region OpenBao — INDEPENDENT, NOT STRETCHED
Critical: each region runs its **own** Raft cluster. There is no cross-region Raft quorum. Region failures are independent failure domains.
```
Region A (Muscat) Region B (Salalah) Region C (Frankfurt DR)
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ OpenBao cluster │ │ OpenBao cluster │ │ OpenBao cluster │
│ 3 Raft nodes │ │ 3 Raft nodes │ │ 3 Raft nodes │
│ INDEPENDENT │ │ INDEPENDENT │ │ INDEPENDENT │
│ Raft quorum │ │ Raft quorum │ │ Raft quorum │
└──────┬───────────┘ └──────────────────┘ └──────────────────┘
│ ▲ ▲
│ async log shipping │ async log shipping │
│ (Performance Replication) │ │
└────────────────────────────────┴────────────────────────────────┘
one-way: primary → secondaries; no cross-region quorum
```
### 5.1 Fault domain semantics
- **Each region has its own self-contained 3-node Raft cluster.** Quorum is **intra-region only** (need 2-of-3 in the same region).
- **A total Region A failure does NOT require any other region to do anything.** Region B and C continue serving reads from their local replicated data.
- **Network partition between regions:** each region keeps operating independently. Writes pause on standby regions (since they're read-only by design).
- **DR promotion is explicit.** Either `sovereign-admin`-approved or automated by failover-controller with strict criteria. Not automatic on every blip.
### 5.2 Read/write semantics
- **Writes** (rotations, new secrets) → primary OpenBao only.
- **Reads** → local OpenBao replica (sub-10ms latency in same continent).
- **Replication lag** <1s typical. Apps in B and C read post-rotation values without any cross-region call.
- **Region failure** DR replica promoted by the failover-controller. New writes are blocked briefly during promotion (~30s). After promotion, the DR region accepts writes.
### 5.3 Why NOT a stretched cluster
A stretched Raft cluster (5 nodes across 3 regions, single quorum) seems superficially appealing but is fragile:
- A single region's network blip can cause loss of quorum if 3 of 5 nodes are in the affected region.
- Cross-region latency degrades all writes (every write needs cross-region majority ack).
- An entire region failure can leave the cluster without quorum.
We deliberately reject this pattern. Each region is its own failure domain.
---
## 6. Keycloak topology
Set at Sovereign provisioning time:
```yaml
# In Sovereign CRD spec
keycloakTopology: per-organization # SME-style: each Org gets its own
# OR
keycloakTopology: shared-sovereign # Corporate: one Keycloak for the Sovereign
```
### 6.1 SME-style (`per-organization`)
```
Sovereign: omantel
└── Each Organization gets a minimal Keycloak (1 replica, embedded H2/sqlite,
~150 MB RAM, no HA)
├── Organization muscatpharmacy
│ Keycloak realm: muscatpharmacy
│ Federations: Omantel-Mobile-OTP, Google, Apple
├── Organization acme-shop
│ Keycloak realm: acme-shop
└── …
```
**Why per-Org for SME**: blast radius. Muscat-pharmacy's Keycloak outage cannot affect Lulu-Hypermarket. Operationally cheap minimal Keycloak fits in <200MB. SME tier customers don't need HA; if their Keycloak restarts in 10s during a deploy, that's tolerable.
**Larger SMEs** can opt into HA via a tier upgrade same data model, just more replicas + Postgres backend instead of embedded H2.
### 6.2 Corporate (`shared-sovereign`)
```
Sovereign: bankdhofar
└── ONE Keycloak (HA, 3 replicas, Postgres backend)
Federates to Bank Dhofar's corporate Azure AD
├── Realm: catalyst-admin (sovereign-admin team)
├── Realm: core-banking (Org)
├── Realm: digital-channels (Org)
├── Realm: analytics (Org)
└── Realm: corporate-it (Org)
```
**Why shared for corporate**: the bank's security perimeter is the entire Sovereign. Every Organization within is a business unit of the same legal entity. Federation to Azure AD is the single auth choke-point anyway. Per-Org Keycloak would mean N times the Azure AD federation config operational overhead with no security benefit.
### 6.3 App-level SSO
Every Application Blueprint can declare SSO support:
```yaml
# in bp-wordpress configSchema
sso:
enabled: true # auto-creates a Keycloak client in the Org's realm
# injects credentials via OpenBao + ExternalSecret
```
End users get one-click SSO across all Apps in their Organization without ever seeing OAuth config.
---
## 7. Rotation policy
Every credential class has a SecretPolicy that drives automatic rotation.
```yaml
apiVersion: catalyst.openova.io/v1alpha1
kind: SecretPolicy
metadata:
name: stricter-rotation
namespace: catalyst-system
spec:
appliesTo:
organizationLabels:
tier: regulated
rules:
- kind: database-credentials
maxTTL: 1h
autoRotate: true
- kind: api-token
maxTTL: 90d
autoRotate: true
rotateBefore: 7d
- kind: oauth-client-secret
maxTTL: 90d
autoRotate: true
- kind: signing-key
maxTTL: 365d
autoRotate: false # requires explicit approval
requireApproval: [security-officer]
- kind: tls-cert
maxTTL: cert-manager-managed
```
| Class | Default | Notes |
|---|---|---|
| Workload identity (SPIRE SVID) | 5 min, auto | Not configurable. |
| Dynamic DB creds | 1 h, auto | Per-Blueprint TTL configurable. |
| API tokens, OAuth client secrets | 90 d, auto | rotateBefore: 7d gives apps a refresh window. |
| Signing keys, root CAs | 365 d, manual approval | Auto-rotation possible but disabled by default for high-impact keys. |
| TLS certs | cert-manager controlled | Acme/Let's Encrypt, ~60 d, automatic. |
| User passwords (Keycloak) | User-managed + MFA | Min age policy enforced by realm. |
A `security-officer` sees a **RotationDashboard** view: every credential class, age, next rotation, force-rotate button (RBAC-gated).
---
## 8. The path of a secret value (no leakage)
```
1. Generated: Crossplane composition or OpenBao auto-generator creates value.
Never printed. Never echoed. Written directly to OpenBao via API.
2. Referenced: ExternalSecret CR in Git names the OpenBao path. No value in Git.
3. Materialized: ESO reads OpenBao path (auth via SVID), renders K8s Secret.
The K8s Secret is base64-encoded; never logged.
4. Consumed: Pod mounts as env or file. Reloader watches hash; rolls deploy
on change. Application sees plaintext only via mount or env.
5. Rotated: SecretPolicy controller invokes rotation API on OpenBao.
New value generated, replication propagates, ESO re-reads,
Reloader rolls. Old value retained for grace window (24h),
then revoked.
6. Audited: Every step logged to Catalyst audit log. No plaintext.
```
**What never happens:**
- Plaintext secrets in Git.
- Plaintext secrets in shell command output.
- Plaintext secrets in issues, PRs, comments, or chat.
- Plaintext secrets in commit messages, branch names, tag names.
If a secret is ever leaked via terminal output (a misconfigured `kubectl describe`, a debug log), the leak is treated as a P1 incident: rotate immediately, audit history, communicate.
---
## 9. Compliance posture
| Standard | Catalyst posture |
|---|---|
| **SOC 2 Type 2** | Audit logging in JetStream + OpenSearch SIEM cold storage. SecretPolicy enforces rotation. EnvironmentPolicy enforces approvals. |
| **PSD2 / FAPI** | Fingate Blueprint composes Keycloak (FAPI authorization), eIDAS cert verification, ext_authz. |
| **DORA** | Resilience testing via Litmus chaos Blueprint. Multi-region by default for regulated tier. |
| **NIS2** | Falco runtime detection + OpenSearch SIEM + Kyverno policy + supply-chain (cosign + Syft+Grype). |
| **GDPR** | Per-region data residency via Placement spec. Right-to-be-forgotten flow defined per Application Blueprint. |
| **ISO 27001** | Mappings published per control; evidence surfaced via Catalyst console audit views and SIEM exports. |
Every Sovereign exports its audit log to a customer-specified SIEM. Default: OpenSearch in the Sovereign itself; customers may push to external Splunk, Datadog SIEM, etc.
---
## 10. Threat model summary
| Threat | Mitigation |
|---|---|
| Stolen ServiceAccount token | SVID is 5-min TTL; revoked by SPIRE on rotation. |
| Stolen K8s Secret | Encrypted at rest in etcd. Pulled only via ESO with SVID. |
| Compromised Pod | NetworkPolicy (Cilium) + L7 policies limit blast radius. Falco detects anomalous syscalls. |
| Malicious commit to Environment Gitea | EnvironmentPolicy requires PR approvals. Kyverno admission control denies non-policy-compliant manifests. |
| Compromised Blueprint upstream | All Blueprints are cosigned. Kyverno verify-signatures policy denies unsigned/wrong-issuer artifacts. |
| Cross-Org leakage | vcluster isolation. JetStream Account isolation. Keycloak realm isolation (per-Org or shared). |
| Compromised sovereign-admin account | MFA required at Keycloak. JIT elevation for production-impacting actions. Full audit trail to SIEM. |
| Compromised OpenBao node | 2-of-3 Raft quorum required for writes. Audit log captures every read. Rotate root token + re-shard quarterly. |
| Region-wide failure | Independent OpenBao Raft per region. PowerDNS lua-records (`ifurlup`) drop the affected regional endpoint from authoritative responses within the health-check window. Apps with `active-active` keep serving from healthy region. |
| Supply-chain attack on a build | SLSA-3 build provenance, cosign signing, Syft+Grype SBOM scanned in CI and at runtime by Trivy. |
---
*See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the broader platform context.*