PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything k8gb was doing — geo-aware response selection, health-checked failover, weighted round-robin — at the authoritative DNS layer. Eliminates a separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign. Changes: - platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never authored — only README existed) - products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted - componentGroups.ts: remove k8gb component (PowerDNS already there) - componentLogos.tsx: drop logo_k8gb + k8gb map entry - model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns - StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb - provision.html: replace k8gb tile and edges with powerdns - catalog.generated.ts regenerated (now includes bp-powerdns) - docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING- CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY, COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs (cilium, external-dns, failover-controller, litmus, flux, opentofu) rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md. Historical entries in VALIDATION-LOG.md preserved as audit trail. - New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed), Application Placement → lua-record selector mapping, when to add a second Sovereign region, operational checks. Closes #171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
349 lines
18 KiB
Markdown
349 lines
18 KiB
Markdown
# Catalyst Security Model
|
|
|
|
**Status:** Authoritative target architecture. **Updated:** 2026-04-27.
|
|
**Implementation:** Per-component status tracked in [`IMPLEMENTATION-STATUS.md`](IMPLEMENTATION-STATUS.md). OpenBao, ESO, SPIRE, Keycloak component READMEs exist; Catalyst's integration glue is design-stage.
|
|
|
|
Identity, secrets, rotation, and multi-region credential semantics for Catalyst Sovereigns. Defer to [`GLOSSARY.md`](GLOSSARY.md) for terminology.
|
|
|
|
---
|
|
|
|
## 1. Identity: two systems, two purposes
|
|
|
|
| Subject | System | Token | Lifetime | What it auths |
|
|
|---|---|---|---|---|
|
|
| **Workloads** (every Pod, every controller) | SPIFFE/SPIRE | SVID (X.509 mTLS cert) | 5 minutes, auto-rotated | Pod ↔ Pod; Pod ↔ OpenBao; Pod ↔ NATS; Pod ↔ Catalyst APIs |
|
|
| **Users** (every human) | Keycloak | OIDC JWT | 15 min access / 30 day refresh | UI auth, REST/GraphQL API, Gitea, console SSE |
|
|
|
|
Two systems, never conflated. Workload identity is bound to a Kubernetes ServiceAccount. User identity is bound to a Keycloak realm subject. The two meet only at boundaries where a service acts on behalf of a user (and even then, the workload presents both: its own SVID for transport mTLS, and the user's JWT in the request body).
|
|
|
|
---
|
|
|
|
## 2. SPIFFE/SPIRE — workload identity
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
│ Each Sovereign runs a SPIRE server (in catalyst-spire namespace) │
|
|
│ - one HA SPIRE server per host cluster │
|
|
│ - upstream-bundle to a root SPIRE server in the management cluster │
|
|
│ - issues SVIDs to a SPIRE agent on every node │
|
|
└──────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
│ SPIRE agent on each node │
|
|
│ - exposes Workload API (Unix socket) to Pods on that node │
|
|
│ - mints SVIDs scoped by SPIFFE ID: │
|
|
│ spiffe://<sovereign>/ns/<namespace>/sa/<service-account> │
|
|
│ - rotates every 5 minutes; Pods refresh in-memory │
|
|
└──────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**SPIFFE ID examples** in Catalyst:
|
|
|
|
```
|
|
spiffe://omantel/ns/catalyst-projector/sa/projector
|
|
spiffe://omantel/ns/catalyst-gitea/sa/gitea
|
|
spiffe://omantel/ns/muscatpharmacy/sa/wordpress ← Application workload
|
|
spiffe://omantel/ns/catalyst-openbao/sa/openbao ← OpenBao itself
|
|
```
|
|
|
|
OpenBao authenticates clients by their SVID. JetStream authenticates clients by their SVID. The Catalyst REST API authenticates workloads by their SVID and users by their JWT.
|
|
|
|
**Why SPIFFE over static service-account tokens:**
|
|
- Static tokens leak. SVIDs auto-rotate at 5-minute boundaries.
|
|
- SPIFFE IDs are portable across clusters (cross-region service-to-service auth works without cross-cluster ServiceAccount sync).
|
|
- mTLS by default — every connection is authenticated and encrypted.
|
|
|
|
---
|
|
|
|
## 3. Secrets: OpenBao + ESO
|
|
|
|
Static secrets (API tokens, passwords, signing keys, OAuth client secrets) live in OpenBao. They reach Pods via External Secrets Operator (ESO).
|
|
|
|
```
|
|
OpenBao (Raft cluster, region-local)
|
|
│
|
|
│ ┌──────────────────────────────────────────────┐
|
|
│ │ ExternalSecret CR in Git, in the Application │
|
|
│ │ Gitea repo. References path in OpenBao. │
|
|
│ └──────────────────────────────────────────────┘
|
|
│ │
|
|
│ ▼
|
|
│ ┌──────────────────────────────────────────────┐
|
|
│ │ ESO (in vcluster) reads ExternalSecret CR │
|
|
│ │ Authenticates to OpenBao via SVID │
|
|
│ └──────────────────────────────────────────────┘
|
|
│ │
|
|
│ ▼
|
|
│ ┌──────────────────────────────────────────────┐
|
|
│ │ K8s Secret (rendered, versioned) │
|
|
│ │ Reloader watches hash → rolling deploy │
|
|
│ └──────────────────────────────────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
(audit log + telemetry) Pod mounts the secret
|
|
```
|
|
|
|
**What's in Git** (always):
|
|
|
|
- `ExternalSecret` CR pointing at an OpenBao path
|
|
- `SecretStore` CR pointing at the OpenBao endpoint
|
|
- `SecretPolicy` CR (rotation rules)
|
|
- Public keys, root CA certs (CRDs)
|
|
|
|
**What's NEVER in Git:**
|
|
|
|
- Secret values (passwords, tokens, private keys, etc.)
|
|
- OpenBao root tokens
|
|
- Static API credentials
|
|
|
|
---
|
|
|
|
## 4. Dynamic credentials
|
|
|
|
For databases, S3, and other systems supporting short-lived credentials, OpenBao mints them on demand:
|
|
|
|
```
|
|
Pod catalyst-secret-sidecar OpenBao (DB engine)
|
|
│ │ │
|
|
│ "give me Postgres" │ authenticates via SVID │
|
|
│─────────────────────────►│ │
|
|
│ │ mints Postgres user │
|
|
│ │ TTL=1h │
|
|
│ │──────────────────────────────────►│
|
|
│ │ returns user/password │
|
|
│◄─────────────────────────│◄──────────────────────────────────│
|
|
│
|
|
│ connects to Postgres, opens connection pool
|
|
│
|
|
│ at T+50min: sidecar pre-emptively requests new creds
|
|
│ app drains old pool, swaps to new creds
|
|
│ no downtime
|
|
│
|
|
│ at T+1h: OpenBao revokes the old user
|
|
```
|
|
|
|
The sidecar is automatic for any Pod whose Blueprint declares `dynamicSecrets: true`. Apps that prefer in-process can use the Catalyst SDK directly. Apps that can't do either get a rolling restart at the TTL boundary (acceptable for low-tier workloads).
|
|
|
|
**Database engines supported:** PostgreSQL (CNPG), FerretDB, MongoDB-compatible, ClickHouse, Valkey, SeaweedFS/S3.
|
|
|
|
---
|
|
|
|
## 5. Multi-region OpenBao — INDEPENDENT, NOT STRETCHED
|
|
|
|
Critical: each region runs its **own** Raft cluster. There is no cross-region Raft quorum. Region failures are independent failure domains.
|
|
|
|
```
|
|
Region A (Muscat) Region B (Salalah) Region C (Frankfurt DR)
|
|
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
|
|
│ OpenBao cluster │ │ OpenBao cluster │ │ OpenBao cluster │
|
|
│ 3 Raft nodes │ │ 3 Raft nodes │ │ 3 Raft nodes │
|
|
│ INDEPENDENT │ │ INDEPENDENT │ │ INDEPENDENT │
|
|
│ Raft quorum │ │ Raft quorum │ │ Raft quorum │
|
|
└──────┬───────────┘ └──────────────────┘ └──────────────────┘
|
|
│ ▲ ▲
|
|
│ async log shipping │ async log shipping │
|
|
│ (Performance Replication) │ │
|
|
└────────────────────────────────┴────────────────────────────────┘
|
|
one-way: primary → secondaries; no cross-region quorum
|
|
```
|
|
|
|
### 5.1 Fault domain semantics
|
|
|
|
- **Each region has its own self-contained 3-node Raft cluster.** Quorum is **intra-region only** (need 2-of-3 in the same region).
|
|
- **A total Region A failure does NOT require any other region to do anything.** Region B and C continue serving reads from their local replicated data.
|
|
- **Network partition between regions:** each region keeps operating independently. Writes pause on standby regions (since they're read-only by design).
|
|
- **DR promotion is explicit.** Either `sovereign-admin`-approved or automated by failover-controller with strict criteria. Not automatic on every blip.
|
|
|
|
### 5.2 Read/write semantics
|
|
|
|
- **Writes** (rotations, new secrets) → primary OpenBao only.
|
|
- **Reads** → local OpenBao replica (sub-10ms latency in same continent).
|
|
- **Replication lag** <1s typical. Apps in B and C read post-rotation values without any cross-region call.
|
|
- **Region failure** → DR replica promoted by the failover-controller. New writes are blocked briefly during promotion (~30s). After promotion, the DR region accepts writes.
|
|
|
|
### 5.3 Why NOT a stretched cluster
|
|
|
|
A stretched Raft cluster (5 nodes across 3 regions, single quorum) seems superficially appealing but is fragile:
|
|
|
|
- A single region's network blip can cause loss of quorum if 3 of 5 nodes are in the affected region.
|
|
- Cross-region latency degrades all writes (every write needs cross-region majority ack).
|
|
- An entire region failure can leave the cluster without quorum.
|
|
|
|
We deliberately reject this pattern. Each region is its own failure domain.
|
|
|
|
---
|
|
|
|
## 6. Keycloak topology
|
|
|
|
Set at Sovereign provisioning time:
|
|
|
|
```yaml
|
|
# In Sovereign CRD spec
|
|
keycloakTopology: per-organization # SME-style: each Org gets its own
|
|
# OR
|
|
keycloakTopology: shared-sovereign # Corporate: one Keycloak for the Sovereign
|
|
```
|
|
|
|
### 6.1 SME-style (`per-organization`)
|
|
|
|
```
|
|
Sovereign: omantel
|
|
└── Each Organization gets a minimal Keycloak (1 replica, embedded H2/sqlite,
|
|
~150 MB RAM, no HA)
|
|
│
|
|
├── Organization muscatpharmacy
|
|
│ Keycloak realm: muscatpharmacy
|
|
│ Federations: Omantel-Mobile-OTP, Google, Apple
|
|
├── Organization acme-shop
|
|
│ Keycloak realm: acme-shop
|
|
└── …
|
|
```
|
|
|
|
**Why per-Org for SME**: blast radius. Muscat-pharmacy's Keycloak outage cannot affect Lulu-Hypermarket. Operationally cheap — minimal Keycloak fits in <200MB. SME tier customers don't need HA; if their Keycloak restarts in 10s during a deploy, that's tolerable.
|
|
|
|
**Larger SMEs** can opt into HA via a tier upgrade — same data model, just more replicas + Postgres backend instead of embedded H2.
|
|
|
|
### 6.2 Corporate (`shared-sovereign`)
|
|
|
|
```
|
|
Sovereign: bankdhofar
|
|
└── ONE Keycloak (HA, 3 replicas, Postgres backend)
|
|
Federates to Bank Dhofar's corporate Azure AD
|
|
│
|
|
├── Realm: catalyst-admin (sovereign-admin team)
|
|
├── Realm: core-banking (Org)
|
|
├── Realm: digital-channels (Org)
|
|
├── Realm: analytics (Org)
|
|
└── Realm: corporate-it (Org)
|
|
```
|
|
|
|
**Why shared for corporate**: the bank's security perimeter is the entire Sovereign. Every Organization within is a business unit of the same legal entity. Federation to Azure AD is the single auth choke-point anyway. Per-Org Keycloak would mean N times the Azure AD federation config — operational overhead with no security benefit.
|
|
|
|
### 6.3 App-level SSO
|
|
|
|
Every Application Blueprint can declare SSO support:
|
|
|
|
```yaml
|
|
# in bp-wordpress configSchema
|
|
sso:
|
|
enabled: true # auto-creates a Keycloak client in the Org's realm
|
|
# injects credentials via OpenBao + ExternalSecret
|
|
```
|
|
|
|
End users get one-click SSO across all Apps in their Organization without ever seeing OAuth config.
|
|
|
|
---
|
|
|
|
## 7. Rotation policy
|
|
|
|
Every credential class has a SecretPolicy that drives automatic rotation.
|
|
|
|
```yaml
|
|
apiVersion: catalyst.openova.io/v1alpha1
|
|
kind: SecretPolicy
|
|
metadata:
|
|
name: stricter-rotation
|
|
namespace: catalyst-system
|
|
spec:
|
|
appliesTo:
|
|
organizationLabels:
|
|
tier: regulated
|
|
rules:
|
|
- kind: database-credentials
|
|
maxTTL: 1h
|
|
autoRotate: true
|
|
- kind: api-token
|
|
maxTTL: 90d
|
|
autoRotate: true
|
|
rotateBefore: 7d
|
|
- kind: oauth-client-secret
|
|
maxTTL: 90d
|
|
autoRotate: true
|
|
- kind: signing-key
|
|
maxTTL: 365d
|
|
autoRotate: false # requires explicit approval
|
|
requireApproval: [security-officer]
|
|
- kind: tls-cert
|
|
maxTTL: cert-manager-managed
|
|
```
|
|
|
|
| Class | Default | Notes |
|
|
|---|---|---|
|
|
| Workload identity (SPIRE SVID) | 5 min, auto | Not configurable. |
|
|
| Dynamic DB creds | 1 h, auto | Per-Blueprint TTL configurable. |
|
|
| API tokens, OAuth client secrets | 90 d, auto | rotateBefore: 7d gives apps a refresh window. |
|
|
| Signing keys, root CAs | 365 d, manual approval | Auto-rotation possible but disabled by default for high-impact keys. |
|
|
| TLS certs | cert-manager controlled | Acme/Let's Encrypt, ~60 d, automatic. |
|
|
| User passwords (Keycloak) | User-managed + MFA | Min age policy enforced by realm. |
|
|
|
|
A `security-officer` sees a **RotationDashboard** view: every credential class, age, next rotation, force-rotate button (RBAC-gated).
|
|
|
|
---
|
|
|
|
## 8. The path of a secret value (no leakage)
|
|
|
|
```
|
|
1. Generated: Crossplane composition or OpenBao auto-generator creates value.
|
|
Never printed. Never echoed. Written directly to OpenBao via API.
|
|
|
|
2. Referenced: ExternalSecret CR in Git names the OpenBao path. No value in Git.
|
|
|
|
3. Materialized: ESO reads OpenBao path (auth via SVID), renders K8s Secret.
|
|
The K8s Secret is base64-encoded; never logged.
|
|
|
|
4. Consumed: Pod mounts as env or file. Reloader watches hash; rolls deploy
|
|
on change. Application sees plaintext only via mount or env.
|
|
|
|
5. Rotated: SecretPolicy controller invokes rotation API on OpenBao.
|
|
New value generated, replication propagates, ESO re-reads,
|
|
Reloader rolls. Old value retained for grace window (24h),
|
|
then revoked.
|
|
|
|
6. Audited: Every step logged to Catalyst audit log. No plaintext.
|
|
```
|
|
|
|
**What never happens:**
|
|
- Plaintext secrets in Git.
|
|
- Plaintext secrets in shell command output.
|
|
- Plaintext secrets in issues, PRs, comments, or chat.
|
|
- Plaintext secrets in commit messages, branch names, tag names.
|
|
|
|
If a secret is ever leaked via terminal output (a misconfigured `kubectl describe`, a debug log), the leak is treated as a P1 incident: rotate immediately, audit history, communicate.
|
|
|
|
---
|
|
|
|
## 9. Compliance posture
|
|
|
|
| Standard | Catalyst posture |
|
|
|---|---|
|
|
| **SOC 2 Type 2** | Audit logging in JetStream + OpenSearch SIEM cold storage. SecretPolicy enforces rotation. EnvironmentPolicy enforces approvals. |
|
|
| **PSD2 / FAPI** | Fingate Blueprint composes Keycloak (FAPI authorization), eIDAS cert verification, ext_authz. |
|
|
| **DORA** | Resilience testing via Litmus chaos Blueprint. Multi-region by default for regulated tier. |
|
|
| **NIS2** | Falco runtime detection + OpenSearch SIEM + Kyverno policy + supply-chain (cosign + Syft+Grype). |
|
|
| **GDPR** | Per-region data residency via Placement spec. Right-to-be-forgotten flow defined per Application Blueprint. |
|
|
| **ISO 27001** | Mappings published per control; evidence surfaced via Catalyst console audit views and SIEM exports. |
|
|
|
|
Every Sovereign exports its audit log to a customer-specified SIEM. Default: OpenSearch in the Sovereign itself; customers may push to external Splunk, Datadog SIEM, etc.
|
|
|
|
---
|
|
|
|
## 10. Threat model summary
|
|
|
|
| Threat | Mitigation |
|
|
|---|---|
|
|
| Stolen ServiceAccount token | SVID is 5-min TTL; revoked by SPIRE on rotation. |
|
|
| Stolen K8s Secret | Encrypted at rest in etcd. Pulled only via ESO with SVID. |
|
|
| Compromised Pod | NetworkPolicy (Cilium) + L7 policies limit blast radius. Falco detects anomalous syscalls. |
|
|
| Malicious commit to Environment Gitea | EnvironmentPolicy requires PR approvals. Kyverno admission control denies non-policy-compliant manifests. |
|
|
| Compromised Blueprint upstream | All Blueprints are cosigned. Kyverno verify-signatures policy denies unsigned/wrong-issuer artifacts. |
|
|
| Cross-Org leakage | vcluster isolation. JetStream Account isolation. Keycloak realm isolation (per-Org or shared). |
|
|
| Compromised sovereign-admin account | MFA required at Keycloak. JIT elevation for production-impacting actions. Full audit trail to SIEM. |
|
|
| Compromised OpenBao node | 2-of-3 Raft quorum required for writes. Audit log captures every read. Rotate root token + re-shard quarterly. |
|
|
| Region-wide failure | Independent OpenBao Raft per region. PowerDNS lua-records (`ifurlup`) drop the affected regional endpoint from authoritative responses within the health-check window. Apps with `active-active` keep serving from healthy region. |
|
|
| Supply-chain attack on a build | SLSA-3 build provenance, cosign signing, Syft+Grype SBOM scanned in CI and at runtime by Trivy. |
|
|
|
|
---
|
|
|
|
*See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the broader platform context.*
|