docs(iter-3-5): purge operator-as-entity, fix Workspace-controller capital, JetStream KV references

ARCHITECTURE (iter 3):
- Removed catalystctl from the §4 write-side diagram (it's read-only;
  presenting it as a write input contradicted §7.4).
- "Both tabs read the same Valkey snapshot" → "JetStream KV snapshot"
  in §5 (Valkey is no longer in the control plane).
- §7.4: catalystctl reframed as "may exist as small read-only debug
  CLI" rather than implying it ships today.
- §11 dependency list: added bp-catalyst-provisioning; removed
  bp-catalyst-crossplane (Crossplane is per-host-cluster infra, not a
  Catalyst control-plane component); added clarifying note.
- §12 CRD list: added SecretPolicy + Runbook (were already in
  IMPLEMENTATION-STATUS but missing from the principles table).
- §2 SME-style description: "SaaS Operator team (Omantel staff)" →
  "SaaS provider's cloud team" (Operator banned as entity).

NAMING-CONVENTION (iter 4):
- §5.1 heading "operator domain" → "Sovereign domain".
- §7 multi-region diagram: replaced piecemeal Catalyst component list
  with a deferral to PLATFORM-TECH-STACK §2; added SPIRE server;
  fixed "per-Org workspaces" → "per-Environment Gitea repos"; added
  per-host-cluster infrastructure callout.

SECURITY (iter 6 — partial; fold into this commit):
- "operator-approved" → "sovereign-admin-approved" for DR promotion.
- Realm name "catalyst-operator" → "catalyst-admin" (entity-noun
  scrubbed from the realm naming itself).

SOVEREIGN-PROVISIONING (iter 7 — partial):
- "single operator's laptop" → "single person's laptop" (avoid
  "operator" as entity).
- "the next operator" → "the next Sovereign provisioning request,
  regardless of who initiates it".
- "catalyst-operator realm" → "catalyst-admin realm" (×2).
- Capital-W "Workspace-controller" residuals (3) → "Environment-
  controller" (replace_all is case-sensitive; previous iter caught
  lowercase only).

PERSONAS (iter 5):
- P3 "within a Sovereign Operator team" → "within a Sovereign's
  operations team".
- Two capital-W "Workspace-controller" residuals fixed.

SRE (iter 11 — partial):
- §13.2 "Workspace-controller stuck" runbook entry →
  "Environment-controller stuck".

Banned-term sweep result post-fix: no `Operator team|role|account|
user|admin` anywhere; no capital-W Workspace as Catalyst scope;
no Valkey-as-control-plane refs.

Refs #37
This commit is contained in:
hatiyildiz 2026-04-27 21:09:31 +02:00
parent 27325edb32
commit 80b91709e1
6 changed files with 48 additions and 39 deletions

View File

@ -24,7 +24,7 @@ The model serves two distinct customer shapes through the **same code**:
│ Many small Organizations, mostly single-Environment │
│ Each Org gets its own minimal Keycloak (no HA) │
│ Self-service marketplace, next-next-next install │
│ Sovereign-admins are the SaaS Operator team (Omantel staff)
│ Sovereign-admins are the SaaS provider's cloud team
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
@ -96,9 +96,13 @@ Everything else is identical in code.
## 4. Write side: Git → Flux → Kubernetes (+ Crossplane)
```
Console UI catalystctl (read-only) REST/GraphQL API
│ │ │
▼ ▼ ▼
Console UI REST/GraphQL API
│ │
│ (Git push from any of these │
│ bypasses provisioning and goes │
│ straight to the Gitea repo; │
│ webhook + projector still fire) │
▼ ▼
┌──────────────────────────────────────────────────────────┐
│ provisioning service │
│ - validates configSchema against Blueprint │
@ -180,7 +184,7 @@ Everything else is identical in code.
**One spine (JetStream), one read model (JetStream KV), one consumer (projector), one stream (SSE).**
The console **never talks to k8s API or Git directly.** This is the architectural lock that prevents the "App says installed in one tab, failed in another tab" class of bug. Both tabs read the same Valkey snapshot served by the same projector replica.
The console **never talks to k8s API or Git directly.** This is the architectural lock that prevents the "App says installed in one tab, failed in another tab" class of bug. Both tabs read the same JetStream KV snapshot served by the same projector replica.
JetStream replaces the older Redpanda + Valkey pairing in the control plane: NATS is Apache 2.0 (no BSL risk), has native KV (fewer moving parts), and native multi-tenant Accounts (cleaner per-Org isolation). Application-layer event needs (e.g. TalentMesh's voice pipeline) remain free to choose Redpanda, Kafka, NATS, or anything else — that's an Application-level decision, not a control-plane one.
@ -255,7 +259,7 @@ The API exposes the same operations the console performs. It is **not** an IaC a
### 7.4 What's deliberately NOT a surface
- `kubectl` — useful for debugging inside one's own vcluster; never a configuration mechanism.
- A standalone CLI for production changes — Catalyst exposes a small read-only CLI for support purposes; not for installs/promotions.
- A standalone CLI for production changes — Catalyst may expose a small read-only debug CLI in the future; not authoritative for installs/promotions.
- Terraform / Pulumi — Crossplane covers non-K8s; it is platform plumbing, not user-facing.
---
@ -387,20 +391,22 @@ bp-catalyst-platform ← umbrella
├── depends: bp-catalyst-console
├── depends: bp-catalyst-marketplace
├── depends: bp-catalyst-admin
├── depends: bp-catalyst-projector
├── depends: bp-catalyst-catalog-svc
├── depends: bp-catalyst-projector
├── depends: bp-catalyst-provisioning
├── depends: bp-catalyst-environment-controller
├── depends: bp-catalyst-blueprint-controller
├── depends: bp-catalyst-billing
├── depends: bp-catalyst-gitea
├── depends: bp-catalyst-nats-jetstream
├── depends: bp-catalyst-openbao
├── depends: bp-catalyst-keycloak
├── depends: bp-catalyst-spire
├── depends: bp-catalyst-crossplane
└── depends: bp-catalyst-observability
├── depends: bp-catalyst-gitea ← per-Sovereign Git server
├── depends: bp-catalyst-nats-jetstream ← event spine + KV
├── depends: bp-catalyst-openbao ← secret backend
├── depends: bp-catalyst-keycloak ← user identity
├── depends: bp-catalyst-spire ← workload identity
└── depends: bp-catalyst-observability ← OTel + Grafana stack
```
(Cilium, Flux, Crossplane, Cert-manager, Kyverno, Harbor, External-Secrets, Reloader, Falco, Sigstore, Syft+Grype are **per-host-cluster infrastructure**, not Catalyst control-plane components — see [`PLATFORM-TECH-STACK.md`](PLATFORM-TECH-STACK.md) §1. They get installed once per host cluster, before Catalyst itself.)
Installing `bp-catalyst-platform` once gives you a working Sovereign. Same Blueprint installed on Hetzner = the openova Sovereign. Same Blueprint installed on AWS for a bank = that bank's Sovereign. Same Blueprint installed on Hetzner for a telco = the omantel Sovereign. **One artifact. Zero divergence.**
OpenOva's own customer Applications (Cortex, Fingate, Fabric, Relay, Specter, Axon) are similarly composite Blueprints that run **on top of** Catalyst — they are Applications inside the `openova-public` Environment of the openova Sovereign.
@ -414,7 +420,7 @@ OpenOva's own customer Applications (Cortex, Fingate, Fabric, Relay, Specter, Ax
| **CQRS** | Write side: Git → Flux → K8s. Read side: catalog-svc + projector. |
| **GitOps as truth** | Every state change is a commit. Rollback = `git revert`. Audit = `git log`. |
| **Event sourcing** | NATS JetStream is the durable event log. Projector replays for recovery. |
| **CRD-driven control plane** | Sovereign, Organization, Environment, Blueprint, Application, EnvironmentPolicy — all CRDs. Controllers reconcile. |
| **CRD-driven control plane** | Sovereign, Organization, Environment, Application, Blueprint, EnvironmentPolicy, SecretPolicy, Runbook — all CRDs. Controllers reconcile. |
| **Multi-tenancy at OS layer** | vcluster per Organization per host cluster — isolated K8s API + control plane per Org. |
| **Crossplane for non-K8s** | All cloud-side resources via Compositions. Users never see Crossplane. |
| **OCI artifacts for software** | Blueprints are signed OCI manifests, cosigned, SBOMed. |

View File

@ -262,7 +262,7 @@ Inside an Organization's vcluster, each Application gets its own namespace.
Two patterns coexist depending on whether the DNS is for **Catalyst control-plane** services or for **Application** endpoints inside an Organization.
#### Catalyst control-plane DNS (operator domain)
#### Catalyst control-plane DNS (Sovereign domain)
```
{component}.{location-code}.{sovereign-domain}
@ -394,16 +394,19 @@ hz-fsn-dmz-prod hz-hel-dmz-prod
Management (one per Sovereign, single region recommended)
────────────────────────────────────────────────────────────
hz-nbg-mgt-prod
Catalyst control plane (console, projector, marketplace, admin,
catalog-svc, blueprint-controller,
environment-controller)
Gitea (Blueprint mirror + per-Org workspaces)
NATS JetStream (event spine, per-Org accounts)
OpenBao (secrets — one cluster here; sibling clusters in workload regions
sync via async perf replication; see SECURITY.md)
Keycloak (per-Org realms in SME-style; per-Sovereign realm in corporate)
Flux (GitOps for Catalyst itself)
Crossplane (manages workload clusters and cloud resources)
All Catalyst control-plane components — see PLATFORM-TECH-STACK §2.
Highlights:
Gitea (Blueprint catalog mirror + per-Org private Blueprint
repos + per-Environment Gitea repos)
NATS JetStream (event spine + KV; per-Org Accounts)
OpenBao (secrets — primary Raft cluster here; sibling replicas
in each workload region with async perf replication.
Each region's Raft is independent. See SECURITY §5.)
Keycloak (per-Org realms in SME-style; per-Sovereign realm in
corporate-style)
SPIRE server (workload identity)
Plus per-host-cluster infrastructure (Cilium, Flux, Crossplane,
cert-manager, Kyverno, Harbor, etc.) — see PLATFORM-TECH-STACK §1.
```
When FSN becomes unavailable, `hz-hel-rtz-prod` serves all traffic for Applications with `placement: active-active` or `active-hotstandby`. The cluster name does not change. k8gb removes the FSN endpoint from DNS. Recovery is a routing event, not a renaming event.

View File

@ -13,7 +13,7 @@ How different people use Catalyst. Defer to [`GLOSSARY.md`](GLOSSARY.md) for ter
|---|---|---|---|
| **P1** | **OpenOva Engineer** | github.com/openova-io | Catalyst codebase, Blueprint repos |
| **P2** | **`sovereign-admin`** | Catalyst admin UI + Sovereign Gitea | Browser UI, Git, kubectl (debug) |
| **P3** | **Support Agent** (within a Sovereign Operator team) | Catalyst admin UI in support mode | Browser UI |
| **P3** | **Support Agent** (within a Sovereign's operations team) | Catalyst admin UI in support mode | Browser UI |
| **P4** | **`org-admin`** | Org-scoped Catalyst console | Browser UI, occasional Git |
| **P5** | **SME End User** (e.g. Ahmed, pharmacy owner on Omantel) | Marketplace + the App they installed | Browser only |
| **P6** | **SME Power User** (e.g. Ahmed's tech-savvy nephew) | Console with Developer mode toggled on | Browser, occasionally Git |
@ -79,7 +79,7 @@ Day 1 — 14:00
bill verification (federated identity). Account created.
4. Catalyst auto-creates: Organization "muscat-pharmacy", Environment
"muscat-pharmacy-prod", vcluster "muscatpharmacy" on hz-fsn-rtz-prod.
Workspace-controller spins up the vcluster in ~60 seconds.
Environment-controller spins up the vcluster in ~60 seconds.
5. Bundle install wizard: 3 simple steps —
Step 1: subdomain (muscatpharmacy.shop.omantel.com)
Step 2: business details (form generated from Blueprint configSchema)
@ -139,7 +139,7 @@ Day 1 — 14:08 — Ahmed is selling.
15:00 New Environment needed for a fraud lab. From the console:
"New Environment in analytics" → fills name "fraud-lab-dev" →
picks "small" topology (1 region, single bb=rtz). Workspace-controller
picks "small" topology (1 region, single bb=rtz). Environment-controller
creates the vcluster, bootstraps Flux, creates Gitea repo. Ready in
60s. Layla now has a new sandbox.

View File

@ -153,7 +153,7 @@ Critical: each region runs its **own** Raft cluster. There is no cross-region Ra
- **Each region has its own self-contained 3-node Raft cluster.** Quorum is **intra-region only** (need 2-of-3 in the same region).
- **A total Region A failure does NOT require any other region to do anything.** Region B and C continue serving reads from their local replicated data.
- **Network partition between regions:** each region keeps operating independently. Writes pause on standby regions (since they're read-only by design).
- **DR promotion is explicit.** Either operator-approved or automated by failover-controller with strict criteria. Not automatic on every blip.
- **DR promotion is explicit.** Either `sovereign-admin`-approved or automated by failover-controller with strict criteria. Not automatic on every blip.
### 5.2 Read/write semantics
@ -211,7 +211,7 @@ Sovereign: bankdhofar
└── ONE Keycloak (HA, 3 replicas, Postgres backend)
Federates to Bank Dhofar's corporate Azure AD
├── Realm: catalyst-operator (sovereign-admin team)
├── Realm: catalyst-admin (sovereign-admin team)
├── Realm: core-banking (Org)
├── Realm: digital-channels (Org)
├── Realm: analytics (Org)

View File

@ -29,9 +29,9 @@ How to provision a new **Sovereign** — a self-sufficient deployed instance of
The bootstrap is performed by `catalyst-provisioner.openova.io`, an always-on provisioning service operated by OpenOva. It is **not** part of any Sovereign at runtime — once a Sovereign is up, it is fully self-sufficient.
Why a permanent provisioner instead of "boot from your laptop":
- OpenTofu state must be durably stored — keeping it on a single operator's laptop is fragile and a security risk.
- OpenTofu state must be durably stored — keeping it on a single person's laptop is fragile and a security risk.
- Provider credentials are scoped, vault-stored, and never leave the provisioner.
- New Sovereigns can be created without a manual installer dance — the same machinery serves the next operator.
- New Sovereigns can be created without a manual installer dance — the same machinery serves the next Sovereign provisioning request, regardless of who initiates it.
A self-host route exists for organizations that want zero OpenOva involvement: `catalyst-provisioner` is itself a Blueprint (`bp-catalyst-provisioner`) and can be deployed in a customer's own infrastructure. From there it bootstraps further Sovereigns. This is the air-gap path.
@ -66,7 +66,7 @@ catalyst-provisioner Target cloud (e.g. Hetzner)
records (via Crossplane) console.<sovereign>.<domain> A
admin.<sovereign>.<domain> A
4. Keycloak realm provisioning ─────────► catalyst-operator realm
4. Keycloak realm provisioning ─────────► catalyst-admin realm
(initial sovereign-admin user)
5. Smoke tests ─────────► Console reachable with TLS
@ -111,16 +111,16 @@ Day-1 actions
1. Configure cert-manager issuers (Let's Encrypt / corporate CA).
2. Configure backup destination (cloud object storage for Velero).
3. Configure Harbor with image-scanning policies.
4. (Optional) Federate Keycloak's catalyst-operator realm to corporate IdP.
4. (Optional) Federate Keycloak's catalyst-admin realm to corporate IdP.
5. (Optional) Configure observability exports (SIEM, datadog, etc.).
6. Onboard the first Organization:
Catalyst console → Admin → Organizations → New
Provide: name, contact, plan.
Workspace-controller does NOT create vclusters yet.
Environment-controller does NOT create vclusters yet.
They are created when the first Environment is provisioned.
7. Create the first Environment in that Organization:
Console → switch to Org context → Environments → New
Workspace-controller spins up a vcluster on the chosen host cluster.
Environment-controller spins up a vcluster on the chosen host cluster.
Bootstraps Flux inside, creates Gitea repo, wires webhook.
Ready in ~60 seconds.
```
@ -232,7 +232,7 @@ Rare but supported. Example: a Bank Dhofar Organization started life on the open
- Keycloak realm export (users, federated identities)
- OpenBao export (sealed secrets only)
3. On bankdhofar Sovereign: Admin → Organization → Import
Workspace-controller recreates Environments → vclusters.
Environment-controller recreates Environments → vclusters.
Flux pulls manifests, reconciles.
Apps come up.
4. Final cutover: DNS swap.

View File

@ -485,7 +485,7 @@ route:
|---|---|---|
| Console unreachable | P1 | Check Cilium Gateway, console pods, projector pods |
| Gitea unreachable | P1 | Check Gitea pods, CNPG primary, NetworkPolicy |
| Workspace-controller stuck | P1 | Check controller logs, Crossplane provider auth |
| Environment-controller stuck | P1 | Check controller logs, Crossplane provider auth |
| OpenBao sealed | P1 | Auto-unseal SPIRE — verify SPIRE server health |
| JetStream consumer lag | P2 | Add consumer replica, check disk pressure |
| projector lag | P2 | Check JetStream consumer status, projector replicas |