openova/.claude/project-memory.md
talent-mesh 435f49738d feat: restructure platform to 52 components and 9 products
Technology forecast and strategic review restructure:
- Remove 13 components (backstage, mongodb, activemq, vitess, airflow, camel, dapr, superset, searxng, langserve, trino, lago, rabbitmq)
- Add 10 components (sigstore, syft-grype, nemo-guardrails, langfuse, reloader, matrix, ferretdb, litmus, livekit, coraza)
- Rename product: Synapse → Axon (SaaS LLM Gateway)
- Merge products: Titan + Fuse → Fabric (Data & Integration)
- New product: Relay (Communication)
- Replace Backstage with Catalyst IDP
- Replace MongoDB with FerretDB (MongoDB wire protocol on CNPG)
- Add supply chain security (Sigstore/Cosign, Syft+Grype)
- Add AI safety and observability (NeMo Guardrails, LangFuse)
- Add technology forecast 2027-2030 document
- Full verification pass: zero stale references across all docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 21:00:19 +00:00

41 KiB

OpenOva Project Memory

Last Updated: 2026-02-26 Purpose: Persistent context for Claude Code sessions about OpenOva platform strategy and architecture


0. Final Building Blocks Table (2026-01-17)

Mandatory Components (Always Installed)

Category Component Purpose
IaC OpenTofu Bootstrap provisioning (MPL 2.0)
IaC Crossplane Day-2 cloud resources
CNI Cilium eBPF networking + Hubble
Mesh Cilium Service Mesh mTLS, L7 policies (replaces Istio)
WAF Coraza OWASP CRS with Envoy Gateway
Supply Chain Sigstore/Cosign Container image signing
Supply Chain Syft + Grype SBOM + vulnerability matching
Operations Reloader Auto-restart on config changes
GitOps Flux GitOps delivery (ArgoCD future option)
Git Gitea Internal Git server (bidirectional mirror)
TLS cert-manager Certificate automation
Secrets External Secrets (ESO) Secrets operator
Secrets OpenBao Secrets backend (MPL 2.0, drop-in Vault replacement)
Policy Kyverno Auto-generate PDBs, NetworkPolicies
Scaling VPA Vertical Pod Autoscaler
Scaling KEDA Event-driven + scale-to-zero
Observability Grafana Stack Alloy + Loki + Mimir + Tempo + Grafana
Observability OpenTelemetry Auto-instrumentation (independent of mesh)
Registry Harbor Container registry + scanning
Storage MinIO Fast S3 (tiered to archival)
Backup Velero Backup to archival S3
DNS ExternalDNS Sync to DNS provider
GSLB k8gb Authoritative DNS + cross-region GSLB
Failover Failover Controller Generic failover orchestration

User Choice Options

Category Options Notes
Cloud Provider Hetzner (now), Huawei/OCI (coming) Provider unlocks related services
Regions 1 or 2 2 recommended for DR, 1 allowed
LoadBalancer Cloud LB (~5-10/mo), k8gb DNS-based (free), Cilium L2 (free, single subnet) Cloud LB recommended
DNS Provider Cloudflare (always), Hetzner DNS, Route53/Cloud DNS/Azure DNS (if using that cloud) Cloudflare recommended
Secrets Backend OpenBao self-hosted, cloud secret managers Self-hosted OpenBao recommended
Archival S3 Cloudflare R2, AWS S3, GCP GCS, Azure Blob, OCI Object Storage, Huawei OBS For backup + MinIO tiering

A La Carte Data Services

Component Purpose DR Strategy
CNPG PostgreSQL operator WAL streaming (async primary-replica)
FerretDB MongoDB wire protocol on PostgreSQL Via CNPG WAL streaming
Strimzi Apache Kafka streaming MirrorMaker2
Valkey Redis-compatible cache (BSD-3 OSS) REPLICAOF
ClickHouse OLAP analytics ReplicatedMergeTree

A La Carte Communication

Component Purpose
Stalwart Email server (JMAP/IMAP/SMTP)
STUNner K8s-native TURN/STUN (WebRTC)
LiveKit Video/audio (WebRTC SFU)
Matrix/Synapse Team chat (federation)

1. Critical Architecture Decisions (2026-01-17)

Service Mesh: Cilium (NOT Istio)

Decision: Cilium Service Mesh replaces Istio entirely.

Rationale:

  • OpenTelemetry auto-instrumentation is independent of service mesh (via init container injection)
  • SQL query visibility comes from OTel Java/Python/Node agents, NOT Envoy sidecars
  • Cilium provides mTLS via eBPF with lower resource overhead
  • Single CNI+Mesh solution reduces operational complexity

Cilium Service Mesh Features:

Feature How
mTLS Cilium identity-based encryption
L7 Policies Envoy proxy (CiliumEnvoyConfig)
Traffic Management CiliumNetworkPolicy + HTTPRoute
Observability Hubble + OTel (independent)

Git Provider: Gitea Only

Decision: Gitea is the sole internal Git provider. GitHub/GitLab options removed.

Architecture:

  • Gitea deployed in each region
  • Bidirectional mirroring between Gitea instances
  • CNPG for metadata storage (async primary-replica, NOT multi-master)
  • Each Gitea connects to LOCAL CNPG only
  • Cross-region writes via primary region
  • Gitea Actions for CI/CD and approval workflows

DNS Architecture: k8gb Authoritative

Decision: k8gb acts as authoritative DNS server (NOT just a record manager).

Architecture:

  • k8gb CoreDNS serves as authoritative DNS for GSLB zone
  • Domain registrar NS records point to k8gb CoreDNS LoadBalancer IPs
  • k8gb CoreDNS is SEPARATE from Kubernetes internal CoreDNS
  • No Cloudflare hybrid option - k8gb handles entire GSLB zone

Split-Brain Protection: Cloud Witness (Cloudflare)

Decision: Use Cloudflare Workers + KV as cloud witness for lease-based failover authority.

Why Cloud Witness (not external DNS resolvers):

  • External DNS resolvers can only verify if a region is reachable, not who should be active
  • Lease-based approach provides true single-source-of-truth
  • Prevents k8gb's DNS-based failover from causing split-brain during partitions

Mechanism:

  • Active region holds lease in Cloudflare KV (renews every 10s, TTL 30s)
  • Standby region cannot become active while lease is held
  • Failover Controller gates all readiness based on lease ownership

Failover Controller: Comprehensive Failover Orchestration

Decision: Build a Failover Controller that controls ALL failover (not just databases).

Scope (Three Layers):

  1. External traffic (Gateway API → k8gb): Controls HTTPRoute readiness
  2. Internal traffic (Cilium Cluster Mesh): Controls Service endpoints
  3. Stateful services (CNPG, MongoDB): Signals database promotion

Key Insight: k8gb alone cannot prevent split-brain during network partitions. The Failover Controller gates k8gb's view by controlling whether endpoints are visible.

Architecture:

  • Cloudflare Worker + KV as witness (lease-based authority)
  • Per-cluster Failover Controller with state machine (ACTIVE/STANDBY/FAILING_OVER)
  • Actuators for Gateway, Service, and Database resources

Modes: automatic | semi-automatic | manual (for regulated environments)

DDoS Protection: Cloud Provider Native

Decision: Rely on cloud provider native DDoS protection.

Provider Protection Visibility
Hetzner Automatic, always-on Low (black box)
OCI Always-on, free Medium
Huawei Anti-DDoS Basic (free) Low-Medium

No Cloudflare proxy required - cloud providers handle volumetric attacks at edge.

WAF (L7): Coraza handles application-layer protection separately.

Multi-Region Strategy

  • Recommended 2 regions (BCP/DR) but 1 region allowed
  • Independent clusters per region (NOT stretched clusters)
  • Each cluster survives independently during network partition
  • Async data replication between regions (eventual consistency)

Cloud Providers

  • Primary: Hetzner Cloud (first supported)
  • Coming Soon: Huawei Cloud, Oracle Cloud (OCI)
  • Dropped: Contabo (no Crossplane support), AWS/GCP/Azure (future consideration)

LoadBalancer Strategy

  • Option 1: Cloud provider LoadBalancers (Hetzner LB, OCI LB, etc.) - recommended
  • Option 2: k8gb DNS-based LB (Gateway API hostNetwork + k8gb health routing) - free
  • Option 3: Cilium L2 Mode (ARP-based, same subnet only) - free
  • BGP is NOT available on target cloud providers (only bare-metal/dedicated)

Secrets Management

  • SOPS eliminated completely - not even for bootstrap
  • Interactive bootstrap: Wizard generates credentials, operator saves them
  • Architecture: Independent OpenBao per cluster + ESO PushSecrets for cross-cluster sync
  • Flow: K8s Secret → ESO PushSecret → Both OpenBao instances simultaneously
  • ESO Generators: Auto-create complex passwords/keys (no manual generation)
  • All secrets managed via K8s CRDs (no manual OpenBao updates)

Storage Architecture

  • MinIO: Fast S3 (in-cluster) with tiered storage
  • Archival S3: External cloud storage (R2, S3, GCS, Blob, OBS)
  • MinIO tiers to Archival S3 for cold data
  • Velero backs up to Archival S3 (not MinIO)
  • Harbor backs up to Archival S3

Cross-Region Networking

  • WireGuard mesh for cross-region connectivity
  • OR native cloud peering if same provider (Hetzner vSwitch, OCI FastConnect)
  • Required for: OpenBao sync, k8gb coordination, data replication, Gitea mirroring

Data Replication Patterns (All Community Edition)

Service Replication Method
CNPG (Postgres) WAL streaming to standby cluster (async primary-replica)
Gitea Bidirectional mirror + CNPG for metadata
MongoDB CDC via Debezium → Kafka → Sink Connector
Strimzi (Kafka) MirrorMaker2 (native)
Valkey REPLICAOF command (async)
MinIO Bucket replication
Harbor Registry replication

FerretDB (Replaces MongoDB)

  • MongoDB replaced by FerretDB (Apache 2.0, MongoDB wire protocol on PostgreSQL)
  • FerretDB uses CNPG as backend - replication via standard PostgreSQL WAL streaming
  • No Debezium/Kafka CDC required for replication (uses CNPG native replication)
  • Full ACID transactions via PostgreSQL

2. OpenOva Positioning & Value Proposition

Core Identity

OpenOva.io is NOT another Kubernetes platform or IDP. It is:

  • Enterprise-grade support provider for open-source K8s ecosystems
  • Transformation journey partner for organizations adopting cloud-native
  • Converged blueprint ecosystem with operational guarantees

Value Proposition

"We provide enterprise-grade, end-to-end support for curated open-source ecosystems on Kubernetes. We don't just deploy technologies - we optimize, harden, upgrade, and stand behind them."

Differentiator

  • Operational excellence (Day-2 safety, upgrades, SLAs) - not tooling
  • Confidence as a service - we own the pager, not the customer
  • Productized blueprints - intellectual property is in the converged, optimized configurations

Target Market

  • Banks, telcos, petroleum (regulated industries)
  • Organizations scared of OSS complexity but wanting to avoid vendor lock-in
  • Teams burned by past platform attempts

3. Architecture Model

Blueprint vs Instance Model

  • Public blueprints (openova-io): Templates with <tenant> placeholders - the "class"
  • Private instances (acme-private): Generated repos with choices made - the "instance"
  • Bootstrap wizard: Generates instance repos from blueprints

Three-Layer Architecture

+-------------------------------------------------------+
| OPENOVA BOOTSTRAP WIZARD (Managed UI)                 |
| - Hosted on OpenOva's infrastructure                  |
| - Collects credentials, runs OpenTofu                |
| - Export option for self-hosted bootstrap             |
| - Permanent sessions with SSO (Google/Azure)          |
| - Exits the picture after bootstrap complete          |
+-------------------------------------------------------+
                         |
                         v
+-------------------------------------------------------+
| CUSTOMER'S ENVIRONMENT (Post-Bootstrap)               |
| - Catalyst IDP (IDP - entry door for lifecycle)          |
| - Flux (GitOps delivery)                              |
| - Gitea (internal Git with bidirectional mirror)      |
| - Crossplane (selective - lifecycle abstraction)      |
| - Operators (CNPG, etc.)                              |
+-------------------------------------------------------+
                         |
                         v
+-------------------------------------------------------+
| OPENOVA BLUEPRINTS (Our IP - stays in picture)        |
| - Certified configurations                            |
| - Upgrade-safe versions                               |
| - Best practices (PDBs, VPAs, policies)               |
| - Published via Git, consumed by customer's Flux      |
+-------------------------------------------------------+

Key Architectural Decisions

  1. Bootstrap wizard is SEPARATE - independent repo/application, hosted on OpenOva
  2. Bootstrap wizard EXITS after provisioning - must be safe to delete after day 1
  3. First cluster inherits bootstrapping capability via Crossplane/CAPI for expansion
  4. Catalyst IDP becomes the entry point for customer's lifecycle management
  5. OpenOva stays in picture via blueprints - not runtime components
  6. OpenTofu is the unified bootstrap mechanism (SaaS or Self-Hosted)

4. Bootstrap Modes

Mode 1: Managed Bootstrap ("OpenOva Cloud Bootstrap")

  • Customer uses OpenOva wizard (hosted UI)
  • OpenOva's OpenTofu provisions customer's cloud infrastructure
  • After bootstrap, customer's Crossplane takes over
  • Customer provides cloud credentials to OpenOva (temporarily)
  • Redirect to Catalyst IDP after completion

Mode 2: Self-Hosted Bootstrap ("OpenOva Bring-Your-Own Bootstrap")

  • Customer exports OpenTofu manifests from wizard
  • Customer runs OpenTofu locally with their own credentials
  • Credentials never leave customer environment
  • Same end result: Catalyst IDP + platform stack ready

Unified Approach

Both modes use the same OpenTofu manifests - only difference is WHERE terraform apply runs.

Bootstrap Sequence

OpenTofu → K8s Cluster → Flux (bootstrap)
         → Gitea (internal Git)
         → Crossplane + Operators
         → Catalyst IDP + Grafana Stack
         → Platform ready

5. Git Repository: Gitea

Fixed decision: Gitea is the sole Git provider.

Gitea Architecture

  • Deployed in each region (active-active for reads)
  • Bidirectional mirroring between instances
  • CNPG for PostgreSQL metadata (async primary-replica)
  • Each Gitea connects to LOCAL CNPG only
  • Gitea Actions for CI/CD pipelines
  • CODEOWNERS for security approval workflows

Why Gitea (not GitHub/GitLab)

Reason Benefit
Self-hosted Full control, no external dependency
Lightweight Lower resource footprint than GitLab
GitOps-focused Designed for Flux integration
Bidirectional mirror Active-active reads across regions
Gitea Actions GitHub Actions compatible CI/CD

6. IDP vs Crossplane Decision

With IDP (Catalyst IDP) in place:

  • IDP handles: Catalog UX, form generation, YAML templating, PR creation
  • Crossplane needed only when:
    • Multi-backend portability expected (CNPG today → managed DB tomorrow)
    • Complex compositions (one request → many resources)
    • Non-K8s resources in same catalog
    • Lifecycle coupling required

Encapsulation Strategy: LIGHT

  • Thin claims (5-10 fields max): tier, ha, backup, deletionProtection, networkProfile
  • Everything else stays internal (operator defaults)
  • Two-lane model: Standard (90%) + Advanced escape hatch (10%)

7. End-User Journeys

Journey 1: Initial Bootstrap (Infra SPOC)

OpenOva Wizard UI → Select cloud/options → Generate OpenTofu
                  → Run OpenTofu (managed or self-hosted)
                  → Cluster + Platform ready
                  → Redirect to Catalyst IDP URL

Journey 2: Day-2 Operations (App Teams via Catalyst IDP)

Catalyst IDP → Select blueprint (e.g., "Tier-1 Postgres")
         → Fill minimal form (tier, ha, backup)
         → PR generated to Gitea
         → Flux applies → Operator reconciles
         → Resource ready, secret injected

Journey 3: Platform Extension (Infra SPOC via Catalyst IDP)

Catalyst IDP → Platform Admin section
         → "Add Cluster" or "Enable Capability Pack"
         → PR generated to Gitea
         → Flux + Crossplane/CAPI provision

Journey 4: Blueprint Updates (OpenOva → Customer)

OpenOva publishes new blueprint version
         → Customer's Catalyst IDP shows notification
         → Customer reviews changelog
         → Customer clicks "Upgrade" (generates PR to Gitea)
         → Flux applies

8. Support Model

Fully Supported

  • Entire mandatory stack
  • Selected a la carte components
  • Blueprint configurations only

Best Effort

  • Customer customizations beyond blueprints
  • Edge cases not in support matrix

Unsupported

  • Versions outside support matrix
  • Non-blueprint configurations
  • DIY operator installations

9. Decided Questions (2026-01-17)

Question Decision
Service mesh Cilium Service Mesh (NOT Istio)
Git provider Gitea only (GitHub/GitLab removed)
Cloud provider Hetzner first, then Huawei/OCI. Contabo dropped.
Multi-region Recommended 2 regions but 1 region allowed (independent clusters)
LoadBalancer Cloud LB (default), k8gb DNS-based (free), Cilium L2 (single subnet)
DNS architecture k8gb as authoritative DNS server for GSLB zone
Split-brain protection Cloudflare Workers + KV (lease-based witness)
Failover orchestration Failover Controller (controls external, internal, stateful)
DDoS protection Cloud provider native (no Cloudflare proxy)
Secrets backend Self-hosted OpenBao per cluster + ESO PushSecrets
SOPS Eliminated completely
Harbor Mandatory from day 1
VPA Mandatory
Crossplane Mandatory for post-bootstrap cloud ops
MongoDB replication CDC via Debezium + Strimzi (Kafka)
Redis-compatible cache Valkey (BSD-3, Linux Foundation)
MinIO Fast S3 with tiering (NOT backup target)
Archival S3 R2/S3/GCS/Blob for backup + tiering
GitOps Flux (ArgoCD as future option)
CI/CD Gitea Actions
Observability OTel auto-instrumentation (independent of mesh) + Grafana Stack

10. Open Decisions / Questions

  1. Exact naming for bootstrap modes - "Managed" vs "Self-Hosted"?
  2. First flagship blueprint - PostgreSQL or Service Mesh?
  3. Wizard tech stack - what to build it with?
  4. Failover Controller implementation - research existing OSS or build new?
  5. Conflict resolution strategy - for eventual consistency scenarios

15. RESOLVED - k8gb and Failover Architecture (2026-01-18)

15.1 k8gb Architecture Deep Dive

Status: RESOLVED

Key Finding from Source Code Analysis:

k8gb clusters operate independently with DNS-based discovery only:

Aspect k8gb Behavior
Local health check Direct service health check (Ingress/Gateway endpoints)
Cross-cluster "health" DNS query to localtargets-* record
Communication DNS only - no direct health checks between clusters

Critical Limitation: k8gb cannot distinguish between:

  • "Region is down" (failover needed)
  • "Network partition" (failover NOT wanted)

Both produce the same symptom: DNS query fails or times out.

Cluster B queries: localtargets-app.example.com from Cluster A
├── Gets IPs → "Cluster A is healthy"
└── No IPs / timeout → "Cluster A is unavailable" (but WHY?)

Scenarios Analyzed:

Scenario k8gb Behavior Problem?
Region truly down Removes region from DNS Correct
Network partition Also removes region from DNS Incorrect failover
Both healthy Returns both regions Correct

Conclusion: k8gb is suitable for stateless services where brief dual-routing during partition is acceptable. For stateful services and strict active-passive, a Failover Controller with cloud witness is required.

15.2 Failover Controller Design

Status: RESOLVED

Architecture Decision: Cloudflare Workers + KV as cloud witness

Component Role
Cloudflare Worker Lease management API
Cloudflare KV Lease storage with TTL
Failover Controller Per-cluster controller that manages readiness

Three Layers Controlled:

  1. External (Gateway API → k8gb): HTTPRoute readiness
  2. Internal (Cilium Cluster Mesh): Service endpoint manipulation
  3. Stateful (CNPG, MongoDB): Database promotion signaling

Witness Pattern:

  • Active region holds lease (renews every 10s, TTL 30s)
  • Standby region queries lease status
  • If lease expires → standby acquires lease → becomes active
  • Network partition: both regions reach witness → active keeps renewing → no split-brain

Documentation: See failover-controller/docs/ADR-FAILOVER-CONTROLLER.md

15.3 k8gb Scope Clarification

Status: RESOLVED

k8gb is for EXTERNAL services only:

  • Routes traffic via DNS based on endpoint availability
  • Does NOT coordinate internal services
  • Does NOT handle database failover

Internal services use Cilium Cluster Mesh:

  • Cross-region service discovery
  • Failover Controller manipulates endpoints

ExternalDNS Role:

  • Creates NS records delegating GSLB zone to k8gb
  • Manages non-GSLB records in parent zone
  • One-time setup for delegation, ongoing for other records

15.4 Gateway API Clarification

Status: RESOLVED

  • Entry point: Kubernetes Gateway API backed by Cilium/Envoy
  • Traefik (K3s default): Disabled in OpenOva deployments
  • Kong: Not included (Cilium Gateway sufficient for routing)
  • API Management: Future consideration if needed

15.5 Redis-Compatible Caching

Status: RESOLVED

  • Valkey selected (Linux Foundation, BSD-3)
  • Dragonfly dropped (BSL license)
  • Redis OSS dropped (license concerns)

15.6 Harbor S3 Backend

Status: RESOLVED

  • MinIO as S3 backend documented
  • Tiered archiving to external S3 documented

15.7 SRE Repo

Status: FUTURE DISCUSSION

  • VPA policies
  • Topology spread
  • PVC resizing
  • KEDA configurations

11. New A La Carte Components (2026-01-18)

Identity

Component Purpose Use Cases
Keycloak OIDC/OAuth/FAPI Authorization Server Any app needing auth, SSO, FAPI compliance

Monetization

Component Purpose Use Cases
OpenMeter Usage metering API monetization, usage tracking

AI Safety & Observability

Component Purpose Use Cases
NeMo Guardrails AI safety firewall Prompt injection detection, PII filtering, topic control
LangFuse LLM observability LLM call tracing, cost tracking, evaluation

Chaos Engineering

Component Purpose Use Cases
Litmus Chaos Chaos engineering Resilience testing, compliance proof

These are standalone a la carte components that can be used independently or bundled into products.


12. Open Banking Meta Blueprint (2026-01-18)

Overview

Meta blueprint that bundles a la carte components with custom services for PSD2/FAPI fintech sandboxes.

Architecture Concept

Meta Blueprint = A La Carte Components + Custom Services

Open Banking Product (Fingate)
├── Keycloak (a la carte)      ─► FAPI Authorization
├── OpenMeter (a la carte)     ─► Usage metering
└── Custom Services            ─► Open Banking specific
    ├── ext-authz
    ├── accounts-api
    ├── payments-api
    ├── consents-api
    ├── tpp-management
    └── sandbox-data

Key Architectural Decision

Envoy at the heart - NOT Kong/Tyk. Leverages existing Cilium/Envoy investment with specialized services.

Architecture Flow

TPP Request (eIDAS cert)
    |
    v
Cilium Ingress (Envoy)
    |
    +--> ext_authz Service
    |       |
    |       +--> Validate eIDAS cert
    |       +--> Check TPP registry
    |       +--> Verify consent
    |       +--> Check/decrement quota (Valkey)
    |
    v
Backend Services (Accounts/Payments/Consents)
    |
    v
Access Logs --> Kafka --> OpenMeter --> Lago

Monetization Models

Model Flow
Prepaid Buy credits → Valkey balance → Atomic decrement → Block at zero
Post-paid Use APIs → Meter usage → Invoice at period end
Subscription + Overage Monthly base + per-call overage

Why Not Kong/Tyk

  • Already have Cilium/Envoy for service mesh
  • Open Banking logic doesn't fit plugin architecture
  • Unified observability with existing Grafana stack
  • Custom services give full control over PSD2 compliance

Open Banking Standards

Standard Status
UK Open Banking 3.1 Primary
Berlin Group NextGenPSD2 Planned
STET (France) Planned

Documentation

  • ADR: handbook/docs/adrs/ADR-OPEN-BANKING-BLUEPRINT.md
  • Spec: handbook/docs/specs/SPEC-OPEN-BANKING-ARCHITECTURE.md
  • Blueprint: handbook/docs/blueprints/BLUEPRINT-OPEN-BANKING.md

13. Repository Structure

openova-io/                    # Public blueprints org
├── bootstrap/                 # Bootstrap wizard
├── terraform/                 # IaC modules
├── flux/                      # GitOps configs
├── handbook/                  # Documentation
├── <component>/               # Individual component blueprints
│   ├── cilium/                # CNI + Service Mesh
│   ├── gitea/                 # Git server
│   ├── failover-controller/   # Failover orchestration
│   ├── grafana/
│   ├── harbor/
│   ├── openbao/              # Secrets backend (MPL 2.0)
│   ├── k8gb/
│   ├── external-dns/
│   ├── keycloak/              # FAPI AuthZ (Open Banking)
│   ├── openmeter/             # Usage metering (Open Banking)
│   ├── lago/                  # Billing (Open Banking)
│   ├── open-banking/          # Open Banking services
│   └── ...

acme-private/                  # Example private instance
├── terraform/                 # Configured for acme
├── flux/                      # Configured for acme
├── <component>/               # Configured for acme

14. Key Quotes & Principles

"Crossplane doesn't kill OpenTofu. It kills OpenTofu-as-a-control-plane."

"The catalog is a contract, not a UI."

"You are selling confidence, not Kubernetes. Insurance, not innovation."

"If the bootstrap platform stays in the picture after day 1, it's doing too much."

"IDP is the front desk. Your thin layer is the contract, the rules, and the insurance behind the desk."

"Wrap CNPG/Strimzi only if you are intentionally offering 'databases' and 'streams' as platform products."

"Public blueprints are the class, private instances are the objects."

"OTel is completely independent of service mesh - that's why Cilium is a no-brainer."


15. Technical ADRs Referenced

  • ADR-MULTI-REGION-STRATEGY: Independent clusters, recommended not enforced
  • ADR-PLATFORM-ENGINEERING-TOOLS: Crossplane, Catalyst IDP, Flux (mandatory)
  • ADR-IMAGE-REGISTRY: Harbor mandatory
  • ADR-SECURITY-SCANNING: Trivy CI/CD + Harbor + Runtime
  • ADR-CILIUM-SERVICE-MESH: Cilium replaces Istio
  • ADR-GITEA: Gitea as sole Git provider
  • ADR-FAILOVER-CONTROLLER: Generic failover orchestration
  • ADR-K8GB-GSLB: k8gb as authoritative DNS
  • ADR-AIRGAP-COMPLIANCE: Air-gap capable architecture

16. Competitive Landscape

Not Competing With

  • Red Hat OpenShift (distro)
  • Cloud providers (AWS/GCP/Azure)
  • Pure tooling vendors

Competing For

  • Regulated enterprises wanting OSS with support
  • Organizations burned by OpenShift cost/complexity
  • Teams needing "someone to call at 3am"

Adjacent Players

  • Upbound (Crossplane ecosystem)
  • Humanitec (Platform orchestrator)
  • Loft/vCluster (Multi-tenancy)


17. Monorepo Consolidation (2026-02-08)

Decision

Consolidated 45+ separate GitHub repos into a single monorepo: openova-io/openova

Structure

openova/
├── core/                    # Bootstrap + Lifecycle Manager (Go application)
├── platform/                # All 52 component blueprints (FLAT structure)
├── products/                # Bundled vertical solutions
│   ├── cortex/              # OpenOva Cortex - Enterprise AI Hub
│   ├── fingate/             # OpenOva Fingate - Open Banking (+ 6 services)
│   ├── fabric/              # OpenOva Fabric - Data & Integration
│   ├── relay/               # OpenOva Relay - Communication
│   └── axon/                # OpenOva Axon - SaaS LLM Gateway
└── docs/                    # Platform documentation

Key Decisions

Decision Rationale
Flat platform/ structure No hierarchical subfolders (networking/, security/, etc.)
Documentation shows groupings README displays logical categories while folders stay flat
Products bundle platform components Reference components from platform/, no duplication
Core is single Go app Bootstrap + Lifecycle Manager with mode switch
52 platform components All flat under platform/
5 products cortex, fingate (+ 6 services), fabric, relay, axon

Component Count (58 total)

  • Platform components: 52 (flat under platform/)
  • Fingate custom services: 6 (accounts-api, consents-api, ext-authz, payments-api, sandbox-data, tpp-management)

Documentation Groupings (in READMEs)

Mandatory (Core Platform):

Category Components
Infrastructure opentofu, crossplane
GitOps flux, gitea
Networking cilium, external-dns, k8gb
Security cert-manager, external-secrets, openbao, trivy, falco, sigstore, syft-grype, coraza
Policy kyverno
Observability grafana, opensearch
Scaling vpa, keda
Operations reloader
Storage minio, velero
Registry harbor
Failover failover-controller

A La Carte (Optional):

Category Components
Data cnpg, ferretdb, valkey, strimzi, clickhouse
CDC debezium
Workflow temporal, flink
Analytics iceberg
Identity keycloak
Monetization openmeter
Communication stalwart, stunner, livekit, matrix
AI/ML knative, kserve, vllm, milvus, neo4j, librechat, bge, llm-gateway, anthropic-adapter
AI Safety nemo-guardrails, langfuse
Chaos litmus

Sync to Customer Gitea

GitHub (monorepo)                    Customer Gitea (multi-repo)
─────────────────                    ──────────────────────────
openova/core/              ──sync──> openova-core/
openova/platform/cilium/   ──sync──> openova-cilium/
openova/platform/flux/     ──sync──> openova-flux/

18. Core Application Architecture (2026-02-08)

Two Deployment Modes

Mode Location Purpose IaC Tool
Bootstrap Outside cluster Initial provisioning OpenTofu
Lifecycle Manager Inside cluster Day-2 operations Crossplane

Zero External Dependencies

Mode State Storage Rationale
Bootstrap SQLite (embedded) No CNPG needed for ephemeral wizard
Manager Kubernetes CRDs Native K8s, no external DB

Bootstrap Exits After Provisioning

The bootstrap wizard is designed to be safely deletable after initial provisioning. Crossplane owns all cloud resources going forward.

No Overlap with Catalyst IDP

Concern Catalyst IDP Lifecycle Manager
Audience Developers (internal) Platform operators
Focus Service catalog, scaffolding Platform health, upgrades
Scope Application-level Infrastructure-level
UI Rich portal Minimal admin dashboard

19. OpenOva Cortex - AI Hub (2026-02-08)

Overview

Enterprise AI platform with LLM serving, RAG, and intelligent agents.

Components (11 from platform/)

knative, kserve, vllm, milvus, neo4j, librechat, bge, llm-gateway, anthropic-adapter, nemo-guardrails, langfuse

Architecture

User Interfaces (LibreChat, Claude Code)
         ↓
AI Safety (NeMo Guardrails)
         ↓
Gateway Layer (LLM Gateway, Anthropic Adapter)
         ↓
Model Serving (KServe, vLLM)
         ↓
Knowledge Layer (Milvus vectors, Neo4j graph)
         ↓
Embeddings (BGE-M3, BGE-Reranker)

LangFuse (traces all LLM calls)

Resource Requirements

Component CPU Memory GPU
vLLM 4 32Gi 2x A10
BGE-M3 2 4Gi 1x A10
BGE-Reranker 1 2Gi 1x A10
Milvus (3 replicas) 2 8Gi -
Total ~15 ~55Gi 4x A10

20. Deleted Repositories (2026-02-08)

All old separate repos deleted after monorepo consolidation:

openova-anthropic-adapter, openova-backstage, openova-bge, openova-cert-manager, openova-cilium, openova-cnpg, openova-crossplane, openova-external-dns, openova-external-secrets, openova-failover-controller, openova-flux, openova-gitea, openova-grafana, openova-harbor, openova-k8gb, openova-keda, openova-keycloak, openova-knative, openova-kserve, openova-kyverno, openova-lago, openova-langserve, openova-librechat, openova-llm-gateway, openova-milvus, openova-minio, openova-mongodb, openova-n8n, openova-neo4j, openova-openmeter, openova-redpanda, openova-searxng, openova-stalwart, openova-stunner, openova-terraform, openova-trivy, openova-valkey, openova-vault, openova-velero, openova-vllm, openova-vpa, openova-open-banking, openova-core, openova-handbook


21. AI-Age Component Rationalization (2026-02-12) — EXECUTED 2026-02-26

Context

Evaluated all platform components through the lens of "in the age of AI/vibe coding (95% AI-written code), is this technology still essential or is it pre-AI era tech made redundant?"

Key Insight

AI doesn't just change HOW code is written — it changes WHAT infrastructure you need:

  • Infrastructure primitives (networking, security, storage, databases) → Still essential. AI can't replace packets, bytes, or certificates.
  • Complexity absorber frameworks (integration, workflow, application runtime) → Declining. These existed because writing integration/orchestration code was hard. AI writes that code now.
  • Developer UI tools (portals, dashboards, search) → Declining. AI assistants replace catalog browsing and dashboard building.
  • AI-native infrastructure (inference, vectors, embeddings) → MORE needed.

Tier 1: Essential (85-95) — Keep

cert-manager (95), cilium (95), external-secrets (95), vllm (95), openbao (93), flux (92), minio (92), velero (92), harbor (90), falco (90), trivy (90), cnpg (90), external-dns (90), grafana (88), kyverno (88), kserve (88), milvus (88), llm-gateway (87), anthropic-adapter (85), valkey (85), keycloak (85)

Tier 2: Needed (75-84) — Keep

gitea (83), opentofu (82), bge (82), failover-controller (82), k8gb (80), keda (80), vpa (78), crossplane (78), knative (75), librechat (75)

Tier 3: Moderate (60-74) — Keep but Monitor

langserve (73), strimzi (72), mongodb (72), debezium (70), stalwart (70), stunner (68), neo4j (65), flink (60)

Tier 4: Questionable (50-59) — Review Necessity

lago (58), openmeter (55), clickhouse (55), opensearch (50), iceberg (50)

Tier 5: Declining/Redundant (12-45) — Candidates for Removal

backstage (45), superset (40), searxng (40), trino (38), temporal (35), airflow (33), dapr (30), rabbitmq (25), camel (20), vitess (15), activemq (12)

Recommendation

Drop ~15 components (Tier 5) → reduce from 55 to ~40 components. Replace with AI-generated custom code that's simpler to operate.

Temporal Decision

Status: DECIDED — Make Temporal optional, not default for Fuse.

Start without Temporal. Use Kafka for event-driven patterns and Dapr Workflow (if kept) for simpler orchestration. Only add Temporal when a customer needs complex sagas with compensation across 4+ services, long-running workflows (days/weeks), or workflow-level visibility at scale.

Impact on Products

Product Impact
Fuse Most affected. Camel, Dapr, Temporal, RabbitMQ, ActiveMQ all in decline tier. Rethink as Kafka + AI-generated integrations.
Titan Airflow, Superset, Trino, Iceberg all questionable/declining. Rethink as Flink + AI-generated pipelines + direct DB queries.
Cortex Mostly AI-native components. Healthy. SearXNG only declining item.
Fingate Mostly essential components (Keycloak). Lago/OpenMeter questionable but serve specific billing needs.

22. Missing Components — AI-Age Gaps (2026-02-12)

Identified Gaps

Evaluated what's MISSING from the stack that would score high in AI-age relevance.

90+ (Essential — Add Now)

Component Score Why
Sigstore/Cosign 92 Container image signing. AI writes Dockerfiles — must verify provenance. Bank regulatory requirement.
Syft + Grype 90 SBOM generation. AI pulls random dependencies. EU CRA + bank regulators demand it.
NeMo Guardrails 90 AI safety firewall. Prompt injection, PII filtering, hallucination detection for LLM outputs. Non-negotiable for banks.
LangFuse 90 LLM observability. Traces every LLM call — cost, latency, tokens, eval scores. Grafana doesn't cover this.
OpenCost 90 FinOps for K8s. AI/GPU workloads are expensive. Cost visibility per namespace/team is essential.

80+ (Strongly Needed)

Component Score Why
Flagger 82 Progressive delivery (canary, blue-green). AI-generated code needs gradual rollouts. Integrates with Flux.
Ray 80 Distributed AI compute. Training, batch processing, distributed fine-tuning.
MLflow 80 AI model registry + experiment tracking. Audit trail for model lifecycle.
Promptfoo 80 LLM evaluation/testing. Unit tests for prompts. CI/CD for AI.
Reloader 80 Auto-restart pods on ConfigMap/Secret changes. Tiny operator, huge operational value.

70+ (Valuable)

Component Score Why
Litmus Chaos 72 Chaos engineering. Banks need proof of resilience.
Headscale 72 Self-hosted WireGuard mesh. Zero-trust networking between clusters.
Goldilocks 70 VPA recommendation dashboard. Right-sizing visibility. Feeds into FinOps.
Dagger 70 CI/CD pipelines as code. AI generates pipelines better in Go/Python than YAML.
Robusta 70 K8s troubleshooting automation. AI-powered alert enrichment.

60+ (Nice to Have)

Component Score Why
Testkube 65 K8s-native test orchestration. Quality gates for AI-generated code.
Descheduler 62 Pod rebalancing after scaling events.
Kubeshark 60 API traffic viewer. Debug AI-generated microservice interactions.
Label Studio 60 Data labeling for ML. Human-in-the-loop for Cortex.

50+ (Situational)

Component Score Why
Karpenter 55 Node autoscaling. Cloud provider support dependent (Hetzner unclear).
Argo Events 52 Event-driven automation. KEDA + custom code may suffice.
Kured 50 Node reboot daemon after kernel updates.

Biggest Gap Identified

AI operational tooling is completely missing. The stack has AI inference (vLLM, KServe, Milvus) but zero AI safety, AI observability, AI testing, or AI model governance. That's like having databases without monitoring.


23. Product Family Update (Locked 2026-02-26)

Final Product Family (9 products)

Product Name Description
Core OpenOva 52 component turnkey K8s ecosystem
Bootstrap+Lifecycle+IDP OpenOva Catalyst Bootstrap wizard, Day-2 manager, IDP, Workflow Explorer
AI Hub OpenOva Cortex LLM serving, RAG, AI safety, LLM observability
SaaS LLM Gateway OpenOva Axon Hosted inference gateway (renamed from Synapse)
Open Banking OpenOva Fingate PSD2/FAPI fintech sandbox
AIOps SOC/NOC OpenOva Specter AI-powered SOAR, self-healing
Data & Integration OpenOva Fabric Event-driven integration + data lakehouse (merged Titan+Fuse)
Communication OpenOva Relay Email, video, chat, WebRTC (NEW)
Migration OpenOva Exodus Migration from proprietary to OSS

Components Changed (2026-02-26)

Removed (13): backstage, mongodb, activemq, vitess, airflow, camel, dapr, superset, searxng, langserve, trino, lago, rabbitmq

Added (10): sigstore, syft-grype, nemo-guardrails, langfuse, reloader, matrix, ferretdb, litmus, livekit, coraza

Net count: 55 - 13 + 10 = 52

Key Renames

  • Synapse → Axon (SaaS LLM Gateway)
  • Titan + Fuse → Fabric (Data & Integration merged)
  • Backstage → Catalyst IDP (integrated into Catalyst product)
  • MongoDB → FerretDB (MongoDB wire protocol on PostgreSQL)

24. SIEM/SOAR Architecture (2026-02-26)

Falco (eBPF) → Falcosidekick → Kafka → OpenSearch (hot SIEM)
                                      → ClickHouse (cold storage)
                                      → Specter (SOAR)

Pipeline

  1. Falco detects runtime threats via eBPF
  2. Trivy provides vulnerability scan results
  3. Kyverno reports policy violations
  4. Events flow through Kafka to OpenSearch for hot SIEM analytics
  5. Aged data moves to ClickHouse for cold storage and compliance
  6. Specter provides SOAR: automated correlation, enrichment, and remediation

25. Communication Architecture - Relay (2026-02-26)

Keycloak (SSO) → Stalwart (Email)
               → LiveKit (Video/Audio) ← STUNner (TURN/STUN)
               → Matrix/Synapse (Chat)

Components

Component Purpose
Stalwart Email (JMAP/IMAP/SMTP)
LiveKit Video/audio (WebRTC SFU)
STUNner K8s-native TURN/STUN for NAT traversal
Matrix/Synapse Team chat with federation

This document serves as persistent context for Claude Code sessions. Update as decisions are made.