docs: add project memory for Claude Code sessions
Persistent context document covering: - Final building blocks table - Critical architecture decisions (Cilium, Gitea, k8gb, etc.) - OpenOva positioning and value proposition - Bootstrap modes and architecture model - Monorepo consolidation (2026-02-08) - Core application architecture - AI Hub meta-platform - All key decisions and principles Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
0e6c347771
commit
92c97fc6f0
926
.claude/project-memory.md
Normal file
926
.claude/project-memory.md
Normal file
@ -0,0 +1,926 @@
|
||||
# OpenOva Project Memory
|
||||
|
||||
> Last Updated: 2026-02-08
|
||||
> Purpose: Persistent context for Claude Code sessions about OpenOva platform strategy and architecture
|
||||
|
||||
---
|
||||
|
||||
## 0. Final Building Blocks Table (2026-01-17)
|
||||
|
||||
### Mandatory Components (Always Installed)
|
||||
|
||||
| Category | Component | Purpose |
|
||||
|----------|-----------|---------|
|
||||
| IaC | Terraform | Bootstrap provisioning |
|
||||
| IaC | Crossplane | Day-2 cloud resources |
|
||||
| CNI | Cilium | eBPF networking + Hubble |
|
||||
| Mesh | Cilium Service Mesh | mTLS, L7 policies (replaces Istio) |
|
||||
| WAF | Coraza | OWASP CRS with Envoy Gateway |
|
||||
| GitOps | Flux | GitOps delivery (ArgoCD future option) |
|
||||
| Git | Gitea | Internal Git server (bidirectional mirror) |
|
||||
| TLS | cert-manager | Certificate automation |
|
||||
| Secrets | External Secrets (ESO) | Secrets operator |
|
||||
| Secrets | Vault | Backend (self-hosted or SaaS) |
|
||||
| Policy | Kyverno | Auto-generate PDBs, NetworkPolicies |
|
||||
| Scaling | VPA | Vertical Pod Autoscaler |
|
||||
| Scaling | KEDA | Event-driven + scale-to-zero |
|
||||
| Observability | Grafana Stack | Alloy + Loki + Mimir + Tempo + Grafana |
|
||||
| Observability | OpenTelemetry | Auto-instrumentation (independent of mesh) |
|
||||
| Registry | Harbor | Container registry + scanning |
|
||||
| Storage | MinIO | Fast S3 (tiered to archival) |
|
||||
| Backup | Velero | Backup to archival S3 |
|
||||
| DNS | ExternalDNS | Sync to DNS provider |
|
||||
| GSLB | k8gb | Authoritative DNS + cross-region GSLB |
|
||||
| Failover | Failover Controller | Generic failover orchestration |
|
||||
| IDP | Backstage | Developer portal |
|
||||
|
||||
### User Choice Options
|
||||
|
||||
| Category | Options | Notes |
|
||||
|----------|---------|-------|
|
||||
| Cloud Provider | Hetzner (now), Huawei/OCI (coming) | Provider unlocks related services |
|
||||
| Regions | 1 or 2 | 2 recommended for DR, 1 allowed |
|
||||
| LoadBalancer | Cloud LB (~5-10/mo), k8gb DNS-based (free), Cilium L2 (free, single subnet) | Cloud LB recommended |
|
||||
| DNS Provider | Cloudflare (always), Hetzner DNS, Route53/Cloud DNS/Azure DNS (if using that cloud) | Cloudflare recommended |
|
||||
| Secrets Backend | Vault self-hosted, HCP Vault, Infisical, cloud secret managers | Self-hosted Vault recommended |
|
||||
| Archival S3 | Cloudflare R2, AWS S3, GCP GCS, Azure Blob, OCI Object Storage, Huawei OBS | For backup + MinIO tiering |
|
||||
|
||||
### A La Carte Data Services
|
||||
|
||||
| Component | Purpose | DR Strategy |
|
||||
|-----------|---------|-------------|
|
||||
| CNPG | PostgreSQL operator | WAL streaming (async primary-replica) |
|
||||
| MongoDB | Document database | CDC via Debezium → Redpanda |
|
||||
| Redpanda | Kafka-compatible streaming | MirrorMaker2 |
|
||||
| Valkey | Redis-compatible cache (BSD-3 OSS) | REPLICAOF |
|
||||
|
||||
### A La Carte Communication
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| Stalwart | Email server (JMAP/IMAP/SMTP) |
|
||||
| STUNner | WebRTC gateway |
|
||||
|
||||
---
|
||||
|
||||
## 1. Critical Architecture Decisions (2026-01-17)
|
||||
|
||||
### Service Mesh: Cilium (NOT Istio)
|
||||
|
||||
**Decision**: Cilium Service Mesh replaces Istio entirely.
|
||||
|
||||
**Rationale**:
|
||||
- OpenTelemetry auto-instrumentation is independent of service mesh (via init container injection)
|
||||
- SQL query visibility comes from OTel Java/Python/Node agents, NOT Envoy sidecars
|
||||
- Cilium provides mTLS via eBPF with lower resource overhead
|
||||
- Single CNI+Mesh solution reduces operational complexity
|
||||
|
||||
**Cilium Service Mesh Features**:
|
||||
| Feature | How |
|
||||
|---------|-----|
|
||||
| mTLS | Cilium identity-based encryption |
|
||||
| L7 Policies | Envoy proxy (CiliumEnvoyConfig) |
|
||||
| Traffic Management | CiliumNetworkPolicy + HTTPRoute |
|
||||
| Observability | Hubble + OTel (independent) |
|
||||
|
||||
### Git Provider: Gitea Only
|
||||
|
||||
**Decision**: Gitea is the sole internal Git provider. GitHub/GitLab options removed.
|
||||
|
||||
**Architecture**:
|
||||
- Gitea deployed in each region
|
||||
- Bidirectional mirroring between Gitea instances
|
||||
- CNPG for metadata storage (async primary-replica, NOT multi-master)
|
||||
- Each Gitea connects to LOCAL CNPG only
|
||||
- Cross-region writes via primary region
|
||||
- Gitea Actions for CI/CD and approval workflows
|
||||
|
||||
### DNS Architecture: k8gb Authoritative
|
||||
|
||||
**Decision**: k8gb acts as authoritative DNS server (NOT just a record manager).
|
||||
|
||||
**Architecture**:
|
||||
- k8gb CoreDNS serves as authoritative DNS for GSLB zone
|
||||
- Domain registrar NS records point to k8gb CoreDNS LoadBalancer IPs
|
||||
- k8gb CoreDNS is SEPARATE from Kubernetes internal CoreDNS
|
||||
- No Cloudflare hybrid option - k8gb handles entire GSLB zone
|
||||
|
||||
### Split-Brain Protection: Cloud Witness (Cloudflare)
|
||||
|
||||
**Decision**: Use Cloudflare Workers + KV as cloud witness for lease-based failover authority.
|
||||
|
||||
**Why Cloud Witness (not external DNS resolvers)**:
|
||||
- External DNS resolvers can only verify if a region is reachable, not who should be active
|
||||
- Lease-based approach provides true single-source-of-truth
|
||||
- Prevents k8gb's DNS-based failover from causing split-brain during partitions
|
||||
|
||||
**Mechanism**:
|
||||
- Active region holds lease in Cloudflare KV (renews every 10s, TTL 30s)
|
||||
- Standby region cannot become active while lease is held
|
||||
- Failover Controller gates all readiness based on lease ownership
|
||||
|
||||
### Failover Controller: Comprehensive Failover Orchestration
|
||||
|
||||
**Decision**: Build a Failover Controller that controls ALL failover (not just databases).
|
||||
|
||||
**Scope (Three Layers)**:
|
||||
1. **External traffic** (Gateway API → k8gb): Controls HTTPRoute readiness
|
||||
2. **Internal traffic** (Cilium Cluster Mesh): Controls Service endpoints
|
||||
3. **Stateful services** (CNPG, MongoDB): Signals database promotion
|
||||
|
||||
**Key Insight**: k8gb alone cannot prevent split-brain during network partitions. The Failover Controller gates k8gb's view by controlling whether endpoints are visible.
|
||||
|
||||
**Architecture**:
|
||||
- Cloudflare Worker + KV as witness (lease-based authority)
|
||||
- Per-cluster Failover Controller with state machine (ACTIVE/STANDBY/FAILING_OVER)
|
||||
- Actuators for Gateway, Service, and Database resources
|
||||
|
||||
**Modes**: automatic | semi-automatic | manual (for regulated environments)
|
||||
|
||||
### DDoS Protection: Cloud Provider Native
|
||||
|
||||
**Decision**: Rely on cloud provider native DDoS protection.
|
||||
|
||||
| Provider | Protection | Visibility |
|
||||
|----------|------------|------------|
|
||||
| Hetzner | Automatic, always-on | Low (black box) |
|
||||
| OCI | Always-on, free | Medium |
|
||||
| Huawei | Anti-DDoS Basic (free) | Low-Medium |
|
||||
|
||||
**No Cloudflare proxy required** - cloud providers handle volumetric attacks at edge.
|
||||
|
||||
**WAF (L7)**: Coraza handles application-layer protection separately.
|
||||
|
||||
### Multi-Region Strategy
|
||||
|
||||
- **Recommended 2 regions** (BCP/DR) but **1 region allowed**
|
||||
- **Independent clusters** per region (NOT stretched clusters)
|
||||
- Each cluster survives independently during network partition
|
||||
- Async data replication between regions (eventual consistency)
|
||||
|
||||
### Cloud Providers
|
||||
|
||||
- **Primary**: Hetzner Cloud (first supported)
|
||||
- **Coming Soon**: Huawei Cloud, Oracle Cloud (OCI)
|
||||
- **Dropped**: Contabo (no Crossplane support), AWS/GCP/Azure (future consideration)
|
||||
|
||||
### LoadBalancer Strategy
|
||||
|
||||
- **Option 1**: Cloud provider LoadBalancers (Hetzner LB, OCI LB, etc.) - recommended
|
||||
- **Option 2**: k8gb DNS-based LB (Gateway API hostNetwork + k8gb health routing) - free
|
||||
- **Option 3**: Cilium L2 Mode (ARP-based, same subnet only) - free
|
||||
- BGP is NOT available on target cloud providers (only bare-metal/dedicated)
|
||||
|
||||
### Secrets Management
|
||||
|
||||
- **SOPS eliminated completely** - not even for bootstrap
|
||||
- **Interactive bootstrap**: Wizard generates credentials, operator saves them
|
||||
- **Architecture**: Independent Vault per cluster + ESO PushSecrets for cross-cluster sync
|
||||
- **Flow**: K8s Secret → ESO PushSecret → Both Vaults simultaneously
|
||||
- **ESO Generators**: Auto-create complex passwords/keys (no manual generation)
|
||||
- All secrets managed via K8s CRDs (no manual Vault updates)
|
||||
|
||||
### Storage Architecture
|
||||
|
||||
- **MinIO**: Fast S3 (in-cluster) with tiered storage
|
||||
- **Archival S3**: External cloud storage (R2, S3, GCS, Blob, OBS)
|
||||
- **MinIO tiers to Archival S3** for cold data
|
||||
- **Velero backs up to Archival S3** (not MinIO)
|
||||
- **Harbor backs up to Archival S3**
|
||||
|
||||
### Cross-Region Networking
|
||||
|
||||
- **WireGuard mesh** for cross-region connectivity
|
||||
- OR **native cloud peering** if same provider (Hetzner vSwitch, OCI FastConnect)
|
||||
- Required for: Vault sync, k8gb coordination, data replication, Gitea mirroring
|
||||
|
||||
### Data Replication Patterns (All Community Edition)
|
||||
|
||||
| Service | Replication Method |
|
||||
|---------|-------------------|
|
||||
| CNPG (Postgres) | WAL streaming to standby cluster (async primary-replica) |
|
||||
| Gitea | Bidirectional mirror + CNPG for metadata |
|
||||
| MongoDB | CDC via Debezium → Redpanda → Sink Connector |
|
||||
| Redpanda | MirrorMaker2 (native) |
|
||||
| Valkey | REPLICAOF command (async) |
|
||||
| MinIO | Bucket replication |
|
||||
| Harbor | Registry replication |
|
||||
|
||||
### MongoDB Replication (IMPORTANT)
|
||||
|
||||
- MongoDB Community Edition does NOT have native cross-cluster replication
|
||||
- **Only option**: CDC via Debezium + Redpanda
|
||||
- Truly independent clusters (not stretched replica set)
|
||||
- Downsides: eventual consistency, conflict resolution needed, Debezium complexity
|
||||
|
||||
---
|
||||
|
||||
## 2. OpenOva Positioning & Value Proposition
|
||||
|
||||
### Core Identity
|
||||
|
||||
OpenOva.io is **NOT** another Kubernetes platform or IDP. It is:
|
||||
- **Enterprise-grade support provider for open-source K8s ecosystems**
|
||||
- **Transformation journey partner** for organizations adopting cloud-native
|
||||
- **Converged blueprint ecosystem** with operational guarantees
|
||||
|
||||
### Value Proposition
|
||||
|
||||
"We provide enterprise-grade, end-to-end support for curated open-source ecosystems on Kubernetes. We don't just deploy technologies - we optimize, harden, upgrade, and stand behind them."
|
||||
|
||||
### Differentiator
|
||||
|
||||
- **Operational excellence** (Day-2 safety, upgrades, SLAs) - not tooling
|
||||
- **Confidence as a service** - we own the pager, not the customer
|
||||
- **Productized blueprints** - intellectual property is in the converged, optimized configurations
|
||||
|
||||
### Target Market
|
||||
|
||||
- Banks, telcos, petroleum (regulated industries)
|
||||
- Organizations scared of OSS complexity but wanting to avoid vendor lock-in
|
||||
- Teams burned by past platform attempts
|
||||
|
||||
---
|
||||
|
||||
## 3. Architecture Model
|
||||
|
||||
### Blueprint vs Instance Model
|
||||
|
||||
- **Public blueprints** (openova-io): Templates with `<tenant>` placeholders - the "class"
|
||||
- **Private instances** (acme-private): Generated repos with choices made - the "instance"
|
||||
- **Bootstrap wizard**: Generates instance repos from blueprints
|
||||
|
||||
### Three-Layer Architecture
|
||||
|
||||
```
|
||||
+-------------------------------------------------------+
|
||||
| OPENOVA BOOTSTRAP WIZARD (Managed UI) |
|
||||
| - Hosted on OpenOva's infrastructure |
|
||||
| - Collects credentials, runs Terraform |
|
||||
| - Export option for self-hosted bootstrap |
|
||||
| - Permanent sessions with SSO (Google/Azure) |
|
||||
| - Exits the picture after bootstrap complete |
|
||||
+-------------------------------------------------------+
|
||||
|
|
||||
v
|
||||
+-------------------------------------------------------+
|
||||
| CUSTOMER'S ENVIRONMENT (Post-Bootstrap) |
|
||||
| - Backstage (IDP - entry door for lifecycle) |
|
||||
| - Flux (GitOps delivery) |
|
||||
| - Gitea (internal Git with bidirectional mirror) |
|
||||
| - Crossplane (selective - lifecycle abstraction) |
|
||||
| - Operators (CNPG, etc.) |
|
||||
+-------------------------------------------------------+
|
||||
|
|
||||
v
|
||||
+-------------------------------------------------------+
|
||||
| OPENOVA BLUEPRINTS (Our IP - stays in picture) |
|
||||
| - Certified configurations |
|
||||
| - Upgrade-safe versions |
|
||||
| - Best practices (PDBs, VPAs, policies) |
|
||||
| - Published via Git, consumed by customer's Flux |
|
||||
+-------------------------------------------------------+
|
||||
```
|
||||
|
||||
### Key Architectural Decisions
|
||||
|
||||
1. **Bootstrap wizard is SEPARATE** - independent repo/application, hosted on OpenOva
|
||||
2. **Bootstrap wizard EXITS after provisioning** - must be safe to delete after day 1
|
||||
3. **First cluster inherits bootstrapping capability** via Crossplane/CAPI for expansion
|
||||
4. **Backstage becomes the entry point** for customer's lifecycle management
|
||||
5. **OpenOva stays in picture via blueprints** - not runtime components
|
||||
6. **Terraform is the unified bootstrap mechanism** (SaaS or Self-Hosted)
|
||||
|
||||
---
|
||||
|
||||
## 4. Bootstrap Modes
|
||||
|
||||
### Mode 1: Managed Bootstrap ("OpenOva Cloud Bootstrap")
|
||||
|
||||
- Customer uses OpenOva wizard (hosted UI)
|
||||
- OpenOva's Terraform provisions customer's cloud infrastructure
|
||||
- After bootstrap, customer's Crossplane takes over
|
||||
- Customer provides cloud credentials to OpenOva (temporarily)
|
||||
- Redirect to Backstage after completion
|
||||
|
||||
### Mode 2: Self-Hosted Bootstrap ("OpenOva Bring-Your-Own Bootstrap")
|
||||
|
||||
- Customer exports Terraform manifests from wizard
|
||||
- Customer runs Terraform locally with their own credentials
|
||||
- Credentials never leave customer environment
|
||||
- Same end result: Backstage + platform stack ready
|
||||
|
||||
### Unified Approach
|
||||
|
||||
Both modes use the same Terraform manifests - only difference is WHERE terraform apply runs.
|
||||
|
||||
### Bootstrap Sequence
|
||||
|
||||
```
|
||||
Terraform → K8s Cluster → Flux (bootstrap)
|
||||
→ Gitea (internal Git)
|
||||
→ Crossplane + Operators
|
||||
→ Backstage + Grafana Stack
|
||||
→ Platform ready
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Git Repository: Gitea
|
||||
|
||||
**Fixed decision**: Gitea is the sole Git provider.
|
||||
|
||||
### Gitea Architecture
|
||||
|
||||
- Deployed in each region (active-active for reads)
|
||||
- Bidirectional mirroring between instances
|
||||
- CNPG for PostgreSQL metadata (async primary-replica)
|
||||
- Each Gitea connects to LOCAL CNPG only
|
||||
- Gitea Actions for CI/CD pipelines
|
||||
- CODEOWNERS for security approval workflows
|
||||
|
||||
### Why Gitea (not GitHub/GitLab)
|
||||
|
||||
| Reason | Benefit |
|
||||
|--------|---------|
|
||||
| Self-hosted | Full control, no external dependency |
|
||||
| Lightweight | Lower resource footprint than GitLab |
|
||||
| GitOps-focused | Designed for Flux integration |
|
||||
| Bidirectional mirror | Active-active reads across regions |
|
||||
| Gitea Actions | GitHub Actions compatible CI/CD |
|
||||
|
||||
---
|
||||
|
||||
## 6. IDP vs Crossplane Decision
|
||||
|
||||
### With IDP (Backstage) in place:
|
||||
|
||||
- **IDP handles**: Catalog UX, form generation, YAML templating, PR creation
|
||||
- **Crossplane needed only when**:
|
||||
- Multi-backend portability expected (CNPG today → managed DB tomorrow)
|
||||
- Complex compositions (one request → many resources)
|
||||
- Non-K8s resources in same catalog
|
||||
- Lifecycle coupling required
|
||||
|
||||
### Encapsulation Strategy: LIGHT
|
||||
|
||||
- Thin claims (5-10 fields max): tier, ha, backup, deletionProtection, networkProfile
|
||||
- Everything else stays internal (operator defaults)
|
||||
- Two-lane model: Standard (90%) + Advanced escape hatch (10%)
|
||||
|
||||
---
|
||||
|
||||
## 7. End-User Journeys
|
||||
|
||||
### Journey 1: Initial Bootstrap (Infra SPOC)
|
||||
|
||||
```
|
||||
OpenOva Wizard UI → Select cloud/options → Generate Terraform
|
||||
→ Run Terraform (managed or self-hosted)
|
||||
→ Cluster + Platform ready
|
||||
→ Redirect to Backstage URL
|
||||
```
|
||||
|
||||
### Journey 2: Day-2 Operations (App Teams via Backstage)
|
||||
|
||||
```
|
||||
Backstage → Select blueprint (e.g., "Tier-1 Postgres")
|
||||
→ Fill minimal form (tier, ha, backup)
|
||||
→ PR generated to Gitea
|
||||
→ Flux applies → Operator reconciles
|
||||
→ Resource ready, secret injected
|
||||
```
|
||||
|
||||
### Journey 3: Platform Extension (Infra SPOC via Backstage)
|
||||
|
||||
```
|
||||
Backstage → Platform Admin section
|
||||
→ "Add Cluster" or "Enable Capability Pack"
|
||||
→ PR generated to Gitea
|
||||
→ Flux + Crossplane/CAPI provision
|
||||
```
|
||||
|
||||
### Journey 4: Blueprint Updates (OpenOva → Customer)
|
||||
|
||||
```
|
||||
OpenOva publishes new blueprint version
|
||||
→ Customer's Backstage shows notification
|
||||
→ Customer reviews changelog
|
||||
→ Customer clicks "Upgrade" (generates PR to Gitea)
|
||||
→ Flux applies
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Support Model
|
||||
|
||||
### Fully Supported
|
||||
|
||||
- Entire mandatory stack
|
||||
- Selected a la carte components
|
||||
- Blueprint configurations only
|
||||
|
||||
### Best Effort
|
||||
|
||||
- Customer customizations beyond blueprints
|
||||
- Edge cases not in support matrix
|
||||
|
||||
### Unsupported
|
||||
|
||||
- Versions outside support matrix
|
||||
- Non-blueprint configurations
|
||||
- DIY operator installations
|
||||
|
||||
---
|
||||
|
||||
## 9. Decided Questions (2026-01-17)
|
||||
|
||||
| Question | Decision |
|
||||
|----------|----------|
|
||||
| Service mesh | Cilium Service Mesh (NOT Istio) |
|
||||
| Git provider | Gitea only (GitHub/GitLab removed) |
|
||||
| Cloud provider | Hetzner first, then Huawei/OCI. Contabo dropped. |
|
||||
| Multi-region | Recommended 2 regions but 1 region allowed (independent clusters) |
|
||||
| LoadBalancer | Cloud LB (default), k8gb DNS-based (free), Cilium L2 (single subnet) |
|
||||
| DNS architecture | k8gb as authoritative DNS server for GSLB zone |
|
||||
| Split-brain protection | Cloudflare Workers + KV (lease-based witness) |
|
||||
| Failover orchestration | Failover Controller (controls external, internal, stateful) |
|
||||
| DDoS protection | Cloud provider native (no Cloudflare proxy) |
|
||||
| Secrets backend | Self-hosted Vault per cluster + ESO PushSecrets (or SaaS options) |
|
||||
| SOPS | Eliminated completely |
|
||||
| Harbor | Mandatory from day 1 |
|
||||
| VPA | Mandatory |
|
||||
| Crossplane | Mandatory for post-bootstrap cloud ops |
|
||||
| MongoDB replication | CDC via Debezium + Redpanda |
|
||||
| Redis-compatible cache | Valkey (BSD-3, Linux Foundation) |
|
||||
| MinIO | Fast S3 with tiering (NOT backup target) |
|
||||
| Archival S3 | R2/S3/GCS/Blob for backup + tiering |
|
||||
| GitOps | Flux (ArgoCD as future option) |
|
||||
| CI/CD | Gitea Actions |
|
||||
| Observability | OTel auto-instrumentation (independent of mesh) + Grafana Stack |
|
||||
|
||||
## 10. Open Decisions / Questions
|
||||
|
||||
1. **Exact naming for bootstrap modes** - "Managed" vs "Self-Hosted"?
|
||||
2. **First flagship blueprint** - PostgreSQL or Service Mesh?
|
||||
3. **Wizard tech stack** - what to build it with?
|
||||
4. **Failover Controller implementation** - research existing OSS or build new?
|
||||
5. **Conflict resolution strategy** - for eventual consistency scenarios
|
||||
|
||||
---
|
||||
|
||||
## 15. RESOLVED - k8gb and Failover Architecture (2026-01-18)
|
||||
|
||||
### 15.1 k8gb Architecture Deep Dive
|
||||
|
||||
**Status:** RESOLVED
|
||||
|
||||
**Key Finding from Source Code Analysis:**
|
||||
|
||||
k8gb clusters operate **independently** with **DNS-based discovery only**:
|
||||
|
||||
| Aspect | k8gb Behavior |
|
||||
|--------|---------------|
|
||||
| Local health check | Direct service health check (Ingress/Gateway endpoints) |
|
||||
| Cross-cluster "health" | DNS query to `localtargets-*` record |
|
||||
| Communication | **DNS only** - no direct health checks between clusters |
|
||||
|
||||
**Critical Limitation:** k8gb cannot distinguish between:
|
||||
- "Region is down" (failover needed)
|
||||
- "Network partition" (failover NOT wanted)
|
||||
|
||||
Both produce the same symptom: DNS query fails or times out.
|
||||
|
||||
```
|
||||
Cluster B queries: localtargets-app.example.com from Cluster A
|
||||
├── Gets IPs → "Cluster A is healthy"
|
||||
└── No IPs / timeout → "Cluster A is unavailable" (but WHY?)
|
||||
```
|
||||
|
||||
**Scenarios Analyzed:**
|
||||
|
||||
| Scenario | k8gb Behavior | Problem? |
|
||||
|----------|---------------|----------|
|
||||
| Region truly down | Removes region from DNS | Correct |
|
||||
| Network partition | Also removes region from DNS | **Incorrect failover** |
|
||||
| Both healthy | Returns both regions | Correct |
|
||||
|
||||
**Conclusion:** k8gb is suitable for **stateless services** where brief dual-routing during partition is acceptable. For **stateful services** and strict active-passive, a Failover Controller with cloud witness is required.
|
||||
|
||||
### 15.2 Failover Controller Design
|
||||
|
||||
**Status:** RESOLVED
|
||||
|
||||
**Architecture Decision:** Cloudflare Workers + KV as cloud witness
|
||||
|
||||
| Component | Role |
|
||||
|-----------|------|
|
||||
| Cloudflare Worker | Lease management API |
|
||||
| Cloudflare KV | Lease storage with TTL |
|
||||
| Failover Controller | Per-cluster controller that manages readiness |
|
||||
|
||||
**Three Layers Controlled:**
|
||||
|
||||
1. **External** (Gateway API → k8gb): HTTPRoute readiness
|
||||
2. **Internal** (Cilium Cluster Mesh): Service endpoint manipulation
|
||||
3. **Stateful** (CNPG, MongoDB): Database promotion signaling
|
||||
|
||||
**Witness Pattern:**
|
||||
- Active region holds lease (renews every 10s, TTL 30s)
|
||||
- Standby region queries lease status
|
||||
- If lease expires → standby acquires lease → becomes active
|
||||
- Network partition: both regions reach witness → active keeps renewing → no split-brain
|
||||
|
||||
**Documentation:** See `failover-controller/docs/ADR-FAILOVER-CONTROLLER.md`
|
||||
|
||||
### 15.3 k8gb Scope Clarification
|
||||
|
||||
**Status:** RESOLVED
|
||||
|
||||
**k8gb is for EXTERNAL services only:**
|
||||
- Routes traffic via DNS based on endpoint availability
|
||||
- Does NOT coordinate internal services
|
||||
- Does NOT handle database failover
|
||||
|
||||
**Internal services use Cilium Cluster Mesh:**
|
||||
- Cross-region service discovery
|
||||
- Failover Controller manipulates endpoints
|
||||
|
||||
**ExternalDNS Role:**
|
||||
- Creates NS records delegating GSLB zone to k8gb
|
||||
- Manages non-GSLB records in parent zone
|
||||
- One-time setup for delegation, ongoing for other records
|
||||
|
||||
### 15.4 Gateway API Clarification
|
||||
|
||||
**Status:** RESOLVED
|
||||
|
||||
- Entry point: Kubernetes Gateway API backed by Cilium/Envoy
|
||||
- Traefik (K3s default): Disabled in OpenOva deployments
|
||||
- Kong: Not included (Cilium Gateway sufficient for routing)
|
||||
- API Management: Future consideration if needed
|
||||
|
||||
### 15.5 Redis-Compatible Caching
|
||||
|
||||
**Status:** RESOLVED
|
||||
|
||||
- **Valkey** selected (Linux Foundation, BSD-3)
|
||||
- Dragonfly dropped (BSL license)
|
||||
- Redis OSS dropped (license concerns)
|
||||
|
||||
### 15.6 Harbor S3 Backend
|
||||
|
||||
**Status:** RESOLVED
|
||||
|
||||
- MinIO as S3 backend documented
|
||||
- Tiered archiving to external S3 documented
|
||||
|
||||
### 15.7 SRE Repo
|
||||
|
||||
**Status:** FUTURE DISCUSSION
|
||||
|
||||
- VPA policies
|
||||
- Topology spread
|
||||
- PVC resizing
|
||||
- KEDA configurations
|
||||
|
||||
---
|
||||
|
||||
## 11. New A La Carte Components (2026-01-18)
|
||||
|
||||
### Identity
|
||||
|
||||
| Component | Purpose | Use Cases |
|
||||
|-----------|---------|-----------|
|
||||
| Keycloak | OIDC/OAuth/FAPI Authorization Server | Any app needing auth, SSO, FAPI compliance |
|
||||
|
||||
### Monetization
|
||||
|
||||
| Component | Purpose | Use Cases |
|
||||
|-----------|---------|-----------|
|
||||
| OpenMeter | Usage metering | API monetization, usage tracking |
|
||||
| Lago | Billing and invoicing | Subscription billing, usage-based pricing |
|
||||
|
||||
These are standalone a la carte components that can be used independently or bundled into meta blueprints.
|
||||
|
||||
---
|
||||
|
||||
## 12. Open Banking Meta Blueprint (2026-01-18)
|
||||
|
||||
### Overview
|
||||
|
||||
Meta blueprint that bundles a la carte components with custom services for PSD2/FAPI fintech sandboxes.
|
||||
|
||||
### Architecture Concept
|
||||
|
||||
**Meta Blueprint = A La Carte Components + Custom Services**
|
||||
|
||||
```
|
||||
Open Banking Meta Blueprint
|
||||
├── Keycloak (a la carte) ─► FAPI Authorization
|
||||
├── OpenMeter (a la carte) ─► Usage metering
|
||||
├── Lago (a la carte) ─► Billing
|
||||
└── Custom Services ─► Open Banking specific
|
||||
├── ext-authz
|
||||
├── accounts-api
|
||||
├── payments-api
|
||||
├── consents-api
|
||||
├── tpp-management
|
||||
└── sandbox-data
|
||||
```
|
||||
|
||||
### Key Architectural Decision
|
||||
|
||||
**Envoy at the heart** - NOT Kong/Tyk. Leverages existing Cilium/Envoy investment with specialized services.
|
||||
|
||||
### Architecture Flow
|
||||
|
||||
```
|
||||
TPP Request (eIDAS cert)
|
||||
|
|
||||
v
|
||||
Cilium Ingress (Envoy)
|
||||
|
|
||||
+--> ext_authz Service
|
||||
| |
|
||||
| +--> Validate eIDAS cert
|
||||
| +--> Check TPP registry
|
||||
| +--> Verify consent
|
||||
| +--> Check/decrement quota (Valkey)
|
||||
|
|
||||
v
|
||||
Backend Services (Accounts/Payments/Consents)
|
||||
|
|
||||
v
|
||||
Access Logs --> Redpanda --> OpenMeter --> Lago
|
||||
```
|
||||
|
||||
### Monetization Models
|
||||
|
||||
| Model | Flow |
|
||||
|-------|------|
|
||||
| Prepaid | Buy credits → Valkey balance → Atomic decrement → Block at zero |
|
||||
| Post-paid | Use APIs → Meter usage → Invoice at period end |
|
||||
| Subscription + Overage | Monthly base + per-call overage |
|
||||
|
||||
### Why Not Kong/Tyk
|
||||
|
||||
- Already have Cilium/Envoy for service mesh
|
||||
- Open Banking logic doesn't fit plugin architecture
|
||||
- Unified observability with existing Grafana stack
|
||||
- Custom services give full control over PSD2 compliance
|
||||
|
||||
### Open Banking Standards
|
||||
|
||||
| Standard | Status |
|
||||
|----------|--------|
|
||||
| UK Open Banking 3.1 | Primary |
|
||||
| Berlin Group NextGenPSD2 | Planned |
|
||||
| STET (France) | Planned |
|
||||
|
||||
### Documentation
|
||||
|
||||
- ADR: `handbook/docs/adrs/ADR-OPEN-BANKING-BLUEPRINT.md`
|
||||
- Spec: `handbook/docs/specs/SPEC-OPEN-BANKING-ARCHITECTURE.md`
|
||||
- Blueprint: `handbook/docs/blueprints/BLUEPRINT-OPEN-BANKING.md`
|
||||
|
||||
---
|
||||
|
||||
## 13. Repository Structure
|
||||
|
||||
```
|
||||
openova-io/ # Public blueprints org
|
||||
├── bootstrap/ # Bootstrap wizard
|
||||
├── terraform/ # IaC modules
|
||||
├── flux/ # GitOps configs
|
||||
├── handbook/ # Documentation
|
||||
├── <component>/ # Individual component blueprints
|
||||
│ ├── cilium/ # CNI + Service Mesh
|
||||
│ ├── gitea/ # Git server
|
||||
│ ├── failover-controller/ # Failover orchestration
|
||||
│ ├── grafana/
|
||||
│ ├── harbor/
|
||||
│ ├── vault/
|
||||
│ ├── k8gb/
|
||||
│ ├── external-dns/
|
||||
│ ├── keycloak/ # FAPI AuthZ (Open Banking)
|
||||
│ ├── openmeter/ # Usage metering (Open Banking)
|
||||
│ ├── lago/ # Billing (Open Banking)
|
||||
│ ├── open-banking/ # Open Banking services
|
||||
│ └── ...
|
||||
|
||||
acme-private/ # Example private instance
|
||||
├── terraform/ # Configured for acme
|
||||
├── flux/ # Configured for acme
|
||||
├── <component>/ # Configured for acme
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Key Quotes & Principles
|
||||
|
||||
> "Crossplane doesn't kill Terraform. It kills Terraform-as-a-control-plane."
|
||||
|
||||
> "The catalog is a contract, not a UI."
|
||||
|
||||
> "You are selling confidence, not Kubernetes. Insurance, not innovation."
|
||||
|
||||
> "If the bootstrap platform stays in the picture after day 1, it's doing too much."
|
||||
|
||||
> "IDP is the front desk. Your thin layer is the contract, the rules, and the insurance behind the desk."
|
||||
|
||||
> "Wrap CNPG/Strimzi only if you are intentionally offering 'databases' and 'streams' as platform products."
|
||||
|
||||
> "Public blueprints are the class, private instances are the objects."
|
||||
|
||||
> "OTel is completely independent of service mesh - that's why Cilium is a no-brainer."
|
||||
|
||||
---
|
||||
|
||||
## 15. Technical ADRs Referenced
|
||||
|
||||
- ADR-MULTI-REGION-STRATEGY: Independent clusters, recommended not enforced
|
||||
- ADR-PLATFORM-ENGINEERING-TOOLS: Crossplane, Backstage, Flux (mandatory)
|
||||
- ADR-IMAGE-REGISTRY: Harbor mandatory
|
||||
- ADR-SECURITY-SCANNING: Trivy CI/CD + Harbor + Runtime
|
||||
- ADR-CILIUM-SERVICE-MESH: Cilium replaces Istio
|
||||
- ADR-GITEA: Gitea as sole Git provider
|
||||
- ADR-FAILOVER-CONTROLLER: Generic failover orchestration
|
||||
- ADR-K8GB-GSLB: k8gb as authoritative DNS
|
||||
- ADR-AIRGAP-COMPLIANCE: Air-gap capable architecture
|
||||
|
||||
---
|
||||
|
||||
## 16. Competitive Landscape
|
||||
|
||||
### Not Competing With
|
||||
|
||||
- Red Hat OpenShift (distro)
|
||||
- Cloud providers (AWS/GCP/Azure)
|
||||
- Pure tooling vendors
|
||||
|
||||
### Competing For
|
||||
|
||||
- Regulated enterprises wanting OSS with support
|
||||
- Organizations burned by OpenShift cost/complexity
|
||||
- Teams needing "someone to call at 3am"
|
||||
|
||||
### Adjacent Players
|
||||
|
||||
- Upbound (Crossplane ecosystem)
|
||||
- Humanitec (Platform orchestrator)
|
||||
- Loft/vCluster (Multi-tenancy)
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## 17. Monorepo Consolidation (2026-02-08)
|
||||
|
||||
### Decision
|
||||
|
||||
Consolidated 45+ separate GitHub repos into a single monorepo: `openova-io/openova`
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
openova/
|
||||
├── core/ # Bootstrap + Lifecycle Manager (Go application)
|
||||
├── platform/ # All 41 component blueprints (FLAT structure)
|
||||
├── meta-platforms/ # Bundled vertical solutions (README only)
|
||||
│ ├── ai-hub/ # Enterprise AI platform
|
||||
│ └── open-banking/ # PSD2/FAPI fintech sandbox (+ 6 services)
|
||||
└── docs/ # Platform documentation
|
||||
```
|
||||
|
||||
### Key Decisions
|
||||
|
||||
| Decision | Rationale |
|
||||
|----------|-----------|
|
||||
| Flat platform/ structure | No hierarchical subfolders (networking/, security/, etc.) |
|
||||
| Documentation shows groupings | README displays logical categories while folders stay flat |
|
||||
| Meta-platforms are README-only | Reference components from platform/, no duplication |
|
||||
| Core is single Go app | Bootstrap + Lifecycle Manager with mode switch |
|
||||
| 41 platform components | All flat under platform/ |
|
||||
| 6 open-banking custom services | Under meta-platforms/open-banking/ |
|
||||
|
||||
### Component Count (47 total)
|
||||
|
||||
- **Platform components**: 41 (flat under platform/)
|
||||
- **Open Banking services**: 6 (accounts-api, consents-api, ext-authz, payments-api, sandbox-data, tpp-management)
|
||||
|
||||
### Documentation Groupings (in READMEs)
|
||||
|
||||
**Mandatory (Core Platform):**
|
||||
| Category | Components |
|
||||
|----------|------------|
|
||||
| Infrastructure | terraform, crossplane |
|
||||
| GitOps & IDP | flux, gitea, backstage |
|
||||
| Networking | cilium, external-dns, k8gb, stunner |
|
||||
| Security | cert-manager, external-secrets, vault, trivy |
|
||||
| Policy | kyverno |
|
||||
| Observability | grafana |
|
||||
| Scaling | vpa, keda |
|
||||
| Storage | minio, velero |
|
||||
| Registry | harbor |
|
||||
| Failover | failover-controller |
|
||||
|
||||
**A La Carte (Optional):**
|
||||
| Category | Components |
|
||||
|----------|------------|
|
||||
| Data | cnpg, mongodb, valkey, redpanda |
|
||||
| Identity | keycloak |
|
||||
| Communication | stalwart |
|
||||
| Monetization | openmeter, lago |
|
||||
| AI/ML | knative, kserve, vllm, milvus, neo4j, langserve, librechat, n8n, searxng, bge, llm-gateway, anthropic-adapter |
|
||||
|
||||
### Sync to Customer Gitea
|
||||
|
||||
```
|
||||
GitHub (monorepo) Customer Gitea (multi-repo)
|
||||
───────────────── ──────────────────────────
|
||||
openova/core/ ──sync──> openova-core/
|
||||
openova/platform/cilium/ ──sync──> openova-cilium/
|
||||
openova/platform/flux/ ──sync──> openova-flux/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 18. Core Application Architecture (2026-02-08)
|
||||
|
||||
### Two Deployment Modes
|
||||
|
||||
| Mode | Location | Purpose | IaC Tool |
|
||||
|------|----------|---------|----------|
|
||||
| **Bootstrap** | Outside cluster | Initial provisioning | Terraform |
|
||||
| **Lifecycle Manager** | Inside cluster | Day-2 operations | Crossplane |
|
||||
|
||||
### Zero External Dependencies
|
||||
|
||||
| Mode | State Storage | Rationale |
|
||||
|------|---------------|-----------|
|
||||
| Bootstrap | SQLite (embedded) | No CNPG needed for ephemeral wizard |
|
||||
| Manager | Kubernetes CRDs | Native K8s, no external DB |
|
||||
|
||||
### Bootstrap Exits After Provisioning
|
||||
|
||||
The bootstrap wizard is designed to be **safely deletable** after initial provisioning. Crossplane owns all cloud resources going forward.
|
||||
|
||||
### No Overlap with Backstage
|
||||
|
||||
| Concern | Backstage | Lifecycle Manager |
|
||||
|---------|-----------|-------------------|
|
||||
| **Audience** | Developers (internal) | Platform operators |
|
||||
| **Focus** | Service catalog, scaffolding | Platform health, upgrades |
|
||||
| **Scope** | Application-level | Infrastructure-level |
|
||||
| **UI** | Rich portal | Minimal admin dashboard |
|
||||
|
||||
---
|
||||
|
||||
## 19. AI Hub Meta-Platform (2026-02-08)
|
||||
|
||||
### Overview
|
||||
|
||||
Enterprise AI platform with LLM serving, RAG, and intelligent agents.
|
||||
|
||||
### Components (12 from platform/)
|
||||
|
||||
knative, kserve, vllm, milvus, neo4j, langserve, librechat, n8n, searxng, bge, llm-gateway, anthropic-adapter
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
User Interfaces (LibreChat, Claude Code, n8n)
|
||||
↓
|
||||
Gateway Layer (LLM Gateway, Anthropic Adapter)
|
||||
↓
|
||||
RAG Service (LangServe)
|
||||
↓
|
||||
Model Serving (KServe, vLLM)
|
||||
↓
|
||||
Knowledge Layer (Milvus vectors, Neo4j graph)
|
||||
↓
|
||||
Embeddings (BGE-M3, BGE-Reranker)
|
||||
```
|
||||
|
||||
### Resource Requirements
|
||||
|
||||
| Component | CPU | Memory | GPU |
|
||||
|-----------|-----|--------|-----|
|
||||
| vLLM | 4 | 32Gi | 2x A10 |
|
||||
| BGE-M3 | 2 | 4Gi | 1x A10 |
|
||||
| BGE-Reranker | 1 | 2Gi | 1x A10 |
|
||||
| Milvus (3 replicas) | 2 | 8Gi | - |
|
||||
| **Total** | ~15 | ~55Gi | 4x A10 |
|
||||
|
||||
---
|
||||
|
||||
## 20. Deleted Repositories (2026-02-08)
|
||||
|
||||
All old separate repos deleted after monorepo consolidation:
|
||||
|
||||
openova-anthropic-adapter, openova-backstage, openova-bge, openova-cert-manager, openova-cilium, openova-cnpg, openova-crossplane, openova-external-dns, openova-external-secrets, openova-failover-controller, openova-flux, openova-gitea, openova-grafana, openova-harbor, openova-k8gb, openova-keda, openova-keycloak, openova-knative, openova-kserve, openova-kyverno, openova-lago, openova-langserve, openova-librechat, openova-llm-gateway, openova-milvus, openova-minio, openova-mongodb, openova-n8n, openova-neo4j, openova-openmeter, openova-redpanda, openova-searxng, openova-stalwart, openova-stunner, openova-terraform, openova-trivy, openova-valkey, openova-vault, openova-velero, openova-vllm, openova-vpa, openova-open-banking, openova-core, openova-handbook
|
||||
|
||||
---
|
||||
|
||||
*This document serves as persistent context for Claude Code sessions. Update as decisions are made.*
|
||||
Loading…
Reference in New Issue
Block a user