Component-level architectural correction (two changes): 1. MinIO → SeaweedFS as unified S3 encapsulation layer The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface. 2. Apache Guacamole added as Application Blueprint §4.5 Communication Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access. Component changes: - DELETED: platform/minio/ - CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section) - CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings) Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count. Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric. UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added. VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer. Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole. |
||
|---|---|---|
| .. | ||
| README.md | ||
KServe
Kubernetes-native model serving. Application Blueprint (see docs/PLATFORM-TECH-STACK.md §4.6). Used by bp-cortex to serve LLMs via vLLM, embedding models via BGE, and any custom inference workload.
Status: Accepted | Updated: 2026-04-27
Overview
KServe provides standardized model serving on Kubernetes with support for multiple ML frameworks, autoscaling, and inference graphs.
flowchart TB
subgraph KServe["KServe"]
Controller[KServe Controller]
Predictor[Predictor]
Transformer[Transformer]
Explainer[Explainer]
end
subgraph Runtimes["Serving Runtimes"]
vLLM[vLLM]
TorchServe[TorchServe]
Triton[Triton]
SKLearn[SKLearn]
end
subgraph Knative["Knative Serving"]
Autoscale[Autoscaling]
Revisions[Revisions]
end
Controller --> Predictor
Controller --> Transformer
Controller --> Explainer
Predictor --> Runtimes
Runtimes --> Knative
Why KServe?
| Feature | Benefit |
|---|---|
| Multi-framework | TensorFlow, PyTorch, ONNX, vLLM, etc. |
| Autoscaling | Scale-to-zero via Knative |
| InferenceService | Standardized deployment pattern |
| Inference Graph | Multi-model pipelines |
| Model explainability | Integrated explainers |
Components
| Component | Purpose |
|---|---|
| InferenceService | Model deployment abstraction |
| ServingRuntime | Framework-specific runtime |
| InferenceGraph | Multi-model orchestration |
| ClusterStorageContainer | Model storage configuration |
Serving Runtimes
| Runtime | Use Case |
|---|---|
| vLLM | LLM inference (recommended) |
| TorchServe | PyTorch models |
| Triton | Multi-framework, high performance |
| SKLearn | Scikit-learn models |
| XGBoost | Gradient boosting models |
| ONNX | ONNX format models |
Configuration
InferenceService Example
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llm-service
namespace: ai-hub
spec:
predictor:
model:
modelFormat:
name: vllm
runtime: vllm-runtime
storageUri: pvc://model-cache/models/qwen-32b
resources:
requests:
cpu: "4"
memory: 32Gi
nvidia.com/gpu: "2"
limits:
cpu: "8"
memory: 64Gi
nvidia.com/gpu: "2"
ServingRuntime for vLLM
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: vllm-runtime
spec:
supportedModelFormats:
- name: vllm
autoSelect: true
containers:
- name: kserve-container
image: vllm/vllm-openai:latest
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=2
- --max-model-len=32768
resources:
requests:
nvidia.com/gpu: "2"
Inference Graph
Multi-model pipeline for complex inference:
apiVersion: serving.kserve.io/v1alpha1
kind: InferenceGraph
metadata:
name: rag-pipeline
spec:
nodes:
root:
routerType: Sequence
steps:
- serviceName: embedder
- serviceName: retriever
- serviceName: llm
embedder:
serviceName: bge-embedder
retriever:
serviceName: vector-search
llm:
serviceName: qwen-llm
GPU Scheduling
# Node selector for GPU nodes
spec:
predictor:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A10
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Model Storage
PVC-based Storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: ai-hub
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: oci-bv
S3-based Storage (SeaweedFS)
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterStorageContainer
metadata:
name: seaweedfs-storage
spec:
container:
name: storage-initializer
image: kserve/storage-initializer:latest
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: seaweedfs-credentials
key: accesskey
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: seaweedfs-credentials
key: secretkey
- name: S3_ENDPOINT
value: http://seaweedfs.storage.svc:8333
Monitoring
| Metric | Query |
|---|---|
| Inference latency | kserve_inference_duration_seconds |
| Request count | kserve_inference_count |
| GPU utilization | DCGM_FI_DEV_GPU_UTIL |
| Model load time | kserve_model_load_duration_seconds |
Consequences
Positive:
- Standardized model deployment
- Multi-framework support
- Autoscaling via Knative
- Inference graphs for pipelines
- GPU scheduling support
Negative:
- Complexity for simple deployments
- Requires Knative
- Learning curve for KServe concepts
Part of OpenOva