openova/platform/kserve
hatiyildiz 7cafa3c894 docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay
Component-level architectural correction (two changes):

1. MinIO → SeaweedFS as unified S3 encapsulation layer

The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface.

2. Apache Guacamole added as Application Blueprint §4.5 Communication

Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access.

Component changes:
- DELETED: platform/minio/
- CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section)
- CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings)

Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count.

Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric.

UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added.

VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer.

Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.
2026-04-28 10:23:46 +02:00
..
README.md docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00

KServe

Kubernetes-native model serving. Application Blueprint (see docs/PLATFORM-TECH-STACK.md §4.6). Used by bp-cortex to serve LLMs via vLLM, embedding models via BGE, and any custom inference workload.

Status: Accepted | Updated: 2026-04-27


Overview

KServe provides standardized model serving on Kubernetes with support for multiple ML frameworks, autoscaling, and inference graphs.

flowchart TB
    subgraph KServe["KServe"]
        Controller[KServe Controller]
        Predictor[Predictor]
        Transformer[Transformer]
        Explainer[Explainer]
    end

    subgraph Runtimes["Serving Runtimes"]
        vLLM[vLLM]
        TorchServe[TorchServe]
        Triton[Triton]
        SKLearn[SKLearn]
    end

    subgraph Knative["Knative Serving"]
        Autoscale[Autoscaling]
        Revisions[Revisions]
    end

    Controller --> Predictor
    Controller --> Transformer
    Controller --> Explainer
    Predictor --> Runtimes
    Runtimes --> Knative

Why KServe?

Feature Benefit
Multi-framework TensorFlow, PyTorch, ONNX, vLLM, etc.
Autoscaling Scale-to-zero via Knative
InferenceService Standardized deployment pattern
Inference Graph Multi-model pipelines
Model explainability Integrated explainers

Components

Component Purpose
InferenceService Model deployment abstraction
ServingRuntime Framework-specific runtime
InferenceGraph Multi-model orchestration
ClusterStorageContainer Model storage configuration

Serving Runtimes

Runtime Use Case
vLLM LLM inference (recommended)
TorchServe PyTorch models
Triton Multi-framework, high performance
SKLearn Scikit-learn models
XGBoost Gradient boosting models
ONNX ONNX format models

Configuration

InferenceService Example

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm-service
  namespace: ai-hub
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: vllm-runtime
      storageUri: pvc://model-cache/models/qwen-32b
      resources:
        requests:
          cpu: "4"
          memory: 32Gi
          nvidia.com/gpu: "2"
        limits:
          cpu: "8"
          memory: 64Gi
          nvidia.com/gpu: "2"

ServingRuntime for vLLM

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-runtime
spec:
  supportedModelFormats:
    - name: vllm
      autoSelect: true
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:latest
      args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=2
        - --max-model-len=32768
      resources:
        requests:
          nvidia.com/gpu: "2"

Inference Graph

Multi-model pipeline for complex inference:

apiVersion: serving.kserve.io/v1alpha1
kind: InferenceGraph
metadata:
  name: rag-pipeline
spec:
  nodes:
    root:
      routerType: Sequence
      steps:
        - serviceName: embedder
        - serviceName: retriever
        - serviceName: llm
    embedder:
      serviceName: bge-embedder
    retriever:
      serviceName: vector-search
    llm:
      serviceName: qwen-llm

GPU Scheduling

# Node selector for GPU nodes
spec:
  predictor:
    nodeSelector:
      nvidia.com/gpu.product: NVIDIA-A10
    tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Model Storage

PVC-based Storage

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: ai-hub
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: oci-bv

S3-based Storage (SeaweedFS)

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterStorageContainer
metadata:
  name: seaweedfs-storage
spec:
  container:
    name: storage-initializer
    image: kserve/storage-initializer:latest
    env:
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            name: seaweedfs-credentials
            key: accesskey
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            name: seaweedfs-credentials
            key: secretkey
      - name: S3_ENDPOINT
        value: http://seaweedfs.storage.svc:8333

Monitoring

Metric Query
Inference latency kserve_inference_duration_seconds
Request count kserve_inference_count
GPU utilization DCGM_FI_DEV_GPU_UTIL
Model load time kserve_model_load_duration_seconds

Consequences

Positive:

  • Standardized model deployment
  • Multi-framework support
  • Autoscaling via Knative
  • Inference graphs for pipelines
  • GPU scheduling support

Negative:

  • Complexity for simple deployments
  • Requires Knative
  • Learning curve for KServe concepts

Part of OpenOva