History

e3mrah c3c9c0cf27 feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283 ) Catalyst-authored umbrella charts for the W2.5.D AI-inference stack. None of the three upstream projects publish a Helm chart, so each chart hand-wires the upstream container as Deployment + Service + ConfigMap + ServiceMonitor + NetworkPolicy + HPA, with the sigstore/common library subchart declared to satisfy the hollow-chart gate (issue #181). bp-vllm (slot 39) — wraps vllm/vllm-openai:v0.6.4. GPU-aware (nvidia.com/gpu when vllm.gpu.enabled=true; CPU fallback for dev). Default model meta-llama/Llama-3.1-8B-Instruct, port 8000, OpenAI-compatible /v1/chat/completions. All engine knobs (maxModelLen, gpuMemoryUtilization, dtype, quantization, tensorParallelSize, prefix-caching) overlay-tunable. Closes #266. bp-bge (slot 42) — wraps ghcr.io/huggingface/text-embeddings-inference:cpu-1.5. Default model BAAI/bge-small-en-v1.5 + BAAI/bge-reranker-base sidecar in same Pod. Two-port Service (8080 embed, 8081 rerank) annotated for bp-llm-gateway discovery. CPU-friendly defaults; overlay swaps in BAAI/bge-m3 on GPU Sovereigns. Closes #269. bp-nemo-guardrails (slot 43) — wraps the upstream NVIDIA/NeMo-Guardrails Dockerfile (nemoguardrails server, FastAPI, port 8000). LLM endpoint + model + engine all overlay-tunable; Colang flow bundle mounts via configMap.externalName for production rails. ConfigMap stub renders a default rail for smoke testing. Closes #270. All three charts: - Default observability toggles to false per BLUEPRINT-AUTHORING.md §11.2 - Pin upstream image tags (no :latest) per INVIOLABLE-PRINCIPLES.md #4 - Non-root securityContext (runAsUser 1000, drop ALL capabilities) - prometheus.io scrape annotations on the Pod for fallback discovery - Operator-tunable NetworkPolicy gating ingress to bp-llm-gateway and egress to HuggingFace / bp-vllm / bp-bge as appropriate helm template (default values) per chart: bp-vllm: ConfigMap, Deployment, Service, ServiceAccount bp-bge: ConfigMap, Deployment, Service, ServiceAccount bp-nemo-guardrails: ConfigMap, Deployment, Service, ServiceAccount helm template (--set serviceMonitor.enabled=true networkPolicy.enabled=true hpa.enabled=true): All three render ConfigMap + Deployment + Service + ServiceAccount + ServiceMonitor + NetworkPolicy + HorizontalPodAutoscaler. helm lint: 0 chart(s) failed for all three (single INFO on missing icon — icons land with the marketplace card work). Closes #266 Closes #269 Closes #270 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-30 18:37:07 +04:00
..
chart	feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283 )	2026-04-30 18:37:07 +04:00
blueprint.yaml	feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283 )	2026-04-30 18:37:07 +04:00
README.md	docs(pass-12): role-in-Catalyst banners on 11 AI/ML Application Blueprints	2026-04-27 21:47:45 +02:00

README.md

vLLM

High-performance LLM inference engine with PagedAttention. Application Blueprint (see docs/PLATFORM-TECH-STACK.md §4.6). Default LLM serving runtime in bp-cortex (the composite AI Hub Blueprint).

Status: Accepted | Updated: 2026-04-27

Overview

vLLM provides high-throughput LLM serving with efficient memory management via PagedAttention. Recommended runtime for LLM inference in OpenOva.

flowchart LR
    subgraph vLLM["vLLM Engine"]
        PagedAttn[PagedAttention]
        Scheduler[Continuous Batching]
        KVCache[KV Cache Management]
    end

    subgraph API["OpenAI-Compatible API"]
        Chat[/v1/chat/completions]
        Completions[/v1/completions]
        Models[/v1/models]
    end

    Request[Request] --> API
    API --> vLLM
    vLLM --> GPU[GPU]

Why vLLM?

Feature	Benefit
PagedAttention	24x higher throughput than HuggingFace
Continuous batching	Efficient request handling
OpenAI-compatible API	Drop-in replacement
Tensor parallelism	Multi-GPU support
Quantization	AWQ, GPTQ, INT8 support

Supported Models

Model Family	Examples
Qwen	Qwen2.5, Qwen3 (recommended)
Llama	Llama 3.1, Llama 3.2
Mistral	Mistral, Mixtral
DeepSeek	DeepSeek-R1, DeepSeek-V3
Others	Phi, Gemma, Yi, etc.

Configuration

Deployment via KServe

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen-32b
  namespace: ai-hub
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: vllm-runtime
      storageUri: pvc://model-cache/models/qwen3-32b-awq
    resources:
      requests:
        nvidia.com/gpu: "2"
      limits:
        nvidia.com/gpu: "2"

Standalone Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: ai-hub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=/models/qwen3-32b-awq
            - --tensor-parallel-size=2
            - --max-model-len=32768
            - --gpu-memory-utilization=0.9
            - --enable-prefix-caching
          ports:
            - containerPort: 8000
          resources:
            requests:
              nvidia.com/gpu: "2"
            limits:
              nvidia.com/gpu: "2"
          volumeMounts:
            - name: model-cache
              mountPath: /models
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache

Key Parameters

Parameter	Purpose	Example
`--model`	Model path or HuggingFace ID	`/models/qwen3-32b`
`--tensor-parallel-size`	Number of GPUs	`2`
`--max-model-len`	Maximum context length	`32768`
`--gpu-memory-utilization`	GPU memory fraction	`0.9`
`--quantization`	Quantization method	`awq`, `gptq`
`--enable-prefix-caching`	Cache common prefixes	-

API Usage

Chat Completions

curl http://vllm.ai-hub.svc:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "stream": true
  }'

With Thinking Mode (Qwen3)

curl http://vllm.ai-hub.svc:8000/v1/chat/completions \
  -d '{
    "model": "qwen3-32b",
    "messages": [
      {"role": "user", "content": "Solve this step by step: ..."}
    ],
    "extra_body": {
      "chat_template_kwargs": {"enable_thinking": true}
    }
  }'

Multi-GPU Configuration

Tensor Parallelism (Single Node)

args:
  - --tensor-parallel-size=4  # Split model across 4 GPUs

Pipeline Parallelism (Multi-Node)

args:
  - --pipeline-parallel-size=2  # Split across 2 nodes
  - --tensor-parallel-size=4    # 4 GPUs per node

Quantization

Method	Memory Reduction	Quality
AWQ	~4x	Excellent
GPTQ	~4x	Good
INT8	~2x	Very Good
FP8	~2x	Excellent

args:
  - --quantization=awq
  - --dtype=half

Monitoring

Metric	Query
Request latency	`vllm:request_latency_seconds`
Tokens/second	`vllm:generation_tokens_total`
GPU memory	`vllm:gpu_cache_usage_perc`
Queue length	`vllm:num_requests_waiting`

Consequences

Positive:

Industry-leading performance
OpenAI-compatible API
Excellent quantization support
Multi-GPU scaling
Active development

Negative:

GPU required
Memory-intensive for large models
Some models not yet supported

Part of OpenOva