Catalyst-authored umbrella charts for the W2.5.D AI-inference stack. None of the three upstream projects publish a Helm chart, so each chart hand-wires the upstream container as Deployment + Service + ConfigMap + ServiceMonitor + NetworkPolicy + HPA, with the sigstore/common library subchart declared to satisfy the hollow-chart gate (issue #181). bp-vllm (slot 39) — wraps vllm/vllm-openai:v0.6.4. GPU-aware (nvidia.com/gpu when vllm.gpu.enabled=true; CPU fallback for dev). Default model meta-llama/Llama-3.1-8B-Instruct, port 8000, OpenAI-compatible /v1/chat/completions. All engine knobs (maxModelLen, gpuMemoryUtilization, dtype, quantization, tensorParallelSize, prefix-caching) overlay-tunable. Closes #266. bp-bge (slot 42) — wraps ghcr.io/huggingface/text-embeddings-inference:cpu-1.5. Default model BAAI/bge-small-en-v1.5 + BAAI/bge-reranker-base sidecar in same Pod. Two-port Service (8080 embed, 8081 rerank) annotated for bp-llm-gateway discovery. CPU-friendly defaults; overlay swaps in BAAI/bge-m3 on GPU Sovereigns. Closes #269. bp-nemo-guardrails (slot 43) — wraps the upstream NVIDIA/NeMo-Guardrails Dockerfile (nemoguardrails server, FastAPI, port 8000). LLM endpoint + model + engine all overlay-tunable; Colang flow bundle mounts via configMap.externalName for production rails. ConfigMap stub renders a default rail for smoke testing. Closes #270. All three charts: - Default observability toggles to false per BLUEPRINT-AUTHORING.md §11.2 - Pin upstream image tags (no :latest) per INVIOLABLE-PRINCIPLES.md #4 - Non-root securityContext (runAsUser 1000, drop ALL capabilities) - prometheus.io scrape annotations on the Pod for fallback discovery - Operator-tunable NetworkPolicy gating ingress to bp-llm-gateway and egress to HuggingFace / bp-vllm / bp-bge as appropriate helm template (default values) per chart: bp-vllm: ConfigMap, Deployment, Service, ServiceAccount bp-bge: ConfigMap, Deployment, Service, ServiceAccount bp-nemo-guardrails: ConfigMap, Deployment, Service, ServiceAccount helm template (--set serviceMonitor.enabled=true networkPolicy.enabled=true hpa.enabled=true): All three render ConfigMap + Deployment + Service + ServiceAccount + ServiceMonitor + NetworkPolicy + HorizontalPodAutoscaler. helm lint: 0 chart(s) failed for all three (single INFO on missing icon — icons land with the marketplace card work). Closes #266 Closes #269 Closes #270 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| chart | ||
| blueprint.yaml | ||
| README.md | ||
vLLM
High-performance LLM inference engine with PagedAttention. Application Blueprint (see docs/PLATFORM-TECH-STACK.md §4.6). Default LLM serving runtime in bp-cortex (the composite AI Hub Blueprint).
Status: Accepted | Updated: 2026-04-27
Overview
vLLM provides high-throughput LLM serving with efficient memory management via PagedAttention. Recommended runtime for LLM inference in OpenOva.
flowchart LR
subgraph vLLM["vLLM Engine"]
PagedAttn[PagedAttention]
Scheduler[Continuous Batching]
KVCache[KV Cache Management]
end
subgraph API["OpenAI-Compatible API"]
Chat[/v1/chat/completions]
Completions[/v1/completions]
Models[/v1/models]
end
Request[Request] --> API
API --> vLLM
vLLM --> GPU[GPU]
Why vLLM?
| Feature | Benefit |
|---|---|
| PagedAttention | 24x higher throughput than HuggingFace |
| Continuous batching | Efficient request handling |
| OpenAI-compatible API | Drop-in replacement |
| Tensor parallelism | Multi-GPU support |
| Quantization | AWQ, GPTQ, INT8 support |
Supported Models
| Model Family | Examples |
|---|---|
| Qwen | Qwen2.5, Qwen3 (recommended) |
| Llama | Llama 3.1, Llama 3.2 |
| Mistral | Mistral, Mixtral |
| DeepSeek | DeepSeek-R1, DeepSeek-V3 |
| Others | Phi, Gemma, Yi, etc. |
Configuration
Deployment via KServe
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: qwen-32b
namespace: ai-hub
spec:
predictor:
model:
modelFormat:
name: vllm
runtime: vllm-runtime
storageUri: pvc://model-cache/models/qwen3-32b-awq
resources:
requests:
nvidia.com/gpu: "2"
limits:
nvidia.com/gpu: "2"
Standalone Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: ai-hub
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=/models/qwen3-32b-awq
- --tensor-parallel-size=2
- --max-model-len=32768
- --gpu-memory-utilization=0.9
- --enable-prefix-caching
ports:
- containerPort: 8000
resources:
requests:
nvidia.com/gpu: "2"
limits:
nvidia.com/gpu: "2"
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
Key Parameters
| Parameter | Purpose | Example |
|---|---|---|
--model |
Model path or HuggingFace ID | /models/qwen3-32b |
--tensor-parallel-size |
Number of GPUs | 2 |
--max-model-len |
Maximum context length | 32768 |
--gpu-memory-utilization |
GPU memory fraction | 0.9 |
--quantization |
Quantization method | awq, gptq |
--enable-prefix-caching |
Cache common prefixes | - |
API Usage
Chat Completions
curl http://vllm.ai-hub.svc:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-32b",
"messages": [
{"role": "user", "content": "Hello!"}
],
"stream": true
}'
With Thinking Mode (Qwen3)
curl http://vllm.ai-hub.svc:8000/v1/chat/completions \
-d '{
"model": "qwen3-32b",
"messages": [
{"role": "user", "content": "Solve this step by step: ..."}
],
"extra_body": {
"chat_template_kwargs": {"enable_thinking": true}
}
}'
Multi-GPU Configuration
Tensor Parallelism (Single Node)
args:
- --tensor-parallel-size=4 # Split model across 4 GPUs
Pipeline Parallelism (Multi-Node)
args:
- --pipeline-parallel-size=2 # Split across 2 nodes
- --tensor-parallel-size=4 # 4 GPUs per node
Quantization
| Method | Memory Reduction | Quality |
|---|---|---|
| AWQ | ~4x | Excellent |
| GPTQ | ~4x | Good |
| INT8 | ~2x | Very Good |
| FP8 | ~2x | Excellent |
args:
- --quantization=awq
- --dtype=half
Monitoring
| Metric | Query |
|---|---|
| Request latency | vllm:request_latency_seconds |
| Tokens/second | vllm:generation_tokens_total |
| GPU memory | vllm:gpu_cache_usage_perc |
| Queue length | vllm:num_requests_waiting |
Consequences
Positive:
- Industry-leading performance
- OpenAI-compatible API
- Excellent quantization support
- Multi-GPU scaling
- Active development
Negative:
- GPU required
- Memory-intensive for large models
- Some models not yet supported
Part of OpenOva