Catalyst-authored umbrella charts for the W2.5.D AI-inference stack. None of the three upstream projects publish a Helm chart, so each chart hand-wires the upstream container as Deployment + Service + ConfigMap + ServiceMonitor + NetworkPolicy + HPA, with the sigstore/common library subchart declared to satisfy the hollow-chart gate (issue #181). bp-vllm (slot 39) — wraps vllm/vllm-openai:v0.6.4. GPU-aware (nvidia.com/gpu when vllm.gpu.enabled=true; CPU fallback for dev). Default model meta-llama/Llama-3.1-8B-Instruct, port 8000, OpenAI-compatible /v1/chat/completions. All engine knobs (maxModelLen, gpuMemoryUtilization, dtype, quantization, tensorParallelSize, prefix-caching) overlay-tunable. Closes #266. bp-bge (slot 42) — wraps ghcr.io/huggingface/text-embeddings-inference:cpu-1.5. Default model BAAI/bge-small-en-v1.5 + BAAI/bge-reranker-base sidecar in same Pod. Two-port Service (8080 embed, 8081 rerank) annotated for bp-llm-gateway discovery. CPU-friendly defaults; overlay swaps in BAAI/bge-m3 on GPU Sovereigns. Closes #269. bp-nemo-guardrails (slot 43) — wraps the upstream NVIDIA/NeMo-Guardrails Dockerfile (nemoguardrails server, FastAPI, port 8000). LLM endpoint + model + engine all overlay-tunable; Colang flow bundle mounts via configMap.externalName for production rails. ConfigMap stub renders a default rail for smoke testing. Closes #270. All three charts: - Default observability toggles to false per BLUEPRINT-AUTHORING.md §11.2 - Pin upstream image tags (no :latest) per INVIOLABLE-PRINCIPLES.md #4 - Non-root securityContext (runAsUser 1000, drop ALL capabilities) - prometheus.io scrape annotations on the Pod for fallback discovery - Operator-tunable NetworkPolicy gating ingress to bp-llm-gateway and egress to HuggingFace / bp-vllm / bp-bge as appropriate helm template (default values) per chart: bp-vllm: ConfigMap, Deployment, Service, ServiceAccount bp-bge: ConfigMap, Deployment, Service, ServiceAccount bp-nemo-guardrails: ConfigMap, Deployment, Service, ServiceAccount helm template (--set serviceMonitor.enabled=true networkPolicy.enabled=true hpa.enabled=true): All three render ConfigMap + Deployment + Service + ServiceAccount + ServiceMonitor + NetworkPolicy + HorizontalPodAutoscaler. helm lint: 0 chart(s) failed for all three (single INFO on missing icon — icons land with the marketplace card work). Closes #266 Closes #269 Closes #270 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
67 lines
1.9 KiB
YAML
67 lines
1.9 KiB
YAML
apiVersion: catalyst.openova.io/v1alpha1
|
|
kind: Blueprint
|
|
metadata:
|
|
name: bp-vllm
|
|
labels:
|
|
catalyst.openova.io/category: ai-runtime
|
|
catalyst.openova.io/section: pts-4-6-llm-serving
|
|
spec:
|
|
version: 1.0.0
|
|
card:
|
|
title: vLLM
|
|
summary: High-throughput LLM inference engine with PagedAttention. OpenAI-compatible API. GPU-accelerated when nvidia.com/gpu is available; CPU fallback for non-GPU dev Sovereigns.
|
|
icon: vllm.svg
|
|
category: ai-runtime
|
|
tags: [llm, inference, openai-compatible, gpu, ai]
|
|
documentation: https://docs.vllm.ai/
|
|
license: Apache-2.0
|
|
visibility: listed
|
|
owner:
|
|
team: ai-platform
|
|
contact: ai-platform@openova.io
|
|
configSchema:
|
|
type: object
|
|
properties:
|
|
model:
|
|
type: string
|
|
default: "meta-llama/Llama-3.1-8B-Instruct"
|
|
description: HuggingFace model ID or in-cluster path served by vLLM.
|
|
replicas:
|
|
type: integer
|
|
default: 1
|
|
minimum: 1
|
|
maximum: 16
|
|
gpu:
|
|
type: object
|
|
properties:
|
|
enabled:
|
|
type: boolean
|
|
default: false
|
|
description: Set true on a GPU-equipped Sovereign. When false, vLLM runs on CPU (dev only — not for production traffic).
|
|
count:
|
|
type: integer
|
|
default: 1
|
|
description: Number of `nvidia.com/gpu` units to request when gpu.enabled=true.
|
|
maxModelLen:
|
|
type: integer
|
|
default: 8192
|
|
description: Maximum context length passed to vLLM via --max-model-len.
|
|
gpuMemoryUtilization:
|
|
type: number
|
|
default: 0.9
|
|
description: Fraction of GPU memory vLLM may use (--gpu-memory-utilization).
|
|
placementSchema:
|
|
modes: [single-region, active-active]
|
|
default: single-region
|
|
manifests:
|
|
chart: ./chart
|
|
depends:
|
|
- blueprint: bp-kserve
|
|
version: ^1.0
|
|
alias: kserve
|
|
upgrades:
|
|
from: ["0.x"]
|
|
observability:
|
|
metrics: prometheus
|
|
logs: stdout
|