openova/platform/vllm/blueprint.yaml
e3mrah c3c9c0cf27
feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283)
Catalyst-authored umbrella charts for the W2.5.D AI-inference stack.
None of the three upstream projects publish a Helm chart, so each
chart hand-wires the upstream container as Deployment + Service +
ConfigMap + ServiceMonitor + NetworkPolicy + HPA, with the
sigstore/common library subchart declared to satisfy the
hollow-chart gate (issue #181).

bp-vllm (slot 39) — wraps vllm/vllm-openai:v0.6.4. GPU-aware
(nvidia.com/gpu when vllm.gpu.enabled=true; CPU fallback for dev).
Default model meta-llama/Llama-3.1-8B-Instruct, port 8000,
OpenAI-compatible /v1/chat/completions. All engine knobs
(maxModelLen, gpuMemoryUtilization, dtype, quantization,
tensorParallelSize, prefix-caching) overlay-tunable. Closes #266.

bp-bge (slot 42) — wraps ghcr.io/huggingface/text-embeddings-inference:cpu-1.5.
Default model BAAI/bge-small-en-v1.5 + BAAI/bge-reranker-base
sidecar in same Pod. Two-port Service (8080 embed, 8081 rerank)
annotated for bp-llm-gateway discovery. CPU-friendly defaults;
overlay swaps in BAAI/bge-m3 on GPU Sovereigns. Closes #269.

bp-nemo-guardrails (slot 43) — wraps the upstream NVIDIA/NeMo-Guardrails
Dockerfile (nemoguardrails server, FastAPI, port 8000). LLM endpoint
+ model + engine all overlay-tunable; Colang flow bundle mounts via
configMap.externalName for production rails. ConfigMap stub renders
a default rail for smoke testing. Closes #270.

All three charts:
- Default observability toggles to false per BLUEPRINT-AUTHORING.md §11.2
- Pin upstream image tags (no :latest) per INVIOLABLE-PRINCIPLES.md #4
- Non-root securityContext (runAsUser 1000, drop ALL capabilities)
- prometheus.io scrape annotations on the Pod for fallback discovery
- Operator-tunable NetworkPolicy gating ingress to bp-llm-gateway and
  egress to HuggingFace / bp-vllm / bp-bge as appropriate

helm template (default values) per chart:
  bp-vllm:            ConfigMap, Deployment, Service, ServiceAccount
  bp-bge:             ConfigMap, Deployment, Service, ServiceAccount
  bp-nemo-guardrails: ConfigMap, Deployment, Service, ServiceAccount

helm template (--set serviceMonitor.enabled=true networkPolicy.enabled=true hpa.enabled=true):
  All three render ConfigMap + Deployment + Service + ServiceAccount +
  ServiceMonitor + NetworkPolicy + HorizontalPodAutoscaler.

helm lint: 0 chart(s) failed for all three (single INFO on missing icon —
icons land with the marketplace card work).

Closes #266
Closes #269
Closes #270

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:37:07 +04:00

67 lines
1.9 KiB
YAML

apiVersion: catalyst.openova.io/v1alpha1
kind: Blueprint
metadata:
name: bp-vllm
labels:
catalyst.openova.io/category: ai-runtime
catalyst.openova.io/section: pts-4-6-llm-serving
spec:
version: 1.0.0
card:
title: vLLM
summary: High-throughput LLM inference engine with PagedAttention. OpenAI-compatible API. GPU-accelerated when nvidia.com/gpu is available; CPU fallback for non-GPU dev Sovereigns.
icon: vllm.svg
category: ai-runtime
tags: [llm, inference, openai-compatible, gpu, ai]
documentation: https://docs.vllm.ai/
license: Apache-2.0
visibility: listed
owner:
team: ai-platform
contact: ai-platform@openova.io
configSchema:
type: object
properties:
model:
type: string
default: "meta-llama/Llama-3.1-8B-Instruct"
description: HuggingFace model ID or in-cluster path served by vLLM.
replicas:
type: integer
default: 1
minimum: 1
maximum: 16
gpu:
type: object
properties:
enabled:
type: boolean
default: false
description: Set true on a GPU-equipped Sovereign. When false, vLLM runs on CPU (dev only — not for production traffic).
count:
type: integer
default: 1
description: Number of `nvidia.com/gpu` units to request when gpu.enabled=true.
maxModelLen:
type: integer
default: 8192
description: Maximum context length passed to vLLM via --max-model-len.
gpuMemoryUtilization:
type: number
default: 0.9
description: Fraction of GPU memory vLLM may use (--gpu-memory-utilization).
placementSchema:
modes: [single-region, active-active]
default: single-region
manifests:
chart: ./chart
depends:
- blueprint: bp-kserve
version: ^1.0
alias: kserve
upgrades:
from: ["0.x"]
observability:
metrics: prometheus
logs: stdout