openova/platform/bge
e3mrah c3c9c0cf27
feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283)
Catalyst-authored umbrella charts for the W2.5.D AI-inference stack.
None of the three upstream projects publish a Helm chart, so each
chart hand-wires the upstream container as Deployment + Service +
ConfigMap + ServiceMonitor + NetworkPolicy + HPA, with the
sigstore/common library subchart declared to satisfy the
hollow-chart gate (issue #181).

bp-vllm (slot 39) — wraps vllm/vllm-openai:v0.6.4. GPU-aware
(nvidia.com/gpu when vllm.gpu.enabled=true; CPU fallback for dev).
Default model meta-llama/Llama-3.1-8B-Instruct, port 8000,
OpenAI-compatible /v1/chat/completions. All engine knobs
(maxModelLen, gpuMemoryUtilization, dtype, quantization,
tensorParallelSize, prefix-caching) overlay-tunable. Closes #266.

bp-bge (slot 42) — wraps ghcr.io/huggingface/text-embeddings-inference:cpu-1.5.
Default model BAAI/bge-small-en-v1.5 + BAAI/bge-reranker-base
sidecar in same Pod. Two-port Service (8080 embed, 8081 rerank)
annotated for bp-llm-gateway discovery. CPU-friendly defaults;
overlay swaps in BAAI/bge-m3 on GPU Sovereigns. Closes #269.

bp-nemo-guardrails (slot 43) — wraps the upstream NVIDIA/NeMo-Guardrails
Dockerfile (nemoguardrails server, FastAPI, port 8000). LLM endpoint
+ model + engine all overlay-tunable; Colang flow bundle mounts via
configMap.externalName for production rails. ConfigMap stub renders
a default rail for smoke testing. Closes #270.

All three charts:
- Default observability toggles to false per BLUEPRINT-AUTHORING.md §11.2
- Pin upstream image tags (no :latest) per INVIOLABLE-PRINCIPLES.md #4
- Non-root securityContext (runAsUser 1000, drop ALL capabilities)
- prometheus.io scrape annotations on the Pod for fallback discovery
- Operator-tunable NetworkPolicy gating ingress to bp-llm-gateway and
  egress to HuggingFace / bp-vllm / bp-bge as appropriate

helm template (default values) per chart:
  bp-vllm:            ConfigMap, Deployment, Service, ServiceAccount
  bp-bge:             ConfigMap, Deployment, Service, ServiceAccount
  bp-nemo-guardrails: ConfigMap, Deployment, Service, ServiceAccount

helm template (--set serviceMonitor.enabled=true networkPolicy.enabled=true hpa.enabled=true):
  All three render ConfigMap + Deployment + Service + ServiceAccount +
  ServiceMonitor + NetworkPolicy + HorizontalPodAutoscaler.

helm lint: 0 chart(s) failed for all three (single INFO on missing icon —
icons land with the marketplace card work).

Closes #266
Closes #269
Closes #270

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:37:07 +04:00
..
chart feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283) 2026-04-30 18:37:07 +04:00
blueprint.yaml feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283) 2026-04-30 18:37:07 +04:00
README.md docs(pass-32): registry-DNS sweep — harbor.<domain> across 9 component READMEs 2026-04-27 22:36:39 +02:00

BGE

BAAI General Embedding models for text embeddings and reranking. Application Blueprint (see docs/PLATFORM-TECH-STACK.md §4.6). Used by bp-cortex for embedding generation (Milvus vector search) and reranking (improving retrieval precision).

Status: Accepted | Updated: 2026-04-27


Overview

BGE provides state-of-the-art text embeddings and reranking for RAG systems, supporting multilingual (including Arabic) and hybrid sparse+dense retrieval.

flowchart LR
    subgraph BGE["BGE Services"]
        M3[BGE-M3<br/>Embeddings]
        Reranker[BGE-Reranker<br/>Cross-Encoder]
    end

    Text[Text] --> M3
    M3 --> Dense[Dense Vector<br/>1024-dim]
    M3 --> Sparse[Sparse Vector]

    Candidates[Candidates] --> Reranker
    Query[Query] --> Reranker
    Reranker --> Ranked[Ranked Results]

Models

Model Purpose Dimensions
BGE-M3 Multilingual embeddings 1024 (dense) + sparse
BGE-Reranker-v2-M3 Cross-encoder reranking -

Why BGE?

Feature Benefit
Multilingual Arabic + English support
Hybrid retrieval Dense + sparse vectors
High accuracy MTEB benchmark leader
Reranking Precision improvement
Self-hosted Data sovereignty

Deployment

BGE-M3 Embeddings

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bge-m3
  namespace: ai-hub
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: bge-m3
          image: harbor.<location-code>.<sovereign-domain>/ai-hub/bge-m3:latest
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_ID
              value: "BAAI/bge-m3"
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
              nvidia.com/gpu: "1"

BGE Reranker

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bge-reranker
  namespace: ai-hub
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: bge-reranker
          image: harbor.<location-code>.<sovereign-domain>/ai-hub/bge-reranker:latest
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_ID
              value: "BAAI/bge-reranker-v2-m3"
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
              nvidia.com/gpu: "1"

API Endpoints

Embeddings

curl -X POST http://bge-m3.ai-hub.svc:8080/embed \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["Hello world", "مرحبا بالعالم"],
    "return_sparse": true
  }'

Response:

{
  "dense": [[0.123, 0.456, ...], [0.789, 0.012, ...]],
  "sparse": [{"token_id": 1234, "weight": 0.5}, ...]
}

Reranking

curl -X POST http://bge-reranker.ai-hub.svc:8080/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is AML?",
    "documents": [
      "Anti-Money Laundering guidelines...",
      "Machine Learning tutorial...",
      "AML compliance requirements..."
    ]
  }'

Response:

{
  "scores": [0.95, 0.12, 0.89],
  "ranked_indices": [0, 2, 1]
}

LangChain Integration

from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings

embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_url="http://bge-m3.ai-hub.svc:8080/embed",
    model_name="BAAI/bge-m3"
)

# Generate embeddings
vectors = embeddings.embed_documents(["Hello", "World"])

Hybrid Retrieval

# Dense + Sparse search in Milvus
from pymilvus import AnnSearchRequest, WeightedRanker

# Get embeddings
response = requests.post(
    "http://bge-m3.ai-hub.svc:8080/embed",
    json={"texts": [query], "return_sparse": True}
)
dense_vec = response.json()["dense"][0]
sparse_vec = response.json()["sparse"][0]

# Parallel search
dense_req = AnnSearchRequest(dense_vec, "dense_vector", {"metric_type": "COSINE"}, 20)
sparse_req = AnnSearchRequest(sparse_vec, "sparse_vector", {"metric_type": "IP"}, 20)

# Combine results
results = collection.hybrid_search(
    [dense_req, sparse_req],
    rerank=WeightedRanker(0.7, 0.3),
    limit=10
)

Performance Tuning

Parameter Description Default
max_batch_size Max texts per request 32
max_length Max token length 8192
normalize L2 normalize vectors true

Monitoring

Metric Query
Embed latency bge_embed_duration_seconds
Rerank latency bge_rerank_duration_seconds
Batch size bge_batch_size
GPU memory nvidia_gpu_memory_used_bytes

Consequences

Positive:

  • State-of-the-art embeddings
  • Multilingual support (Arabic)
  • Hybrid dense + sparse
  • Cross-encoder reranking
  • Self-hosted

Negative:

  • GPU required for performance
  • Memory-intensive
  • Batch size limits

Part of OpenOva