openova/platform/temporal
e3mrah 87d9a4afa7
feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271) (#288)
W2.5.E batch — three Application-tier Blueprints completing the LLM
serving / workflow stack:

- bp-temporal/1.0.0 — wraps temporal/temporal 1.2.0 (the new chart
  rewrite that removed cassandra:/mysql:/postgresql:/elasticsearch:/
  prometheus:/grafana: top-level keys in favour of
  server.config.persistence.datastores). Postgres-only via CNPG-backed
  visibility store (skip Cassandra). Web UI ON. Keycloak OIDC
  integration via --auth-claim-mapper renders auth.yaml ConfigMap
  (operator wires via additionalVolumes once bp-keycloak is
  reconciled, default OFF). dependsOn: bp-cnpg + bp-cert-manager.
  Closes #271.
  Kinds: Cluster (CNPG) + ConfigMap + Deployment + Job + Pod +
  Service.

- bp-llm-gateway/1.0.0 — wraps berriai/litellm-helm 0.1.572 from OCI.
  Subscription-aware proxy for Claude Code: routes to Anthropic (via
  operator OAuth/Max subscription — NEVER an ANTHROPIC_API_KEY,
  per memory/feedback_no_api_key.md), Bedrock, Vertex,
  OpenAI-compatible (via bp-anthropic-adapter), and self-hosted
  vLLM. CNPG-backed audit log (every prompt + response persisted
  for compliance). Bundled bitnami postgresql + redis subcharts
  DISABLED (db.useExisting=true points at the CNPG cluster).
  Keycloak SSO via auth.yaml ConfigMap (default OFF).
  ExternalSecret-backed environmentSecrets brings tokens / IAM
  creds in without inlining plaintext. dependsOn: bp-cnpg +
  bp-keycloak + bp-external-secrets. Closes #267.
  Kinds: Cluster (CNPG audit) + ConfigMap + Deployment + Job +
  Pod + Secret + Service + ServiceAccount.

- bp-anthropic-adapter/1.0.0 — Catalyst-authored scratch chart for
  the OpenAI ↔ Anthropic translation Go service. SHA-pinned image
  ghcr.io/openova-io/openova/anthropic-adapter:<sha> (Inviolable
  Principle #4a — GitHub Actions is the only build path; empty
  default tag fails the render with a clear error instead of
  silently shipping :latest). OAuth/Max subscription token mounted
  from K8s Secret materialized by ESO from bp-openbao —
  ANTHROPIC_OAUTH_TOKEN env var, NEVER an ANTHROPIC_API_KEY.
  Includes OpenAI → Anthropic model-mapping ConfigMap (gpt-4 →
  claude-3-5-sonnet, gpt-4o-mini → claude-3-5-haiku, etc.).
  sigstore/common library subchart included to satisfy the
  hollow-chart gate (matches bp-vllm pattern from #283).
  dependsOn: bp-external-secrets. Closes #268.
  Kinds: ConfigMap + Deployment + Service + ServiceAccount.

CRITICAL — bp-llm-gateway and bp-anthropic-adapter both consume the
operator's Claude OAuth/Max subscription. Per memory/
feedback_no_api_key.md and the user's standing instruction, neither
chart accepts or generates an ANTHROPIC_API_KEY. Tokens flow
exclusively through ExternalSecret-managed K8s Secrets that ESO
materializes from bp-openbao at install time.

Per docs/BLUEPRINT-AUTHORING.md §11.2 (issue #182): every
observability toggle defaults `false` (ServiceMonitor / metrics
sidecar / PodMonitor) and is operator-tunable via per-cluster
overlay once bp-kube-prometheus-stack reconciles. Each chart ships
tests/observability-toggle.sh covering default-off, opt-in (with
--api-versions monitoring.coreos.com/v1 to simulate the CRDs), and
explicit-off cases. bp-anthropic-adapter additionally tests the
never-:latest gate via Case 4 (empty image tag must fail render).

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every
upstream version, namespace, server URL, role, secret name, model
default, and toggle is exposed under values.yaml. Cluster overlays
in clusters/<sovereign>/ may override without rebuilding the
Blueprint OCI artifact.

Per docs/BLUEPRINT-AUTHORING.md §11.1 (umbrella shape — hard
contract): bp-temporal and bp-llm-gateway declare their upstream
charts under Chart.yaml dependencies: so helm dependency build
bundles the upstream payload into the OCI artifact. bp-anthropic-
adapter is a scratch chart (no upstream Helm chart exists) and
includes sigstore/common as the obligatory hollow-chart-gate
dependency, matching the bp-vllm precedent from W2.5.D (#283).

Closes #267
Closes #268
Closes #271

helm lint: 1 chart(s) linted, 0 chart(s) failed (each, INFO icon-recommended only)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 19:37:19 +04:00
..
chart feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271) (#288) 2026-04-30 19:37:19 +04:00
blueprint.yaml feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271) (#288) 2026-04-30 19:37:19 +04:00
README.md docs(pass-38): surviving "fuse" namespace in temporal; SECURITY + grafana clean 2026-04-27 22:59:17 +02:00

Temporal

Durable workflow orchestration with saga + compensation. Application Blueprint (see docs/PLATFORM-TECH-STACK.md §4.3 — Workflow & processing). Used by bp-fabric (composite Data & Integration Blueprint) for long-running, compensable workflows that span multiple Application services.

Status: Accepted | Updated: 2026-04-27


Overview

Temporal is a durable execution platform that makes it simple to build reliable, long-running workflows and microservice orchestrations. Unlike traditional message queues or job schedulers, Temporal provides durable execution: workflow code survives process crashes, node failures, and even entire cluster restarts without losing state. Developers write workflows as ordinary code in their language of choice, and Temporal handles retries, timeouts, and state persistence transparently.

Within OpenOva, Temporal serves as the workflow orchestration engine for the Fabric data and integration product. It handles saga patterns for distributed transactions, long-running business processes, scheduled jobs, and any operation that needs reliable execution across multiple services. Temporal replaces fragile combinations of message queues, cron jobs, and custom state machines with a single, battle-tested platform.

Temporal's architecture separates the server (which manages workflow state) from workers (which execute workflow and activity code). Workers are stateless and can be scaled independently. The server persists all workflow state to a database, ensuring that workflows survive any infrastructure failure. SDKs are available for Go, Java, Python, and TypeScript, making Temporal accessible to polyglot teams.


Architecture

flowchart TB
    subgraph Clients["Workflow Clients"]
        API[API Services]
        Scheduler[Scheduled Jobs]
        Events[Event Handlers]
    end

    subgraph Temporal["Temporal Server"]
        Frontend[Frontend Service]
        History[History Service]
        Matching[Matching Service]
        Worker_svc[Internal Worker]
    end

    subgraph Persistence["Persistence"]
        PG[PostgreSQL / CNPG]
        ES[Elasticsearch / OpenSearch]
    end

    subgraph Workers["Application Workers"]
        W1[Order Worker]
        W2[Payment Worker]
        W3[Notification Worker]
    end

    Clients --> Frontend
    Frontend --> History
    Frontend --> Matching
    History --> PG
    Worker_svc --> ES
    Matching --> W1
    Matching --> W2
    Matching --> W3

Saga Pattern

sequenceDiagram
    participant C as Client
    participant T as Temporal
    participant O as Order Service
    participant P as Payment Service
    participant I as Inventory Service
    participant N as Notification Service

    C->>T: Start OrderSaga
    T->>O: CreateOrder
    O-->>T: Order Created
    T->>P: ProcessPayment
    P-->>T: Payment OK
    T->>I: ReserveInventory
    I-->>T: Inventory Reserved
    T->>N: SendConfirmation
    N-->>T: Notification Sent
    T-->>C: Saga Complete

    Note over T,I: If ReserveInventory fails:
    T->>P: CompensatePayment (refund)
    T->>O: CancelOrder

Key Features

Feature Description
Durable Execution Workflows survive process/node/cluster failures
Saga Orchestration Coordinate distributed transactions with compensation
Retry Policies Configurable retry with exponential backoff per activity
Timeouts Start-to-close, schedule-to-start, and heartbeat timeouts
Cron Workflows Replace cron jobs with reliable scheduled workflows
Versioning Deploy new workflow logic without breaking running instances
Signals & Queries Send data to and read state from running workflows
Child Workflows Compose complex workflows from smaller building blocks
Visibility Search and filter workflows by custom attributes

Configuration

Helm Values

temporal:
  server:
    replicas: 3
    config:
      persistence:
        default:
          driver: sql
          sql:
            driver: postgres12
            host: temporal-postgres.databases.svc
            port: 5432
            database: temporal
            user: temporal
            password: ${PG_PASSWORD}  # From ESO
        visibility:
          driver: sql
          sql:
            driver: postgres12
            host: temporal-postgres.databases.svc
            port: 5432
            database: temporal_visibility
            user: temporal
            password: ${PG_PASSWORD}  # From ESO

    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 2
        memory: 4Gi

  admintools:
    enabled: true

  web:
    enabled: true
    ingress:
      enabled: true
      hosts:
        - temporal.<env>.<sovereign-domain>

  prometheus:
    enabled: true

Namespace Setup

# Create Temporal namespace for workload isolation
tctl namespace register \
  --namespace orders \
  --retention 30d \
  --description "Order processing workflows"

Workflow Examples

Go SDK - Order Saga

package workflows

import (
    "time"
    "go.temporal.io/sdk/temporal"
    "go.temporal.io/sdk/workflow"
)

func OrderSagaWorkflow(ctx workflow.Context, order Order) (OrderResult, error) {
    retryPolicy := &temporal.RetryPolicy{
        InitialInterval:    time.Second,
        BackoffCoefficient: 2.0,
        MaximumInterval:    time.Minute,
        MaximumAttempts:    5,
    }
    activityOpts := workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Second,
        RetryPolicy:         retryPolicy,
    }
    ctx = workflow.WithActivityOptions(ctx, activityOpts)

    // Step 1: Create Order
    var orderID string
    err := workflow.ExecuteActivity(ctx, CreateOrder, order).Get(ctx, &orderID)
    if err != nil {
        return OrderResult{}, err
    }

    // Step 2: Process Payment (with compensation)
    var paymentID string
    err = workflow.ExecuteActivity(ctx, ProcessPayment, orderID, order.Amount).Get(ctx, &paymentID)
    if err != nil {
        // Compensate: cancel the order
        _ = workflow.ExecuteActivity(ctx, CancelOrder, orderID).Get(ctx, nil)
        return OrderResult{}, err
    }

    // Step 3: Reserve Inventory (with compensation)
    err = workflow.ExecuteActivity(ctx, ReserveInventory, orderID, order.Items).Get(ctx, nil)
    if err != nil {
        // Compensate: refund payment, cancel order
        _ = workflow.ExecuteActivity(ctx, RefundPayment, paymentID).Get(ctx, nil)
        _ = workflow.ExecuteActivity(ctx, CancelOrder, orderID).Get(ctx, nil)
        return OrderResult{}, err
    }

    // Step 4: Send Confirmation
    _ = workflow.ExecuteActivity(ctx, SendConfirmation, orderID).Get(ctx, nil)

    return OrderResult{OrderID: orderID, PaymentID: paymentID, Status: "completed"}, nil
}

Python SDK - Data Pipeline

from temporalio import workflow, activity
from datetime import timedelta

@activity.defn
async def extract_data(source: str) -> dict:
    # Extract data from source system
    ...

@activity.defn
async def transform_data(raw_data: dict) -> dict:
    # Apply business transformations
    ...

@activity.defn
async def load_data(transformed: dict) -> str:
    # Write to destination
    ...

@workflow.defn
class DataPipelineWorkflow:
    @workflow.run
    async def run(self, source: str) -> str:
        raw = await workflow.execute_activity(
            extract_data, source,
            start_to_close_timeout=timedelta(minutes=10),
        )
        transformed = await workflow.execute_activity(
            transform_data, raw,
            start_to_close_timeout=timedelta(minutes=30),
        )
        result = await workflow.execute_activity(
            load_data, transformed,
            start_to_close_timeout=timedelta(minutes=10),
        )
        return result

Worker Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-workflow-worker
  namespace: fabric
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: worker
          image: harbor.<location-code>.<sovereign-domain>/fabric/order-worker:latest
          env:
            - name: TEMPORAL_HOST
              value: temporal-frontend.temporal.svc:7233
            - name: TEMPORAL_NAMESPACE
              value: orders
            - name: TEMPORAL_TASK_QUEUE
              value: order-processing
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1
              memory: 1Gi

Monitoring

Metric Description
temporal_workflow_started_total Workflows started
temporal_workflow_completed_total Workflows completed successfully
temporal_workflow_failed_total Workflows failed
temporal_workflow_task_queue_depth Pending tasks per queue
temporal_activity_execution_latency Activity execution duration
temporal_workflow_endtoend_latency Total workflow duration
temporal_schedule_missed_catchup_window Missed cron schedule executions

Consequences

Positive:

  • Durable execution eliminates custom retry/state-machine code across all services
  • Saga pattern support simplifies distributed transaction management
  • Multi-language SDKs (Go, Java, Python, TypeScript) suit polyglot teams
  • Workflow versioning enables safe deployments without breaking running instances
  • Built-in visibility and search make debugging production workflows practical
  • Cron workflows replace fragile crontab/CronJob setups with reliable scheduling

Negative:

  • Requires PostgreSQL for persistence, adding a database dependency
  • Temporal server itself needs careful sizing and operational attention
  • Workflow determinism constraints require developer discipline (no random, no system clock)
  • Learning curve for understanding event sourcing and replay semantics
  • Debugging replayed workflows requires familiarity with Temporal's execution model
  • Large workflow histories can impact performance without proper archival configuration

Part of OpenOva Fabric - Data & Integration