openova/platform/clickhouse
talent-mesh 10245dff98 feat: ecosystem expansion to 55 components with license compliance
- Replace BSL-licensed components with open-source alternatives:
  Terraform→OpenTofu (MPL 2.0), Vault→OpenBao (MPL 2.0),
  Redpanda→Strimzi/Kafka (Apache 2.0), n8n→Airflow (Apache 2.0)
- Add 14 new platform components: activemq, camel, clickhouse, dapr,
  debezium, falco, flink, iceberg, opensearch, rabbitmq, superset,
  temporal, trino, vitess
- Rename meta-platforms/ to products/ with new product names:
  Cortex (AI Hub), Fingate (Open Banking), Titan (Data Lakehouse),
  Fuse (Microservices Integration)
- Update all documentation, READMEs, and cross-references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 18:15:11 +00:00
..
README.md feat: ecosystem expansion to 55 components with license compliance 2026-02-11 18:15:11 +00:00

ClickHouse

Column-oriented OLAP database for real-time analytics.

Status: Accepted | Updated: 2026-02-09


Overview

ClickHouse is an open-source column-oriented database management system designed for online analytical processing (OLAP). Licensed under the Apache License 2.0, ClickHouse can process analytical queries over billions of rows per second on commodity hardware, making it one of the fastest analytical databases available. It is widely used for real-time analytics, time-series data, log analytics, and business intelligence workloads.

In the OpenOva platform, ClickHouse is offered as an a la carte component for customers who need high-performance analytical capabilities without the cost of managed cloud data warehouses like Snowflake or BigQuery. It integrates naturally with the platform's observability stack for long-term metric retention and with Debezium/Kafka (via Strimzi) for streaming analytics pipelines. The ClickHouse Operator provides Kubernetes-native lifecycle management.

ClickHouse stores data in a columnar format with aggressive compression, enabling queries to scan only the columns needed for a given query. Combined with vectorized query execution, this architecture delivers orders-of-magnitude performance improvements over row-oriented databases for analytical workloads.


Architecture

Single Region

flowchart TB
    subgraph ClickHouse["ClickHouse Cluster"]
        subgraph Shard1["Shard 1"]
            CH1[Replica 1]
            CH2[Replica 2]
            CH1 <-->|"Replicate"| CH2
        end
        subgraph Shard2["Shard 2"]
            CH3[Replica 1]
            CH4[Replica 2]
            CH3 <-->|"Replicate"| CH4
        end
    end

    subgraph Sources["Data Sources"]
        Debezium[Debezium CDC]
        Kafka[Strimzi/Kafka]
        Apps[Applications]
    end

    subgraph Consumers["Query Clients"]
        Grafana[Grafana Dashboards]
        BI[BI Tools]
        API[Analytics API]
    end

    Debezium --> Kafka
    Kafka -->|"Kafka Engine"| CH1
    Kafka -->|"Kafka Engine"| CH3
    Apps -->|"HTTP/Native"| CH1
    CH1 --> Grafana
    CH3 --> BI
    CH1 --> API

Multi-Region

flowchart TB
    subgraph Region1["Region 1"]
        CH1[ClickHouse Cluster]
        ZK1[ClickHouse Keeper]
    end

    subgraph Region2["Region 2"]
        CH2[ClickHouse Cluster]
        ZK2[ClickHouse Keeper]
    end

    subgraph Streaming["Event Streaming"]
        Kafka[Strimzi/Kafka]
    end

    Kafka -->|"Kafka Engine"| CH1
    Kafka -->|"Kafka Engine"| CH2
    ZK1 <-->|"Raft"| ZK2

Why ClickHouse?

Factor ClickHouse PostgreSQL (CNPG) Snowflake / BigQuery
Query type OLAP (analytical) OLTP (transactional) OLAP (analytical)
Query speed Billions of rows/sec Millions of rows/sec Fast but variable
Storage format Columnar Row-oriented Columnar
Real-time ingestion Native support Possible but slow Micro-batch
Cost Self-hosted, Apache 2.0 Self-hosted, Apache 2.0 Pay-per-query (expensive)
Kubernetes-native ClickHouse Operator CNPG Operator Managed only
Time-series Optimized Possible (TimescaleDB) Possible
Compression 10-40x 2-4x 10-40x

Decision: Use ClickHouse for analytical workloads, time-series data, and log analytics. Use CNPG (PostgreSQL) for transactional workloads. ClickHouse replaces expensive managed OLAP services for self-hosted deployments.


Key Features

Feature Description
Columnar Storage Stores and compresses data by column for fast analytical scans
MergeTree Engine LSM-tree-inspired storage engine with automatic data compaction
Kafka Engine Native streaming ingestion from Kafka (via Strimzi) topics
Materialized Views Incrementally updated aggregations on insert
Distributed Queries Scatter-gather queries across shards
ClickHouse Keeper Built-in ZooKeeper-compatible coordination (replaces ZooKeeper)
SQL Compatibility ANSI SQL with extensions for analytics (window functions, arrays, JSON)
Tiered Storage Hot/warm/cold storage policies with S3/MinIO cold tier
Projections Pre-sorted data views for faster queries on secondary sort orders
TTL Automatic data expiration and archival policies

Configuration

ClickHouse Cluster (ClickHouse Operator)

apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: clickhouse
  namespace: databases
spec:
  defaults:
    templates:
      dataVolumeClaimTemplate: data-volume
      podTemplate: clickhouse-pod
  configuration:
    zookeeper:
      nodes:
        - host: clickhouse-keeper
          port: 2181
    clusters:
      - name: analytics
        layout:
          shardsCount: 2
          replicasCount: 2
        templates:
          podTemplate: clickhouse-pod
    settings:
      max_concurrent_queries: 200
      max_memory_usage: 10000000000
      max_server_memory_usage_to_ram_ratio: 0.8
    profiles:
      default/max_execution_time: 60
      default/max_rows_to_read: 1000000000
    users:
      default/password_sha256_hex: <sha256-hash>
      default/networks/ip:
        - "10.0.0.0/8"
      readonly/password_sha256_hex: <sha256-hash>
      readonly/profile: readonly
  templates:
    podTemplates:
      - name: clickhouse-pod
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:24.3
              resources:
                requests:
                  cpu: 1
                  memory: 4Gi
                limits:
                  cpu: 4
                  memory: 16Gi
    volumeClaimTemplates:
      - name: data-volume
        spec:
          storageClassName: <storage-class>
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 500Gi

Kafka Engine (Streaming Ingestion from Strimzi/Kafka)

-- Source table reading from Kafka (via Strimzi)
CREATE TABLE events_queue (
    event_id UUID,
    event_type String,
    payload String,
    created_at DateTime64(3)
) ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'kafka-kafka-bootstrap.databases.svc:9092',
    kafka_topic_list = 'events.analytics',
    kafka_group_name = 'clickhouse-analytics',
    kafka_format = 'JSONEachRow';

-- Target MergeTree table
CREATE TABLE events (
    event_id UUID,
    event_type LowCardinality(String),
    payload String,
    created_at DateTime64(3)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(created_at)
ORDER BY (event_type, created_at)
TTL created_at + INTERVAL 90 DAY;

-- Materialized view connecting the two
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT * FROM events_queue;

Tiered Storage (MinIO Cold Tier)

<storage_configuration>
    <disks>
        <default>
            <keep_free_space_bytes>1073741824</keep_free_space_bytes>
        </default>
        <s3_cold>
            <type>s3</type>
            <endpoint>http://minio.storage.svc:9000/clickhouse-cold/</endpoint>
            <access_key_id>minioadmin</access_key_id>
            <secret_access_key>minioadmin</secret_access_key>
        </s3_cold>
    </disks>
    <policies>
        <tiered>
            <volumes>
                <hot>
                    <disk>default</disk>
                </hot>
                <cold>
                    <disk>s3_cold</disk>
                </cold>
            </volumes>
            <move_factor>0.2</move_factor>
        </tiered>
    </policies>
</storage_configuration>

Monitoring

Metric Description
ClickHouseProfileEvents_Query Total queries executed
ClickHouseProfileEvents_InsertedRows Rows inserted
ClickHouseMetrics_MemoryTracking Current memory usage
ClickHouseAsyncMetrics_ReplicasMaxQueueSize Replication queue depth
ClickHouseProfileEvents_MergeTreeDataWriterRows MergeTree write throughput
ClickHouseMetrics_QueryThread Active query threads

Consequences

Positive:

  • Orders-of-magnitude faster than row-oriented databases for analytical queries
  • Native Kafka (via Strimzi) integration enables real-time streaming analytics
  • Columnar compression reduces storage costs by 10-40x compared to row stores
  • Replaces expensive managed OLAP services (Snowflake, BigQuery) for self-hosted deployments
  • Tiered storage to MinIO provides cost-effective long-term data retention

Negative:

  • Not suitable for OLTP workloads (use CNPG for transactional queries)
  • UPDATE and DELETE operations are expensive (merge-on-read semantics)
  • Requires careful schema design (sort keys, partitioning) for optimal performance
  • ClickHouse Keeper or ZooKeeper adds operational overhead for replicated setups
  • Complex JOIN queries across large datasets may require denormalization

Part of OpenOva