Commit Graph

87 Commits

Author SHA1 Message Date
e3mrah
b8d7a8b9cf
fix(bp-seaweedfs): disable global.enableSecurity to avoid fromToml on helm-controller v1.1.0 (#339)
Upstream seaweedfs/seaweedfs templates/shared/security-configmap.yaml
uses Helm template fromToml; helm-controller v1.1.0's bundled helm SDK
(v3.x older than 3.13) doesn't define fromToml so the install fails:
  parse error at security-configmap.yaml:21: function fromToml not defined
Setting global.seaweedfs.enableSecurity: false skips the entire template.
Internal SeaweedFS API is cluster-IP only on Sovereign-1; chart-level
security is acceptable to defer until helm-controller is bumped.
Bumped 1.0.0 → 1.0.1.
Unblocks the chain: bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor,
bp-grafana all dependsOn bp-seaweedfs.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 23:42:43 +04:00
e3mrah
9554be4a5e
fix(bp-external-secrets): gate ClusterSecretStore on CRD presence + drop delete-policy (#337)
The chart's post-install hook was failing on otech.omani.works:
  failed post-install: unable to build kubernetes object for deleting hook
  bp-external-secrets/templates/clustersecretstore-vault-region1.yaml:
  resource mapping not found for kind ClusterSecretStore in version
  external-secrets.io/v1beta1
Two corrections:
1. Capabilities-gate the entire template — don't render unless the
   ClusterSecretStore CRD is registered (it ships in via the upstream
   ESO subchart but isn't live on first install)
2. Remove 'before-hook-creation' delete-policy (was the actual trigger
   for the 'deleting hook' failure path)
Bumped 1.0.0 → 1.0.1.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 23:31:24 +04:00
e3mrah
5502d9aa48
feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159) (#291)
Activates the previously-templated `letsencrypt-dns01-prod` ClusterIssuer
in bp-cert-manager by shipping the missing piece — a Go binary that
satisfies cert-manager's external webhook contract
(`webhook.acme.cert-manager.io/v1alpha1`) against the Dynadot api3.json.

Architecture
============

* `core/pkg/dynadot-client/` — canonical Dynadot HTTP client (shared with
  pool-domain-manager and catalyst-dns). Encapsulates the api3.json
  transport, command builders, response decoding, and the safe
  read-modify-write semantics required to never accidentally wipe a
  zone (memory: feedback_dynadot_dns.md). Destructive `set_dns2`
  variant is unexported.
* `core/cmd/cert-manager-dynadot-webhook/` — the cert-manager webhook
  binary. Implements `Solver.Present` via the client's append-only
  `AddRecord` path and `Solver.CleanUp` via the read-modify-write
  `RemoveSubRecord` path. Domain allowlist (`DYNADOT_MANAGED_DOMAINS`)
  rejects challenges for unmanaged apexes BEFORE any Dynadot call.
* `platform/cert-manager-dynadot-webhook/` — Catalyst-authored Helm
  wrapper. Templates Deployment + Service + APIService + serving
  Certificate (CA chain via cert-manager Issuer self-signing) +
  RBAC + ServiceAccount. Mirrors the standard cert-manager external-
  webhook deployment shape.
* `platform/cert-manager/chart/` — flips `dns01.enabled: true` so the
  paired ClusterIssuer activates. The interim http01 issuer remains
  templated as the rollback path.

Test results
============

  core/pkg/dynadot-client          — 7 tests PASS  (race-clean)
  core/cmd/cert-manager-dynadot-... — 9 tests PASS  (race-clean)

Test coverage includes a Present/CleanUp round-trip against an
httptest fixture that models Dynadot's zone state, an explicit
unmanaged-domain rejection, a regression preserving a pre-existing
CNAME across the DNS-01 round-trip (the zone-wipe defence), and a
typed-error propagation test that surfaces `ErrInvalidToken` to
cert-manager so the controller will retry.

Helm template smoke render
==========================

`helm template` against the new chart with default values yields 12
resources / 424 lines (APIService, Certificate, ClusterRoleBinding,
Deployment, Issuer, Role, RoleBinding, Service, ServiceAccount). The
modified bp-cert-manager chart still renders both ClusterIssuers
(`letsencrypt-dns01-prod` + `letsencrypt-http01-prod`) with default
values; flipping `certManager.issuers.dns01.enabled=false` is the
clean rollback.

Smoke command (post-deploy)
===========================

  kubectl get apiservices.apiregistration.k8s.io \
    v1alpha1.acme.dynadot.openova.io
  # Issue a *.<sovereign>.<pool> wildcard cert and watch the
  # Order/Challenge progress through cert-manager.

CI
==

`.github/workflows/build-cert-manager-dynadot-webhook.yaml` mirrors the
pool-domain-manager-build pattern (cosign keyless signing, SBOM
attestation, GHCR push at `ghcr.io/openova-io/openova/cert-manager-
dynadot-webhook:<sha>`). Triggered by changes to either the binary or
the shared dynadot-client package.

Closes #159

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:37:47 +04:00
e3mrah
c09109a61a
feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290)
Edge + serverless + model-serving batch (W2.5.C) — three upstream-
subchart umbrella Blueprints completing the bootstrap-kit slots for
WebRTC media relay (bp-relay → bp-stunner) and the AI/ML serving stack
(bp-cortex → bp-kserve → bp-knative).

Each chart follows the canonical umbrella pattern from
docs/BLUEPRINT-AUTHORING.md §11.1: Chart.yaml declares the upstream
chart under `dependencies:` so `helm dependency build` bundles the
upstream payload into the OCI artifact, and Catalyst-curated overlay
values + templates sit alongside in chart/values.yaml + chart/templates/.

Per-chart highlights:
- bp-stunner/1.0.0 — wraps stunner/stunner-gateway-operator 1.1.0.
  Ships a Cilium-native GatewayClass (Capabilities-gated on
  gateway.networking.k8s.io/v1) so bp-relay (LiveKit / SFU) can claim
  Gateway CRs without an operator-ordering dance. Default UDP TURN port
  range 30000-32767 matches the range opened at the Sovereign edge
  firewall (Crossplane bp-firewall composition).
- bp-knative/1.0.0 — wraps knative-operator v1.21.1. Ships a
  KnativeServing CR pre-configured for **istio-less mode**
  (ingress.istio.enabled=false, ingress.contour.enabled=false,
  ingress.kourier.enabled=false; config.network.ingress-class=cilium).
  Sovereign FQDN sourced from values, no hardcoded fallback per
  inviolable principle #4 — render fails loudly if cluster overlay
  doesn't set knativeOverlay.knativeServing.sovereignFqdn.
- bp-kserve/1.0.0 — wraps kserve/kserve v0.16.0 (latest version
  published on the official OCI registry as of 2026-04-30). Default
  deploymentMode=RawDeployment (no Knative hop on the hot path) but
  bp-knative is still installed (declared as a hard dep) so per-IS
  annotation `serving.kserve.io/deploymentMode: Serverless` opts in to
  scale-to-zero per tenant. Cilium native Gateway-API ingress
  (enableGatewayApi=true, className=cilium, disableIstioVirtualHost=
  true).

Observability discipline (issue #182): every observability toggle
(ServiceMonitor, HPA, GatewayClass) defaults false and is operator-
tunable via per-cluster overlay once bp-kube-prometheus-stack reconciles.
Each chart ships tests/observability-toggle.sh covering default-off,
opt-in (with `--api-versions monitoring.coreos.com/v1` to simulate
Prometheus Operator CRDs), and explicit-off cases.

Per-chart kind summary (helm template default render):

  bp-stunner: ClusterRole, ClusterRoleBinding, ConfigMap, Dataplane,
              Deployment, Role, RoleBinding, Service, ServiceAccount.
              (+ GatewayClass when --api-versions
              gateway.networking.k8s.io/v1 is passed.)

  bp-knative: ClusterRole, ClusterRoleBinding, ConfigMap,
              CustomResourceDefinition, Deployment, KnativeServing,
              Role, RoleBinding, Secret, Service, ServiceAccount.

  bp-kserve:  Certificate, ClusterRole, ClusterRoleBinding,
              ClusterServingRuntime, ClusterStorageContainer,
              ConfigMap, Deployment, Gateway, Issuer,
              MutatingWebhookConfiguration, Role, RoleBinding,
              Service, ServiceAccount, ValidatingWebhookConfiguration.

`helm lint` clean for all three (single INFO on missing icon — icons
land with marketplace card work).

`bash tests/observability-toggle.sh` green for all three (3 cases each:
default-off, opt-in, explicit-off).

Closes #263 #264 #265

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:37:38 +04:00
e3mrah
782d8015c5
feat(charts): bp-openmeter (CH-less) + bp-livekit + bp-matrix wrapper charts (closes #272 #273 #274) (#289)
W2.5.F — three Catalyst Blueprint umbrella charts at platform/{openmeter,
livekit,matrix}/, each declaring its upstream chart under Chart.yaml
`dependencies:` so `helm dependency build` bundles the upstream payload
into the published OCI artifact (per docs/BLUEPRINT-AUTHORING.md §11.1
— hollow charts forbidden, CI-enforced by issue #181).

Per-chart kind summary
======================

bp-openmeter (closes #272)
  default `helm template` kinds: ConfigMap, Deployment, Service, ServiceAccount
  upstream chart: openmeter 1.0.0-beta.213 (oci://ghcr.io/openmeterio/helm-charts)

  ClickHouse-less profile per docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §6.4.
  The upstream chart's bundled clickhouse / kafka / postgresql / redis /
  svix subcharts are all DISABLED — Catalyst supplies CNPG (postgres),
  JetStream (event bus), and Valkey (redis-compat) at the platform tier.
  Chart-level toggle `catalystBlueprint.backend.kind` (default `cnpg`,
  alt `clickhouse`) records the active profile so observability/audit
  pipelines can report it. The OpenMeter binary's
  `aggregation.clickhouse.address` is left blank — per-Sovereign overlay
  supplies it once a host cluster adds bp-clickhouse and the operator
  re-rolls with `backend.kind: clickhouse`. Catalyst overlay templates
  (NetworkPolicy / ServiceMonitor / HPA) all default OFF per
  docs/BLUEPRINT-AUTHORING.md §11.2.

bp-livekit (closes #273)
  default `helm template` kinds: ConfigMap, Deployment, Service, ServiceAccount
  upstream chart: livekit-server 1.9.0 (https://helm.livekit.io)

  WebRTC SFU. Powers the Huawei iFlytek voice demo. Catalyst defaults
  pair LiveKit with bp-stunner (the upstream chart's bundled co-located
  TURN server is OFF; per-Sovereign overlay points the LiveKit TURN
  config at the stunner UDP-gateway Service). RTC UDP port range is
  50000-60000 (matches the Hetzner firewall rule the per-Sovereign
  overlay opens). Catalyst overlay templates (NetworkPolicy /
  ServiceMonitor / HPA) all default OFF; the chart's NetworkPolicy
  template documents that LiveKit's hostNetwork mode means pod-level
  policies do NOT cover the SFU port range — the firewall rule is the
  load-bearing control. blueprint.yaml `depends:` declares bp-stunner +
  bp-cert-manager + bp-valkey.

bp-matrix (closes #274)
  default `helm template` kinds: ConfigMap, Deployment, Ingress, Job,
  PersistentVolumeClaim, Pod, Role, RoleBinding, Secret, Service,
  ServiceAccount
  upstream chart: matrix-synapse 3.12.25 (https://ananace.gitlab.io/charts)

  Synapse (the Matrix server implementation, NOT the retired OpenOva
  product noun). Federation OFF by default (Catalyst per-Sovereign
  tenancy default — operator overlays flip it on per-Organization).
  Postgres backend via bp-cnpg externalPostgresql; OIDC SSO via
  bp-keycloak; bundled bitnami postgresql + redis subcharts both
  disabled. Catalyst overlay NetworkPolicy gates the federation port
  (8448) on `federation.enabled` — verified by Case 5 of the
  observability-toggle test. Catalyst-overlay ServiceMonitor (upstream
  chart has none) + HPA both default OFF.

Lint
====
All three charts pass `helm lint` clean (only the noisy "icon is
recommended" INFO message).

Observability tests
===================
Each chart's `tests/observability-toggle.sh` enforces the Catalyst
contract from docs/BLUEPRINT-AUTHORING.md §11.2:
  Case 1: default render produces zero monitoring.coreos.com/v1
          resources (no ServiceMonitor / PrometheusRule).
  Case 2: opt-in (--set serviceMonitor.enabled=true --api-versions
          monitoring.coreos.com/v1) renders a ServiceMonitor.
  Case 3: explicit-off render is clean.
  Case 4 (per chart):
    - openmeter: ClickHouse-less profile asserts no
      clickhouse.altinity.com / Kafka subchart resources leak into the
      default render.
    - livekit:   asserts upstream livekit-server.serviceMonitor.create
      defaults false.
    - matrix:    asserts default render carries an empty
      federation_domain_whitelist (the per-Sovereign tenancy default).
  Case 5 (matrix only): `--set federation.enabled=true networkPolicy
          .enabled=true` opens port 8448 in the Catalyst NetworkPolicy.

All gates green for all three charts.

Closes #272 #273 #274

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 19:37:28 +04:00
e3mrah
87d9a4afa7
feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271) (#288)
W2.5.E batch — three Application-tier Blueprints completing the LLM
serving / workflow stack:

- bp-temporal/1.0.0 — wraps temporal/temporal 1.2.0 (the new chart
  rewrite that removed cassandra:/mysql:/postgresql:/elasticsearch:/
  prometheus:/grafana: top-level keys in favour of
  server.config.persistence.datastores). Postgres-only via CNPG-backed
  visibility store (skip Cassandra). Web UI ON. Keycloak OIDC
  integration via --auth-claim-mapper renders auth.yaml ConfigMap
  (operator wires via additionalVolumes once bp-keycloak is
  reconciled, default OFF). dependsOn: bp-cnpg + bp-cert-manager.
  Closes #271.
  Kinds: Cluster (CNPG) + ConfigMap + Deployment + Job + Pod +
  Service.

- bp-llm-gateway/1.0.0 — wraps berriai/litellm-helm 0.1.572 from OCI.
  Subscription-aware proxy for Claude Code: routes to Anthropic (via
  operator OAuth/Max subscription — NEVER an ANTHROPIC_API_KEY,
  per memory/feedback_no_api_key.md), Bedrock, Vertex,
  OpenAI-compatible (via bp-anthropic-adapter), and self-hosted
  vLLM. CNPG-backed audit log (every prompt + response persisted
  for compliance). Bundled bitnami postgresql + redis subcharts
  DISABLED (db.useExisting=true points at the CNPG cluster).
  Keycloak SSO via auth.yaml ConfigMap (default OFF).
  ExternalSecret-backed environmentSecrets brings tokens / IAM
  creds in without inlining plaintext. dependsOn: bp-cnpg +
  bp-keycloak + bp-external-secrets. Closes #267.
  Kinds: Cluster (CNPG audit) + ConfigMap + Deployment + Job +
  Pod + Secret + Service + ServiceAccount.

- bp-anthropic-adapter/1.0.0 — Catalyst-authored scratch chart for
  the OpenAI ↔ Anthropic translation Go service. SHA-pinned image
  ghcr.io/openova-io/openova/anthropic-adapter:<sha> (Inviolable
  Principle #4a — GitHub Actions is the only build path; empty
  default tag fails the render with a clear error instead of
  silently shipping :latest). OAuth/Max subscription token mounted
  from K8s Secret materialized by ESO from bp-openbao —
  ANTHROPIC_OAUTH_TOKEN env var, NEVER an ANTHROPIC_API_KEY.
  Includes OpenAI → Anthropic model-mapping ConfigMap (gpt-4 →
  claude-3-5-sonnet, gpt-4o-mini → claude-3-5-haiku, etc.).
  sigstore/common library subchart included to satisfy the
  hollow-chart gate (matches bp-vllm pattern from #283).
  dependsOn: bp-external-secrets. Closes #268.
  Kinds: ConfigMap + Deployment + Service + ServiceAccount.

CRITICAL — bp-llm-gateway and bp-anthropic-adapter both consume the
operator's Claude OAuth/Max subscription. Per memory/
feedback_no_api_key.md and the user's standing instruction, neither
chart accepts or generates an ANTHROPIC_API_KEY. Tokens flow
exclusively through ExternalSecret-managed K8s Secrets that ESO
materializes from bp-openbao at install time.

Per docs/BLUEPRINT-AUTHORING.md §11.2 (issue #182): every
observability toggle defaults `false` (ServiceMonitor / metrics
sidecar / PodMonitor) and is operator-tunable via per-cluster
overlay once bp-kube-prometheus-stack reconciles. Each chart ships
tests/observability-toggle.sh covering default-off, opt-in (with
--api-versions monitoring.coreos.com/v1 to simulate the CRDs), and
explicit-off cases. bp-anthropic-adapter additionally tests the
never-:latest gate via Case 4 (empty image tag must fail render).

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every
upstream version, namespace, server URL, role, secret name, model
default, and toggle is exposed under values.yaml. Cluster overlays
in clusters/<sovereign>/ may override without rebuilding the
Blueprint OCI artifact.

Per docs/BLUEPRINT-AUTHORING.md §11.1 (umbrella shape — hard
contract): bp-temporal and bp-llm-gateway declare their upstream
charts under Chart.yaml dependencies: so helm dependency build
bundles the upstream payload into the OCI artifact. bp-anthropic-
adapter is a scratch chart (no upstream Helm chart exists) and
includes sigstore/common as the obligatory hollow-chart-gate
dependency, matching the bp-vllm precedent from W2.5.D (#283).

Closes #267
Closes #268
Closes #271

helm lint: 1 chart(s) linted, 0 chart(s) failed (each, INFO icon-recommended only)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 19:37:19 +04:00
e3mrah
a6bf07b0ce
feat(charts): bp-librechat wrapper chart (closes #275) (#287)
W2.5.G — Catalyst-authored scratch chart for LibreChat (slot 48 of the
omantel-1 bootstrap-kit). LibreChat upstream does not publish a Helm
chart, so this chart hand-wires the official ghcr.io/danny-avila/librechat
container as Deployment + Service + Ingress + ConfigMap + ServiceAccount
+ NetworkPolicy + ServiceMonitor + HPA, with the sigstore/common
library subchart declared to satisfy the hollow-chart gate (issue #181).

Per docs/BLUEPRINT-AUTHORING.md §11.2: every observability toggle
(serviceMonitor, hpa) defaults false; opt-in via per-cluster overlay
once kube-prometheus-stack reconciles. The ServiceMonitor template is
double-gated by .Values.serviceMonitor.enabled AND
Capabilities.APIVersions.Has "monitoring.coreos.com/v1" so flipping the
toggle on a too-early Sovereign cannot break the bp-librechat reconcile.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every endpoint
URL, model name, secret reference, namespace selector, and image tag is
operator-tunable via values.yaml. The Sovereign FQDN, Keycloak issuer,
llm-gateway URL, embeddings URL, and TLS ClusterIssuer are all
operator-supplied at install time. The image tag is pinned to v0.7.5
(no :latest).

Connectors:
- Chat completions: bp-llm-gateway (OpenAI-compatible /v1/chat/completions)
  exposed as a "custom" endpoint named "Catalyst LLM"
- Embeddings (RAG): bp-bge — provider=bge maps to EMBEDDINGS_PROVIDER=openai
  + RAG_OPENAI_BASEURL=<bge.svc> at template-render time
- SSO: bp-keycloak (OpenID Connect) — issuer/clientId from values,
  client secret + session secret from ExternalSecret
- Conversation store: FerretDB on bp-cnpg (MongoDB wire protocol over
  Postgres) — operator-supplied connection URI

Hosted at chat-app.<sovereign-fqdn>; the chart `fail`s render if
ingress.host is empty (no platform-wide default).

helm template (default values, --set ingress.host=...):
  ConfigMap, Deployment, Ingress, NetworkPolicy, Service, ServiceAccount

helm template (--set hpa.enabled=true serviceMonitor.enabled=true
              --api-versions monitoring.coreos.com/v1):
  ConfigMap, Deployment, HorizontalPodAutoscaler, Ingress, NetworkPolicy,
  Service, ServiceAccount, ServiceMonitor

helm lint: 1 chart(s) linted, 0 chart(s) failed (single INFO on
missing icon — icons land with the marketplace card work).

tests/observability-toggle.sh: PASS on default-off, opt-in
(--api-versions monitoring.coreos.com/v1 to simulate the CRDs), and
explicit-off cases.

Path isolation: only platform/librechat/ — no HR slot files,
blueprint-release.yaml, or other charts touched. The HR slot files
(clusters/.../48-librechat.yaml) and blueprint-release.yaml will land
in a separate slot-wiring PR per the W2.K4 expansion plan.

Closes #275

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:56:59 +04:00
e3mrah
9dc8506dd9
feat(charts): bp-external-secrets + bp-cnpg + bp-valkey wrapper charts (#285)
Storage-substrate batch (W2.5.A) — closes #254 by shipping the three
upstream-subchart umbrella Blueprints that the Flux HRs at
clusters/_template/bootstrap-kit/{15-external-secrets,16-cnpg,17-valkey}
.yaml (merged via PR #262) target.

Each chart follows the canonical umbrella pattern documented in
docs/BLUEPRINT-AUTHORING.md §11.1: Chart.yaml declares the upstream
chart under `dependencies:` so `helm dependency build` bundles the
upstream payload into the OCI artifact, and Catalyst-curated overlay
values + templates sit alongside in chart/values.yaml + chart/templates/.

Per-chart highlights:
- bp-external-secrets/1.0.0 — wraps external-secrets/external-secrets
  0.10.7. Ships a default `vault-region1` ClusterSecretStore (via Helm
  post-install/post-upgrade hook to defer the CR application until the
  upstream chart's CRDs are registered) wired to the in-cluster
  bp-openbao service. clusterSecretStore.enabled toggle lets cluster
  overlays opt out and author their own multi-region CRs.
- bp-cnpg/1.0.0 — wraps cnpg/cloudnative-pg 0.28.0. Operator-only
  surface (Cluster CRs are per-Application). CRDs ship in-chart so
  bp-powerdns / bp-keycloak / bp-gitea / bp-langfuse / bp-grafana /
  bp-temporal / bp-matrix / bp-llm-gateway / bp-bge / bp-nemo-guardrails
  / bp-openmeter / pool-domain-manager can `dependsOn: bp-cnpg` via
  Flux — closing #254 (bp-powerdns CreateContainerConfigError on
  pdns-pg-app secret).
- bp-valkey/1.0.0 — wraps bitnami/valkey 5.5.1. BSD-3 Redis-compatible
  cache, replication architecture, password auth ON, NetworkPolicy ON,
  replicas 0 by default for solo Sovereigns (cluster overlays bump for
  HA). Application-tier cache only — Catalyst control plane uses NATS
  JetStream KV (per ARCHITECTURE.md §5).

Per docs/BLUEPRINT-AUTHORING.md §11.2 (issue #182): every observability
toggle defaults `false` (ServiceMonitor / PodMonitor / PrometheusRule /
metrics sidecar) and is operator-tunable via per-cluster overlay once
bp-kube-prometheus-stack reconciles. Each chart ships
tests/observability-toggle.sh covering default-off, opt-in (--api-versions
monitoring.coreos.com/v1 to simulate the CRDs), and explicit-off cases.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every upstream
version, namespace, server URL, role, and password toggle is exposed
under values.yaml. Cluster overlays in clusters/<sovereign>/ may
override without rebuilding the Blueprint OCI artifact.

helm lint: 1 chart(s) linted, 0 chart(s) failed (each, INFO icon-recommended only)
helm template default render kinds:
  bp-external-secrets: ClusterRole, ClusterRoleBinding, ClusterSecretStore, CustomResourceDefinition, Deployment, Role, RoleBinding, Secret, Service, ServiceAccount, ValidatingWebhookConfiguration
  bp-cnpg:             ClusterRole, ClusterRoleBinding, ConfigMap, CustomResourceDefinition, Deployment, MutatingWebhookConfiguration, Service, ServiceAccount, ValidatingWebhookConfiguration
  bp-valkey:           ConfigMap, NetworkPolicy, PodDisruptionBudget, Secret, Service, ServiceAccount, StatefulSet

Closes #254

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 18:39:29 +04:00
e3mrah
ba2ff05292
feat(charts): bp-seaweedfs + bp-harbor + bp-vpa wrapper charts (#284)
W2.5.B — first authoring of the three Catalyst Blueprint wrapper charts
that fill bootstrap-kit slots 18 (seaweedfs), 19 (harbor) and 29 (vpa).
Each wraps an upstream chart as a Helm subchart and ships Catalyst-
curated overlay templates (NetworkPolicy + ServiceMonitor) gated behind
opt-in toggles, per docs/BLUEPRINT-AUTHORING.md §11 and
docs/INVIOLABLE-PRINCIPLES.md.

bp-seaweedfs (slot 18 — storage foundation)
  - Wraps seaweedfs/seaweedfs 4.22.0; Chart name `bp-seaweedfs`.
  - Catalyst defaults: 1 master + 3 volume + 1 filer + 2 s3 replicas.
  - S3 API on 8333 — single S3 surface every consumer talks to per
    docs/PLATFORM-TECH-STACK.md §3.5 (no per-app MinIO).
  - Overlay templates: NetworkPolicy (cross-namespace S3 reachability,
    cold-tier egress allowlist), ServiceMonitor (Capabilities-gated,
    DEFAULT FALSE per §11.2).
  - Default helm template kinds: ClusterRole, ClusterRoleBinding,
    ConfigMap, Deployment, Secret, Service, ServiceAccount, StatefulSet.

bp-harbor (slot 19 — per-Sovereign OCI registry)
  - Wraps goharbor/harbor 1.18.3 (appVersion 2.14.3); Chart name
    `bp-harbor`.
  - Catalyst defaults: blob backend = SeaweedFS S3 (regionendpoint
    seaweedfs-s3.seaweedfs.svc:8333), metadata DB = bp-cnpg external
    Postgres, ingress class `cilium`, expose.tls.enabled true (cert-
    manager-issued Secret).
  - Overlay templates: NetworkPolicy (CNPG/SeaweedFS/Keycloak egress),
    ServiceMonitor (Capabilities-gated, DEFAULT FALSE).
  - Trivy + SSO + pull-mirror are operator-flag opt-ins per per-
    Sovereign overlay (default false; trivy/keycloak/cnpg deps land on
    later slots).
  - Default helm template kinds: ConfigMap, Deployment, Ingress,
    PersistentVolumeClaim, Secret, Service, StatefulSet.

bp-vpa (slot 29 — vertical autoscaling)
  - Wraps cowboysysop/vertical-pod-autoscaler 11.1.1 (appVersion
    1.5.0); Chart name `bp-vpa`.
  - Catalyst defaults: 1 replica each of recommender + updater +
    admission-controller. Default mode `Off` (recommend only).
  - Admission webhook self-signs via init Job (cluster-internal); per-
    Sovereign overlay MAY swap to cert-manager.
  - Overlay templates: NetworkPolicy (apiserver + metrics-server
    egress, admission webhook ingress).
  - Upstream metrics.serviceMonitor / metrics.prometheusRule defaulted
    false per §11.2.
  - Default helm template kinds: ClusterRole, ClusterRoleBinding,
    ConfigMap, Deployment, Job, Pod, Secret, Service, ServiceAccount.

Lint + observability-toggle results
  helm lint: 1 chart(s) linted, 0 chart(s) failed (each)
  tests/observability-toggle.sh: PASS on all three (default render has
  zero monitoring.coreos.com/v1 references; opt-in render produces a
  ServiceMonitor; explicit-off render is clean).

Path isolation: only platform/seaweedfs/, platform/harbor/, and
platform/vpa/ — no HR slot files or other charts touched.

Refs: bootstrap-kit slots 18, 19, 29 reconcile against
ghcr.io/openova-io/bp-seaweedfs:1.0.0, bp-harbor:1.0.0, bp-vpa:1.0.0
which this commit produces on next blueprint-release CI run.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 18:37:50 +04:00
e3mrah
c3c9c0cf27
feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283)
Catalyst-authored umbrella charts for the W2.5.D AI-inference stack.
None of the three upstream projects publish a Helm chart, so each
chart hand-wires the upstream container as Deployment + Service +
ConfigMap + ServiceMonitor + NetworkPolicy + HPA, with the
sigstore/common library subchart declared to satisfy the
hollow-chart gate (issue #181).

bp-vllm (slot 39) — wraps vllm/vllm-openai:v0.6.4. GPU-aware
(nvidia.com/gpu when vllm.gpu.enabled=true; CPU fallback for dev).
Default model meta-llama/Llama-3.1-8B-Instruct, port 8000,
OpenAI-compatible /v1/chat/completions. All engine knobs
(maxModelLen, gpuMemoryUtilization, dtype, quantization,
tensorParallelSize, prefix-caching) overlay-tunable. Closes #266.

bp-bge (slot 42) — wraps ghcr.io/huggingface/text-embeddings-inference:cpu-1.5.
Default model BAAI/bge-small-en-v1.5 + BAAI/bge-reranker-base
sidecar in same Pod. Two-port Service (8080 embed, 8081 rerank)
annotated for bp-llm-gateway discovery. CPU-friendly defaults;
overlay swaps in BAAI/bge-m3 on GPU Sovereigns. Closes #269.

bp-nemo-guardrails (slot 43) — wraps the upstream NVIDIA/NeMo-Guardrails
Dockerfile (nemoguardrails server, FastAPI, port 8000). LLM endpoint
+ model + engine all overlay-tunable; Colang flow bundle mounts via
configMap.externalName for production rails. ConfigMap stub renders
a default rail for smoke testing. Closes #270.

All three charts:
- Default observability toggles to false per BLUEPRINT-AUTHORING.md §11.2
- Pin upstream image tags (no :latest) per INVIOLABLE-PRINCIPLES.md #4
- Non-root securityContext (runAsUser 1000, drop ALL capabilities)
- prometheus.io scrape annotations on the Pod for fallback discovery
- Operator-tunable NetworkPolicy gating ingress to bp-llm-gateway and
  egress to HuggingFace / bp-vllm / bp-bge as appropriate

helm template (default values) per chart:
  bp-vllm:            ConfigMap, Deployment, Service, ServiceAccount
  bp-bge:             ConfigMap, Deployment, Service, ServiceAccount
  bp-nemo-guardrails: ConfigMap, Deployment, Service, ServiceAccount

helm template (--set serviceMonitor.enabled=true networkPolicy.enabled=true hpa.enabled=true):
  All three render ConfigMap + Deployment + Service + ServiceAccount +
  ServiceMonitor + NetworkPolicy + HorizontalPodAutoscaler.

helm lint: 0 chart(s) failed for all three (single INFO on missing icon —
icons land with the marketplace card work).

Closes #266
Closes #269
Closes #270

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:37:07 +04:00
e3mrah
0cfd0defa9
fix(bp-langfuse): drop apostrophe from description to clear GHCR 500 (resolves #215) (#278)
Root cause: Helm's `helm push` collapses the chart `description` field
into a single-line OCI manifest annotation
`org.opencontainers.image.description`. The GHCR manifest-PUT validator
returns a deterministic 500 Internal Server Error when that annotation
is long AND contains an ASCII apostrophe. bp-langfuse 1.0.0 was the
only chart in the observability batch (PR #214) carrying both
characteristics, so it was the only one that failed to publish.

Fix: reword the affected sentence from "Langfuse's persistent state" to
"the Langfuse persistent state" — drops the apostrophe, preserves the
meaning, and crucially preserves every byte of the actual chart payload
(values, templates, all 350 entries of the upstream langfuse-1.5.28
subchart with its 4-level-deep Bitnami vendoring). No runtime
behavioural change; helm template renders the exact same 6 resources
across 490 lines.

The narrowing was done by progressively reducing the Chart.yaml from
the failing version to a passing version while pushing to a scratch
GHCR namespace, with the bp-langfuse repo deleted between attempts
(verified via `DELETE /orgs/openova-io/packages/container/bp-langfuse`
and re-querying). The trigger is reproducible: long description +
apostrophe → 500; long description without apostrophe → push succeeds;
short description with apostrophe → push succeeds.

Added a multi-line WARNING comment immediately above `description:`
documenting the trigger so future authors do not reintroduce a
possessive form. Issue #215 captures the full reproduction.

Closes #215

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 17:31:51 +04:00
e3mrah
ec3821f7e1
fix(bp-*): event-driven HR install -- drop blanket timeout, use disableWait (#250)
Helm install completes when manifests apply, not when pods reach Ready.
Flux dependsOn checks Ready=True on each HR independently, so
spec.install.disableWait + spec.upgrade.disableWait is the correct
shape for slow-Ready workloads. Blanket spec.timeout: Nm watchdogs from
PR #221 were a band-aid that caused cascading HR failures and blocked
downstream HRs (bp-nats-jetstream, bp-openbao depended on bp-spire).

Founder direction (verbatim): "always event driven robust jobs"

Per-HR audit (drop spec.timeout: 15m, add disableWait, with reason):

- bp-cilium:        envoyconfig CRD self-wait — agent crash-loops until
                    its own CRDs land
- bp-cert-manager:  webhook readiness depends on cainjector mutating
                    Secret — multi-minute on cold start
- bp-flux:          adopts cloud-init Flux objects; the helm-controller
                    reconciling THIS HR is itself a chart target — Ready
                    deadlock without disableWait
- bp-sealed-secrets: single-replica controller + CRD — install completes
                    on manifest apply
- bp-spire:         spire-controller-manager waits for CRD informer cache
                    sync — multi-minute legitimate path; chart fix below
- bp-nats-jetstream: JetStream raft quorum formation across N replicas
- bp-openbao:       3-node Raft sealed-by-default; Ready=True only after
                    operator runs `bao operator init` unseal flow
- bp-keycloak:      DB schema migration + 100+ Liquibase changesets on
                    first install
- bp-gitea:         PostgreSQL DB init + admin user + Blueprint catalog
                    mirror seeding
- bp-external-dns:  pod readiness depends on PowerDNS API + pdns-pg CNPG
                    cascade
- bp-catalyst-platform: ~10 services, inter-service NATS/OTel readiness
                    is not Helm's concern

Intentionally NOT touched (other parallel agents own these):
- bp-crossplane (Agent A): chart split for intra-chart CRD-ordering
- bp-powerdns   (Agent D): post-install hook for intra-chart Job-ordering

bp-spire chart fix (1.1.3 -> 1.1.4):

Root cause investigation on otech.omani.works (live):
  spire-controller-manager has restarted 37 times with:
    "failed to wait for clusterstaticentry caches to sync: timed out
     waiting for cache to be synced for Kind *v1alpha1.ClusterStaticEntry"

`kubectl get crd | grep spire` returns nothing — the spire.spiffe.io
v1alpha1 CRDs (ClusterSPIFFEID / ClusterStaticEntry /
ClusterFederatedTrustDomain) are NOT registered. The upstream `spire`
chart does not install its own CRDs; the spiffe maintainers ship them
via the SEPARATE `spire-crds` chart, expected to be installed first.

Fix: platform/spire/chart/Chart.yaml now declares spire-crds 0.5.0 as
the FIRST dependency. Helm installs subcharts in dependency order, so
listing spire-crds first guarantees CRDs are applied before the spire
subchart's controller-manager Deployment starts. blueprint.yaml +
both 06-spire.yaml cluster references bumped to 1.1.4.

Live error this fixes (otech.omani.works, persistent ~5h):
  Helm upgrade failed for release spire-system/spire with chart
  bp-spire@1.1.3: context deadline exceeded
  + downstream cascade: bp-nats-jetstream / bp-openbao stuck at
    "dependency 'flux-system/bp-spire' is not ready"

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:19 +04:00
e3mrah
726af6df81
fix(bp-powerdns): self-generate api-credentials Secret + disable upstream zone-bootstrap Job (#248)
Root cause investigation on otech.omani.works (kubectl, sanitized):

  $ kubectl get pods -n powerdns
  create-zone-if-not-exist-sh-tjtr4   0/1  CreateContainerConfigError  4h
  powerdns-57d7d49f99-{9hrb4,lxlgt,nkmht}  0/1  CreateContainerConfigError  4h
  dnsdist-594dbfc5f-wznsw                  1/1  Running  4h

  $ kubectl get secrets -n powerdns
  powerdns                Opaque  1  4h
  powerdns-api-tls-8kxpx  Opaque  1  4h     (NO `powerdns-api-credentials`, NO `pdns-pg-app`)

  $ kubectl describe pod ... powerdns-57d7d49f99-9hrb4
  Environment:
    PDNS_API_KEY:  <set to the key 'api-key' in secret 'powerdns-api-credentials'>  Optional: false
    PDNS_DB_HOST:  <set to the key 'host' in secret 'pdns-pg-app'>                  Optional: false
    State: Waiting   Reason: CreateContainerConfigError

The handover's chicken-egg-with-secret theory was directionally right but
the cause was more fundamental:

  1. Wrapper chart's api-credentials-secret.yaml (1.1.2) was a no-op
     unless operator set `apiKey` value out-of-band — comment said the
     deployment would "fail to start until the named Secret exists" as
     "the explicit signal we want". On a Sovereign that bootstraps from
     bp-* OCI artifacts, no operator is standing by, so the Secret is
     never created and pods sit in CreateContainerConfigError forever.

  2. The upstream chart's `create-zone-if-not-exists-sh` Job is rendered
     whenever both `zoneName` and `api.key` are set — defaulting
     `zoneName: "example.de."` it ALWAYS rendered and ALWAYS failed
     (same missing Secret). Catalyst doesn't want this Job at all
     because zones are loaded later by pool-domain-manager (PDM).

  3. The chart's CNPG Cluster template is gated behind
     Capabilities.APIVersions.Has "postgresql.cnpg.io/v1" — on a fresh
     Sovereign without bp-cnpg yet (bp-cnpg is on the roadmap, not in
     bootstrap-kit), no Cluster is rendered and `pdns-pg-app` Secret
     never materialises. With Helm `--wait`, install times out
     ("context deadline exceeded") even though the manifests applied
     cleanly.

Fix:

  * api-credentials-secret.yaml: self-generate via Helm `lookup` +
    `randAlphaNum 32`. First install creates fresh randoms; every
    subsequent reconcile reads back the existing values from the
    Secret so the API key never rotates on upgrade. Operator can
    still pin specific values via .Values.powerdns.apiKey /
    .Values.powerdns.webserverPassword, or skip Secret creation
    entirely via .Values.powerdns.useExistingApiSecret. Same pattern
    as bitnami/postgresql, bitnami/keycloak.

  * values.yaml: set `powerdns.zoneName: ""` so upstream chart's
    `{{- if and .Values.powerdns.zoneName .Values.powerdns.api.key }}`
    gate skips the create-zone Job entirely. Catalyst's PDM creates
    zones via the REST API after the cluster comes up; we don't want
    a placeholder `example.de.` zone in production.

  * HelmRelease (both _template and otech.omani.works overlays):
    `install.disableWait: true` + `upgrade.disableWait: true` so the
    HelmRelease reports Ready as soon as manifests apply cleanly,
    rather than gating on powerdns Deployment readiness which depends
    on bp-cnpg landing first to synthesise `pdns-pg-app`. Runtime
    convergence is observed via kubectl, not gated on Helm.

Live error this addresses:
  Helm upgrade failed for release powerdns/powerdns with chart
  bp-powerdns@1.1.2: context deadline exceeded

Verified locally with `helm template`:
  - powerdns-api-credentials Secret renders with random api-key + webserver-password
  - create-zone-if-not-exist-sh Job no longer rendered
  - Deployment env continues to reference powerdns-api-credentials correctly

Bumped 1.1.2 -> 1.1.3 (chart, blueprint, both bootstrap-kit overlays).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:12 +04:00
e3mrah
2d1799d738
fix(bp-crossplane): split XRDs+Compositions into bp-crossplane-claims (#247)
Resolves install ordering on fresh clusters where the apiserver rejects
CompositeResourceDefinition CRs because the apiextensions.crossplane.io
CRDs registered by the crossplane subchart aren't live yet at apply time.

- bp-crossplane bumped 1.1.2 -> 1.1.3 (controller-only payload)
- NEW bp-crossplane-claims@1.0.0 carries XRDs + Compositions
- Flux HelmRelease for crossplane-claims uses dependsOn: [bp-crossplane]
- composition-validate.sh + fixtures relocate to the new chart
- blueprint-release CI: opt-out annotation
  catalyst.openova.io/no-upstream=true permits zero-deps charts that
  legitimately ship only Catalyst-authored CRs (the original hollow-chart
  rule remains in force for every other umbrella chart)

Live error this fixes (from otech.omani.works):
  no matches for kind "CompositeResourceDefinition" in version
  "apiextensions.crossplane.io/v1" -- ensure CRDs are installed first

Pattern: intra-chart CRD-ordering breaks -> split charts + Flux dependsOn.
Apply universally to similar cases going forward.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:05 +04:00
e3mrah
f658757962
fix(bp-crossplane): resolve CHART_DIR to absolute path in composition-validate.sh (#237)
CI invokes the script as `bash <script> "platform/crossplane/chart"` from
the repo root. The script then `cd`s into that relative path, which works,
but every later `"$CHART_DIR/<sub>"` reference (notably FIXTURE_DIR for
Case 6) inherits the now-stale relative prefix and resolves under the
wrong cwd. Fix: resolve CHART_DIR via `(cd ... && pwd)` to an absolute
path BEFORE the chdir.

Local repro before fix:

  $ bash platform/crossplane/chart/tests/composition-validate.sh \
        platform/crossplane/chart
  ...
  Case 6: every fixture XRC kind is matched by an XRD
  FAIL: fixtures dir platform/crossplane/chart/tests/fixtures missing

Local result after fix:

  $ bash platform/crossplane/chart/tests/composition-validate.sh \
        platform/crossplane/chart
  ...
  Case 6: every fixture XRC kind is matched by an XRD
    PASS
  All bp-crossplane Day-2 CRUD Composition gates green.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 09:36:07 +02:00
e3mrah
8592d20919
feat(bp-crossplane): 6 XRDs + Compositions for Day-2 CRUD (RegionClaim/ClusterClaim/NodePoolClaim/LoadBalancerClaim/PeeringClaim/NodeActionClaim) (#236)
Adds the 6 CompositeResourceDefinitions and matching Compositions that
back the catalyst-api Day-2 CRUD endpoints. catalyst-api writes XRCs of
these kinds; Crossplane materialises them into provider-hcloud (and a
small number of provider-kubernetes) managed resources. Per
docs/INVIOLABLE-PRINCIPLES.md #3, every cloud-side op flows through
provider-hcloud — never bespoke hcloud-go calls or shell-outs to the
hcloud CLI.

XRDs (canonical group: compose.openova.io/v1alpha1):

  - RegionClaim       → composes the Phase-0 quartet via provider-hcloud:
                        Network + NetworkSubnet + Firewall + Server (cp1)
                        + LoadBalancer + LoadBalancerNetwork +
                        LoadBalancerService×2 + LoadBalancerTarget. Mirrors
                        infra/hetzner/main.tf 1:1 so deletion of a
                        RegionClaim cascades the whole slice.
  - ClusterClaim      → composes a provider-kubernetes Object that
                        materialises a cluster-identity ConfigMap. The
                        catalyst-environment-controller reads the CM to
                        template per-server cloud-init.
  - NodePoolClaim     → composes up to 100 provider-hcloud Server
                        resources. UPDATE flow: patching replicas n→m
                        flips the per-index Required-policy gate so
                        Crossplane creates/deletes Server CRs.
  - LoadBalancerClaim → composes provider-hcloud LoadBalancer +
                        LoadBalancerNetwork + up to 50
                        LoadBalancerService entries (per listener) + up
                        to 50 LoadBalancerTarget entries. UPDATE: patch
                        listeners[]/targets[] → composite controller
                        adds/removes services/targets.
  - PeeringClaim      → composes 1 or 2 provider-hcloud Route resources
                        (bidirectional flag toggles the second one
                        through a Required-policy gate).
  - NodeActionClaim   → composes a provider-kubernetes Object that
                        creates a batch/v1 Job running kubectl
                        cordon/drain (k8s-side op, not a cloud op, per
                        the task spec). action=replace additionally
                        composes a provider-hcloud Server for the
                        replacement node.

UPDATE/DELETE summary:

  - UPDATE: every mutable schema field is patched onto the underlying
    managed resource; Crossplane's composite controller drives the diff
    and provider-hcloud reconciles to the new state.
  - DELETE: every composed resource has deletionPolicy: Delete, so a
    cascade delete of the composite tears down the whole resource graph
    in dependency-safe order (Crossplane retries until deps unblock).

New tests:
  - tests/composition-validate.sh — 7 gates: helm renders cleanly,
    exactly 6 XRDs, ≥ 6 Compositions, all 6 expected claim kinds
    present, every rendered doc is valid YAML, every fixture references
    a real XRD, and (when KUBECONFIG + Crossplane CRDs available)
    server-side dry-run for every fixture.
  - tests/fixtures/<kind>-sample.yaml — one XRC fixture per kind.

Version bump:
  - platform/crossplane/chart/Chart.yaml             1.1.1 → 1.1.2
  - platform/crossplane/blueprint.yaml               1.1.1 → 1.1.2
  - clusters/_template/bootstrap-kit/04-crossplane.yaml         → 1.1.2
  - clusters/otech.omani.works/bootstrap-kit/04-crossplane.yaml → 1.1.2

Hard rules respected:
  - provider-hcloud only for cloud ops (never hcloud-go, never CLI).
  - provider-kubernetes Object for k8s-side ops (never raw kubectl).
  - No bespoke kubectl manifests for cloud resources.
  - Frontend + catalyst-api Go code untouched (sibling-owned).
  - Target state, no MVP framing — all 6 Compositions ship.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 09:33:38 +02:00
e3mrah
c747fe2265
fix(bp-gitea): override postgresql to bitnamilegacy (Bitnami evacuated docker.io tags) (#231)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 08:27:49 +02:00
e3mrah
da87fb38c4
fix(bp-spire): disable ALL default-enabled clusterSPIFFEIDs (default+oidc+test-keys) (#230)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 08:13:41 +02:00
e3mrah
719c3bac35
fix(bp-spire): disable default ClusterSPIFFEID — CRD not observable in time on fresh install (#228)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 07:51:03 +02:00
e3mrah
1689ffcd1a
fix(bp-coraza,bp-syft-grype): add common library subchart to satisfy hollow-chart gate (#220)
Both charts are scratch (no upstream Helm chart published — Coraza
project + anchore/syft+grype CLIs ship containers only). The
blueprint-release.yaml hollow-chart gate (issue #181) rejects charts
with zero declared dependencies. Adding sigstore/common as a tiny
library subchart satisfies the gate; common is a library type so it
contributes zero runtime resources to either chart's rendered output.

The Catalyst-side templates (Deployment+Service for bp-coraza,
CronJob+PVC for bp-syft-grype) remain entirely in templates/ — the
library dep is purely a CI-gate mechanism, NOT a functional dependency.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 06:15:28 +02:00
e3mrah
3a57e287e5
feat(platform): security umbrellas (falco/kyverno/trivy/sigstore/syft-grype/reloader/coraza/litmus) (#216)
* feat(bp-falco): umbrella chart for security layer

Catalyst Blueprint umbrella chart for falco — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-kyverno): umbrella chart for security layer

Catalyst Blueprint umbrella chart for kyverno — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-trivy): umbrella chart for security layer

Catalyst Blueprint umbrella chart for trivy — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-sigstore): umbrella chart for security layer

Catalyst Blueprint umbrella chart for sigstore — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-syft-grype): umbrella chart for security layer

Catalyst Blueprint umbrella chart for syft-grype — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-reloader): umbrella chart for security layer

Catalyst Blueprint umbrella chart for reloader — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-coraza): umbrella chart for security layer

Catalyst Blueprint umbrella chart for coraza — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-litmus): umbrella chart for security layer

Catalyst Blueprint umbrella chart for litmus — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 06:07:38 +02:00
e3mrah
75128781b3
feat(platform): observability stack umbrellas (grafana/loki/mimir/tempo/alloy/otel/langfuse/velero) (#214)
* feat(bp-grafana): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana — visualization layer of the
LGTM observability stack (Loki/Grafana/Tempo/Mimir).

Pinned to grafana/grafana 10.5.15 (appVersion 12.3.1) — current stable
on 2026-04-29. Solo-Sovereign defaults: 1 replica, 10Gi PVC,
ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-loki): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Loki — log aggregation backend
of the LGTM stack. SingleBinary mode by default (solo-Sovereign min);
SimpleScalable/Distributed are values toggles.

Pinned to grafana/loki 7.0.0 (appVersion 3.6.7) on 2026-04-29.
Filesystem storage default; SeaweedFS S3 wiring is per-Sovereign overlay
when scaling out. All observability toggles default false per
BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-mimir): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Mimir — metrics storage tier of
the LGTM stack.

Pinned to grafana/mimir-distributed 6.0.6 (appVersion 3.0.4) on
2026-04-29. Solo-Sovereign defaults: every component scaled to 1
replica, zoneAwareReplication disabled, Kafka ingest-storage disabled.
Bundled MinIO kept enabled as a stop-gap so the chart renders;
SeaweedFS S3 wiring is per-Sovereign overlay. All metaMonitoring
toggles default false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-tempo): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Tempo — distributed tracing
backend of the LGTM stack. Single-binary mode by default
(solo-Sovereign min); microservice mode (tempo-distributed) is a chart
swap toggle.

Pinned to grafana/tempo 1.24.4 (appVersion 2.9.0) on 2026-04-29. Local
PVC storage default; SeaweedFS S3 wiring is per-Sovereign overlay.
Metrics generator disabled by default (depends on bp-mimir).
ServiceMonitor default false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-alloy): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Alloy — unified telemetry
collector for the LGTM stack (logs, metrics, traces; OTLP-native).

Pinned to grafana/alloy 1.8.0 (appVersion v1.16.0) on 2026-04-29.
DaemonSet controller default (one Alloy per node) so node + container
telemetry work out of the box. Empty Alloy config by default;
per-Sovereign overlays populate forwarders to bp-loki/bp-mimir/bp-tempo
once those reconcile. ServiceMonitor + ingress + CRDs default false per
BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-opentelemetry): umbrella chart for observability stack

Catalyst Blueprint umbrella for the OpenTelemetry Collector — vendor-
neutral telemetry collector. Sibling to bp-alloy; per-Sovereign overlays
choose one.

Pinned to open-telemetry/opentelemetry-collector 0.152.0 (appVersion
0.150.1) on 2026-04-29. Uses the contrib distribution
(otel/opentelemetry-collector-contrib:0.150.1) so Loki/Mimir/Tempo
exporters are bundled. Deployment mode default (1 replica); DaemonSet
+ StatefulSet are values toggles. All presets default false; ingress
+ ServiceMonitor + PodMonitor + PrometheusRule + NetworkPolicy default
false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-langfuse): umbrella chart for observability stack

Catalyst Blueprint umbrella for Langfuse — LLM observability platform.
Complements bp-grafana (infrastructure metrics) with AI-specific
telemetry (traces, evaluations, prompts, cost attribution).

Pinned to langfuse/langfuse 1.5.28 (appVersion 3.171.0) on 2026-04-29.

Catalyst convention: ALL bundled Bitnami subcharts are disabled —
PostgreSQL via cnpg.io/Cluster (bp-cnpg), Redis via bp-valkey,
ClickHouse via bp-clickhouse, S3 via bp-seaweedfs. Per-Sovereign
overlays wire external endpoints + Secret references. Telemetry to
Langfuse Inc. defaulted false; signUpDisabled defaulted true.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-velero): umbrella chart for observability stack

Catalyst Blueprint umbrella for Velero — Kubernetes-native backup and
disaster recovery. Per platform/velero/README.md, ALL Velero output
goes to SeaweedFS (Catalyst's unified S3 encapsulation), which
transitions to a cloud archival backend on the cold tier.

Pinned to vmware-tanzu/velero 12.0.1 (appVersion 1.18.0) on 2026-04-29.
Bundled velero-plugin-for-aws:v1.14.0 init container so SeaweedFS S3 is
reachable. backupsEnabled/snapshotsEnabled defaulted false at this
layer (placeholders for backupStorageLocation); per-Sovereign overlays
flip on after wiring SeaweedFS endpoint + credentials. ServiceMonitor +
PodMonitor + PrometheusRule default false per BLUEPRINT-AUTHORING.md
§11.2.

Part of issue #204 observability-stack umbrellas batch.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 22:11:04 +02:00
e3mrah
fa0e3a494b
fix(bp-keycloak): pin to current Bitnami tag (closes #191) (#198)
* fix(bp-keycloak): pin to current Bitnami Keycloak tag (closes #191)

Bitnami consolidated their tag scheme around 2025-09 (see
https://github.com/bitnami/charts/issues/30852). The chart was pinned to
upstream bitnami/keycloak Helm chart 24.7.1, whose default image tag
`bitnami/keycloak:26.2.4-debian-12-r0` now returns 404 in the Docker Hub
registry — installs hit ImagePullBackOff (verified on omantel).

Changes:
- Upstream Bitnami chart: 24.7.1 -> 25.2.0 (latest, appVersion 26.3.3)
- Override image.registry/image.repository for every Bitnami image used
  by the chart (keycloak app, keycloak-config-cli, postgresql,
  postgres-exporter, os-shell) to point at `bitnamilegacy/*`, where the
  historic debian-12 tags are preserved
- Replace deprecated `proxy: edge` with `proxyHeaders: "xforwarded"`
  (chart 25.x renamed the field; Catalyst fronts Keycloak with Cilium
  Gateway which sets X-Forwarded-* headers)
- bp-keycloak chart version: 1.1.1 -> 1.1.2

Verification (registry HEAD via Bearer token):
  bitnami/keycloak:26.2.4-debian-12-r0          -> 404 (broken pin)
  bitnami/keycloak:26.3.3-debian-12-r0          -> 404 (registry move)
  bitnamilegacy/keycloak:26.3.3-debian-12-r0    -> 200
  bitnamilegacy/keycloak-config-cli:6.4.0-...   -> 200
  bitnamilegacy/postgresql:17.6.0-debian-12-r0  -> 200
  bitnamilegacy/postgres-exporter:0.17.1-...    -> 200
  bitnamilegacy/os-shell:12-debian-12-r50       -> 200

`helm template platform/keycloak/chart` renders cleanly; rendered images
all resolve to bitnamilegacy/* tags listed above.

Long-term follow-up (not blocking): bitnamilegacy is explicitly marked
"no longer updated, may be removed in the future" — Catalyst should
either build its own Keycloak image or migrate to the Bitnami Secure
Image (BSI/Photon) catalog when chart support catches up. Tracked in
the bp-keycloak description block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-keycloak): bump blueprint.yaml version to match Chart.yaml 1.1.2

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:10:17 +02:00
e3mrah
bcd2e7980a
fix: hide CRD-emitting resources behind Capabilities gates (closes #190) (#200)
* fix(bp-external-dns): hide CRD-emitting resources behind Capabilities gates (refs #190)

Wrap the Catalyst overlay's ServiceMonitor and ExternalSecret templates
in `.Capabilities.APIVersions.Has` checks so a cold install on a fresh
Sovereign — where bp-kube-prometheus-stack and bp-external-secrets have
not yet reconciled — no longer fails with `no matches for kind X in
version Y`. The values toggles (`externalDns.serviceMonitor.enabled`,
`externalDns.externalSecret.enabled`) remain — Capabilities is defense
in depth so an operator flipping the toggle on a Sovereign that hasn't
reached Phase 2 doesn't break the bp-external-dns reconcile.

Verified locally: `helm template` with toggles off renders 0 of these
resources; with toggles ON and `--api-versions monitoring.coreos.com/v1
--api-versions external-secrets.io/v1beta1` both render exactly once.

Bump version 1.1.0 → 1.1.2 to align with the Phase-1 architectural-fix
wave from issue #190.

* fix(bp-powerdns): hide CRD-emitting resources behind Capabilities gates (refs #190)

Three Catalyst overlay templates emit resources whose CRDs ship in OTHER
charts and were unconditionally rendered, causing a cold install of
bp-powerdns to fail with `no matches for kind X` on a Sovereign that
hasn't yet reconciled the upstream chart:

  - cnpg-cluster.yaml          → postgresql.cnpg.io/v1 Cluster
                                 (CRD ships in bp-cnpg)
  - api-ingress.yaml           → traefik.io/v1alpha1 Middleware
                                 (CRD ships with the Traefik controller;
                                  k3s ships it by default but a Sovereign
                                  overlay MAY disable Traefik in favour
                                  of cilium-only ingress)
  - crossplane-floatingip.yaml → compose.openova.io/v1alpha1 HetznerFloatingIP
                                 (CRD ships when the Catalyst Crossplane
                                  composition family lands — see GAP
                                  DISCLOSURE in that template)

Each is wrapped in `.Capabilities.APIVersions.Has "<group>/<version>"`.
The Traefik router-middleware annotation on the Ingress is similarly
gated so the auth posture cleanly moves to the Sovereign's chosen
ingress controller when Traefik is absent.

Verified locally: `helm template` with default values renders 0 of
these resources; with `--api-versions postgresql.cnpg.io/v1
--api-versions traefik.io/v1alpha1 --api-versions compose.openova.io/v1alpha1`
plus `--set crossplane.floatingIP.enabled=true`, all three render
exactly once. Existing tests/observability-toggle.sh still passes.

Bump version 1.1.1 → 1.1.2.

* fix(bp-powerdns): bump blueprint.yaml to match Chart.yaml 1.1.2 after Capabilities gate work

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 20:10:14 +02:00
e3mrah
1f5c76def1
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards

15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:

  1.  card height drift from canonical 108px
  2.  reserved right padding eating description width
  3.  logo tile drift from per-brand LOGO_SURFACE
  4.  invisible glyph (white-on-white) via luminance proxy
  5.  wizard step order Org/Topology/Provider/Credentials/Components/
      Domain/Review
  6.  legacy "Choose Your Stack" / "Always Included" tab labels
  7.  Domain step reachable before Components
  8.  CPX32 not the recommended Hetzner SKU
  9.  per-region SKU dropdown shows wrong provider catalog
  10. provision page is .html (static) not SPA route
  11. legacy bubble/edge DAG SVG markup on provision page
  12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
  13. AppDetail uses tablist instead of sectioned layout
  14. job rows navigate to /job/<id> instead of expand-in-place
  15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage

Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.

CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.

Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:07:55 +04:00
hatiyildiz
b0c1c07271 fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.

This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.

Changes:

  * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
    URL pinned at v2.4.0 (deliberate; this is the source of truth) and
    added the CRITICAL VERSION-PIN INVARIANT comment block documenting
    the failure mode.

  * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
    2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
    2.4.0, matching cloud-init exactly. Bumped chart version
    1.1.1 -> 1.1.2.

  * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
    .version` mirror of the dep pin moved from 2.13.0 to 2.14.1.

  * `clusters/_template/bootstrap-kit/03-flux.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
    bp-flux HelmRelease to 1.1.2 + added explicit
    `install.disableTakeOwnership: false`,
    `upgrade.disableTakeOwnership: false`, and
    `upgrade.preserveValues: true` so helm-controller adopts the
    cloud-init-installed Flux objects rather than rolling back on
    ownership conflict.

  * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
    umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.

  * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
    — bumped HelmRelease to 1.1.2.

  * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
    catastrophic-failure replay test:
      Case 1: Chart.yaml declares the flux2 subchart with explicit version.
      Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
      Case 3: chart's flux2 subchart appVersion equals cloud-init's
              pinned upstream version (the load-bearing invariant).
      Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
      Case 5: helm template renders cleanly + contains the four core
              Flux controllers.
      Case 6: replay test rejects a planted mismatched fake Chart.yaml
              (the gate's own self-test — proves the gate works).
    All six cases green locally; the new test joins the existing
    observability-toggle test in tests/.

  * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
    — version-pin invariant" documenting the failure mode, the four
    pin-sites, the safe bump procedure, and the existing-Sovereign
    recovery path (full reprovision).

Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.

Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:38:17 +02:00
hatiyildiz
4265884d58 feat(bp-external-dns): umbrella chart + add to bootstrap-kit Kustomization
Convert platform/external-dns/chart/ from a metadata-only wrapper to a
proper Helm umbrella that pulls kubernetes-sigs/external-dns 1.15.2
(appVersion 0.15.1, k8s 1.31-validated) as a Helm subchart, mirroring
the bp-cilium / bp-cert-manager / bp-powerdns shape. Native PowerDNS
provider speaks the bp-powerdns REST API directly via the
EXTERNAL_DNS_PDNS_API_KEY env var sourced from the
powerdns-api-credentials Secret bp-powerdns renders.

Catalyst overlay templates added (default-off where applicable per the
observability-toggle rule for the bp-* family):
  - templates/networkpolicy.yaml      (default ON; egress to powerdns +
                                       cluster DNS + apiserver only)
  - templates/servicemonitor.yaml     (default OFF)
  - templates/externalsecret.yaml     (default OFF; Phase-2 OpenBao path)
  - templates/_helpers.tpl

Bootstrap-kit Kustomization gets a new 12-external-dns.yaml HelmRelease
referencing bp-external-dns:1.1.0 with dependsOn bp-cert-manager +
bp-powerdns, and the legacy 11-bp-catalyst-platform.yaml is renumbered
13- so the install ordering reads in canonical Phase-0 sequence. Mirrored
to clusters/omantel.omani.works/bootstrap-kit/ with the SOVEREIGN_FQDN
substitution applied.

bp-catalyst-platform Chart.yaml drops bp-external-dns from its
dependency block — install ordering for ExternalDNS is now owned by Flux
dependsOn at the Kustomization layer rather than this umbrella's Helm
dependency graph. Bumped 1.1.0 → 1.1.1 to reflect the dep removal, and
the bootstrap-kit HelmRelease references in both clusters bumped in
lockstep.

Wrapper chart version bumped 1.0.0 → 1.1.1 (umbrella shape).

Local gates pass:
  - helm dependency build (pulls external-dns-1.15.2.tgz)
  - helm lint (0 failures)
  - helm template smoke render (245 lines, 6 kinds rendered)
  - helm package + tar-tzf verifies external-dns subchart inside the
    packaged tgz (subchart-guard simulation passes)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:29:27 +02:00
e3mrah
31d5911221
Merge pull request #185 from openova-io/fix/bp-charts-observability-toggles-default-false
fix(bp-*): observability toggles default false (v1.1.1)
2026-04-29 21:26:48 +04:00
hatiyildiz
1ddd569789 fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.

Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.

Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
  hubble.metrics.enabled → null (this is the exact value that disables
  the upstream metrics ServiceMonitor template branch — verified by
  reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
  .enabled → false. tests/observability-toggle.sh extended with Case 4
  (default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
  → false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
  recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
  promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
  openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
  + metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
  serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
  → false (forward-compatibility guard; current upstream
  pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
  upstream bump cannot silently regress).

Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added bdeb0f54 + the
existing wiring) before helm push. A regression that re-introduces a
hardcoded enabled: true in any chart fails CI before the OCI artifact
is published.

Versioning:
- All 11 leaf charts bumped 1.1.0 → 1.1.1.
- products/catalyst/chart (bp-catalyst-platform umbrella) deps updated
  to 1.1.1 across the board.
- clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to
  1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror.

docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every
toggle disabled across all 11 Blueprints. References
docs/INVIOLABLE-PRINCIPLES.md #4.

GATES (all green):
- helm dep build resolves cleanly post-change for every chart whose
  upstream is published (umbrella waits on per-leaf publish).
- helm lint clean on all 11 leaves.
- helm template . default render produces zero monitoring.coreos.com
  references on every leaf (verified locally).
- tests/observability-toggle.sh PASS on all 11 leaves.

Live verification: with v1.1.1 published the omantel.omani.works
HelmRelease can roll forward without a manual values patch — Flux picks
up the new chart digest automatically (semver: 1.x in OCIRepository).

Refs: issue #182.
2026-04-29 19:23:52 +02:00
hatiyildiz
02b5b6c4c8 fix(bootstrap-kit): override cilium + cert-manager values to disable observability toggles
Live verified on omantel: bp-cilium and bp-cert-manager v1.1.0 fail Helm
install with 'no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1'. Manual kubectl-patch of the live HelmRelease
worked but Flux's 15-min reconcile rolls back the patch because the
HelmRelease CR is owned by the kustomize-controller from git.

Override the values inline in the HelmRelease manifests so the patch is
durable across Flux reconciles. Same pattern as the in-flight observability-
toggle agent will apply to all 12 charts in the next chart bump (v1.1.1).
This is the manifest-level workaround that unblocks the running omantel
cluster TODAY without waiting for v1.1.1 publish.

Mirrors the patches into both clusters/_template/bootstrap-kit/ AND
clusters/omantel.omani.works/bootstrap-kit/ so future Sovereigns inherit.
2026-04-29 19:17:08 +02:00
hatiyildiz
b1638f51ea fix(bp-* tests): skip helm dep build when charts/ already vendored
Earlier rerun failure on the CI workflow (bp-cert-manager 25120060270):

  Error: no repository definition for https://charts.jetstack.io.
  Please add the missing repos via 'helm repo add'

Root cause: blueprint-release.yaml's earlier `helm dependency build`
step (line 181) successfully resolves the upstream chart and populates
chart/charts/ — but it does NOT `helm repo add` the upstream repo
first. Helm 3.20's `helm dep build` succeeds on the first call by
falling back to direct-URL fetch from Chart.yaml `dependencies[].repository`.
A SECOND `helm dep build` (run by the test script) hits a different
code path that requires the repo to be in the helm repo cache.

Fix: tests/observability-toggle.sh now skips `helm dep build` when
chart/charts/ is already populated (which is always the case in CI
since the workflow's own `helm dependency build` step ran first). Local
dev runs from a fresh checkout still resolve subcharts.

Refs #182

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:12:21 +02:00
hatiyildiz
d34facc040 fix(bp-*): observability toggles default false — break circular CRD dependency
bp-cilium@1.1.0 install fails on every fresh Sovereign with:

  no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  — ensure CRDs are installed first

Cascades to all 10 other bp-* HelmReleases ("dep is not ready") since
bp-cilium is the root of the bootstrap dep graph. Verified live on
omantel.omani.works 2026-04-29 (issue #182).

Root cause: platform/cilium/chart/values.yaml and
platform/cert-manager/chart/values.yaml hardcoded
`serviceMonitor.enabled: true`. The monitoring.coreos.com/v1 CRDs ship
with kube-prometheus-stack — an Application-tier Blueprint that itself
depends on the bootstrap-kit. Hardcoding `true` creates a circular CRD
ordering: bp-cilium wants the CRD bp-kube-prometheus-stack provides, but
bp-kube-prometheus-stack cannot install before bp-cilium.

The `trustCRDsExist=true` mitigation only suppresses Helm's render-time
gate; the apiserver still rejects the resource at install-time.

Violates INVIOLABLE-PRINCIPLES.md #4 (never hardcode): observability
toggles MUST be operator-tunable, not chart-level constants assuming an
observability tier exists.

This commit:

A. Defaults every observability toggle false in the affected wrappers:
   - platform/cilium/chart/values.yaml:
     cilium.prometheus.enabled: false
     cilium.prometheus.serviceMonitor.enabled: false
     (trustCRDsExist removed — no longer relevant)
   - platform/cert-manager/chart/values.yaml:
     cert-manager.prometheus.enabled: false
     cert-manager.prometheus.servicemonitor.enabled: false
   - platform/crossplane/chart/values.yaml:
     crossplane.metrics.enabled: false
     (uniformity rule — does not break install but holds the invariant)

B. Bumps affected wrapper charts 1.1.0 → 1.1.1:
   - bp-cilium, bp-cert-manager, bp-crossplane (leaves)
   - bp-catalyst-platform (umbrella; deps repinned to 1.1.1 for the 3)

C. Updates clusters/_template/bootstrap-kit/* and
   clusters/omantel.omani.works/bootstrap-kit/* HelmRelease versions to
   1.1.1 so the live Sovereign picks up the fix on Flux reconcile.

D. Adds platform/<name>/chart/tests/observability-toggle.sh under each
   affected chart. Each script asserts:
     - default render produces zero monitoring.coreos.com refs
     - opt-in render with --set <toggle>=true succeeds and produces a
       ServiceMonitor (proves the toggle is wired)
     - explicit-off render succeeds and produces zero refs
   Wired into .github/workflows/blueprint-release.yaml via a new
   "Run chart integration tests" step that executes every chart/tests/
   *.sh on every publish — a regression that re-introduces a hardcoded
   `true` fails the publish job before the OCI artifact is pushed.

E. Documents the rule in docs/BLUEPRINT-AUTHORING.md §11.2
   "Observability toggles must default false". References Principle #4
   and provides the canonical pattern (default off in wrapper values,
   opt-in via per-cluster overlay at clusters/<sovereign>/...).

Per-chart audit table (which toggle was hardcoded → new default):

| Chart            | Toggle                                                   | Was  | Now   |
|------------------|----------------------------------------------------------|------|-------|
| bp-cilium        | cilium.prometheus.enabled                                | true | false |
| bp-cilium        | cilium.prometheus.serviceMonitor.enabled                 | true | false |
| bp-cert-manager  | cert-manager.prometheus.enabled                          | true | false |
| bp-cert-manager  | cert-manager.prometheus.servicemonitor.enabled           | true | false |
| bp-crossplane    | crossplane.metrics.enabled                               | true | false |
| bp-flux          | (no observability hardcodes)                             | n/a  | n/a   |
| bp-sealed-secrets| (no observability hardcodes)                             | n/a  | n/a   |
| bp-spire         | (no observability hardcodes)                             | n/a  | n/a   |
| bp-nats-jetstream| (no observability hardcodes)                             | n/a  | n/a   |
| bp-openbao       | (no observability hardcodes)                             | n/a  | n/a   |
| bp-keycloak      | (no observability hardcodes)                             | n/a  | n/a   |
| bp-gitea         | (no observability hardcodes)                             | n/a  | n/a   |
| bp-powerdns      | (no observability hardcodes)                             | n/a  | n/a   |
| bp-catalyst-platform | (umbrella, no values overlay)                        | n/a  | n/a   |

Local gates green:
  helm dep build      ✓ all 3 affected charts
  helm lint           ✓ all 3
  helm template       ✓ all 3 — 0 monitoring.coreos.com refs in default
  tests/observability-toggle.sh  ✓ all 9 sub-cases pass

Closes the install path for bp-cilium 1.1.1 on a fresh Sovereign;
unblocks the full bp-* dep graph.

Refs: https://github.com/openova-io/openova/issues/182

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:08:09 +02:00
hatiyildiz
43aff20254 feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream
chart as a dependencies: entry. helm dependency build pulls the upstream
payload into the OCI artifact at publish time, so Flux helm install of
bp-<name>:1.1.0 actually installs the upstream Helm release alongside the
Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer,
ExternalSecret) under templates/.

Pinned upstream chart versions per platform/<name>/blueprint.yaml:
- cilium                 1.16.5  https://helm.cilium.io
- cert-manager           v1.16.2 https://charts.jetstack.io
- flux                   2.4.0   https://fluxcd-community.github.io/helm-charts
- crossplane             1.17.x  https://charts.crossplane.io/stable
- sealed-secrets         2.16.x  https://bitnami-labs.github.io/sealed-secrets
- spire                  ...     https://spiffe.github.io/helm-charts-hardened
- nats-jetstream         ...     https://nats-io.github.io/k8s/helm/charts
- openbao                ...     https://openbao.github.io/openbao-helm
- keycloak               ...     https://charts.bitnami.com/bitnami
- gitea                  ...     https://dl.gitea.com/charts
- catalyst-platform      umbrella over the 10 leaf bp-* charts via
                         helm dependency

values.yaml in each chart adopts the umbrella convention: catalystBlueprint
metadata block (provenance + version) at top level, upstream subchart
values namespaced under the dependency name.

cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the
helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER
cert-manager controllers are running and CRDs registered (the previous
hollow-chart shape ran the ClusterIssuer at install time when CRDs
didn't exist yet, which was the omantel cluster's exact failure mode).

Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella
conversion is a meaningful structural revision). Cluster manifests in
clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/
bootstrap-kit/ updated to reference 1.1.0.

The blueprint-release.yaml workflow's helm package step needs an
explicit helm dependency build before push so the upstream subchart
bytes ship inside the OCI artifact. That CI change is a follow-up
commit on this same branch (separate file scope).
2026-04-29 17:21:36 +02:00
hatiyildiz
67fdecb770 merge: remove k8gb (#171) 2026-04-29 08:51:21 +02:00
hatiyildiz
f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00
hatiyildiz
f4679e2748 fix(powerdns): enable gpgsql-dnssec for DNSSEC API (1.0.6)
Without `gpgsql-dnssec=yes` the gpgsql backend driver does not expose
the DNSSEC API surface — `PUT /zones/<zone>` with `dnssec:true` returns
422 "no DNSSEC-capable backends are loaded". This blocks pool-domain-
manager from enabling DNSSEC on every Sovereign child zone (mandatory
per docs/PLATFORM-POWERDNS.md).

Fix lands in additionalConfig so the directive is rendered alongside
`default-soa-edit-signed=INCEPTION-EPOCH` and `direct-dnskey=yes`. No
schema migration needed — the gpgsql 5.0.3 schema already includes the
cryptokeys table; the missing piece was just the backend feature flag.

Bumps Chart.yaml to 1.0.6. Verified: after this lands the PUT call
returns 204 and POST /cryptokeys mints a usable KSK.

Discovered while bringing up openova#168 (PDM per-Sovereign zones).
2026-04-29 08:42:18 +02:00
hatiyildiz
fa84cac438 fix(powerdns): plain ALTER TABLE in postInitSQL (avoid $$ escape battle, 1.0.5)
The DO block in 1.0.4 rendered with $$ collapsed to $ by the time it
reached CNPG's postInitApplicationSQL — "syntax error at or near $".
Both Helm template processing and the YAML scalar block were chewing on
the dollar signs.

Replaced with explicit ALTER TABLE statements (one per gpgsql table) +
GRANT — same end state, no PL/pgSQL quoting required. Verified at
runtime on contabo-mkt: powerdns Pod went CrashLoopBackOff →
Running 1/1 immediately after the manual ALTER ran by hand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:17:28 +02:00
hatiyildiz
214a3e1ada fix(powerdns): grant table ownership to pdns user in CNPG bootstrap (1.0.4)
Verified at runtime on Contabo-mkt: postInitApplicationSQL runs as the
postgres superuser, not the application owner, so the schema tables
created by the bootstrap block were owned by postgres. PowerDNS connects
as 'pdns' and got 'permission denied for table domains' on the first
SELECT against the zone cache.

Added a DO block at the end of the schema bootstrap that walks every
table in the public schema and ALTERs OWNER TO {{ .Values.postgres.cluster.owner }}
plus GRANT ALL PRIVILEGES ON SCHEMA public — same shape PDM uses (and
the contabo-mkt cluster verified the fix runtime: powerdns Pod went
from CrashLoopBackOff to 1/1 Ready immediately after the same DDL was
run by hand).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:14:12 +02:00
hatiyildiz
db20e9d42b fix(powerdns): dnsdist backend resolution + drop DnstapLogAction (1.0.3)
dnsdist 1.9.14 runtime errors:
  1. newServer{address='powerdns:5353'} → "Unable to convert presentation
     address" — dnsdist's address parser expects IP[:port], not a DNS
     name. Kubernetes auto-injects POWERDNS_SERVICE_HOST as an env var
     into every pod in the same namespace as the powerdns Service; using
     that gives us the ClusterIP at config-load time without needing an
     init container or runtime DNS resolution.
  2. DnstapLogAction(name, bool, fn) signature changed in 1.9 — the
     2nd parameter now expects a shared_ptr to a RemoteLoggerInterface,
     not a boolean. Rather than wire up a remote dnstap server (which
     adds a moving part for marginal observability gain), drop the line.
     Catalyst observability is the dnsdist /metrics endpoint surfaced
     to Prometheus + the k8s container log.

Bumped chart to 1.0.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:12:27 +02:00
hatiyildiz
20c0543806 fix(powerdns): correct dnsdist image tag + drop readOnlyRootFilesystem (1.0.2)
Two runtime issues caught during first contabo-mkt rollout:

1. dnsdist image tag was "1.9" (default) — that tag doesn't exist in
   docker.io/powerdns/dnsdist-19. The 1.9.x line publishes 1.9.0 .. 1.9.14
   (no rolling "1.9" alias). Pinned to 1.9.14 (current latest).

2. PowerDNS pod crash-looped on Errno 30 (Read-only file system:
   /etc/powerdns/pdns.d/0-api.conf.conf). The upstream pdns_server-startup
   script writes rendered config files to /etc/powerdns/pdns.d/ at
   container start, and the upstream template doesn't expose an emptyDir
   we could redirect that path to. Set readOnlyRootFilesystem=false with
   a verbose comment explaining why; the rest of the security context
   (runAsNonRoot, runAsUser=953, drop ALL caps) stays in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:06:39 +02:00
hatiyildiz
19d926bfeb fix(powerdns): avoid recursive include in dnsdist checksum, bump to 1.0.1
Helm flagged dnsdist.yaml's checksum/config annotation as a recursive
template self-reference (the file included itself). Replaced with a
hash of the rendered .Values.dnsdist.config (post-tpl), which is the
substantive content the annotation is supposed to track anyway.

Bumped Chart.yaml to 1.0.1 so the OCIRepository semver "1.x" picks
up the fix automatically on next reconcile. Blueprint API version stays
at 1.0.0 (Blueprint contract is unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:02:53 +02:00
hatiyildiz
0190c60520 feat(powerdns): bp-powerdns wrapper chart + per-Sovereign zone model (#167)
Introduces the bp-powerdns Catalyst Blueprint wrapper as the authoritative
DNS service for every Sovereign zone. Replaces k8gb in componentGroups.ts —
PowerDNS Lua records cover geo + health-checked failover natively, removing
the dedicated GSLB controller.

Wrapper chart (platform/powerdns/chart/):
  - Chart.yaml — bp-powerdns 1.0.0, depends on pschichtel/powerdns 0.10.0
    upstream (verified Artifact Hub publisher, tracks docker.io/powerdns/
    pdns-auth-50 at appVersion 5.0.3 — surveyed Artifact Hub, no official
    PowerDNS chart exists)
  - values.yaml — 3 replicas, gpgsql backend, DNSSEC ECDSAP256SHA256,
    lua-records ON, dnsdist 100 qps default per source IP, REST API at
    pdns.openova.io/api behind Traefik basicAuth
  - blueprint.yaml — Catalyst metadata, visibility=unlisted (mandatory
    infra), section pts-3-2-gitops-and-iac
  - templates/cnpg-cluster.yaml — separate `pdns-pg` Postgres (1 instance,
    5Gi, postgres-16) with PowerDNS auth-5.0.3 schema applied via
    postInitApplicationSQL
  - templates/dnsdist.yaml — companion Deployment + ConfigMap with
    rate-limiting policy (MaxQPSIPRule per source IP)
  - templates/api-ingress.yaml — Traefik Ingress + basicAuth Middleware
  - templates/anycast-endpoint.yaml — placeholder Service of type
    LoadBalancer (Phase-0 stand-in for the anycast Floating IP target state)
  - templates/crossplane-floatingip.yaml — DISCLOSED GAP: target-state
    XHetznerFloatingIP composite, disabled by default until the
    Crossplane composition is authored (the existing compositions cover
    Server/Network/Firewall/LoadBalancer/PoolAllocation only). The
    placeholder anycast Service is the operational stand-in.

Per docs/INVIOLABLE-PRINCIPLES.md:
  - #4 (never hardcode): every value flows from values.yaml or a
    referenced K8s Secret. Image tags come from upstream chart appVersion,
    never duplicated.
  - #8 (disclose every divergence): the XHetznerFloatingIP gap is
    documented in the template + in docs/PLATFORM-POWERDNS.md ("Anycast
    deferral" section).

componentGroups.ts: powerdns added to SPINE group as mandatory (depends on
cnpg). external-dns now lists powerdns as a dependency. k8gb removed.

docs/PLATFORM-POWERDNS.md: per-Sovereign zone model, DNSSEC posture, REST
API contract, lua-records GSLB pattern, dnsdist policy, anycast deferral
runbook, first-deploy procedure for Contabo-mkt.

Closes #167 (Phase 1 of public-repo work; Phase 4 cluster manifest lands
in openova-private feat/powerdns-deploy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 07:49:51 +02:00
hatiyildiz
31b03ce02a ci(pdm)+platform(crossplane): build workflow + XDynadotPoolAllocation composition (Phase 3+4 of #163)
CI workflow (.github/workflows/pool-domain-manager-build.yaml) mirrors
the marketplace-api / catalyst-api shape:

  - Triggers on push to core/pool-domain-manager/** + workflow_dispatch
  - Runs unit tests (reserved + dynadot — the integration suite needs a
    real Postgres which the workflow does not provide; full integration
    runs in test-bootstrap-api.yaml against an ephemeral CNPG)
  - Builds and pushes ghcr.io/openova-io/openova/pool-domain-manager:<sha>
  - Cosign-signs the image via Sigstore keyless OIDC (id-token: write)
  - Emits an SBOM attestation tied to the image digest
  - Manifest deployment is intentionally NOT in this workflow — PDM
    manifests live in the openova-private repo per the issue body, so
    the Flux Kustomization there picks up the new SHA via a follow-up
    private-repo commit (Phase 6 of #163)

Crossplane composition (platform/crossplane/compositions/xrd-pool-
allocation.yaml + composition-pool-allocation.yaml) wraps PDM as a
declarative Crossplane Resource:

  apiVersion: compose.openova.io/v1alpha1
  kind: XDynadotPoolAllocation
  spec:
    parameters:
      poolDomain:    omani.works
      subdomain:     omantel
      sovereignFQDN: omantel.omani.works
      loadBalancerIP: 1.2.3.4
      createdBy:     crossplane

The Composition uses provider-http (crossplane-contrib/provider-http) to
render the XR into a Reserve → Commit sequence of HTTP calls against
PDM's in-cluster service URL. Per docs/INVIOLABLE-PRINCIPLES.md #3 we use
provider-http rather than bespoke Go to keep the day-2 lifecycle
declarative. Operators who want to pre-allocate a name (e.g. reserve
'omantel.omani.works' for a Sovereign that hasn't been provisioned yet)
commit YAML to Git and Flux+Crossplane converge.

Refs: #163
2026-04-29 06:46:11 +02:00
hatiyildiz
8886eff708 Merge branch 'feat/group-g-dns-finish-v3'
Group G DNS finish (v3): #110 (Dynadot multi-domain table-driven tests),
#112 (catalyst-dns httptest-mocked Dynadot coverage), #113 (cert-manager
LE DNS-01 + HTTP-01 ClusterIssuer templates with operator runbook for
the cert-manager-dynadot-webhook gap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:45:35 +02:00
hatiyildiz
97e942e0bc feat(cert-manager): #113 — Lets Encrypt DNS-01 + HTTP-01 ClusterIssuers
Adds platform/cert-manager/chart/templates/clusterissuer-letsencrypt-dns01.yaml
with two ClusterIssuers, both Catalyst-curated, rendered conditionally
from values.yaml:

- letsencrypt-dns01-prod (TARGET STATE, default disabled) — ACME DNS-01
  via the cert-manager webhook solver, pointing at a future
  `cert-manager-dynadot-webhook` Catalyst binary that will implement the
  webhook.acme.cert-manager.io/v1alpha1 contract against the existing
  internal/dynadot/ package. Shipping the issuer template ahead of the
  webhook so cluster overlays only need a values flip + secret ref —
  no template edits — once the webhook lands.

- letsencrypt-http01-prod (INTERIM, default enabled) — ACME HTTP-01
  via the cilium ingress class. Issues certs for the explicit hostnames
  (console, gitea, harbor, admin, api) but NOT for wildcards; the
  canonical *.<sub>.<domain> record needs DNS-01.

Header comment explains the gap: the Catalyst external-dns webhook
(products/catalyst/bootstrap/api/cmd/external-dns-dynadot-webhook/)
implements a DIFFERENT RPC contract (records.list/add/delete) than what
cert-manager DNS-01 expects (Present/CleanUp on ChallengeRequest CRD),
so it cannot be reused; a dedicated cmd/cert-manager-dynadot-webhook/
must be built. Operator runbook for cutover is in the file header.

values.yaml gains a `certManager.issuers.{email,acmeServer,dns01,http01}`
section so all knobs are runtime-configurable per
docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode); cluster overlays in
clusters/<sovereign>/ can flip dns01.enabled via the bp-catalyst-platform
umbrella's values without rebuilding the Blueprint OCI artifact.

blueprint.yaml gains a spec.outputs section advertising:
- issuerName: letsencrypt-http01-prod (default)
- wildcardIssuerName: letsencrypt-dns01-prod (target state)
- issuerKind: ClusterIssuer

so dependent Blueprints (cilium-gateway, harbor, gitea) can consume the
issuer name without hardcoding it.

Closes #113.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:44:56 +02:00
hatiyildiz
c07e0ad1ee feat(external-dns): #109 — author bp-external-dns leaf chart for OCI publish
The bp-catalyst-platform umbrella (issue #104) declares a dependency on
bp-external-dns:1.0.0 — but the chart didn't exist; only README + Dynadot
multi-domain policy lived under platform/external-dns/. Without this leaf
the umbrella's `helm dependency build` fails (verified in run 25068433765).

This commit authors the minimal target-state leaf:
- Chart.yaml: name=bp-external-dns, version=1.0.0
- values.yaml: catalystBlueprint.upstream metadata (external-dns 1.15.0
  from kubernetes-sigs/external-dns Helm repo) + Catalyst-curated values
  overlay (sources, txtOwnerId, ServiceMonitor, RBAC, resources)

Per BLUEPRINT-AUTHORING.md §3, leaf charts are pure values-overlay wrappers:
no templates dir, just Chart.yaml + values.yaml with the catalystBlueprint
metadata block read by the bootstrap-kit installer at helm-install time.

Per-Sovereign provider/zone/credential overrides are overlaid by the
Crossplane Composition that materializes the HelmRelease — keeping this
chart provider-agnostic (no hardcoded Cloudflare/Dynadot/Hetzner choice
per INVIOLABLE-PRINCIPLES.md §4).

After this lands, blueprint-release.yaml will publish
ghcr.io/openova-io/bp-external-dns:1.0.0 and the next umbrella push will
resolve all 11 leaf deps successfully.
2026-04-28 19:42:23 +02:00
hatiyildiz
f0fe3006ba feat(external-dns): #109 — Catalyst-curated dynadot-multi-domain policy
Adds platform/external-dns/policies/dynadot-multi-domain.yaml — the
canonical external-dns + dynadot webhook deployment that ships in every
Sovereign on an OpenOva pool domain.

Why a webhook: external-dns has no upstream Dynadot provider; the
canonical pattern is the webhook RPC contract, with a sidecar that
implements the provider in our preferred language. We reuse the same
internal/dynadot/ package the catalyst-api uses, so the never-wipe rule,
record encoding, and managed-domain allowlist are identical on both
write paths (per docs/INVIOLABLE-PRINCIPLES.md #2 — no duplicate
implementations of the same concern).

Multi-domain:
- One --domain-filter per zone in the external-dns args; adding a third
  pool domain (e.g. acme.io) is a one-line edit here PLUS a one-key edit
  on dynadot-api-credentials' `domains` field. No webhook rebuild.
- Webhook reads DYNADOT_MANAGED_DOMAINS from the same secret with
  optional=true, preserving backward compatibility with the legacy
  single-`domain` secret shape (pre-#108).

TXT registry:
- --txt-owner-id=$(SOVEREIGN_FQDN), --txt-prefix=_externaldns.<sub>.
- Cluster overlays substitute SOVEREIGN_FQDN via the bp-catalyst-platform
  umbrella so two clusters sharing a parent zone (alpha.omani.works,
  beta.omani.works) cannot collide.

Closes #109.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:45:53 +02:00
hatiyildiz
046e5ebc18 feat(day2-iac): Crossplane Compositions + per-Sovereign Flux cluster tree + catalyst-dns binary
Group F deliverables — completes the day-2 IaC layer that takes over after OpenTofu's Phase 0 hand-off (per docs/SOVEREIGN-PROVISIONING.md §4).

Three artifacts:

1. platform/crossplane/compositions/ — XRDs + Compositions for canonical Hetzner resources
   under the canonical compose.openova.io/v1alpha1 group (per BLUEPRINT-AUTHORING.md §8):
   - XHetznerNetwork + composition-network.yaml — wraps hcloud_network + subnet
   - XHetznerFirewall + composition-firewall.yaml
   - XHetznerServer + composition-server.yaml
   - XHetznerLoadBalancer + composition-loadbalancer.yaml (lb11, 80→31080, 443→31443)
   - README documenting the canonical pattern

2. clusters/_template/ — the canonical per-Sovereign Flux Kustomization tree.
   Copied to clusters/<sovereign-fqdn>/ at provisioning time; cloud-init's
   GitRepository points at the result.
   - kustomization.yaml (root: flux-system + infrastructure + bootstrap-kit)
   - flux-system/ (placeholder for Flux self-config customization)
   - infrastructure/ (provider-hcloud + ProviderConfig referencing hcloud-credentials secret OpenTofu writes)
   - bootstrap-kit/ — 11 HelmRelease manifests in dependency order:
     01-cilium → 02-cert-manager → 03-flux → 04-crossplane → 05-sealed-secrets
     → 06-spire → 07-nats-jetstream → 08-openbao → 09-keycloak → 10-gitea → 11-bp-catalyst-platform
     Each pulls from oci://ghcr.io/openova-io/bp-<name>:1.0.0 — the wrapper charts published by blueprint-release CI.
     dependsOn declarations enforce the canonical install order at runtime.

3. clusters/omantel.omani.works/ — the first concrete Sovereign instance.
   Mirror of _template with SOVEREIGN_FQDN_PLACEHOLDER substituted to omantel.omani.works.
   This is what the wizard's first omantel.omani.works run will actually reconcile.

4. products/catalyst/bootstrap/api/cmd/catalyst-dns/main.go — small Go binary the
   OpenTofu module's null_resource.dns_pool invokes via local-exec at Phase-0 apply time.
   Reads DYNADOT_API_KEY/SECRET/DOMAIN/SUBDOMAIN/LB_IP env vars; calls existing dynadot.Client.AddSovereignRecords. Containerfile already builds + ships it at /usr/local/bin/catalyst-dns.

Architectural compliance (Lesson #24 closed):
- No bespoke Go cloud-API calls (Crossplane Compositions are the canonical day-2 IaC)
- No exec.Command("helm", ...) (Flux HelmReleases are the canonical install unit)
- No kubectl apply from outside (cloud-init kubectl-applies one Flux GitRepository, then Flux owns everything)

After this commit, the path is end-to-end: wizard → catalyst-api → tofu apply (with infra/hetzner/) → cloud-init installs k3s + Flux + applies GitRepository pointing at clusters/omantel.omani.works/ → Flux reconciles bootstrap-kit (11 HelmReleases in dependency order) → Crossplane adopts day-2 management.
2026-04-28 14:09:29 +02:00
hatiyildiz
62d9c7d936 fix(charts): drop dependencies block — wrappers carry values overlay only
The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks.

Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values.

This keeps:
- blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd)
- the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork)
- the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>)

Changes:
- 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package.
- 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values.
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up.

After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.
2026-04-28 12:57:29 +02:00
hatiyildiz
441ebaebb8 fix(charts): pin upstream chart versions/names to ones that exist in their repos
The first Blueprint Release CI run (commit 8c0f766) failed because four chart wrappers referenced upstream chart versions/names that don't exist in their published repositories:

- platform/flux/chart: name was "flux", repo was OCI; actual is name "flux2" in plain helm repo at https://fluxcd-community.github.io/helm-charts. Pinned to 2.13.0.
- platform/openbao/chart: version 2.1.0 was the binary appVersion, not the chart version. Pinned to 0.16.0 chart (which packages openbao 2.1.0 internally).
- platform/keycloak/chart (Bitnami): chart version 25.0.6 was the appVersion of upstream; Bitnami's chart is at 24.7.1 packaging Keycloak 26.0.x. Pinned to 24.7.1.
- platform/nats-jetstream/chart: name was "nats-jetstream"; the upstream chart is named "nats" (it always was — JetStream is a feature of NATS, not a separate chart). Renamed.

Cilium, cert-manager, crossplane, sealed-secrets, spire wrappers were unaffected; their version pins matched upstream availability.

Containerd permission-denied errors from `helm package` on cilium/cert-manager/crossplane/gitea/sealed-secrets are a separate CI plumbing issue (helm tries to pull OCI base images during package build via containerd, but the GitHub Actions runner blocks containerd socket access). Tracked as a follow-up: switch to `helm package --skip-refresh` or use a runner with containerd permissions.

After this commit lands, the next blueprint-release CI run should green-build at minimum the 4 fixed charts. Successful builds publish bp-{flux,openbao,keycloak,nats-jetstream}:1.0.0 OCI artifacts to ghcr.io/openova-io/.
2026-04-28 12:55:21 +02:00