Commit Graph

7 Commits

Author SHA1 Message Date
e3mrah
83ec889f06
feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560) (#580)
Charts bumped:
- bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented)
- bp-crossplane 1.1.3 -> 1.1.4 (subchart stub)
- bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched)
- bp-velero 1.2.0 -> 1.2.1 (subchart stub)
- bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented)
- bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented)
- bp-grafana 1.0.0 -> 1.0.1 (subchart stub)
- bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented)
- bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services)

Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache.

Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:21:53 +04:00
e3mrah
05cb39c042
fix(bp-flux): catalyst-cluster-reconciler ClusterRoleBinding overlay (closes #338) (#393)
PROBLEM
-------
On Sovereign-1 (otech.omani.works, 2026-04-30) every HelmRelease that
transitioned through pending-install/pending-upgrade got stuck because
the helm-controller SA could not UPDATE its own helm-storage Secrets
(sh.helm.release.v1.<name>.<n>) in flux-system. Symptom:

  secrets "sh.helm.release.v1.catalyst-platform.v1" is forbidden:
  User "system:serviceaccount:flux-system:helm-controller" cannot
  update resource "secrets" in API group "" in the namespace "flux-system"

Runtime workaround on otech (added 2026-04-30): manual ClusterRoleBinding
flux-system-helm-controller-admin → cluster-admin → flux-system/helm-controller.
Tracked as the permanent fix in #338.

FIX
---
Add platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml — a
Catalyst-managed ClusterRoleBinding (catalyst-cluster-reconciler) that
binds cluster-admin to helm-controller AND kustomize-controller in
.Values.catalyst.fluxNamespace (default flux-system). Independent from
the upstream subchart's cluster-reconciler binding (different name, no
ownership conflict), so if the upstream binding ever drifts again the
overlay still holds the cluster correct.

WHY cluster-admin (not narrower)
--------------------------------
helm-controller installs arbitrary user-supplied Helm charts which can
ship any K8s resource (CRDs, ClusterRoles, MutatingWebhookConfigurations,
etc.). There is no narrower role that satisfies the full install path.
The Flux project's own bootstrap install.yaml binds cluster-admin for
the same reason (upstream default multitenancy.privileged=true).
Multi-tenancy lockdown is a Sovereign Day-2 hardening choice tracked
separately.

NEVER-HARDCODE COMPLIANCE
-------------------------
Per docs/INVIOLABLE-PRINCIPLES.md #4, the namespace is operator-overridable
via .Values.catalyst.fluxNamespace. Default is flux-system because that's
the canonical Catalyst install namespace (matches cloud-init's flux2
install.yaml + clusters/_template/bootstrap-kit/03-flux.yaml).

VERSION
-------
- bp-flux 1.1.2 → 1.1.3 (Chart.yaml + blueprint.yaml + 3 bootstrap-kit refs).
- The flux2 subchart pin (2.14.1) is unchanged — version-pin replay test
  remains green (cloud-init v2.4.0 == subchart appVersion 2.4.0).

VERIFICATION
------------
- platform/flux/chart/tests/version-pin-replay.sh — all 6 cases PASS.
- platform/flux/chart/tests/observability-toggle.sh — all 3 cases PASS.
- helm template renders the new ClusterRoleBinding with correct subjects
  (flux-system by default; verified --set catalyst.fluxNamespace=custom
  override path).
- scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles.

FILES
-----
- platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml (new)
- platform/flux/chart/Chart.yaml (1.1.2 → 1.1.3)
- platform/flux/chart/values.yaml (catalyst.fluxNamespace default)
- platform/flux/blueprint.yaml (1.1.2 → 1.1.3)
- clusters/{_template,otech.omani.works,omantel.omani.works}/bootstrap-kit/03-flux.yaml (chart version)
- docs/lessons-learned/helm-controller-rbac.md (permanent-fix note)
- docs/omantel-handover-wbs.md (#338 status row)

Refs: #43 #369 #338
Lesson: docs/lessons-learned/helm-controller-rbac.md

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
2026-05-01 15:56:45 +04:00
hatiyildiz
b0c1c07271 fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.

This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.

Changes:

  * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
    URL pinned at v2.4.0 (deliberate; this is the source of truth) and
    added the CRITICAL VERSION-PIN INVARIANT comment block documenting
    the failure mode.

  * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
    2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
    2.4.0, matching cloud-init exactly. Bumped chart version
    1.1.1 -> 1.1.2.

  * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
    .version` mirror of the dep pin moved from 2.13.0 to 2.14.1.

  * `clusters/_template/bootstrap-kit/03-flux.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
    bp-flux HelmRelease to 1.1.2 + added explicit
    `install.disableTakeOwnership: false`,
    `upgrade.disableTakeOwnership: false`, and
    `upgrade.preserveValues: true` so helm-controller adopts the
    cloud-init-installed Flux objects rather than rolling back on
    ownership conflict.

  * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
    umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.

  * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
    — bumped HelmRelease to 1.1.2.

  * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
    catastrophic-failure replay test:
      Case 1: Chart.yaml declares the flux2 subchart with explicit version.
      Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
      Case 3: chart's flux2 subchart appVersion equals cloud-init's
              pinned upstream version (the load-bearing invariant).
      Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
      Case 5: helm template renders cleanly + contains the four core
              Flux controllers.
      Case 6: replay test rejects a planted mismatched fake Chart.yaml
              (the gate's own self-test — proves the gate works).
    All six cases green locally; the new test joins the existing
    observability-toggle test in tests/.

  * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
    — version-pin invariant" documenting the failure mode, the four
    pin-sites, the safe bump procedure, and the existing-Sovereign
    recovery path (full reprovision).

Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.

Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:38:17 +02:00
hatiyildiz
1ddd569789 fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.

Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.

Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
  hubble.metrics.enabled → null (this is the exact value that disables
  the upstream metrics ServiceMonitor template branch — verified by
  reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
  .enabled → false. tests/observability-toggle.sh extended with Case 4
  (default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
  → false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
  recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
  promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
  openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
  + metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
  serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
  → false (forward-compatibility guard; current upstream
  pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
  upstream bump cannot silently regress).

Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added bdeb0f54 + the
existing wiring) before helm push. A regression that re-introduces a
hardcoded enabled: true in any chart fails CI before the OCI artifact
is published.

Versioning:
- All 11 leaf charts bumped 1.1.0 → 1.1.1.
- products/catalyst/chart (bp-catalyst-platform umbrella) deps updated
  to 1.1.1 across the board.
- clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to
  1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror.

docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every
toggle disabled across all 11 Blueprints. References
docs/INVIOLABLE-PRINCIPLES.md #4.

GATES (all green):
- helm dep build resolves cleanly post-change for every chart whose
  upstream is published (umbrella waits on per-leaf publish).
- helm lint clean on all 11 leaves.
- helm template . default render produces zero monitoring.coreos.com
  references on every leaf (verified locally).
- tests/observability-toggle.sh PASS on all 11 leaves.

Live verification: with v1.1.1 published the omantel.omani.works
HelmRelease can roll forward without a manual values patch — Flux picks
up the new chart digest automatically (semver: 1.x in OCIRepository).

Refs: issue #182.
2026-04-29 19:23:52 +02:00
hatiyildiz
43aff20254 feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream
chart as a dependencies: entry. helm dependency build pulls the upstream
payload into the OCI artifact at publish time, so Flux helm install of
bp-<name>:1.1.0 actually installs the upstream Helm release alongside the
Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer,
ExternalSecret) under templates/.

Pinned upstream chart versions per platform/<name>/blueprint.yaml:
- cilium                 1.16.5  https://helm.cilium.io
- cert-manager           v1.16.2 https://charts.jetstack.io
- flux                   2.4.0   https://fluxcd-community.github.io/helm-charts
- crossplane             1.17.x  https://charts.crossplane.io/stable
- sealed-secrets         2.16.x  https://bitnami-labs.github.io/sealed-secrets
- spire                  ...     https://spiffe.github.io/helm-charts-hardened
- nats-jetstream         ...     https://nats-io.github.io/k8s/helm/charts
- openbao                ...     https://openbao.github.io/openbao-helm
- keycloak               ...     https://charts.bitnami.com/bitnami
- gitea                  ...     https://dl.gitea.com/charts
- catalyst-platform      umbrella over the 10 leaf bp-* charts via
                         helm dependency

values.yaml in each chart adopts the umbrella convention: catalystBlueprint
metadata block (provenance + version) at top level, upstream subchart
values namespaced under the dependency name.

cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the
helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER
cert-manager controllers are running and CRDs registered (the previous
hollow-chart shape ran the ClusterIssuer at install time when CRDs
didn't exist yet, which was the omantel cluster's exact failure mode).

Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella
conversion is a meaningful structural revision). Cluster manifests in
clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/
bootstrap-kit/ updated to reference 1.1.0.

The blueprint-release.yaml workflow's helm package step needs an
explicit helm dependency build before push so the upstream subchart
bytes ship inside the OCI artifact. That CI change is a follow-up
commit on this same branch (separate file scope).
2026-04-29 17:21:36 +02:00
hatiyildiz
62d9c7d936 fix(charts): drop dependencies block — wrappers carry values overlay only
The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks.

Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values.

This keeps:
- blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd)
- the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork)
- the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>)

Changes:
- 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package.
- 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values.
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up.

After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.
2026-04-28 12:57:29 +02:00
hatiyildiz
8c0f76640c feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI
Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit 07b4bcf) needs. Each chart is the canonical bp-<name> source per BLUEPRINT-AUTHORING.md §1's source-location rule.

11 charts created with Chart.yaml + values.yaml + blueprint.yaml each:

Network + GitOps:
- platform/cilium/chart — wraps cilium 1.16.5; kubeProxyReplacement, WireGuard mTLS, Hubble, Gateway API
- platform/flux/chart — wraps flux 2.4.0
- platform/crossplane/chart — wraps crossplane 1.18.0 + provider-hcloud manifest

Security:
- platform/cert-manager/chart — wraps cert-manager 1.16.2 with CRDs+ServiceMonitor
- platform/sealed-secrets/chart — wraps sealed-secrets 2.16.1 (transient bootstrap-only)
- platform/spire/chart — wraps spiffe/spire 1.10.4 (5-min SVID rotation)

Catalyst control-plane services:
- platform/nats-jetstream/chart — wraps nats 2.10.22 (3-node cluster, JetStream + KV)
- platform/openbao/chart — wraps openbao 2.1.0 (3-node Raft, region-local per SECURITY §5)
- platform/keycloak/chart — wraps keycloak 25.0.6 (Bitnami flavor, edge proxy mode)
- platform/gitea/chart — wraps gitea 10.5.0 (CNPG Postgres backend, no chart-bundled valkey/redis since Catalyst control plane uses JetStream)

New platform/ folders (added per AUDIT-PROCEDURE component-count anchor — was 53, now 55):
- platform/spire/README.md — workload identity Catalyst control plane component
- platform/nats-jetstream/README.md — control-plane event spine
- platform/sealed-secrets/README.md — transient bootstrap-only

Each blueprint.yaml declares:
- catalyst.openova.io/v1alpha1 Blueprint kind (canonical CRD per BLUEPRINT-AUTHORING §3)
- visibility: unlisted (mandatory infra, auto-installed by bootstrap kit, not a marketplace card)
- manifests.chart: ./chart pointer
- depends: [] (foundational components have no Blueprint dependencies; control-plane services depend on each other implicitly via bootstrap order, not via Blueprint depends)

.github/workflows/blueprint-release.yaml:
- New CI workflow per BLUEPRINT-AUTHORING §11 (path-matrix per Blueprint folder)
- Triggers on push to main touching platform/*/chart/** or products/*/chart/**
- detect job: emits matrix of changed Blueprint folders via git diff
- build job (per chart): helm dependency build → helm package → helm push to GHCR → cosign keyless sign (GitHub OIDC) → Syft SBOM attestation
- Output: ghcr.io/openova-io/bp-<name>:<semver> with SLSA-3-style supply-chain provenance

Closes [F] tickets: 11 G2 charts (cilium, cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak, gitea, plus the umbrella products/catalyst/chart already exists from Pass 105). blueprint.yaml CRDs added across 11 entries. CI fan-out workflow live.

After this commit lands, the bootstrap-kit installer in commit 07b4bcf has real OCI artifacts to install. The first push to main will trigger 10 build matrix jobs (cilium was created in a separate commit earlier in this session) which produce 10 cosigned bp-<name>:<semver> artifacts on GHCR.

Component-count anchor update follows: 53 → 55 (added spire + nats-jetstream + sealed-secrets — but sealed-secrets was already conceptually counted under "supporting services"). Per AUDIT-PROCEDURE the count needs updating in CLAUDE.md, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST L11. Tracked as separate ticket [K] docs.
2026-04-28 12:51:06 +02:00