fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)

Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.

This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.

Changes:

  * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
    URL pinned at v2.4.0 (deliberate; this is the source of truth) and
    added the CRITICAL VERSION-PIN INVARIANT comment block documenting
    the failure mode.

  * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
    2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
    2.4.0, matching cloud-init exactly. Bumped chart version
    1.1.1 -> 1.1.2.

  * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
    .version` mirror of the dep pin moved from 2.13.0 to 2.14.1.

  * `clusters/_template/bootstrap-kit/03-flux.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
    bp-flux HelmRelease to 1.1.2 + added explicit
    `install.disableTakeOwnership: false`,
    `upgrade.disableTakeOwnership: false`, and
    `upgrade.preserveValues: true` so helm-controller adopts the
    cloud-init-installed Flux objects rather than rolling back on
    ownership conflict.

  * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
    umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.

  * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
    — bumped HelmRelease to 1.1.2.

  * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
    catastrophic-failure replay test:
      Case 1: Chart.yaml declares the flux2 subchart with explicit version.
      Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
      Case 3: chart's flux2 subchart appVersion equals cloud-init's
              pinned upstream version (the load-bearing invariant).
      Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
      Case 5: helm template renders cleanly + contains the four core
              Flux controllers.
      Case 6: replay test rejects a planted mismatched fake Chart.yaml
              (the gate's own self-test — proves the gate works).
    All six cases green locally; the new test joins the existing
    observability-toggle test in tests/.

  * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
    — version-pin invariant" documenting the failure mode, the four
    pin-sites, the safe bump procedure, and the existing-Sovereign
    recovery path (full reprovision).

Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.

Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hatiyildiz 2026-04-29 19:38:17 +02:00
parent 6f5e14192c
commit b0c1c07271
10 changed files with 409 additions and 19 deletions

View File

@ -3,7 +3,29 @@
# Wrapper chart: platform/flux/chart/
# Catalyst-curated values: platform/flux/chart/values.yaml
# Reconciled by: Flux on the new Sovereign's k3s control plane.
#
# DOUBLE-INSTALL SAFETY (omantel.omani.works incident, 2026-04-29)
# ----------------------------------------------------------------
# Cloud-init pre-installs Flux core via
# curl https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml
# so that this very HelmRelease can be reconciled. helm-controller then
# runs `helm install` for bp-flux on top of the already-running Flux. If
# the chart's subchart `flux2` version disagrees with the cloud-init
# install (different upstream Flux release), CRD `storedVersions`
# mismatches → Helm install fails → rollback → rollback DELETES the
# running Flux controllers → cluster has no GitOps engine and is
# unrecoverable in-place.
#
# Mitigations applied here:
# 1. bp-flux:1.1.2 pins the `flux2` subchart at 2.14.1 (= appVersion
# 2.4.0) which matches cloud-init's v2.4.0 install.yaml.
# 2. spec.upgrade.preserveValues: true — never silently overwrite
# operator overlays on upgrade.
# 3. spec.install.disableTakeOwnership: false (explicit) — helm-
# controller adopts the cloud-init-installed objects rather than
# re-creating, so install is non-destructive when objects already
# exist with matching apiVersion/kind/name.
# See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
---
apiVersion: v1
kind: Namespace
@ -38,14 +60,26 @@ spec:
chart:
spec:
chart: bp-flux
version: 1.1.1
version: 1.1.2
sourceRef:
kind: HelmRepository
name: bp-flux
namespace: flux-system
install:
# Adopt cloud-init-installed Flux objects rather than fail on
# ownership conflict (the objects exist before the HelmRelease ever
# reconciles). Without this, the very first reconcile would error
# with "object already exists" on every Flux controller Deployment.
disableTakeOwnership: false
remediation:
retries: 3
upgrade:
# Keep operator-supplied values (e.g. resource overrides applied via
# helm-controller out-of-band, or dry-run patches during incident
# response) on chart upgrades. Without this, every upgrade would
# reset the chart to default values, masking operator state.
preserveValues: true
# Match install behaviour — adopt rather than fail on conflict.
disableTakeOwnership: false
remediation:
retries: 3

View File

@ -43,7 +43,7 @@ spec:
chart:
spec:
chart: bp-catalyst-platform
version: 1.1.1
version: 1.1.2
sourceRef:
kind: HelmRepository
name: bp-catalyst-platform

View File

@ -3,7 +3,29 @@
# Wrapper chart: platform/flux/chart/
# Catalyst-curated values: platform/flux/chart/values.yaml
# Reconciled by: Flux on the new Sovereign's k3s control plane.
#
# DOUBLE-INSTALL SAFETY (omantel.omani.works incident, 2026-04-29)
# ----------------------------------------------------------------
# Cloud-init pre-installs Flux core via
# curl https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml
# so that this very HelmRelease can be reconciled. helm-controller then
# runs `helm install` for bp-flux on top of the already-running Flux. If
# the chart's subchart `flux2` version disagrees with the cloud-init
# install (different upstream Flux release), CRD `storedVersions`
# mismatches → Helm install fails → rollback → rollback DELETES the
# running Flux controllers → cluster has no GitOps engine and is
# unrecoverable in-place.
#
# Mitigations applied here:
# 1. bp-flux:1.1.2 pins the `flux2` subchart at 2.14.1 (= appVersion
# 2.4.0) which matches cloud-init's v2.4.0 install.yaml.
# 2. spec.upgrade.preserveValues: true — never silently overwrite
# operator overlays on upgrade.
# 3. spec.install.disableTakeOwnership: false (explicit) — helm-
# controller adopts the cloud-init-installed objects rather than
# re-creating, so install is non-destructive when objects already
# exist with matching apiVersion/kind/name.
# See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
---
apiVersion: v1
kind: Namespace
@ -38,14 +60,26 @@ spec:
chart:
spec:
chart: bp-flux
version: 1.1.1
version: 1.1.2
sourceRef:
kind: HelmRepository
name: bp-flux
namespace: flux-system
install:
# Adopt cloud-init-installed Flux objects rather than fail on
# ownership conflict (the objects exist before the HelmRelease ever
# reconciles). Without this, the very first reconcile would error
# with "object already exists" on every Flux controller Deployment.
disableTakeOwnership: false
remediation:
retries: 3
upgrade:
# Keep operator-supplied values (e.g. resource overrides applied via
# helm-controller out-of-band, or dry-run patches during incident
# response) on chart upgrades. Without this, every upgrade would
# reset the chart to default values, masking operator state.
preserveValues: true
# Match install behaviour — adopt rather than fail on conflict.
disableTakeOwnership: false
remediation:
retries: 3

View File

@ -43,7 +43,7 @@ spec:
chart:
spec:
chart: bp-catalyst-platform
version: 1.1.1
version: 1.1.2
sourceRef:
kind: HelmRepository
name: bp-catalyst-platform

View File

@ -128,6 +128,75 @@ The catalyst-api retains the OpenTofu state per-Sovereign in `/tmp/catalyst/tofu
---
### bp-flux double-install — version-pin invariant
**Live incident:** omantel.omani.works, 2026-04-29 — Flux controllers deleted by the FIRST reconcile of `bp-flux`. Cluster lost its GitOps engine in-place; the only recovery is a full reprovision.
#### What happened
1. Cloud-init runs early in the bootstrap sequence and installs Flux core via:
```
curl -fsSL https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml \
| kubectl apply -f -
```
This is intentional — Flux must exist BEFORE the `flux-system/GitRepository` + `Kustomization` that pulls `clusters/<sovereign-fqdn>/bootstrap-kit/` can be reconciled.
2. Cloud-init then applies the GitRepository + Kustomization. Flux begins reconciling `clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml`, which is a `HelmRelease` for `bp-flux`.
3. helm-controller runs `helm install` for `bp-flux` against the running cluster. The chart's umbrella declares `dependencies: [{ name: flux2, version: <X> }]` — the upstream community chart that ships its own copies of Flux's CRDs and controller Deployments.
4. If the chart's subchart version ships a DIFFERENT upstream Flux release than cloud-init installed, Helm tries to update the existing Flux CRDs to a new schema. The apiserver rejects the update with:
```
status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions
```
because the version stored in the existing CRDs (from cloud-init's install) isn't in the new chart's `spec.versions`.
5. Helm rolls back the failed install. The rollback **deletes** the existing Flux controller Deployments (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller).
6. The cluster has no Flux. Every subsequent HelmRelease in the bootstrap kit halts. The cluster is unrecoverable in-place — the only fix is `tofu destroy` + reprovision.
#### The invariant
**Cloud-init's `flux2 v<X.Y.Z>/install.yaml` URL pin and the bp-flux umbrella chart's `flux2` subchart `appVersion` MUST be the same upstream Flux version.** They cannot drift.
The fluxcd-community chart's `appVersion` field is the upstream Flux release tag the chart ships. Mapping:
| cloud-init URL | community chart (`flux2` dep) | upstream `appVersion` |
|---|---|---|
| `v2.4.0` | `2.14.1` | `2.4.0` (current) |
| `v2.3.0` | `2.13.0` | `2.3.0` |
#### Where the invariant is enforced
- `infra/hetzner/cloudinit-control-plane.tftpl` — pins the install.yaml URL (currently `v2.4.0`).
- `platform/flux/chart/Chart.yaml` — pins the subchart (currently `flux2: 2.14.1`).
- `platform/flux/chart/values.yaml``catalystBlueprint.upstream.version` mirrors the dep pin (provenance metadata).
- `platform/flux/chart/tests/version-pin-replay.sh` — CI gate; replays the catastrophic precondition and FAILS the build if the two pins ever drift.
- `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml` — the HelmRelease declares `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than re-creating them and rolling back on conflict.
#### How to bump Flux version safely
When an upgrade to a newer Flux release is desired, the bump must land in **one PR** and touch all four pin sites at once:
1. Pick the target upstream Flux version (e.g. `v2.5.1`).
2. Find the matching community chart version from `https://fluxcd-community.github.io/helm-charts/index.yaml` — match on `appVersion: 2.5.1`.
3. Update `infra/hetzner/cloudinit-control-plane.tftpl` install.yaml URL → `v2.5.1`.
4. Update `platform/flux/chart/Chart.yaml` `flux2` dep → the matching community chart version.
5. Update `platform/flux/chart/values.yaml` `catalystBlueprint.upstream.version` to match.
6. Bump `platform/flux/chart/Chart.yaml` `version:` (semver patch).
7. Update `clusters/_template/bootstrap-kit/03-flux.yaml` and every `clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml` to the new bp-flux version.
8. Run `bash platform/flux/chart/tests/version-pin-replay.sh` locally — must pass.
9. PR; `blueprint-release.yaml` rebuilds bp-flux; subchart-guard CI must be green.
The `version-pin-replay.sh` test is the gate. CI rejects any PR that bumps one pin without the other.
#### Existing Sovereigns
Sovereigns provisioned before this fix (any cluster running `bp-flux:1.1.1` or earlier with the `flux2: 2.13.0` subchart against a `v2.4.0` cloud-init install) are at risk on next bp-flux reconcile and may already be broken. The recovery procedure is full reprovision (`tofu destroy` → `tofu apply` with the corrected manifests). There is no in-place recovery for a cluster whose Flux controllers have been deleted by a Helm rollback.
The omantel.omani.works cluster used to live-verify the failure mode is currently in this state and is being held for reprovision against `bp-flux:1.1.2`.
---
### Phase 1 watch shows 0 HelmReleases
**Symptom.** The wizard's progress page reaches `flux-bootstrap` successfully, then the Sovereign Admin banner shows the warning:

View File

@ -337,8 +337,30 @@ runcmd:
# Install Flux core. Cilium is now the cluster's CNI, so Flux pods will
# actually start. Flux then reconciles clusters/${sovereign_fqdn}/ which
# adopts the Helm release above as bp-cilium and continues with
# bp-cert-manager, bp-flux (host-level Flux, distinct from this Flux
# which is the CONTROL-PLANE Flux), bp-crossplane, etc.
# bp-cert-manager, bp-flux (which ADOPTS this Flux install rather than
# reinstalls — see version-pin invariant below), bp-crossplane, etc.
#
# CRITICAL VERSION-PIN INVARIANT — DO NOT CHANGE IN ISOLATION
# -----------------------------------------------------------
# The version pinned in the URL below MUST match the upstream Flux
# release that `platform/flux/chart/Chart.yaml`'s `flux2` subchart
# bundles, otherwise bp-flux's HelmRelease runs `helm install` on top
# of THIS Flux installation with a different upstream version, the
# CRD `status.storedVersions` mismatches, Helm install fails, rollback
# fires, and rollback DELETES the running Flux controllers — leaving
# the cluster with no GitOps engine, unrecoverable in-place.
#
# Live verified on omantel.omani.works on 2026-04-29 — every Sovereign
# provisioned without this pin in sync was destroyed minutes after
# bp-flux's first reconcile. See docs/RUNBOOK-PROVISIONING.md
# §"bp-flux double-install".
#
# Mapping (cloud-init install.yaml -> chart subchart -> appVersion):
# v2.4.0 -> flux2 2.14.1 -> appVersion 2.4.0 <- CURRENT
# v2.3.0 -> flux2 2.13.0 -> appVersion 2.3.0
#
# CI gate `platform/flux/chart/tests/version-pin-replay.sh` rejects
# divergence between this URL's version and the chart's subchart pin.
- 'curl -fsSL https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml | kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml apply -f -'
- 'kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml -n flux-system wait --for=condition=Available --timeout=300s deployment --all'

View File

@ -1,6 +1,6 @@
apiVersion: v2
name: bp-flux
version: 1.1.1
version: 1.1.2
description: |
Catalyst-curated Blueprint umbrella chart for Flux. Depends on the
upstream `flux2` chart (fluxcd-community) as a Helm subchart so
@ -16,11 +16,30 @@ maintainers:
email: catalyst@openova.io
# Upstream chart pulled in as a Helm subchart so `helm dependency build`
# bundles it into the OCI artifact. Pinned to fluxcd/flux2 2.13.0 (matches
# platform/flux/blueprint.yaml + values.yaml `catalystBlueprint.upstream
# .version`). Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) the
# version is operator-bumpable via PR + Blueprint release.
# bundles it into the OCI artifact.
#
# CRITICAL VERSION-PIN INVARIANT — DO NOT BUMP IN ISOLATION
# ----------------------------------------------------------
# This subchart version MUST match the upstream Flux release that
# `infra/hetzner/cloudinit-control-plane.tftpl` installs at cluster
# bootstrap time, otherwise helm-controller's reconcile of bp-flux runs
# `helm install` on top of an already-installed Flux of a DIFFERENT
# version. The CRD `status.storedVersions` mismatch fails the install,
# Helm rolls back, and the rollback DELETES the running Flux controllers
# — leaving the cluster with no GitOps engine, unrecoverable in-place.
# Live verified on omantel.omani.works on 2026-04-29 (see
# docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install").
#
# Mapping (community chart -> flux2 appVersion -> cloud-init install.yaml):
# flux2 2.14.1 -> appVersion 2.4.0 -> cloud-init v2.4.0 <- CURRENT
# flux2 2.13.0 -> appVersion 2.3.0 -> cloud-init v2.3.0
#
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) the version is
# operator-bumpable via PR + Blueprint release, but BOTH this dep version
# AND cloudinit-control-plane.tftpl's flux2 install.yaml URL must move
# together in the same PR. CI gate `tests/version-pin-replay.sh` rejects
# divergence.
dependencies:
- name: flux2
version: "2.13.0"
version: "2.14.1"
repository: "https://fluxcd-community.github.io/helm-charts"

View File

@ -0,0 +1,199 @@
#!/usr/bin/env bash
# bp-flux version-pin replay test — catastrophic-failure regression guard.
#
# Live incident replay (omantel.omani.works, 2026-04-29):
# - Cloud-init pre-installed Flux core via
# https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml
# - bp-flux:1.1.1 declared `flux2` subchart 2.13.0 (= upstream
# appVersion 2.3.0). MISMATCH against cloud-init's v2.4.0.
# - helm-controller ran `helm install` for bp-flux on top of the
# running v2.4.0 Flux. CRD `status.storedVersions` carried "v1"
# from the v2.4.0 install; the chart's v2.3.0 CRDs only declare
# "v1beta1". apiserver rejected the chart's CRD update with:
# status.storedVersions[0]: Invalid value: "v1": must appear in
# spec.versions
# - Helm rolled back the install — and the rollback DELETED the
# running Flux controller Deployments (helm-controller,
# source-controller, kustomize-controller, image-automation,
# image-reflector, notification-controller).
# - Cluster lost its GitOps engine. No further HelmRelease could
# progress. Catastrophic, in-place unrecoverable.
#
# This test replays the precondition for the catastrophic failure
# (version disagreement between cloud-init's flux2 install URL and the
# chart's `flux2` subchart pin) and FAILS LOUDLY if the disagreement is
# ever reintroduced.
#
# Usage: bash tests/version-pin-replay.sh [CHART_DIR]
set -euo pipefail
CHART_DIR="${1:-$(cd "$(dirname "$0")/.." && pwd)}"
# REPO_ROOT can be overridden via env (used by Case 6's self-test which
# runs against a /tmp fake chart but still needs to validate against the
# real repo's cloud-init template).
REPO_ROOT="${REPO_ROOT:-$(cd "$CHART_DIR/../../.." && pwd)}"
CLOUDINIT_TPL="$REPO_ROOT/infra/hetzner/cloudinit-control-plane.tftpl"
TMP="$(mktemp -d)"
trap 'rm -rf "$TMP"' EXIT
echo "[version-pin-replay] CHART_DIR=$CHART_DIR"
echo "[version-pin-replay] REPO_ROOT=$REPO_ROOT"
# ── Case 1 — Chart.yaml's flux2 subchart pin is set ──────────────────
echo "[version-pin-replay] Case 1: Chart.yaml declares the flux2 subchart with an explicit version"
chart_dep_version=$(awk '
/^dependencies:/ {in_deps=1; next}
in_deps && /name: *flux2/ {found_name=1; next}
in_deps && found_name && /version:/ {gsub(/"/, "", $2); print $2; exit}
' "$CHART_DIR/Chart.yaml")
if [ -z "$chart_dep_version" ]; then
echo "FAIL: Chart.yaml does not declare a flux2 subchart with `version:`. Replay precondition met (catastrophic regression)." >&2
exit 1
fi
echo " chart subchart pin: flux2 $chart_dep_version"
# ── Case 2 — cloud-init's install.yaml URL contains an explicit version tag ──
echo "[version-pin-replay] Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag"
if [ ! -f "$CLOUDINIT_TPL" ]; then
echo "FAIL: cloud-init template missing at $CLOUDINIT_TPL — cannot validate version pin." >&2
exit 1
fi
cloudinit_url=$(grep -oE 'https://github.com/fluxcd/flux2/releases/download/v[0-9]+\.[0-9]+\.[0-9]+/install.yaml' "$CLOUDINIT_TPL" | head -1)
if [ -z "$cloudinit_url" ]; then
echo "FAIL: cloud-init template at $CLOUDINIT_TPL does not pin a flux2 install.yaml URL with explicit v-tag (e.g. v2.4.0)." >&2
exit 1
fi
cloudinit_version=$(echo "$cloudinit_url" | sed -E 's|.*/v([0-9]+\.[0-9]+\.[0-9]+)/install.yaml|\1|')
echo " cloud-init flux2 install.yaml pin: v$cloudinit_version"
# ── Case 3 — chart subchart appVersion equals cloud-init install.yaml version ──
# The fluxcd-community `flux2` chart's `appVersion` field is the upstream
# Flux release tag (e.g. 2.4.0). It MUST match cloud-init's URL pin.
echo "[version-pin-replay] Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version"
subchart_tgz="$CHART_DIR/charts/flux2-${chart_dep_version}.tgz"
subchart_dir="$CHART_DIR/charts/flux2"
if [ ! -f "$subchart_tgz" ] && [ ! -d "$subchart_dir" ]; then
echo " charts/ empty — running 'helm dependency build' to fetch flux2 ${chart_dep_version}"
( cd "$CHART_DIR" && helm dependency build >"$TMP/dep-build.log" 2>&1 ) || {
echo "FAIL: helm dependency build failed:" >&2
cat "$TMP/dep-build.log" >&2
exit 1
}
fi
if [ -f "$subchart_tgz" ]; then
app_version=$(tar -xzOf "$subchart_tgz" flux2/Chart.yaml | awk '/^appVersion:/ {gsub(/"/, "", $2); print $2; exit}')
elif [ -d "$subchart_dir" ]; then
app_version=$(awk '/^appVersion:/ {gsub(/"/, "", $2); print $2; exit}' "$subchart_dir/Chart.yaml")
else
echo "FAIL: helm dependency build did not produce flux2 subchart at $subchart_tgz nor $subchart_dir" >&2
exit 1
fi
echo " subchart flux2 ${chart_dep_version}.appVersion = ${app_version}"
if [ "$app_version" != "$cloudinit_version" ]; then
cat >&2 <<EOF
FAIL: VERSION-PIN MISMATCH (catastrophic regression).
cloud-init's install.yaml URL pins upstream Flux: v${cloudinit_version}
bp-flux Chart.yaml's flux2 subchart pin (${chart_dep_version}) carries
appVersion: ${app_version}
These MUST match — bp-flux's HelmRelease will run \`helm install\` on
top of the cloud-init-installed Flux. A version mismatch makes the
CRD storedVersions update fail, Helm rolls back, and the rollback
DELETES the running Flux controllers.
Live verified on omantel.omani.works (2026-04-29). Either:
(a) bump $CLOUDINIT_TPL to install v${app_version}, or
(b) bump $CHART_DIR/Chart.yaml's flux2 subchart to a version whose
appVersion equals v${cloudinit_version}.
EOF
exit 1
fi
echo " PASS: cloud-init v${cloudinit_version} == subchart appVersion ${app_version}"
# ── Case 4 — values.yaml catalystBlueprint metadata mirrors Chart.yaml dep ──
echo "[version-pin-replay] Case 4: values.yaml catalystBlueprint.upstream.version mirrors the Chart.yaml dep pin"
values_meta_version=$(awk '
/catalystBlueprint:/ {in_meta=1; next}
in_meta && /upstream:/ {
line=$0
sub(/.*version:[[:space:]]*"?/, "", line)
sub(/".*/, "", line)
sub(/[,}].*/, "", line)
gsub(/[[:space:]]/, "", line)
print line
exit
}
' "$CHART_DIR/values.yaml")
if [ -z "$values_meta_version" ]; then
echo "FAIL: values.yaml does not declare catalystBlueprint.upstream.version (provenance metadata missing)." >&2
exit 1
fi
if [ "$values_meta_version" != "$chart_dep_version" ]; then
echo "FAIL: values.yaml catalystBlueprint.upstream.version (${values_meta_version}) != Chart.yaml flux2 subchart version (${chart_dep_version}). Provenance metadata is out of sync." >&2
exit 1
fi
echo " PASS: values.yaml metadata = Chart.yaml dep = ${chart_dep_version}"
# ── Case 5 — `helm template` renders cleanly with default values ─────
echo "[version-pin-replay] Case 5: helm template renders cleanly and contains the version-aligned Flux controller payload"
helm template smoke-flux "$CHART_DIR" > "$TMP/render.yaml" 2> "$TMP/render.err" || {
echo "FAIL: helm template render failed:" >&2
cat "$TMP/render.err" >&2
exit 1
}
for ctl in source-controller kustomize-controller helm-controller notification-controller; do
if ! grep -q "name: ${ctl}$" "$TMP/render.yaml"; then
echo "FAIL: rendered chart missing Flux controller Deployment: ${ctl}" >&2
exit 1
fi
done
echo " PASS: rendered chart contains all four core Flux controllers"
# ── Case 6 — rollback-destruction precondition replay ────────────────
# Simulate the disagreement that caused the omantel destruction by
# planting a fake `Chart.yaml` with a mismatched flux2 dep, run this
# very test in dry-mode against it, and assert it FAILS. This is the
# regression-guard's regression-guard: prove the test itself rejects
# the catastrophic precondition.
echo "[version-pin-replay] Case 6: replay test rejects a fake mismatched Chart.yaml (self-test of the gate)"
fake_chart="$TMP/fake-chart"
mkdir -p "$fake_chart/charts"
cp "$CHART_DIR/values.yaml" "$fake_chart/values.yaml"
cat > "$fake_chart/Chart.yaml" <<YAML
apiVersion: v2
name: bp-flux
version: 9.9.9
type: application
dependencies:
- name: flux2
version: "2.13.0"
repository: "https://fluxcd-community.github.io/helm-charts"
YAML
# Re-use the already-fetched 2.13.0 subchart if present in the working
# tree; otherwise download it via helm dependency build.
if [ -f "$REPO_ROOT/.test-cache/flux2-2.13.0.tgz" ]; then
cp "$REPO_ROOT/.test-cache/flux2-2.13.0.tgz" "$fake_chart/charts/flux2-2.13.0.tgz"
else
( cd "$fake_chart" && helm dependency build >"$TMP/fake-dep-build.log" 2>&1 ) || {
echo " (skip Case 6: could not fetch flux2 2.13.0 for the self-test)" >&2
echo "[version-pin-replay] All upstream gates green; self-test skipped (offline)."
exit 0
}
fi
# Run THIS test against the fake chart and assert non-zero exit.
# Pass REPO_ROOT through so Case 2 (cloud-init lookup) still resolves.
if REPO_ROOT="$REPO_ROOT" bash "$0" "$fake_chart" >"$TMP/fake.out" 2>&1; then
echo "FAIL: self-test did NOT reject the mismatched fake chart — the gate is broken." >&2
cat "$TMP/fake.out" >&2
exit 1
fi
if ! grep -q "VERSION-PIN MISMATCH" "$TMP/fake.out"; then
echo "FAIL: self-test rejected the fake chart but not for the expected reason. Output:" >&2
cat "$TMP/fake.out" >&2
exit 1
fi
echo " PASS: self-test correctly rejected the catastrophic fake (mismatch detected)"
echo "[version-pin-replay] All bp-flux version-pin gates green."

View File

@ -9,7 +9,13 @@
# the values namespace).
catalystBlueprint:
upstream: { chart: flux2, version: "2.13.0", repo: "https://fluxcd-community.github.io/helm-charts" }
# Pinned to flux2 2.14.1 (= upstream Flux appVersion 2.4.0). MUST match
# `infra/hetzner/cloudinit-control-plane.tftpl`'s install.yaml URL
# (currently v2.4.0). See Chart.yaml comment block "CRITICAL VERSION-PIN
# INVARIANT" for the full incident replay (omantel.omani.works,
# 2026-04-29 — Flux controllers deleted by Helm rollback after a
# double-install version-mismatch).
upstream: { chart: flux2, version: "2.14.1", repo: "https://fluxcd-community.github.io/helm-charts" }
# ─── Upstream chart values (subchart key: flux2) ──────────────────────────
# Generated by docs/PROVISIONING-PLAN.md tickets [F] chart Pass 105+.

View File

@ -1,7 +1,7 @@
apiVersion: v2
name: bp-catalyst-platform
version: 1.1.1
appVersion: 1.1.1
version: 1.1.2
appVersion: 1.1.2
description: |
Catalyst Platform — the unified Catalyst control plane umbrella chart for Catalyst-Zero.
Composes the catalyst-{ui,api}, console, admin, marketplace UI modules and the marketplace-api backend.
@ -23,7 +23,9 @@ description: |
install ordering is owned by Flux dependsOn (bp-cert-manager,
bp-powerdns) rather than this umbrella's Helm dependency graph.
Bumped to 1.1.1 in lockstep with bp-external-dns 1.1.0 to reflect the
dependency removal.
dependency removal. Bumped to 1.1.2 to pull in bp-flux:1.1.2 — the
catastrophic-double-install fix (omantel.omani.works incident,
2026-04-29). See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
type: application
dependencies:
@ -33,8 +35,13 @@ dependencies:
- name: bp-cert-manager
version: 1.1.1
repository: oci://ghcr.io/openova-io
# bp-flux 1.1.2 — version-pinned to upstream Flux v2.4.0 (matches
# cloud-init's install.yaml URL). Earlier 1.1.1 shipped flux2 2.13.0
# (= upstream v2.3.0), which destroyed cluster Flux on first reconcile
# via Helm rollback (omantel.omani.works incident, 2026-04-29). See
# docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
- name: bp-flux
version: 1.1.1
version: 1.1.2
repository: oci://ghcr.io/openova-io
- name: bp-crossplane
version: 1.1.1