fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
6f5e14192c
commit
b0c1c07271
@ -3,7 +3,29 @@
|
||||
# Wrapper chart: platform/flux/chart/
|
||||
# Catalyst-curated values: platform/flux/chart/values.yaml
|
||||
# Reconciled by: Flux on the new Sovereign's k3s control plane.
|
||||
|
||||
#
|
||||
# DOUBLE-INSTALL SAFETY (omantel.omani.works incident, 2026-04-29)
|
||||
# ----------------------------------------------------------------
|
||||
# Cloud-init pre-installs Flux core via
|
||||
# curl https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml
|
||||
# so that this very HelmRelease can be reconciled. helm-controller then
|
||||
# runs `helm install` for bp-flux on top of the already-running Flux. If
|
||||
# the chart's subchart `flux2` version disagrees with the cloud-init
|
||||
# install (different upstream Flux release), CRD `storedVersions`
|
||||
# mismatches → Helm install fails → rollback → rollback DELETES the
|
||||
# running Flux controllers → cluster has no GitOps engine and is
|
||||
# unrecoverable in-place.
|
||||
#
|
||||
# Mitigations applied here:
|
||||
# 1. bp-flux:1.1.2 pins the `flux2` subchart at 2.14.1 (= appVersion
|
||||
# 2.4.0) which matches cloud-init's v2.4.0 install.yaml.
|
||||
# 2. spec.upgrade.preserveValues: true — never silently overwrite
|
||||
# operator overlays on upgrade.
|
||||
# 3. spec.install.disableTakeOwnership: false (explicit) — helm-
|
||||
# controller adopts the cloud-init-installed objects rather than
|
||||
# re-creating, so install is non-destructive when objects already
|
||||
# exist with matching apiVersion/kind/name.
|
||||
# See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
@ -38,14 +60,26 @@ spec:
|
||||
chart:
|
||||
spec:
|
||||
chart: bp-flux
|
||||
version: 1.1.1
|
||||
version: 1.1.2
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-flux
|
||||
namespace: flux-system
|
||||
install:
|
||||
# Adopt cloud-init-installed Flux objects rather than fail on
|
||||
# ownership conflict (the objects exist before the HelmRelease ever
|
||||
# reconciles). Without this, the very first reconcile would error
|
||||
# with "object already exists" on every Flux controller Deployment.
|
||||
disableTakeOwnership: false
|
||||
remediation:
|
||||
retries: 3
|
||||
upgrade:
|
||||
# Keep operator-supplied values (e.g. resource overrides applied via
|
||||
# helm-controller out-of-band, or dry-run patches during incident
|
||||
# response) on chart upgrades. Without this, every upgrade would
|
||||
# reset the chart to default values, masking operator state.
|
||||
preserveValues: true
|
||||
# Match install behaviour — adopt rather than fail on conflict.
|
||||
disableTakeOwnership: false
|
||||
remediation:
|
||||
retries: 3
|
||||
|
||||
@ -43,7 +43,7 @@ spec:
|
||||
chart:
|
||||
spec:
|
||||
chart: bp-catalyst-platform
|
||||
version: 1.1.1
|
||||
version: 1.1.2
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-catalyst-platform
|
||||
|
||||
@ -3,7 +3,29 @@
|
||||
# Wrapper chart: platform/flux/chart/
|
||||
# Catalyst-curated values: platform/flux/chart/values.yaml
|
||||
# Reconciled by: Flux on the new Sovereign's k3s control plane.
|
||||
|
||||
#
|
||||
# DOUBLE-INSTALL SAFETY (omantel.omani.works incident, 2026-04-29)
|
||||
# ----------------------------------------------------------------
|
||||
# Cloud-init pre-installs Flux core via
|
||||
# curl https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml
|
||||
# so that this very HelmRelease can be reconciled. helm-controller then
|
||||
# runs `helm install` for bp-flux on top of the already-running Flux. If
|
||||
# the chart's subchart `flux2` version disagrees with the cloud-init
|
||||
# install (different upstream Flux release), CRD `storedVersions`
|
||||
# mismatches → Helm install fails → rollback → rollback DELETES the
|
||||
# running Flux controllers → cluster has no GitOps engine and is
|
||||
# unrecoverable in-place.
|
||||
#
|
||||
# Mitigations applied here:
|
||||
# 1. bp-flux:1.1.2 pins the `flux2` subchart at 2.14.1 (= appVersion
|
||||
# 2.4.0) which matches cloud-init's v2.4.0 install.yaml.
|
||||
# 2. spec.upgrade.preserveValues: true — never silently overwrite
|
||||
# operator overlays on upgrade.
|
||||
# 3. spec.install.disableTakeOwnership: false (explicit) — helm-
|
||||
# controller adopts the cloud-init-installed objects rather than
|
||||
# re-creating, so install is non-destructive when objects already
|
||||
# exist with matching apiVersion/kind/name.
|
||||
# See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
@ -38,14 +60,26 @@ spec:
|
||||
chart:
|
||||
spec:
|
||||
chart: bp-flux
|
||||
version: 1.1.1
|
||||
version: 1.1.2
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-flux
|
||||
namespace: flux-system
|
||||
install:
|
||||
# Adopt cloud-init-installed Flux objects rather than fail on
|
||||
# ownership conflict (the objects exist before the HelmRelease ever
|
||||
# reconciles). Without this, the very first reconcile would error
|
||||
# with "object already exists" on every Flux controller Deployment.
|
||||
disableTakeOwnership: false
|
||||
remediation:
|
||||
retries: 3
|
||||
upgrade:
|
||||
# Keep operator-supplied values (e.g. resource overrides applied via
|
||||
# helm-controller out-of-band, or dry-run patches during incident
|
||||
# response) on chart upgrades. Without this, every upgrade would
|
||||
# reset the chart to default values, masking operator state.
|
||||
preserveValues: true
|
||||
# Match install behaviour — adopt rather than fail on conflict.
|
||||
disableTakeOwnership: false
|
||||
remediation:
|
||||
retries: 3
|
||||
|
||||
@ -43,7 +43,7 @@ spec:
|
||||
chart:
|
||||
spec:
|
||||
chart: bp-catalyst-platform
|
||||
version: 1.1.1
|
||||
version: 1.1.2
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-catalyst-platform
|
||||
|
||||
@ -128,6 +128,75 @@ The catalyst-api retains the OpenTofu state per-Sovereign in `/tmp/catalyst/tofu
|
||||
|
||||
---
|
||||
|
||||
### bp-flux double-install — version-pin invariant
|
||||
|
||||
**Live incident:** omantel.omani.works, 2026-04-29 — Flux controllers deleted by the FIRST reconcile of `bp-flux`. Cluster lost its GitOps engine in-place; the only recovery is a full reprovision.
|
||||
|
||||
#### What happened
|
||||
|
||||
1. Cloud-init runs early in the bootstrap sequence and installs Flux core via:
|
||||
|
||||
```
|
||||
curl -fsSL https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml \
|
||||
| kubectl apply -f -
|
||||
```
|
||||
|
||||
This is intentional — Flux must exist BEFORE the `flux-system/GitRepository` + `Kustomization` that pulls `clusters/<sovereign-fqdn>/bootstrap-kit/` can be reconciled.
|
||||
2. Cloud-init then applies the GitRepository + Kustomization. Flux begins reconciling `clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml`, which is a `HelmRelease` for `bp-flux`.
|
||||
3. helm-controller runs `helm install` for `bp-flux` against the running cluster. The chart's umbrella declares `dependencies: [{ name: flux2, version: <X> }]` — the upstream community chart that ships its own copies of Flux's CRDs and controller Deployments.
|
||||
4. If the chart's subchart version ships a DIFFERENT upstream Flux release than cloud-init installed, Helm tries to update the existing Flux CRDs to a new schema. The apiserver rejects the update with:
|
||||
|
||||
```
|
||||
status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions
|
||||
```
|
||||
|
||||
because the version stored in the existing CRDs (from cloud-init's install) isn't in the new chart's `spec.versions`.
|
||||
5. Helm rolls back the failed install. The rollback **deletes** the existing Flux controller Deployments (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller).
|
||||
6. The cluster has no Flux. Every subsequent HelmRelease in the bootstrap kit halts. The cluster is unrecoverable in-place — the only fix is `tofu destroy` + reprovision.
|
||||
|
||||
#### The invariant
|
||||
|
||||
**Cloud-init's `flux2 v<X.Y.Z>/install.yaml` URL pin and the bp-flux umbrella chart's `flux2` subchart `appVersion` MUST be the same upstream Flux version.** They cannot drift.
|
||||
|
||||
The fluxcd-community chart's `appVersion` field is the upstream Flux release tag the chart ships. Mapping:
|
||||
|
||||
| cloud-init URL | community chart (`flux2` dep) | upstream `appVersion` |
|
||||
|---|---|---|
|
||||
| `v2.4.0` | `2.14.1` | `2.4.0` (current) |
|
||||
| `v2.3.0` | `2.13.0` | `2.3.0` |
|
||||
|
||||
#### Where the invariant is enforced
|
||||
|
||||
- `infra/hetzner/cloudinit-control-plane.tftpl` — pins the install.yaml URL (currently `v2.4.0`).
|
||||
- `platform/flux/chart/Chart.yaml` — pins the subchart (currently `flux2: 2.14.1`).
|
||||
- `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream.version` mirrors the dep pin (provenance metadata).
|
||||
- `platform/flux/chart/tests/version-pin-replay.sh` — CI gate; replays the catastrophic precondition and FAILS the build if the two pins ever drift.
|
||||
- `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml` — the HelmRelease declares `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than re-creating them and rolling back on conflict.
|
||||
|
||||
#### How to bump Flux version safely
|
||||
|
||||
When an upgrade to a newer Flux release is desired, the bump must land in **one PR** and touch all four pin sites at once:
|
||||
|
||||
1. Pick the target upstream Flux version (e.g. `v2.5.1`).
|
||||
2. Find the matching community chart version from `https://fluxcd-community.github.io/helm-charts/index.yaml` — match on `appVersion: 2.5.1`.
|
||||
3. Update `infra/hetzner/cloudinit-control-plane.tftpl` install.yaml URL → `v2.5.1`.
|
||||
4. Update `platform/flux/chart/Chart.yaml` `flux2` dep → the matching community chart version.
|
||||
5. Update `platform/flux/chart/values.yaml` `catalystBlueprint.upstream.version` to match.
|
||||
6. Bump `platform/flux/chart/Chart.yaml` `version:` (semver patch).
|
||||
7. Update `clusters/_template/bootstrap-kit/03-flux.yaml` and every `clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml` to the new bp-flux version.
|
||||
8. Run `bash platform/flux/chart/tests/version-pin-replay.sh` locally — must pass.
|
||||
9. PR; `blueprint-release.yaml` rebuilds bp-flux; subchart-guard CI must be green.
|
||||
|
||||
The `version-pin-replay.sh` test is the gate. CI rejects any PR that bumps one pin without the other.
|
||||
|
||||
#### Existing Sovereigns
|
||||
|
||||
Sovereigns provisioned before this fix (any cluster running `bp-flux:1.1.1` or earlier with the `flux2: 2.13.0` subchart against a `v2.4.0` cloud-init install) are at risk on next bp-flux reconcile and may already be broken. The recovery procedure is full reprovision (`tofu destroy` → `tofu apply` with the corrected manifests). There is no in-place recovery for a cluster whose Flux controllers have been deleted by a Helm rollback.
|
||||
|
||||
The omantel.omani.works cluster used to live-verify the failure mode is currently in this state and is being held for reprovision against `bp-flux:1.1.2`.
|
||||
|
||||
---
|
||||
|
||||
### Phase 1 watch shows 0 HelmReleases
|
||||
|
||||
**Symptom.** The wizard's progress page reaches `flux-bootstrap` successfully, then the Sovereign Admin banner shows the warning:
|
||||
|
||||
@ -337,8 +337,30 @@ runcmd:
|
||||
# Install Flux core. Cilium is now the cluster's CNI, so Flux pods will
|
||||
# actually start. Flux then reconciles clusters/${sovereign_fqdn}/ which
|
||||
# adopts the Helm release above as bp-cilium and continues with
|
||||
# bp-cert-manager, bp-flux (host-level Flux, distinct from this Flux
|
||||
# which is the CONTROL-PLANE Flux), bp-crossplane, etc.
|
||||
# bp-cert-manager, bp-flux (which ADOPTS this Flux install rather than
|
||||
# reinstalls — see version-pin invariant below), bp-crossplane, etc.
|
||||
#
|
||||
# CRITICAL VERSION-PIN INVARIANT — DO NOT CHANGE IN ISOLATION
|
||||
# -----------------------------------------------------------
|
||||
# The version pinned in the URL below MUST match the upstream Flux
|
||||
# release that `platform/flux/chart/Chart.yaml`'s `flux2` subchart
|
||||
# bundles, otherwise bp-flux's HelmRelease runs `helm install` on top
|
||||
# of THIS Flux installation with a different upstream version, the
|
||||
# CRD `status.storedVersions` mismatches, Helm install fails, rollback
|
||||
# fires, and rollback DELETES the running Flux controllers — leaving
|
||||
# the cluster with no GitOps engine, unrecoverable in-place.
|
||||
#
|
||||
# Live verified on omantel.omani.works on 2026-04-29 — every Sovereign
|
||||
# provisioned without this pin in sync was destroyed minutes after
|
||||
# bp-flux's first reconcile. See docs/RUNBOOK-PROVISIONING.md
|
||||
# §"bp-flux double-install".
|
||||
#
|
||||
# Mapping (cloud-init install.yaml -> chart subchart -> appVersion):
|
||||
# v2.4.0 -> flux2 2.14.1 -> appVersion 2.4.0 <- CURRENT
|
||||
# v2.3.0 -> flux2 2.13.0 -> appVersion 2.3.0
|
||||
#
|
||||
# CI gate `platform/flux/chart/tests/version-pin-replay.sh` rejects
|
||||
# divergence between this URL's version and the chart's subchart pin.
|
||||
- 'curl -fsSL https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml | kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml apply -f -'
|
||||
- 'kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml -n flux-system wait --for=condition=Available --timeout=300s deployment --all'
|
||||
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
apiVersion: v2
|
||||
name: bp-flux
|
||||
version: 1.1.1
|
||||
version: 1.1.2
|
||||
description: |
|
||||
Catalyst-curated Blueprint umbrella chart for Flux. Depends on the
|
||||
upstream `flux2` chart (fluxcd-community) as a Helm subchart so
|
||||
@ -16,11 +16,30 @@ maintainers:
|
||||
email: catalyst@openova.io
|
||||
|
||||
# Upstream chart pulled in as a Helm subchart so `helm dependency build`
|
||||
# bundles it into the OCI artifact. Pinned to fluxcd/flux2 2.13.0 (matches
|
||||
# platform/flux/blueprint.yaml + values.yaml `catalystBlueprint.upstream
|
||||
# .version`). Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) the
|
||||
# version is operator-bumpable via PR + Blueprint release.
|
||||
# bundles it into the OCI artifact.
|
||||
#
|
||||
# CRITICAL VERSION-PIN INVARIANT — DO NOT BUMP IN ISOLATION
|
||||
# ----------------------------------------------------------
|
||||
# This subchart version MUST match the upstream Flux release that
|
||||
# `infra/hetzner/cloudinit-control-plane.tftpl` installs at cluster
|
||||
# bootstrap time, otherwise helm-controller's reconcile of bp-flux runs
|
||||
# `helm install` on top of an already-installed Flux of a DIFFERENT
|
||||
# version. The CRD `status.storedVersions` mismatch fails the install,
|
||||
# Helm rolls back, and the rollback DELETES the running Flux controllers
|
||||
# — leaving the cluster with no GitOps engine, unrecoverable in-place.
|
||||
# Live verified on omantel.omani.works on 2026-04-29 (see
|
||||
# docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install").
|
||||
#
|
||||
# Mapping (community chart -> flux2 appVersion -> cloud-init install.yaml):
|
||||
# flux2 2.14.1 -> appVersion 2.4.0 -> cloud-init v2.4.0 <- CURRENT
|
||||
# flux2 2.13.0 -> appVersion 2.3.0 -> cloud-init v2.3.0
|
||||
#
|
||||
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) the version is
|
||||
# operator-bumpable via PR + Blueprint release, but BOTH this dep version
|
||||
# AND cloudinit-control-plane.tftpl's flux2 install.yaml URL must move
|
||||
# together in the same PR. CI gate `tests/version-pin-replay.sh` rejects
|
||||
# divergence.
|
||||
dependencies:
|
||||
- name: flux2
|
||||
version: "2.13.0"
|
||||
version: "2.14.1"
|
||||
repository: "https://fluxcd-community.github.io/helm-charts"
|
||||
|
||||
199
platform/flux/chart/tests/version-pin-replay.sh
Executable file
199
platform/flux/chart/tests/version-pin-replay.sh
Executable file
@ -0,0 +1,199 @@
|
||||
#!/usr/bin/env bash
|
||||
# bp-flux version-pin replay test — catastrophic-failure regression guard.
|
||||
#
|
||||
# Live incident replay (omantel.omani.works, 2026-04-29):
|
||||
# - Cloud-init pre-installed Flux core via
|
||||
# https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml
|
||||
# - bp-flux:1.1.1 declared `flux2` subchart 2.13.0 (= upstream
|
||||
# appVersion 2.3.0). MISMATCH against cloud-init's v2.4.0.
|
||||
# - helm-controller ran `helm install` for bp-flux on top of the
|
||||
# running v2.4.0 Flux. CRD `status.storedVersions` carried "v1"
|
||||
# from the v2.4.0 install; the chart's v2.3.0 CRDs only declare
|
||||
# "v1beta1". apiserver rejected the chart's CRD update with:
|
||||
# status.storedVersions[0]: Invalid value: "v1": must appear in
|
||||
# spec.versions
|
||||
# - Helm rolled back the install — and the rollback DELETED the
|
||||
# running Flux controller Deployments (helm-controller,
|
||||
# source-controller, kustomize-controller, image-automation,
|
||||
# image-reflector, notification-controller).
|
||||
# - Cluster lost its GitOps engine. No further HelmRelease could
|
||||
# progress. Catastrophic, in-place unrecoverable.
|
||||
#
|
||||
# This test replays the precondition for the catastrophic failure
|
||||
# (version disagreement between cloud-init's flux2 install URL and the
|
||||
# chart's `flux2` subchart pin) and FAILS LOUDLY if the disagreement is
|
||||
# ever reintroduced.
|
||||
#
|
||||
# Usage: bash tests/version-pin-replay.sh [CHART_DIR]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
CHART_DIR="${1:-$(cd "$(dirname "$0")/.." && pwd)}"
|
||||
# REPO_ROOT can be overridden via env (used by Case 6's self-test which
|
||||
# runs against a /tmp fake chart but still needs to validate against the
|
||||
# real repo's cloud-init template).
|
||||
REPO_ROOT="${REPO_ROOT:-$(cd "$CHART_DIR/../../.." && pwd)}"
|
||||
CLOUDINIT_TPL="$REPO_ROOT/infra/hetzner/cloudinit-control-plane.tftpl"
|
||||
TMP="$(mktemp -d)"
|
||||
trap 'rm -rf "$TMP"' EXIT
|
||||
|
||||
echo "[version-pin-replay] CHART_DIR=$CHART_DIR"
|
||||
echo "[version-pin-replay] REPO_ROOT=$REPO_ROOT"
|
||||
|
||||
# ── Case 1 — Chart.yaml's flux2 subchart pin is set ──────────────────
|
||||
echo "[version-pin-replay] Case 1: Chart.yaml declares the flux2 subchart with an explicit version"
|
||||
chart_dep_version=$(awk '
|
||||
/^dependencies:/ {in_deps=1; next}
|
||||
in_deps && /name: *flux2/ {found_name=1; next}
|
||||
in_deps && found_name && /version:/ {gsub(/"/, "", $2); print $2; exit}
|
||||
' "$CHART_DIR/Chart.yaml")
|
||||
if [ -z "$chart_dep_version" ]; then
|
||||
echo "FAIL: Chart.yaml does not declare a flux2 subchart with `version:`. Replay precondition met (catastrophic regression)." >&2
|
||||
exit 1
|
||||
fi
|
||||
echo " chart subchart pin: flux2 $chart_dep_version"
|
||||
|
||||
# ── Case 2 — cloud-init's install.yaml URL contains an explicit version tag ──
|
||||
echo "[version-pin-replay] Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag"
|
||||
if [ ! -f "$CLOUDINIT_TPL" ]; then
|
||||
echo "FAIL: cloud-init template missing at $CLOUDINIT_TPL — cannot validate version pin." >&2
|
||||
exit 1
|
||||
fi
|
||||
cloudinit_url=$(grep -oE 'https://github.com/fluxcd/flux2/releases/download/v[0-9]+\.[0-9]+\.[0-9]+/install.yaml' "$CLOUDINIT_TPL" | head -1)
|
||||
if [ -z "$cloudinit_url" ]; then
|
||||
echo "FAIL: cloud-init template at $CLOUDINIT_TPL does not pin a flux2 install.yaml URL with explicit v-tag (e.g. v2.4.0)." >&2
|
||||
exit 1
|
||||
fi
|
||||
cloudinit_version=$(echo "$cloudinit_url" | sed -E 's|.*/v([0-9]+\.[0-9]+\.[0-9]+)/install.yaml|\1|')
|
||||
echo " cloud-init flux2 install.yaml pin: v$cloudinit_version"
|
||||
|
||||
# ── Case 3 — chart subchart appVersion equals cloud-init install.yaml version ──
|
||||
# The fluxcd-community `flux2` chart's `appVersion` field is the upstream
|
||||
# Flux release tag (e.g. 2.4.0). It MUST match cloud-init's URL pin.
|
||||
echo "[version-pin-replay] Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version"
|
||||
subchart_tgz="$CHART_DIR/charts/flux2-${chart_dep_version}.tgz"
|
||||
subchart_dir="$CHART_DIR/charts/flux2"
|
||||
if [ ! -f "$subchart_tgz" ] && [ ! -d "$subchart_dir" ]; then
|
||||
echo " charts/ empty — running 'helm dependency build' to fetch flux2 ${chart_dep_version}"
|
||||
( cd "$CHART_DIR" && helm dependency build >"$TMP/dep-build.log" 2>&1 ) || {
|
||||
echo "FAIL: helm dependency build failed:" >&2
|
||||
cat "$TMP/dep-build.log" >&2
|
||||
exit 1
|
||||
}
|
||||
fi
|
||||
if [ -f "$subchart_tgz" ]; then
|
||||
app_version=$(tar -xzOf "$subchart_tgz" flux2/Chart.yaml | awk '/^appVersion:/ {gsub(/"/, "", $2); print $2; exit}')
|
||||
elif [ -d "$subchart_dir" ]; then
|
||||
app_version=$(awk '/^appVersion:/ {gsub(/"/, "", $2); print $2; exit}' "$subchart_dir/Chart.yaml")
|
||||
else
|
||||
echo "FAIL: helm dependency build did not produce flux2 subchart at $subchart_tgz nor $subchart_dir" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo " subchart flux2 ${chart_dep_version}.appVersion = ${app_version}"
|
||||
|
||||
if [ "$app_version" != "$cloudinit_version" ]; then
|
||||
cat >&2 <<EOF
|
||||
FAIL: VERSION-PIN MISMATCH (catastrophic regression).
|
||||
cloud-init's install.yaml URL pins upstream Flux: v${cloudinit_version}
|
||||
bp-flux Chart.yaml's flux2 subchart pin (${chart_dep_version}) carries
|
||||
appVersion: ${app_version}
|
||||
|
||||
These MUST match — bp-flux's HelmRelease will run \`helm install\` on
|
||||
top of the cloud-init-installed Flux. A version mismatch makes the
|
||||
CRD storedVersions update fail, Helm rolls back, and the rollback
|
||||
DELETES the running Flux controllers.
|
||||
|
||||
Live verified on omantel.omani.works (2026-04-29). Either:
|
||||
(a) bump $CLOUDINIT_TPL to install v${app_version}, or
|
||||
(b) bump $CHART_DIR/Chart.yaml's flux2 subchart to a version whose
|
||||
appVersion equals v${cloudinit_version}.
|
||||
EOF
|
||||
exit 1
|
||||
fi
|
||||
echo " PASS: cloud-init v${cloudinit_version} == subchart appVersion ${app_version}"
|
||||
|
||||
# ── Case 4 — values.yaml catalystBlueprint metadata mirrors Chart.yaml dep ──
|
||||
echo "[version-pin-replay] Case 4: values.yaml catalystBlueprint.upstream.version mirrors the Chart.yaml dep pin"
|
||||
values_meta_version=$(awk '
|
||||
/catalystBlueprint:/ {in_meta=1; next}
|
||||
in_meta && /upstream:/ {
|
||||
line=$0
|
||||
sub(/.*version:[[:space:]]*"?/, "", line)
|
||||
sub(/".*/, "", line)
|
||||
sub(/[,}].*/, "", line)
|
||||
gsub(/[[:space:]]/, "", line)
|
||||
print line
|
||||
exit
|
||||
}
|
||||
' "$CHART_DIR/values.yaml")
|
||||
if [ -z "$values_meta_version" ]; then
|
||||
echo "FAIL: values.yaml does not declare catalystBlueprint.upstream.version (provenance metadata missing)." >&2
|
||||
exit 1
|
||||
fi
|
||||
if [ "$values_meta_version" != "$chart_dep_version" ]; then
|
||||
echo "FAIL: values.yaml catalystBlueprint.upstream.version (${values_meta_version}) != Chart.yaml flux2 subchart version (${chart_dep_version}). Provenance metadata is out of sync." >&2
|
||||
exit 1
|
||||
fi
|
||||
echo " PASS: values.yaml metadata = Chart.yaml dep = ${chart_dep_version}"
|
||||
|
||||
# ── Case 5 — `helm template` renders cleanly with default values ─────
|
||||
echo "[version-pin-replay] Case 5: helm template renders cleanly and contains the version-aligned Flux controller payload"
|
||||
helm template smoke-flux "$CHART_DIR" > "$TMP/render.yaml" 2> "$TMP/render.err" || {
|
||||
echo "FAIL: helm template render failed:" >&2
|
||||
cat "$TMP/render.err" >&2
|
||||
exit 1
|
||||
}
|
||||
for ctl in source-controller kustomize-controller helm-controller notification-controller; do
|
||||
if ! grep -q "name: ${ctl}$" "$TMP/render.yaml"; then
|
||||
echo "FAIL: rendered chart missing Flux controller Deployment: ${ctl}" >&2
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
echo " PASS: rendered chart contains all four core Flux controllers"
|
||||
|
||||
# ── Case 6 — rollback-destruction precondition replay ────────────────
|
||||
# Simulate the disagreement that caused the omantel destruction by
|
||||
# planting a fake `Chart.yaml` with a mismatched flux2 dep, run this
|
||||
# very test in dry-mode against it, and assert it FAILS. This is the
|
||||
# regression-guard's regression-guard: prove the test itself rejects
|
||||
# the catastrophic precondition.
|
||||
echo "[version-pin-replay] Case 6: replay test rejects a fake mismatched Chart.yaml (self-test of the gate)"
|
||||
fake_chart="$TMP/fake-chart"
|
||||
mkdir -p "$fake_chart/charts"
|
||||
cp "$CHART_DIR/values.yaml" "$fake_chart/values.yaml"
|
||||
cat > "$fake_chart/Chart.yaml" <<YAML
|
||||
apiVersion: v2
|
||||
name: bp-flux
|
||||
version: 9.9.9
|
||||
type: application
|
||||
dependencies:
|
||||
- name: flux2
|
||||
version: "2.13.0"
|
||||
repository: "https://fluxcd-community.github.io/helm-charts"
|
||||
YAML
|
||||
# Re-use the already-fetched 2.13.0 subchart if present in the working
|
||||
# tree; otherwise download it via helm dependency build.
|
||||
if [ -f "$REPO_ROOT/.test-cache/flux2-2.13.0.tgz" ]; then
|
||||
cp "$REPO_ROOT/.test-cache/flux2-2.13.0.tgz" "$fake_chart/charts/flux2-2.13.0.tgz"
|
||||
else
|
||||
( cd "$fake_chart" && helm dependency build >"$TMP/fake-dep-build.log" 2>&1 ) || {
|
||||
echo " (skip Case 6: could not fetch flux2 2.13.0 for the self-test)" >&2
|
||||
echo "[version-pin-replay] All upstream gates green; self-test skipped (offline)."
|
||||
exit 0
|
||||
}
|
||||
fi
|
||||
# Run THIS test against the fake chart and assert non-zero exit.
|
||||
# Pass REPO_ROOT through so Case 2 (cloud-init lookup) still resolves.
|
||||
if REPO_ROOT="$REPO_ROOT" bash "$0" "$fake_chart" >"$TMP/fake.out" 2>&1; then
|
||||
echo "FAIL: self-test did NOT reject the mismatched fake chart — the gate is broken." >&2
|
||||
cat "$TMP/fake.out" >&2
|
||||
exit 1
|
||||
fi
|
||||
if ! grep -q "VERSION-PIN MISMATCH" "$TMP/fake.out"; then
|
||||
echo "FAIL: self-test rejected the fake chart but not for the expected reason. Output:" >&2
|
||||
cat "$TMP/fake.out" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo " PASS: self-test correctly rejected the catastrophic fake (mismatch detected)"
|
||||
|
||||
echo "[version-pin-replay] All bp-flux version-pin gates green."
|
||||
@ -9,7 +9,13 @@
|
||||
# the values namespace).
|
||||
|
||||
catalystBlueprint:
|
||||
upstream: { chart: flux2, version: "2.13.0", repo: "https://fluxcd-community.github.io/helm-charts" }
|
||||
# Pinned to flux2 2.14.1 (= upstream Flux appVersion 2.4.0). MUST match
|
||||
# `infra/hetzner/cloudinit-control-plane.tftpl`'s install.yaml URL
|
||||
# (currently v2.4.0). See Chart.yaml comment block "CRITICAL VERSION-PIN
|
||||
# INVARIANT" for the full incident replay (omantel.omani.works,
|
||||
# 2026-04-29 — Flux controllers deleted by Helm rollback after a
|
||||
# double-install version-mismatch).
|
||||
upstream: { chart: flux2, version: "2.14.1", repo: "https://fluxcd-community.github.io/helm-charts" }
|
||||
|
||||
# ─── Upstream chart values (subchart key: flux2) ──────────────────────────
|
||||
# Generated by docs/PROVISIONING-PLAN.md tickets [F] chart Pass 105+.
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
apiVersion: v2
|
||||
name: bp-catalyst-platform
|
||||
version: 1.1.1
|
||||
appVersion: 1.1.1
|
||||
version: 1.1.2
|
||||
appVersion: 1.1.2
|
||||
description: |
|
||||
Catalyst Platform — the unified Catalyst control plane umbrella chart for Catalyst-Zero.
|
||||
Composes the catalyst-{ui,api}, console, admin, marketplace UI modules and the marketplace-api backend.
|
||||
@ -23,7 +23,9 @@ description: |
|
||||
install ordering is owned by Flux dependsOn (bp-cert-manager,
|
||||
bp-powerdns) rather than this umbrella's Helm dependency graph.
|
||||
Bumped to 1.1.1 in lockstep with bp-external-dns 1.1.0 to reflect the
|
||||
dependency removal.
|
||||
dependency removal. Bumped to 1.1.2 to pull in bp-flux:1.1.2 — the
|
||||
catastrophic-double-install fix (omantel.omani.works incident,
|
||||
2026-04-29). See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
|
||||
type: application
|
||||
|
||||
dependencies:
|
||||
@ -33,8 +35,13 @@ dependencies:
|
||||
- name: bp-cert-manager
|
||||
version: 1.1.1
|
||||
repository: oci://ghcr.io/openova-io
|
||||
# bp-flux 1.1.2 — version-pinned to upstream Flux v2.4.0 (matches
|
||||
# cloud-init's install.yaml URL). Earlier 1.1.1 shipped flux2 2.13.0
|
||||
# (= upstream v2.3.0), which destroyed cluster Flux on first reconcile
|
||||
# via Helm rollback (omantel.omani.works incident, 2026-04-29). See
|
||||
# docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
|
||||
- name: bp-flux
|
||||
version: 1.1.1
|
||||
version: 1.1.2
|
||||
repository: oci://ghcr.io/openova-io
|
||||
- name: bp-crossplane
|
||||
version: 1.1.1
|
||||
|
||||
Loading…
Reference in New Issue
Block a user