Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.
This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.
Changes:
* `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
URL pinned at v2.4.0 (deliberate; this is the source of truth) and
added the CRITICAL VERSION-PIN INVARIANT comment block documenting
the failure mode.
* `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
2.4.0, matching cloud-init exactly. Bumped chart version
1.1.1 -> 1.1.2.
* `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
.version` mirror of the dep pin moved from 2.13.0 to 2.14.1.
* `clusters/_template/bootstrap-kit/03-flux.yaml` and
`clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
bp-flux HelmRelease to 1.1.2 + added explicit
`install.disableTakeOwnership: false`,
`upgrade.disableTakeOwnership: false`, and
`upgrade.preserveValues: true` so helm-controller adopts the
cloud-init-installed Flux objects rather than rolling back on
ownership conflict.
* `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.
* `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
`clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
— bumped HelmRelease to 1.1.2.
* `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
catastrophic-failure replay test:
Case 1: Chart.yaml declares the flux2 subchart with explicit version.
Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
Case 3: chart's flux2 subchart appVersion equals cloud-init's
pinned upstream version (the load-bearing invariant).
Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
Case 5: helm template renders cleanly + contains the four core
Flux controllers.
Case 6: replay test rejects a planted mismatched fake Chart.yaml
(the gate's own self-test — proves the gate works).
All six cases green locally; the new test joins the existing
observability-toggle test in tests/.
* `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
— version-pin invariant" documenting the failure mode, the four
pin-sites, the safe bump procedure, and the existing-Sovereign
recovery path (full reprovision).
Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.
Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>