Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
86 lines
3.1 KiB
YAML
86 lines
3.1 KiB
YAML
# bp-flux — Catalyst bootstrap-kit Blueprint. Host-level Flux. Per-vcluster Flux is bootstrapped later by environment-controller.
|
|
#
|
|
# Wrapper chart: platform/flux/chart/
|
|
# Catalyst-curated values: platform/flux/chart/values.yaml
|
|
# Reconciled by: Flux on the new Sovereign's k3s control plane.
|
|
#
|
|
# DOUBLE-INSTALL SAFETY (omantel.omani.works incident, 2026-04-29)
|
|
# ----------------------------------------------------------------
|
|
# Cloud-init pre-installs Flux core via
|
|
# curl https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml
|
|
# so that this very HelmRelease can be reconciled. helm-controller then
|
|
# runs `helm install` for bp-flux on top of the already-running Flux. If
|
|
# the chart's subchart `flux2` version disagrees with the cloud-init
|
|
# install (different upstream Flux release), CRD `storedVersions`
|
|
# mismatches → Helm install fails → rollback → rollback DELETES the
|
|
# running Flux controllers → cluster has no GitOps engine and is
|
|
# unrecoverable in-place.
|
|
#
|
|
# Mitigations applied here:
|
|
# 1. bp-flux:1.1.2 pins the `flux2` subchart at 2.14.1 (= appVersion
|
|
# 2.4.0) which matches cloud-init's v2.4.0 install.yaml.
|
|
# 2. spec.upgrade.preserveValues: true — never silently overwrite
|
|
# operator overlays on upgrade.
|
|
# 3. spec.install.disableTakeOwnership: false (explicit) — helm-
|
|
# controller adopts the cloud-init-installed objects rather than
|
|
# re-creating, so install is non-destructive when objects already
|
|
# exist with matching apiVersion/kind/name.
|
|
# See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
|
|
---
|
|
apiVersion: v1
|
|
kind: Namespace
|
|
metadata:
|
|
name: flux-system
|
|
labels:
|
|
catalyst.openova.io/sovereign: SOVEREIGN_FQDN_PLACEHOLDER
|
|
---
|
|
apiVersion: source.toolkit.fluxcd.io/v1beta2
|
|
kind: HelmRepository
|
|
metadata:
|
|
name: bp-flux
|
|
namespace: flux-system
|
|
spec:
|
|
type: oci
|
|
interval: 15m
|
|
url: oci://ghcr.io/openova-io
|
|
secretRef:
|
|
name: ghcr-pull
|
|
---
|
|
apiVersion: helm.toolkit.fluxcd.io/v2
|
|
kind: HelmRelease
|
|
metadata:
|
|
name: bp-flux
|
|
namespace: flux-system
|
|
spec:
|
|
interval: 15m
|
|
releaseName: flux
|
|
targetNamespace: flux-system
|
|
dependsOn:
|
|
- name: bp-cert-manager
|
|
chart:
|
|
spec:
|
|
chart: bp-flux
|
|
version: 1.1.2
|
|
sourceRef:
|
|
kind: HelmRepository
|
|
name: bp-flux
|
|
namespace: flux-system
|
|
install:
|
|
# Adopt cloud-init-installed Flux objects rather than fail on
|
|
# ownership conflict (the objects exist before the HelmRelease ever
|
|
# reconciles). Without this, the very first reconcile would error
|
|
# with "object already exists" on every Flux controller Deployment.
|
|
disableTakeOwnership: false
|
|
remediation:
|
|
retries: 3
|
|
upgrade:
|
|
# Keep operator-supplied values (e.g. resource overrides applied via
|
|
# helm-controller out-of-band, or dry-run patches during incident
|
|
# response) on chart upgrades. Without this, every upgrade would
|
|
# reset the chart to default values, masking operator state.
|
|
preserveValues: true
|
|
# Match install behaviour — adopt rather than fail on conflict.
|
|
disableTakeOwnership: false
|
|
remediation:
|
|
retries: 3
|