openova/clusters/_template/bootstrap-kit/03-flux.yaml
hatiyildiz b0c1c07271 fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.

This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.

Changes:

  * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
    URL pinned at v2.4.0 (deliberate; this is the source of truth) and
    added the CRITICAL VERSION-PIN INVARIANT comment block documenting
    the failure mode.

  * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
    2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
    2.4.0, matching cloud-init exactly. Bumped chart version
    1.1.1 -> 1.1.2.

  * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
    .version` mirror of the dep pin moved from 2.13.0 to 2.14.1.

  * `clusters/_template/bootstrap-kit/03-flux.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
    bp-flux HelmRelease to 1.1.2 + added explicit
    `install.disableTakeOwnership: false`,
    `upgrade.disableTakeOwnership: false`, and
    `upgrade.preserveValues: true` so helm-controller adopts the
    cloud-init-installed Flux objects rather than rolling back on
    ownership conflict.

  * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
    umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.

  * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
    — bumped HelmRelease to 1.1.2.

  * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
    catastrophic-failure replay test:
      Case 1: Chart.yaml declares the flux2 subchart with explicit version.
      Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
      Case 3: chart's flux2 subchart appVersion equals cloud-init's
              pinned upstream version (the load-bearing invariant).
      Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
      Case 5: helm template renders cleanly + contains the four core
              Flux controllers.
      Case 6: replay test rejects a planted mismatched fake Chart.yaml
              (the gate's own self-test — proves the gate works).
    All six cases green locally; the new test joins the existing
    observability-toggle test in tests/.

  * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
    — version-pin invariant" documenting the failure mode, the four
    pin-sites, the safe bump procedure, and the existing-Sovereign
    recovery path (full reprovision).

Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.

Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:38:17 +02:00

86 lines
3.1 KiB
YAML

# bp-flux — Catalyst bootstrap-kit Blueprint. Host-level Flux. Per-vcluster Flux is bootstrapped later by environment-controller.
#
# Wrapper chart: platform/flux/chart/
# Catalyst-curated values: platform/flux/chart/values.yaml
# Reconciled by: Flux on the new Sovereign's k3s control plane.
#
# DOUBLE-INSTALL SAFETY (omantel.omani.works incident, 2026-04-29)
# ----------------------------------------------------------------
# Cloud-init pre-installs Flux core via
# curl https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml
# so that this very HelmRelease can be reconciled. helm-controller then
# runs `helm install` for bp-flux on top of the already-running Flux. If
# the chart's subchart `flux2` version disagrees with the cloud-init
# install (different upstream Flux release), CRD `storedVersions`
# mismatches → Helm install fails → rollback → rollback DELETES the
# running Flux controllers → cluster has no GitOps engine and is
# unrecoverable in-place.
#
# Mitigations applied here:
# 1. bp-flux:1.1.2 pins the `flux2` subchart at 2.14.1 (= appVersion
# 2.4.0) which matches cloud-init's v2.4.0 install.yaml.
# 2. spec.upgrade.preserveValues: true — never silently overwrite
# operator overlays on upgrade.
# 3. spec.install.disableTakeOwnership: false (explicit) — helm-
# controller adopts the cloud-init-installed objects rather than
# re-creating, so install is non-destructive when objects already
# exist with matching apiVersion/kind/name.
# See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
---
apiVersion: v1
kind: Namespace
metadata:
name: flux-system
labels:
catalyst.openova.io/sovereign: SOVEREIGN_FQDN_PLACEHOLDER
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-flux
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-flux
namespace: flux-system
spec:
interval: 15m
releaseName: flux
targetNamespace: flux-system
dependsOn:
- name: bp-cert-manager
chart:
spec:
chart: bp-flux
version: 1.1.2
sourceRef:
kind: HelmRepository
name: bp-flux
namespace: flux-system
install:
# Adopt cloud-init-installed Flux objects rather than fail on
# ownership conflict (the objects exist before the HelmRelease ever
# reconciles). Without this, the very first reconcile would error
# with "object already exists" on every Flux controller Deployment.
disableTakeOwnership: false
remediation:
retries: 3
upgrade:
# Keep operator-supplied values (e.g. resource overrides applied via
# helm-controller out-of-band, or dry-run patches during incident
# response) on chart upgrades. Without this, every upgrade would
# reset the chart to default values, masking operator state.
preserveValues: true
# Match install behaviour — adopt rather than fail on conflict.
disableTakeOwnership: false
remediation:
retries: 3