openova/docs/CHART-AUTHORING.md
hatiyildiz 015e7ab18b fix(catalyst-chart): annotate api-deployment for Flux strategy-flip recovery
DIVERGES from the literal "$patch: replace" prescription on the issue
because that directive cannot survive any apply path that actually
runs in production (verified end-to-end in
tests/integration/strategy-flip.sh):

  - Flux's kustomize-controller submits via Server-Side Apply. SSA
    rejects `.spec.strategy.$patch` with "field not declared in
    schema" — fluxcd/pkg/ssa Manager.Apply does not preprocess SMP
    directives.
  - kubectl strict-decoding rejects `$patch` on every CREATE path
    (`kubectl create`, `kubectl apply` to an empty namespace, every
    `--server-side` flavor) with "unknown field spec.strategy.$patch"
    — adding it to a chart base resource BREAKS fresh installs of
    every new Sovereign.

The durable fix is the documented Flux annotation
`kustomize.toolkit.fluxcd.io/force: enabled` on the Deployment.
When kustomize-controller's SSA dry-run fails Invalid (the contabo-
mkt failure mode: `spec.strategy.rollingUpdate: Forbidden` on the
post-merge object that retained `rollingUpdate.maxSurge=25%` /
`maxUnavailable=25%` from the prior `kubectl-client-side-apply`
field manager), the controller falls back to delete-and-recreate
THIS resource. The recreated Deployment carries no residual
`rollingUpdate.*` fields, so the regression cannot recur. The
annotation is IaC, scoped to the Deployment, applies on every
reconcile.

Verified gates:
  - `kubectl apply --dry-run=server -f .../api-deployment.yaml`
    over a Deployment in the bad pre-state (RollingUpdate +
    maxSurge=25% / maxUnavailable=25%) → exit 0,
    "deployment.apps/catalyst-api configured (server dry run)".
  - Same manifest applied to an empty namespace via SSA + CSA →
    both succeed (the fresh-install gate that catches `$patch:`-
    shaped regressions).
  - SSA path correctly REPRODUCES the regression mode (asserted
    in step 3 of the integration test) → proves the recovery layer
    is necessary.
  - Flux force-recovery equivalent (delete + apply) succeeds →
    proves the recovery path itself works.

Files:
  - products/catalyst/chart/templates/api-deployment.yaml: add
    `kustomize.toolkit.fluxcd.io/force: enabled` annotation +
    inline reference comment explaining failure mode and rejecting
    inline `$patch: replace` as a future regression vector.
  - docs/CHART-AUTHORING.md (new): authoritative chart-authoring
    doc, with §"Strategy flips on existing Deployments" anchoring
    the failure mode + canonical fix + table of related fields
    (selector, clusterIP, accessModes, etc.) that share the
    pattern. References docs/INVIOLABLE-PRINCIPLES.md #3 (Flux is
    the only GitOps reconciler) and #4 (never hardcode runtime
    knobs in operator runbooks).
  - tests/integration/strategy-flip.yaml (new): bad-state fixture
    + assertion ConfigMap. Reproduces the exact 25%/25% pre-state
    that triggered contabo-mkt.
  - tests/integration/strategy-flip.sh (new): 6-step runner —
    bad-state stage, CSA gate, SSA failure-mode reproduction,
    structural annotation check, recovery-path proof, fresh-
    install gate. Exits non-zero on any regression.
  - .github/workflows/test-strategy-flip.yaml (new): CI wiring on
    kind v1.30.6 (matches contabo-mkt k3s decoding behavior),
    triggered by edits to the chart manifest, the test, the doc,
    or the workflow itself.

Sweep of the rest of the Catalyst chart templates: the only
`strategy.type: Recreate` Deployment in the chart is catalyst-api.
catalyst-ui, marketplace-api, and all 11 sme-services Deployments
declare default RollingUpdate and live as RollingUpdate on contabo-
mkt — no latent flips. Services use ClusterIP with default IP
allocation; the api-deployments PVC is RWO and never re-shaped by
the chart. No additional resources needed hardening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:04:07 +02:00

9.7 KiB

Chart Authoring Notes

Status: Authoritative. Audience: Anyone editing a products/<name>/chart/templates/*.yaml or platform/<name>/chart/templates/*.yaml resource that ships to a Flux- reconciled cluster.

This document captures sharp edges in the chart-authoring workflow that have already cost the project a real outage. Each section names a specific failure mode, a specific reproducer, and the canonical fix — in the same shape as docs/INVIOLABLE-PRINCIPLES.md. Read it before declaring "done" on any chart that mutates a long-lived resource.


Strategy flips on existing Deployments

What goes wrong

A chart manifest declares Deployment.spec.strategy.type: Recreate. The cluster already runs a Deployment of the same name that was created earlier with the default RollingUpdate strategy (so spec.strategy.rollingUpdate.maxSurge=25% and maxUnavailable=25% exist on the live object). Flux's kustomize-controller submits the new manifest via Server-Side Apply with the kustomize-controller field manager. The API server merges, then validates. Validation rejects with:

Deployment.apps "<name>" is invalid:
  spec.strategy.rollingUpdate: Forbidden:
    may not be specified when strategy `type` is 'Recreate'

The Flux Kustomization parks at Ready=False on every reconcile until an operator intervenes.

Why Server-Side Apply does this

SSA's contract is "set the fields you declare." It does NOT remove fields owned by other field managers. The pre-existing Deployment was created via kubectl apply (CSA), so the kubectl-client-side-apply field manager owns .spec.strategy.rollingUpdate.maxSurge and .spec.strategy.rollingUpdate.maxUnavailable. When kustomize- controller flips .spec.strategy.type to Recreate, those rolling- update fields stay on the object. The post-merge state has both type: Recreate AND rollingUpdate.* keys. The API validator forbids that combination. SSA cannot fix this on its own.

Why $patch: replace is NOT the answer

$patch: replace is a Strategic Merge Patch runtime directive. It does NOT belong in a chart's base resource. Reasons:

  1. API strict-decoding rejects it on CREATE. kubectl create, kubectl apply to an empty namespace, and kubectl apply --server-side all return:
    strict decoding error: unknown field "spec.strategy.$patch"
    
    This BREAKS fresh installs — including every new Sovereign bootstrap.
  2. Flux SSA rejects it. The kustomize-controller SSA path returns field not declared in schema on .spec.strategy.$patch.
  3. It is a runtime directive, not a chart field. $patch: replace is processed at SMP merge time by SMP-aware mergers. kustomize build does NOT consume the directive when it appears in a base resource — it passes it through as if it were a normal YAML key. The downstream API call then fails as above.

The correct place for $patch: replace is inside a Kustomize patches: entry, where the kustomize binary processes it at build time and emits a clean output that contains no $patch key. That is not what fixes the strategy-flip problem either, because the build- time output is identical to declaring strategy.type: Recreate directly — it produces the same SSA failure.

The canonical fix

Annotate the Deployment with the Flux force annotation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: catalyst-api
  annotations:
    kustomize.toolkit.fluxcd.io/force: enabled
spec:
  replicas: 1
  strategy:
    type: Recreate
  # ...

When kustomize-controller's SSA dry-run fails with an Invalid response on this resource, the controller falls back to delete-and-recreate the SINGLE annotated resource (not the whole Kustomization). The recreated Deployment has no residual rollingUpdate.* fields — the regression cannot recur on the rebuilt object. The annotation lives in Git, version-controlled, applies on every reconcile.

This is not a "kubectl delete bandaid." Per INVIOLABLE-PRINCIPLES.md #3 (Follow the documented architecture, exactly — Flux is the ONLY GitOps reconciler) and #4 (Never hardcode — runtime configuration in Git, not in shell history): the remediation is declarative, scoped to the resource, and removed only by editing the chart.

When you may use this annotation

The Flux force annotation triggers delete + recreate on apply failure. Use it only on resources that:

  • Already declare strategy.type: Recreate (so delete-and-recreate is the steady-state update path anyway), OR
  • Carry no client traffic (a brief unavailability is acceptable), OR
  • Are explicitly designed to lose in-process state on every roll.

Do NOT add the annotation to a resource whose default update mode is RollingUpdate and whose pods serve live traffic — you would be trading off availability against an outcome that better resource authoring (selectors, immutable-field migrations) could deliver.

Required test coverage

Every chart that flips Deployment.spec.strategy.type MUST be covered by a test fixture in tests/integration/strategy-flip.yaml (or its equivalent next to a similar regression). The test must:

  1. Stage a Deployment with the OLD strategy at the same name.
  2. Apply the NEW chart manifest.
  3. Assert the apply succeeds via the documented apply path.
  4. Assert the chart manifest carries the Flux force annotation.
  5. Assert the chart manifest is also valid for fresh install (no inline $patch: replace or other strict-decoding-violating directives).

The current implementation lives at tests/integration/strategy-flip.sh and the CI workflow at .github/workflows/test-strategy-flip.yaml. Wire any new strategy- flip into both.

Reference incident

  • Date: 2026-04-29
  • Cluster: contabo-mkt
  • Resource: catalyst/catalyst-api
  • Symptom: Kustomization stuck Ready=False for hours; user unblocked manually with kubectl delete deploy catalyst-api -n catalyst. Flux re-created the Deployment from scratch on the next reconcile; the rollingUpdate.* fields were no longer present and the Kustomization went Ready=True.
  • Root cause: chart's api-deployment.yaml declared strategy.type: Recreate; the live object had been created with default RollingUpdate; SSA preserved the rollingUpdate fields under the prior field manager.
  • Durable fix: add kustomize.toolkit.fluxcd.io/force: enabled annotation to the chart manifest at products/catalyst/chart/templates/api-deployment.yaml.

Generalizing the lesson

Other chart fields that can collide on apply

The strategy-flip is one instance of a broader class: fields whose old value and new value cannot legally coexist, where the old value is owned by a non-Flux field manager. The same fix applies to each of them — annotate the resource with kustomize.toolkit.fluxcd.io/force: enabled and let Flux recover via delete-and-recreate when SSA dry-run fails.

Resource kind Field that triggers an Invalid merge Notes
Deployment spec.strategy.type Recreate ↔ RollingUpdate This document.
Deployment spec.selector.matchLabels change Selector is immutable post-create. Must recreate.
Service spec.clusterIP (None ↔ value) Immutable. Must recreate.
Service spec.type ClusterIP ↔ NodePort ↔ LoadBalancer Some transitions invalid; recreate is safe path.
PersistentVolumeClaim spec.accessModes change after binding Immutable post-bind. Recreate would lose data — DO NOT add force annotation; instead provision a new PVC under a new name and migrate.
StatefulSet spec.serviceName, spec.selector Immutable. Must recreate (which loses pod identity). Plan migrations carefully.
Job spec.template.* after create Immutable. Recreation is the only path.

For PVCs and StatefulSets specifically: NEVER add the Flux force annotation as a default. Data loss is the failure mode. The right move is a paired migration: provision the new resource under a new name, copy data, swap references, retire the old.

Authoring discipline

Before declaring "done" on any chart that touches a long-lived resource:

  1. Run the chart's manifest through kubectl apply --dry-run=server against an EMPTY namespace. Must succeed (no $patch: in the spec, no fields the strict decoder rejects).
  2. If the resource type appears in the table above, ALSO run kubectl apply --dry-run=server against a namespace where a PRIOR shape of the resource already exists. Must succeed under the user's documented apply path; if it fails, add the Flux force annotation AND the integration test.
  3. Verify the chart's kustomization.yaml references all template files (catches the "I added a template but forgot to wire it" regression).
  4. If the resource carries client traffic, document the recreate blast radius in the chart's leading comment — operators reading the chart need to know an apply may interrupt service.

Cross-references

  • docs/INVIOLABLE-PRINCIPLES.md #3 — Follow the documented architecture, exactly. Flux is the ONLY GitOps reconciler; remediations live in IaC, not in shell history.
  • docs/INVIOLABLE-PRINCIPLES.md #4 — Never hardcode. Runtime knobs live in Git as declarative resources, not as operator runbook steps.
  • Flux docs: https://fluxcd.io/flux/components/kustomize/kustomizations/#force — official documentation of the kustomize.toolkit.fluxcd.io/force: enabled annotation.
  • tests/integration/strategy-flip.sh — the runner that defends the Catalyst chart against this regression.
  • tests/integration/strategy-flip.yaml — the bad-state fixture and assertion contract.
  • .github/workflows/test-strategy-flip.yaml — CI wiring.