openova/docs/CHART-AUTHORING.md
hatiyildiz 015e7ab18b fix(catalyst-chart): annotate api-deployment for Flux strategy-flip recovery
DIVERGES from the literal "$patch: replace" prescription on the issue
because that directive cannot survive any apply path that actually
runs in production (verified end-to-end in
tests/integration/strategy-flip.sh):

  - Flux's kustomize-controller submits via Server-Side Apply. SSA
    rejects `.spec.strategy.$patch` with "field not declared in
    schema" — fluxcd/pkg/ssa Manager.Apply does not preprocess SMP
    directives.
  - kubectl strict-decoding rejects `$patch` on every CREATE path
    (`kubectl create`, `kubectl apply` to an empty namespace, every
    `--server-side` flavor) with "unknown field spec.strategy.$patch"
    — adding it to a chart base resource BREAKS fresh installs of
    every new Sovereign.

The durable fix is the documented Flux annotation
`kustomize.toolkit.fluxcd.io/force: enabled` on the Deployment.
When kustomize-controller's SSA dry-run fails Invalid (the contabo-
mkt failure mode: `spec.strategy.rollingUpdate: Forbidden` on the
post-merge object that retained `rollingUpdate.maxSurge=25%` /
`maxUnavailable=25%` from the prior `kubectl-client-side-apply`
field manager), the controller falls back to delete-and-recreate
THIS resource. The recreated Deployment carries no residual
`rollingUpdate.*` fields, so the regression cannot recur. The
annotation is IaC, scoped to the Deployment, applies on every
reconcile.

Verified gates:
  - `kubectl apply --dry-run=server -f .../api-deployment.yaml`
    over a Deployment in the bad pre-state (RollingUpdate +
    maxSurge=25% / maxUnavailable=25%) → exit 0,
    "deployment.apps/catalyst-api configured (server dry run)".
  - Same manifest applied to an empty namespace via SSA + CSA →
    both succeed (the fresh-install gate that catches `$patch:`-
    shaped regressions).
  - SSA path correctly REPRODUCES the regression mode (asserted
    in step 3 of the integration test) → proves the recovery layer
    is necessary.
  - Flux force-recovery equivalent (delete + apply) succeeds →
    proves the recovery path itself works.

Files:
  - products/catalyst/chart/templates/api-deployment.yaml: add
    `kustomize.toolkit.fluxcd.io/force: enabled` annotation +
    inline reference comment explaining failure mode and rejecting
    inline `$patch: replace` as a future regression vector.
  - docs/CHART-AUTHORING.md (new): authoritative chart-authoring
    doc, with §"Strategy flips on existing Deployments" anchoring
    the failure mode + canonical fix + table of related fields
    (selector, clusterIP, accessModes, etc.) that share the
    pattern. References docs/INVIOLABLE-PRINCIPLES.md #3 (Flux is
    the only GitOps reconciler) and #4 (never hardcode runtime
    knobs in operator runbooks).
  - tests/integration/strategy-flip.yaml (new): bad-state fixture
    + assertion ConfigMap. Reproduces the exact 25%/25% pre-state
    that triggered contabo-mkt.
  - tests/integration/strategy-flip.sh (new): 6-step runner —
    bad-state stage, CSA gate, SSA failure-mode reproduction,
    structural annotation check, recovery-path proof, fresh-
    install gate. Exits non-zero on any regression.
  - .github/workflows/test-strategy-flip.yaml (new): CI wiring on
    kind v1.30.6 (matches contabo-mkt k3s decoding behavior),
    triggered by edits to the chart manifest, the test, the doc,
    or the workflow itself.

Sweep of the rest of the Catalyst chart templates: the only
`strategy.type: Recreate` Deployment in the chart is catalyst-api.
catalyst-ui, marketplace-api, and all 11 sme-services Deployments
declare default RollingUpdate and live as RollingUpdate on contabo-
mkt — no latent flips. Services use ClusterIP with default IP
allocation; the api-deployments PVC is RWO and never re-shaped by
the chart. No additional resources needed hardening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:04:07 +02:00

230 lines
9.7 KiB
Markdown

# Chart Authoring Notes
**Status:** Authoritative.
**Audience:** Anyone editing a `products/<name>/chart/templates/*.yaml` or
`platform/<name>/chart/templates/*.yaml` resource that ships to a Flux-
reconciled cluster.
This document captures sharp edges in the chart-authoring workflow that
have already cost the project a real outage. Each section names a
specific failure mode, a specific reproducer, and the canonical fix —
in the same shape as `docs/INVIOLABLE-PRINCIPLES.md`. Read it before
declaring "done" on any chart that mutates a long-lived resource.
---
## Strategy flips on existing Deployments
### What goes wrong
A chart manifest declares `Deployment.spec.strategy.type: Recreate`.
The cluster already runs a Deployment of the same name that was
created earlier with the default `RollingUpdate` strategy (so
`spec.strategy.rollingUpdate.maxSurge=25%` and `maxUnavailable=25%`
exist on the live object). Flux's kustomize-controller submits the
new manifest via Server-Side Apply with the `kustomize-controller`
field manager. The API server merges, then validates. Validation
rejects with:
```
Deployment.apps "<name>" is invalid:
spec.strategy.rollingUpdate: Forbidden:
may not be specified when strategy `type` is 'Recreate'
```
The Flux Kustomization parks at `Ready=False` on every reconcile
until an operator intervenes.
### Why Server-Side Apply does this
SSA's contract is "set the fields you declare." It does NOT remove
fields owned by other field managers. The pre-existing Deployment was
created via `kubectl apply` (CSA), so the
`kubectl-client-side-apply` field manager owns
`.spec.strategy.rollingUpdate.maxSurge` and
`.spec.strategy.rollingUpdate.maxUnavailable`. When kustomize-
controller flips `.spec.strategy.type` to `Recreate`, those rolling-
update fields stay on the object. The post-merge state has both
`type: Recreate` AND `rollingUpdate.*` keys. The API validator forbids
that combination. SSA cannot fix this on its own.
### Why `$patch: replace` is NOT the answer
`$patch: replace` is a Strategic Merge Patch runtime directive. It
does NOT belong in a chart's base resource. Reasons:
1. **API strict-decoding rejects it on CREATE.** `kubectl create`,
`kubectl apply` to an empty namespace, and `kubectl apply
--server-side` all return:
```
strict decoding error: unknown field "spec.strategy.$patch"
```
This BREAKS fresh installs — including every new Sovereign
bootstrap.
2. **Flux SSA rejects it.** The `kustomize-controller` SSA path
returns `field not declared in schema` on
`.spec.strategy.$patch`.
3. **It is a runtime directive, not a chart field.** `$patch:
replace` is processed at SMP merge time by SMP-aware mergers.
`kustomize build` does NOT consume the directive when it appears
in a base resource — it passes it through as if it were a normal
YAML key. The downstream API call then fails as above.
The correct place for `$patch: replace` is inside a Kustomize
`patches:` entry, where the kustomize binary processes it at build
time and emits a clean output that contains no `$patch` key. That is
not what fixes the strategy-flip problem either, because the build-
time output is identical to declaring `strategy.type: Recreate`
directly — it produces the same SSA failure.
### The canonical fix
Annotate the Deployment with the Flux force annotation:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: catalyst-api
annotations:
kustomize.toolkit.fluxcd.io/force: enabled
spec:
replicas: 1
strategy:
type: Recreate
# ...
```
When kustomize-controller's SSA dry-run fails with an Invalid response
on this resource, the controller falls back to delete-and-recreate the
SINGLE annotated resource (not the whole Kustomization). The
recreated Deployment has no residual `rollingUpdate.*` fields — the
regression cannot recur on the rebuilt object. The annotation lives
in Git, version-controlled, applies on every reconcile.
This is **not** a "kubectl delete bandaid." Per
[INVIOLABLE-PRINCIPLES.md](INVIOLABLE-PRINCIPLES.md) #3 (Follow the
documented architecture, exactly — Flux is the ONLY GitOps reconciler)
and #4 (Never hardcode — runtime configuration in Git, not in shell
history): the remediation is declarative, scoped to the resource, and
removed only by editing the chart.
### When you may use this annotation
The Flux force annotation triggers delete + recreate on apply
failure. Use it only on resources that:
- Already declare `strategy.type: Recreate` (so delete-and-recreate is
the steady-state update path anyway), OR
- Carry no client traffic (a brief unavailability is acceptable), OR
- Are explicitly designed to lose in-process state on every roll.
Do NOT add the annotation to a resource whose default update mode is
`RollingUpdate` and whose pods serve live traffic — you would be
trading off availability against an outcome that better resource
authoring (selectors, immutable-field migrations) could deliver.
### Required test coverage
Every chart that flips `Deployment.spec.strategy.type` MUST be covered
by a test fixture in `tests/integration/strategy-flip.yaml` (or its
equivalent next to a similar regression). The test must:
1. Stage a Deployment with the OLD strategy at the same name.
2. Apply the NEW chart manifest.
3. Assert the apply succeeds via the documented apply path.
4. Assert the chart manifest carries the Flux force annotation.
5. Assert the chart manifest is also valid for fresh install (no
inline `$patch: replace` or other strict-decoding-violating
directives).
The current implementation lives at
[`tests/integration/strategy-flip.sh`](../tests/integration/strategy-flip.sh)
and the CI workflow at
`.github/workflows/test-strategy-flip.yaml`. Wire any new strategy-
flip into both.
### Reference incident
- **Date:** 2026-04-29
- **Cluster:** contabo-mkt
- **Resource:** `catalyst/catalyst-api`
- **Symptom:** Kustomization stuck Ready=False for hours; user
unblocked manually with `kubectl delete deploy catalyst-api -n
catalyst`. Flux re-created the Deployment from scratch on the next
reconcile; the `rollingUpdate.*` fields were no longer present and
the Kustomization went Ready=True.
- **Root cause:** chart's `api-deployment.yaml` declared
`strategy.type: Recreate`; the live object had been created with
default RollingUpdate; SSA preserved the rollingUpdate fields under
the prior field manager.
- **Durable fix:** add `kustomize.toolkit.fluxcd.io/force: enabled`
annotation to the chart manifest at
`products/catalyst/chart/templates/api-deployment.yaml`.
---
## Generalizing the lesson
### Other chart fields that can collide on apply
The strategy-flip is one instance of a broader class: fields whose
**old value** and **new value** cannot legally coexist, where the old
value is owned by a non-Flux field manager. The same fix applies to
each of them — annotate the resource with
`kustomize.toolkit.fluxcd.io/force: enabled` and let Flux recover via
delete-and-recreate when SSA dry-run fails.
| Resource kind | Field that triggers an Invalid merge | Notes |
|---|---|---|
| `Deployment` | `spec.strategy.type` Recreate ↔ RollingUpdate | This document. |
| `Deployment` | `spec.selector.matchLabels` change | Selector is immutable post-create. Must recreate. |
| `Service` | `spec.clusterIP` (None ↔ value) | Immutable. Must recreate. |
| `Service` | `spec.type` ClusterIP ↔ NodePort ↔ LoadBalancer | Some transitions invalid; recreate is safe path. |
| `PersistentVolumeClaim` | `spec.accessModes` change after binding | Immutable post-bind. Recreate would lose data — DO NOT add force annotation; instead provision a new PVC under a new name and migrate. |
| `StatefulSet` | `spec.serviceName`, `spec.selector` | Immutable. Must recreate (which loses pod identity). Plan migrations carefully. |
| `Job` | `spec.template.*` after create | Immutable. Recreation is the only path. |
For PVCs and StatefulSets specifically: NEVER add the Flux force
annotation as a default. Data loss is the failure mode. The right
move is a paired migration: provision the new resource under a new
name, copy data, swap references, retire the old.
### Authoring discipline
Before declaring "done" on any chart that touches a long-lived
resource:
1. Run the chart's manifest through `kubectl apply --dry-run=server`
against an EMPTY namespace. Must succeed (no `$patch:` in the
spec, no fields the strict decoder rejects).
2. If the resource type appears in the table above, ALSO run
`kubectl apply --dry-run=server` against a namespace where a
PRIOR shape of the resource already exists. Must succeed under the
user's documented apply path; if it fails, add the Flux force
annotation AND the integration test.
3. Verify the chart's `kustomization.yaml` references all template
files (catches the "I added a template but forgot to wire it"
regression).
4. If the resource carries client traffic, document the recreate
blast radius in the chart's leading comment — operators reading
the chart need to know an apply may interrupt service.
### Cross-references
- [`docs/INVIOLABLE-PRINCIPLES.md`](INVIOLABLE-PRINCIPLES.md) #3
Follow the documented architecture, exactly. Flux is the ONLY
GitOps reconciler; remediations live in IaC, not in shell history.
- [`docs/INVIOLABLE-PRINCIPLES.md`](INVIOLABLE-PRINCIPLES.md) #4
Never hardcode. Runtime knobs live in Git as declarative resources,
not as operator runbook steps.
- Flux docs:
https://fluxcd.io/flux/components/kustomize/kustomizations/#force
— official documentation of the
`kustomize.toolkit.fluxcd.io/force: enabled` annotation.
- `tests/integration/strategy-flip.sh` — the runner that defends the
Catalyst chart against this regression.
- `tests/integration/strategy-flip.yaml` — the bad-state fixture and
assertion contract.
- `.github/workflows/test-strategy-flip.yaml` — CI wiring.