openova/docs/lessons-learned
e3mrah 05cb39c042
fix(bp-flux): catalyst-cluster-reconciler ClusterRoleBinding overlay (closes #338) (#393)
PROBLEM
-------
On Sovereign-1 (otech.omani.works, 2026-04-30) every HelmRelease that
transitioned through pending-install/pending-upgrade got stuck because
the helm-controller SA could not UPDATE its own helm-storage Secrets
(sh.helm.release.v1.<name>.<n>) in flux-system. Symptom:

  secrets "sh.helm.release.v1.catalyst-platform.v1" is forbidden:
  User "system:serviceaccount:flux-system:helm-controller" cannot
  update resource "secrets" in API group "" in the namespace "flux-system"

Runtime workaround on otech (added 2026-04-30): manual ClusterRoleBinding
flux-system-helm-controller-admin → cluster-admin → flux-system/helm-controller.
Tracked as the permanent fix in #338.

FIX
---
Add platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml — a
Catalyst-managed ClusterRoleBinding (catalyst-cluster-reconciler) that
binds cluster-admin to helm-controller AND kustomize-controller in
.Values.catalyst.fluxNamespace (default flux-system). Independent from
the upstream subchart's cluster-reconciler binding (different name, no
ownership conflict), so if the upstream binding ever drifts again the
overlay still holds the cluster correct.

WHY cluster-admin (not narrower)
--------------------------------
helm-controller installs arbitrary user-supplied Helm charts which can
ship any K8s resource (CRDs, ClusterRoles, MutatingWebhookConfigurations,
etc.). There is no narrower role that satisfies the full install path.
The Flux project's own bootstrap install.yaml binds cluster-admin for
the same reason (upstream default multitenancy.privileged=true).
Multi-tenancy lockdown is a Sovereign Day-2 hardening choice tracked
separately.

NEVER-HARDCODE COMPLIANCE
-------------------------
Per docs/INVIOLABLE-PRINCIPLES.md #4, the namespace is operator-overridable
via .Values.catalyst.fluxNamespace. Default is flux-system because that's
the canonical Catalyst install namespace (matches cloud-init's flux2
install.yaml + clusters/_template/bootstrap-kit/03-flux.yaml).

VERSION
-------
- bp-flux 1.1.2 → 1.1.3 (Chart.yaml + blueprint.yaml + 3 bootstrap-kit refs).
- The flux2 subchart pin (2.14.1) is unchanged — version-pin replay test
  remains green (cloud-init v2.4.0 == subchart appVersion 2.4.0).

VERIFICATION
------------
- platform/flux/chart/tests/version-pin-replay.sh — all 6 cases PASS.
- platform/flux/chart/tests/observability-toggle.sh — all 3 cases PASS.
- helm template renders the new ClusterRoleBinding with correct subjects
  (flux-system by default; verified --set catalyst.fluxNamespace=custom
  override path).
- scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles.

FILES
-----
- platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml (new)
- platform/flux/chart/Chart.yaml (1.1.2 → 1.1.3)
- platform/flux/chart/values.yaml (catalyst.fluxNamespace default)
- platform/flux/blueprint.yaml (1.1.2 → 1.1.3)
- clusters/{_template,otech.omani.works,omantel.omani.works}/bootstrap-kit/03-flux.yaml (chart version)
- docs/lessons-learned/helm-controller-rbac.md (permanent-fix note)
- docs/omantel-handover-wbs.md (#338 status row)

Refs: #43 #369 #338
Lesson: docs/lessons-learned/helm-controller-rbac.md

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
2026-05-01 15:56:45 +04:00
..
catalyst-bootstrap-api.md docs(lessons-learned): renaming persisted JSON tag silently drops legacy data (#351) 2026-05-01 11:08:05 +02:00
chi-router-quirks.md docs: lessons learned from #305 — helm-controller log format + chi router %3A quirk 2026-05-01 06:51:32 +02:00
helm-controller-logs.md docs: lessons learned from #305 — helm-controller log format + chi router %3A quirk 2026-05-01 06:51:32 +02:00
helm-controller-rbac.md fix(bp-flux): catalyst-cluster-reconciler ClusterRoleBinding overlay (closes #338) (#393) 2026-05-01 15:56:45 +04:00
helm-hooks-and-crd-ordering.md docs(lessons-learned): Helm hooks + CRD ordering, catalyst-bootstrap-api credentials behavior 2026-05-01 10:11:42 +02:00
README.md docs(lessons-learned): Helm hooks + CRD ordering, catalyst-bootstrap-api credentials behavior 2026-05-01 10:11:42 +02:00

Lessons Learned

Operational knowledge discovered during platform development. Platform/infrastructure behaviors that exist regardless of our code; non-obvious config or behavior found during debugging; patterns that would bite the next contributor.

Organized by domain.

Domain What's in it
helm-controller-rbac.md Flux helm-controller v1.1.0 RBAC + template-parse quirks
helm-controller-logs.md Flux v2.4 helm-controller stdout uses nested-object JSON for HelmRelease, not flat strings
chi-router-quirks.md go-chi does not decode %3A (and other path-safe specials) before route matching
helm-hooks-and-crd-ordering.md before-hook-creation deadlocks on first install when the CRD comes from the same chart's upstream subchart — architectural fix is chart-split + Flux dependsOn
catalyst-bootstrap-api.md tofu destroy works against the on-disk workdir without re-prompting credentials — destructive endpoints split tofu vs Hetzner-direct paths cleanly