From 1734979d741f7931a1c977944cb3be6ec217830f Mon Sep 17 00:00:00 2001 From: e3mrah <81884938+emrahbaysal@users.noreply.github.com> Date: Sun, 3 May 2026 10:32:38 +0400 Subject: [PATCH] fix(infra): bump kernel inotify limits (bao init was hitting EMFILE) (#656) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) * fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth) The wizard's componentGroups.ts carried hand-maintained `dependencies: [...]` arrays that deviated from the real Flux install graph in clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced this): componentGroups.ts Flux HelmRelease.dependsOn ---------------------- --------------------------- keycloak: [cnpg] keycloak: [cert-manager, gateway-api] openbao: [] openbao: [spire, gateway-api, cnpg] harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager, valkey] gateway-api] Founder's directive: "all the real dependencies are related to real flux related dependencies, if you are hosting irrelevant hardcoded baseless wizard catalog dependencies, I dont know where they are coming from. The single source of truth for the dependencies is flux!!!" — 2026-05-03 This commit: 1. Adds scripts/generate-blueprint-deps.sh that parses every bootstrap-kit HelmRelease and emits blueprint-deps.generated.json keyed by bare component id (bp- prefix stripped on both source and target side). 2. Commits the generated JSON. 3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id). 4. Patches componentGroups.ts so every RAW_COMPONENT's `dependencies` field is OVERRIDDEN at module load with the Flux-canonical list (the inline `dependencies: [...]` literals are now ignored — Flux is canonical). Follow-ups (not in this PR): - CI drift check that re-runs the script and diffs the JSON. - Strip the inline `dependencies: [...]` arrays entirely once the drift check is green. - Wire the FlowPage edge-rendering to match. Co-authored-by: hatiyildiz * fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent hardcoded dep map at lines 105-155 that the founder caught — most visibly: keycloak: ['cert-manager', 'openbao'] ← FALSE; Flux says no openbao The reason the founder kept seeing the spurious arrow on the Flow page. Replace the local table with an import of BLUEPRINT_DEPS from data/blueprintDeps.ts (single source of truth — generated from clusters/_template/bootstrap-kit/*.yaml by scripts/generate-blueprint-deps.sh). Co-authored-by: hatiyildiz * fix(jobs): don't regress status to pending after exec started helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the Job's Status with jobStatusFromHelmState(state) on every event. Flux oscillates HelmReleases between Reconciling and DependencyNotReady while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready — helmwatch maps both back to HelmStatePending. The bridge then flips the row to status='pending' even though an active Execution is streaming exec log lines (startedAt + latestExecutionId already set). Founder caught this on otech34's install-external-secrets job: status='pending' on the Jobs page while Exec Log was actively tailing. Fix: monotonic guard — once activeExecID[component] != "" (Execution allocated), refuse to regress nextStatus to StatusPending. Treat ongoing-after-start as Running so the row reflects the live stream. Co-authored-by: hatiyildiz * fix(jobs): cascade Failed status through dependsOn (fail-fast) Founder caught on otech34: install-openbao=failed but install-external-secrets stayed pending forever ('masking it and waiting unnecessarily'). Flux's HelmRelease for external-secrets is in DependencyNotReady, helmwatch maps that to StatePending, bridge writes Status=pending — no signal that the upstream FAILED rather than 'still installing'. Add a post-rollup sweep in deriveTreeView that propagates Failed through the dependsOn graph. Up to 8 sweeps cover the deepest bootstrap-kit chain. Idempotent on read; reverses if openbao recovers because it operates on the live snapshot. Co-authored-by: hatiyildiz * fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files' Diagnosed live during otech35: openbao-init pod crash-looped 4× on 'bao operator init' with: failed to create fsnotify watcher: too many open files Flux mapped to InstallFailed → RetriesExceeded → cascading through external-secrets and external-secrets-stores. The wizard masked the OS-level root cause behind a generic InstallFailed. Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm- controller + 11 CNPG operators + Reflector + Cert-Manager + bao + keycloak-config-cli + ... each grabs instance slots). The instance count exhausts within minutes; the next process to ask for an inotify slot gets EMFILE. Bump well above k8s/k3s production guidance so future blueprints don't tickle the same wall: fs.inotify.max_user_instances = 8192 fs.inotify.max_user_watches = 1048576 fs.inotify.max_queued_events = 16384 Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system' in runcmd. Permanent across reboots. Co-authored-by: hatiyildiz --------- Co-authored-by: hatiyildiz --- infra/hetzner/cloudinit-control-plane.tftpl | 25 +++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/infra/hetzner/cloudinit-control-plane.tftpl b/infra/hetzner/cloudinit-control-plane.tftpl index 500b4ef6..f84e8b23 100644 --- a/infra/hetzner/cloudinit-control-plane.tftpl +++ b/infra/hetzner/cloudinit-control-plane.tftpl @@ -51,6 +51,26 @@ write_files: "gitopsBranch": "${gitops_branch}" } + # ── Kernel inotify limits — k3s + Flux + CNPG + bao + Helm exhaust Ubuntu defaults ── + # Default Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 + # and fs.inotify.max_user_watches=524288 — but every Helm controller, + # CNPG operator, k3s kubelet, file-watching admin tool grabs an + # instance slot. On a 35-component bootstrap-kit the slots run out + # mid-install and the next process to ask gets: + # failed to create fsnotify watcher: too many open files + # Diagnosed live during otech35 — bp-openbao's `bao operator init` + # crash-looped 4× with that exact error, which Flux escalated to + # InstallFailed/RetriesExceeded — masking the real OS-level root cause. + # + # Bump well above k8s/k3s production guidance so future blueprint + # additions don't tickle the same wall. + - path: /etc/sysctl.d/99-catalyst-inotify.conf + permissions: '0644' + content: | + fs.inotify.max_user_instances = 8192 + fs.inotify.max_user_watches = 1048576 + fs.inotify.max_queued_events = 16384 + # ── OS hardening: SSH daemon ────────────────────────────────────────── # Drop-in overrides /etc/ssh/sshd_config defaults. Per Catalyst's threat # model the operator's only valid path in is the Hetzner-project SSH key @@ -728,6 +748,11 @@ runcmd: - update-alternatives --set iptables /usr/sbin/iptables-legacy || true - update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy || true + # Apply inotify-limit bumps written by write_files. sysctl --system + # picks up /etc/sysctl.d/*.conf so future blueprints + bao init never + # hit "too many open files" again. + - sysctl --system + # Activate hardened sshd config (cloud-init may have written authorized_keys # already from Hetzner ssh_keys[]; we never touch that file). - systemctl reload ssh || systemctl reload sshd || true