fix(infra): bump kernel inotify limits (bao init was hitting EMFILE) (#656)

* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)

* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)

The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):

  componentGroups.ts          Flux HelmRelease.dependsOn
  ----------------------      ---------------------------
  keycloak: [cnpg]            keycloak: [cert-manager, gateway-api]
  openbao:  []                openbao:  [spire, gateway-api, cnpg]
  harbor:   [cnpg, seaweedfs, harbor:   [cnpg, cert-manager,
              valkey]                    gateway-api]

Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03

This commit:
  1. Adds scripts/generate-blueprint-deps.sh that parses every
     bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
     keyed by bare component id (bp- prefix stripped on both source
     and target side).
  2. Commits the generated JSON.
  3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
     thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
  4. Patches componentGroups.ts so every RAW_COMPONENT's
     `dependencies` field is OVERRIDDEN at module load with the
     Flux-canonical list (the inline `dependencies: [...]` literals
     are now ignored — Flux is canonical).

Follow-ups (not in this PR):
  - CI drift check that re-runs the script and diffs the JSON.
  - Strip the inline `dependencies: [...]` arrays entirely once the
    drift check is green.
  - Wire the FlowPage edge-rendering to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT

PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
  keycloak: ['cert-manager', 'openbao']  ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.

Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): don't regress status to pending after exec started

helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).

Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.

Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): cascade Failed status through dependsOn (fail-fast)

Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.

Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'

Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
  failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.

Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.

Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
  fs.inotify.max_user_instances = 8192
  fs.inotify.max_user_watches   = 1048576
  fs.inotify.max_queued_events  = 16384

Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
This commit is contained in:
e3mrah 2026-05-03 10:32:38 +04:00 committed by GitHub
parent 7b4d4616b6
commit 1734979d74
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -51,6 +51,26 @@ write_files:
"gitopsBranch": "${gitops_branch}"
}
# ── Kernel inotify limits — k3s + Flux + CNPG + bao + Helm exhaust Ubuntu defaults ──
# Default Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128
# and fs.inotify.max_user_watches=524288 — but every Helm controller,
# CNPG operator, k3s kubelet, file-watching admin tool grabs an
# instance slot. On a 35-component bootstrap-kit the slots run out
# mid-install and the next process to ask gets:
# failed to create fsnotify watcher: too many open files
# Diagnosed live during otech35 — bp-openbao's `bao operator init`
# crash-looped 4× with that exact error, which Flux escalated to
# InstallFailed/RetriesExceeded — masking the real OS-level root cause.
#
# Bump well above k8s/k3s production guidance so future blueprint
# additions don't tickle the same wall.
- path: /etc/sysctl.d/99-catalyst-inotify.conf
permissions: '0644'
content: |
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 1048576
fs.inotify.max_queued_events = 16384
# ── OS hardening: SSH daemon ──────────────────────────────────────────
# Drop-in overrides /etc/ssh/sshd_config defaults. Per Catalyst's threat
# model the operator's only valid path in is the Hetzner-project SSH key
@ -728,6 +748,11 @@ runcmd:
- update-alternatives --set iptables /usr/sbin/iptables-legacy || true
- update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy || true
# Apply inotify-limit bumps written by write_files. sysctl --system
# picks up /etc/sysctl.d/*.conf so future blueprints + bao init never
# hit "too many open files" again.
- sysctl --system
# Activate hardened sshd config (cloud-init may have written authorized_keys
# already from Hetzner ssh_keys[]; we never touch that file).
- systemctl reload ssh || systemctl reload sshd || true