fix(infra): bump kernel inotify limits (bao init was hitting EMFILE) (#656)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)
* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)
The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):
componentGroups.ts Flux HelmRelease.dependsOn
---------------------- ---------------------------
keycloak: [cnpg] keycloak: [cert-manager, gateway-api]
openbao: [] openbao: [spire, gateway-api, cnpg]
harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager,
valkey] gateway-api]
Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03
This commit:
1. Adds scripts/generate-blueprint-deps.sh that parses every
bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
keyed by bare component id (bp- prefix stripped on both source
and target side).
2. Commits the generated JSON.
3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
4. Patches componentGroups.ts so every RAW_COMPONENT's
`dependencies` field is OVERRIDDEN at module load with the
Flux-canonical list (the inline `dependencies: [...]` literals
are now ignored — Flux is canonical).
Follow-ups (not in this PR):
- CI drift check that re-runs the script and diffs the JSON.
- Strip the inline `dependencies: [...]` arrays entirely once the
drift check is green.
- Wire the FlowPage edge-rendering to match.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT
PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
keycloak: ['cert-manager', 'openbao'] ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.
Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(jobs): don't regress status to pending after exec started
helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).
Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.
Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(jobs): cascade Failed status through dependsOn (fail-fast)
Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.
Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'
Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.
Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.
Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 1048576
fs.inotify.max_queued_events = 16384
Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
---------
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
This commit is contained in:
parent
7b4d4616b6
commit
1734979d74
@ -51,6 +51,26 @@ write_files:
|
||||
"gitopsBranch": "${gitops_branch}"
|
||||
}
|
||||
|
||||
# ── Kernel inotify limits — k3s + Flux + CNPG + bao + Helm exhaust Ubuntu defaults ──
|
||||
# Default Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128
|
||||
# and fs.inotify.max_user_watches=524288 — but every Helm controller,
|
||||
# CNPG operator, k3s kubelet, file-watching admin tool grabs an
|
||||
# instance slot. On a 35-component bootstrap-kit the slots run out
|
||||
# mid-install and the next process to ask gets:
|
||||
# failed to create fsnotify watcher: too many open files
|
||||
# Diagnosed live during otech35 — bp-openbao's `bao operator init`
|
||||
# crash-looped 4× with that exact error, which Flux escalated to
|
||||
# InstallFailed/RetriesExceeded — masking the real OS-level root cause.
|
||||
#
|
||||
# Bump well above k8s/k3s production guidance so future blueprint
|
||||
# additions don't tickle the same wall.
|
||||
- path: /etc/sysctl.d/99-catalyst-inotify.conf
|
||||
permissions: '0644'
|
||||
content: |
|
||||
fs.inotify.max_user_instances = 8192
|
||||
fs.inotify.max_user_watches = 1048576
|
||||
fs.inotify.max_queued_events = 16384
|
||||
|
||||
# ── OS hardening: SSH daemon ──────────────────────────────────────────
|
||||
# Drop-in overrides /etc/ssh/sshd_config defaults. Per Catalyst's threat
|
||||
# model the operator's only valid path in is the Hetzner-project SSH key
|
||||
@ -728,6 +748,11 @@ runcmd:
|
||||
- update-alternatives --set iptables /usr/sbin/iptables-legacy || true
|
||||
- update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy || true
|
||||
|
||||
# Apply inotify-limit bumps written by write_files. sysctl --system
|
||||
# picks up /etc/sysctl.d/*.conf so future blueprints + bao init never
|
||||
# hit "too many open files" again.
|
||||
- sysctl --system
|
||||
|
||||
# Activate hardened sshd config (cloud-init may have written authorized_keys
|
||||
# already from Hetzner ssh_keys[]; we never touch that file).
|
||||
- systemctl reload ssh || systemctl reload sshd || true
|
||||
|
||||
Loading…
Reference in New Issue
Block a user