openova/clusters/_template/bootstrap-kit/21-alloy.yaml
e3mrah 0dbdf3b327
fix(bp-trivy): node-collector tolerates control-plane taint (closes #769) (#772)
PR #755 added `node-role.kubernetes.io/control-plane=true:NoSchedule` to
the CP node when worker_count > 0. Two bootstrap-kit charts have pods
that MUST land on the CP and lacked the matching toleration:

bp-trivy
  • node-collector: Pod pinned to each node via nodeSelector
    `kubernetes.io/hostname=<node>`. The CP-bound collector reads
    /var/lib/etcd, /var/lib/kubelet, /var/lib/kube-scheduler,
    /var/lib/kube-controller-manager via hostPath — these only exist
    on the CP. Without the toleration the collector sat Pending forever
    on otech93 (live evidence in #769).
  • scanJobTolerations: per-workload scan jobs the operator spawns may
    target pods on CP-only system DaemonSets (kube-system kube-proxy
    in non-Cilium mode, etc.). Adding the toleration here so reports
    are produced for those workloads too.

bp-alloy
  • DaemonSet — one pod MUST land on every node including the CP, so
    CP-local kubelet logs + node metrics flow into the LGTM stack.
    Without the toleration Alloy ran 3/4 nodes (Ready=N-1) on otech93
    and CP telemetry was silently lost.

Both tolerations are no-ops on solo Sovereigns (worker_count=0): the CP
is untainted in solo mode per PR #755's conditional.

Versions bumped:
  • bp-trivy 1.0.2 → 1.0.3 (Chart.yaml + 3× HelmRelease pins)
  • bp-alloy 1.0.0 → 1.0.1 (Chart.yaml + 3× HelmRelease pins)

Out of scope (audited, no change needed):
  • bp-cilium — upstream defaults already tolerate everything (verified
    on otech93: cilium DaemonSet at 4/4 nodes).
  • bp-falco — values.yaml already declares NoSchedule + NoExecute
    Exists tolerations (4/4 on otech93).
  • cnpg/harbor — no kubelet-cert-renew Jobs in current charts.

Verified:
  • `helm template` on both charts renders the expected toleration
    (alloy: pod-spec; trivy: trivy-operator-config ConfigMap consumed
     by the operator at scan-job spawn time).
  • `bash scripts/check-bootstrap-deps.sh` PASSED (no DAG drift).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:38:29 +02:00

72 lines
2.0 KiB
YAML

# bp-alloy — Catalyst Blueprint #21 (W2.K2 Observability batch).
# Grafana Alloy — the unified telemetry collector for the LGTM stack.
# Runs as a DaemonSet on every node; tails container logs, scrapes
# Prometheus metrics, and forwards traces. Co-resident with bp-opentelemetry
# (slot 20) — Alloy handles host-level collection (kubelet, journald,
# node_exporter) while OTel handles app-level OTLP.
#
# Wrapper chart: platform/alloy/chart/
# Reconciled by: Flux on the new Sovereign's k3s control plane, AFTER
# bp-opentelemetry is Ready (Alloy's default config
# forwards OTLP to the Collector's gRPC endpoint).
#
# dependsOn:
# - bp-opentelemetry (slot 20) — Alloy forwards OTLP to the Collector.
# Without the Collector Service in place, Alloy retries forever on a
# non-existent upstream.
#
# disableWait: Alloy is a DaemonSet — Helm `--wait` would block on
# every node's Alloy Pod becoming Ready. On larger Sovereigns this can
# legitimately take >5min during a cold-start image pull; the HelmRelease
# reports Ready when manifests apply, runtime convergence observed via
# kubectl.
---
apiVersion: v1
kind: Namespace
metadata:
name: alloy
labels:
catalyst.openova.io/sovereign: ${SOVEREIGN_FQDN}
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-alloy
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-alloy
namespace: flux-system
spec:
interval: 15m
timeout: 15m
releaseName: alloy
targetNamespace: alloy
dependsOn:
- name: bp-opentelemetry
chart:
spec:
chart: bp-alloy
version: 1.0.1
sourceRef:
kind: HelmRepository
name: bp-alloy
namespace: flux-system
install:
disableWait: true
remediation:
retries: 3
upgrade:
disableWait: true
remediation:
retries: 3