PR #755 added `node-role.kubernetes.io/control-plane=true:NoSchedule` to the CP node when worker_count > 0. Two bootstrap-kit charts have pods that MUST land on the CP and lacked the matching toleration: bp-trivy • node-collector: Pod pinned to each node via nodeSelector `kubernetes.io/hostname=<node>`. The CP-bound collector reads /var/lib/etcd, /var/lib/kubelet, /var/lib/kube-scheduler, /var/lib/kube-controller-manager via hostPath — these only exist on the CP. Without the toleration the collector sat Pending forever on otech93 (live evidence in #769). • scanJobTolerations: per-workload scan jobs the operator spawns may target pods on CP-only system DaemonSets (kube-system kube-proxy in non-Cilium mode, etc.). Adding the toleration here so reports are produced for those workloads too. bp-alloy • DaemonSet — one pod MUST land on every node including the CP, so CP-local kubelet logs + node metrics flow into the LGTM stack. Without the toleration Alloy ran 3/4 nodes (Ready=N-1) on otech93 and CP telemetry was silently lost. Both tolerations are no-ops on solo Sovereigns (worker_count=0): the CP is untainted in solo mode per PR #755's conditional. Versions bumped: • bp-trivy 1.0.2 → 1.0.3 (Chart.yaml + 3× HelmRelease pins) • bp-alloy 1.0.0 → 1.0.1 (Chart.yaml + 3× HelmRelease pins) Out of scope (audited, no change needed): • bp-cilium — upstream defaults already tolerate everything (verified on otech93: cilium DaemonSet at 4/4 nodes). • bp-falco — values.yaml already declares NoSchedule + NoExecute Exists tolerations (4/4 on otech93). • cnpg/harbor — no kubelet-cert-renew Jobs in current charts. Verified: • `helm template` on both charts renders the expected toleration (alloy: pod-spec; trivy: trivy-operator-config ConfigMap consumed by the operator at scan-job spawn time). • `bash scripts/check-bootstrap-deps.sh` PASSED (no DAG drift). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
72 lines
2.0 KiB
YAML
72 lines
2.0 KiB
YAML
# bp-alloy — Catalyst Blueprint #21 (W2.K2 Observability batch).
|
|
# Grafana Alloy — the unified telemetry collector for the LGTM stack.
|
|
# Runs as a DaemonSet on every node; tails container logs, scrapes
|
|
# Prometheus metrics, and forwards traces. Co-resident with bp-opentelemetry
|
|
# (slot 20) — Alloy handles host-level collection (kubelet, journald,
|
|
# node_exporter) while OTel handles app-level OTLP.
|
|
#
|
|
# Wrapper chart: platform/alloy/chart/
|
|
# Reconciled by: Flux on the new Sovereign's k3s control plane, AFTER
|
|
# bp-opentelemetry is Ready (Alloy's default config
|
|
# forwards OTLP to the Collector's gRPC endpoint).
|
|
#
|
|
# dependsOn:
|
|
# - bp-opentelemetry (slot 20) — Alloy forwards OTLP to the Collector.
|
|
# Without the Collector Service in place, Alloy retries forever on a
|
|
# non-existent upstream.
|
|
#
|
|
# disableWait: Alloy is a DaemonSet — Helm `--wait` would block on
|
|
# every node's Alloy Pod becoming Ready. On larger Sovereigns this can
|
|
# legitimately take >5min during a cold-start image pull; the HelmRelease
|
|
# reports Ready when manifests apply, runtime convergence observed via
|
|
# kubectl.
|
|
|
|
---
|
|
apiVersion: v1
|
|
kind: Namespace
|
|
metadata:
|
|
name: alloy
|
|
labels:
|
|
catalyst.openova.io/sovereign: ${SOVEREIGN_FQDN}
|
|
---
|
|
apiVersion: source.toolkit.fluxcd.io/v1beta2
|
|
kind: HelmRepository
|
|
metadata:
|
|
name: bp-alloy
|
|
namespace: flux-system
|
|
spec:
|
|
type: oci
|
|
interval: 15m
|
|
url: oci://ghcr.io/openova-io
|
|
secretRef:
|
|
name: ghcr-pull
|
|
---
|
|
apiVersion: helm.toolkit.fluxcd.io/v2
|
|
kind: HelmRelease
|
|
metadata:
|
|
name: bp-alloy
|
|
namespace: flux-system
|
|
spec:
|
|
interval: 15m
|
|
timeout: 15m
|
|
releaseName: alloy
|
|
targetNamespace: alloy
|
|
dependsOn:
|
|
- name: bp-opentelemetry
|
|
chart:
|
|
spec:
|
|
chart: bp-alloy
|
|
version: 1.0.1
|
|
sourceRef:
|
|
kind: HelmRepository
|
|
name: bp-alloy
|
|
namespace: flux-system
|
|
install:
|
|
disableWait: true
|
|
remediation:
|
|
retries: 3
|
|
upgrade:
|
|
disableWait: true
|
|
remediation:
|
|
retries: 3
|