fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.
This commit is contained in:
hatiyildiz 2026-05-15 19:02:28 +02:00
parent f16d238ca9
commit afe522b0cb

View File

@ -96,7 +96,7 @@ rules:
- apiGroups: ["apps"]
resources: ["daemonsets"]
resourceNames: ["cilium-envoy"]
verbs: ["get", "patch"]
verbs: ["get", "patch", "list", "watch"]
# ALSO patch the cilium-operator Deployment. Reason: on a fresh
# Sovereign, cilium-operator's first CEC reconciliation produces a
# CiliumEnvoyConfig WITHOUT the hostNetwork bind `additionalAddresses.
@ -113,10 +113,16 @@ rules:
- apiGroups: ["apps"]
resources: ["deployments"]
resourceNames: ["cilium-operator"]
verbs: ["get", "patch"]
# Read DaemonSet rollout status so the Job can wait for the new pods
# to come up before exiting. `kubectl rollout status` issues GET on
# the DaemonSet resource (no extra verbs needed).
verbs: ["get", "patch", "list", "watch"]
# Read rollout status so the Job can wait for new pods to come up
# before exiting. `kubectl rollout status` does NOT just GET — it
# uses client-go informerwatcher to LIST+WATCH the
# Deployment/DaemonSet resource. Without list+watch verbs the
# informer fails with "forbidden: cannot list resource ..." and the
# Job stalls at the rollout-status check until activeDeadlineSeconds.
# Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15):
# tls-restart Job stuck Running 10m+ on the cilium-operator rollout
# check, never restarted cilium-envoy, console.<fqdn> never served.
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding