fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets
Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:
W reflector.go:561] failed to list *unstructured.Unstructured:
deployments.apps "cilium-operator" is forbidden: User
"system:serviceaccount:kube-system:cilium-envoy-tls-restart"
cannot list resource "deployments" in API group "apps" in the
namespace "kube-system"
The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.
Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.
This commit is contained in:
parent
f16d238ca9
commit
afe522b0cb
@ -96,7 +96,7 @@ rules:
|
||||
- apiGroups: ["apps"]
|
||||
resources: ["daemonsets"]
|
||||
resourceNames: ["cilium-envoy"]
|
||||
verbs: ["get", "patch"]
|
||||
verbs: ["get", "patch", "list", "watch"]
|
||||
# ALSO patch the cilium-operator Deployment. Reason: on a fresh
|
||||
# Sovereign, cilium-operator's first CEC reconciliation produces a
|
||||
# CiliumEnvoyConfig WITHOUT the hostNetwork bind `additionalAddresses.
|
||||
@ -113,10 +113,16 @@ rules:
|
||||
- apiGroups: ["apps"]
|
||||
resources: ["deployments"]
|
||||
resourceNames: ["cilium-operator"]
|
||||
verbs: ["get", "patch"]
|
||||
# Read DaemonSet rollout status so the Job can wait for the new pods
|
||||
# to come up before exiting. `kubectl rollout status` issues GET on
|
||||
# the DaemonSet resource (no extra verbs needed).
|
||||
verbs: ["get", "patch", "list", "watch"]
|
||||
# Read rollout status so the Job can wait for new pods to come up
|
||||
# before exiting. `kubectl rollout status` does NOT just GET — it
|
||||
# uses client-go informerwatcher to LIST+WATCH the
|
||||
# Deployment/DaemonSet resource. Without list+watch verbs the
|
||||
# informer fails with "forbidden: cannot list resource ..." and the
|
||||
# Job stalls at the rollout-status check until activeDeadlineSeconds.
|
||||
# Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15):
|
||||
# tls-restart Job stuck Running 10m+ on the cilium-operator rollout
|
||||
# check, never restarted cilium-envoy, console.<fqdn> never served.
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: RoleBinding
|
||||
|
||||
Loading…
Reference in New Issue
Block a user