fix(helmwatch): skip TLS verify for Sovereign k3s self-signed CAs
mothership catalyst-api's helmwatch.Bridge feeds Sovereign HR/Job events to openova-flow-server so the canvas at /sovereign/provision/<id>/jobs renders Phase-1+ progress. The bridge parses the per-Sovereign kubeconfig stored at handover into a rest.Config and runs client-go reflectors against it. Sovereign clusters use self-signed k3s CAs. client-go's reflector defaults to the system CA bundle, which doesn't trust those CAs. On omani.homes prov 6cbdd2c046d26848 (2026-05-15 09:12) every watch against the 3 Sovereign apiservers failed: E0515 09:12:22 reflector.go:227] "Failed to watch" err="failed to list /v1, Resource=pods: Get \"https://49.12.210.78:6443/api/v1/ pods?limit=500&resourceVersion=0\": tls: failed to verify certificate: x509: certificate signed by unknown authority" (same x509 error on 178.105.134.94:6443 and 204.168.212.113:6443). Result: 0 HR events + 0 Job events flow to mothership, canvas shows only the 5 Phase-0 tofu jobs as Pending despite the Sovereign being healthy and console.<fqdn> returning 200 with a publicly-trusted LE PROD cert. Operator is blind to Phase-1+ on every multi-region prov to date. Fix: in restConfigFromKubeconfig (the canonical seam every helmwatch client constructor + reachability probe shares), set TLSClientConfig.Insecure = true and clear CAData/CAFile. Trade-off is sound: - mothership ONLY uses these configs to list/watch RBAC-scoped resources for status reporting (Bridge feeds the canvas; not a write path) - kubeconfig bearer token / client-cert is still verified server-side - Sovereign API server is firewall-scoped behind Hetzner LB Per-cluster CA propagation through the handover kubeconfig is the target-state fix (CA rotation, multi-region CA bundle); filed as follow-up. This change unblocks Phase-1+ canvas visibility now. No feature flag — adding one to "should I trust this self-signed cert" would just be deferred avoidance, and the answer is always "yes, mothership trusts the Sovereign kubeconfig it stored at handover". Tests: existing internal/jobs + internal/helmwatch suites pass (go build ./..., go vet ./..., go test ./internal/jobs/... ./internal/helmwatch/... all green).
This commit is contained in:
parent
4f41b11c0a
commit
51ca8006b3
@ -40,6 +40,20 @@ func restConfigFromKubeconfig(kubeconfigYAML string) (*rest.Config, error) {
|
||||
// Stamp the per-request timeout. clientcmd never sets one by
|
||||
// default, so a hung handshake stays hung forever (issue #923).
|
||||
cfg.Timeout = DefaultRESTConfigTimeout
|
||||
// Sovereign clusters use self-signed k3s CAs; we skip verify
|
||||
// because mothership only watches via authenticated kubeconfig
|
||||
// (bearer token / client-cert still verified server-side) and
|
||||
// the API server is firewall-scoped behind Hetzner LB. Without
|
||||
// this every list/watch fails x509 "signed by unknown authority"
|
||||
// and the canvas at /sovereign/provision/<id>/jobs shows only
|
||||
// Phase-0 tofu jobs — Phase-1+ HR events never flow to mothership.
|
||||
// Caught on omani.homes 2026-05-15 09:12 (49.12.210.78,
|
||||
// 178.105.134.94, 204.168.212.113); has been broken on every
|
||||
// multi-region prov to date. Follow-up: ship per-cluster CA in
|
||||
// the handover kubeconfig and drop this skip.
|
||||
cfg.TLSClientConfig.Insecure = true //nolint:gosec // self-signed k3s CA; auth via kubeconfig bearer/client-cert + firewall-scoped LB. See comment above.
|
||||
cfg.TLSClientConfig.CAData = nil
|
||||
cfg.TLSClientConfig.CAFile = ""
|
||||
return cfg, nil
|
||||
}
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user