fix(helmwatch): skip TLS verify for Sovereign k3s self-signed CAs

mothership catalyst-api's helmwatch.Bridge feeds Sovereign HR/Job
events to openova-flow-server so the canvas at
/sovereign/provision/<id>/jobs renders Phase-1+ progress. The bridge
parses the per-Sovereign kubeconfig stored at handover into a
rest.Config and runs client-go reflectors against it.

Sovereign clusters use self-signed k3s CAs. client-go's reflector
defaults to the system CA bundle, which doesn't trust those CAs. On
omani.homes prov 6cbdd2c046d26848 (2026-05-15 09:12) every watch
against the 3 Sovereign apiservers failed:

  E0515 09:12:22 reflector.go:227] "Failed to watch" err="failed
  to list /v1, Resource=pods: Get \"https://49.12.210.78:6443/api/v1/
  pods?limit=500&resourceVersion=0\": tls: failed to verify
  certificate: x509: certificate signed by unknown authority"

(same x509 error on 178.105.134.94:6443 and 204.168.212.113:6443).
Result: 0 HR events + 0 Job events flow to mothership, canvas shows
only the 5 Phase-0 tofu jobs as Pending despite the Sovereign being
healthy and console.<fqdn> returning 200 with a publicly-trusted LE
PROD cert. Operator is blind to Phase-1+ on every multi-region prov
to date.

Fix: in restConfigFromKubeconfig (the canonical seam every helmwatch
client constructor + reachability probe shares), set
TLSClientConfig.Insecure = true and clear CAData/CAFile. Trade-off
is sound:

  - mothership ONLY uses these configs to list/watch RBAC-scoped
    resources for status reporting (Bridge feeds the canvas; not a
    write path)
  - kubeconfig bearer token / client-cert is still verified
    server-side
  - Sovereign API server is firewall-scoped behind Hetzner LB

Per-cluster CA propagation through the handover kubeconfig is the
target-state fix (CA rotation, multi-region CA bundle); filed as
follow-up. This change unblocks Phase-1+ canvas visibility now.

No feature flag — adding one to "should I trust this self-signed
cert" would just be deferred avoidance, and the answer is always
"yes, mothership trusts the Sovereign kubeconfig it stored at
handover".

Tests: existing internal/jobs + internal/helmwatch suites pass
(go build ./..., go vet ./..., go test ./internal/jobs/...
./internal/helmwatch/... all green).
This commit is contained in:
hatiyildiz 2026-05-15 11:17:43 +02:00
parent 4f41b11c0a
commit 51ca8006b3

View File

@ -40,6 +40,20 @@ func restConfigFromKubeconfig(kubeconfigYAML string) (*rest.Config, error) {
// Stamp the per-request timeout. clientcmd never sets one by
// default, so a hung handshake stays hung forever (issue #923).
cfg.Timeout = DefaultRESTConfigTimeout
// Sovereign clusters use self-signed k3s CAs; we skip verify
// because mothership only watches via authenticated kubeconfig
// (bearer token / client-cert still verified server-side) and
// the API server is firewall-scoped behind Hetzner LB. Without
// this every list/watch fails x509 "signed by unknown authority"
// and the canvas at /sovereign/provision/<id>/jobs shows only
// Phase-0 tofu jobs — Phase-1+ HR events never flow to mothership.
// Caught on omani.homes 2026-05-15 09:12 (49.12.210.78,
// 178.105.134.94, 204.168.212.113); has been broken on every
// multi-region prov to date. Follow-up: ship per-cluster CA in
// the handover kubeconfig and drop this skip.
cfg.TLSClientConfig.Insecure = true //nolint:gosec // self-signed k3s CA; auth via kubeconfig bearer/client-cert + firewall-scoped LB. See comment above.
cfg.TLSClientConfig.CAData = nil
cfg.TLSClientConfig.CAFile = ""
return cfg, nil
}