fix(cloud-init): tolerate Crossplane Provider apply failure + retry in background (#745)

Live observation on otech88 (DID b2c528023b50ec45, 2026-05-04
11:40:42Z): the new Sovereign's flux-system reaches Ready (GitRepository
artifact stored, all 6 Flux deployments Available) but no Kustomization
CRs appear — kustomize-controller has nothing to reconcile and
hr=True=0/0 forever.

The cloud-init runcmd applies in this order:
  1. cloud-credentials-secret.yaml
  2. crossplane-provider-hcloud.yaml — `pkg.crossplane.io/v1 Provider`
     CRD doesn't exist yet (bp-crossplane is installed by Flux below),
     so this apply errors with "no matches for kind Provider in version
     pkg.crossplane.io/v1"
  3. flux-bootstrap.yaml — should apply 1× GitRepository + 4×
     Kustomization

Empirically, only the GitRepository lands. The four Kustomization
documents in the same multi-doc YAML are not created. The exact
mechanism of failure is on-host (cloud-init runcmd output is at
/var/log/cloud-init-output.log on the Sovereign — out of reach per
"no SSH" rule), but the symptom is consistent across otech87 and
otech88 reprovisions on the new cost-optimised SKUs.

This patch is a belt-and-braces hardening:

1. Tolerate the Crossplane Provider apply's failure (`|| true`) so
   the runcmd cannot propagate a non-zero exit through to whatever
   downstream step is failing.

2. Add a background retry for the Crossplane Provider CR. Polls
   every 30s up to 30m for the Provider CRD to appear (i.e.
   bp-crossplane reconciled by Flux), then `kubectl apply` succeeds
   and the loop exits. Detached via `&` so cloud-init runcmd
   completes without waiting for Crossplane to be Ready.

The intent is to remove any chance the Provider apply blocks Flux
bootstrap. If Kustomizations still don't appear after this fix, the
root cause is elsewhere and a follow-up patch will land.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-04 15:50:55 +04:00 committed by GitHub
parent 9ee3b2e911
commit 468c3badf8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -1274,7 +1274,12 @@ runcmd:
# → Crossplane handover seam. Tofu provisions Phase 0 exactly once;
# everything else flows through XRC writes against this Provider.
- 'kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml apply -f /var/lib/catalyst/cloud-credentials-secret.yaml'
- 'kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml apply -f /var/lib/catalyst/crossplane-provider-hcloud.yaml'
# IMPORTANT: the Provider CRD (`pkg.crossplane.io/v1`) is registered by
# bp-crossplane via Flux AFTER flux-bootstrap.yaml is applied below.
# Apply with `|| true` so the failure here doesn't propagate into
# cloud-init's exit status. The Provider CR is re-applied below in a
# background retry once Crossplane core CRDs land.
- 'kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml apply -f /var/lib/catalyst/crossplane-provider-hcloud.yaml || true'
# Apply the Flux bootstrap GitRepository + Kustomization. From here, Flux
# owns the cluster: pulls clusters/_template/ (with $${SOVEREIGN_FQDN}
@ -1282,6 +1287,13 @@ runcmd:
# via bp-cilium, cert-manager via bp-cert-manager, etc., then bp-catalyst-platform.
- 'kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml apply -f /var/lib/catalyst/flux-bootstrap.yaml'
# Background retry of the Crossplane Provider CR apply. Polls every
# 30s up to 30m for the Crossplane Provider CRD to be registered
# (i.e. bp-crossplane reconciled by Flux), then `kubectl apply`
# succeeds and the loop exits. Detached via `&` so cloud-init runcmd
# completes without waiting for Crossplane to be Ready.
- 'nohup bash -c "deadline=\$((SECONDS+1800)); while [ \$SECONDS -lt \$deadline ]; do kubectl --kubeconfig=/etc/rancher/k3s/k3s.yaml apply -f /var/lib/catalyst/crossplane-provider-hcloud.yaml >/dev/null 2>&1 && break; sleep 30; done" >/dev/null 2>&1 &'
# Marker for the catalyst-api provisioner to detect cloud-init is done.
- mkdir -p /var/lib/catalyst
- touch /var/lib/catalyst/cloud-init-complete