Pre-fix the cloud-init template passed --disable=local-storage to the k3s
installer with the design intent that Crossplane would install hcloud-csi
day-2 and register a StorageClass after bp-crossplane reconciled. That
created a circular dependency on a fresh Sovereign: every PVC-using
HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres,
bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres)
blocks Pending on a StorageClass that would only exist after bp-crossplane
finished installing — but they ARE in the bootstrap-kit Kustomization
that needs to converge before the day-2 path runs. Verified live on
omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0
both stuck Pending for 20+ min with `no persistent volumes available for
this claim and no storage class is set`, `kubectl get sc` empty.
This change:
1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its
built-in local-path-provisioner and registers the `local-path`
StorageClass on first boot.
2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap
apply that:
a. waits for the local-path-provisioner pod Ready
b. patches the local-path SC with is-default-class=true
c. fails loudly if the SC is missing post-wait (safety gate so a
broken cluster doesn't fall through to Flux silently)
3. Adds tests/integration/storageclass.sh — phase 1 render-assertion
(regression gate against re-introducing --disable=local-storage,
plus positive assertions that the wait/patch/verify steps are
present, plus ordering check that the patch precedes the Flux
apply); phase 2 kind-cluster proof that a fresh cluster has a
default StorageClass that binds a test PVC.
4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom,
root cause, and the live-cluster recovery path (apply
local-path-storage.yaml + patch default class) for already-provisioned
Sovereigns that hit this without reprovisioning.
Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target
(single CPX21/CPX31 control-plane node) that is the correct shape — the
data lives on the node, capacity is bounded by the disk, and there are
no other nodes for volumes to migrate to. Operators upgrading to
multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate,
deliberate operation; that is not part of the cloud-init bootstrap.
Live verification on omantel.omani.works (reproduces the production
symptom + proves the recovery path):
Before:
NAMESPACE NAME STATUS AGE
keycloak data-keycloak-postgresql-0 Pending 10m
spire-system spire-data-spire-server-0 Pending 10m
No StorageClass.
After (kubectl apply local-path-storage.yaml + patch):
NAME PROVISIONER ... AGE
local-path (default) rancher.io/local-path ... 34s
NAMESPACE NAME STATUS STORAGECLASS
keycloak data-keycloak-postgresql-0 Bound local-path
spire-system spire-data-spire-server-0 Bound local-path
Gates:
- tofu validate: Success! The configuration is valid.
- tests/integration/storageclass.sh: PASS (phase 1 render-assertion +
phase 2 fresh kind cluster default StorageClass binds test PVC).
- Regression sanity: re-injecting --disable=local-storage causes
phase 1 to FAIL with the documented error message (verified).
Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that
block); the StorageClass setup runs between healthz-wait and the Flux
bootstrap apply so the bootstrap-kit Kustomization sees a default class
on its first reconciliation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>