openova

History

hatiyildiz 4f56ae47da fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs Pre-fix the cloud-init template passed --disable=local-storage to the k3s installer with the design intent that Crossplane would install hcloud-csi day-2 and register a StorageClass after bp-crossplane reconciled. That created a circular dependency on a fresh Sovereign: every PVC-using HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres, bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres) blocks Pending on a StorageClass that would only exist after bp-crossplane finished installing — but they ARE in the bootstrap-kit Kustomization that needs to converge before the day-2 path runs. Verified live on omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0 both stuck Pending for 20+ min with `no persistent volumes available for this claim and no storage class is set`, `kubectl get sc` empty. This change: 1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its built-in local-path-provisioner and registers the `local-path` StorageClass on first boot. 2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap apply that: a. waits for the local-path-provisioner pod Ready b. patches the local-path SC with is-default-class=true c. fails loudly if the SC is missing post-wait (safety gate so a broken cluster doesn't fall through to Flux silently) 3. Adds tests/integration/storageclass.sh — phase 1 render-assertion (regression gate against re-introducing --disable=local-storage, plus positive assertions that the wait/patch/verify steps are present, plus ordering check that the patch precedes the Flux apply); phase 2 kind-cluster proof that a fresh cluster has a default StorageClass that binds a test PVC. 4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom, root cause, and the live-cluster recovery path (apply local-path-storage.yaml + patch default class) for already-provisioned Sovereigns that hit this without reprovisioning. Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target (single CPX21/CPX31 control-plane node) that is the correct shape — the data lives on the node, capacity is bounded by the disk, and there are no other nodes for volumes to migrate to. Operators upgrading to multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate, deliberate operation; that is not part of the cloud-init bootstrap. Live verification on omantel.omani.works (reproduces the production symptom + proves the recovery path): Before: NAMESPACE NAME STATUS AGE keycloak data-keycloak-postgresql-0 Pending 10m spire-system spire-data-spire-server-0 Pending 10m No StorageClass. After (kubectl apply local-path-storage.yaml + patch): NAME PROVISIONER ... AGE local-path (default) rancher.io/local-path ... 34s NAMESPACE NAME STATUS STORAGECLASS keycloak data-keycloak-postgresql-0 Bound local-path spire-system spire-data-spire-server-0 Bound local-path Gates: - tofu validate: Success! The configuration is valid. - tests/integration/storageclass.sh: PASS (phase 1 render-assertion + phase 2 fresh kind cluster default StorageClass binds test PVC). - Regression sanity: re-injecting --disable=local-storage causes phase 1 to FAIL with the documented error message (verified). Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that block); the StorageClass setup runs between healthz-wait and the Flux bootstrap apply so the bootstrap-kit Kustomization sees a default class on its first reconciliation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-29 19:43:09 +02:00
..
dod	feat(dod): #149-#157 — Group M DoD scaffolding (DEMO-RUNBOOK + dod_test.go + dod.yaml)	2026-04-28 19:34:46 +02:00
e2e	fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly	2026-04-29 18:07:27 +02:00
integration	fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs	2026-04-29 19:43:09 +02:00