openova/tests/integration
hatiyildiz 4f56ae47da fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs
Pre-fix the cloud-init template passed --disable=local-storage to the k3s
installer with the design intent that Crossplane would install hcloud-csi
day-2 and register a StorageClass after bp-crossplane reconciled. That
created a circular dependency on a fresh Sovereign: every PVC-using
HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres,
bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres)
blocks Pending on a StorageClass that would only exist after bp-crossplane
finished installing — but they ARE in the bootstrap-kit Kustomization
that needs to converge before the day-2 path runs. Verified live on
omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0
both stuck Pending for 20+ min with `no persistent volumes available for
this claim and no storage class is set`, `kubectl get sc` empty.

This change:
1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its
   built-in local-path-provisioner and registers the `local-path`
   StorageClass on first boot.
2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap
   apply that:
     a. waits for the local-path-provisioner pod Ready
     b. patches the local-path SC with is-default-class=true
     c. fails loudly if the SC is missing post-wait (safety gate so a
        broken cluster doesn't fall through to Flux silently)
3. Adds tests/integration/storageclass.sh — phase 1 render-assertion
   (regression gate against re-introducing --disable=local-storage,
   plus positive assertions that the wait/patch/verify steps are
   present, plus ordering check that the patch precedes the Flux
   apply); phase 2 kind-cluster proof that a fresh cluster has a
   default StorageClass that binds a test PVC.
4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom,
   root cause, and the live-cluster recovery path (apply
   local-path-storage.yaml + patch default class) for already-provisioned
   Sovereigns that hit this without reprovisioning.

Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target
(single CPX21/CPX31 control-plane node) that is the correct shape — the
data lives on the node, capacity is bounded by the disk, and there are
no other nodes for volumes to migrate to. Operators upgrading to
multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate,
deliberate operation; that is not part of the cloud-init bootstrap.

Live verification on omantel.omani.works (reproduces the production
symptom + proves the recovery path):

  Before:
    NAMESPACE      NAME                         STATUS    AGE
    keycloak       data-keycloak-postgresql-0   Pending   10m
    spire-system   spire-data-spire-server-0    Pending   10m
    No StorageClass.

  After (kubectl apply local-path-storage.yaml + patch):
    NAME                   PROVISIONER             ...   AGE
    local-path (default)   rancher.io/local-path   ...   34s

    NAMESPACE      NAME                         STATUS   STORAGECLASS
    keycloak       data-keycloak-postgresql-0   Bound    local-path
    spire-system   spire-data-spire-server-0    Bound    local-path

Gates:
  - tofu validate: Success! The configuration is valid.
  - tests/integration/storageclass.sh: PASS (phase 1 render-assertion +
    phase 2 fresh kind cluster default StorageClass binds test PVC).
  - Regression sanity: re-injecting --disable=local-storage causes
    phase 1 to FAIL with the documented error message (verified).

Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that
block); the StorageClass setup runs between healthz-wait and the Flux
bootstrap apply so the bootstrap-kit Kustomization sees a default class
on its first reconciliation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:43:09 +02:00
..
storageclass.sh fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs 2026-04-29 19:43:09 +02:00
strategy-flip.sh fix(catalyst-chart): annotate api-deployment for Flux strategy-flip recovery 2026-04-29 18:04:07 +02:00
strategy-flip.yaml fix(catalyst-chart): annotate api-deployment for Flux strategy-flip recovery 2026-04-29 18:04:07 +02:00