Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is not specified" on every Sovereign (otech112 evidence). HelmRelease reports Ready=True (Helm install succeeded) but the Pod CrashLoopBackOffs invisibly behind the False-positive condition. Closes #916 — wizard let operators dispatch unbuildable topologies (otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not encode regional orderability. Hetzner rejected the worker creation 41s into `tofu apply` after Phase-0 had already created the CP + network + LB + firewall. Chart fix (issue #921): - Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the umbrella chart (base64-encoded per upstream contract). - Render `hetzner-node-config` Secret unconditionally with both keys so the upstream Deployment's secretKeyRef references resolve cleanly during `helm template` AND in the live cluster regardless of overlay state. - Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto the upstream chart's deployment. - Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps it under `flux-system/cloud-credentials.hcloud-cloud-init`; the bootstrap-kit overlay lifts that key via Flux `valuesFrom` into `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus receive the IDENTICAL bootstrap as the Phase-0 worker fleet. - Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0. - Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's blueprint-release "Run chart integration tests" step. Wizard fix (issue #916): - Add `availableRegions?: string[]` to NodeSize interface; encode cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere new) per Hetzner /v1/server_types vs POST /v1/servers gap. - Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers. - StepProvider filters SKU dropdowns by selected region; auto-swaps current SKU to recommended default when region change drops it out of orderability. - Mirror the matrix Go-side in sku_availability.go; gate `provisioner.Request.Validate()` with same predicate so a stale wizard build OR direct API caller bypassing the UI cannot dispatch otech109's failure mode. - Two-sided enforcement covers both r.Regions[] (multi-region) and the legacy singular path. Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API side. Chart smoke renders + helm template gates the env wiring at publish time. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
149 lines
6.2 KiB
YAML
149 lines
6.2 KiB
YAML
# bp-cluster-autoscaler-hcloud — Catalyst bootstrap-kit Blueprint #50
|
||
# (Tier 5 — Scaling/Resilience). Slot 40 was already forward-declared
|
||
# for bp-llm-gateway in scripts/expected-bootstrap-deps.yaml; this
|
||
# blueprint lands at slot 50 — after the W2.K4 cohort + slot 49
|
||
# (bp-cert-manager-powerdns-webhook) — to preserve the existing
|
||
# numbering plan.
|
||
#
|
||
# Adds and removes Hetzner Cloud worker nodes on demand in response to
|
||
# `FailedScheduling` events on the Sovereign's k3s cluster. Bounded by
|
||
# the `min`/`max` node-group config the operator picked at launch.
|
||
#
|
||
# Live evidence motivating this blueprint (issue #767):
|
||
# otech92 — 2× cpx32 workers couldn't fit external-secrets-webhook
|
||
# because the bootstrap-kit's RAM aggregate (~14 GB across 35
|
||
# HelmReleases) exceeded the 2× 8 GB pool the operator chose. With
|
||
# cluster-autoscaler the Sovereign would have grown the pool to a
|
||
# third worker automatically.
|
||
#
|
||
# Wrapper chart: platform/cluster-autoscaler-hcloud/chart/ — umbrella
|
||
# over upstream kubernetes/autoscaler cluster-autoscaler chart 9.46.6
|
||
# (appVersion 1.32.0). Catalyst-curated values flow under the
|
||
# `cluster-autoscaler:` key + a vendor-agnostic
|
||
# `clusterAutoscalerHcloud.*` block that ships the namespace-local
|
||
# Hetzner-API-token Secret (`hcloud-token`).
|
||
#
|
||
# Reconciled by: Flux on the new Sovereign's k3s control plane.
|
||
#
|
||
# Hetzner-token wiring (mirrors the velero/harbor object-storage pattern
|
||
# in 19-harbor.yaml + 34-velero.yaml):
|
||
# - cloud-init writes `flux-system/cloud-credentials` Secret with the
|
||
# `hcloud-token` key (see infra/hetzner/cloudinit-control-plane.tftpl
|
||
# §"cloud-credentials-secret"). That Secret is the canonical Hetzner-
|
||
# API-token holder for every Day-2 mutation seam (Crossplane provider-
|
||
# hcloud, this autoscaler, future hcloud Floating-IP claims).
|
||
# - This HelmRelease lifts the `hcloud-token` value into the umbrella
|
||
# chart's `clusterAutoscalerHcloud.hcloudToken` value via Flux
|
||
# `valuesFrom`. The umbrella chart then synthesises a namespace-local
|
||
# `cluster-autoscaler/hcloud-token` Secret (templates/hetzner-token-
|
||
# secret.yaml) the upstream chart's `extraEnvSecrets.HCLOUD_TOKEN`
|
||
# wiring binds as the deployment's HCLOUD_TOKEN env var.
|
||
#
|
||
# dependsOn: (none) — cluster-autoscaler is independent of every other
|
||
# bootstrap-kit blueprint at install time. The cloud-credentials Secret
|
||
# is provisioned by cloud-init BEFORE Flux installs anything.
|
||
|
||
---
|
||
apiVersion: v1
|
||
kind: Namespace
|
||
metadata:
|
||
name: cluster-autoscaler
|
||
labels:
|
||
catalyst.openova.io/sovereign: ${SOVEREIGN_FQDN}
|
||
---
|
||
apiVersion: source.toolkit.fluxcd.io/v1beta2
|
||
kind: HelmRepository
|
||
metadata:
|
||
name: bp-cluster-autoscaler-hcloud
|
||
namespace: flux-system
|
||
spec:
|
||
type: oci
|
||
interval: 15m
|
||
url: oci://ghcr.io/openova-io
|
||
secretRef:
|
||
name: ghcr-pull
|
||
---
|
||
apiVersion: helm.toolkit.fluxcd.io/v2
|
||
kind: HelmRelease
|
||
metadata:
|
||
name: bp-cluster-autoscaler-hcloud
|
||
namespace: flux-system
|
||
spec:
|
||
interval: 15m
|
||
releaseName: cluster-autoscaler
|
||
targetNamespace: cluster-autoscaler
|
||
chart:
|
||
spec:
|
||
chart: bp-cluster-autoscaler-hcloud
|
||
version: 1.0.0
|
||
sourceRef:
|
||
kind: HelmRepository
|
||
name: bp-cluster-autoscaler-hcloud
|
||
namespace: flux-system
|
||
# Event-driven install: cluster-autoscaler is a single Deployment +
|
||
# ServiceAccount + RBAC. Helm install completes when manifests apply;
|
||
# the binary's Hetzner-API connectivity check is a runtime concern,
|
||
# not a Helm-wait concern. disableWait keeps Flux's Ready signal
|
||
# aligned with manifest apply.
|
||
install:
|
||
disableWait: true
|
||
remediation:
|
||
retries: 3
|
||
upgrade:
|
||
disableWait: true
|
||
remediation:
|
||
retries: 3
|
||
# ── Hetzner-token + node-bootstrap wiring (issue #921) ─────────────
|
||
# Pulls keys from the canonical `flux-system/cloud-credentials`
|
||
# Secret cloud-init writes at Phase 0
|
||
# (infra/hetzner/cloudinit-control-plane.tftpl §"cloud-credentials-
|
||
# secret"):
|
||
# - hcloud-token → API token (mandatory)
|
||
# - hcloud-cloud-init → base64(cloud-init.yaml) — the autoscaler-
|
||
# spawned worker's bootstrap, identical to the
|
||
# Phase-0 worker user_data. Required by
|
||
# cluster-autoscaler 1.32.x's Hetzner provider
|
||
# (HCLOUD_CLOUD_INIT env var) — without it the
|
||
# autoscaler Pod exits at startup with FATAL
|
||
# "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT
|
||
# is not specified".
|
||
# Flux dereferences `valuesFrom` at HelmRelease apply time, so the
|
||
# plaintext payloads never appear in this committed manifest.
|
||
#
|
||
# The chart's templates/hetzner-node-config-secret.yaml renders these
|
||
# values into a namespace-local `cluster-autoscaler/hetzner-node-config`
|
||
# Secret which the upstream chart's `extraEnvSecrets.HCLOUD_CLOUD_INIT`
|
||
# binding lifts onto the deployment's env.
|
||
valuesFrom:
|
||
- kind: Secret
|
||
name: cloud-credentials
|
||
valuesKey: hcloud-token
|
||
targetPath: clusterAutoscalerHcloud.hcloudToken
|
||
- kind: Secret
|
||
name: cloud-credentials
|
||
valuesKey: hcloud-cloud-init
|
||
targetPath: clusterAutoscalerHcloud.cloudInit
|
||
# When older Sovereigns provisioned BEFORE issue #921 lack the
|
||
# hcloud-cloud-init key, Flux skips this entry rather than failing
|
||
# the entire HelmRelease — the chart's empty-string default keeps
|
||
# the upstream Deployment shape valid (the autoscaler will still
|
||
# FATAL at startup, surfacing the missing-cloud-init in Pod logs;
|
||
# operators rotate by re-running cloud-init or by patching
|
||
# cloud-credentials directly).
|
||
optional: true
|
||
# Per-Sovereign baseline values. clusters/<sovereign>/bootstrap-kit/
|
||
# 40-cluster-autoscaler.yaml MAY override `autoscalingGroups` to set
|
||
# the actual instanceType + region + min/max + name the Tofu module
|
||
# provisioned at Phase 0. The defaults below match the canonical
|
||
# otechN topology (cpx32 in fsn1, min 2 / max 10) so a vanilla
|
||
# Sovereign that forgets to patch this still gets a sensible
|
||
# autoscaler.
|
||
values:
|
||
cluster-autoscaler:
|
||
autoscalingGroups:
|
||
- name: workers
|
||
instanceType: cpx32
|
||
region: fsn1
|
||
minSize: 2
|
||
maxSize: 10
|