openova/clusters/_template/bootstrap-kit/50-cluster-autoscaler.yaml
e3mrah d1431bed09
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without
HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x
FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is
not specified" on every Sovereign (otech112 evidence). HelmRelease
reports Ready=True (Helm install succeeded) but the Pod
CrashLoopBackOffs invisibly behind the False-positive condition.

Closes #916 — wizard let operators dispatch unbuildable topologies
(otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not
encode regional orderability. Hetzner rejected the worker creation 41s
into `tofu apply` after Phase-0 had already created the CP + network +
LB + firewall.

Chart fix (issue #921):
- Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the
  umbrella chart (base64-encoded per upstream contract).
- Render `hetzner-node-config` Secret unconditionally with both keys so
  the upstream Deployment's secretKeyRef references resolve cleanly
  during `helm template` AND in the live cluster regardless of overlay
  state.
- Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto
  the upstream chart's deployment.
- Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps
  it under `flux-system/cloud-credentials.hcloud-cloud-init`; the
  bootstrap-kit overlay lifts that key via Flux `valuesFrom` into
  `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus
  receive the IDENTICAL bootstrap as the Phase-0 worker fleet.
- Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0.
- Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies
  Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's
  blueprint-release "Run chart integration tests" step.

Wizard fix (issue #916):
- Add `availableRegions?: string[]` to NodeSize interface; encode
  cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere
  new) per Hetzner /v1/server_types vs POST /v1/servers gap.
- Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers.
- StepProvider filters SKU dropdowns by selected region; auto-swaps
  current SKU to recommended default when region change drops it out
  of orderability.
- Mirror the matrix Go-side in sku_availability.go; gate
  `provisioner.Request.Validate()` with same predicate so a stale
  wizard build OR direct API caller bypassing the UI cannot dispatch
  otech109's failure mode.
- Two-sided enforcement covers both r.Regions[] (multi-region) and the
  legacy singular path.

Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API
side. Chart smoke renders + helm template gates the env wiring at
publish time.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:21:59 +04:00

149 lines
6.2 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# bp-cluster-autoscaler-hcloud — Catalyst bootstrap-kit Blueprint #50
# (Tier 5 — Scaling/Resilience). Slot 40 was already forward-declared
# for bp-llm-gateway in scripts/expected-bootstrap-deps.yaml; this
# blueprint lands at slot 50 — after the W2.K4 cohort + slot 49
# (bp-cert-manager-powerdns-webhook) — to preserve the existing
# numbering plan.
#
# Adds and removes Hetzner Cloud worker nodes on demand in response to
# `FailedScheduling` events on the Sovereign's k3s cluster. Bounded by
# the `min`/`max` node-group config the operator picked at launch.
#
# Live evidence motivating this blueprint (issue #767):
# otech92 — 2× cpx32 workers couldn't fit external-secrets-webhook
# because the bootstrap-kit's RAM aggregate (~14 GB across 35
# HelmReleases) exceeded the 2× 8 GB pool the operator chose. With
# cluster-autoscaler the Sovereign would have grown the pool to a
# third worker automatically.
#
# Wrapper chart: platform/cluster-autoscaler-hcloud/chart/ — umbrella
# over upstream kubernetes/autoscaler cluster-autoscaler chart 9.46.6
# (appVersion 1.32.0). Catalyst-curated values flow under the
# `cluster-autoscaler:` key + a vendor-agnostic
# `clusterAutoscalerHcloud.*` block that ships the namespace-local
# Hetzner-API-token Secret (`hcloud-token`).
#
# Reconciled by: Flux on the new Sovereign's k3s control plane.
#
# Hetzner-token wiring (mirrors the velero/harbor object-storage pattern
# in 19-harbor.yaml + 34-velero.yaml):
# - cloud-init writes `flux-system/cloud-credentials` Secret with the
# `hcloud-token` key (see infra/hetzner/cloudinit-control-plane.tftpl
# §"cloud-credentials-secret"). That Secret is the canonical Hetzner-
# API-token holder for every Day-2 mutation seam (Crossplane provider-
# hcloud, this autoscaler, future hcloud Floating-IP claims).
# - This HelmRelease lifts the `hcloud-token` value into the umbrella
# chart's `clusterAutoscalerHcloud.hcloudToken` value via Flux
# `valuesFrom`. The umbrella chart then synthesises a namespace-local
# `cluster-autoscaler/hcloud-token` Secret (templates/hetzner-token-
# secret.yaml) the upstream chart's `extraEnvSecrets.HCLOUD_TOKEN`
# wiring binds as the deployment's HCLOUD_TOKEN env var.
#
# dependsOn: (none) — cluster-autoscaler is independent of every other
# bootstrap-kit blueprint at install time. The cloud-credentials Secret
# is provisioned by cloud-init BEFORE Flux installs anything.
---
apiVersion: v1
kind: Namespace
metadata:
name: cluster-autoscaler
labels:
catalyst.openova.io/sovereign: ${SOVEREIGN_FQDN}
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-cluster-autoscaler-hcloud
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-cluster-autoscaler-hcloud
namespace: flux-system
spec:
interval: 15m
releaseName: cluster-autoscaler
targetNamespace: cluster-autoscaler
chart:
spec:
chart: bp-cluster-autoscaler-hcloud
version: 1.0.0
sourceRef:
kind: HelmRepository
name: bp-cluster-autoscaler-hcloud
namespace: flux-system
# Event-driven install: cluster-autoscaler is a single Deployment +
# ServiceAccount + RBAC. Helm install completes when manifests apply;
# the binary's Hetzner-API connectivity check is a runtime concern,
# not a Helm-wait concern. disableWait keeps Flux's Ready signal
# aligned with manifest apply.
install:
disableWait: true
remediation:
retries: 3
upgrade:
disableWait: true
remediation:
retries: 3
# ── Hetzner-token + node-bootstrap wiring (issue #921) ─────────────
# Pulls keys from the canonical `flux-system/cloud-credentials`
# Secret cloud-init writes at Phase 0
# (infra/hetzner/cloudinit-control-plane.tftpl §"cloud-credentials-
# secret"):
# - hcloud-token → API token (mandatory)
# - hcloud-cloud-init → base64(cloud-init.yaml) — the autoscaler-
# spawned worker's bootstrap, identical to the
# Phase-0 worker user_data. Required by
# cluster-autoscaler 1.32.x's Hetzner provider
# (HCLOUD_CLOUD_INIT env var) — without it the
# autoscaler Pod exits at startup with FATAL
# "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT
# is not specified".
# Flux dereferences `valuesFrom` at HelmRelease apply time, so the
# plaintext payloads never appear in this committed manifest.
#
# The chart's templates/hetzner-node-config-secret.yaml renders these
# values into a namespace-local `cluster-autoscaler/hetzner-node-config`
# Secret which the upstream chart's `extraEnvSecrets.HCLOUD_CLOUD_INIT`
# binding lifts onto the deployment's env.
valuesFrom:
- kind: Secret
name: cloud-credentials
valuesKey: hcloud-token
targetPath: clusterAutoscalerHcloud.hcloudToken
- kind: Secret
name: cloud-credentials
valuesKey: hcloud-cloud-init
targetPath: clusterAutoscalerHcloud.cloudInit
# When older Sovereigns provisioned BEFORE issue #921 lack the
# hcloud-cloud-init key, Flux skips this entry rather than failing
# the entire HelmRelease — the chart's empty-string default keeps
# the upstream Deployment shape valid (the autoscaler will still
# FATAL at startup, surfacing the missing-cloud-init in Pod logs;
# operators rotate by re-running cloud-init or by patching
# cloud-credentials directly).
optional: true
# Per-Sovereign baseline values. clusters/<sovereign>/bootstrap-kit/
# 40-cluster-autoscaler.yaml MAY override `autoscalingGroups` to set
# the actual instanceType + region + min/max + name the Tofu module
# provisioned at Phase 0. The defaults below match the canonical
# otechN topology (cpx32 in fsn1, min 2 / max 10) so a vanilla
# Sovereign that forgets to patch this still gets a sensible
# autoscaler.
values:
cluster-autoscaler:
autoscalingGroups:
- name: workers
instanceType: cpx32
region: fsn1
minSize: 2
maxSize: 10