openova/infra
e3mrah 05065b66d6
fix(provisioner+observer): document cpx21 availability + kubectl retry/LKG (closes #752, #753) (#756)
#752 — investigate cpx21/cpx31 availability in EU DCs

Concrete proof gathered against the live Hetzner Cloud API on 2026-05-04.
GET /v1/server_types LISTS cpx11/cpx21/cpx31/cpx41 with full EU prices in
fsn1/nbg1/hel1, but POST /v1/servers rejects every order for those SKUs in
those DCs with:

  {"error":{"code":"invalid_input",
            "message":"unsupported location for server type"}}

Probed all 6 (SKU × DC) combinations end-to-end via real POST + immediate
DELETE. cpx22 + cpx32 were also probed as a sanity check and returned
ORDERED. The /v1/server_types price entry is misleading: Hetzner advertises
prices for every (SKU, location) pair regardless of orderability.

Conclusion: NO SKU bump-back. cpx22 + cpx32 (PR #744) remain the floor.
README + variables.tf docstrings now carry the durable reproducer so future
engineers don't re-attempt cpx21/cpx31.

#753 — kubectl retry / LKG observer reliability

/tmp/autopilot.sh updated (script lives outside the repo, on the VPS):
  • Every kubectl call carries --request-timeout=8s so a hung TLS handshake
    surfaces as a fast empty rather than a 30s+ stall.
  • Last-known-good (LKG) state held across transient flakes: hr/cert/nodes
    no longer flip to "0/0 nodes=0" on a single failed poll.
  • Only 3 consecutive transients count as a real failure; below the
    threshold the observer prints "hr=<LKG> (transient N/3)".

UI side: the wizard's StatusPill / ApplicationPage drive off SSE from
catalyst-api (useDeploymentEvents.ts), not direct kubectl polling, so no UI
change needed. catalyst-api itself uses client-go (helmwatch / phase1_watch),
not exec kubectl, so its observer is not subject to the same shell-out flake.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:11:44 +04:00
..
hetzner fix(provisioner+observer): document cpx21 availability + kubectl retry/LKG (closes #752, #753) (#756) 2026-05-04 17:11:44 +04:00