fix(api): cloud-init kubeconfig postback must live outside RequireSession (#637)

* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
This commit is contained in:
e3mrah 2026-05-02 22:42:45 +04:00 committed by GitHub
parent 12233290d1
commit 9402970da2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -194,6 +194,16 @@ func main() {
})
r.Delete("/api/v1/auth/session", h.HandleAuthLogout)
// Unauthenticated cloud-init postback (issue #183, Option D + #634).
// The new Sovereign's control plane PUTs its rewritten kubeconfig
// here with `Authorization: Bearer <postback-token>`. PutKubeconfig
// has its own SHA-256-hash-vs-stored-hash compare — it MUST live
// outside the session-cookie middleware because cloud-init has no
// browser cookies. Putting this inside the RequireSession group
// rejected every postback with 401 {"error":"unauthenticated"} and
// stuck Phase-1 in PENDING forever (caught live on otech23).
r.Put("/api/v1/deployments/{id}/kubeconfig", h.PutKubeconfig)
// Auth-gated wizard endpoints — RequireSession validates the
// HMAC-signed catalyst_session cookie on every request. When
// cfg is nil (Sovereign clusters, CI without CATALYST_KC_ADDR)
@ -242,13 +252,8 @@ func main() {
// catalyst-api Pod cold-starts mid-Phase-1 and has to reattach
// to a deployment whose kubeconfig is on the PVC.
rg.Get("/api/v1/deployments/{id}/kubeconfig", h.GetKubeconfig)
// PUT — cloud-init postback (issue #183, Option D). The new
// Sovereign's control plane PUTs its rewritten kubeconfig here
// with an Authorization: Bearer header. The handler verifies
// SHA-256 of the bearer against the persisted hash, writes the
// kubeconfig file to the PVC at mode 0600, and triggers the
// Phase-1 helmwatch goroutine.
rg.Put("/api/v1/deployments/{id}/kubeconfig", h.PutKubeconfig)
// (PUT /kubeconfig is registered ABOVE the session group — see
// the cloud-init postback comment near r.Delete /auth/session.)
// Registrar proxy — wizard's BYO Flow B (#169). /validate is called
// pre-submit so a typo'd token surfaces at the prompt; /set-ns is
// called from CreateDeployment when domainMode == byo-api.