fix(catalyst-api): wire harbor_robot_token end-to-end (REQUIRED, no docker.io fallback) (#638)

* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
This commit is contained in:
e3mrah 2026-05-02 23:07:59 +04:00 committed by GitHub
parent 3190d5d0a3
commit a9b9a32aa3
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 75 additions and 3 deletions

View File

@ -147,6 +147,18 @@ type Request struct {
// 0o600.
GHCRPullToken string `json:"-"`
// HarborRobotToken — central Harbor proxy-cache robot account secret
// (issue #557). Stamped server-side from Provisioner.HarborRobotToken
// (env CATALYST_HARBOR_ROBOT_TOKEN). Interpolated by
// cloudinit-control-plane.tftpl into /etc/rancher/k3s/registries.yaml
// so containerd authenticates against harbor.openova.io's proxy
// projects (proxy-dockerhub, proxy-gcr, proxy-quay, proxy-k8s,
// proxy-ghcr) on every image pull. Without this, containerd falls
// through to the upstream registry on a fresh Hetzner IP — Docker Hub
// returns rate-limit HTML and pods stick at Init:0/6 (caught live
// during otech24). json:"-" — never accepted from the wizard payload.
HarborRobotToken string `json:"-"`
// DeploymentID — catalyst-api's per-deployment identifier (16-char
// hex). Stamped onto the Request by the handler before tfvars are
// emitted so the OpenTofu cloud-init template can render the URL
@ -313,6 +325,20 @@ func (r *Request) Validate() error {
if r.SovereignDomainMode == "pool" && strings.TrimSpace(r.GHCRPullToken) == "" {
return errors.New("GHCR pull token is required for managed-pool deployments (CATALYST_GHCR_PULL_TOKEN missing on catalyst-api — see docs/SECRET-ROTATION.md)")
}
// Harbor robot token (issue #557) — REQUIRED, no exceptions. The
// architecture mandate is that every Sovereign image pull goes
// through harbor.openova.io's proxy projects (proxy-dockerhub,
// proxy-gcr, proxy-quay, proxy-k8s, proxy-ghcr). An empty token
// means containerd will fail authentication against Harbor and
// fall through to upstream registries — Docker Hub then
// rate-limits a fresh Hetzner IP and pods stick at Init:0/6
// forever (caught live during otech24). Fail fast at /api/v1/
// deployments POST so a misconfigured catalyst-api Pod surfaces
// the missing CATALYST_HARBOR_ROBOT_TOKEN env immediately
// instead of after 5 min of tofu apply.
if strings.TrimSpace(r.HarborRobotToken) == "" {
return errors.New("Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing on catalyst-api — every Sovereign image pull MUST go through harbor.openova.io; falling through to docker.io is not allowed)")
}
// Hetzner Object Storage (issue #371) — Phase 0b. All four fields are
// required for any Hetzner-backed Sovereign: the bucket exists at
@ -490,6 +516,20 @@ type Provisioner struct {
// error so the operator notices the misconfiguration before
// `tofu apply` runs.
GHCRPullToken string
// HarborRobotToken is the central Harbor proxy-cache robot account
// secret (`robot$openova-bot` on harbor.openova.io). Mounted from
// the Reflector-mirrored `harbor-robot-token` K8s Secret in the
// catalyst namespace as env CATALYST_HARBOR_ROBOT_TOKEN.
// cloudinit-control-plane.tftpl interpolates it into the new
// Sovereign's /etc/rancher/k3s/registries.yaml so containerd
// authenticates against harbor.openova.io's docker.io / gcr / quay /
// k8s / ghcr proxy projects on every image pull (issue #557).
// Empty falls through to anonymous Harbor pulls; if the proxy is
// configured for public access this still works, but rate-limited
// upstream (Docker Hub) pulls will fail when the proxy can't
// authenticate either. Stamped onto every Request before tfvars.
HarborRobotToken string
}
// New returns a Provisioner with paths read from environment.
@ -506,6 +546,7 @@ func New() *Provisioner {
ModulePath: env("CATALYST_TOFU_MODULE_PATH", "/infra/hetzner"),
WorkDir: env("CATALYST_TOFU_WORKDIR", "/var/lib/catalyst/tofu"),
GHCRPullToken: os.Getenv("CATALYST_GHCR_PULL_TOKEN"),
HarborRobotToken: os.Getenv("CATALYST_HARBOR_ROBOT_TOKEN"),
}
}
@ -521,6 +562,9 @@ func (p *Provisioner) Provision(ctx context.Context, req Request, events chan<-
if strings.TrimSpace(req.GHCRPullToken) == "" {
req.GHCRPullToken = p.GHCRPullToken
}
if strings.TrimSpace(req.HarborRobotToken) == "" {
req.HarborRobotToken = p.HarborRobotToken
}
if err := req.Validate(); err != nil {
return nil, err
@ -598,6 +642,9 @@ func (p *Provisioner) Destroy(ctx context.Context, req Request, events chan<- Ev
if strings.TrimSpace(req.GHCRPullToken) == "" {
req.GHCRPullToken = p.GHCRPullToken
}
if strings.TrimSpace(req.HarborRobotToken) == "" {
req.HarborRobotToken = p.HarborRobotToken
}
emit := func(phase, level, msg string) {
select {
@ -797,6 +844,13 @@ func writeTfvars(deployDir string, req Request) error {
// rotates yearly and is stored in 1Password — never in git.
"ghcr_pull_token": req.GHCRPullToken,
// Harbor proxy-cache robot token (issue #557). Stamped server-
// side. cloudinit-control-plane.tftpl writes it into
// /etc/rancher/k3s/registries.yaml so containerd authenticates
// against harbor.openova.io's proxy projects. Empty falls
// through to anonymous Harbor pulls.
"harbor_robot_token": req.HarborRobotToken,
// Cloud-init kubeconfig postback (issue #183, Option D). The
// catalyst-api stamps deployment_id + kubeconfig_bearer_token
// onto the Request before writeTfvars is called: deployment_id

View File

@ -287,6 +287,24 @@ spec:
name: catalyst-ghcr-pull-token
key: token
optional: true
# CATALYST_HARBOR_ROBOT_TOKEN — central Harbor proxy-cache
# robot account secret (issue #557). Reflector mirrors the
# `harbor-robot-token` Secret from openova-harbor namespace
# into catalyst namespace; the value is interpolated into
# the new Sovereign's /etc/rancher/k3s/registries.yaml at
# cloud-init time so containerd authenticates against
# harbor.openova.io's proxy projects (proxy-dockerhub etc).
#
# NOT optional — provisioner.Validate() rejects deployments
# with an empty token. The architecture mandate is that every
# Sovereign image pull goes through harbor.openova.io; falling
# through to docker.io is forbidden (rate-limit makes a fresh
# Hetzner IP unbootable within minutes).
- name: CATALYST_HARBOR_ROBOT_TOKEN
valueFrom:
secretKeyRef:
name: harbor-robot-token
key: token
# ── /auth/handover Keycloak service-account (issue #606) ──────────
# CATALYST_KC_ADDR — Keycloak base URL. Defaults to in-cluster
# service FQDN in code; override here for non-standard Sovereign