fix(catalyst-api): wire harbor_robot_token end-to-end (REQUIRED, no docker.io fallback) (#638)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
This commit is contained in:
parent
3190d5d0a3
commit
a9b9a32aa3
@ -147,6 +147,18 @@ type Request struct {
|
||||
// 0o600.
|
||||
GHCRPullToken string `json:"-"`
|
||||
|
||||
// HarborRobotToken — central Harbor proxy-cache robot account secret
|
||||
// (issue #557). Stamped server-side from Provisioner.HarborRobotToken
|
||||
// (env CATALYST_HARBOR_ROBOT_TOKEN). Interpolated by
|
||||
// cloudinit-control-plane.tftpl into /etc/rancher/k3s/registries.yaml
|
||||
// so containerd authenticates against harbor.openova.io's proxy
|
||||
// projects (proxy-dockerhub, proxy-gcr, proxy-quay, proxy-k8s,
|
||||
// proxy-ghcr) on every image pull. Without this, containerd falls
|
||||
// through to the upstream registry on a fresh Hetzner IP — Docker Hub
|
||||
// returns rate-limit HTML and pods stick at Init:0/6 (caught live
|
||||
// during otech24). json:"-" — never accepted from the wizard payload.
|
||||
HarborRobotToken string `json:"-"`
|
||||
|
||||
// DeploymentID — catalyst-api's per-deployment identifier (16-char
|
||||
// hex). Stamped onto the Request by the handler before tfvars are
|
||||
// emitted so the OpenTofu cloud-init template can render the URL
|
||||
@ -313,6 +325,20 @@ func (r *Request) Validate() error {
|
||||
if r.SovereignDomainMode == "pool" && strings.TrimSpace(r.GHCRPullToken) == "" {
|
||||
return errors.New("GHCR pull token is required for managed-pool deployments (CATALYST_GHCR_PULL_TOKEN missing on catalyst-api — see docs/SECRET-ROTATION.md)")
|
||||
}
|
||||
// Harbor robot token (issue #557) — REQUIRED, no exceptions. The
|
||||
// architecture mandate is that every Sovereign image pull goes
|
||||
// through harbor.openova.io's proxy projects (proxy-dockerhub,
|
||||
// proxy-gcr, proxy-quay, proxy-k8s, proxy-ghcr). An empty token
|
||||
// means containerd will fail authentication against Harbor and
|
||||
// fall through to upstream registries — Docker Hub then
|
||||
// rate-limits a fresh Hetzner IP and pods stick at Init:0/6
|
||||
// forever (caught live during otech24). Fail fast at /api/v1/
|
||||
// deployments POST so a misconfigured catalyst-api Pod surfaces
|
||||
// the missing CATALYST_HARBOR_ROBOT_TOKEN env immediately
|
||||
// instead of after 5 min of tofu apply.
|
||||
if strings.TrimSpace(r.HarborRobotToken) == "" {
|
||||
return errors.New("Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing on catalyst-api — every Sovereign image pull MUST go through harbor.openova.io; falling through to docker.io is not allowed)")
|
||||
}
|
||||
|
||||
// Hetzner Object Storage (issue #371) — Phase 0b. All four fields are
|
||||
// required for any Hetzner-backed Sovereign: the bucket exists at
|
||||
@ -490,6 +516,20 @@ type Provisioner struct {
|
||||
// error so the operator notices the misconfiguration before
|
||||
// `tofu apply` runs.
|
||||
GHCRPullToken string
|
||||
|
||||
// HarborRobotToken is the central Harbor proxy-cache robot account
|
||||
// secret (`robot$openova-bot` on harbor.openova.io). Mounted from
|
||||
// the Reflector-mirrored `harbor-robot-token` K8s Secret in the
|
||||
// catalyst namespace as env CATALYST_HARBOR_ROBOT_TOKEN.
|
||||
// cloudinit-control-plane.tftpl interpolates it into the new
|
||||
// Sovereign's /etc/rancher/k3s/registries.yaml so containerd
|
||||
// authenticates against harbor.openova.io's docker.io / gcr / quay /
|
||||
// k8s / ghcr proxy projects on every image pull (issue #557).
|
||||
// Empty falls through to anonymous Harbor pulls; if the proxy is
|
||||
// configured for public access this still works, but rate-limited
|
||||
// upstream (Docker Hub) pulls will fail when the proxy can't
|
||||
// authenticate either. Stamped onto every Request before tfvars.
|
||||
HarborRobotToken string
|
||||
}
|
||||
|
||||
// New returns a Provisioner with paths read from environment.
|
||||
@ -503,9 +543,10 @@ type Provisioner struct {
|
||||
// with a clear pointer to docs/SECRET-ROTATION.md.
|
||||
func New() *Provisioner {
|
||||
return &Provisioner{
|
||||
ModulePath: env("CATALYST_TOFU_MODULE_PATH", "/infra/hetzner"),
|
||||
WorkDir: env("CATALYST_TOFU_WORKDIR", "/var/lib/catalyst/tofu"),
|
||||
GHCRPullToken: os.Getenv("CATALYST_GHCR_PULL_TOKEN"),
|
||||
ModulePath: env("CATALYST_TOFU_MODULE_PATH", "/infra/hetzner"),
|
||||
WorkDir: env("CATALYST_TOFU_WORKDIR", "/var/lib/catalyst/tofu"),
|
||||
GHCRPullToken: os.Getenv("CATALYST_GHCR_PULL_TOKEN"),
|
||||
HarborRobotToken: os.Getenv("CATALYST_HARBOR_ROBOT_TOKEN"),
|
||||
}
|
||||
}
|
||||
|
||||
@ -521,6 +562,9 @@ func (p *Provisioner) Provision(ctx context.Context, req Request, events chan<-
|
||||
if strings.TrimSpace(req.GHCRPullToken) == "" {
|
||||
req.GHCRPullToken = p.GHCRPullToken
|
||||
}
|
||||
if strings.TrimSpace(req.HarborRobotToken) == "" {
|
||||
req.HarborRobotToken = p.HarborRobotToken
|
||||
}
|
||||
|
||||
if err := req.Validate(); err != nil {
|
||||
return nil, err
|
||||
@ -598,6 +642,9 @@ func (p *Provisioner) Destroy(ctx context.Context, req Request, events chan<- Ev
|
||||
if strings.TrimSpace(req.GHCRPullToken) == "" {
|
||||
req.GHCRPullToken = p.GHCRPullToken
|
||||
}
|
||||
if strings.TrimSpace(req.HarborRobotToken) == "" {
|
||||
req.HarborRobotToken = p.HarborRobotToken
|
||||
}
|
||||
|
||||
emit := func(phase, level, msg string) {
|
||||
select {
|
||||
@ -797,6 +844,13 @@ func writeTfvars(deployDir string, req Request) error {
|
||||
// rotates yearly and is stored in 1Password — never in git.
|
||||
"ghcr_pull_token": req.GHCRPullToken,
|
||||
|
||||
// Harbor proxy-cache robot token (issue #557). Stamped server-
|
||||
// side. cloudinit-control-plane.tftpl writes it into
|
||||
// /etc/rancher/k3s/registries.yaml so containerd authenticates
|
||||
// against harbor.openova.io's proxy projects. Empty falls
|
||||
// through to anonymous Harbor pulls.
|
||||
"harbor_robot_token": req.HarborRobotToken,
|
||||
|
||||
// Cloud-init kubeconfig postback (issue #183, Option D). The
|
||||
// catalyst-api stamps deployment_id + kubeconfig_bearer_token
|
||||
// onto the Request before writeTfvars is called: deployment_id
|
||||
|
||||
@ -287,6 +287,24 @@ spec:
|
||||
name: catalyst-ghcr-pull-token
|
||||
key: token
|
||||
optional: true
|
||||
# CATALYST_HARBOR_ROBOT_TOKEN — central Harbor proxy-cache
|
||||
# robot account secret (issue #557). Reflector mirrors the
|
||||
# `harbor-robot-token` Secret from openova-harbor namespace
|
||||
# into catalyst namespace; the value is interpolated into
|
||||
# the new Sovereign's /etc/rancher/k3s/registries.yaml at
|
||||
# cloud-init time so containerd authenticates against
|
||||
# harbor.openova.io's proxy projects (proxy-dockerhub etc).
|
||||
#
|
||||
# NOT optional — provisioner.Validate() rejects deployments
|
||||
# with an empty token. The architecture mandate is that every
|
||||
# Sovereign image pull goes through harbor.openova.io; falling
|
||||
# through to docker.io is forbidden (rate-limit makes a fresh
|
||||
# Hetzner IP unbootable within minutes).
|
||||
- name: CATALYST_HARBOR_ROBOT_TOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: harbor-robot-token
|
||||
key: token
|
||||
# ── /auth/handover Keycloak service-account (issue #606) ──────────
|
||||
# CATALYST_KC_ADDR — Keycloak base URL. Defaults to in-cluster
|
||||
# service FQDN in code; override here for non-standard Sovereign
|
||||
|
||||
Loading…
Reference in New Issue
Block a user