Charts bumped:
- bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented)
- bp-crossplane 1.1.3 -> 1.1.4 (subchart stub)
- bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched)
- bp-velero 1.2.0 -> 1.2.1 (subchart stub)
- bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented)
- bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented)
- bp-grafana 1.0.0 -> 1.0.1 (subchart stub)
- bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented)
- bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services)
Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache.
Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
418 lines
22 KiB
YAML
418 lines
22 KiB
YAML
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: catalyst-api
|
|
labels:
|
|
app.kubernetes.io/name: catalyst-api
|
|
app.kubernetes.io/component: api
|
|
annotations:
|
|
# `kustomize.toolkit.fluxcd.io/force: enabled` is the durable
|
|
# remediation for the `RollingUpdate -> Recreate` strategy-flip
|
|
# collision documented in docs/CHART-AUTHORING.md §"Strategy flips
|
|
# on existing Deployments".
|
|
#
|
|
# Failure mode this addresses
|
|
# ---------------------------
|
|
# On 2026-04-29 the `catalyst` Flux Kustomization on contabo-mkt
|
|
# stuck at Ready=False with:
|
|
#
|
|
# Deployment.apps "catalyst-api" is invalid:
|
|
# spec.strategy.rollingUpdate: Forbidden:
|
|
# may not be specified when strategy `type` is 'Recreate'
|
|
#
|
|
# Root cause: the live Deployment had been previously created with
|
|
# the default `RollingUpdate` strategy (so `rollingUpdate.maxSurge=25%`
|
|
# and `maxUnavailable=25%` were present on the live object, owned
|
|
# by the `kubectl-client-side-apply` field manager). Flux's
|
|
# kustomize-controller submits this manifest via Server-Side Apply
|
|
# with field manager `kustomize-controller`. SSA's contract is
|
|
# "set the fields you declare" — it does NOT remove fields owned
|
|
# by other managers. Result: post-merge object had `type: Recreate`
|
|
# AND the residual `rollingUpdate.*` block, which the API server's
|
|
# validator rejects as invalid (Recreate forbids any rollingUpdate
|
|
# keys). SSA is REQUIRED to reject the merge. No SSA-only chart
|
|
# change can fix this.
|
|
#
|
|
# Why `$patch: replace` does NOT solve this
|
|
# -----------------------------------------
|
|
# The Strategic Merge Patch directive `$patch: replace` would tell
|
|
# an SMP-aware merger to REPLACE the strategy block instead of
|
|
# merging into it. But:
|
|
# - SSA rejects `$patch` outright with "field not declared in
|
|
# schema" (it's not in apps/v1 Deployment).
|
|
# - kubectl strict-decoding rejects `$patch` on CREATE under any
|
|
# mode with "unknown field spec.strategy.$patch" — so adding
|
|
# it to the chart manifest BREAKS fresh installs.
|
|
# `$patch: replace` is a runtime SMP directive, never a chart-spec
|
|
# value. It belongs in a Kustomize `patches:` entry (where the
|
|
# kustomize binary consumes it at build time and emits a clean
|
|
# output) — never inline in a base resource.
|
|
#
|
|
# Why the Flux force annotation IS the right fix
|
|
# ----------------------------------------------
|
|
# When kustomize-controller's SSA submission fails dry-run with an
|
|
# Invalid response, this annotation directs the controller to
|
|
# recover by deleting and recreating THIS resource specifically
|
|
# (not the whole Kustomization). The recreated Deployment has no
|
|
# residual `rollingUpdate.*` fields — the regression cannot
|
|
# recur on the rebuilt object.
|
|
#
|
|
# That is NOT a "kubectl delete bandaid": the annotation is part
|
|
# of the IaC manifest, version-controlled, applied declaratively
|
|
# via Flux on every reconciliation, scoped to this single
|
|
# Deployment, and removed only by editing the chart. Per
|
|
# docs/INVIOLABLE-PRINCIPLES.md #3 (Follow the documented
|
|
# architecture, exactly — Flux is the ONLY GitOps reconciler) and
|
|
# #4 (Never hardcode — runtime configuration in Git, not in shell
|
|
# history): the remediation lives in source control.
|
|
#
|
|
# Why this Deployment in particular tolerates a recreate: the
|
|
# spec declares `strategy.type: Recreate`, so the steady-state
|
|
# update path is delete-and-recreate anyway. Flux falling back to
|
|
# delete-and-recreate on a strategy-flip is a no-op relative to a
|
|
# normal pod-spec change. The deployments PVC is ReadWriteOnce;
|
|
# the recreate flow detaches it from the old Pod before mounting
|
|
# it on the new one, which is exactly the contract `Recreate`
|
|
# enforces. State persistence is maintained because the PVC
|
|
# itself is NOT recreated by this annotation — only the
|
|
# Deployment resource is.
|
|
kustomize.toolkit.fluxcd.io/force: enabled
|
|
spec:
|
|
replicas: 1
|
|
# Recreate strategy is required because the deployments PVC is RWO
|
|
# (single-attach). A rolling update would try to schedule a second
|
|
# Pod that mounts the same PVC, which Kubernetes rejects as a
|
|
# MultiAttachError. RWX with a multi-writer-aware filesystem
|
|
# (NFS, CephFS) is the path to HA, but Catalyst-Zero today is
|
|
# single-replica by design — the wizard is interactive and PDM owns
|
|
# cross-tenant isolation, so a single API server is sufficient.
|
|
#
|
|
# The strategy-flip regression that bit contabo-mkt on 2026-04-29
|
|
# (apply over a pre-existing RollingUpdate Deployment fails with
|
|
# `spec.strategy.rollingUpdate: Forbidden`) is recovered by the
|
|
# `kustomize.toolkit.fluxcd.io/force: enabled` annotation above —
|
|
# see that annotation's comment for the full failure-mode analysis
|
|
# and the docs/CHART-AUTHORING.md §"Strategy flips on existing
|
|
# Deployments" entry. Do NOT add an inline `$patch: replace` here:
|
|
# it BREAKS fresh installs (kubectl strict-decoding rejects
|
|
# `spec.strategy.$patch` on create), and Flux's SSA path strips it
|
|
# anyway. The integration test at tests/integration/strategy-flip.yaml
|
|
# asserts both the recovery path works and the regression mode is
|
|
# still detected.
|
|
strategy:
|
|
type: Recreate
|
|
selector:
|
|
matchLabels:
|
|
app.kubernetes.io/name: catalyst-api
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app.kubernetes.io/name: catalyst-api
|
|
spec:
|
|
imagePullSecrets:
|
|
- name: ghcr-pull
|
|
# fsGroup applies to the volumes mounted into the Pod so the
|
|
# non-root container UID (65534) can write to the deployments
|
|
# PVC. Without this, Hetzner Cloud Volumes default to root:root
|
|
# and the catalyst-api process gets EACCES on every store.Save —
|
|
# surfacing as the "deployment store unavailable" warning at
|
|
# startup and silent persistence failures at runtime.
|
|
#
|
|
# fsGroupChangePolicy: OnRootMismatch limits the chown traversal
|
|
# to first start (where the volume is freshly provisioned with
|
|
# the wrong UID). Subsequent restarts skip the recursive chown
|
|
# if the root dir already matches, keeping Pod start times
|
|
# bounded as the deployments directory grows.
|
|
securityContext:
|
|
fsGroup: 65534
|
|
fsGroupChangePolicy: OnRootMismatch
|
|
containers:
|
|
- name: catalyst-api
|
|
image: "{{ if .Values.global.imageRegistry }}{{ .Values.global.imageRegistry }}{{ else }}{{ .Values.images.registry }}{{ end }}/{{ .Values.images.organization }}/catalyst-api:{{ .Values.images.catalystApi.tag }}"
|
|
imagePullPolicy: IfNotPresent
|
|
ports:
|
|
- containerPort: 8080
|
|
protocol: TCP
|
|
env:
|
|
- name: PORT
|
|
value: "8080"
|
|
- name: CORS_ORIGIN
|
|
value: "https://catalyst.openova.io"
|
|
- name: DYNADOT_API_KEY
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: dynadot-api-credentials
|
|
key: api-key
|
|
# optional=true: Sovereign clusters don't hold Dynadot
|
|
# credentials — their tenant DNS is served by the
|
|
# Sovereign's own PowerDNS instance, not the parent
|
|
# account. Catalyst-Zero (contabo-mkt) supplies the
|
|
# real secret; Sovereigns use an empty stub or omit it
|
|
# entirely. Without optional=true the pod refuses to
|
|
# start when the secret is absent (issue #547).
|
|
optional: true
|
|
- name: DYNADOT_API_SECRET
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: dynadot-api-credentials
|
|
key: api-secret
|
|
optional: true
|
|
# DYNADOT_MANAGED_DOMAINS — comma-separated list of pool domains
|
|
# the same Dynadot account manages. Per docs/INVIOLABLE-PRINCIPLES.md
|
|
# #4, this is runtime configuration so adding a third pool domain
|
|
# (e.g. acme.io) does NOT require a code change — only a secret
|
|
# update. The Dynadot API is account-scoped (one api-key/api-secret
|
|
# pair covers every domain owned by the account); this list scopes
|
|
# which domains the catalyst-api is *allowed* to write records for,
|
|
# defending against misconfiguration that would let a wizard-
|
|
# supplied poolDomain trigger writes against an unrelated domain.
|
|
- name: DYNADOT_MANAGED_DOMAINS
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: dynadot-api-credentials
|
|
key: domains
|
|
# optional=true so deployments using the legacy single-value
|
|
# `domain` key (pre-#108) keep working until the secret is
|
|
# migrated; the dynadot package falls through to DYNADOT_DOMAIN
|
|
# then to its built-in defaults if neither key is present.
|
|
optional: true
|
|
- name: DYNADOT_DOMAIN
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: dynadot-api-credentials
|
|
key: domain
|
|
optional: true
|
|
# CATALYST_TOFU_WORKDIR — provisioner runs `tofu init/plan/apply`
|
|
# inside this directory. Default in code (/var/lib/catalyst/tofu)
|
|
# is unwritable for UID 65534 because the only emptyDir mounts on
|
|
# this Pod are /tmp and /home/nonroot. We pin to /tmp/catalyst so
|
|
# the writable emptyDir backs the per-Sovereign workdir tree.
|
|
- name: CATALYST_TOFU_WORKDIR
|
|
value: /tmp/catalyst/tofu
|
|
# CATALYST_DEPLOYMENTS_DIR — flat-file store for deployment
|
|
# records (one JSON file per deployment id). Backed by the
|
|
# PVC mount below so deployments persist across Pod
|
|
# restarts. Each record is the full Deployment state with
|
|
# credentials redacted; see internal/store/store.go.
|
|
- name: CATALYST_DEPLOYMENTS_DIR
|
|
value: /var/lib/catalyst/deployments
|
|
# CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS (issue #547) — current
|
|
# bootstrap-kit cardinality is 38 HRs (clusters/_template/
|
|
# bootstrap-kit/01-cilium → 49-bp-cert-manager-powerdns-webhook).
|
|
# Watch must observe at least this many HRs before the
|
|
# terminate-on-all-done check fires. The helmwatch default is
|
|
# 11 (the legacy 11-component count); without this override
|
|
# informer alphabetical sync order means the first 12 HRs reach
|
|
# Ready=True before the rest enter the cache, watch exits Ready
|
|
# early, the wizard's jobs page locks at 12/38 install rows.
|
|
# Per INVIOLABLE-PRINCIPLES #4 (no hardcoded values), keeping
|
|
# the actual count as a chart value lets future bootstrap-kit
|
|
# additions ship without touching helmwatch source.
|
|
- name: CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS
|
|
value: "38"
|
|
# CATALYST_KUBECONFIGS_DIR — sibling directory on the same
|
|
# PVC for the plaintext kubeconfigs the new Sovereign POSTs
|
|
# back via the bearer-token endpoint (issue #183, Option D).
|
|
# One <id>.yaml per deployment, mode 0600. The store JSON
|
|
# record carries only the file path + a SHA-256 hash of
|
|
# the bearer; the plaintext kubeconfig is NEVER serialized
|
|
# into the JSON.
|
|
- name: CATALYST_KUBECONFIGS_DIR
|
|
value: /var/lib/catalyst/kubeconfigs
|
|
# CATALYST_API_PUBLIC_URL — the public origin the new
|
|
# Sovereign's cloud-init PUTs its kubeconfig back to. The
|
|
# OpenTofu module templates this into the Sovereign's
|
|
# user_data so the Sovereign knows where to call. Per
|
|
# docs/INVIOLABLE-PRINCIPLES.md #4 this is runtime
|
|
# configuration; air-gapped franchises override it
|
|
# without code change.
|
|
- name: CATALYST_API_PUBLIC_URL
|
|
value: https://console.openova.io/sovereign
|
|
# CATALYST_K8SCACHE_KUBECONFIGS_DIR — issue #321. Directory
|
|
# the k8scache.Factory reads kubeconfigs from at startup.
|
|
# The data-plane SharedInformerFactory opens one informer
|
|
# per kubeconfig file; the cloud-init postback handler
|
|
# (PUT /api/v1/deployments/{id}/kubeconfig) writes here on
|
|
# Phase-1 attach so a fresh Sovereign id is automatically
|
|
# picked up at next catalyst-api restart. The same PVC
|
|
# (catalyst-api-deployments) backs the existing
|
|
# deployments store; the data-plane reads the kubeconfigs/
|
|
# subdirectory directly.
|
|
- name: CATALYST_K8SCACHE_KUBECONFIGS_DIR
|
|
value: /var/lib/catalyst/kubeconfigs
|
|
# CATALYST_K8SCACHE_SNAPSHOT_DIR — issue #321 cold-start
|
|
# mitigation. Backed by a separate 5Gi PVC
|
|
# (catalyst-api-cache) so its size is independent of the
|
|
# deployments store. See api-cache-pvc.yaml for the sizing
|
|
# rationale + the cold-start latency contract.
|
|
- name: CATALYST_K8SCACHE_SNAPSHOT_DIR
|
|
value: /var/cache/sov-cache
|
|
# CATALYST_K8SCACHE_KINDS_CONFIGMAP — optional ConfigMap
|
|
# extending the built-in kinds registry. Per docs/
|
|
# INVIOLABLE-PRINCIPLES.md #4 a new watched GVR (e.g.
|
|
# HelmRelease, Kustomization) is a runtime configuration
|
|
# change, not a code change. Empty disables ConfigMap
|
|
# loading; built-in DefaultKinds is used.
|
|
- name: CATALYST_K8SCACHE_KINDS_CONFIGMAP
|
|
value: catalyst-k8scache-kinds
|
|
- name: CATALYST_K8SCACHE_KINDS_CONFIGMAP_NAMESPACE
|
|
value: catalyst
|
|
# CATALYST_GHCR_PULL_TOKEN — long-lived GHCR pull token that
|
|
# the provisioner stamps onto every Request and the OpenTofu
|
|
# cloud-init template writes into the new Sovereign's
|
|
# flux-system/ghcr-pull Secret so Flux source-controller
|
|
# can pull private bp-* OCI artifacts from
|
|
# ghcr.io/openova-io/. Without this, Phase 1 stalls at
|
|
# bp-cilium with "secrets ghcr-pull not found" — verified
|
|
# live on omantel.omani.works pre-fix.
|
|
#
|
|
# optional: true — when the Secret or key is missing the
|
|
# Pod still starts (with the env var unset). The
|
|
# provisioner's Validate() rejects deployments that need
|
|
# the token (Phase 1 bootstrap-kit pulls private bp-*
|
|
# charts) with a clear pointer to docs/SECRET-ROTATION.md,
|
|
# so a misconfigured catalyst-api fails fast on
|
|
# /api/v1/deployments POST instead of silently mid-apply.
|
|
# /healthz, /api/v1/credentials/validate, and the BYO
|
|
# registrar proxy keep working — they don't read the
|
|
# token at all.
|
|
#
|
|
# Rotation: yearly, see docs/SECRET-ROTATION.md. The Secret
|
|
# is created out-of-band by an operator (never via Flux,
|
|
# never committed to git) — the chart references it but
|
|
# does not template it.
|
|
- name: CATALYST_GHCR_PULL_TOKEN
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: catalyst-ghcr-pull-token
|
|
key: token
|
|
optional: true
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 128Mi
|
|
limits:
|
|
# tofu provider plugins (hcloud ~80MB, dynadot ~30MB) + state +
|
|
# plan files easily exceed the prior 64Mi cap. 1Gi gives headroom
|
|
# for parallel provider init and sustained `apply` work.
|
|
cpu: 1000m
|
|
memory: 1Gi
|
|
# Liveness vs readiness — the split is REQUIRED, not cosmetic
|
|
# (issue #530). /healthz is liveness: it returns 200 whenever
|
|
# the catalyst-api process is up and the HTTP server is
|
|
# serving. /readyz is readiness: it returns 200 only when the
|
|
# primary Sovereign's Pod + Deployment informers are synced
|
|
# (or no Sovereigns are registered yet).
|
|
#
|
|
# The previous wiring pointed BOTH probes at /healthz AND
|
|
# /healthz performed the strict informer-sync check. The
|
|
# crashloop chain that followed:
|
|
#
|
|
# 1. Operator POSTs a fresh deployment.
|
|
# 2. catalyst-api registers the Sovereign in k8scache and
|
|
# starts looking for a kubeconfig file on the PVC.
|
|
# 3. Kubeconfig will NOT arrive until the new Sovereign's
|
|
# cloud-init runs (~60-120s) and PUTs it back. Until
|
|
# then, informers cannot start, sync flips false.
|
|
# 4. /healthz returns 503. kubelet kills the Pod on the
|
|
# next liveness probe (~33s).
|
|
# 5. Restarted Pod restores deployments from the PVC,
|
|
# re-registers the Sovereign, re-enters the same
|
|
# no-kubeconfig state. Loop repeats.
|
|
# 6. Service has zero ready endpoints throughout. nginx
|
|
# returns 502 to cloud-init's kubeconfig PUT. The PUT
|
|
# never reaches catalyst-api. Provision stalls forever.
|
|
#
|
|
# The fix: liveness must be process-level (am I up?), NOT
|
|
# workload-level (do I have a kubeconfig?). The strict
|
|
# informer-sync check stays — moved to /readyz — so a Pod
|
|
# whose primary Sovereign is mid-sync briefly drops out of
|
|
# the Service rotation but is NOT restarted. The kubeconfig
|
|
# PUT endpoint reaches catalyst-api the moment cloud-init
|
|
# calls it, breaking the deadlock.
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /healthz
|
|
port: 8080
|
|
initialDelaySeconds: 3
|
|
periodSeconds: 10
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /readyz
|
|
port: 8080
|
|
initialDelaySeconds: 2
|
|
periodSeconds: 5
|
|
securityContext:
|
|
allowPrivilegeEscalation: false
|
|
# readOnlyRootFilesystem deliberately false: the bootstrap installer
|
|
# writes kubeconfig temp files (mode 0600) under /tmp and helm
|
|
# downloads chart caches under $HOME. Per Catalyst security policy
|
|
# these writes are scoped via emptyDir below, never to the image's
|
|
# actual root FS.
|
|
readOnlyRootFilesystem: false
|
|
runAsNonRoot: true
|
|
runAsUser: 65534
|
|
volumeMounts:
|
|
- name: tmp
|
|
mountPath: /tmp
|
|
- name: home
|
|
mountPath: /home/nonroot
|
|
# Catalyst PVC — mounted at /var/lib/catalyst so two
|
|
# subdirectories live on the same single-attach volume:
|
|
#
|
|
# deployments/<id>.json — flat-file deployment store.
|
|
# Every catalyst-api restart that rehydrates from
|
|
# this directory closes the user-reported regression
|
|
# where a deployment id created at 12:57 vanished
|
|
# after 6 image rolls. The store walks every *.json
|
|
# on startup; in-flight rows are rewritten to
|
|
# `failed` with operator instructions for purging
|
|
# orphaned Hetzner resources.
|
|
#
|
|
# kubeconfigs/<id>.yaml — plaintext kubeconfig POSTed
|
|
# back from cloud-init via the bearer-token endpoint
|
|
# (issue #183, Option D). Mode 0600 per file. The
|
|
# path is persisted in the deployment record so a
|
|
# Pod restart mid-Phase-1 reattaches the helmwatch
|
|
# goroutine.
|
|
#
|
|
# One PVC, one mount — keeps the failure modes (PVC
|
|
# unbind, fs full) bounded to one volume, and lets the
|
|
# Go process create both subdirectories on startup
|
|
# without a second volume claim or init container.
|
|
- name: catalyst
|
|
mountPath: /var/lib/catalyst
|
|
# k8scache disk-snapshot mount (issue #321). Separate PVC
|
|
# so cache size is independent of deployment-record
|
|
# storage. The k8scache loop writes one JSON per
|
|
# (cluster, kind) here, mode 0600. Pruned by the loop
|
|
# itself when a snapshot ages past 1h.
|
|
- name: sov-cache
|
|
mountPath: /var/cache/sov-cache
|
|
volumes:
|
|
- name: tmp
|
|
emptyDir:
|
|
# 2Gi to hold the per-deployment OpenTofu workdir tree under
|
|
# /tmp/catalyst/tofu/<sovereign-fqdn>/ (provider plugins + state
|
|
# + plan binary). Each Sovereign run gets its own subdirectory.
|
|
sizeLimit: 2Gi
|
|
- name: home
|
|
emptyDir:
|
|
sizeLimit: 256Mi
|
|
# Persistent catalyst-api state — mounted at /var/lib/catalyst
|
|
# so deployments/ and kubeconfigs/ share one volume. The PVC
|
|
# must already exist in the same namespace under the name
|
|
# catalyst-api-deployments; see api-deployments-pvc.yaml in
|
|
# this chart. Single-attach (RWO) is fine because the
|
|
# Deployment is single-replica with the Recreate strategy
|
|
# declared above; a future HA rework would need RWX or a
|
|
# different persistence layer.
|
|
- name: catalyst
|
|
persistentVolumeClaim:
|
|
claimName: catalyst-api-deployments
|
|
# k8scache disk-snapshot PVC (issue #321). 5Gi RWO; see
|
|
# api-cache-pvc.yaml for the sizing + cold-start contract.
|
|
- name: sov-cache
|
|
persistentVolumeClaim:
|
|
claimName: catalyst-api-cache
|