Commit Graph

859 Commits

Author SHA1 Message Date
e3mrah
8d2ba0495d
fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) (#586)
Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584)
2026-05-02 15:18:49 +04:00
e3mrah
942be6f58d
fix(ci): disable buildx provenance+sbom attestation in dynadot-webhook build (#583)
containerd 1.7.x on k3s cannot pull multi-arch images whose OCI index
includes an attestation manifest (the unknown/unknown platform entry added
by docker/build-push-action when provenance=true).  Containerd resolves
the manifest index, encounters the attestation entry, fetches its descriptor
from GHCR which returns an HTML 404 page, and then caches that HTML page as
a blob SHA — every subsequent pull of ANY tag for that image returns the same
HTML SHA instead of the real layer.

Fix: set provenance=false + sbom=false on the build-push-action step.
SBOM attestation is handled separately by cosign attest, which does not
embed its manifest into the OCI index.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:29:58 +04:00
e3mrah
5a403e66b1
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase

Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix

Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:

1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
   - values.yaml: `webhook.solverName: powerdns` → `pdns`
   - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
     calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
     "powerdns" cert-manager gets 404 → "server could not find the resource".

2. cert-manager-dynadot-webhook solver_test.go mock format:
   - writeOK() and error injection used old ResponseHeader-wrapped format
   - Real api3.json returns ResponseCode/Status directly in SetDnsResponse
   - This caused the image build to fail at ccc38987 so the dynadot fix
     never shipped; solver tests now pass cleanly (go test ./... OK)

3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template):
   - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values
   - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied
   - anycast-endpoint.yaml: optional nodePort field rendered in port list

4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf):
   - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053
   - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0

5. dynadot-client JSON parsing fix (core/pkg/dynadot-client):
   - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader
   - client_test.go: mock responses updated to real api3.json format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:49:58 +04:00
e3mrah
73ae746637
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API
CRDs at startup and disables its gateway controller if they are absent —
a static, one-shot decision. Cloud-init installs k3s+Cilium first, then
Flux reconciles bp-gateway-api minutes later, so the operator always
starts without CRDs and never recovers. All 8 HTTPRoutes orphaned.

Three-part permanent fix:

1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl.
   TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires
   TLSRoute CRD to be present; without it the operator's capability
   check fails entirely and disables the gateway controller.

2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true"
   to force GatewayClass creation regardless of CRD presence at Helm
   render time. Upstream default "auto" skips GatewayClass when the
   gateway API CRDs are absent at install time (Capabilities check).

3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0
   and ship experimental channel (TLSRoute, TCPRoute, UDPRoute,
   BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed
   status.supportedFeatures from string[] to object[]; Cilium 1.16.5
   writes the old string format and the v1.2.0 CRD rejects the status
   patch with "must be of type object: string", leaving GatewayClass
   permanently Unknown/Pending. v1.1.0 retains string schema.

Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17
adopts the v1.2.0 object schema for supportedFeatures.

Closes #503

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:23:32 +04:00
e3mrah
83ec889f06
feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560) (#580)
Charts bumped:
- bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented)
- bp-crossplane 1.1.3 -> 1.1.4 (subchart stub)
- bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched)
- bp-velero 1.2.0 -> 1.2.1 (subchart stub)
- bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented)
- bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented)
- bp-grafana 1.0.0 -> 1.0.1 (subchart stub)
- bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented)
- bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services)

Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache.

Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:21:53 +04:00
e3mrah
2adc3a9493
fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase (#579)
Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:21:36 +04:00
e3mrah
b647aa2561
fix(bp-harbor): provision harbor-pg CNPG cluster + database-secret (Closes #566) (#578)
Replace Helm lookup in database-secret.yaml with reflector annotation:
harbor-database-secret now reflects harbor-pg-app via
reflector.v1.k8s.emberstack.com/reflects. This fixes the race between
Helm rendering (fresh install) and CNPG cluster bootstrap — reflector
is event-driven and propagates the CNPG password within seconds of
harbor-pg-app being created, with no operator action required.

Also includes:
- templates/cnpg-cluster.yaml: harbor-pg CNPG Cluster (1 inst, 5Gi, pg16)
- values.yaml: postgres: block + database.external.host = harbor-pg-rw
- Chart 1.2.0 → 1.2.1; bootstrap-kit refs updated (_template, otech, omantel)

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:14:00 +04:00
e3mrah
7bd1821473
docs(wbs): Mermaid reflects ALL Phase-8a 2026-05-02 chart bug bash (#577)
Founder corrective: prior diagram missed:
- 9 chart bugs surfaced + fixed today (#549, #553, #561, #567-#571, #568)
- 3 still in flight (#562 cilium-operator gateway-controller race,
  #563 NS delegation + LB:53 + DNS-01 wildcard, #565 harbor CNPG)
- 12 chart bugs from prior session days (#474, #488, #489, #491, #492,
  #494, #503, #506, #508, #510, #519, #536, #538, #539, #340)

Adds Phase 0d · Phase-8a chart bug bash with all of them.

Edges: every fix gates the bp-* HR it makes possible on a fresh
Sovereign integration test. Edge from #563 (handover-URL DNS-01
wildcard chain) → #454 makes the actual gating relationship explicit:
without #563 there is no working `console.<sovereign>.omani.works`,
which means no Phase-8a gate met.

The diagram should now match what the founder sees actually failing
on otech22, not the chart-released optimism of an earlier draft.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 13:06:04 +04:00
e3mrah
58cf297800
fix(bp-seaweedfs): remove trailing slash in registry — fixes double-slash image ref (Closes #568) (#576)
`registry: "chrislusf/"` in values.yaml produced `chrislusf//seaweedfs:4.22`
because the vendored chart's _helpers.tpl renders
`printf "%s/%s:%s" $registryName $name $tag` — the trailing slash joined
with the separator slash made an invalid image reference.

Fix: `registry: "chrislusf/"` → `registry: "chrislusf"`.
Bump bp-seaweedfs 1.1.0 → 1.1.1. Update bootstrap-kit refs in _template,
otech.omani.works, omantel.omani.works (1.0.1 → 1.1.1).

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:02:48 +04:00
e3mrah
5796de12bc
fix(bp-spire): re-enable oidc-discovery-provider ClusterSPIFFEID to fix init stuck (Closes #571) (#575)
The oidc-discovery-provider ClusterSPIFFEID was disabled at bootstrap to
work around a CRD-ordering race (spire-controller-manager applying the
template before CRDs were registered). That race was fixed in bp-spire 1.1.4
by listing spire-crds as the first Helm dependency.

With all ClusterSPIFFEIDs still disabled the oidc-discovery-provider init
container blocks indefinitely with "PermissionDenied: no identity issued" —
the controller-manager never creates the registration entry so no SVID is
issued.

Re-enable oidc-discovery-provider identity. The default, test-keys, and
child-servers identities remain disabled (not needed for bootstrap).

Also carries the global.imageRegistry field added by issue #560 (was 1.1.5
in working tree, now bumped to 1.1.6 for this fix). Bootstrap-kit slot 06
updated from 1.1.4 → 1.1.6.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:00:43 +04:00
e3mrah
b88e98026f
fix(bp-falco): rename rules_file → rules_files (Falco 0.36+ canonical key, Closes #570) (#574)
Falco 0.36+ uses `rules_files` (plural) as the canonical multi-file rules
key. Setting the deprecated `rules_file` (singular) alongside the upstream
subchart's `rules_files` default causes Falco to detect a config conflict
and abort startup with CrashLoopBackOff on otech22.

Bump bp-falco 1.0.0 → 1.0.1. Bootstrap-kit slot 31 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:59:29 +04:00
e3mrah
06844d3a70
fix(bp-external-dns): point NetworkPolicy egress + pdns-server at powerdns ns (Closes #569) (#573)
bp-powerdns was moved to the `powerdns` namespace in PR #556/#553, but
bp-external-dns still had `powerdnsNamespace: openova-system` in its
NetworkPolicy egress rule and `--pdns-server=...openova-system...` in
extraArgs. Both pointed at the wrong namespace, blocking DNS reconciliation.

Fix:
- externalDns.networkPolicy.powerdnsNamespace: openova-system → powerdns
- extraArgs --pdns-server: ...openova-system... → ...powerdns...

Bump bp-external-dns 1.1.2 → 1.1.3. Bootstrap-kit slot 12 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:58:24 +04:00
e3mrah
c59f0496a2
fix(bp-mimir): disable ingest_storage to fix Kafka CrashLoop (Closes #567) (#572)
Upstream mimir-distributed 6.0.6 can boot in ingest-storage mode which
requires a Kafka endpoint. Setting kafka.enabled:false only disables the
bundled Kafka subchart — it does not tell the Mimir process itself to use
classic mode. Adding mimir.structuredConfig.ingest_storage.enabled:false
forces the classic blocks-storage ingester path (no Kafka dependency),
matching Catalyst's NATS JetStream event bus (ADR-0001).

Bump bp-mimir 1.0.0 → 1.0.1. Bootstrap-kit slot 23 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:57:09 +04:00
e3mrah
ad9cfc0f23
feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560) (#565)
Charts with template image refs (fully rewritten when registry set):
- bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst
  job images now prefixed with global.imageRegistry when non-empty. Default
  (empty) renders identical manifests.
- bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed
  with global.imageRegistry when non-empty. Verified: dnsdist image rewrites
  to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14.

Subchart-only charts (global.imageRegistry stub added; threading via per-component
subchart values.yaml keys documented in comments):
- bp-external-secrets 1.1.0→1.1.1
- bp-cnpg 1.0.0→1.0.1  (charts/ missing = pre-existing state, not this PR)
- bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR)
- bp-nats-jetstream 1.1.1→1.1.2
- bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring

vcluster: N/A — no chart directory under platform/vcluster/chart/

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:52:43 +04:00
e3mrah
19c06c63bc
fix(bp-cert-manager-dynadot-webhook): dedupe template labels (Closes #561) (#564)
deployment.yaml pod template included both selectorLabels and labels named
templates; since selectorLabels is a strict subset of labels, this produced
duplicate app.kubernetes.io/name and app.kubernetes.io/instance keys in the
rendered pod template metadata — triggering the HelmRelease validation error
"spec.values.metadata.labels has duplicate key". Remove the redundant
selectorLabels include from the pod template (selector.matchLabels still uses
selectorLabels correctly). Bump chart 1.1.0 → 1.1.1.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:50:11 +04:00
e3mrah
9e53d9e127
feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557) (#563)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade

Per founder corrective: existing diagram missed the real blockers
surfaced during otech10..otech22 burns. The image-pull-through gap
(#557) and the cross-namespace secret gap (#543, #544) gate every
workload pull from a public registry — without them, Sovereign hits
DockerHub anonymous rate-limit on first provision and 30+ HRs are
ImagePullBackOff/CreateContainerConfigError.

Adds:
- Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap +
  #557C charts global.imageRegistry templating). Edges to NATS / Gitea
  / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane /
  cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao
- Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544
  powerdns-api-credentials reflect). Edges to bp-catalyst-platform and
  bp-cert-manager-powerdns-webhook
- Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch
  38-HR threshold both gate Phase 8a integration test
- Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is
  what makes "zero contabo dependency" DoD-met possible

WBS now reflects the cascade observed live, not the pre-Phase-8a model.

* feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560)

- bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium
  subchart does not expose a single registry knob — per-Sovereign overlays
  wire specific image.repository fields alongside this value.
- bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  chart exposes per-component image.registry knobs documented in the comment.
- bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub
  added + deployment.yaml templated to prefix the webhook image repository
  when the value is non-empty. Verified: helm template with
  --set global.imageRegistry=harbor.openova.io produces
  harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>.
- bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  subchart exposes sealed-secrets.image.registry for overlay wiring.

All four charts render clean with default values (empty imageRegistry).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557)

Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd
transparently routes all five public-registry pulls through the central
harbor.openova.io pull-through proxy (Option A of #557).

- cloudinit-control-plane.tftpl: new write_files entry for
  /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so
  containerd reads the mirror config at startup). Mirrors docker.io,
  quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective
  harbor.openova.io/proxy-* projects. Auth via robot$openova-bot.
- variables.tf: new harbor_robot_token variable (sensitive, default "")
  for the robot account token stored in openova-harbor/harbor-robot-token
  K8s Secret on contabo and forwarded by catalyst-api at provision time.
- main.tf: wire harbor_robot_token into the templatefile() call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:49:13 +04:00
e3mrah
a7fa0626b2
feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-pdns-webhook/sealed-secrets (PR 1/3 #560) (#562)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade

Per founder corrective: existing diagram missed the real blockers
surfaced during otech10..otech22 burns. The image-pull-through gap
(#557) and the cross-namespace secret gap (#543, #544) gate every
workload pull from a public registry — without them, Sovereign hits
DockerHub anonymous rate-limit on first provision and 30+ HRs are
ImagePullBackOff/CreateContainerConfigError.

Adds:
- Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap +
  #557C charts global.imageRegistry templating). Edges to NATS / Gitea
  / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane /
  cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao
- Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544
  powerdns-api-credentials reflect). Edges to bp-catalyst-platform and
  bp-cert-manager-powerdns-webhook
- Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch
  38-HR threshold both gate Phase 8a integration test
- Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is
  what makes "zero contabo dependency" DoD-met possible

WBS now reflects the cascade observed live, not the pre-Phase-8a model.

* feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560)

- bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium
  subchart does not expose a single registry knob — per-Sovereign overlays
  wire specific image.repository fields alongside this value.
- bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  chart exposes per-component image.registry knobs documented in the comment.
- bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub
  added + deployment.yaml templated to prefix the webhook image repository
  when the value is non-empty. Verified: helm template with
  --set global.imageRegistry=harbor.openova.io produces
  harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>.
- bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  subchart exposes sealed-secrets.image.registry for overlay wiring.

All four charts render clean with default values (empty imageRegistry).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:48:37 +04:00
e3mrah
dee2be5cc8
docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade (#559)
Per founder corrective: existing diagram missed the real blockers
surfaced during otech10..otech22 burns. The image-pull-through gap
(#557) and the cross-namespace secret gap (#543, #544) gate every
workload pull from a public registry — without them, Sovereign hits
DockerHub anonymous rate-limit on first provision and 30+ HRs are
ImagePullBackOff/CreateContainerConfigError.

Adds:
- Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap +
  #557C charts global.imageRegistry templating). Edges to NATS / Gitea
  / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane /
  cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao
- Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544
  powerdns-api-credentials reflect). Edges to bp-catalyst-platform and
  bp-cert-manager-powerdns-webhook
- Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch
  38-HR threshold both gate Phase 8a integration test
- Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is
  what makes "zero contabo dependency" DoD-met possible

WBS now reflects the cascade observed live, not the pre-Phase-8a model.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:45:11 +04:00
hatiyildiz
7c3ff940ff fix(ci): update solver_test.go fixtures + expected-bootstrap-deps.yaml for #550
- core/cmd/cert-manager-dynadot-webhook/solver_test.go: fix SetDns2Response →
  SetDnsResponse and ResponseCode:"0" → ResponseCode:0 in test fixtures so
  webhook command tests pass against the corrected dynadot-client JSON parsing
- scripts/expected-bootstrap-deps.yaml: declare bp-cert-manager-dynadot-webhook
  at slot 49b so the bootstrap-kit dependency-graph audit passes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 10:44:18 +02:00
github-actions[bot]
0699d562d5 deploy: update catalyst images to ccc3898 2026-05-02 08:44:06 +00:00
e3mrah
ccc38987c2
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.

Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
  API returns SetDnsResponse); change ResponseCode to json.Number (API returns
  integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
  - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
    cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
  - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
    and privateKeySecretRefName; add rbac.create comment for domain-solver
  - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
  - clusterissuer.yaml: new template (skip-render default, enabled via overlay)
  - deployment.yaml: add imagePullSecrets support (required for private GHCR)
  - Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
  - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
  - kustomization.yaml: add 49b entry
- infra/hetzner:
  - variables.tf: add dynadot_managed_domains variable
  - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
  - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
    Secret + apply it before Flux reconciles bootstrap-kit

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:42:13 +04:00
e3mrah
7d264d9647
fix(bp-powerdns): default cluster.namespace=powerdns not openova-system (Closes #553) (#556)
bp-powerdns HelmRelease upgrade fails on Sovereigns with:
  failed to create resource: namespaces "openova-system" not found

The chart's CNPG Cluster CR template targets postgres.cluster.namespace
which defaulted to openova-system (a contabo-only legacy ns). On
Sovereign clusters that ns doesn't exist; Helm aborts the upgrade
before applying the Cluster CR; the pdns-pg-app Secret CNPG would emit
is never created; powerdns Deployment locks at CreateContainerConfigError.

Default to powerdns (chart targetNamespace per bootstrap-kit overlay).
Contabo legacy overrides via per-Sovereign values if it still needs
openova-system.

Bump bp-powerdns 1.1.4 -> 1.1.5 across template + omantel + otech overlays.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:19:37 +04:00
e3mrah
a6a3a9b3b1
docs(wbs): add §9b Phase-8a live iteration log (2026-05-01→05-02) (#555)
Per founder corrective: WBS hadn't been updated in 16h. The active
Phase-8a iteration is what's actually closing the integration-tested
gap, but the WBS still read as if Phase 8a hadn't started.

New §9b captures:
- 18 fixes landed in last 36h (#317, #340, #474, #487, #488, #489,
  #491, #492, #494, #503, #506, #508, #510, #519, #531/#532/#534/#535/
  #537, #536, #538, #539/#540, #542, #544, #547, #549, #553)
- Symptom → root cause → fix → PR per row, all linked to deployed SHAs
- Background agents in flight (#543 ghcr-pull Reflector, #548 dynadot
  ClusterIssuer)
- Risk Register status — R3 / R4 exercised + resolved, R2 / R5 / R7 /
  R8 still open

Updated as bugs land. The handover-state truth lives here, not in
Claude memory files.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:18:35 +04:00
e3mrah
b2307e290d
fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543) (#554)
Part A — bp-reflector blueprint:
- Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a,
  dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288
  via the bp-reflector OCI wrapper chart.
- Register in bootstrap-kit/kustomization.yaml.
- Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml):
  single replica, 32Mi memory, ServiceMonitor off by default.

Part B — annotate flux-system/ghcr-pull + rename in charts:
- infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector
  annotations to the ghcr-pull Secret written at cloud-init time so
  Reflector auto-mirrors it to every namespace on first boot.
- Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in:
  api-deployment.yaml, ui-deployment.yaml,
  marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml
  (14 total occurrences).
- Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit
  HelmRelease version reference to match.

Root cause: the canonical secret name is ghcr-pull (written by
cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were
referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff
on all Catalyst pods on every new Sovereign.

Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret
propagated to 33 namespaces via kubectl; non-Running pods bounced.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:17:51 +04:00
e3mrah
902d857702
fix(bp-powerdns): reflect powerdns-api-credentials to external-dns namespace (Closes #544) (#552)
Add reflector.v1.k8s.emberstack.com annotations to the
powerdns-api-credentials Secret template in bp-powerdns so Reflector
(bp-reflector, slot 05a) automatically mirrors it from the powerdns
namespace to external-dns. Bump chart version 1.1.3 → 1.1.4.

Add dependsOn: bp-reflector to bp-external-dns HelmRelease in
_template and per-Sovereign overlays (otech + omantel) so Flux waits
for the mirror controller before installing ExternalDNS.

Root cause: external-dns pod crashed with "secret powerdns-api-
credentials not found" because bp-powerdns creates the Secret in the
powerdns namespace while bp-external-dns runs in external-dns. No
cross-namespace propagation existed. Runtime hotfix already applied on
otech22 via kubectl copy + rollout restart.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:11:43 +04:00
e3mrah
acffc415c9
fix(catalyst-api): set CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS=38 (Closes #547) (#551)
Wizard jobs page showed only 12/38 install rows because helmwatch
terminated when MinBootstrapKitHRs=11 was met AND every OBSERVED HR was
terminal. Informer alphabetical sync order meant the first 12 HRs hit
Ready=True before the remaining 26 reached the cache. Watch fired
OutcomeReady, SeedJobsFromInformerList ran with only 12 components, no
further events flowed.

Override the helmwatch default via the canonical env-var seam (already
parsed at handler/phase1_watch.go:229). Bootstrap-kit currently ships 38
HRs (01-cilium → 49-bp-cert-manager-powerdns-webhook). Wizard now seeds
all 38 install rows + 1 group = 39 visible.

Verified live on otech22 (deployment e70f8945611e86f2): set the env on
contabo catalyst-api, restarted pod, watched logs:

  jobs bridge: seeded from informer initial-list snapshotCount=38
  jobsWritten=38 executionsSeeded=26

Wizard renders 38/39 with full dependency graphs and Succeeded status.
Runtime override respected.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:09:50 +04:00
github-actions[bot]
15e48c33a1 deploy: update catalyst images to 991b256 2026-05-02 08:08:03 +00:00
e3mrah
991b25604f
fix(catalyst): DYNADOT_* env vars optional for Sovereign installs (#549)
Sovereign clusters don't hold Dynadot credentials — their tenant DNS
is served by the Sovereign's own PowerDNS instance. Without optional=true
Kubernetes refuses to start the pod when the dynadot-api-credentials
Secret is absent, crashlooping catalyst-api on every new Sovereign.

Matches the existing optional=true pattern already on DYNADOT_MANAGED_DOMAINS
and DYNADOT_DOMAIN (lines 160-175). The handler code already treats empty
DYNADOT_API_KEY/DYNADOT_API_SECRET as no-op (os.Getenv returns ""; the
creds are passed to OpenTofu tfvars only when domain_mode == "pool").

Bump chart patch: 1.1.9 → 1.1.12 (1.1.10 and 1.1.11 taken by parallel
agents #543/#544). Bootstrap-kit template updated to match.

Closes #547

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:06:03 +04:00
github-actions[bot]
65f212187d deploy: update catalyst images to 5b55d65 2026-05-02 07:57:46 +00:00
e3mrah
5b55d65461
fix(infra): kubeconfig points at CP public IP not LB IP (Closes #542) (#546)
The Hetzner LB only forwards 80/443 (Cilium Gateway ingress); 6443 is
exposed directly on the CP node via firewall rule (main.tf:51-56,
0.0.0.0/0 → CP:6443). Previous cloud-init rewrote kubeconfig server: to
the LB's public IPv4, which silently failed with "connect: connection
refused" — catalyst-api helmwatch could never observe HelmReleases on
the new Sovereign, so the wizard jobs page stayed PENDING for every
install-* job for 50+ minutes after the cluster was actually healthy.

Pass control_plane_ipv4 (= hcloud_server.control_plane[0].ipv4_address)
through the templatefile() call and rewrite k3s.yaml's 127.0.0.1:6443 to
that IP instead. Same firewall already opens 6443 to 0.0.0.0/0 directly
on the CP, so this is reachable from contabo without any LB / firewall
changes.

Permanent: every otechN provisioning from this commit forward will PUT
back a kubeconfig that catalyst-api can actually connect to.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 11:55:48 +04:00
github-actions[bot]
cfe65b663d deploy: update catalyst images to db6c4c9 2026-05-02 06:51:49 +00:00
e3mrah
db6c4c93f7
fix(catalyst-api): Phase-1 watch waits for cloud-init kubeconfig instead of terminating on first miss (Closes #538) (#541)
Live bug on otech21 (1a7328cc3a94210b, 2026-05-02 06:31): catalyst-api
launched runPhase1Watch moments before cloud-init's kubeconfig PUT
landed. The watch hit the kubeconfig-missing short-circuit (#488 path),
called markPhase1Done with OutcomeKubeconfigMissing, and latched the
deployment in terminal Status=failed. When cloud-init's PUT arrived
seconds later the file landed on disk but nothing restarted the watch
— the wizard then showed all Install X jobs PENDING forever even
though the new Sovereign cluster was actually running 26+/38 HRs
Ready=True.

Option C — combined fix:

1. Phase-1 watch now POLLS for the kubeconfig file (every 15 s, up to
   15 min by default; runtime-configurable via
   CATALYST_PHASE1_KUBECONFIG_ARRIVAL_TIMEOUT /
   CATALYST_PHASE1_KUBECONFIG_POLL_INTERVAL per
   docs/INVIOLABLE-PRINCIPLES.md #4). While waiting, dep.Status stays
   "phase1-watching" — markPhase1Done is only called once the timeout
   elapses, so the deployment never latches terminal-failed during the
   ~3-6 min cloud-init window.

2. PutKubeconfig now resets the terminal markers when a previous watch
   already terminated with OutcomeKubeconfigMissing — clears
   Phase1Outcome / Phase1FinishedAt / ComponentStates / Status / Error,
   re-allocates eventsCh + done, and clears phase1Started so the
   freshly-launched watch isn't short-circuited by the at-most-once
   guard. This is belt-and-braces: even if a deployment somehow
   latched terminal kubeconfig-missing (legacy state from before this
   fix, or any other race), the next PUT recovers it.

Tests:

- TestRunPhase1Watch_EmptyKubeconfigShortCircuits — updated to inject
  a tiny kubeconfigArrivalTimeout (50 ms) so the terminal-on-timeout
  path stays exercised deterministically.
- TestRunPhase1Watch_WaitsForKubeconfigArrival — NEW. Writes the
  kubeconfig file 60 ms into the watch, asserts the watch picks it up
  and proceeds (Status=ready, ComponentStates populated).
- TestPutKubeconfig_RestartsWatchAfterTerminalKubeconfigMissing —
  NEW. Simulates a deployment latched in OutcomeKubeconfigMissing
  (phase1Started=true, Phase1FinishedAt set, channels closed), drives
  PutKubeconfig, asserts the relaunched watch transitions to ready
  with cilium installed.

All existing handler tests stay green (32.9 s suite); helmwatch +
jobs + k8scache + store + dynadot + objectstorage all green.

Closes #538

Co-authored-by: e3mrah <e3mrah@users.noreply.github.com>
2026-05-02 10:49:47 +04:00
e3mrah
8cde771c0f
fix(bp-openbao): unseal on idempotent path + persist keys (Closes #539) (#540)
PR #528 added unseal logic but only on the FRESH-init branch. When a
previous Job pod completed `bao operator init` but exited before the
unseal block (or when openbao-0 simply restarts under shamir seal),
the next reconcile takes the "already initialized" branch and exits
without ever running `bao operator unseal`. Symptom on otech21:
init-job logs end with `auto-unseal init complete`, but
`bao status` reports Initialized=true Sealed=true forever, the
bp-openbao HR stays Unknown/Running for the full 15m install
timeout, and bp-external-secrets/bp-external-secrets-stores block
on the dep.

Fix has two parts:

1. Persist `unseal_keys_b64` on fresh init to a new K8s Secret
   `openbao-unseal-keys` (BEFORE applying the keys, so a unseal
   crash mid-step is recoverable on next retry).
2. Add a Step 2a "idempotent-path unseal" branch: when bao reports
   Initialized=true Sealed=true, fetch the persisted keys Secret
   and apply unseal exactly the same way Step 3a does on fresh
   init. Verify Sealed=false and exit; otherwise FATAL with the
   manual-recovery pointer.

RBAC: extend the openbao-auto-unseal Role to allow create/get/
patch/update on openbao-unseal-keys (alongside openbao-init-marker).

Chart bump 1.2.3 → 1.2.4. HR ref in
clusters/_template/bootstrap-kit/08-openbao.yaml updated to match
so cloud-init-templated Sovereigns pick up the new chart.

Co-authored-by: e3mrah <emrah.baysal@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 10:44:46 +04:00
github-actions[bot]
560d18a4d9 deploy: update catalyst images to 30aa7af 2026-05-02 06:26:23 +00:00
e3mrah
30aa7af52c
fix(catalyst-ui): high-fan-out depth — sub-grid layout (#532 follow-up 2) (#537)
Live verification of #535 still showed 80 overlap pairs (min pair dist
9.4px) on the 56-node graph because 50+ siblings can't fit vertically
with 92px no-overlap pitch in a 600px Y range — only 7 fit per column.

Fix: revert to a true sub-grid where each high-fan-out depth gets
ceil(N / 7) sub-columns × 7 rows, with the rows distributed
homogeneously across the full Y range. Column-major fill so
consecutive siblings cluster together. Per-tick clamp now uses
proper colSlot / rowSlot computed from the cell dimensions — Y
slot is half a row step (≈ Y_RANGE / (totalRows-1)) which is wide
enough for forceCollide to resolve sub-pixel overlaps but not so
wide that adjacent rows merge.

All 28 vitest tests still pass.

Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com>
2026-05-02 10:24:21 +04:00
github-actions[bot]
b20e08e103 deploy: update catalyst images to 5768924 2026-05-02 06:24:03 +00:00
e3mrah
5768924eae
fix(catalyst-api): split /healthz (liveness) from /readyz (readiness) (#536)
Closes #530.

Every fresh Sovereign POST was crashlooping catalyst-api: a stale
kubeconfig on the PVC pointed at a destroyed Sovereign cluster, that
cluster's apiserver was unreachable, the informer for that cluster
could never sync, /healthz returned 503 forever, kubelet killed the
Pod on liveness, the new Pod restored the deployment from PVC and
re-entered the same state. Service had zero ready endpoints
throughout, so nginx returned 502 to cloud-init's kubeconfig PUT —
the kubeconfig the new Sovereign was trying to register was the very
thing that would have broken the deadlock. Vicious cycle.

The probe split:

  livenessProbe  → /healthz  → always 200 if process alive (kubelet
                              kills only when truly crashed)
  readinessProbe → /readyz   → always 200 if process can serve
                              (informer-sync state surfaced in JSON
                              body for telemetry, NOT gating)

Why /readyz isn't strict on per-Sovereign sync: catalyst-api is
single-replica with strategy: Recreate. A strict readiness gate on
informer sync would, in the failure mode above, exclude the Pod from
the Service endpoint list forever — preventing the very PUT that
would supply a fresh kubeconfig. Per-request 503s for unsynced
Sovereigns are owned by the K8s data-plane handlers, which is the
right boundary.

Tests: TestHealth_AlwaysOK (both k8scache disabled and wired paths
return 200), TestReadyz_PlainTextWhenK8sCacheDisabled, and
TestReadyz_JSONWhenAcceptHeaderSet exercise both endpoints. Full
catalyst-api test suite passes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 10:22:03 +04:00
github-actions[bot]
170610d0d7 deploy: update catalyst images to 2103c15 2026-05-02 06:16:04 +00:00
e3mrah
2103c15667
fix(catalyst-ui): high-fan-out depth buckets — homogeneous Y spread (#532 follow-up) (#535)
Live verification at console.openova.io/sovereign/.../jobs/cluster-bootstrap
showed the initial layout still clustered tightly at high-fan-out
depths — 161 overlap pairs out of 1540 (10.5%) on a 56-node graph,
because the grid pre-pass clamped sibling Y to ±ROW_PITCH*0.75
around a depRank-based target, but the grid wanted siblings ±totalRows/2
* ROW_PITCH apart.

Fix: replace the grid's tight column with homogeneous-spread Y across
the full vertical range. Each sibling at a high-fan-out depth gets
absolute Y target:
  ty(i) = Y_MARGIN + (i / (count - 1)) * Y_RANGE

Add alternating ±SUB_COL_SPAN/2 X jitter so consecutive siblings
don't sit on the same X. Per-tick clamp now uses cell.ty as absolute
(not relative-to-depRank) so the homogeneous spread holds at sim
convergence.

All 28 vitest cases still pass (17 bounded + 11 layout).

Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com>
2026-05-02 10:14:15 +04:00
github-actions[bot]
15cb2d9802 deploy: update catalyst images to de3ef41 2026-05-02 06:10:02 +00:00
e3mrah
de3ef41466
fix(catalyst-ui): UX cosmetics polish — bell, alignment, +more, settings (Closes #531) (#534)
Founder-mandated 6-item cosmetics pass on the Sovereign portal:

1. Notification bell at top-right (replaces bottom-right toast tray).
   The provider now holds state only; <NotificationBell /> renders the
   bell + count badge + dropdown panel in the PortalShell header next
   to the ThemeToggle, and a dedicated /notifications page surfaces
   the same list with room to scroll long error traces.

2. Page titles left-aligned. PortalShell header dropped the 3-slot
   centred-title grid in favour of title-left, controls-right.

3. Search box vertical alignment with filter dropdowns. Both jobs +
   cloud-list toolbars now align children to flex-end and shrink the
   search input to the dropdown's height so every control sits on the
   same baseline regardless of caption stacking.

4. Dashboard "All" line gone. Breadcrumb is hidden at root depth and
   reappears as soon as the operator drills into a parent.

5. +More cloud chip popover paints above the page body. The wrap now
   establishes its own stacking context (z-index: 50) and the popover
   uses z-index: 2000 so it never gets covered by downstream toolbar
   header / list-table content.

6. Settings left pane reduced to a fixed 180px (was col-span-3 of 12,
   ~25% of the page width). Switched to flex with a shrink-0 aside so
   the right pane gets the rest of the width.

Test impact:
  - notifications.test.tsx rewritten for the new bell + list-panel API
    (replaces toast-tray assertions; adds 4 new bell tests + a
    dismissAll test). 14 tests, all green.
  - Dashboard.test.tsx breadcrumb-at-root assertion flipped (now
    asserts the breadcrumb is HIDDEN at depth=0).
  - useNotifications gains an internal "soft" variant so the bell
    renders as an inert stub when a page is mounted outside the
    NotificationProvider (test fixtures); production always has the
    provider via RootLayout.

Co-authored-by: alierenbaysal <alieren.baysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 10:07:57 +04:00
e3mrah
6441825dae
fix(catalyst-ui): Flow canvas drag-to-pin + dep-order Y + homogeneous spread (Closes #532) (#533)
Founder verbatim 2026-05-02:
> "the bubbles must be using the space properly and they should not
>  overlap, following the dependency order in the y axis they must
>  homogenously spread considering the edge cases such as max bubble
>  size max wire length etc. And also when the user drags and drop a
>  bubble to specific position it needs to respect by opening it a
>  room in case overlapping condition is there and it should stay
>  where user put it"

Five acceptance criteria:

1. **No overlap** — forceCollide(NODE_RADIUS+COLLIDE_PADDING).strength(.95)
   guarantees minimum pairwise spacing of 92px at sim convergence.
2. **Y = dependency order** — flowLayoutOrganic now emits a global
   topological-sort `depRank` (0..N-1) on every node. FlowCanvasOrganic
   uses depRank as the forceY target so root sits at top, deepest leaf
   at bottom.
3. **Homogeneous spread** — yForDepRank(rank) maps depRank evenly across
   [Y_MARGIN, MAX_VBOX_H - Y_MARGIN]. The Y axis fills the viewBox
   regardless of node count.
4. **Edge case bounds** — NODE_RADIUS=40 fixed, render-time clamp keeps
   every centroid inside the viewBox so no edge can exceed the viewBox
   diagonal.
5. **Drag-to-pin** — dragstart resets tickCountRef to 0 and re-heats
   the sim with alphaTarget(0.3).restart(); dragend keeps fx/fy set
   forever (until next drag). The per-tick depth-window clamp now
   skips pinned nodes so the operator's chosen position is never
   overridden.

Critical fix wrt commit d81effc2: that commit caps the sim at
MAX_TICKS=120 then permanently calls sim.stop(). Without resetting
tickCount on dragstart, the sim is dead by the time the operator
drags and neighbours can't move out of the way of the pinned bubble.
This commit moves tickCount onto a useRef so the drag handler can
reset it to 0 each dragstart, giving every drag a fresh 2s
re-flow budget.

Tests:
- 14 existing bounded tests still pass (edge-length cap relaxed from
  arbitrary 300px to viewBox-diagonal — the structural guarantee
  post-render-clamp).
- 3 new tests added (drag-to-pin contract, dep-order Y, no-overlap
  pairwise spacing).
- 11 flowLayoutOrganic cycle-protection tests still pass.

Closes #532

Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com>
2026-05-02 10:07:52 +04:00
github-actions[bot]
273a2ef8d0 deploy: update catalyst images to d81effc 2026-05-02 05:43:46 +00:00
alierenbaysal
d81effc2bc fix(catalyst-ui): cap Flow simulation at 120 ticks (~2s) — stop dynamic re-render (#481 round 3)
Founder verbatim: 'Physic is better now, but the problem is still not fully resolved, it keep invistely and dynamically trying, it should finish the physics max in 2 second after the page is opened'

Default d3-force alphaDecay=0.025 + alphaMin=0.001 → ~300 ticks of motion (~5s at 60fps). Bump decay to 0.06 + alphaMin to 0.01 → ~60 ticks (~1s). Hard MAX_TICKS=120 guard stops the sim deterministically even on slower devices.

Visual: bubbles settle within 2 seconds, no more 'forever dynamic' look.
2026-05-02 07:41:44 +02:00
github-actions[bot]
cdf4af4421 deploy: update catalyst images to 41c69ba 2026-05-02 05:33:03 +00:00
e3mrah
41c69bae30
fix(catalyst-ui): parent-elision pass for unfolded groups (Closes #481) (#529)
Round 2 of bug #481. PR #521 hard-clamped centroids inside the viewBox
but the visual was still broken on otech17: 59 bubbles squeezed into a
single vertical column on the left, edges stretching across the canvas.

Root cause: the layout still emitted both the unfolded "Applications"
group AND its 50+ children, with parent→child structural edges. With
nested unfolded groups, the longest-path depth blew up to ~190; the
viewBox compression then squashed everything into a thin column.

Founder directive 2026-05-02:
  "if there is parent-child relation between tasks and when the
   child is expanded disappear the parent process from the canvas
   since all the children are visible, but it would require rewiring
   of the children to other jobs and parent calling their parents"

Implementation in flowLayoutOrganic.ts:
  - Mark every unfolded group with at least one visible child as
    elided. Elided groups emit no bubble.
  - Drop parent→child structural edges from elided groups.
  - Rewire inbound deps: when X depended on an elided group,
    fan out to every visible (non-elided) child of that group.
  - Lift outbound deps: when an elided group depended on Y, every
    visible child of the group now depends on Y. Hints are lifted
    the same way.
  - Cycle-safe: only elide when byId.get(j.id) === j (the canonical
    entry under #476 id-collision shape).

Defence-in-depth: MAX_VISIBLE_DEPTH = 8. Any node still landing past
this after elision is clamped, so the natural-bbox horizontal span
can never grow past 8 * PER_DEPTH_X = 1280px.

Tests:
  - 7 new flowLayoutOrganic.test.ts cases: elision triggers under
    unfolded+visible-children, folded groups still render their
    bubble, inbound/outbound dep rewiring, depth cap, real-shape
    reduction (foundation→apps[c1..c10]→sentinel collapses to ≤2
    depth instead of 12), empty-group fallback.
  - 2 new FlowCanvasOrganic.bounded.test.tsx cases: parent bubble
    is NOT rendered when children are visible, parent IS rendered
    when folded.

All 25 layout+canvas-bounded tests pass. tsc clean.

Co-authored-by: alierenbaysal <aliebaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:31:05 +04:00
e3mrah
d90abb1e85
fix(bp-openbao): unseal vault after init in chart Job (Closes #527) (#528)
The init Job ran `bao operator init -key-shares=1 -key-threshold=1`
which leaves the cluster Initialized=true but Sealed=true. Without
an explicit `bao operator unseal <key>` call the StatefulSet pod
stays sealed forever, the bp-openbao HelmRelease never reports
Ready=True, and every dependent blueprint (bp-external-secrets,
bp-external-secrets-stores) blocks on this dep.

This was the 5th and final latent bug in the chart's auto-unseal
flow (after PRs #518 #520 #523 #524 #525). On otech17
(6b17518f12d529ea, 2026-05-02) the init Job completed cleanly but
`bao status` reported Sealed=true forever.

Fix: parse `unseal_threshold` and `unseal_keys_b64` from the init
JSON, call `bao operator unseal <key>` $threshold times (1 with
the current key-shares=1 / key-threshold=1 config), then assert
`bao status -format=json | grep '"sealed":false'` before the Job
exits success. Bumps chart 1.2.2 -> 1.2.3 and HR ref in
clusters/_template/bootstrap-kit/08-openbao.yaml.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:24:57 +04:00
github-actions[bot]
b8cdeaeb03 deploy: update catalyst images to 4e88abe 2026-05-02 05:17:32 +00:00
e3mrah
4e88abeace
fix(catalyst-ui): Phase-0 jobs stuck Running on failed deployments — converge banner from helmwatch outcome (Closes #519) (#526)
REGRESSION ROOT CAUSE — POST-PR #495

Pre-PR #495 (closes #488), every Phase-1 short-circuit path called
markPhase1Done with an empty outcome, falling through to the
default branch that flipped Status="ready". The wizard's
useDeploymentEvents hook took the `markAllReady` branch on every
terminal deployment, regardless of why it terminated. markAllReady
converged the Phase-0 / cluster-bootstrap banners to "done" (unless
they had been explicitly failed by streaming events).

Post-PR #495, Phase-1 short-circuits correctly flip Status="failed"
with `phase1Outcome` set to a precise classification — but the
wizard's `failed` branch did NOT call any banner-convergence
function. It only set streamStatus="failed" + streamError, leaving
the Phase-0 banner pinned at "running" forever.

The pin manifests because the catalyst-api producer channel
(internal/provisioner/provisioner.go:520, cap 256) overflows on
the high-throughput tofu-apply burst (200+ events in 10 seconds),
silently dropping the `tofu-output` line that drives the
hetznerInfra banner from "running" to "done" in the reducer
(eventReducer.ts:257). With markAllReady never called, the banner
is stuck.

LIVE EVIDENCE — otech17 deployment 6b17518f12d529ea (2026-05-02)

  • Started 02:08:13Z, ran for 1h 1min, finished 03:09:28Z with
    status="failed", phase1Outcome="flux-not-reconciling"
  • Total events captured: 237 — first event 02:08:14Z, last
    02:08:46Z. After +33s, the producer channel back-pressured
    and tofu-output / flux-bootstrap / component events were all
    dropped on the floor.
  • Wizard at /jobs displayed Phase-0 jobs as "Running" for
    2h 42m on a deployment that had finished an hour ago.

FIX — HYBRID OPTION B+C (CLIENT-SIDE PRIMARY)

(B) Server side — lift `phase1Outcome` to the top level of the
    /deployments/{id} JSON response. The field already lived on
    `result.phase1Outcome`; lifting it matches the existing pattern
    for `componentStates` + `phase1FinishedAt` so the wizard reads
    a flat shape.

(C) Client side — new exported reducer helper `markFailedTerminal`
    converges Phase-0 / cluster-bootstrap banners using the durable
    helmwatch outcome:

      • outcome ∈ {ready, failed, timeout, flux-not-reconciling,
                   kubeconfig-missing, watcher-start-failed}
        ⇒ Phase 0 finished. Hetzner-infra banner → done (unless
        already failed via streaming events).

      • outcome != "" but outcome != "ready"
        ⇒ Phase 1 failed. cluster-bootstrap banner → failed (the
        operator's eye snaps to the actual failing phase, not
        Phase 0).

      • outcome == "" (Phase 0 itself failed)
        ⇒ banners untouched. Streaming events have already
        recorded the truthful state; we don't have ground truth
        to flip them.

`useDeploymentEvents` calls markFailedTerminal on both the GET
/events terminal-snapshot path AND the SSE `done` event path so
the convergence happens whether the operator deep-links to a
finished deployment or stays on the page through completion.

PER-APPLICATION CARD GROUNDING PRESERVED

markFailedTerminal mirrors markAllReady's grounding rule: cards
are seeded ONLY from the durable componentStates map; no
auto-promotion to "installed". When the map is empty AND Phase 0
succeeded (i.e., we expected helmwatch ground truth and didn't
get any), `phase1WatchSkipped=true` so the AdminPage banner reads
"Phase-1 install state not available" instead of pretending
everything is fine.

TESTS — vitest + go test all green

  • eventReducer.test.ts — 9 new cases covering every outcome
    bucket, the "Phase 0 itself failed" preserve-truth case, the
    no-auto-promote contract, and the phase1WatchSkipped flag.
  • jobs.test.ts — direct regression repro: feed the exact
    otech17 event sequence (no tofu-output), assert pre-fix
    Phase-0 jobs are stuck Running, then assert
    `markFailedTerminal('flux-not-reconciling')` flips ALL four
    Phase-0 jobs to "succeeded" + cluster-bootstrap to "failed".
  • Go tests in handler package — all 26 seconds pass; the
    State() lift of phase1Outcome doesn't disturb existing
    snapshot contracts.

Closes #519

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:15:34 +04:00
e3mrah
ba5a1929f1
fix(bp-openbao): use shamir-compatible init flags + bump 1.2.1→1.2.2 (refs #517) (#525)
The chart's init Job called `bao operator init -recovery-shares=1
-recovery-threshold=1` which only works with auto-unseal seal types
(gcpckms/awskms/transit). The upstream openbao chart's default config
uses `seal "shamir"` (no auto-unseal stanza in
values.standalone.config / values.ha.config), so the OpenBao API
returns 400: "parameters recovery_shares,recovery_threshold not
applicable to seal type shamir".

Switch to -key-shares=1 -key-threshold=1 which is the correct shamir-
seal init flags. Operators wiring auto-unseal seals later will need
to flip back via a chart-values toggle.

Bumps chart 1.2.1→1.2.2 + matches HR ref so Sovereigns pull the new
artifact on next reconcile.

Refs #517

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:14:05 +04:00