Commit Graph

19 Commits

Author SHA1 Message Date
e3mrah
20b3c5258a
feat(bp-newapi): chart maturation + first-otech deploy + Qwen vLLM channel (#799) (#812)
* feat(bp-newapi): chart maturation — ExternalSecret + first-otech vLLM channel + skip-render gates (#799)

Maturation work for the SME-3 turnkey-experience epic (#795). Aligns
the bp-newapi scratch chart with ADR-0003 (RBAC ↔ NewAPI user-create
hook contract) and gets it past the blueprint-release CI smoke render
that has blocked publication since PR #396 (run 25213444992 failed at
default-values render of v1.0.0).

Changes
-------
- templates/external-secret.yaml (NEW). Renders the
  `catalyst-newapi-admin-token` ExternalSecret consumed by unified-rbac
  (ADR-0003 §3.2 + §6) for issuing per-user keys against
  `http://newapi.newapi.svc/api/v1/admin/users`. Sourced from OpenBao
  via the `vault-region1` ClusterSecretStore (canonical default shipped
  by bp-external-secrets-stores). Capabilities-gated on
  `external-secrets.io/v1beta1` so cold installs without ESO don't
  fail-render. Operator supplies the per-Sovereign OpenBao path via
  `catalystIntegration.externalSecret.remoteRef.key`; canonical
  convention is `sovereign/<sovereign-fqdn>/newapi/admin-token` with
  property `ADMIN_API_TOKEN`. Per Inviolable Principle #4 every knob
  is operator-overridable in the cluster overlay.

- values.yaml. Adds `catalystIntegration.externalSecret.{enabled,
  refreshInterval, secretStoreRef.{kind,name}, remoteRef.{key,property}}`
  block (default enabled=true, key="" so a misconfigured overlay fails
  loudly at render rather than silently skipping). Adds
  `defaultChannels.vllm` block — first-otech shorthand that composes a
  vLLM-typed channel into the rendered channels list when enabled.
  Default endpoint is empty per Inviolable Principle #4; the
  `clusters/<sovereign>/bootstrap-kit/80-newapi.yaml` overlay supplies
  the per-Sovereign URL (canonical first-otech reference =
  `https://llm-api.omtd.bankdhofar.com` model `qwen3-coder`, the same
  upstream Axon uses on the OpenOva marketing deployment).

- templates/_helpers.tpl. New `bp-newapi.effectiveChannels` helper
  composes `.Values.channels` with `defaultChannels.vllm` (when
  enabled). The `assertChannelAttestation` helper now operates on the
  effective list so attestation gates apply to defaultChannels
  composition too. `defaultChannels.vllm.enabled=true` with empty
  endpoint fails-fast at render with a guided error message.

- templates/configmap.yaml. Channels rendering switches to the
  effectiveChannels helper. OIDC block now skip-renders gracefully when
  `auth.adminUI.keycloak.issuer` is unset (smoke-render path) instead
  of `required`-failing; the per-Sovereign overlay sets the issuer.

- templates/deployment.yaml. Skip-render gate on Deployment when
  `database.existingSecret`, `credentials.existingSecret`, or (when
  Keycloak mode is selected) the OIDC client secret is missing. Removes
  the four `required` calls that were failing CI smoke render. Service,
  ServiceAccount, ConfigMap, NetworkPolicy still render so the smoke
  test gets a non-empty output proving structural soundness; the actual
  Deployment defers until the per-Sovereign overlay wires the secrets.

- templates/ingress.yaml. Same skip-render pattern: when either
  `ingress.host` or `ingress.adminHost` is empty, the entire ingress
  block is silently skipped. Matches the bp-keycloak / bp-openbao /
  bp-external-dns HTTPRoute templates.

- Chart.yaml. version 1.0.0 → 1.1.0 (minor bump — additive features;
  no breaking changes to existing operator overrides).

Verification
------------
`helm template` smoke render on default values now succeeds with 4
resources (NetworkPolicy / ServiceAccount / ConfigMap / Service); 168
lines, well above the CI 5-line minimum. With a full per-Sovereign
overlay (hosts + secrets + Keycloak issuer + ESO Capabilities + Traefik
Capabilities + defaultChannels.vllm.endpoint), 8 resources render
including Deployment, both Ingresses, the Traefik allowlist Middleware,
and the ExternalSecret. The composed qwen channel writes through to
`channels.yaml` with the expected endpoint + models + attestation.

Refs
----
ADR-0003 §3.2 + §6 — admin-token contract
Issue #795 (epic) — locked decisions
Issue #796 — hook contract spec (sequential blocker, merged)
Inviolable Principles #1, #3, #4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(bootstrap-kit): slot 80 — bp-newapi default install (#799)

Adds the canonical install slot for bp-newapi to every fresh Sovereign's
bootstrap-kit. Sequenced after the W2.K1 dependency wave so NewAPI's
ExternalSecret + Postgres DSN dependencies resolve on first reconcile.

The HelmRelease declares `dependsOn: [bp-openbao, bp-keycloak, bp-cnpg]`:
- bp-openbao(08): admin-token ExternalSecret backend
- bp-keycloak(09): OIDC issuer for ops-staff admin UI at admin.<fqdn>
- bp-cnpg(16): Postgres backing for users/credits/channels/audit

Per-Sovereign overlays inherit the slot's defaults and override:
- ingress.host                                        api.${SOVEREIGN_FQDN}
- ingress.adminHost                                   admin.${SOVEREIGN_FQDN}
- auth.adminUI.keycloak.issuer
- database.existingSecret                             (Crossplane-claimed)
- credentials.existingSecret
- catalystIntegration.externalSecret.remoteRef.key    sovereign/${FQDN}/newapi/admin-token
- defaultChannels.vllm.enabled                        true (first-otech)
- defaultChannels.vllm.endpoint                       (operator-supplied)

The `_template/` slot keeps `defaultChannels.vllm.enabled: false` so a
fresh Sovereign does not silently wire customers to a third-party
endpoint; the canonical first-otech reference (Qwen3 Coder via
`https://llm-api.omtd.bankdhofar.com`, same relay Axon uses on the
OpenOva marketing deployment) is documented in-line for operators
adopting the same upstream.

Refs: #795 (epic), ADR-0003

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-deps): register bp-newapi slot 80 in expected DAG (#799)

Fixes the dependency-graph-audit drift detection caught at PR #812 CI:
the audit script enumerates HelmReleases in clusters/_template/bootstrap-kit/
and compares to scripts/expected-bootstrap-deps.yaml; an HR present on
disk but absent from the expected DAG is treated as drift.

Adds the canonical entry for bp-newapi at slot 80 with the same
depends_on set declared on the HelmRelease itself
([bp-openbao, bp-keycloak, bp-cnpg]).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-newapi): align blueprint.yaml spec.version with Chart.yaml (#799)

The TestBootstrapKit_BlueprintCardsHaveRequiredFields static-validation
gate asserts Chart.yaml version == blueprint.yaml spec.version. The
chart was bumped to 1.1.0 in c63ecd8c; bumping the blueprint metadata
to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:17:25 +04:00
e3mrah
33dc98782b
feat(bp-self-sovereign-cutover): chart + bootstrap-kit slot 06a (#791) (#808)
New platform Blueprint at `platform/self-sovereign-cutover/chart/`. Ships
DORMANT — eight step PodSpec ConfigMaps, the registry-pivot DaemonSet, the
mutable cutover-status ConfigMap, plus ServiceAccount/RBAC. The catalyst-api
cutover endpoint (#792, merged at 03828641) reads each step ConfigMap by
label selector and stamps real Jobs only on operator-driven trigger.

Step inventory:
  01 gitea-mirror             — git push --mirror upstream → local Gitea
  02 harbor-projects          — create 7 proxy-cache projects
  03 harbor-prewarm           — HEAD-pull bootstrap-kit images through cache
  04 registry-pivot           — DaemonSet rewrites registries.yaml on every node
  05 flux-gitrepository-patch — pivot GitRepository.url → local Gitea
  06 helmrepository-patches   — pivot 38 OCI URLs → local Harbor
  07 catalyst-api-env-patch   — kubectl set env CATALYST_GITOPS_REPO_URL
  08 egress-block-test        — CiliumNetworkPolicy + 10-min sovereignty proof

Plus self-sovereign-cutover-status ConfigMap with the consumer-contract keys
(cutoverComplete, currentStep, step.<name>.result, etc.) shipped at install
with helm.sh/resource-policy: keep so chart uninstall doesn't lose state.

Bootstrap-kit slot `06a-bp-self-sovereign-cutover.yaml` installs the chart
into the `catalyst` namespace (matches catalyst-api's default discovery
namespace), depends on bp-gitea + bp-harbor, uses disableWait: true.

RBAC splits `create` verbs into their own Rule WITHOUT resourceNames per
feedback_rbac_create_no_resourcenames.md — the bp-openbao loop anchor.

chart/tests/cutover-contract.sh enforces:
  - 8 step ConfigMaps render
  - required labels (part-of/component/cutover-order/cutover-mode)
  - required data keys (stepName + podSpec for job-mode)
  - step 04 mode=daemonset-wait
  - status ConfigMap retained on uninstall
  - RBAC create/resourceNames split

helm template smoke render: 1180 lines, 19 resources (1 Namespace + 1 SA +
11 ConfigMaps + 1 DaemonSet + 1 ClusterRole + 1 ClusterRoleBinding).
helm lint: clean.
scripts/check-bootstrap-deps.sh: PASSED (slot 6a registered, depends_on
[bp-gitea, bp-harbor]).

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:55:19 +04:00
e3mrah
53bc4357ca
feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) (#776)
* feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767)

Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers
couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB):

1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate"
   Section with: bootstrap-kit baseline (sum of mandatory-tier component
   footprints), selected components delta, control-plane overhead, and a
   "Recommended N x <SKU>" line that turns amber when the operator's chosen
   worker count is below the rollup. Backed by per-component RAM/CPU floors
   in components/wizard/steps/componentFootprints.ts (covered by 12 unit
   tests including the otech92 reproduction).

2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at
   bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart
   9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired
   from the canonical flux-system/cloud-credentials.hcloud-token Secret
   cloud-init writes (mirrors the velero/harbor object-storage pattern).
   Pinned to the control-plane node so the autoscaler never schedules onto
   a worker it could itself terminate. 10-minute scale-down idle as the
   cost-saving default.

Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA /
KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over
KEDA for cluster scaling, and the bounds + safety story.

Per the issue's MVP scope, this PR ships the blueprint + StepReview
estimate WITHOUT the wizard StepProvider min/max pair refactor or the
tofu node-pool template restructuring. Those are tracked as a follow-up
issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps

Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected-
bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776
because the file existed without a matching entry in the expected DAG, AND
collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort +
slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to
the expected-bootstrap-deps.yaml so the audit passes.

`scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:49:44 +04:00
e3mrah
2b60e944e2
fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook (#681)
* fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook

Caught live on otech43-46: cert-manager DNS-01 challenges for
*.otechN.omani.works failed because the Sovereign-side webhook wrote
challenge TXT records to the Sovereign's local PowerDNS. omani.works is
delegated from Dynadot to ns1/2/3.openova.io which run on contabo's
central PowerDNS — the Sovereign's local PowerDNS is INVISIBLE on the
public DNS chain until pool-domain-manager seals the per-Sovereign NS
delegation. Let's Encrypt resolvers walk the public chain, query
contabo, get NXDOMAIN, the cert never issues. Manual workaround was
seeding challenge TXT directly in contabo PowerDNS.

This PR automates the right write path:

- bp-cert-manager-powerdns-webhook chart bumped to 1.0.4. Default
  powerdns.host flips from "" (skip-render) to https://pdns.openova.io
  (contabo's public PowerDNS API ingress, authoritative for omani.works).
- ClusterIssuer letsencrypt-dns01-prod-powerdns now usable with no
  per-cluster powerdns.host override for the omani.works pool.
  apiKeySecretRef.namespace clarified — upstream ignores it; the Secret
  must live in cert-manager namespace (= ChallengeRequest.ResourceNamespace
  for ClusterIssuers).
- bootstrap-kit slot 49 updated: drops bp-powerdns dependsOn (webhook
  calls out-of-cluster contabo, not local PowerDNS), bumps chart version,
  removes inline powerdns.host override (defaults are correct).
- bootstrap-kit slot 49b (bp-cert-manager-dynadot-webhook) DELETED
  entirely — Dynadot is NOT the API-level authority for omani.works
  subdomains, the dynadot webhook silently fails the same way the
  Sovereign-local powerdns one did.
- clusters/_template/sovereign-tls/cilium-gateway-cert.yaml flips
  issuerRef from letsencrypt-dns01-prod (was dynadot-backed) to
  letsencrypt-dns01-prod-powerdns (the new contabo-backed issuer).
- bp-cert-manager chart: certManager.issuers.dns01.enabled defaults to
  false (deprecated dynadot path). letsencrypt-http01-prod retained for
  per-host certs. Cluster overlays MAY flip dns01.enabled=true for
  non-omani.works pools where Dynadot IS the API-level authority.
- scripts/expected-bootstrap-deps.yaml: drops slot 49b, drops bp-powerdns
  edge from slot 49.
- Documentation (README + blueprint.yaml + Chart.yaml description)
  rewritten to reflect contabo retarget and lifecycle reasoning.

Credential plumbing (out of scope here, must be done in cloud-init):
- Every Sovereign needs a `powerdns-api-credentials` Secret in the
  `cert-manager` namespace whose `api-key` value matches contabo's
  PowerDNS API key. Same seeding pattern as `dynadot-api-credentials`
  in infra/hetzner/cloudinit-control-plane.tftpl.

Caveat — basicAuth on contabo's PowerDNS API ingress: contabo currently
fronts pdns.openova.io with Traefik basicAuth (per
clusters/contabo-mkt/apps/powerdns/helmrelease.yaml). The upstream
zachomedia/cert-manager-webhook-pdns binary supports the X-API-Key
header but not HTTP Basic Auth out of the box. To make this end-to-end
green, contabo's basicAuth requirement must be relaxed (X-API-Key alone
provides the auth posture, and contabo's API endpoint is restricted to
operator IPs by other means OR the Sovereign's webhook needs an
Authorization header injected via the chart's powerdns.headers map
(plaintext password in the ClusterIssuer config — not ideal). This PR
ships the chart side; the basicAuth question is a follow-up on the
contabo side.

Verified locally:
- helm lint platform/cert-manager-powerdns-webhook/chart -> PASS
- helm template platform/cert-manager-powerdns-webhook/chart -> renders
- helm template ... --set clusterIssuer.enabled=true -> renders the
  ClusterIssuer with host="https://pdns.openova.io" + correct apiKey
  Secret reference.
- helm template platform/cert-manager/chart -> renders ONLY
  letsencrypt-http01-prod (the dns01 dynadot issuer correctly gated off).
- scripts/check-bootstrap-deps.sh: net-zero new drift; my branch reduces
  pre-existing errors from 3 to 2 (the dropped slot 49b removed the only
  drift my branch was responsible for).

Closes follow-up to #373. Preconditions for handover URL TLS green
on otech43-46 lineage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): repair YAML structure in expected-bootstrap-deps.yaml

Two pre-existing drifts were blocking dependency-graph-audit CI:

1. Slot 5a (bp-reflector) was missing its closing list separator,
   causing yq to merge the bp-nats-jetstream entry into the bp-reflector
   map and effectively drop bp-reflector from the expected DAG.
   Added explicit `- slot: 7` for bp-nats-jetstream and quoted "5a" so
   yq treats it as a string slot (matches the convention with "49b").

2. bp-powerdns slot 11: actual bootstrap-kit declares dependsOn
   bp-cnpg (live since otech28 — pdns-pg-app secret race) but the
   expected DAG was missing this edge.

This is unblocks merging fix/cert-manager-powerdns-webhook-contabo (PR
above) — these drifts existed on main but weren't surfaced until the
last expected-deps edit forced a re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:12:48 +04:00
e3mrah
74921e30f1
fix(architecture): drop bp-spire, Cilium WireGuard is the canonical east-west mesh (#665)
Founder direction 2026-05-03: with 100% Cilium mesh enforcement +
Envoy where required, bp-spire is redundant for the minimal Sovereign
MVP.

Reasoning:
- Cilium 1.13+ has built-in mutual auth using SPIFFE, but it ships
  with its own embedded SPIRE server managed by the Cilium operator.
  External bp-spire is not needed for east-west mTLS.
- Our ESO→OpenBao auth uses the K8s ServiceAccount auth method
  (TokenReview against kube-apiserver), not JWT-SVID.
- WireGuard transparent encryption (already enabled in cilium values)
  encrypts every pod-to-pod connection at the kernel transport layer.
- Cross-Sovereign federation and per-workload-fingerprint attestation
  are not blocking handover; they can be re-introduced as an opt-in
  blueprint when needed.

Changes:
- Delete clusters/_template/bootstrap-kit/06-spire.yaml
- Remove bp-spire from kustomization.yaml + expected-bootstrap-deps.yaml
- Remove bp-spire dependsOn from 07-nats-jetstream.yaml + 08-openbao.yaml
- bp-cilium 1.1.4: add encryption.nodeEncryption=true so node-to-node
  traffic (not just pod-to-pod) is also WireGuard-encrypted; document
  in values.yaml comment that WireGuard is the canonical east-west
  mTLS layer.

Removes 4 pods (spire-server, spire-agent, spire-spiffe-csi-driver,
spire-spiffe-oidc-discovery-provider) from every Sovereign and the
recurring CSI mount race that was getting stuck on otech43.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 13:56:36 +04:00
e3mrah
be6e610093
fix: drop bp-langfuse from minimal + bp-mimir 1.0.2 push_grpc fix (#664)
* fix: drop bp-langfuse from minimal bootstrap-kit + bp-mimir push_grpc fix

Two independent fixes packaged together:

1. **Drop bp-langfuse** from the SOLO minimal bootstrap-kit. Per
   founder direction: langfuse is LLM-specific (prompt/completion
   tracing for AI plane), not platform infrastructure, and belongs
   to a future 'AI Add-On' template. Its CreateContainerConfigError
   on every Sovereign provision (missing langfuse-secrets pre-install)
   was eating Phase-1 reconciliation budget without contributing to
   handover-ready state. Removed:
   - clusters/_template/bootstrap-kit/26-langfuse.yaml
   - kustomization.yaml entry
   - scripts/expected-bootstrap-deps.yaml slot 26 entry

2. **bp-mimir 1.0.2** — re-enable ingester.push_grpc_method_enabled.
   Upstream mimir-distributed 6.0.6 disables Push gRPC when
   ingest-storage is off, but classic-mode ingester REQUIRES it.
   The combo crashloops with 'cannot disable Push gRPC method in
   ingester, while ingest storage (-ingest-storage.enabled) is not
   enabled'. Caught live on otech43 with 17 restarts.

Both issues block Phase-1 ready=40/40 from being a clean signal.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-mimir): chart 1.0.2 push_grpc_method_enabled + finalize langfuse drop

Follow-up to previous commit which only captured the file deletion.
This commit applies: bp-mimir 1.0.2 chart bump, kustomization +
expected-deps removal of langfuse, bootstrap-kit version bumps.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 13:50:38 +04:00
e3mrah
544dc86b5b
fix(wizard): blueprint deps sourced from Flux dependsOn (single source of truth) (#652)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)

* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)

The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):

  componentGroups.ts          Flux HelmRelease.dependsOn
  ----------------------      ---------------------------
  keycloak: [cnpg]            keycloak: [cert-manager, gateway-api]
  openbao:  []                openbao:  [spire, gateway-api, cnpg]
  harbor:   [cnpg, seaweedfs, harbor:   [cnpg, cert-manager,
              valkey]                    gateway-api]

Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03

This commit:
  1. Adds scripts/generate-blueprint-deps.sh that parses every
     bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
     keyed by bare component id (bp- prefix stripped on both source
     and target side).
  2. Commits the generated JSON.
  3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
     thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
  4. Patches componentGroups.ts so every RAW_COMPONENT's
     `dependencies` field is OVERRIDDEN at module load with the
     Flux-canonical list (the inline `dependencies: [...]` literals
     are now ignored — Flux is canonical).

Follow-ups (not in this PR):
  - CI drift check that re-runs the script and diffs the JSON.
  - Strip the inline `dependencies: [...]` arrays entirely once the
    drift check is green.
  - Wire the FlowPage edge-rendering to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:47:52 +04:00
e3mrah
8d2ba0495d
fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) (#586)
Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584)
2026-05-02 15:18:49 +04:00
hatiyildiz
7c3ff940ff fix(ci): update solver_test.go fixtures + expected-bootstrap-deps.yaml for #550
- core/cmd/cert-manager-dynadot-webhook/solver_test.go: fix SetDns2Response →
  SetDnsResponse and ResponseCode:"0" → ResponseCode:0 in test fixtures so
  webhook command tests pass against the corrected dynadot-client JSON parsing
- scripts/expected-bootstrap-deps.yaml: declare bp-cert-manager-dynadot-webhook
  at slot 49b so the bootstrap-kit dependency-graph audit passes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 10:44:18 +02:00
e3mrah
f689766615
fix(infra): add explicit dependsOn to bp-openbao + bp-catalyst-platform (#512) (#513)
Phase-8a-preflight live deployment otech16 (9e14dcc0d2de7586, 2026-05-02):
even after bumping install/upgrade timeout to 15m (commit f47948e7), the
post-install hooks for bp-openbao and bp-catalyst-platform STILL race their
dependencies. The hooks need workload pods Ready before they can do their
work — bp-openbao 3-node Raft init waits for cnpg-postgres + Cilium L7,
and bp-catalyst-platform umbrella init waits for keycloak + cnpg.

Fix (Option C — explicit dependsOn):
- bp-openbao: add bp-cnpg (already had bp-spire, bp-gateway-api)
- bp-catalyst-platform: add bp-keycloak + bp-cnpg (already had bp-gitea, bp-gateway-api)

This makes Flux wait for those HRs Ready=True BEFORE starting the install,
so the post-install hooks run after deps are warm. Eliminates the race.

Updated scripts/expected-bootstrap-deps.yaml to match. Verified:
- bash scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles
- go test ./tests/e2e/bootstrap-kit/... -run TestBootstrapKit_DependencyOrderMatchesCanonical — PASS

Closes #512

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 06:00:56 +04:00
e3mrah
e1f7d22f3c
fix(bootstrap-kit): install Gateway API CRDs ahead of HTTPRoute charts (#503) (#505)
Adds bp-gateway-api Blueprint (slot 01a) that vendors the upstream
Kubernetes Gateway API Standard-channel CRDs (v1.2.0) and registers them
ahead of every chart that ships HTTPRoute templates: bp-openbao,
bp-keycloak, bp-gitea, bp-powerdns, bp-catalyst-platform, bp-harbor,
bp-grafana.

Phase-8a-preflight live deployment otech10 (e1a0cd6662872fcb on
catalyst-api:c148ef3, 2026-05-01) reached 21/37 HRs Ready=True before
stalling on bp-harbor / bp-openbao / bp-powerdns reconciling to
InstallFailed with `no matches for kind "HTTPRoute" in version
"gateway.networking.k8s.io/v1"`. Cilium 1.16's chart `gatewayAPI.
enabled=true` flag wires up the cilium gateway controller and creates
the `cilium` GatewayClass, but does NOT install the
gateway.networking.k8s.io CRDs themselves; cilium 1.16 has no
`installCRDs`-equivalent knob for gateway-api so the upstream CRDs must
ship via a separate Blueprint.

Pattern locked in by docs/INVIOLABLE-PRINCIPLES.md and reinforced by
the founder for ALL similar future cases: intra-chart CRD-ordering
breaks → split into two charts + Flux dependsOn. Mirrors the
bp-crossplane/bp-crossplane-claims and bp-external-secrets/
bp-external-secrets-stores splits.

Files:
- platform/gateway-api/{blueprint.yaml,chart/} — new Blueprint with
  per-CRD templates vendored from kubernetes-sigs/gateway-api v1.2.0
  standard-install.yaml; helm.sh/resource-policy: keep on every CRD so
  Helm uninstall does not orphan every HTTPRoute on the cluster
- platform/gateway-api/chart/scripts/regenerate.sh — developer tool
  for re-vendoring on upstream version bump (annotation-driven)
- platform/gateway-api/chart/tests/crd-render.sh — chart integration
  test (5 CRDs, keep annotation, bundle-version matches Chart.yaml pin)
- clusters/_template/bootstrap-kit/01a-gateway-api.yaml — HelmRelease
  + HelmRepository, dependsOn bp-cilium
- clusters/_template/bootstrap-kit/{08-openbao,09-keycloak,10-gitea,
  11-powerdns,13-bp-catalyst-platform,19-harbor,25-grafana}.yaml —
  add `dependsOn: bp-gateway-api`
- clusters/_template/bootstrap-kit/kustomization.yaml — register
  01a-gateway-api.yaml between 01-cilium and 02-cert-manager
- scripts/expected-bootstrap-deps.yaml — declare slot 1a + add
  bp-gateway-api to depends_on of every HTTPRoute-using slot

Closes #503

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:30:50 +04:00
e3mrah
1865ac8975
fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) (#504)
* fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340)

The upstream seaweedfs/seaweedfs 4.22.0 chart now ships
templates/shared/security-configmap.yaml which calls fromToml — a Sprig
function added in Helm 3.13. Flux v1.x helm-controller bundles a Helm
SDK older than 3.13 and PARSES every template before any
{{- if .Values.global.seaweedfs.enableSecurity }} gate fires, so the file's
mere presence breaks install on every Sovereign with:

  parse error at (bp-seaweedfs/charts/seaweedfs/templates/shared/security-configmap.yaml:21):
    function "fromToml" not defined

even though enableSecurity defaults to false. Setting the gate value
does NOT skip parsing — only deleting / never-shipping the file does.

Fix shape (per ticket #340):

1. Vendor upstream seaweedfs/seaweedfs 4.22.0 into chart/charts/seaweedfs/
   (committed bytes, not auto-pulled at build time). Required because the
   upstream Helm repo overwrites 4.22.0 in place — re-pulling would
   re-introduce the broken file.
2. Delete charts/seaweedfs/templates/shared/security-configmap.yaml.
   Every other template that references the deleted ConfigMap is gated
   under {{- if enableSecurity }} so removing it is a no-op for our
   default deployment shape (Catalyst SeaweedFS auth happens at the S3
   layer via IAM creds from External Secrets, not via the upstream
   chart's TLS/JWT machinery).
3. Drop the dependencies: block from chart/Chart.yaml; add
   annotations.catalyst.openova.io/no-upstream=true so the
   blueprint-release workflow's hollow-chart guard (issue #181) skips
   the auto-pull/round-trip checks for this chart.
4. Whitelist platform/seaweedfs/chart/charts/ in .gitignore so the
   vendored bytes are tracked.
5. Bump bp-seaweedfs 1.0.1 → 1.1.0 (signal: vendored, not auto-pulled).
6. Add tests/no-fromtoml.sh — chart-test that asserts the offending
   file stays deleted across future re-vendors. Runs in
   .github/workflows/blueprint-release.yaml as a publish-gating check.

Unblocks Phase-8a observability + storage chain on otech (bp-loki,
bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn
bp-seaweedfs).

Closes #340

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): align expected-bootstrap-deps.yaml with bp-harbor's actual deps

The bp-harbor HR at clusters/_template/bootstrap-kit/19-harbor.yaml lines
35-37 already removed `bp-seaweedfs` from its dependsOn (cloud-direct
architecture per ADR-0001 §13 — Harbor writes blobs directly to cloud
Object Storage on Sovereigns, not via SeaweedFS), but the expected DAG
in scripts/expected-bootstrap-deps.yaml was never updated to match.

Pre-existing drift on main; surfaced by the dependency-graph-audit
check on PR #504 (bp-seaweedfs vendoring fix). Fixing it inline so the
audit passes on the same PR — the two changes are both about the
storage chain on Sovereigns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:20:59 +04:00
e3mrah
87ba48c44e
fix(ci): vendor-coupling guardrail path - products/catalyst/bootstrap/api/internal/objectstorage (closes #438) (#440)
The mode-gate check was looking for ${REPO_ROOT}/internal/objectstorage
but the actual Go package lives at products/catalyst/bootstrap/api/internal/objectstorage.
Update the path so hard-fail mode auto-engages on this repo.

Validation:
  bash scripts/check-vendor-coupling.sh
  -> HARD-FAIL mode banner emitted, exit 0 on clean tree
  Synthetic 'hetzner-object-storage' under platform/ -> exit 1.

Refs: PR #437 (#383) which surfaced the bug.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 18:21:57 +04:00
e3mrah
0fdd411e79
ci(guardrail): vendor-coupling check - fail CI if chart values use vendor name (closes #428) (#431)
Adds scripts/check-vendor-coupling.sh + .github/workflows/check-vendor-coupling.yaml
that scan platform/, clusters/, products/catalyst/bootstrap/{api,ui} for vendor names
(hetzner|aws|gcp|azure|oci) appearing in capability-named slots:

  1. <vendor>-object-storage          (sealed-secret / overlay-secret name)
  2. <chart>Overlay\.<vendor>\.       (chart values block keyed to vendor)
  3. <vendor>ObjectStorage            (camelCase payload field)

Excludes legitimately-per-provider paths (infra/<provider>/, internal/<provider>/,
internal/objectstorage/<provider>/, core/pkg/<provider>/), Crossplane Provider CR
refs (lines containing "crossplane-contrib/provider-"), and *.md files (docs may
discuss the rule).

Mode gate: warn-only while internal/objectstorage/ does not exist (pre-#425
work-in-progress); hard-fail once that directory lands. Locally on this branch
the script emits 49 warnings to stderr and exits 0 against the existing
hetzner-coupled references in platform/velero, platform/seaweedfs, and
clusters/.../bootstrap-kit/34-velero.yaml; once #425's rename lands those
warnings disappear and any future re-introduction fails CI.

Workflow trigger surface: push-to-main + pull_request on the scanned paths +
workflow_dispatch. No schedule: cron per CLAUDE.md "every workflow MUST be
event-driven, NEVER scheduled".

Canonical seam used: scripts/ + .github/workflows/ (mirrors
scripts/check-bootstrap-deps.sh + .github/workflows/blueprint-release.yaml
shape). NOT a duplicate - no prior vendor-coupling guard existed.

Refs: docs/omantel-handover-wbs.md §3a (canonical-seam map)
      docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:49:49 +04:00
e3mrah
92b7db622d
fix(bp-external-secrets-stores): split ClusterSecretStore into separate chart per #247 pattern (closes #331) (#426)
* fix(bp-external-secrets): split ClusterSecretStore into bp-external-secrets-stores chart (resolves CRD ordering, closes #331)

bp-external-secrets@1.0.0 deadlocked on first install on otech.omani.works:

  Helm install failed for release external-secrets-system/external-secrets
  with chart bp-external-secrets@1.0.0:
  failed post-install: unable to build kubernetes object for deleting hook
  bp-external-secrets/templates/clustersecretstore-vault-region1.yaml:
  resource mapping not found for name: "vault-region1" namespace: ""
  no matches for kind "ClusterSecretStore" in version "external-secrets.io/v1beta1"

Root cause: Helm's `helm.sh/hook-delete-policy: before-hook-creation` ran
a kubectl-style lookup of the existing ClusterSecretStore CR before the
upstream `external-secrets` subchart's CRDs finished registration. The
in-line ClusterSecretStore template (templates/clustersecretstore-vault-
region1.yaml) and the upstream subchart's CRDs co-installed in the same
release; admission ordering wasn't deterministic enough to make the
post-install hook safe.

Fix — same pattern as PR #247 (bp-crossplane@1.1.3 ↔ bp-crossplane-claims@1.0.0):
split the chart into controller + stores. Flux dependsOn orders them.

  - bp-external-secrets@1.1.0 — controller-only (just upstream subchart
    + NetworkPolicy + ServiceMonitor toggle). CRDs register here.
  - bp-external-secrets-stores@1.0.0 (NEW) — the default
    ClusterSecretStore CR; depends on bp-external-secrets being Ready.
    No Helm hooks needed: by the time this chart's HelmRelease starts,
    Flux has already verified bp-external-secrets is Ready=True and
    therefore the CRDs are registered.

Files:
  NEW: platform/external-secrets-stores/blueprint.yaml             (1.0.0)
  NEW: platform/external-secrets-stores/chart/Chart.yaml           (1.0.0; no upstream subchart, annotation `catalyst.openova.io/no-upstream: "true"`)
  NEW: platform/external-secrets-stores/chart/values.yaml          (clusterSecretStore.* knobs moved from controller chart)
  MOVED: platform/external-secrets/chart/templates/clustersecretstore-vault-region1.yaml
       → platform/external-secrets-stores/chart/templates/clustersecretstore-vault-region1.yaml
       (Helm hook annotations removed — Flux dependsOn now handles ordering)
  TOUCHED: platform/external-secrets/chart/Chart.yaml              (1.0.0 → 1.1.0; description note appended)
  TOUCHED: platform/external-secrets/blueprint.yaml                (1.0.0 → 1.1.0)
  TOUCHED: platform/external-secrets/chart/values.yaml             (clusterSecretStore block removed; pointer comment added)
  NEW: clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml
       (Flux HelmRelease, dependsOn: [bp-external-secrets, bp-openbao])
  TOUCHED: clusters/_template/bootstrap-kit/15-external-secrets.yaml
       (chart version 1.0.0 → 1.1.0)
  TOUCHED: clusters/_template/bootstrap-kit/kustomization.yaml
       (slot 15a inserted after 15)

Out of scope for this PR (separate tickets):
  - blueprint-release.yaml CI fan-out: verify the path-matrix picks up
    the new platform/external-secrets-stores/ directory automatically;
    if not, add the directory to the matrix in a follow-up.
  - Per-Sovereign cluster directory edits (#257 will delete those).
  - Phase 0 minimum trim (#310 will renumber slots; this PR uses 15a as
    a non-disruptive sub-slot insertion that works with both the current
    35-slot kustomization and the eventual 15-slot canonical layout —
    when #310 renumbers, 15 + 15a become 08 + 09 in the canonical order).

Refs: #331 (this issue), #247 (pattern reference — bp-crossplane split),

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): register bp-external-secrets-stores in expected-bootstrap-deps.yaml

The dependency-graph-audit CI step rejected PR #334 because the new
bp-external-secrets-stores HR was on disk at slot 15a but missing from
the expected DAG. This commit adds it with the same dependsOn shape as
clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml:
[bp-external-secrets, bp-openbao].

Refs: #331, #310 (Phase 0 minimum), PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bp-external-secrets): retire CR cases from controller test, add stores-toggle (#331)

After splitting the default ClusterSecretStore into bp-external-secrets-stores
@1.0.0, the controller chart's observability-toggle integration test still
expected the CR to render in the controller chart (Cases 4 + 5). Those
assertions now belong on the new chart.

Changes:
  - platform/external-secrets/chart/tests/observability-toggle.sh:
    Replace Cases 4+5 with a single inverted assertion — the controller
    chart MUST render ZERO ClusterSecretStore CRs (top-level kind:); only
    the upstream subchart's CRD definition (whose spec.names.kind value is
    "ClusterSecretStore" at non-zero indent) is allowed.
  - platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh:
    NEW. Mirrors the retired Cases 4+5 against the stores chart, plus a
    Case 3 that asserts clusterSecretStore.server overrides propagate.

Local smoke:
  bash platform/external-secrets/chart/tests/observability-toggle.sh         → 4/4 PASS
  bash platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh → 3/3 PASS

Refs: #331, PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): handle alphanumeric sub-slot suffixes in check-bootstrap-deps.sh

PR #334 (issue #331) added slot 15a-external-secrets-stores as a sub-slot
between numeric slots 15 and 16. The bootstrap-deps audit script's
`printf '%02d'` formatter rejected `15a` with:

  scripts/check-bootstrap-deps.sh: line 390: printf: 15a: invalid number

Fix: detect non-numeric slot tokens and pass them through verbatim. Numeric
slots still render as zero-padded `01..49` for output alignment.

Local smoke:
  $ bash scripts/check-bootstrap-deps.sh
  ...
    [P] slot 15  bp-external-secrets        <-- bp-cert-manager bp-openbao
    [P] slot 15a bp-external-secrets-stores <-- bp-external-secrets bp-openbao
  ...
  OK: bootstrap-kit dependency graph audit PASSED

Refs: #331, PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(wbs): tick #331 chart-released

bp-external-secrets@1.1.0 (controller-only) + bp-external-secrets-stores@1.0.0
(NEW) shipped in PR #426. Helm-template acceptance + both toggle tests +
dependency-graph-audit all green. Sovereign-impact deferred to Phase 8.

Refs: #331, PR #426.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 17:33:47 +04:00
e3mrah
f7796ef807
feat(bp-velero): Hetzner Object Storage backend wiring (closes #384) (#423)
* feat(bp-velero): Hetzner Object Storage backend wiring (closes #384)

Velero on a Hetzner Sovereign now writes its backups DIRECTLY to Hetzner
Object Storage per ADR-0001 §13 (S3-aware app architecture rule) +
docs/omantel-handover-wbs.md §3 — NOT SeaweedFS, which is reserved as a
POSIX→S3 buffer for legacy POSIX-only writers and is not in the minimal
Sovereign set.

Mirrors the Hetzner-direct backend pattern Agent #383 is wiring for
Harbor; both consume the canonical flux-system/hetzner-object-storage
Secret shipped by issue #371 (cloud-init writes 5 keys: s3-endpoint /
s3-region / s3-bucket / s3-access-key / s3-secret-key, derived from
the operator-issued Hetzner-Console keys + the per-Sovereign bucket
provisioned by OpenTofu's aminueza/minio resource).

platform/velero/chart/ (umbrella chart, bumped to 1.1.0):
  - templates/_helpers.tpl: NEW — bp-velero.fullname / bp-velero.labels
    helpers + bp-velero.hetznerCredentialsSecretName (default
    `velero-hetzner-credentials`).
  - templates/hetzner-credentials-secret.yaml: NEW — synthesises a
    velero-namespace Secret with a single `cloud` key in AWS-CLI INI
    format from .Values.veleroOverlay.hetzner.s3.{accessKey,secretKey}.
    The upstream Velero deployment mounts this at /credentials/cloud
    via existingSecret + AWS_SHARED_CREDENTIALS_FILE. Skip-render path
    when veleroOverlay.hetzner.enabled is false (default — keeps
    contabo render clean) or useExistingSecret is true (operator
    supplied Secret out-of-band).
  - values.yaml: BSL provider/region/s3Url/bucket fields populated as
    placeholders the per-Sovereign HelmRelease overrides via Flux
    valuesFrom; backupsEnabled defaults FALSE so default render emits
    no half-broken BSL; veleroOverlay.hetzner block surfaces the
    operator-overridable fields. Long-form rationale comments inline
    on each value per the chart's existing docstring style.

clusters/_template/bootstrap-kit/34-velero.yaml (+ omantel + otech):
  - dependsOn: bp-seaweedfs REMOVED — Velero is no longer a SeaweedFS
    consumer on Sovereigns (was the old SeaweedFS-tiered architecture
    that minimal-omantel retired in favour of cloud-native S3).
  - chart version bumped 1.0.0 → 1.1.0.
  - valuesFrom block added: 5 Secret-key entries pull each canonical
    s3-* key into the matching umbrella value path. Plaintext
    credentials never appear in the committed manifest; Flux
    dereferences valuesFrom at HelmRelease apply time.
  - values block adds the baseline veleroOverlay.hetzner.enabled=true
    + velero.credentials.{useSecret:true,existingSecret:velero-hetzner-
    credentials} + BSL provider/credential/s3ForcePathStyle scaffolding
    that the valuesFrom entries fill in.

docs/omantel-handover-wbs.md:
  - §2 row 19: " chart needs S3 endpoint rework" → "🟢 chart-released
    v1.1.0 — Hetzner Object Storage backend wired to #371 secret".
  - §9 #384 row: detailed status with smoke evidence.

Smoke evidence (contabo, default values — no Hetzner credentials):
  - helm template t . → renders cleanly (no Hetzner Secret, no BSL).
  - helm template t . --set veleroOverlay.hetzner.enabled=true \
      --set ...accessKey=AK_TEST --set ...secretKey=SK_TEST \
      --set velero.backupsEnabled=true (+ BSL config) →
      Secret/velero-hetzner-credentials with `cloud` INI key emitted +
      BackupStorageLocation/default with provider=aws,
      bucket=omantel-velero, region=fsn1,
      s3Url=https://fsn1.your-objectstorage.com.
  - helm install velero-smoke . -n velero-smoke (defaults) → pod
    velero-69bb84c5-669sh Ready 1/1 in 48s. Smoke torn down clean.

Hetzner-S3 E2E deferred to Phase 8 (first omantel run) — contabo has
no Hetzner Object Storage credentials so end-to-end backup→restore
verification can't run here.

Anti-duplication rule: NO bash scripts authored, NO parallel
implementations of upstream Velero functionality. Upstream Velero +
velero-plugin-for-aws natively support any S3-compatible backend; the
work here is values + a credential-shape adapter Secret, not a fork.

Closes #384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): drop bp-seaweedfs dep from bp-velero expected DAG (#384)

Mirrors the dependsOn removal in clusters/_template/bootstrap-kit/34-
velero.yaml from the parent commit. Velero on Hetzner Sovereigns now
writes directly to Hetzner Object Storage (ADR-0001 §13 + WBS §3); no
in-cluster prerequisite Blueprint is required.

Local `bash scripts/check-bootstrap-deps.sh` now passes (0 drift,
0 cycles). The CI failure on the parent commit's PR was the audit
flagging bp-velero as having a missing edge to bp-seaweedfs because
this expected-DAG file still listed it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:24:44 +04:00
e3mrah
6e0f734d62
fix(bootstrap-kit): renumber bp-cert-manager-powerdns-webhook 36→49 + register in expected DAG (#373 followup) (#412)
PR #410 landed slot 36 for bp-cert-manager-powerdns-webhook, but slot 36
was already reserved in scripts/expected-bootstrap-deps.yaml for
bp-stunner (W2.K4 forward-declaration). The bootstrap-kit dependency
audit failed on the merge SHA 04308af7 with:

  ERROR: HR 'bp-cert-manager-powerdns-webhook' (file
  clusters/_template/bootstrap-kit/36-bp-cert-manager-powerdns-webhook.yaml)
  is present on disk but NOT declared in
  scripts/expected-bootstrap-deps.yaml.

Two fixes here:

  1. Move the file to slot 49 (first free slot after W2.K4's 35-48
     forward declarations). File renamed; kustomization.yaml updated;
     in-file comment block updated to explain the slot choice.

  2. Register slot 49 in scripts/expected-bootstrap-deps.yaml as
     `wave: present` with `depends_on: [bp-cert-manager, bp-powerdns]` —
     matches the HelmRelease's actual dependsOn block.

Local audit:
  $ bash scripts/check-bootstrap-deps.sh
  Present on disk:       36
  Declared expected:     49
  Deferred (W2.K1-K4):   13
  Drift:                 0
  Cycles:                0
  OK: bootstrap-kit dependency graph audit PASSED

This is a CI-only follow-up; chart and runtime semantics from #410 are
unchanged. Sovereign-impact deferred to Phase 8 per chart-only DoD.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:46:49 +04:00
e3mrah
0289f0388d
feat(scripts): bootstrap-kit dependency-graph audit script (W2.K0) (#259)
Adds scripts/check-bootstrap-deps.sh + scripts/expected-bootstrap-deps.yaml,
the W2.K0 deliverable from docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3.

The script parses every clusters/_template/bootstrap-kit/*.yaml, extracts
metadata.name + spec.dependsOn for the HelmRelease document(s), and
mechanically verifies the actual graph against the expected DAG declared
in scripts/expected-bootstrap-deps.yaml. It detects cycles via Kahn's
algorithm and prints the rendered DAG as ASCII grouped by Wave 2 batch
(W2.K1-K4) on success.

Behaviour against the in-flight expansion: HRs declared expected but not
yet on disk are reported as "deferred" (informational, not an error), so
that this script can be the static authoritative list while W2.K1-K4
PRs land their HR files in series. After all four W2 PRs merge, the
"deferred" count drops to 0 and the audit goes 100% green.

Wired into the existing .github/workflows/test-bootstrap-kit.yaml as a
new dependency-graph-audit job that runs on every PR touching:
  - clusters/** (any HR file edit)
  - scripts/check-bootstrap-deps.sh
  - scripts/expected-bootstrap-deps.yaml
  - .github/workflows/test-bootstrap-kit.yaml

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:16:16 +04:00
hatiyildiz
9e3268f2c5 docs(ops): comprehensive operator runbook + remediation playbook + idempotent recovery script
Adds docs/RUNBOOK-OPERATIONS.md as the single operator-facing entry point for
provisioning, troubleshooting, and recovering Catalyst Sovereigns:

A. Pre-provision checklist — Hetzner project + token, Dynadot pool zones +
   credentials, GHCR pull token (cross-link SECRET-ROTATION.md), PowerDNS pool
   zones bootstrapped, PDM healthy, bp-* chart versions, subchart-guard CI green.
B. Step-by-step walkthrough with timing — Phase 0 OpenTofu (30-60s plan +
   60-120s apply), PDM /commit (~5s), cloud-init (3-5min), Phase 1
   bootstrap-kit (10-15min), cert-manager + Cilium Gateway (1-2min). Total
   15-25min for a solo Sovereign.
C. 18 known failure modes with SYMPTOM / ROOT CAUSE / DIAGNOSIS / RECOVERY,
   each pinned to the canonical fix commit (c6cbfe68, e571ec7a, 54872009,
   2022e1af, 34c8de84, dddbab4b, 43aff202, 418cead0, 64d7de97, 330211d2,
   41c7ac13) or marked fix-in-flight where applicable.
D. Idempotent recovery script (Hetzner purge with DELETE-204-but-resource-
   persists verification sweep, PDM allocation release, catalyst-api
   deployment-record cancel). Dry-run by default; --apply gates real deletes
   on a validated HETZNER_API_TOKEN.
E. Cross-links to INVIOLABLE-PRINCIPLES, SOVEREIGN-PROVISIONING,
   RUNBOOK-PROVISIONING, BLUEPRINT-AUTHORING, CHART-AUTHORING, SECRET-ROTATION,
   PLATFORM-POWERDNS, IMPLEMENTATION-STATUS — references, doesn't duplicate.
F. Mermaid phase timeline diagram at the top showing ownership boundaries
   (catalyst-provisioner -> cloud-init -> Sovereign cluster) and hand-off points.
G. Mermaid failure decision tree at the end — operators land at the right §C
   entry in 4-6 yes/no questions.

Recovery script gracefully degrades to a name-only preview when
HETZNER_API_TOKEN is unset in dry-run mode (apply mode still hard-fails on
missing/invalid token), so operators can review what WOULD happen before
exporting the token.

Verified dry-run output against the live omantel.omani.works Sovereign:
- Step 1 lists 8 Hetzner kinds + 8 verification-sweep targets to inspect
- Step 2 confirms PDM reports the subdomain currently RESERVED (live state)
- Step 3 correctly identifies catalyst-api deployment 6274daeb7a9873cd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:29 +02:00