7658f9d937
19 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
20b3c5258a
|
feat(bp-newapi): chart maturation + first-otech deploy + Qwen vLLM channel (#799) (#812)
* feat(bp-newapi): chart maturation — ExternalSecret + first-otech vLLM channel + skip-render gates (#799) Maturation work for the SME-3 turnkey-experience epic (#795). Aligns the bp-newapi scratch chart with ADR-0003 (RBAC ↔ NewAPI user-create hook contract) and gets it past the blueprint-release CI smoke render that has blocked publication since PR #396 (run 25213444992 failed at default-values render of v1.0.0). Changes ------- - templates/external-secret.yaml (NEW). Renders the `catalyst-newapi-admin-token` ExternalSecret consumed by unified-rbac (ADR-0003 §3.2 + §6) for issuing per-user keys against `http://newapi.newapi.svc/api/v1/admin/users`. Sourced from OpenBao via the `vault-region1` ClusterSecretStore (canonical default shipped by bp-external-secrets-stores). Capabilities-gated on `external-secrets.io/v1beta1` so cold installs without ESO don't fail-render. Operator supplies the per-Sovereign OpenBao path via `catalystIntegration.externalSecret.remoteRef.key`; canonical convention is `sovereign/<sovereign-fqdn>/newapi/admin-token` with property `ADMIN_API_TOKEN`. Per Inviolable Principle #4 every knob is operator-overridable in the cluster overlay. - values.yaml. Adds `catalystIntegration.externalSecret.{enabled, refreshInterval, secretStoreRef.{kind,name}, remoteRef.{key,property}}` block (default enabled=true, key="" so a misconfigured overlay fails loudly at render rather than silently skipping). Adds `defaultChannels.vllm` block — first-otech shorthand that composes a vLLM-typed channel into the rendered channels list when enabled. Default endpoint is empty per Inviolable Principle #4; the `clusters/<sovereign>/bootstrap-kit/80-newapi.yaml` overlay supplies the per-Sovereign URL (canonical first-otech reference = `https://llm-api.omtd.bankdhofar.com` model `qwen3-coder`, the same upstream Axon uses on the OpenOva marketing deployment). - templates/_helpers.tpl. New `bp-newapi.effectiveChannels` helper composes `.Values.channels` with `defaultChannels.vllm` (when enabled). The `assertChannelAttestation` helper now operates on the effective list so attestation gates apply to defaultChannels composition too. `defaultChannels.vllm.enabled=true` with empty endpoint fails-fast at render with a guided error message. - templates/configmap.yaml. Channels rendering switches to the effectiveChannels helper. OIDC block now skip-renders gracefully when `auth.adminUI.keycloak.issuer` is unset (smoke-render path) instead of `required`-failing; the per-Sovereign overlay sets the issuer. - templates/deployment.yaml. Skip-render gate on Deployment when `database.existingSecret`, `credentials.existingSecret`, or (when Keycloak mode is selected) the OIDC client secret is missing. Removes the four `required` calls that were failing CI smoke render. Service, ServiceAccount, ConfigMap, NetworkPolicy still render so the smoke test gets a non-empty output proving structural soundness; the actual Deployment defers until the per-Sovereign overlay wires the secrets. - templates/ingress.yaml. Same skip-render pattern: when either `ingress.host` or `ingress.adminHost` is empty, the entire ingress block is silently skipped. Matches the bp-keycloak / bp-openbao / bp-external-dns HTTPRoute templates. - Chart.yaml. version 1.0.0 → 1.1.0 (minor bump — additive features; no breaking changes to existing operator overrides). Verification ------------ `helm template` smoke render on default values now succeeds with 4 resources (NetworkPolicy / ServiceAccount / ConfigMap / Service); 168 lines, well above the CI 5-line minimum. With a full per-Sovereign overlay (hosts + secrets + Keycloak issuer + ESO Capabilities + Traefik Capabilities + defaultChannels.vllm.endpoint), 8 resources render including Deployment, both Ingresses, the Traefik allowlist Middleware, and the ExternalSecret. The composed qwen channel writes through to `channels.yaml` with the expected endpoint + models + attestation. Refs ---- ADR-0003 §3.2 + §6 — admin-token contract Issue #795 (epic) — locked decisions Issue #796 — hook contract spec (sequential blocker, merged) Inviolable Principles #1, #3, #4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(bootstrap-kit): slot 80 — bp-newapi default install (#799) Adds the canonical install slot for bp-newapi to every fresh Sovereign's bootstrap-kit. Sequenced after the W2.K1 dependency wave so NewAPI's ExternalSecret + Postgres DSN dependencies resolve on first reconcile. The HelmRelease declares `dependsOn: [bp-openbao, bp-keycloak, bp-cnpg]`: - bp-openbao(08): admin-token ExternalSecret backend - bp-keycloak(09): OIDC issuer for ops-staff admin UI at admin.<fqdn> - bp-cnpg(16): Postgres backing for users/credits/channels/audit Per-Sovereign overlays inherit the slot's defaults and override: - ingress.host api.${SOVEREIGN_FQDN} - ingress.adminHost admin.${SOVEREIGN_FQDN} - auth.adminUI.keycloak.issuer - database.existingSecret (Crossplane-claimed) - credentials.existingSecret - catalystIntegration.externalSecret.remoteRef.key sovereign/${FQDN}/newapi/admin-token - defaultChannels.vllm.enabled true (first-otech) - defaultChannels.vllm.endpoint (operator-supplied) The `_template/` slot keeps `defaultChannels.vllm.enabled: false` so a fresh Sovereign does not silently wire customers to a third-party endpoint; the canonical first-otech reference (Qwen3 Coder via `https://llm-api.omtd.bankdhofar.com`, same relay Axon uses on the OpenOva marketing deployment) is documented in-line for operators adopting the same upstream. Refs: #795 (epic), ADR-0003 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-deps): register bp-newapi slot 80 in expected DAG (#799) Fixes the dependency-graph-audit drift detection caught at PR #812 CI: the audit script enumerates HelmReleases in clusters/_template/bootstrap-kit/ and compares to scripts/expected-bootstrap-deps.yaml; an HR present on disk but absent from the expected DAG is treated as drift. Adds the canonical entry for bp-newapi at slot 80 with the same depends_on set declared on the HelmRelease itself ([bp-openbao, bp-keycloak, bp-cnpg]). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-newapi): align blueprint.yaml spec.version with Chart.yaml (#799) The TestBootstrapKit_BlueprintCardsHaveRequiredFields static-validation gate asserts Chart.yaml version == blueprint.yaml spec.version. The chart was bumped to 1.1.0 in c63ecd8c; bumping the blueprint metadata to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
33dc98782b
|
feat(bp-self-sovereign-cutover): chart + bootstrap-kit slot 06a (#791) (#808)
New platform Blueprint at `platform/self-sovereign-cutover/chart/`. Ships
DORMANT — eight step PodSpec ConfigMaps, the registry-pivot DaemonSet, the
mutable cutover-status ConfigMap, plus ServiceAccount/RBAC. The catalyst-api
cutover endpoint (#792, merged at
|
||
|
|
53bc4357ca
|
feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) (#776)
* feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB): 1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate" Section with: bootstrap-kit baseline (sum of mandatory-tier component footprints), selected components delta, control-plane overhead, and a "Recommended N x <SKU>" line that turns amber when the operator's chosen worker count is below the rollup. Backed by per-component RAM/CPU floors in components/wizard/steps/componentFootprints.ts (covered by 12 unit tests including the otech92 reproduction). 2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart 9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired from the canonical flux-system/cloud-credentials.hcloud-token Secret cloud-init writes (mirrors the velero/harbor object-storage pattern). Pinned to the control-plane node so the autoscaler never schedules onto a worker it could itself terminate. 10-minute scale-down idle as the cost-saving default. Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA / KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over KEDA for cluster scaling, and the bounds + safety story. Per the issue's MVP scope, this PR ships the blueprint + StepReview estimate WITHOUT the wizard StepProvider min/max pair refactor or the tofu node-pool template restructuring. Those are tracked as a follow-up issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected- bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776 because the file existed without a matching entry in the expected DAG, AND collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort + slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to the expected-bootstrap-deps.yaml so the audit passes. `scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2b60e944e2
|
fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook (#681)
* fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook Caught live on otech43-46: cert-manager DNS-01 challenges for *.otechN.omani.works failed because the Sovereign-side webhook wrote challenge TXT records to the Sovereign's local PowerDNS. omani.works is delegated from Dynadot to ns1/2/3.openova.io which run on contabo's central PowerDNS — the Sovereign's local PowerDNS is INVISIBLE on the public DNS chain until pool-domain-manager seals the per-Sovereign NS delegation. Let's Encrypt resolvers walk the public chain, query contabo, get NXDOMAIN, the cert never issues. Manual workaround was seeding challenge TXT directly in contabo PowerDNS. This PR automates the right write path: - bp-cert-manager-powerdns-webhook chart bumped to 1.0.4. Default powerdns.host flips from "" (skip-render) to https://pdns.openova.io (contabo's public PowerDNS API ingress, authoritative for omani.works). - ClusterIssuer letsencrypt-dns01-prod-powerdns now usable with no per-cluster powerdns.host override for the omani.works pool. apiKeySecretRef.namespace clarified — upstream ignores it; the Secret must live in cert-manager namespace (= ChallengeRequest.ResourceNamespace for ClusterIssuers). - bootstrap-kit slot 49 updated: drops bp-powerdns dependsOn (webhook calls out-of-cluster contabo, not local PowerDNS), bumps chart version, removes inline powerdns.host override (defaults are correct). - bootstrap-kit slot 49b (bp-cert-manager-dynadot-webhook) DELETED entirely — Dynadot is NOT the API-level authority for omani.works subdomains, the dynadot webhook silently fails the same way the Sovereign-local powerdns one did. - clusters/_template/sovereign-tls/cilium-gateway-cert.yaml flips issuerRef from letsencrypt-dns01-prod (was dynadot-backed) to letsencrypt-dns01-prod-powerdns (the new contabo-backed issuer). - bp-cert-manager chart: certManager.issuers.dns01.enabled defaults to false (deprecated dynadot path). letsencrypt-http01-prod retained for per-host certs. Cluster overlays MAY flip dns01.enabled=true for non-omani.works pools where Dynadot IS the API-level authority. - scripts/expected-bootstrap-deps.yaml: drops slot 49b, drops bp-powerdns edge from slot 49. - Documentation (README + blueprint.yaml + Chart.yaml description) rewritten to reflect contabo retarget and lifecycle reasoning. Credential plumbing (out of scope here, must be done in cloud-init): - Every Sovereign needs a `powerdns-api-credentials` Secret in the `cert-manager` namespace whose `api-key` value matches contabo's PowerDNS API key. Same seeding pattern as `dynadot-api-credentials` in infra/hetzner/cloudinit-control-plane.tftpl. Caveat — basicAuth on contabo's PowerDNS API ingress: contabo currently fronts pdns.openova.io with Traefik basicAuth (per clusters/contabo-mkt/apps/powerdns/helmrelease.yaml). The upstream zachomedia/cert-manager-webhook-pdns binary supports the X-API-Key header but not HTTP Basic Auth out of the box. To make this end-to-end green, contabo's basicAuth requirement must be relaxed (X-API-Key alone provides the auth posture, and contabo's API endpoint is restricted to operator IPs by other means OR the Sovereign's webhook needs an Authorization header injected via the chart's powerdns.headers map (plaintext password in the ClusterIssuer config — not ideal). This PR ships the chart side; the basicAuth question is a follow-up on the contabo side. Verified locally: - helm lint platform/cert-manager-powerdns-webhook/chart -> PASS - helm template platform/cert-manager-powerdns-webhook/chart -> renders - helm template ... --set clusterIssuer.enabled=true -> renders the ClusterIssuer with host="https://pdns.openova.io" + correct apiKey Secret reference. - helm template platform/cert-manager/chart -> renders ONLY letsencrypt-http01-prod (the dns01 dynadot issuer correctly gated off). - scripts/check-bootstrap-deps.sh: net-zero new drift; my branch reduces pre-existing errors from 3 to 2 (the dropped slot 49b removed the only drift my branch was responsible for). Closes follow-up to #373. Preconditions for handover URL TLS green on otech43-46 lineage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): repair YAML structure in expected-bootstrap-deps.yaml Two pre-existing drifts were blocking dependency-graph-audit CI: 1. Slot 5a (bp-reflector) was missing its closing list separator, causing yq to merge the bp-nats-jetstream entry into the bp-reflector map and effectively drop bp-reflector from the expected DAG. Added explicit `- slot: 7` for bp-nats-jetstream and quoted "5a" so yq treats it as a string slot (matches the convention with "49b"). 2. bp-powerdns slot 11: actual bootstrap-kit declares dependsOn bp-cnpg (live since otech28 — pdns-pg-app secret race) but the expected DAG was missing this edge. This is unblocks merging fix/cert-manager-powerdns-webhook-contabo (PR above) — these drifts existed on main but weren't surfaced until the last expected-deps edit forced a re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
74921e30f1
|
fix(architecture): drop bp-spire, Cilium WireGuard is the canonical east-west mesh (#665)
Founder direction 2026-05-03: with 100% Cilium mesh enforcement + Envoy where required, bp-spire is redundant for the minimal Sovereign MVP. Reasoning: - Cilium 1.13+ has built-in mutual auth using SPIFFE, but it ships with its own embedded SPIRE server managed by the Cilium operator. External bp-spire is not needed for east-west mTLS. - Our ESO→OpenBao auth uses the K8s ServiceAccount auth method (TokenReview against kube-apiserver), not JWT-SVID. - WireGuard transparent encryption (already enabled in cilium values) encrypts every pod-to-pod connection at the kernel transport layer. - Cross-Sovereign federation and per-workload-fingerprint attestation are not blocking handover; they can be re-introduced as an opt-in blueprint when needed. Changes: - Delete clusters/_template/bootstrap-kit/06-spire.yaml - Remove bp-spire from kustomization.yaml + expected-bootstrap-deps.yaml - Remove bp-spire dependsOn from 07-nats-jetstream.yaml + 08-openbao.yaml - bp-cilium 1.1.4: add encryption.nodeEncryption=true so node-to-node traffic (not just pod-to-pod) is also WireGuard-encrypted; document in values.yaml comment that WireGuard is the canonical east-west mTLS layer. Removes 4 pods (spire-server, spire-agent, spire-spiffe-csi-driver, spire-spiffe-oidc-discovery-provider) from every Sovereign and the recurring CSI mount race that was getting stuck on otech43. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
be6e610093
|
fix: drop bp-langfuse from minimal + bp-mimir 1.0.2 push_grpc fix (#664)
* fix: drop bp-langfuse from minimal bootstrap-kit + bp-mimir push_grpc fix Two independent fixes packaged together: 1. **Drop bp-langfuse** from the SOLO minimal bootstrap-kit. Per founder direction: langfuse is LLM-specific (prompt/completion tracing for AI plane), not platform infrastructure, and belongs to a future 'AI Add-On' template. Its CreateContainerConfigError on every Sovereign provision (missing langfuse-secrets pre-install) was eating Phase-1 reconciliation budget without contributing to handover-ready state. Removed: - clusters/_template/bootstrap-kit/26-langfuse.yaml - kustomization.yaml entry - scripts/expected-bootstrap-deps.yaml slot 26 entry 2. **bp-mimir 1.0.2** — re-enable ingester.push_grpc_method_enabled. Upstream mimir-distributed 6.0.6 disables Push gRPC when ingest-storage is off, but classic-mode ingester REQUIRES it. The combo crashloops with 'cannot disable Push gRPC method in ingester, while ingest storage (-ingest-storage.enabled) is not enabled'. Caught live on otech43 with 17 restarts. Both issues block Phase-1 ready=40/40 from being a clean signal. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-mimir): chart 1.0.2 push_grpc_method_enabled + finalize langfuse drop Follow-up to previous commit which only captured the file deletion. This commit applies: bp-mimir 1.0.2 chart bump, kustomization + expected-deps removal of langfuse, bootstrap-kit version bumps. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
544dc86b5b
|
fix(wizard): blueprint deps sourced from Flux dependsOn (single source of truth) (#652)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)
* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)
The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):
componentGroups.ts Flux HelmRelease.dependsOn
---------------------- ---------------------------
keycloak: [cnpg] keycloak: [cert-manager, gateway-api]
openbao: [] openbao: [spire, gateway-api, cnpg]
harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager,
valkey] gateway-api]
Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03
This commit:
1. Adds scripts/generate-blueprint-deps.sh that parses every
bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
keyed by bare component id (bp- prefix stripped on both source
and target side).
2. Commits the generated JSON.
3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
4. Patches componentGroups.ts so every RAW_COMPONENT's
`dependencies` field is OVERRIDDEN at module load with the
Flux-canonical list (the inline `dependencies: [...]` literals
are now ignored — Flux is canonical).
Follow-ups (not in this PR):
- CI drift check that re-runs the script and diffs the JSON.
- Strip the inline `dependencies: [...]` arrays entirely once the
drift check is green.
- Wire the FlowPage edge-rendering to match.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
---------
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
8d2ba0495d
|
fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) (#586)
Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584) |
||
|
|
7c3ff940ff |
fix(ci): update solver_test.go fixtures + expected-bootstrap-deps.yaml for #550
- core/cmd/cert-manager-dynadot-webhook/solver_test.go: fix SetDns2Response → SetDnsResponse and ResponseCode:"0" → ResponseCode:0 in test fixtures so webhook command tests pass against the corrected dynadot-client JSON parsing - scripts/expected-bootstrap-deps.yaml: declare bp-cert-manager-dynadot-webhook at slot 49b so the bootstrap-kit dependency-graph audit passes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
f689766615
|
fix(infra): add explicit dependsOn to bp-openbao + bp-catalyst-platform (#512) (#513)
Phase-8a-preflight live deployment otech16 (9e14dcc0d2de7586, 2026-05-02):
even after bumping install/upgrade timeout to 15m (commit
|
||
|
|
e1f7d22f3c
|
fix(bootstrap-kit): install Gateway API CRDs ahead of HTTPRoute charts (#503) (#505)
Adds bp-gateway-api Blueprint (slot 01a) that vendors the upstream
Kubernetes Gateway API Standard-channel CRDs (v1.2.0) and registers them
ahead of every chart that ships HTTPRoute templates: bp-openbao,
bp-keycloak, bp-gitea, bp-powerdns, bp-catalyst-platform, bp-harbor,
bp-grafana.
Phase-8a-preflight live deployment otech10 (e1a0cd6662872fcb on
catalyst-api:c148ef3, 2026-05-01) reached 21/37 HRs Ready=True before
stalling on bp-harbor / bp-openbao / bp-powerdns reconciling to
InstallFailed with `no matches for kind "HTTPRoute" in version
"gateway.networking.k8s.io/v1"`. Cilium 1.16's chart `gatewayAPI.
enabled=true` flag wires up the cilium gateway controller and creates
the `cilium` GatewayClass, but does NOT install the
gateway.networking.k8s.io CRDs themselves; cilium 1.16 has no
`installCRDs`-equivalent knob for gateway-api so the upstream CRDs must
ship via a separate Blueprint.
Pattern locked in by docs/INVIOLABLE-PRINCIPLES.md and reinforced by
the founder for ALL similar future cases: intra-chart CRD-ordering
breaks → split into two charts + Flux dependsOn. Mirrors the
bp-crossplane/bp-crossplane-claims and bp-external-secrets/
bp-external-secrets-stores splits.
Files:
- platform/gateway-api/{blueprint.yaml,chart/} — new Blueprint with
per-CRD templates vendored from kubernetes-sigs/gateway-api v1.2.0
standard-install.yaml; helm.sh/resource-policy: keep on every CRD so
Helm uninstall does not orphan every HTTPRoute on the cluster
- platform/gateway-api/chart/scripts/regenerate.sh — developer tool
for re-vendoring on upstream version bump (annotation-driven)
- platform/gateway-api/chart/tests/crd-render.sh — chart integration
test (5 CRDs, keep annotation, bundle-version matches Chart.yaml pin)
- clusters/_template/bootstrap-kit/01a-gateway-api.yaml — HelmRelease
+ HelmRepository, dependsOn bp-cilium
- clusters/_template/bootstrap-kit/{08-openbao,09-keycloak,10-gitea,
11-powerdns,13-bp-catalyst-platform,19-harbor,25-grafana}.yaml —
add `dependsOn: bp-gateway-api`
- clusters/_template/bootstrap-kit/kustomization.yaml — register
01a-gateway-api.yaml between 01-cilium and 02-cert-manager
- scripts/expected-bootstrap-deps.yaml — declare slot 1a + add
bp-gateway-api to depends_on of every HTTPRoute-using slot
Closes #503
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1865ac8975
|
fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) (#504)
* fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) The upstream seaweedfs/seaweedfs 4.22.0 chart now ships templates/shared/security-configmap.yaml which calls fromToml — a Sprig function added in Helm 3.13. Flux v1.x helm-controller bundles a Helm SDK older than 3.13 and PARSES every template before any {{- if .Values.global.seaweedfs.enableSecurity }} gate fires, so the file's mere presence breaks install on every Sovereign with: parse error at (bp-seaweedfs/charts/seaweedfs/templates/shared/security-configmap.yaml:21): function "fromToml" not defined even though enableSecurity defaults to false. Setting the gate value does NOT skip parsing — only deleting / never-shipping the file does. Fix shape (per ticket #340): 1. Vendor upstream seaweedfs/seaweedfs 4.22.0 into chart/charts/seaweedfs/ (committed bytes, not auto-pulled at build time). Required because the upstream Helm repo overwrites 4.22.0 in place — re-pulling would re-introduce the broken file. 2. Delete charts/seaweedfs/templates/shared/security-configmap.yaml. Every other template that references the deleted ConfigMap is gated under {{- if enableSecurity }} so removing it is a no-op for our default deployment shape (Catalyst SeaweedFS auth happens at the S3 layer via IAM creds from External Secrets, not via the upstream chart's TLS/JWT machinery). 3. Drop the dependencies: block from chart/Chart.yaml; add annotations.catalyst.openova.io/no-upstream=true so the blueprint-release workflow's hollow-chart guard (issue #181) skips the auto-pull/round-trip checks for this chart. 4. Whitelist platform/seaweedfs/chart/charts/ in .gitignore so the vendored bytes are tracked. 5. Bump bp-seaweedfs 1.0.1 → 1.1.0 (signal: vendored, not auto-pulled). 6. Add tests/no-fromtoml.sh — chart-test that asserts the offending file stays deleted across future re-vendors. Runs in .github/workflows/blueprint-release.yaml as a publish-gating check. Unblocks Phase-8a observability + storage chain on otech (bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn bp-seaweedfs). Closes #340 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): align expected-bootstrap-deps.yaml with bp-harbor's actual deps The bp-harbor HR at clusters/_template/bootstrap-kit/19-harbor.yaml lines 35-37 already removed `bp-seaweedfs` from its dependsOn (cloud-direct architecture per ADR-0001 §13 — Harbor writes blobs directly to cloud Object Storage on Sovereigns, not via SeaweedFS), but the expected DAG in scripts/expected-bootstrap-deps.yaml was never updated to match. Pre-existing drift on main; surfaced by the dependency-graph-audit check on PR #504 (bp-seaweedfs vendoring fix). Fixing it inline so the audit passes on the same PR — the two changes are both about the storage chain on Sovereigns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
87ba48c44e
|
fix(ci): vendor-coupling guardrail path - products/catalyst/bootstrap/api/internal/objectstorage (closes #438) (#440)
The mode-gate check was looking for ${REPO_ROOT}/internal/objectstorage
but the actual Go package lives at products/catalyst/bootstrap/api/internal/objectstorage.
Update the path so hard-fail mode auto-engages on this repo.
Validation:
bash scripts/check-vendor-coupling.sh
-> HARD-FAIL mode banner emitted, exit 0 on clean tree
Synthetic 'hetzner-object-storage' under platform/ -> exit 1.
Refs: PR #437 (#383) which surfaced the bug.
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
|
||
|
|
0fdd411e79
|
ci(guardrail): vendor-coupling check - fail CI if chart values use vendor name (closes #428) (#431)
Adds scripts/check-vendor-coupling.sh + .github/workflows/check-vendor-coupling.yaml
that scan platform/, clusters/, products/catalyst/bootstrap/{api,ui} for vendor names
(hetzner|aws|gcp|azure|oci) appearing in capability-named slots:
1. <vendor>-object-storage (sealed-secret / overlay-secret name)
2. <chart>Overlay\.<vendor>\. (chart values block keyed to vendor)
3. <vendor>ObjectStorage (camelCase payload field)
Excludes legitimately-per-provider paths (infra/<provider>/, internal/<provider>/,
internal/objectstorage/<provider>/, core/pkg/<provider>/), Crossplane Provider CR
refs (lines containing "crossplane-contrib/provider-"), and *.md files (docs may
discuss the rule).
Mode gate: warn-only while internal/objectstorage/ does not exist (pre-#425
work-in-progress); hard-fail once that directory lands. Locally on this branch
the script emits 49 warnings to stderr and exits 0 against the existing
hetzner-coupled references in platform/velero, platform/seaweedfs, and
clusters/.../bootstrap-kit/34-velero.yaml; once #425's rename lands those
warnings disappear and any future re-introduction fails CI.
Workflow trigger surface: push-to-main + pull_request on the scanned paths +
workflow_dispatch. No schedule: cron per CLAUDE.md "every workflow MUST be
event-driven, NEVER scheduled".
Canonical seam used: scripts/ + .github/workflows/ (mirrors
scripts/check-bootstrap-deps.sh + .github/workflows/blueprint-release.yaml
shape). NOT a duplicate - no prior vendor-coupling guard existed.
Refs: docs/omantel-handover-wbs.md §3a (canonical-seam map)
docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode)
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
92b7db622d
|
fix(bp-external-secrets-stores): split ClusterSecretStore into separate chart per #247 pattern (closes #331) (#426)
* fix(bp-external-secrets): split ClusterSecretStore into bp-external-secrets-stores chart (resolves CRD ordering, closes #331) bp-external-secrets@1.0.0 deadlocked on first install on otech.omani.works: Helm install failed for release external-secrets-system/external-secrets with chart bp-external-secrets@1.0.0: failed post-install: unable to build kubernetes object for deleting hook bp-external-secrets/templates/clustersecretstore-vault-region1.yaml: resource mapping not found for name: "vault-region1" namespace: "" no matches for kind "ClusterSecretStore" in version "external-secrets.io/v1beta1" Root cause: Helm's `helm.sh/hook-delete-policy: before-hook-creation` ran a kubectl-style lookup of the existing ClusterSecretStore CR before the upstream `external-secrets` subchart's CRDs finished registration. The in-line ClusterSecretStore template (templates/clustersecretstore-vault- region1.yaml) and the upstream subchart's CRDs co-installed in the same release; admission ordering wasn't deterministic enough to make the post-install hook safe. Fix — same pattern as PR #247 (bp-crossplane@1.1.3 ↔ bp-crossplane-claims@1.0.0): split the chart into controller + stores. Flux dependsOn orders them. - bp-external-secrets@1.1.0 — controller-only (just upstream subchart + NetworkPolicy + ServiceMonitor toggle). CRDs register here. - bp-external-secrets-stores@1.0.0 (NEW) — the default ClusterSecretStore CR; depends on bp-external-secrets being Ready. No Helm hooks needed: by the time this chart's HelmRelease starts, Flux has already verified bp-external-secrets is Ready=True and therefore the CRDs are registered. Files: NEW: platform/external-secrets-stores/blueprint.yaml (1.0.0) NEW: platform/external-secrets-stores/chart/Chart.yaml (1.0.0; no upstream subchart, annotation `catalyst.openova.io/no-upstream: "true"`) NEW: platform/external-secrets-stores/chart/values.yaml (clusterSecretStore.* knobs moved from controller chart) MOVED: platform/external-secrets/chart/templates/clustersecretstore-vault-region1.yaml → platform/external-secrets-stores/chart/templates/clustersecretstore-vault-region1.yaml (Helm hook annotations removed — Flux dependsOn now handles ordering) TOUCHED: platform/external-secrets/chart/Chart.yaml (1.0.0 → 1.1.0; description note appended) TOUCHED: platform/external-secrets/blueprint.yaml (1.0.0 → 1.1.0) TOUCHED: platform/external-secrets/chart/values.yaml (clusterSecretStore block removed; pointer comment added) NEW: clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml (Flux HelmRelease, dependsOn: [bp-external-secrets, bp-openbao]) TOUCHED: clusters/_template/bootstrap-kit/15-external-secrets.yaml (chart version 1.0.0 → 1.1.0) TOUCHED: clusters/_template/bootstrap-kit/kustomization.yaml (slot 15a inserted after 15) Out of scope for this PR (separate tickets): - blueprint-release.yaml CI fan-out: verify the path-matrix picks up the new platform/external-secrets-stores/ directory automatically; if not, add the directory to the matrix in a follow-up. - Per-Sovereign cluster directory edits (#257 will delete those). - Phase 0 minimum trim (#310 will renumber slots; this PR uses 15a as a non-disruptive sub-slot insertion that works with both the current 35-slot kustomization and the eventual 15-slot canonical layout — when #310 renumbers, 15 + 15a become 08 + 09 in the canonical order). Refs: #331 (this issue), #247 (pattern reference — bp-crossplane split), Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): register bp-external-secrets-stores in expected-bootstrap-deps.yaml The dependency-graph-audit CI step rejected PR #334 because the new bp-external-secrets-stores HR was on disk at slot 15a but missing from the expected DAG. This commit adds it with the same dependsOn shape as clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml: [bp-external-secrets, bp-openbao]. Refs: #331, #310 (Phase 0 minimum), PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bp-external-secrets): retire CR cases from controller test, add stores-toggle (#331) After splitting the default ClusterSecretStore into bp-external-secrets-stores @1.0.0, the controller chart's observability-toggle integration test still expected the CR to render in the controller chart (Cases 4 + 5). Those assertions now belong on the new chart. Changes: - platform/external-secrets/chart/tests/observability-toggle.sh: Replace Cases 4+5 with a single inverted assertion — the controller chart MUST render ZERO ClusterSecretStore CRs (top-level kind:); only the upstream subchart's CRD definition (whose spec.names.kind value is "ClusterSecretStore" at non-zero indent) is allowed. - platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh: NEW. Mirrors the retired Cases 4+5 against the stores chart, plus a Case 3 that asserts clusterSecretStore.server overrides propagate. Local smoke: bash platform/external-secrets/chart/tests/observability-toggle.sh → 4/4 PASS bash platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh → 3/3 PASS Refs: #331, PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): handle alphanumeric sub-slot suffixes in check-bootstrap-deps.sh PR #334 (issue #331) added slot 15a-external-secrets-stores as a sub-slot between numeric slots 15 and 16. The bootstrap-deps audit script's `printf '%02d'` formatter rejected `15a` with: scripts/check-bootstrap-deps.sh: line 390: printf: 15a: invalid number Fix: detect non-numeric slot tokens and pass them through verbatim. Numeric slots still render as zero-padded `01..49` for output alignment. Local smoke: $ bash scripts/check-bootstrap-deps.sh ... [P] slot 15 bp-external-secrets <-- bp-cert-manager bp-openbao [P] slot 15a bp-external-secrets-stores <-- bp-external-secrets bp-openbao ... OK: bootstrap-kit dependency graph audit PASSED Refs: #331, PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(wbs): tick #331 chart-released bp-external-secrets@1.1.0 (controller-only) + bp-external-secrets-stores@1.0.0 (NEW) shipped in PR #426. Helm-template acceptance + both toggle tests + dependency-graph-audit all green. Sovereign-impact deferred to Phase 8. Refs: #331, PR #426. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
f7796ef807
|
feat(bp-velero): Hetzner Object Storage backend wiring (closes #384) (#423)
* feat(bp-velero): Hetzner Object Storage backend wiring (closes #384) Velero on a Hetzner Sovereign now writes its backups DIRECTLY to Hetzner Object Storage per ADR-0001 §13 (S3-aware app architecture rule) + docs/omantel-handover-wbs.md §3 — NOT SeaweedFS, which is reserved as a POSIX→S3 buffer for legacy POSIX-only writers and is not in the minimal Sovereign set. Mirrors the Hetzner-direct backend pattern Agent #383 is wiring for Harbor; both consume the canonical flux-system/hetzner-object-storage Secret shipped by issue #371 (cloud-init writes 5 keys: s3-endpoint / s3-region / s3-bucket / s3-access-key / s3-secret-key, derived from the operator-issued Hetzner-Console keys + the per-Sovereign bucket provisioned by OpenTofu's aminueza/minio resource). platform/velero/chart/ (umbrella chart, bumped to 1.1.0): - templates/_helpers.tpl: NEW — bp-velero.fullname / bp-velero.labels helpers + bp-velero.hetznerCredentialsSecretName (default `velero-hetzner-credentials`). - templates/hetzner-credentials-secret.yaml: NEW — synthesises a velero-namespace Secret with a single `cloud` key in AWS-CLI INI format from .Values.veleroOverlay.hetzner.s3.{accessKey,secretKey}. The upstream Velero deployment mounts this at /credentials/cloud via existingSecret + AWS_SHARED_CREDENTIALS_FILE. Skip-render path when veleroOverlay.hetzner.enabled is false (default — keeps contabo render clean) or useExistingSecret is true (operator supplied Secret out-of-band). - values.yaml: BSL provider/region/s3Url/bucket fields populated as placeholders the per-Sovereign HelmRelease overrides via Flux valuesFrom; backupsEnabled defaults FALSE so default render emits no half-broken BSL; veleroOverlay.hetzner block surfaces the operator-overridable fields. Long-form rationale comments inline on each value per the chart's existing docstring style. clusters/_template/bootstrap-kit/34-velero.yaml (+ omantel + otech): - dependsOn: bp-seaweedfs REMOVED — Velero is no longer a SeaweedFS consumer on Sovereigns (was the old SeaweedFS-tiered architecture that minimal-omantel retired in favour of cloud-native S3). - chart version bumped 1.0.0 → 1.1.0. - valuesFrom block added: 5 Secret-key entries pull each canonical s3-* key into the matching umbrella value path. Plaintext credentials never appear in the committed manifest; Flux dereferences valuesFrom at HelmRelease apply time. - values block adds the baseline veleroOverlay.hetzner.enabled=true + velero.credentials.{useSecret:true,existingSecret:velero-hetzner- credentials} + BSL provider/credential/s3ForcePathStyle scaffolding that the valuesFrom entries fill in. docs/omantel-handover-wbs.md: - §2 row 19: "❌ chart needs S3 endpoint rework" → "🟢 chart-released v1.1.0 — Hetzner Object Storage backend wired to #371 secret". - §9 #384 row: detailed status with smoke evidence. Smoke evidence (contabo, default values — no Hetzner credentials): - helm template t . → renders cleanly (no Hetzner Secret, no BSL). - helm template t . --set veleroOverlay.hetzner.enabled=true \ --set ...accessKey=AK_TEST --set ...secretKey=SK_TEST \ --set velero.backupsEnabled=true (+ BSL config) → Secret/velero-hetzner-credentials with `cloud` INI key emitted + BackupStorageLocation/default with provider=aws, bucket=omantel-velero, region=fsn1, s3Url=https://fsn1.your-objectstorage.com. - helm install velero-smoke . -n velero-smoke (defaults) → pod velero-69bb84c5-669sh Ready 1/1 in 48s. Smoke torn down clean. Hetzner-S3 E2E deferred to Phase 8 (first omantel run) — contabo has no Hetzner Object Storage credentials so end-to-end backup→restore verification can't run here. Anti-duplication rule: NO bash scripts authored, NO parallel implementations of upstream Velero functionality. Upstream Velero + velero-plugin-for-aws natively support any S3-compatible backend; the work here is values + a credential-shape adapter Secret, not a fork. Closes #384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): drop bp-seaweedfs dep from bp-velero expected DAG (#384) Mirrors the dependsOn removal in clusters/_template/bootstrap-kit/34- velero.yaml from the parent commit. Velero on Hetzner Sovereigns now writes directly to Hetzner Object Storage (ADR-0001 §13 + WBS §3); no in-cluster prerequisite Blueprint is required. Local `bash scripts/check-bootstrap-deps.sh` now passes (0 drift, 0 cycles). The CI failure on the parent commit's PR was the audit flagging bp-velero as having a missing edge to bp-seaweedfs because this expected-DAG file still listed it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
6e0f734d62
|
fix(bootstrap-kit): renumber bp-cert-manager-powerdns-webhook 36→49 + register in expected DAG (#373 followup) (#412)
PR #410 landed slot 36 for bp-cert-manager-powerdns-webhook, but slot 36
was already reserved in scripts/expected-bootstrap-deps.yaml for
bp-stunner (W2.K4 forward-declaration). The bootstrap-kit dependency
audit failed on the merge SHA
|
||
|
|
0289f0388d
|
feat(scripts): bootstrap-kit dependency-graph audit script (W2.K0) (#259)
Adds scripts/check-bootstrap-deps.sh + scripts/expected-bootstrap-deps.yaml, the W2.K0 deliverable from docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3. The script parses every clusters/_template/bootstrap-kit/*.yaml, extracts metadata.name + spec.dependsOn for the HelmRelease document(s), and mechanically verifies the actual graph against the expected DAG declared in scripts/expected-bootstrap-deps.yaml. It detects cycles via Kahn's algorithm and prints the rendered DAG as ASCII grouped by Wave 2 batch (W2.K1-K4) on success. Behaviour against the in-flight expansion: HRs declared expected but not yet on disk are reported as "deferred" (informational, not an error), so that this script can be the static authoritative list while W2.K1-K4 PRs land their HR files in series. After all four W2 PRs merge, the "deferred" count drops to 0 and the audit goes 100% green. Wired into the existing .github/workflows/test-bootstrap-kit.yaml as a new dependency-graph-audit job that runs on every PR touching: - clusters/** (any HR file edit) - scripts/check-bootstrap-deps.sh - scripts/expected-bootstrap-deps.yaml - .github/workflows/test-bootstrap-kit.yaml Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9e3268f2c5 |
docs(ops): comprehensive operator runbook + remediation playbook + idempotent recovery script
Adds docs/RUNBOOK-OPERATIONS.md as the single operator-facing entry point for provisioning, troubleshooting, and recovering Catalyst Sovereigns: A. Pre-provision checklist — Hetzner project + token, Dynadot pool zones + credentials, GHCR pull token (cross-link SECRET-ROTATION.md), PowerDNS pool zones bootstrapped, PDM healthy, bp-* chart versions, subchart-guard CI green. B. Step-by-step walkthrough with timing — Phase 0 OpenTofu (30-60s plan + 60-120s apply), PDM /commit (~5s), cloud-init (3-5min), Phase 1 bootstrap-kit (10-15min), cert-manager + Cilium Gateway (1-2min). Total 15-25min for a solo Sovereign. C. 18 known failure modes with SYMPTOM / ROOT CAUSE / DIAGNOSIS / RECOVERY, each pinned to the canonical fix commit ( |