Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull` because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init never created the Secret, so every fresh Sovereign's source-controller logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium. The operator workaround (kubectl apply by hand) is not durable across reprovisioning. Verified live on omantel.omani.works pre-fix. Changes: - provisioner.Request gains GHCRPullToken (json:"-") so it is never serialized into persisted deployment records. provisioner.New() reads CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the Request before tofu.auto.tfvars.json. Validate() rejects empty for domain_mode=pool with a pointer to docs/SECRET-ROTATION.md. - handler.CreateDeployment also stamps the env var onto the Request so the synchronous validation path returns 400 early on misconfiguration. - infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true, default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64 locals and passes both to templatefile(). cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so the GitRepository + Kustomization land into a cluster that already has working GHCR creds. - products/catalyst/chart/templates/api-deployment.yaml mounts CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in the catalyst namespace (key: token, optional: true so the Pod still starts on misconfigured installs and Validate() owns the gate). - docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token, Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds. Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN> placeholder; the token never lives in git. - Tests: provisioner unit tests cover New() reading the env var, tolerance of missing env, pool-mode validation rejection with operator-facing error, BYO acceptance, and the json:"-" serialization invariant. tests/e2e/hetzner-provisioning gains a TestCloudInit_RendersGHCRPullSecret render-only integration test that asserts the rendered cloud-init contains the Secret, applies it before flux-bootstrap, and that the dockerconfigjson round-trips the sample token through templatefile() correctly. Existing pool-mode handler tests now t.Setenv the placeholder token; the on-disk redaction test asserts the placeholder never reaches disk. Gates: - go vet ./... and go test -race -count=1 ./... in products/catalyst/bootstrap/api: PASS. - helm lint products/catalyst/chart: PASS (warnings pre-existing). - tofu fmt + tofu validate: deferred to CI (no tofu binary on the development host). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.7 KiB
Secret Rotation
The canonical list of credentials Catalyst-Zero handles, where each one lives, and how to rotate it.
Per INVIOLABLE-PRINCIPLES.md #10 (credential
hygiene): passwords, tokens, API keys, client secrets, kubeconfig
contents, TLS private keys, and .env values are all credentials and
treated identically. No credential is committed to git, ever. The
catalyst-api Pod's runtime env is the single source of truth for every
secret it consumes; persisted deployment records redact every one of them
via internal/store.Redact.
This document is the operator runbook for rotating each of those credentials on the schedule below — and the rollback path if a rotation breaks something live.
Rotation Schedule
| Credential | Where it lives | Rotation cadence | Rollback window |
|---|---|---|---|
GHCR pull token (catalyst-ghcr-pull-token) |
K8s Secret in catalyst ns, key token |
Yearly | 24h via 1Password version history |
| Hetzner Cloud API token (per Sovereign) | Wizard input → catalyst-api memory only | Per Sovereign apply | n/a — single-use, never persisted |
Dynadot API key + secret (dynadot-api-credentials) |
K8s Secret in openova-system ns, keys api-key + api-secret |
Yearly (or on personnel change) | 24h via 1Password version history |
Sovereign Admin SSO client secret (Keycloak catalyst-admin realm) |
Per-Sovereign K8s Secret in keycloak ns |
Yearly | 1h — Keycloak supports two active client secrets during rollover |
| SOPS / SealedSecrets cluster key (per Sovereign) | K8s Secret in kube-system ns |
Per Sovereign, never rotated post-bootstrap | n/a — re-key requires migrating every existing SealedSecret |
The rest of this document is the per-credential procedure.
GHCR pull token (catalyst-ghcr-pull-token)
What it is. A long-lived GitHub Personal Access Token (PAT) or
fine-grained token with the packages:read scope on the openova-io
organisation. The token authenticates the GHCR pulls Flux performs on
every freshly-provisioned Sovereign — every HelmRepository CR in
clusters/<sovereign-fqdn>/bootstrap-kit/ references the
flux-system/ghcr-pull Secret, and that Secret's content comes from this
token.
Why this token has its own runbook. The bootstrap-kit pulls the bp-*
OCI artifacts from ghcr.io/openova-io/, which is a private registry
path. Without the token, the source-controller logs:
failed to get authentication secret 'flux-system/ghcr-pull':
secrets "ghcr-pull" not found
…and Phase 1 stalls at bp-cilium. The fix that landed this runbook
(fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly) makes the cloud-init template write
the Secret BEFORE kubectl apply -f flux-bootstrap.yaml, but the token
itself is never in the template — OpenTofu interpolates it at apply time
from var.ghcr_pull_token, sourced from the catalyst-api Pod's env var
CATALYST_GHCR_PULL_TOKEN.
Where the token must NEVER be: git (any branch, any repo), the
bootstrap-kit YAMLs, the catalyst-api Pod logs, the Hetzner project
metadata, Slack/email/issue bodies. The provisioner stamps it onto the
Request struct in memory, writes tofu.auto.tfvars.json (mode 0600), and
that file is wiped when the per-deployment workdir is cleared. The
json:"-" tag on Request.GHCRPullToken keeps it out of the persisted
deployment records (see internal/store.Redact).
Generation
Generate a fine-grained PAT (preferred over classic PATs):
- https://github.com/settings/personal-access-tokens/new
- Resource owner: openova-io
- Repository access: Public Repositories (read-only) — this is sufficient because GHCR packages inherit the openova-io org's GHCR visibility settings; the token does not need repo-level access.
- Permissions:
- Account → Packages → Read (the only scope this token uses)
- Expiration: 365 days (next rotation date — write it on the 1Password item).
- Generate. Copy the token to 1Password immediately (the page shows it once); never paste it into a terminal or a chat window.
Storage
1Password vault: OpenOva — Production
Item title: Catalyst — GHCR pull token (catalyst-ghcr-pull-token)
Tags: catalyst, ghcr, rotation:yearly
Notes field on the 1Password item must record:
- Generation date.
- Expiration date.
- Username paired with this token at the registry:
openova-bot(the literal string the cloud-init template uses; GitHub validates the token, not the username, but this string lands in audit-trail JSON). - Operator who generated it.
Apply (the one-liner)
Replace <GHCR_PULL_TOKEN> with the token retrieved from 1Password —
never paste a real token into git, an issue, a commit message, or a
terminal session that will be transcribed.
kubectl create secret generic catalyst-ghcr-pull-token \
--namespace=catalyst \
--from-literal=token='<GHCR_PULL_TOKEN>' \
--dry-run=client -o yaml | \
kubectl apply -f -
The --dry-run=client … | kubectl apply -f - form is idempotent: a fresh
install creates the Secret; a rotation overwrites the existing one
in-place. The catalyst-api Deployment must be rolled to pick up the new
value:
kubectl -n catalyst rollout restart deployment/catalyst-api
kubectl -n catalyst rollout status deployment/catalyst-api
(secretKeyRef-mounted env vars are NOT auto-refreshed by the Pod —
only volume mounts are. The catalyst-api chart mounts the token as
env.valueFrom.secretKeyRef, so a rollout is required.)
Verify
# The Secret exists with the expected key.
kubectl -n catalyst get secret catalyst-ghcr-pull-token \
-o jsonpath='{.data.token}' | base64 -d | wc -c
# (Output: a non-zero byte count. NEVER append `; echo` — that prints
# the token to your terminal.)
# The catalyst-api Pod read it cleanly at startup.
kubectl -n catalyst logs deploy/catalyst-api | grep -i 'ghcr' || \
echo "no ghcr-related warning — provisioner picked up the token"
# A fresh /api/v1/deployments POST validates without the
# 'CATALYST_GHCR_PULL_TOKEN missing' error (expected for managed-pool
# domain mode).
Rollback
If the new token does not authenticate (typo, wrong scope, expired):
- Open 1Password's item version history; copy the previous token.
- Re-run the
kubectl create secret … --dry-run=client | kubectl applyone-liner with the previous token. kubectl -n catalyst rollout restart deployment/catalyst-api.- File a follow-up issue to investigate why the new token failed.
The previous token remains valid until the next yearly rotation — GitHub does not invalidate replaced fine-grained tokens automatically. Revoke the broken token in the GitHub UI as a hygiene step once rollback succeeds.
Hetzner Cloud API token (per Sovereign)
Captured by the wizard's StepProvider, lives in catalyst-api memory only
for the duration of one deployment. NEVER persisted (the
Request.HetznerToken field is json:"-"; internal/store.Redact
overwrites it with <redacted> for any record that ends up on disk).
Rotation: per-Sovereign apply. Each tofu apply accepts a fresh token;
once tofu apply returns, catalyst-api drops the value out of memory
(the Pod restart on next image roll loses the in-memory copy regardless).
If a Hetzner token is suspected of leaking: revoke at https://console.hetzner.cloud/projects → Security → API tokens. The next wizard run will accept a fresh one.
Dynadot API key + secret (dynadot-api-credentials)
K8s Secret in openova-system namespace, keys: api-key, api-secret,
domain (legacy single-domain), domains (comma-separated list,
preferred).
Yearly rotation via the Dynadot account UI:
- https://www.dynadot.com → My Account → API Settings → Regenerate.
- Copy both halves to the 1Password item Dynadot — OpenOva pool domains API credentials.
- Apply:
kubectl create secret generic dynadot-api-credentials \
--namespace=openova-system \
--from-literal=api-key='<DYNADOT_API_KEY>' \
--from-literal=api-secret='<DYNADOT_API_SECRET>' \
--from-literal=domains='omani.works' \
--dry-run=client -o yaml | \
kubectl apply -f -
kubectl -n catalyst rollout restart deployment/catalyst-api
kubectl -n openova-system rollout restart deployment/pool-domain-manager
The domains value is the comma-separated allowlist of pool domains
this account manages. Adding a third pool domain (e.g. acme.io) is a
secret update, not a code change — see
INVIOLABLE-PRINCIPLES.md #4.
Cross-cutting rules
- NEVER print a credential to a terminal. All retrievals pipe to a
file (
> /path && chmod 600) or directly intokubectl create secret --from-literal. Session transcripts are durable. - NEVER commit a credential. Use this runbook's
kubectl create secret … | kubectl applyone-liner; the value never touches a file the working tree tracks. - NEVER skip the rollout restart.
secretKeyRefenv vars are read at Pod start. A Secret update with no rollout is a silent half-rotation: existing Pods serve the old value, new Pods (post next evict) serve the new one. The catalyst-api is single-replica with strategyRecreate, so this is one step. - Log only metadata, never the value.
kubectl describe secretshowsdata: token: <not shown>— that is intentional. Reading the value via-o jsonpathand piping to a file is the sanctioned confirmation path; piping tocat/echois not.
If you accidentally expose a credential — printed to a terminal that will be transcribed, committed it to a branch, posted it to an issue — rotate immediately following this runbook. Do not try to "quietly fix it" by editing history; assume the leaked value is captured.