The case of the self-signed cert: how a Traefik v3.7 upgrade quietly broke TLS

I upgraded Traefik from v3.6 to v3.7 on my single-node k3s box. The chart bump applied cleanly, the pods went green, and every HTTPS site started serving CN=TRAEFIK DEFAULT CERT — Traefik’s built-in self-signed placeholder — instead of the real Let’s Encrypt wildcard. Health checks stayed green the whole time.

Here’s the wrong turn I took, the minimal reproduction that set me straight, and the one-line root cause.

The setup

The cluster runs Traefik as the only ingress, fronting everything through the Gateway API provider (Gateway + HTTPRoute, not the legacy Ingress). Wildcard TLS (*.example.com, plus a handful of other domains) is terminated at the gateway. Pulumi manages it all. The upgrade was a routine Helm chart bump (39.0.8 → 40.2.0, which ships Traefik proxy v3.6.13 → v3.7.1).

Symptom

After pulumi up:

curl -v https://example.com → ssl_verify_result != 0, issuer CN=TRAEFIK DEFAULT CERT.
Same for every other host.
Traefik pod Running, readiness probe green, no obvious crash.
Rolling back to v3.6.13 → everything served the correct cert again.

A clean v3.6-works / v3.7-breaks bisection. Classic “what changed in the new version” — so I went looking for a TLS bug.

Wrong turn #1: “it must be the wildcard listener / SNI”

My gateway had one hostname-less HTTPS listener carrying the wildcard cert. I found an open Traefik issue (#12742) about Gateway listener hostnames not affecting cert selection, and a v3.7 rewrite of the Gateway TLS pipeline. The story fit: maybe v3.7 stopped binding a cert to a hostname-less listener.

So I split the catch-all into explicit *.example.com + apex example.com listeners and reworked route attachment. Reasonable hardening — but when I retried v3.7, it still served the default cert. The hypothesis was wrong.

The minimal reproduction (where it pays off)

Instead of guessing further on production, I built a throwaway kind harness: one Gateway, a matrix of HTTPS listener shapes (single host, wildcard, shared-secret apex+wildcard, distinct secrets), self-signed certs, and an openssl s_client -servername probe per hostname that classified the served cert as real vs default. Run it against v3.6.13 and v3.7.1, diff the tables.

The result demolished my theory: on v3.7.1 with the CRDs set up correctly, every listener shape served its cert correctly. There was no wildcard/SNI bug. But in an early run I’d seen all hosts go to the default cert — and the difference between that run and the working one wasn’t the listeners at all. It was which Gateway API CRDs were installed.

Root cause

Traefik v3.7’s Kubernetes Gateway provider watches TLSRoute and BackendTLSPolicy at v1. If it can’t watch one of them, the provider’s informer cache never finishes syncing, so the GatewayClass is never marked Accepted, no listener is programmed, and every HTTPS connection falls back to the default self-signed cert. It fails closed — and silently, because the pod’s health checks don’t depend on the Gateway sync.

My cluster had the Gateway API standard channel at v1.2.1, installed long ago. TLSRoute and a v1-served BackendTLSPolicy only landed in the standard channel in Gateway API v1.5.0. So my CRDs simply didn’t have what v3.7 insists on watching. Traefik v3.6 tolerated their absence; v3.7 does not.

Confirming logs (once I knew to look):

Failed to watch *v1.TLSRoute: failed to list *v1.TLSRoute:
  the server could not find the requested resource
Failed to watch *v1.BackendTLSPolicy: ...

The fix

Bring the Gateway API CRDs up to v1.5.1 (the standard channel is enough — the experimental channel is only needed for TCPRoute):

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.1/standard-install.yaml

Then the v3.7 chart bump just works: every host served its real Let’s Encrypt cert again, and the Failed to watch lines vanished. The listener split I did on the wrong hunch was unnecessary (harmless, but unnecessary).

This requirement is documented — Traefik’s v3.7 migration guide says the Gateway provider moved to Gateway API v1.5.1 and the CRDs must be updated. The Helm chart 40.x also deliberately stopped bundling Gateway API CRDs (they aren’t Traefik’s to own). The two facts combine into a trap: the chart that upgrades your proxy no longer ships the CRDs the upgraded proxy now hard-requires.

What’s not documented

The failure mode. The docs say “update the CRDs”; they don’t say “if you don’t, every Gateway listener silently serves a self-signed cert while your health checks stay green.” A maintainer (issue #12321) framed the watch behavior as intentional — “never exit, run on best effort.” But the observed behavior isn’t graceful degradation: a single missing optional CRD takes down TLS for all hosts at once. That gap between stated intent and blast radius is the part worth flagging.

Takeaways

Read the migration guide’s CRD requirements before bumping. “Supports Gateway API v1.5.1, requires the CRDs to be updated” is easy to skim past; it was the whole ballgame.
Fail-closed + green health checks is a nasty combination. Probe what actually matters from outside (here: the served TLS cert via SNI), not just pod liveness.
A minimal repro is worth more than another hour of theorizing. Mine disproved my leading hypothesis and surfaced the real cause — which lived in a layer (CRD availability) I wasn’t even looking at.
When a Helm chart stops bundling CRDs, the CRD lifecycle becomes your job. Pin and upgrade them deliberately, in lockstep with the component that needs them.