Long time no see! I remembered this thing existed fairly recently after doing a batch of upgrades on the infrastructure it’s running on. Since it had been a while I did not work on this cluster, I figured I’d try to get up to date on most of my service as quick as I can, which means bumping Helm charts or image tags up to the latest release and hope for the best.
After extinguishing a couple unrelated, minor fires mostly linked to the relatively long time I left it running on its own, I figured I was done and moved on. Until, a few days ago, I noticed that Cert-Manager (which I set up with the help of this article) was creating various acme-solver
deployments that did not go away on their own, and threw various errors in the manager’s logs about receiving 404s instead of the expected 200 during ACME validation.
After investigating a bit, I noticed the neighbouring containers were actually receiving the requests:
0.0.0.0 - - [02/Dec/2021:14:41:28 +0000] "GET /.well-known/acme-challenge/-- HTTP/1.1" 302 1556 "-" "cert-manager/v1.6.0 (clean)"
Indeed, when Cert-Manager attempts to renew a certificate, it creates a solver and an ingress to expose it to the internet. They have to cohabit with my other ingresses, which are usually set up to catch any requests arriving to a specific domain, i.e.:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: foo-ingress
namespace: foo
annotations:
kubernetes.io/ingress.class: "traefik"
traefik.ingress.kubernetes.io/router.tls: "true"
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
rules:
- host: foo.domain.tld
http:
paths:
- path: /
backend:
serviceName: foo-service
servicePort: 80
tls:
- hosts:
- foo.domain.tld
secretName: foo-domain-tld-tls
According to Traefik’s documentation, ingresses created by Cert-Manager should have priority compared to that kind of rules, as they specifically target the well-known directory containing the challenge. For some reason, that was apparently no longer the case.
Fixing it was fairly simple: Cert-Manager’s ClusterIssuer objects can be configured to annotate the ingress it creates when solving challenges. So I modified my ClusterIssuer objects to define a priority, like so:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
# The ACME server URL
server: https://acme-v02.api.letsencrypt.org/directory
# Email address used for ACME registration
email: bar@domain.tld
# Name of a secret used to store the ACME account private key
privateKeySecretRef:
name: letsencrypt-prod
# Enable the HTTP-01 challenge provider
solvers:
- http01:
ingress:
ingressTemplate:
metadata:
annotations:
kubernetes.io/ingress.class: "traefik"
traefik.ingress.kubernetes.io/router.priority: "100"
And that’s it! Priority can be set to whatever is necessary, as long as it’s higher than the other ingress'.