Skip to content

Commit 420e01b

Browse files
"Hands off Linkerd certificate rotation" blog post (#2053)
1 parent b1df8e1 commit 420e01b

File tree

4 files changed

+332
-1
lines changed

4 files changed

+332
-1
lines changed
20.1 KB
Loading
Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,331 @@
1+
---
2+
date: 2025-10-20T00:00:00Z
3+
title: Hands off Linkerd certificate rotation
4+
keywords: [linkerd, "Cert Manager", automation]
5+
params:
6+
author:
7+
name: Matthew McLane
8+
avatar: matthew-mclane.jpg
9+
---
10+
11+
_This blog post was originally published on
12+
[Matthew McLane's Medium blog](https://medium.com/@mclanem_45809/hands-off-linkerd-certificate-rotation-0e387fdeaa0a)._
13+
14+
I’ll start by saying that I think Linkerd is a **great tool**. We use it at work
15+
to provide **TLS between our pods**, which frees us from having to build that
16+
functionality directly into our containers. When it works, it’s fantastic! It’s
17+
simple to get up and running and just does the job without a lot of extra fuss.
18+
For the most part, it’s been a very hands-off experience, which is exactly what
19+
we need.
20+
21+
Recently, though, a change to **cert-manager** caused our long-standing
22+
certificates to unexpectedly rotate. This sent me on a journey to understand and
23+
implement a **fully automated certificate rotation solution** for our Linkerd
24+
service mesh, and I’d like to take you along for the ride.
25+
26+
## The Problem
27+
28+
Linkerd largely manages its own certificates, but it needs a trusted foundation:
29+
a root anchor and an identity issuer certificate. Linkerd’s own documentation on
30+
**[Automatically Rotating Control Plane TLS Credentials](/2/tasks/automatically-rotating-control-plane-tls-credentials/)**
31+
explains this in detail. My goal was to build a completely automated solution
32+
for our clusters, bypassing the need for manual `kubectl` commands. I wanted to
33+
leverage our existing ArgoCD infrastructure to handle everything, including
34+
regular certificate rotation and all the necessary restarts, without any manual
35+
intervention.
36+
37+
## linkerd-certs helm chart
38+
39+
The first step in my solution was to create a simple **Helm chart** to lay down
40+
the required certificates. Following the
41+
[documentation](/2/tasks/automatically-rotating-control-plane-tls-credentials/),
42+
this chart creates three key certificates in the namespace using cert-manager:
43+
`linkerd-trust-root-issuer`, `linkerd-trust-anchor`, and
44+
`linkerd-identity-issuer`.
45+
46+
This Helm chart also sets up the `linkerd-identity-issuer` and the necessary
47+
trust bundle within the Linkerd namespace. Essentially, this single chart
48+
handles all the certificates needed to install Linkerd and enable its automatic
49+
rotation feature.
50+
51+
## The rotation problem
52+
53+
As stated in the documentation:
54+
55+
> Rotating the identity issuer is basically a non-event: cert-manager can handle
56+
> rotating the identity issuer completely on its own.
57+
> .
58+
> .
59+
> .
60+
> Rotating the trust anchor is a bit different, because rotating the trust
61+
> anchor mean that you have to restart both the Linkerd control plane and all
62+
> the proxies while managing the trust bundle. In practice, this requires manual
63+
> intervention, because while cert-manager can handle the hard work of actually
64+
> rotating the trust anchor, it can’t trigger the needed restarts.
65+
66+
I really didn’t want to rely on anything with manual intervention. The solution
67+
to this problem was fairly simple to workout. All the heavy lifting was provided
68+
in the
69+
[documentation](/2/tasks/automatically-rotating-control-plane-tls-credentials/)!
70+
First I started by creating a set of shell scripts.
71+
72+
First is a script to rotate the certificates:
73+
74+
```bash
75+
#!/bin/bash
76+
set -e
77+
echo "renewing linkerd-trusted-anchor"
78+
cmctl renew -n cert-manager linkerd-trust-anchor
79+
echo "Waiting 120 seconds to allow for certs to update"
80+
sleep 120
81+
echo "---"
82+
83+
echo "renewing linkerd-identity-issuer"
84+
cmctl renew -n linkerd linkerd-identity-issuer
85+
echo "Waiting 120 seconds to allow for certs to update"
86+
sleep 120
87+
```
88+
89+
Next was a script to restart the linkerd control-plane pods. I also use this
90+
moment to restart the linkerd-viz pods.
91+
92+
```bash
93+
#!/bin/bash
94+
set -e
95+
echo "---"
96+
echo "Restarting linkerd control plane"
97+
kubectl rollout restart -n linkerd deploy --selector=linkerd.io/control-plane-ns=linkerd
98+
kubectl rollout status -n linkerd deploy --selector=linkerd.io/control-plane-ns=linkerd
99+
100+
echo "Waiting 20 seconds for stabilization..."
101+
sleep 20
102+
echo "---"
103+
echo "Restarting linkerd viz"
104+
kubectl rollout restart -n linkerd-viz deploy --selector=linkerd.io/extension=viz
105+
kubectl rollout status -n linkerd-viz deploy --selector=linkerd.io/extension=viz
106+
```
107+
108+
The next step is a script to restart the data plane or all of the pods that have
109+
had the linkerd-proxy injected. Thankfully we use namespace annotations to
110+
control what gets injected, so all I needed to do was query for those
111+
namespaces. Once I have found all namespaces with “linkerd.io/inject: enabled”,
112+
we can restart each one at a time.
113+
114+
```bash
115+
#!/bin/bash
116+
set -e
117+
NAMESPACES=$(kubectl get ns -o json | jq -r '.items[] | select(.metadata.annotations."linkerd.io/inject" == "enabled") | .metadata.name')
118+
# Check if any namespaces were found.
119+
if [ -z "$NAMESPACES" ]; then
120+
echo "No namespaces found with 'linkerd.io/inject: enabled' annotation."
121+
exit 0
122+
fi
123+
124+
echo "---"
125+
echo "Linkerd injected namespaces:"
126+
echo "$NAMESPACES"
127+
echo "---"
128+
129+
# Loop through each namespace found.
130+
for NAMESPACE in $NAMESPACES; do
131+
echo "Restarting deployments in namespace: $NAMESPACE"
132+
kubectl rollout restart -n "$NAMESPACE" deployment
133+
kubectl rollout status -n "$NAMESPACE" deployment
134+
echo "Waiting 10 seconds for stabilization..."
135+
sleep 10
136+
echo "---"
137+
done
138+
```
139+
140+
The last step is to remove the old trust anchor from the trust bundle.
141+
142+
```bash
143+
#!/bin/bash
144+
set -e
145+
# Remove the old anchor from the trust bundle
146+
kubectl get secret -n cert-manager linkerd-trust-anchor -o yaml \
147+
| sed -e s/linkerd-trust-anchor/linkerd-previous-anchor/ \
148+
| egrep -v '^ *(resourceVersion|uid)' \
149+
| kubectl apply -f -
150+
```
151+
152+
One last script ties all of these scripts together into a single runable shell
153+
script.
154+
155+
```bash
156+
#!/bin/bash
157+
set -e
158+
159+
echo "Starting Linkerd certificate rotation process"
160+
echo "------------------------------------------"
161+
/scripts/rotate-certs.sh
162+
/scripts/restart-control-plane.sh
163+
sleep 60s
164+
/scripts/restart-data-plane.sh
165+
sleep 60s
166+
/scripts/update-bundle.sh
167+
echo "------------------------------------------"
168+
echo "Linkerd certificate rotation process completed"
169+
```
170+
171+
All that was left was to schedule this all to run. To accomplish this I bundled
172+
all of these scripts up into a docker container.
173+
174+
```bash
175+
FROM bitnami/kubectl
176+
177+
USER root
178+
179+
# Note that the scripts listed above are in a scripts subdirectory.
180+
RUN mkdir /scripts
181+
WORKDIR /scripts
182+
COPY ./scripts .
183+
184+
RUN apt-get update && apt-get install --no-install-recommends -y curl \
185+
&& apt-get clean \
186+
&& rm -rf /var/lib/apt/lists/*
187+
188+
# Install cmctl
189+
RUN curl -fsSL -o cmctl https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64 && \
190+
chmod +x cmctl && \
191+
mv cmctl /usr/local/bin
192+
193+
USER nonroot
194+
CMD ["sh", "./rotation.sh"]
195+
```
196+
197+
## CronJob
198+
199+
Scheduling the above container to run involves two things. First, you need a
200+
service account that has the permission needed to not only rotate the certs but
201+
also restart all of the deployments. Thankfully all I had to do was add the
202+
following to our linkerd-certs helm chart mentioned earlier.
203+
204+
```yaml
205+
---
206+
kind: ServiceAccount
207+
apiVersion: v1
208+
metadata:
209+
name: rotator
210+
namespace: linkerd
211+
212+
---
213+
apiVersion: rbac.authorization.k8s.io/v1
214+
kind: Role
215+
metadata:
216+
name: rotator
217+
namespace: linkerd
218+
rules:
219+
- apiGroups: ["apps", "extensions", "cert-manager.io"]
220+
resources: ["deployments", "certificates", "certificates/status"]
221+
verbs: ["get", "patch", "list", "watch", "update"]
222+
223+
---
224+
apiVersion: rbac.authorization.k8s.io/v1
225+
kind: RoleBinding
226+
metadata:
227+
name: rotator
228+
namespace: linkerd
229+
roleRef:
230+
apiGroup: rbac.authorization.k8s.io
231+
kind: Role
232+
name: rotator
233+
subjects:
234+
- kind: ServiceAccount
235+
name: rotator
236+
namespace: linkerd
237+
238+
---
239+
apiVersion: rbac.authorization.k8s.io/v1
240+
kind: ClusterRole
241+
metadata:
242+
name: rotator-clusterrole
243+
rules:
244+
- apiGroups: ["cert-manager.io", ""]
245+
resources: ["certificates", "certificates/status", "secrets"]
246+
verbs: ["get", "list", "patch", "update"]
247+
- apiGroups: ["*"]
248+
resources: ["namespaces", "deployments"]
249+
verbs: ["get", "list"]
250+
- apiGroups: ["*"]
251+
resources: ["deployments"]
252+
verbs: ["get", "list", "watch", "patch"]
253+
254+
---
255+
apiVersion: rbac.authorization.k8s.io/v1
256+
kind: ClusterRoleBinding
257+
metadata:
258+
name: rotator-clusterrolebinding
259+
namespace: cert-manager
260+
roleRef:
261+
apiGroup: rbac.authorization.k8s.io
262+
kind: ClusterRole
263+
name: rotator-clusterrole
264+
subjects:
265+
- kind: ServiceAccount
266+
name: rotator
267+
namespace: linkerd
268+
269+
---
270+
apiVersion: batch/v1
271+
kind: CronJob
272+
metadata:
273+
name: linkerd-cert-rotation
274+
namespace: linkerd
275+
spec:
276+
concurrencyPolicy: Forbid
277+
schedule: {{ .Values.rotation.schedule }}
278+
jobTemplate:
279+
spec:
280+
backoffLimit: 0
281+
activeDeadlineSeconds: 600
282+
template:
283+
spec:
284+
serviceAccountName: rotator
285+
restartPolicy: Never
286+
activeDeadlineSeconds: 3600
287+
containers:
288+
- name: linkerd-cert-rotator
289+
image: {{ .Values.rotation.image }}:{{ .Values.rotation.tag }}
290+
imagePullPolicy: Always
291+
command: [ "sh", "-c" ]
292+
args:
293+
- "/scripts/rotation.sh >> /proc/1/fd/1 2>&1"
294+
```
295+
296+
You then just need to add rotation.schedule, rotation.image, and rotation.tag to
297+
the values depending on where you pushed your container to and what schedule you
298+
want. I set these jobs to run once a month.
299+
300+
## Rotation Periods
301+
302+
We want our certificates to rotate every 30 days, with a significant buffer in
303+
case our automation fails. To achieve this, I configure cert-manager to issue
304+
certificates with a **duration of 120 days** and renew them after **60 days**.
305+
306+
This provides a **60-day window** to ensure both the Linkerd control plane and
307+
all meshed pods are restarted to pick up the new certificates. If they aren’t
308+
restarted within this window, the old certificates will expire, leading to
309+
communication issues.
310+
311+
Using a CronJob, we enforce a certificate rotation every **30 days**. This keeps
312+
our certificates fresh while providing a substantial buffer to handle any
313+
automation issues before they cause problems. A great side benefit is the
314+
ability to manually run the CronJob at any time to force an adhoc certificate
315+
rotation.
316+
317+
## Improvements
318+
319+
As with any solution there is more I could do.
320+
321+
1. I would like to add automated checks to my shell script to verify when the
322+
cert has been updated instead of just sleeping for a period of time.
323+
1. I would really like to add an automated check to validate the at the trust
324+
bundle was updated at the end
325+
1. I would like to create a dashboard and some monitoring alerts to notify us
326+
about the age of these certs.
327+
328+
Did I miss any?
329+
330+
_Enjoyed the read? [Follow Matthew on Medium](https://medium.com/@mclanem_45809)
331+
to keep up with his latest posts._
3.2 KB
Loading

linkerd.io/content/blog/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,6 @@ outputs:
88
- RSS # Enable RSS
99
params:
1010
feature:
11+
- /blog/2025/1020-hands-off-linkerd-certificate-rotation
1112
- /blog/2025/0909-linkerd-with-opentelemetry
12-
- /blog/2025/0801-imagine-learning
1313
---

0 commit comments

Comments
 (0)