|
| 1 | +--- |
| 2 | +date: 2025-10-20T00:00:00Z |
| 3 | +title: Hands off Linkerd certificate rotation |
| 4 | +keywords: [linkerd, "Cert Manager", automation] |
| 5 | +params: |
| 6 | + author: |
| 7 | + name: Matthew McLane |
| 8 | + avatar: matthew-mclane.jpg |
| 9 | +--- |
| 10 | + |
| 11 | +_This blog post was originally published on |
| 12 | +[Matthew McLane's Medium blog](https://medium.com/@mclanem_45809/hands-off-linkerd-certificate-rotation-0e387fdeaa0a)._ |
| 13 | + |
| 14 | +I’ll start by saying that I think Linkerd is a **great tool**. We use it at work |
| 15 | +to provide **TLS between our pods**, which frees us from having to build that |
| 16 | +functionality directly into our containers. When it works, it’s fantastic! It’s |
| 17 | +simple to get up and running and just does the job without a lot of extra fuss. |
| 18 | +For the most part, it’s been a very hands-off experience, which is exactly what |
| 19 | +we need. |
| 20 | + |
| 21 | +Recently, though, a change to **cert-manager** caused our long-standing |
| 22 | +certificates to unexpectedly rotate. This sent me on a journey to understand and |
| 23 | +implement a **fully automated certificate rotation solution** for our Linkerd |
| 24 | +service mesh, and I’d like to take you along for the ride. |
| 25 | + |
| 26 | +## The Problem |
| 27 | + |
| 28 | +Linkerd largely manages its own certificates, but it needs a trusted foundation: |
| 29 | +a root anchor and an identity issuer certificate. Linkerd’s own documentation on |
| 30 | +**“[Automatically Rotating Control Plane TLS Credentials](/2/tasks/automatically-rotating-control-plane-tls-credentials/)”** |
| 31 | +explains this in detail. My goal was to build a completely automated solution |
| 32 | +for our clusters, bypassing the need for manual `kubectl` commands. I wanted to |
| 33 | +leverage our existing ArgoCD infrastructure to handle everything, including |
| 34 | +regular certificate rotation and all the necessary restarts, without any manual |
| 35 | +intervention. |
| 36 | + |
| 37 | +## linkerd-certs helm chart |
| 38 | + |
| 39 | +The first step in my solution was to create a simple **Helm chart** to lay down |
| 40 | +the required certificates. Following the |
| 41 | +[documentation](/2/tasks/automatically-rotating-control-plane-tls-credentials/), |
| 42 | +this chart creates three key certificates in the namespace using cert-manager: |
| 43 | +`linkerd-trust-root-issuer`, `linkerd-trust-anchor`, and |
| 44 | +`linkerd-identity-issuer`. |
| 45 | + |
| 46 | +This Helm chart also sets up the `linkerd-identity-issuer` and the necessary |
| 47 | +trust bundle within the Linkerd namespace. Essentially, this single chart |
| 48 | +handles all the certificates needed to install Linkerd and enable its automatic |
| 49 | +rotation feature. |
| 50 | + |
| 51 | +## The rotation problem |
| 52 | + |
| 53 | +As stated in the documentation: |
| 54 | + |
| 55 | +> Rotating the identity issuer is basically a non-event: cert-manager can handle |
| 56 | +> rotating the identity issuer completely on its own. |
| 57 | +> . |
| 58 | +> . |
| 59 | +> . |
| 60 | +> Rotating the trust anchor is a bit different, because rotating the trust |
| 61 | +> anchor mean that you have to restart both the Linkerd control plane and all |
| 62 | +> the proxies while managing the trust bundle. In practice, this requires manual |
| 63 | +> intervention, because while cert-manager can handle the hard work of actually |
| 64 | +> rotating the trust anchor, it can’t trigger the needed restarts. |
| 65 | +
|
| 66 | +I really didn’t want to rely on anything with manual intervention. The solution |
| 67 | +to this problem was fairly simple to workout. All the heavy lifting was provided |
| 68 | +in the |
| 69 | +[documentation](/2/tasks/automatically-rotating-control-plane-tls-credentials/)! |
| 70 | +First I started by creating a set of shell scripts. |
| 71 | + |
| 72 | +First is a script to rotate the certificates: |
| 73 | + |
| 74 | +```bash |
| 75 | +#!/bin/bash |
| 76 | +set -e |
| 77 | +echo "renewing linkerd-trusted-anchor" |
| 78 | +cmctl renew -n cert-manager linkerd-trust-anchor |
| 79 | +echo "Waiting 120 seconds to allow for certs to update" |
| 80 | +sleep 120 |
| 81 | +echo "---" |
| 82 | + |
| 83 | +echo "renewing linkerd-identity-issuer" |
| 84 | +cmctl renew -n linkerd linkerd-identity-issuer |
| 85 | +echo "Waiting 120 seconds to allow for certs to update" |
| 86 | +sleep 120 |
| 87 | +``` |
| 88 | + |
| 89 | +Next was a script to restart the linkerd control-plane pods. I also use this |
| 90 | +moment to restart the linkerd-viz pods. |
| 91 | + |
| 92 | +```bash |
| 93 | +#!/bin/bash |
| 94 | +set -e |
| 95 | +echo "---" |
| 96 | +echo "Restarting linkerd control plane" |
| 97 | +kubectl rollout restart -n linkerd deploy --selector=linkerd.io/control-plane-ns=linkerd |
| 98 | +kubectl rollout status -n linkerd deploy --selector=linkerd.io/control-plane-ns=linkerd |
| 99 | + |
| 100 | +echo "Waiting 20 seconds for stabilization..." |
| 101 | +sleep 20 |
| 102 | +echo "---" |
| 103 | +echo "Restarting linkerd viz" |
| 104 | +kubectl rollout restart -n linkerd-viz deploy --selector=linkerd.io/extension=viz |
| 105 | +kubectl rollout status -n linkerd-viz deploy --selector=linkerd.io/extension=viz |
| 106 | +``` |
| 107 | + |
| 108 | +The next step is a script to restart the data plane or all of the pods that have |
| 109 | +had the linkerd-proxy injected. Thankfully we use namespace annotations to |
| 110 | +control what gets injected, so all I needed to do was query for those |
| 111 | +namespaces. Once I have found all namespaces with “linkerd.io/inject: enabled”, |
| 112 | +we can restart each one at a time. |
| 113 | + |
| 114 | +```bash |
| 115 | +#!/bin/bash |
| 116 | +set -e |
| 117 | +NAMESPACES=$(kubectl get ns -o json | jq -r '.items[] | select(.metadata.annotations."linkerd.io/inject" == "enabled") | .metadata.name') |
| 118 | +# Check if any namespaces were found. |
| 119 | +if [ -z "$NAMESPACES" ]; then |
| 120 | + echo "No namespaces found with 'linkerd.io/inject: enabled' annotation." |
| 121 | + exit 0 |
| 122 | +fi |
| 123 | + |
| 124 | +echo "---" |
| 125 | +echo "Linkerd injected namespaces:" |
| 126 | +echo "$NAMESPACES" |
| 127 | +echo "---" |
| 128 | + |
| 129 | +# Loop through each namespace found. |
| 130 | +for NAMESPACE in $NAMESPACES; do |
| 131 | + echo "Restarting deployments in namespace: $NAMESPACE" |
| 132 | + kubectl rollout restart -n "$NAMESPACE" deployment |
| 133 | + kubectl rollout status -n "$NAMESPACE" deployment |
| 134 | + echo "Waiting 10 seconds for stabilization..." |
| 135 | + sleep 10 |
| 136 | + echo "---" |
| 137 | +done |
| 138 | +``` |
| 139 | + |
| 140 | +The last step is to remove the old trust anchor from the trust bundle. |
| 141 | + |
| 142 | +```bash |
| 143 | +#!/bin/bash |
| 144 | +set -e |
| 145 | +# Remove the old anchor from the trust bundle |
| 146 | +kubectl get secret -n cert-manager linkerd-trust-anchor -o yaml \ |
| 147 | + | sed -e s/linkerd-trust-anchor/linkerd-previous-anchor/ \ |
| 148 | + | egrep -v '^ *(resourceVersion|uid)' \ |
| 149 | + | kubectl apply -f - |
| 150 | +``` |
| 151 | + |
| 152 | +One last script ties all of these scripts together into a single runable shell |
| 153 | +script. |
| 154 | + |
| 155 | +```bash |
| 156 | +#!/bin/bash |
| 157 | +set -e |
| 158 | + |
| 159 | +echo "Starting Linkerd certificate rotation process" |
| 160 | +echo "------------------------------------------" |
| 161 | +/scripts/rotate-certs.sh |
| 162 | +/scripts/restart-control-plane.sh |
| 163 | +sleep 60s |
| 164 | +/scripts/restart-data-plane.sh |
| 165 | +sleep 60s |
| 166 | +/scripts/update-bundle.sh |
| 167 | +echo "------------------------------------------" |
| 168 | +echo "Linkerd certificate rotation process completed" |
| 169 | +``` |
| 170 | + |
| 171 | +All that was left was to schedule this all to run. To accomplish this I bundled |
| 172 | +all of these scripts up into a docker container. |
| 173 | + |
| 174 | +```bash |
| 175 | +FROM bitnami/kubectl |
| 176 | + |
| 177 | +USER root |
| 178 | + |
| 179 | +# Note that the scripts listed above are in a scripts subdirectory. |
| 180 | +RUN mkdir /scripts |
| 181 | +WORKDIR /scripts |
| 182 | +COPY ./scripts . |
| 183 | + |
| 184 | +RUN apt-get update && apt-get install --no-install-recommends -y curl \ |
| 185 | + && apt-get clean \ |
| 186 | + && rm -rf /var/lib/apt/lists/* |
| 187 | + |
| 188 | +# Install cmctl |
| 189 | +RUN curl -fsSL -o cmctl https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64 && \ |
| 190 | + chmod +x cmctl && \ |
| 191 | + mv cmctl /usr/local/bin |
| 192 | + |
| 193 | +USER nonroot |
| 194 | +CMD ["sh", "./rotation.sh"] |
| 195 | +``` |
| 196 | + |
| 197 | +## CronJob |
| 198 | + |
| 199 | +Scheduling the above container to run involves two things. First, you need a |
| 200 | +service account that has the permission needed to not only rotate the certs but |
| 201 | +also restart all of the deployments. Thankfully all I had to do was add the |
| 202 | +following to our linkerd-certs helm chart mentioned earlier. |
| 203 | + |
| 204 | +```yaml |
| 205 | +--- |
| 206 | +kind: ServiceAccount |
| 207 | +apiVersion: v1 |
| 208 | +metadata: |
| 209 | + name: rotator |
| 210 | + namespace: linkerd |
| 211 | + |
| 212 | +--- |
| 213 | +apiVersion: rbac.authorization.k8s.io/v1 |
| 214 | +kind: Role |
| 215 | +metadata: |
| 216 | + name: rotator |
| 217 | + namespace: linkerd |
| 218 | +rules: |
| 219 | + - apiGroups: ["apps", "extensions", "cert-manager.io"] |
| 220 | + resources: ["deployments", "certificates", "certificates/status"] |
| 221 | + verbs: ["get", "patch", "list", "watch", "update"] |
| 222 | + |
| 223 | +--- |
| 224 | +apiVersion: rbac.authorization.k8s.io/v1 |
| 225 | +kind: RoleBinding |
| 226 | +metadata: |
| 227 | + name: rotator |
| 228 | + namespace: linkerd |
| 229 | +roleRef: |
| 230 | + apiGroup: rbac.authorization.k8s.io |
| 231 | + kind: Role |
| 232 | + name: rotator |
| 233 | +subjects: |
| 234 | + - kind: ServiceAccount |
| 235 | + name: rotator |
| 236 | + namespace: linkerd |
| 237 | + |
| 238 | +--- |
| 239 | +apiVersion: rbac.authorization.k8s.io/v1 |
| 240 | +kind: ClusterRole |
| 241 | +metadata: |
| 242 | + name: rotator-clusterrole |
| 243 | +rules: |
| 244 | +- apiGroups: ["cert-manager.io", ""] |
| 245 | + resources: ["certificates", "certificates/status", "secrets"] |
| 246 | + verbs: ["get", "list", "patch", "update"] |
| 247 | +- apiGroups: ["*"] |
| 248 | + resources: ["namespaces", "deployments"] |
| 249 | + verbs: ["get", "list"] |
| 250 | +- apiGroups: ["*"] |
| 251 | + resources: ["deployments"] |
| 252 | + verbs: ["get", "list", "watch", "patch"] |
| 253 | + |
| 254 | +--- |
| 255 | +apiVersion: rbac.authorization.k8s.io/v1 |
| 256 | +kind: ClusterRoleBinding |
| 257 | +metadata: |
| 258 | + name: rotator-clusterrolebinding |
| 259 | + namespace: cert-manager |
| 260 | +roleRef: |
| 261 | + apiGroup: rbac.authorization.k8s.io |
| 262 | + kind: ClusterRole |
| 263 | + name: rotator-clusterrole |
| 264 | +subjects: |
| 265 | +- kind: ServiceAccount |
| 266 | + name: rotator |
| 267 | + namespace: linkerd |
| 268 | + |
| 269 | +--- |
| 270 | +apiVersion: batch/v1 |
| 271 | +kind: CronJob |
| 272 | +metadata: |
| 273 | + name: linkerd-cert-rotation |
| 274 | + namespace: linkerd |
| 275 | +spec: |
| 276 | + concurrencyPolicy: Forbid |
| 277 | + schedule: {{ .Values.rotation.schedule }} |
| 278 | + jobTemplate: |
| 279 | + spec: |
| 280 | + backoffLimit: 0 |
| 281 | + activeDeadlineSeconds: 600 |
| 282 | + template: |
| 283 | + spec: |
| 284 | + serviceAccountName: rotator |
| 285 | + restartPolicy: Never |
| 286 | + activeDeadlineSeconds: 3600 |
| 287 | + containers: |
| 288 | + - name: linkerd-cert-rotator |
| 289 | + image: {{ .Values.rotation.image }}:{{ .Values.rotation.tag }} |
| 290 | + imagePullPolicy: Always |
| 291 | + command: [ "sh", "-c" ] |
| 292 | + args: |
| 293 | + - "/scripts/rotation.sh >> /proc/1/fd/1 2>&1" |
| 294 | +``` |
| 295 | +
|
| 296 | +You then just need to add rotation.schedule, rotation.image, and rotation.tag to |
| 297 | +the values depending on where you pushed your container to and what schedule you |
| 298 | +want. I set these jobs to run once a month. |
| 299 | +
|
| 300 | +## Rotation Periods |
| 301 | +
|
| 302 | +We want our certificates to rotate every 30 days, with a significant buffer in |
| 303 | +case our automation fails. To achieve this, I configure cert-manager to issue |
| 304 | +certificates with a **duration of 120 days** and renew them after **60 days**. |
| 305 | +
|
| 306 | +This provides a **60-day window** to ensure both the Linkerd control plane and |
| 307 | +all meshed pods are restarted to pick up the new certificates. If they aren’t |
| 308 | +restarted within this window, the old certificates will expire, leading to |
| 309 | +communication issues. |
| 310 | +
|
| 311 | +Using a CronJob, we enforce a certificate rotation every **30 days**. This keeps |
| 312 | +our certificates fresh while providing a substantial buffer to handle any |
| 313 | +automation issues before they cause problems. A great side benefit is the |
| 314 | +ability to manually run the CronJob at any time to force an adhoc certificate |
| 315 | +rotation. |
| 316 | +
|
| 317 | +## Improvements |
| 318 | +
|
| 319 | +As with any solution there is more I could do. |
| 320 | +
|
| 321 | +1. I would like to add automated checks to my shell script to verify when the |
| 322 | + cert has been updated instead of just sleeping for a period of time. |
| 323 | +1. I would really like to add an automated check to validate the at the trust |
| 324 | + bundle was updated at the end |
| 325 | +1. I would like to create a dashboard and some monitoring alerts to notify us |
| 326 | + about the age of these certs. |
| 327 | +
|
| 328 | +Did I miss any? |
| 329 | +
|
| 330 | +_Enjoyed the read? [Follow Matthew on Medium](https://medium.com/@mclanem_45809) |
| 331 | +to keep up with his latest posts._ |
0 commit comments