Skip to content

linkerd-cni install-cni.sh script is prone to race conditions during CNI setup #14725

@pdefreitas

Description

@pdefreitas

What is the issue?

During install-cni.sh execution a inotifywait is launched and executes background:
https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L346
https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L295

In the following lines there is a event triggered that caused the file present in HOST CNI folder to be edited, like the comment suggests it requires inotifywait to be working:
https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L360

In some nodes of our clusters we've observed a rare but real race condition where this mv operation does not trigger inotifywait, because it wasn't listening yet. This results in the CNI failing to be properly setup and repair-controller to be forever restarting pods within affected nodes since they fail to initialize due to incorrect CNI setup.

How can it be reproduced?

It is random but it can be simulated by inotifywait being slower than the rest of the script. It will exhibit the race condition.

Logs, error output, etc

Example of working case (install-cni):

[2025-11-17 09:05:27] Wrote linkerd CNI binaries to /host/opt/cni/bin
Setting up watches.
Watches established.
[2025-11-17 09:05:27] Trigger CNI config detection for /host/etc/cni/net.d/15-azure.conflist
[2025-11-17 09:05:27] Detected event: CREATE /host/etc/cni/net.d/15-azure.conflist
[2025-11-17 09:05:27] New/changed file [/host/etc/cni/net.d/15-azure.conflist] detected; re-installing
Setting up watches.
Watches established

Example of race condition (install-cni):

Defaulted container "install-cni" out of: install-cni, repair-controller
[2025-11-17 00:56:58] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2025-11-17 00:56:58] Trigger CNI config detection for /host/etc/cni/net.d/15-azure.conflist
Setting up watches.
Watches established.
Setting up watches.
Watches established.

output of linkerd check -o short

Linkerd output is normal.

Environment

  • Kubernetes Version: 1.33.3
  • Cluster Environment: AKS
  • Host OS: Azure Linux 3
  • Linkerd version: edge-25.8.5 and CNI 1.6.4

Possible solution

We need to find a way to mitigate the race condition. In the current implementation, there are no guarantees that inotifywait will be listening to events atomically, before we "Trigger CNI config detection" logic, since it is running with -m in a background job. Adding a sleep in between could be an option, however it still won't be atomic. We might need some sort of retry mechanism or to refactor to truly wait until we're listening to events.

Additional context

No response

Would you like to work on fixing this bug?

maybe

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions