linkerd-cni install-cni.sh script is prone to race conditions during CNI setup

### What is the issue?

During install-cni.sh execution a `inotifywait` is launched and executes background:
https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L346
https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L295

In the following lines there is a event triggered that caused the file present in HOST CNI folder to be edited, like the comment suggests it requires `inotifywait` to be working:
https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.4/cni-plugin/deployment/scripts/install-cni.sh#L360

In some nodes of our clusters we've observed a rare but real race condition where this `mv` operation does not trigger `inotifywait`, because it wasn't listening yet. This results in the CNI failing to be properly setup and repair-controller to be forever restarting pods within affected nodes since they fail to initialize due to incorrect CNI setup.

### How can it be reproduced?

It is random but it can be simulated by `inotifywait` being slower than the rest of the script. It will exhibit the race condition.

### Logs, error output, etc

Example of working case (install-cni):
```
[2025-11-17 09:05:27] Wrote linkerd CNI binaries to /host/opt/cni/bin
Setting up watches.
Watches established.
[2025-11-17 09:05:27] Trigger CNI config detection for /host/etc/cni/net.d/15-azure.conflist
[2025-11-17 09:05:27] Detected event: CREATE /host/etc/cni/net.d/15-azure.conflist
[2025-11-17 09:05:27] New/changed file [/host/etc/cni/net.d/15-azure.conflist] detected; re-installing
Setting up watches.
Watches established
```
Example of race condition (install-cni):
```
Defaulted container "install-cni" out of: install-cni, repair-controller
[2025-11-17 00:56:58] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2025-11-17 00:56:58] Trigger CNI config detection for /host/etc/cni/net.d/15-azure.conflist
Setting up watches.
Watches established.
Setting up watches.
Watches established.
``` 

### output of `linkerd check -o short`

Linkerd output is normal.

### Environment

- Kubernetes Version: 1.33.3
- Cluster Environment: AKS
- Host OS: Azure Linux 3
- Linkerd version: edge-25.8.5 and CNI 1.6.4

### Possible solution

We need to find a way to mitigate the race condition. In the current implementation, there are no guarantees that `inotifywait` will be listening to events atomically, before we "Trigger CNI config detection" logic, since it  is running with `-m` in a background job. Adding a `sleep` in between could be an option, however it still won't be atomic. We might need some sort of retry mechanism or to refactor to truly wait until we're listening to events.

### Additional context

_No response_

### Would you like to work on fixing this bug?

maybe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

linkerd-cni install-cni.sh script is prone to race conditions during CNI setup #14725

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of `linkerd check -o short`

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

linkerd-cni install-cni.sh script is prone to race conditions during CNI setup #14725

Description

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

output of `linkerd check -o short`