OADP-6536: Call WaitGroup.Done() once only when PVB changes to final status the first time to avoid panic #434

kaovilai · 2025-08-11T17:21:08Z

Fix: Call WaitGroup.Done() once only when PVB changes to final status + Update Go to 1.23.0

Summary

This PR includes two important updates:

PVB WaitGroup Fix: Adapts upstream fix vmware-tanzu/velero#8940 (commit 2eb97fa) to prevent calling WaitGroup.Done() multiple times for the same PodVolumeBackup, which causes "negative WaitGroup counter" panic errors.
Go Version Update: Updates Go version from 1.22.5 to 1.23.0 across all Dockerfiles and CI workflows, aligning with upstream commit 874388bd3e34bbcfe955f624b8e3579d959132fc. This update was missed in previous CVE-related PRs.

Problem

When the event handler receives multiple update events for a PodVolumeBackup that's already in a final status (Completed or Failed), it could call WaitGroup.Done() multiple times, leading to a panic.

Implementation Differences from Upstream

Upstream Version (Velero main branch)

Uses a pvbIndexer field (cache.Indexer) in the backupper struct
Tracks PVB states through the indexer infrastructure
Checks if PVB already exists in final status before calling Done()

Our Version (OADP 1.4)

Does not have the pvbIndexer infrastructure that exists upstream
Implements a simpler solution by checking state transitions in the UpdateFunc
Uses the oldObj and newObj parameters to detect when a PVB transitions to a final status
No additional fields needed in the backupper struct

Why Direct Cherry-pick Wasn't Possible

The upstream commit relies on the pvbIndexer infrastructure that was introduced in later versions of Velero but is not present in the OADP 1.4 branch. A direct cherry-pick would fail due to:

Missing pvbIndexer field in the backupper struct
Missing indexer initialization code
Different architectural approach to tracking PVB states

Adapted Solution

// In the event handler UpdateFunc:
UpdateFunc: func(oldObj, newObj interface{}) {
    pvb := newObj.(*velerov1api.PodVolumeBackup)
    
    // ... existing checks ...
    
    // Check if the previous state was already in a final status
    statusChangedToFinal := true
    if oldPvb, ok := oldObj.(*velerov1api.PodVolumeBackup); ok {
        // If the PVB was already in a final status, no need to call WaitGroup.Done()
        if oldPvb.Status.Phase == velerov1api.PodVolumeBackupPhaseCompleted ||
            oldPvb.Status.Phase == velerov1api.PodVolumeBackupPhaseFailed {
            statusChangedToFinal = false
        }
    }
    
    b.result = append(b.result, pvb)
    
    // Call WaitGroup.Done() once only when the PVB changes to final status the first time.
    // This avoids the cases where the handler gets multiple update events whose PVBs are all in final status
    // which causes panic with "negative WaitGroup counter" error
    if statusChangedToFinal {
        b.wg.Done()
    }
}

This achieves the same goal as upstream - preventing multiple Done() calls for the same PVB - but by checking the state transition rather than maintaining additional tracking infrastructure. This approach is simpler and more aligned with the upstream logic while being suitable for OADP 1.4's codebase.

Testing

All existing tests pass
Added unit test to verify WaitGroup.Done() is called only once per PVB state transition
The fix prevents the panic while maintaining the same functional behavior
Tested with go test ./pkg/podvolume/... -v -count=1

Files Updated for Go 1.23.0

Dockerfile (2 occurrences)
hack/build-image/Dockerfile
.github/workflows/pr-ci-check.yml
.github/workflows/push.yml
.github/workflows/e2e-test-kind.yaml (2 occurrences)
.github/workflows/crds-verify-kind.yaml
Tiltfile

Related Issues

Fixes the "negative WaitGroup counter" panic in PodVolumeBackup processing
Based on upstream PR: Call WaitGroup.Done() once only when PVB changes to fianl status the first time to avoid panic vmware-tanzu/velero#8940
Go update based on upstream commit: vmware-tanzu@874388b

…first time to avoid panic Prevents calling WaitGroup.Done() multiple times for the same PodVolumeBackup by tracking processed PVBs. This avoids "negative WaitGroup counter" panic errors when the handler receives multiple update events with the PVB already in final status. Based on upstream commit 2eb97fa Adapted for OADP 1.4 without pvbIndexer infrastructure Signed-off-by: Tiger Kaovilai <[email protected]>

openshift-ci · 2025-08-11T17:21:13Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-08-11T17:21:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [kaovilai]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Updates Go version across all Dockerfiles and CI workflows to align with upstream commit 874388b. This update was missed in previous CVE-related PRs. Updated files: - Dockerfile (2 occurrences) - hack/build-image/Dockerfile - .github/workflows/pr-ci-check.yml - .github/workflows/push.yml - .github/workflows/e2e-test-kind.yaml (2 occurrences) - .github/workflows/crds-verify-kind.yaml - Tiltfile Signed-off-by: Tiger Kaovilai <[email protected]> Co-authored-by: Claude <[email protected]>

… status changes to final for the first time Signed-off-by: Tiger Kaovilai <[email protected]>

openshift-ci-robot · 2025-08-14T15:01:10Z

@kaovilai: This pull request references OADP-6536 which is a valid jira issue.

In response to this:

Fix: Call WaitGroup.Done() once only when PVB changes to final status + Update Go to 1.23.0

Summary

This PR includes two important updates:

PVB WaitGroup Fix: Adapts upstream fix vmware-tanzu/velero#8940 (commit 2eb97fa) to prevent calling WaitGroup.Done() multiple times for the same PodVolumeBackup, which causes "negative WaitGroup counter" panic errors.

Go Version Update: Updates Go version from 1.22.5 to 1.23.0 across all Dockerfiles and CI workflows, aligning with upstream commit 874388bd3e34bbcfe955f624b8e3579d959132fc. This update was missed in previous CVE-related PRs.

Problem

When the event handler receives multiple update events for a PodVolumeBackup that's already in a final status (Completed or Failed), it could call WaitGroup.Done() multiple times, leading to a panic.

Implementation Differences from Upstream

Upstream Version (Velero main branch)

Uses a pvbIndexer field (cache.Indexer) in the backupper struct

Tracks PVB states through the indexer infrastructure

Checks if PVB already exists in final status before calling Done()

Our Version (OADP 1.4)

Does not have the pvbIndexer infrastructure that exists upstream

Implements a simpler solution by checking state transitions in the UpdateFunc

Uses the oldObj and newObj parameters to detect when a PVB transitions to a final status

No additional fields needed in the backupper struct

Why Direct Cherry-pick Wasn't Possible

The upstream commit relies on the pvbIndexer infrastructure that was introduced in later versions of Velero but is not present in the OADP 1.4 branch. A direct cherry-pick would fail due to:

Missing pvbIndexer field in the backupper struct

Missing indexer initialization code

Different architectural approach to tracking PVB states

Adapted Solution
// In the event handler UpdateFunc:
UpdateFunc: func(oldObj, newObj interface{}) {
   pvb := newObj.(*velerov1api.PodVolumeBackup)
   
   // ... existing checks ...
   
   // Check if the previous state was already in a final status
   statusChangedToFinal := true
   if oldPvb, ok := oldObj.(*velerov1api.PodVolumeBackup); ok {
       // If the PVB was already in a final status, no need to call WaitGroup.Done()
       if oldPvb.Status.Phase == velerov1api.PodVolumeBackupPhaseCompleted ||
           oldPvb.Status.Phase == velerov1api.PodVolumeBackupPhaseFailed {
           statusChangedToFinal = false
       }
   }
   
   b.result = append(b.result, pvb)
   
   // Call WaitGroup.Done() once only when the PVB changes to final status the first time.
   // This avoids the cases where the handler gets multiple update events whose PVBs are all in final status
   // which causes panic with "negative WaitGroup counter" error
   if statusChangedToFinal {
       b.wg.Done()
   }
}
This achieves the same goal as upstream - preventing multiple Done() calls for the same PVB - but by checking the state transition rather than maintaining additional tracking infrastructure. This approach is simpler and more aligned with the upstream logic while being suitable for OADP 1.4's codebase.

Testing

All existing tests pass

Added unit test to verify WaitGroup.Done() is called only once per PVB state transition

The fix prevents the panic while maintaining the same functional behavior

Tested with go test ./pkg/podvolume/... -v -count=1

Files Updated for Go 1.23.0

Dockerfile (2 occurrences)

hack/build-image/Dockerfile

.github/workflows/pr-ci-check.yml

.github/workflows/push.yml

.github/workflows/e2e-test-kind.yaml (2 occurrences)

.github/workflows/crds-verify-kind.yaml

Tiltfile

Related Issues

Fixes the "negative WaitGroup counter" panic in PodVolumeBackup processing

Based on upstream PR: Call WaitGroup.Done() once only when PVB changes to fianl status the first time to avoid panic vmware-tanzu/velero#8940

Go update based on upstream commit: vmware-tanzu@874388b

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Signed-off-by: Tiger Kaovilai <[email protected]>

kaovilai · 2025-08-14T15:53:06Z

/retest

openshift-ci · 2025-08-14T16:04:24Z

@kaovilai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/lint	`bdb2247`	link	true	`/test lint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2025-11-13T01:01:13Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

coderabbitai · 2025-11-13T01:01:34Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-bot · 2025-12-13T08:30:12Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 11, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 11, 2025

kaovilai force-pushed the 8940-oadp-1.4 branch from 7338674 to 2117620 Compare August 11, 2025 17:37

kaovilai force-pushed the 8940-oadp-1.4 branch from 2117620 to ce7031e Compare August 11, 2025 21:07

Fix WaitGroup handling to avoid panic by calling Done() only when PVB…

15ce3f8

… status changes to final for the first time Signed-off-by: Tiger Kaovilai <[email protected]>

kaovilai marked this pull request as ready for review August 14, 2025 14:59

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2025

kaovilai changed the title ~~Call WaitGroup.Done() once only when PVB changes to final status the first time to avoid panic~~ OADP-6536: Call WaitGroup.Done() once only when PVB changes to final status the first time to avoid panic Aug 14, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 14, 2025

openshift-ci bot requested review from eemcmullan and shubham-pampattiwar August 14, 2025 15:04

Add unit test for WaitGroup logic in Backupper event handler

bdb2247

Signed-off-by: Tiger Kaovilai <[email protected]>

kaovilai force-pushed the 8940-oadp-1.4 branch from 84917a4 to bdb2247 Compare August 14, 2025 15:13

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 13, 2025

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 13, 2025

OADP-6536: Call WaitGroup.Done() once only when PVB changes to final status the first time to avoid panic #434

Are you sure you want to change the base?

OADP-6536: Call WaitGroup.Done() once only when PVB changes to final status the first time to avoid panic #434

Conversation

kaovilai commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Call WaitGroup.Done() once only when PVB changes to final status + Update Go to 1.23.0

Summary

Problem

Implementation Differences from Upstream

Upstream Version (Velero main branch)

Our Version (OADP 1.4)

Why Direct Cherry-pick Wasn't Possible

Adapted Solution

Testing

Files Updated for Go 1.23.0

Related Issues

Uh oh!

openshift-ci bot commented Aug 11, 2025

Uh oh!

openshift-ci bot commented Aug 11, 2025

Uh oh!

openshift-ci-robot commented Aug 14, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Call WaitGroup.Done() once only when PVB changes to final status + Update Go to 1.23.0

Summary

Problem

Implementation Differences from Upstream

Upstream Version (Velero main branch)

Our Version (OADP 1.4)

Why Direct Cherry-pick Wasn't Possible

Adapted Solution

Testing

Files Updated for Go 1.23.0

Related Issues

Uh oh!

kaovilai commented Aug 14, 2025

Uh oh!

openshift-ci bot commented Aug 14, 2025

Uh oh!

openshift-bot commented Nov 13, 2025

Uh oh!

coderabbitai bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

openshift-bot commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaovilai commented Aug 11, 2025 •

edited

Loading

openshift-ci-robot commented Aug 14, 2025 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Nov 13, 2025 •

edited

Loading