Skip to content

Conversation

@kaovilai
Copy link
Member

@kaovilai kaovilai commented Aug 11, 2025

Fix: Call WaitGroup.Done() once only when PVB changes to final status + Update Go to 1.23.0

Summary

This PR includes two important updates:

  1. PVB WaitGroup Fix: Adapts upstream fix vmware-tanzu/velero#8940 (commit 2eb97fa) to prevent calling WaitGroup.Done() multiple times for the same PodVolumeBackup, which causes "negative WaitGroup counter" panic errors.

  2. Go Version Update: Updates Go version from 1.22.5 to 1.23.0 across all Dockerfiles and CI workflows, aligning with upstream commit 874388bd3e34bbcfe955f624b8e3579d959132fc. This update was missed in previous CVE-related PRs.

Problem

When the event handler receives multiple update events for a PodVolumeBackup that's already in a final status (Completed or Failed), it could call WaitGroup.Done() multiple times, leading to a panic.

Implementation Differences from Upstream

Upstream Version (Velero main branch)

  • Uses a pvbIndexer field (cache.Indexer) in the backupper struct
  • Tracks PVB states through the indexer infrastructure
  • Checks if PVB already exists in final status before calling Done()

Our Version (OADP 1.4)

  • Does not have the pvbIndexer infrastructure that exists upstream
  • Implements a simpler solution by checking state transitions in the UpdateFunc
  • Uses the oldObj and newObj parameters to detect when a PVB transitions to a final status
  • No additional fields needed in the backupper struct

Why Direct Cherry-pick Wasn't Possible

The upstream commit relies on the pvbIndexer infrastructure that was introduced in later versions of Velero but is not present in the OADP 1.4 branch. A direct cherry-pick would fail due to:

  1. Missing pvbIndexer field in the backupper struct
  2. Missing indexer initialization code
  3. Different architectural approach to tracking PVB states

Adapted Solution

// In the event handler UpdateFunc:
UpdateFunc: func(oldObj, newObj interface{}) {
    pvb := newObj.(*velerov1api.PodVolumeBackup)
    
    // ... existing checks ...
    
    // Check if the previous state was already in a final status
    statusChangedToFinal := true
    if oldPvb, ok := oldObj.(*velerov1api.PodVolumeBackup); ok {
        // If the PVB was already in a final status, no need to call WaitGroup.Done()
        if oldPvb.Status.Phase == velerov1api.PodVolumeBackupPhaseCompleted ||
            oldPvb.Status.Phase == velerov1api.PodVolumeBackupPhaseFailed {
            statusChangedToFinal = false
        }
    }
    
    b.result = append(b.result, pvb)
    
    // Call WaitGroup.Done() once only when the PVB changes to final status the first time.
    // This avoids the cases where the handler gets multiple update events whose PVBs are all in final status
    // which causes panic with "negative WaitGroup counter" error
    if statusChangedToFinal {
        b.wg.Done()
    }
}

This achieves the same goal as upstream - preventing multiple Done() calls for the same PVB - but by checking the state transition rather than maintaining additional tracking infrastructure. This approach is simpler and more aligned with the upstream logic while being suitable for OADP 1.4's codebase.

Testing

  • All existing tests pass
  • Added unit test to verify WaitGroup.Done() is called only once per PVB state transition
  • The fix prevents the panic while maintaining the same functional behavior
  • Tested with go test ./pkg/podvolume/... -v -count=1

Files Updated for Go 1.23.0

  • Dockerfile (2 occurrences)
  • hack/build-image/Dockerfile
  • .github/workflows/pr-ci-check.yml
  • .github/workflows/push.yml
  • .github/workflows/e2e-test-kind.yaml (2 occurrences)
  • .github/workflows/crds-verify-kind.yaml
  • Tiltfile

Related Issues

…first time to avoid panic

Prevents calling WaitGroup.Done() multiple times for the same PodVolumeBackup by
tracking processed PVBs. This avoids "negative WaitGroup counter" panic errors when
the handler receives multiple update events with the PVB already in final status.

Based on upstream commit 2eb97fa
Adapted for OADP 1.4 without pvbIndexer infrastructure

Signed-off-by: Tiger Kaovilai <[email protected]>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 11, 2025
@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 11, 2025
Updates Go version across all Dockerfiles and CI workflows to align with
upstream commit 874388b. This update was
missed in previous CVE-related PRs.

Updated files:
- Dockerfile (2 occurrences)
- hack/build-image/Dockerfile
- .github/workflows/pr-ci-check.yml
- .github/workflows/push.yml
- .github/workflows/e2e-test-kind.yaml (2 occurrences)
- .github/workflows/crds-verify-kind.yaml
- Tiltfile

Signed-off-by: Tiger Kaovilai <[email protected]>
Co-authored-by: Claude <[email protected]>
… status changes to final for the first time

Signed-off-by: Tiger Kaovilai <[email protected]>
@kaovilai kaovilai marked this pull request as ready for review August 14, 2025 14:59
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2025
@kaovilai kaovilai changed the title Call WaitGroup.Done() once only when PVB changes to final status the first time to avoid panic OADP-6536: Call WaitGroup.Done() once only when PVB changes to final status the first time to avoid panic Aug 14, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 14, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 14, 2025

@kaovilai: This pull request references OADP-6536 which is a valid jira issue.

In response to this:

Fix: Call WaitGroup.Done() once only when PVB changes to final status + Update Go to 1.23.0

Summary

This PR includes two important updates:

  1. PVB WaitGroup Fix: Adapts upstream fix vmware-tanzu/velero#8940 (commit 2eb97fa) to prevent calling WaitGroup.Done() multiple times for the same PodVolumeBackup, which causes "negative WaitGroup counter" panic errors.

  2. Go Version Update: Updates Go version from 1.22.5 to 1.23.0 across all Dockerfiles and CI workflows, aligning with upstream commit 874388bd3e34bbcfe955f624b8e3579d959132fc. This update was missed in previous CVE-related PRs.

Problem

When the event handler receives multiple update events for a PodVolumeBackup that's already in a final status (Completed or Failed), it could call WaitGroup.Done() multiple times, leading to a panic.

Implementation Differences from Upstream

Upstream Version (Velero main branch)

  • Uses a pvbIndexer field (cache.Indexer) in the backupper struct
  • Tracks PVB states through the indexer infrastructure
  • Checks if PVB already exists in final status before calling Done()

Our Version (OADP 1.4)

  • Does not have the pvbIndexer infrastructure that exists upstream
  • Implements a simpler solution by checking state transitions in the UpdateFunc
  • Uses the oldObj and newObj parameters to detect when a PVB transitions to a final status
  • No additional fields needed in the backupper struct

Why Direct Cherry-pick Wasn't Possible

The upstream commit relies on the pvbIndexer infrastructure that was introduced in later versions of Velero but is not present in the OADP 1.4 branch. A direct cherry-pick would fail due to:

  1. Missing pvbIndexer field in the backupper struct
  2. Missing indexer initialization code
  3. Different architectural approach to tracking PVB states

Adapted Solution

// In the event handler UpdateFunc:
UpdateFunc: func(oldObj, newObj interface{}) {
   pvb := newObj.(*velerov1api.PodVolumeBackup)
   
   // ... existing checks ...
   
   // Check if the previous state was already in a final status
   statusChangedToFinal := true
   if oldPvb, ok := oldObj.(*velerov1api.PodVolumeBackup); ok {
       // If the PVB was already in a final status, no need to call WaitGroup.Done()
       if oldPvb.Status.Phase == velerov1api.PodVolumeBackupPhaseCompleted ||
           oldPvb.Status.Phase == velerov1api.PodVolumeBackupPhaseFailed {
           statusChangedToFinal = false
       }
   }
   
   b.result = append(b.result, pvb)
   
   // Call WaitGroup.Done() once only when the PVB changes to final status the first time.
   // This avoids the cases where the handler gets multiple update events whose PVBs are all in final status
   // which causes panic with "negative WaitGroup counter" error
   if statusChangedToFinal {
       b.wg.Done()
   }
}

This achieves the same goal as upstream - preventing multiple Done() calls for the same PVB - but by checking the state transition rather than maintaining additional tracking infrastructure. This approach is simpler and more aligned with the upstream logic while being suitable for OADP 1.4's codebase.

Testing

  • All existing tests pass
  • Added unit test to verify WaitGroup.Done() is called only once per PVB state transition
  • The fix prevents the panic while maintaining the same functional behavior
  • Tested with go test ./pkg/podvolume/... -v -count=1

Files Updated for Go 1.23.0

  • Dockerfile (2 occurrences)
  • hack/build-image/Dockerfile
  • .github/workflows/pr-ci-check.yml
  • .github/workflows/push.yml
  • .github/workflows/e2e-test-kind.yaml (2 occurrences)
  • .github/workflows/crds-verify-kind.yaml
  • Tiltfile

Related Issues

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kaovilai
Copy link
Member Author

/retest

@openshift-ci
Copy link

openshift-ci bot commented Aug 14, 2025

@kaovilai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint bdb2247 link true /test lint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 13, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 13, 2025

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants