Skip to content

Conversation

@s-selwin
Copy link
Contributor

@s-selwin s-selwin commented Dec 10, 2025

This PR resolves two key instability issues that were causing transient failures (false negatives) in the test_add_capacity

Entry Criteria Race Condition: Fixes the Error from server (NotFound) failures occurring in check_pods_in_statuses. This was due to a race condition where the check tried to query short-lived storageclient pods that had already been deleted by Kubernetes Garbage Collection. We now explicitly exclude these transient pods from the health check.

Entry Check Timeout: We were incorrectly failing the entry criteria because the test was trying to track deleted, temporary Job Pods (like storageclient-...). These are now excluded to eliminate the NotFound API error and the subsequent timeout.

@s-selwin s-selwin requested review from a team as code owners December 10, 2025 09:57
@openshift-ci
Copy link

openshift-ci bot commented Dec 10, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: s-selwin

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@amr1ta
Copy link
Contributor

amr1ta commented Dec 10, 2025

Please run a pr validation job.

expected_statuses=expected_statuses,
exclude_pod_name_prefixes=["demo-pod"],
exclude_pod_name_prefixes=["demo-pod", "storageclient"],
timeout=500,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 sec sounds like long timeout if problem is race condition, is it possible to use shorter timeout? and what would be sampling interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AviadP You are absolutely right that a 500-second timeout is excessive for fixing a simple race condition. but in here, the timeout is for a much broader purpose to make sure all pods stabilize before starting the test.

As we know, the wait_for_pods_to_be_in_statuses validates the readiness of the pods before introducing add_capacity operation.

I found a bug (#13623) where in test failed with error "one or more ocs pods are not in running but when troubleshooting through MG logs (all -o wide) , I spotted all pods in right statuses but took time in stabilizing - just by the duration they were running. PFB

Screenshot 2025-12-11 at 11 06 16 AM

A shorter timeout (e.g., 180s) would lead to a false failure every time initialization takes longer than 3 minutes.

The 500 seconds provides the necessary safety buffer (over 8 minutes) required for the cluster to settle, preventing the test from starting in an unstable pod states.

There is no change in sampling interval (10 sec by default)

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master

Job UNSTABLE (some or all tests failed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

6 participants