Increased timeout for pod stabilize & ignoring storage client pod #13899

s-selwin · 2025-12-10T09:57:16Z

This PR resolves two key instability issues that were causing transient failures (false negatives) in the test_add_capacity

Entry Criteria Race Condition: Fixes the Error from server (NotFound) failures occurring in check_pods_in_statuses. This was due to a race condition where the check tried to query short-lived storageclient pods that had already been deleted by Kubernetes Garbage Collection. We now explicitly exclude these transient pods from the health check.

Entry Check Timeout: We were incorrectly failing the entry criteria because the test was trying to track deleted, temporary Job Pods (like storageclient-...). These are now excluded to eliminate the NotFound API error and the subsequent timeout.

Signed-off-by: selwin.s <[email protected]>

openshift-ci · 2025-12-10T09:57:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: s-selwin

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

amr1ta · 2025-12-10T10:30:49Z

Please run a pr validation job.

AviadP · 2025-12-10T10:49:55Z

tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py

            expected_statuses=expected_statuses,
-            exclude_pod_name_prefixes=["demo-pod"],
+            exclude_pod_name_prefixes=["demo-pod", "storageclient"],
+            timeout=500,


500 sec sounds like long timeout if problem is race condition, is it possible to use shorter timeout? and what would be sampling interval?

@AviadP You are absolutely right that a 500-second timeout is excessive for fixing a simple race condition. but in here, the timeout is for a much broader purpose to make sure all pods stabilize before starting the test.

As we know, the wait_for_pods_to_be_in_statuses validates the readiness of the pods before introducing add_capacity operation.

I found a bug (#13623) where in test failed with error "one or more ocs pods are not in running but when troubleshooting through MG logs (all -o wide) , I spotted all pods in right statuses but took time in stabilizing - just by the duration they were running. PFB

A shorter timeout (e.g., 180s) would lead to a false failure every time initialization takes longer than 3 minutes.

The 500 seconds provides the necessary safety buffer (over 8 minutes) required for the cluster to settle, preventing the test from starting in an unstable pod states.

There is no change in sampling interval (10 sec by default)

ocs-ci

PR validation on existing cluster

Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master

Job UNSTABLE (some or all tests failed).

ocs-ci

PR validation on existing cluster

Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master

Job UNSTABLE (some or all tests failed).

ocs-ci

PR validation on existing cluster

Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master

Job UNSTABLE (some or all tests failed).

Increased timeout for pod stabilize & ignoring storage client pod

976678c

Signed-off-by: selwin.s <[email protected]>

s-selwin requested review from a team as code owners December 10, 2025 09:57

pull-request-size bot added the size/XS label Dec 10, 2025

s-selwin assigned s-selwin and am-agrawa and unassigned am-agrawa Dec 10, 2025

s-selwin requested a review from am-agrawa December 10, 2025 10:05

s-selwin added the Squad/Brown label Dec 10, 2025

s-selwin linked an issue Dec 10, 2025 that may be closed by this pull request

[Bug]: test_add_capacity: check_pods_in_statuses function getting executed before pods stabilize #13623

Open

s-selwin requested a review from suchita-g December 10, 2025 10:09

s-selwin linked an issue Dec 10, 2025 that may be closed by this pull request

[Bug]: test_add_capacity[10] fails with storage client pod search #13632

Open

This was referenced Dec 10, 2025

[Bug]: test_add_capacity[10] fails with storage client pod search #13632

Open

[Bug]: test_add_capacity: check_pods_in_statuses function getting executed before pods stabilize #13623

Open

AviadP reviewed Dec 10, 2025

View reviewed changes

ocs-ci reviewed Dec 10, 2025

View reviewed changes

ocs-ci reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increased timeout for pod stabilize & ignoring storage client pod #13899

Increased timeout for pod stabilize & ignoring storage client pod #13899

Uh oh!

s-selwin commented Dec 10, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Dec 10, 2025

Uh oh!

amr1ta commented Dec 10, 2025

Uh oh!

AviadP Dec 10, 2025

Uh oh!

s-selwin Dec 11, 2025

Uh oh!

ocs-ci left a comment

Uh oh!

ocs-ci left a comment

Uh oh!

ocs-ci left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Increased timeout for pod stabilize & ignoring storage client pod #13899

Are you sure you want to change the base?

Increased timeout for pod stabilize & ignoring storage client pod #13899

Uh oh!

Conversation

s-selwin commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Dec 10, 2025

Uh oh!

amr1ta commented Dec 10, 2025

Uh oh!

AviadP Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

s-selwin Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ocs-ci left a comment

Choose a reason for hiding this comment

Uh oh!

ocs-ci left a comment

Choose a reason for hiding this comment

Uh oh!

ocs-ci left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

s-selwin commented Dec 10, 2025 •

edited

Loading