Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ocs_ci/ocs/resources/pod.py
Original file line number Diff line number Diff line change
Expand Up @@ -2593,6 +2593,7 @@ def get_pod_restarts_count(namespace=None, label=None, list_of_pods=None):
"rook-ceph-osd-prepare",
"rook-ceph-drain-canary",
"ceph-file-controller-detect-version",
"status-reporter",
)
if all(exclude_name not in p.name for exclude_name in exclude_names):
pod_count = ocp_pod_obj.get_resource(p.name, "RESTARTS")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,10 @@ def test_add_capacity(
# All OCS pods are in running state:
# ToDo https://github.com/red-hat-storage/ocs-ci/issues/2361
expected_statuses = [constants.STATUS_RUNNING, constants.STATUS_COMPLETED]
assert pod_helpers.check_pods_in_statuses(
assert pod_helpers.wait_for_pods_to_be_in_statuses(
expected_statuses=expected_statuses,
exclude_pod_name_prefixes=["demo-pod"],
exclude_pod_name_prefixes=["demo-pod", "storageclient"],
timeout=500,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 sec sounds like long timeout if problem is race condition, is it possible to use shorter timeout? and what would be sampling interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AviadP You are absolutely right that a 500-second timeout is excessive for fixing a simple race condition. but in here, the timeout is for a much broader purpose to make sure all pods stabilize before starting the test.

As we know, the wait_for_pods_to_be_in_statuses validates the readiness of the pods before introducing add_capacity operation.

I found a bug (#13623) where in test failed with error "one or more ocs pods are not in running but when troubleshooting through MG logs (all -o wide) , I spotted all pods in right statuses but took time in stabilizing - just by the duration they were running. PFB

Screenshot 2025-12-11 at 11 06 16 AM

A shorter timeout (e.g., 180s) would lead to a false failure every time initialization takes longer than 3 minutes.

The 500 seconds provides the necessary safety buffer (over 8 minutes) required for the cluster to settle, preventing the test from starting in an unstable pod states.

There is no change in sampling interval (10 sec by default)

), "Entry criteria FAILED: one or more OCS pods are not in running state"
# Create the namespace under which this test will execute:
project = project_factory()
Expand Down
Loading