-
Notifications
You must be signed in to change notification settings - Fork 186
Increased timeout for pod stabilize & ignoring storage client pod #13899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: selwin.s <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: s-selwin The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Please run a pr validation job. |
| expected_statuses=expected_statuses, | ||
| exclude_pod_name_prefixes=["demo-pod"], | ||
| exclude_pod_name_prefixes=["demo-pod", "storageclient"], | ||
| timeout=500, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
500 sec sounds like long timeout if problem is race condition, is it possible to use shorter timeout? and what would be sampling interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AviadP You are absolutely right that a 500-second timeout is excessive for fixing a simple race condition. but in here, the timeout is for a much broader purpose to make sure all pods stabilize before starting the test.
As we know, the wait_for_pods_to_be_in_statuses validates the readiness of the pods before introducing add_capacity operation.
I found a bug (#13623) where in test failed with error "one or more ocs pods are not in running but when troubleshooting through MG logs (all -o wide) , I spotted all pods in right statuses but took time in stabilizing - just by the duration they were running. PFB
A shorter timeout (e.g., 180s) would lead to a false failure every time initialization takes longer than 3 minutes.
The 500 seconds provides the necessary safety buffer (over 8 minutes) required for the cluster to settle, preventing the test from starting in an unstable pod states.
There is no change in sampling interval (10 sec by default)
ocs-ci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master
Job UNSTABLE (some or all tests failed).
ocs-ci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master
Job UNSTABLE (some or all tests failed).
ocs-ci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: sels-odf-20
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/z_cluster/cluster_expansion/test_add_capacity_entry_exit_criteria.py::TestAddCapacity::test_add_capacity
Additional Test Params:
OCP VERSION: 4.20
OCS VERSION: 4.20
tested against branch: master
Job UNSTABLE (some or all tests failed).
This PR resolves two key instability issues that were causing transient failures (false negatives) in the test_add_capacity
Entry Criteria Race Condition: Fixes the Error from server (NotFound) failures occurring in check_pods_in_statuses. This was due to a race condition where the check tried to query short-lived storageclient pods that had already been deleted by Kubernetes Garbage Collection. We now explicitly exclude these transient pods from the health check.
Entry Check Timeout: We were incorrectly failing the entry criteria because the test was trying to track deleted, temporary Job Pods (like storageclient-...). These are now excluded to eliminate the NotFound API error and the subsequent timeout.