Skip to content
This repository was archived by the owner on May 3, 2024. It is now read-only.

Conversation

@nkommuri
Copy link

Problem Statement

  • Cortx-33875 : During cluster restart, confd container hits and assert and restarts.

Design

  • For Bug, During confd startup, it tries to establish a connection with all other services in the cluster. While establishing a session with remote a service [In m0_rpc_item_send()], it observed M0_NC_FAILED state for that service in conf cache, and called m0_rpc_item_failed(), which eventually called m0_rpc_session_establish_reply_received().
    m0_rpc_session_establish_reply_received() expects the session state to be M0_RPC_SESSION_ESTABLISHING, but it is not. session actually is in M0_RPC_SESSION_INITIALISED state and hence the assert. session will be set to M0_RPC_SESSION_ESTABLISHING after m0_rpc__fop_post() call, but it hit the assert before completion of the call.

In m0_rpc_session_establish_reply_received(), it is possible that session can still be in M0_RPC_SESSION_INITIALISED state due to M0_NC_FAILED state in conf cache. motr should handle this situation instead of asserting.

Coding

Checklist for Author

  • Coding conventions are followed and code is consistent

Testing

Checklist for Author

  • Unit and System Tests are added
  • Test Cases cover Happy Path, Non-Happy Path and Scalability
  • Testing was performed with RPM

Impact Analysis

Checklist for Author/Reviewer/GateKeeper

  • Interface change (if any) are documented
  • Side effects on other features (deployment/upgrade)
  • Dependencies on other component(s)

Review Checklist

Checklist for Author

  • JIRA number/GitHub Issue added to PR
  • PR is self reviewed
  • Jira and state/status is updated and JIRA is updated with PR link
  • Check if the description is clear and explained

Documentation

Checklist for Author

  • Changes done to WIKI / Confluence page / Quick Start Guide

Naga Kishore Kommuri added 2 commits August 10, 2022 03:25
Issue: During session establish reply, we expect session state to
be M0_RPC_SESSION_ESTABLISHING and assert otherwise. But, during
m0_rpc_post(), if we decide to cancel the session based on confc obj
status, then session state will still be in M0_RPC_SESSION_INITIALISED.
Converted assert into debug log msg. Session state will be moved to
M0_RPC_SESSION_FAILED, in such case.

Signed-off-by: Naga Kishore Kommuri <[email protected]>
as it would have been already called reply received function.

Signed-off-by: Naga Kishore Kommuri <[email protected]>
@cla-bot
Copy link

cla-bot bot commented Aug 16, 2022

Thanks for your contribution!
The CLA bot has flagged your contribution as not having a Contributor License Agreement
in place. Note that this is not needed in the overwhelming majority of instances and this warning will usually be ignored.
The code reviewers will make a determination and may ask you to sign a CLA or may choose to ignore this warning.
More information about this can be found here.

state doesn't match with the expected state.

Signed-off-by: Naga Kishore Kommuri <[email protected]>
@cla-bot
Copy link

cla-bot bot commented Aug 16, 2022

Thanks for your contribution!
The CLA bot has flagged your contribution as not having a Contributor License Agreement
in place. Note that this is not needed in the overwhelming majority of instances and this warning will usually be ignored.
The code reviewers will make a determination and may ask you to sign a CLA or may choose to ignore this warning.
More information about this can be found here.

@rkothiya
Copy link
Contributor

retest this please

@cla-bot
Copy link

cla-bot bot commented Aug 16, 2022

Thanks for your contribution!
The CLA bot has flagged your contribution as not having a Contributor License Agreement
in place. Note that this is not needed in the overwhelming majority of instances and this warning will usually be ignored.
The code reviewers will make a determination and may ask you to sign a CLA or may choose to ignore this warning.
More information about this can be found here.

@rkothiya
Copy link
Contributor

Jenkins CI Result : Motr#1594

Motr Test Summary

Test ResultCountInfo
❌Failed2
📁

04motr-single-node/49motr-rpc-cancel
01motr-single-node/00userspace-tests

🏁Skipped32
📁

01motr-single-node/28sys-kvs
01motr-single-node/35m0singlenode
01motr-single-node/04initscripts
01motr-single-node/37protocol
02motr-single-node/51kem
02motr-single-node/20rpc-session-cancel
02motr-single-node/10pver-assign
02motr-single-node/21fsync-single-node
02motr-single-node/13dgmode-io
02motr-single-node/14poolmach
02motr-single-node/11m0t1fs
02motr-single-node/26motr-user-kernel-tests
02motr-single-node/08spiel
03motr-single-node/06conf
03motr-single-node/36spare-reservation
04motr-single-node/34sns-repair-1n-1f
04motr-single-node/08spiel-sns-repair-quiesce
04motr-single-node/28sys-kvs-kernel
04motr-single-node/11m0t1fs-rconfc-fail
04motr-single-node/08spiel-sns-repair
04motr-single-node/19sns-repair-abort
04motr-single-node/22sns-repair-ios-fail
05motr-single-node/18sns-repair-quiesce
05motr-single-node/12fwait
05motr-single-node/16sns-repair-multi
05motr-single-node/07mount-fail
05motr-single-node/15sns-repair-single
05motr-single-node/23sns-abort-quiesce
05motr-single-node/17sns-repair-concurrent-io
05motr-single-node/07mount
05motr-single-node/07mount-multiple
05motr-single-node/12fsync

✔️Passed41
📁

01motr-single-node/43m0crate
01motr-single-node/05confgen
01motr-single-node/06hagen
01motr-single-node/52motr-singlenode-sanity
01motr-single-node/01net
01motr-single-node/01kernel-tests
01motr-single-node/03console
01motr-single-node/02rpcping
02motr-single-node/07m0d-fatal
02motr-single-node/67fdmi-plugin-multi-filters
02motr-single-node/53clusterusage-alert
02motr-single-node/41motr-conf-update
03motr-single-node/61sns-repair-motr-1n-1f
03motr-single-node/72spiel-sns-motr-repair-quiesce
03motr-single-node/08spiel-multi-confd
03motr-single-node/69sns-repair-motr-quiesce
03motr-single-node/62sns-repair-motr-mf
03motr-single-node/70sns-failure-after-repair-quiesce
03motr-single-node/63sns-repair-motr-1k-1f
03motr-single-node/60sns-repair-motr-1f
03motr-single-node/66sns-repair-motr-abort-quiesce
03motr-single-node/24motr-dix-repair-lookup-insert-spiel
03motr-single-node/68sns-repair-motr-shutdown
03motr-single-node/64sns-repair-motr-ios-fail
03motr-single-node/71spiel-sns-motr-repair
03motr-single-node/24motr-dix-repair-lookup-insert-m0repair
03motr-single-node/04sss
03motr-single-node/65sns-repair-motr-abort
04motr-single-node/48motr-raid0-io
04motr-single-node/25m0kv
04motr-single-node/44motr-rm-lock-cc-io
04motr-single-node/45motr-rmw
05motr-single-node/23dix-repair-m0repair
05motr-single-node/43motr-sync-replication
05motr-single-node/42motr-utils
05motr-single-node/45motr-sns-repair-N-1
05motr-single-node/40motr-dgmode
05motr-single-node/23dix-repair-quiesce-m0repair
05motr-single-node/23spiel-dix-repair-quiesce
05motr-single-node/44motr-sns-repair
05motr-single-node/23spiel-dix-repair

Total75🔗

CppCheck Summary

   Cppcheck: No new warnings found 👍

@cla-bot
Copy link

cla-bot bot commented Aug 17, 2022

Thanks for your contribution!
The CLA bot has flagged your contribution as not having a Contributor License Agreement
in place. Note that this is not needed in the overwhelming majority of instances and this warning will usually be ignored.
The code reviewers will make a determination and may ask you to sign a CLA or may choose to ignore this warning.
More information about this can be found here.

@rkothiya rkothiya changed the title 33875 Cortx-33875: Confd containers restart during cluster restart Aug 17, 2022
@rkothiya rkothiya merged commit b65638c into Seagate:main Aug 17, 2022
kiwionly2 pushed a commit to kiwionly2/cortx-motr that referenced this pull request Aug 30, 2022
…#2078)

Problem: 
Confd containers restart during cluster restart

Solution : 
During session establish reply, we expect session state to
be M0_RPC_SESSION_ESTABLISHING and assert otherwise. But, during
m0_rpc_post(), if we decide to cancel the session based on confc obj
status, then session state will still be in M0_RPC_SESSION_INITIALISED.
Converted assert into debug log msg. Session state will be moved to
M0_RPC_SESSION_FAILED, in such case.

No need to call session_failed() if m0_rpc__fop_post() returns failure
as it would have been already called reply received function.

Converted DEBUG msg to ERROR and logging only if session's
state doesn't match with the expected state.

Signed-off-by: Naga Kishore Kommuri <[email protected]>
@nkommuri nkommuri deleted the 33875 branch September 14, 2022 08:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants