CORTX-29871: process state machine update multiple times #2145

mssawant · 2022-08-01T20:45:30Z

Process events are a broadcast to all the nodes in the cluster.
Not all the nodes are required to update the entire process configuration
tree in KV. Only the process's local hax or RC must do that.

Solution:

Avoid updating process configuration tree in KV, but notify local motr
processes about any remote process state updates.
Set HA and Confd process states to M0_CONF_HA_PROCESS_RECOVERED
on receiving M0_CONF_HA_PROCESS_STARTED.

Signed-off-by: Mandar Sawant [email protected]

mssawant · 2022-08-01T22:51:43Z

retest this please

mssawant · 2022-08-02T04:53:11Z

retest this please

vaibhavparatwar · 2022-08-02T04:54:30Z

@mssawant what is the JIRA Id to link with this PR?

mssawant · 2022-08-04T03:22:26Z

retest this please

stale · 2022-08-16T01:54:17Z

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 6 days. Tagging @mssawant for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

supriyachavan4398 · 2022-08-16T07:09:18Z

With Mandar's branch, tried to do bootstrap in LR env,

[root@ssc-vm-g2-rhev4-1947 cortx-hare-1]# hctl bootstrap --mkfs ~/1n_new.yaml
2022-08-16 01:06:16: Generating cluster configuration... OK
2022-08-16 01:06:17: Starting Consul server on this node......... OK
2022-08-16 01:06:24: Importing configuration into the KV store... OK
2022-08-16 01:06:24: Starting Consul on other nodes...Consul ready on all nodes
2022-08-16 01:06:25: Updating Consul configuration from the KV store... OK
2022-08-16 01:06:26: Waiting for the RC Leader to get elected....... OK
2022-08-16 01:06:31: Starting Motr (phase1, mkfs)... OK
2022-08-16 01:06:38: Starting Motr (phase1, m0d)... OK
2022-08-16 01:06:40: Starting Motr (phase2, mkfs)...Job for motr-mkfs@0x7200000000000001:0x2.service failed because the control process exited with error code. See "systemctl status motr-mkfs@0x7200000000000001:0x2.service" and "journalctl -xe" for details.
[root@ssc-vm-g2-rhev4-1947 cortx-hare-1]# hctl status
Bytecount:
    critical : 0
    damaged : 0
    degraded : 0
    healthy : 0
Data pool:
    # fid name
    0x6f00000000000001:0x0 'the pool'
Profile:
    # fid name: pool(s)
    0x7000000000000001:0x0 'default': 'the pool' None None
Services:
    ssc-vm-g2-rhev4-1947.colo.seagate.com  (RC)
    [started]  hax                 0x7200000000000001:0x0          inet:tcp:192.168.62.178@22001
    [started]  confd               0x7200000000000001:0x1          inet:tcp:192.168.62.178@21002
    [offline]  ioservice           0x7200000000000001:0x2          inet:tcp:192.168.62.178@21003
    [unknown]  m0_client_other     0x7200000000000001:0x3          inet:tcp:192.168.62.178@22501
    [unknown]  m0_client_other     0x7200000000000001:0x4          inet:tcp:192.168.62.178@22502

But it was failing to start ioservices.

[root@ssc-vm-g2-rhev4-1947 cortx-hare-1]# systemctl status motr-mkfs@0x7200000000000001:0x2.service -l
● motr-mkfs@0x7200000000000001:0x2.service - Motr mkfs helper for 0x7200000000000001:0x2 service
   Loaded: loaded (/root/test/cortx-motr/scripts/install/usr/lib/systemd/system/[email protected]; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-08-16 01:06:42 MDT; 3min 59s ago
  Process: 19970 ExecStart=/usr/libexec/cortx-motr/motr-mkfs %i (code=exited, status=42)
 Main PID: 19970 (code=exited, status=42)

Aug 16 01:06:40 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: motr transport : libfab
Aug 16 01:06:40 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: + exec /root/test/cortx-motr/utils/mkfs/m0mkfs -e libfab:inet:tcp:192.168.62.178@21003 -A linuxstob:addb-stobs -f '<0x7200000000000001:0x2>' -T ad -S stobs -D db -m 524288 -q 16 -C 307200 -E 32 -J 64 -t 0 -X 209715200 -P 314572800 -O 524288000 -w 8 -H inet:tcp:192.168.62.178@22001 -U -L /dev/sdc -V 26843545600 -F -u f0e603b0-1d31-11ed-9ce8-566fcce40af4 -z 8589934592 -r 134217728
Aug 16 01:06:41 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: motr[19974]:  e100  ERROR  [setup.c:1490:cs_be_init]  <! rc=-22
Aug 16 01:06:41 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: motr[19974]:  e100  ERROR  [setup.c:1611:cs_storage_setup]  <! rc=-22 cs_be_init
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: propagating error to parent shell
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com systemd[1]: motr-mkfs@0x7200000000000001:0x2.service: main process exited, code=exited, status=42/n/a
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: got sub-shell error, terminating..
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com systemd[1]: Failed to start Motr mkfs helper for 0x7200000000000001:0x2 service.
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com systemd[1]: Unit motr-mkfs@0x7200000000000001:0x2.service entered failed state.
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com systemd[1]: motr-mkfs@0x7200000000000001:0x2.service failed.
[root@ssc-vm-g2-rhev4-1947 cortx-hare-1]#

cc. @mssawant, @vaibhavparatwar

nkommuri · 2022-08-16T13:47:45Z

Created a custom build [ https://eos-jenkins.colo.seagate.com/job/GitHub-custom-ci-builds/job/generic/job/custom-ci/7556/ ] with this PR and motr PR Seagate/cortx-motr#2078

Tested this build on my 9 node cluster. After cluster shutdown and restart, I see all device's state to be online immediately. Didn't see the issue of device state fluctuation.

Process events are a broadcast to all the nodes in the cluster. Not all the nodes are required to update the entire process configuration tree in KV. Only the process's local hax or RC must do that. Solution: - Avoid updating process configuration tree in KV, but notify local motr processes about any remote process state updates. - Set HA and Confd process states to M0_CONF_HA_PROCESS_RECOVERED on receiving M0_CONF_HA_PROCESS_STARTED. Signed-off-by: Mandar Sawant <[email protected]>

cla-bot bot added the cla-signed label Aug 1, 2022

auto-assign bot assigned d-nayak, Shreya-18 and vaibhavparatwar Aug 1, 2022

mssawant requested review from SeagateChaDeepak, Shreya-18 and d-nayak and removed request for SeagateChaDeepak August 1, 2022 20:46

mssawant force-pushed the ha-confd-recovered branch from e02f191 to ec89fb6 Compare August 3, 2022 16:25

Shreya-18 approved these changes Aug 5, 2022

View reviewed changes

hessio added the Status: Waiting to be Reviewed PR needs to be reviewed label Aug 8, 2022

stale bot added the needs-attention label Aug 16, 2022

stale bot removed the needs-attention label Aug 16, 2022

vaibhavparatwar self-requested a review August 16, 2022 14:05

vaibhavparatwar approved these changes Aug 16, 2022

View reviewed changes

mssawant force-pushed the ha-confd-recovered branch from ec89fb6 to a18ff59 Compare August 16, 2022 14:12

mssawant changed the title ~~ha: process state machine update multiple times~~ CORTX-29871: process state machine update multiple times Aug 16, 2022

d-nayak approved these changes Aug 16, 2022

View reviewed changes

mssawant merged commit 2073c28 into Seagate:main Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CORTX-29871: process state machine update multiple times #2145

CORTX-29871: process state machine update multiple times #2145

Uh oh!

mssawant commented Aug 1, 2022

Uh oh!

mssawant commented Aug 1, 2022

Uh oh!

mssawant commented Aug 2, 2022

Uh oh!

vaibhavparatwar commented Aug 2, 2022

Uh oh!

mssawant commented Aug 4, 2022

Uh oh!

stale bot commented Aug 16, 2022

Uh oh!

supriyachavan4398 commented Aug 16, 2022 •

edited

Loading

Uh oh!

nkommuri commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

CORTX-29871: process state machine update multiple times #2145

CORTX-29871: process state machine update multiple times #2145

Uh oh!

Conversation

mssawant commented Aug 1, 2022

Uh oh!

mssawant commented Aug 1, 2022

Uh oh!

mssawant commented Aug 2, 2022

Uh oh!

vaibhavparatwar commented Aug 2, 2022

Uh oh!

mssawant commented Aug 4, 2022

Uh oh!

stale bot commented Aug 16, 2022

Uh oh!

supriyachavan4398 commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nkommuri commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

supriyachavan4398 commented Aug 16, 2022 •

edited

Loading