Skip to content
This repository was archived by the owner on Feb 8, 2024. It is now read-only.

Conversation

@mssawant
Copy link

@mssawant mssawant commented Aug 1, 2022

Process events are a broadcast to all the nodes in the cluster.
Not all the nodes are required to update the entire process configuration
tree in KV. Only the process's local hax or RC must do that.

Solution:

  • Avoid updating process configuration tree in KV, but notify local motr
    processes about any remote process state updates.
  • Set HA and Confd process states to M0_CONF_HA_PROCESS_RECOVERED
    on receiving M0_CONF_HA_PROCESS_STARTED.

Signed-off-by: Mandar Sawant [email protected]

@mssawant
Copy link
Author

mssawant commented Aug 1, 2022

retest this please

1 similar comment
@mssawant
Copy link
Author

mssawant commented Aug 2, 2022

retest this please

@vaibhavparatwar
Copy link
Contributor

@mssawant what is the JIRA Id to link with this PR?

@mssawant mssawant force-pushed the ha-confd-recovered branch from e02f191 to ec89fb6 Compare August 3, 2022 16:25
@mssawant
Copy link
Author

mssawant commented Aug 4, 2022

retest this please

@hessio hessio added the Status: Waiting to be Reviewed PR needs to be reviewed label Aug 8, 2022
@stale
Copy link

stale bot commented Aug 16, 2022

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 6 days. Tagging @mssawant for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

@supriyachavan4398
Copy link
Contributor

supriyachavan4398 commented Aug 16, 2022

With Mandar's branch, tried to do bootstrap in LR env,

[root@ssc-vm-g2-rhev4-1947 cortx-hare-1]# hctl bootstrap --mkfs ~/1n_new.yaml
2022-08-16 01:06:16: Generating cluster configuration... OK
2022-08-16 01:06:17: Starting Consul server on this node......... OK
2022-08-16 01:06:24: Importing configuration into the KV store... OK
2022-08-16 01:06:24: Starting Consul on other nodes...Consul ready on all nodes
2022-08-16 01:06:25: Updating Consul configuration from the KV store... OK
2022-08-16 01:06:26: Waiting for the RC Leader to get elected....... OK
2022-08-16 01:06:31: Starting Motr (phase1, mkfs)... OK
2022-08-16 01:06:38: Starting Motr (phase1, m0d)... OK
2022-08-16 01:06:40: Starting Motr (phase2, mkfs)...Job for motr-mkfs@0x7200000000000001:0x2.service failed because the control process exited with error code. See "systemctl status motr-mkfs@0x7200000000000001:0x2.service" and "journalctl -xe" for details.
[root@ssc-vm-g2-rhev4-1947 cortx-hare-1]# hctl status
Bytecount:
    critical : 0
    damaged : 0
    degraded : 0
    healthy : 0
Data pool:
    # fid name
    0x6f00000000000001:0x0 'the pool'
Profile:
    # fid name: pool(s)
    0x7000000000000001:0x0 'default': 'the pool' None None
Services:
    ssc-vm-g2-rhev4-1947.colo.seagate.com  (RC)
    [started]  hax                 0x7200000000000001:0x0          inet:tcp:192.168.62.178@22001
    [started]  confd               0x7200000000000001:0x1          inet:tcp:192.168.62.178@21002
    [offline]  ioservice           0x7200000000000001:0x2          inet:tcp:192.168.62.178@21003
    [unknown]  m0_client_other     0x7200000000000001:0x3          inet:tcp:192.168.62.178@22501
    [unknown]  m0_client_other     0x7200000000000001:0x4          inet:tcp:192.168.62.178@22502

But it was failing to start ioservices.

[root@ssc-vm-g2-rhev4-1947 cortx-hare-1]# systemctl status motr-mkfs@0x7200000000000001:0x2.service -l
● motr-mkfs@0x7200000000000001:0x2.service - Motr mkfs helper for 0x7200000000000001:0x2 service
   Loaded: loaded (/root/test/cortx-motr/scripts/install/usr/lib/systemd/system/[email protected]; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-08-16 01:06:42 MDT; 3min 59s ago
  Process: 19970 ExecStart=/usr/libexec/cortx-motr/motr-mkfs %i (code=exited, status=42)
 Main PID: 19970 (code=exited, status=42)

Aug 16 01:06:40 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: motr transport : libfab
Aug 16 01:06:40 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: + exec /root/test/cortx-motr/utils/mkfs/m0mkfs -e libfab:inet:tcp:192.168.62.178@21003 -A linuxstob:addb-stobs -f '<0x7200000000000001:0x2>' -T ad -S stobs -D db -m 524288 -q 16 -C 307200 -E 32 -J 64 -t 0 -X 209715200 -P 314572800 -O 524288000 -w 8 -H inet:tcp:192.168.62.178@22001 -U -L /dev/sdc -V 26843545600 -F -u f0e603b0-1d31-11ed-9ce8-566fcce40af4 -z 8589934592 -r 134217728
Aug 16 01:06:41 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: motr[19974]:  e100  ERROR  [setup.c:1490:cs_be_init]  <! rc=-22
Aug 16 01:06:41 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: motr[19974]:  e100  ERROR  [setup.c:1611:cs_storage_setup]  <! rc=-22 cs_be_init
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: propagating error to parent shell
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com systemd[1]: motr-mkfs@0x7200000000000001:0x2.service: main process exited, code=exited, status=42/n/a
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com motr-mkfs[19970]: got sub-shell error, terminating..
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com systemd[1]: Failed to start Motr mkfs helper for 0x7200000000000001:0x2 service.
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com systemd[1]: Unit motr-mkfs@0x7200000000000001:0x2.service entered failed state.
Aug 16 01:06:42 ssc-vm-g2-rhev4-1947.colo.seagate.com systemd[1]: motr-mkfs@0x7200000000000001:0x2.service failed.
[root@ssc-vm-g2-rhev4-1947 cortx-hare-1]#

cc. @mssawant, @vaibhavparatwar

@nkommuri
Copy link

Created a custom build [ https://eos-jenkins.colo.seagate.com/job/GitHub-custom-ci-builds/job/generic/job/custom-ci/7556/ ] with this PR and motr PR Seagate/cortx-motr#2078

Tested this build on my 9 node cluster. After cluster shutdown and restart, I see all device's state to be online immediately. Didn't see the issue of device state fluctuation.

@stale stale bot removed the needs-attention label Aug 16, 2022
@vaibhavparatwar vaibhavparatwar self-requested a review August 16, 2022 14:05
Process events are a broadcast to all the nodes in the cluster.
Not all the nodes are required to update the entire process configuration
tree in KV. Only the process's local hax or RC must do that.

Solution:
- Avoid updating process configuration tree in KV, but notify local motr
processes about any remote process state updates.
- Set HA and Confd process states to M0_CONF_HA_PROCESS_RECOVERED
on receiving M0_CONF_HA_PROCESS_STARTED.

Signed-off-by: Mandar Sawant <[email protected]>
@mssawant mssawant force-pushed the ha-confd-recovered branch from ec89fb6 to a18ff59 Compare August 16, 2022 14:12
@mssawant mssawant changed the title ha: process state machine update multiple times CORTX-29871: process state machine update multiple times Aug 16, 2022
@mssawant mssawant merged commit 2073c28 into Seagate:main Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants