-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the problem
During the test of cockroach, the raft appeared panic problem, the error log is as follows
Tocommit(61) is out of range [lastIndex(59)]. Was the raft log corrupted, truncated, or lost?
After careful investigation, I found that raft logs from other cluster were sent to the localhost:28253 process. There is a disorder in sending messages between different clusters.
To Reproduce
The steps to reproduce are as follows:
1,
Start the three nodes of cluster1 node1, node2, node3
Node1->localhost:28251
Node2->localhost:28252
Node3->localhost:28253
create table in cluster1's node1, keeps inserting data
2,
Kill cluster1's node3 localhost:28253 process, clear node3's data
Keep cluster1's node1 inserting data
3,
Start the three nodes of cluster2 node1, node2, node3
Node1->localhost:28254
Node2->localhost:28255
Node3->localhost:28253
At this time, the node3 (localhost:28253) IP port seen by cluster1 has not changed. It is considered that node3 is still the node3 of cluster1 (actually node3 is cluster2 at this time),
cluster1's nodes will send grpc heartbeat to node3, node3 receives cluster1 grpc heartbeat after detecting clusterID inconsistency and return error to cluster1's nodes, then cluster1
Think node3's connection (to localhost:28253) is unhealthy, no, cluster1's nodes will not synchronize raft logs to node3.
However, the instantaneous clusterID that node3 just started has not been obtained yet.
https://github.com/cockroachdb/cockroach/blob/master/pkg/rpc/heartbeat.go#L96
At this time, the correct PingResponse will be returned to cluster1. After the cluster1 receives the feedback, cluster1's raft log will be send to node3.
Leading to the raft process panic of node3