Skip to content

rpc: clusterID check can fail to prevent two clusters from interacting #37907

@xhh1989

Description

@xhh1989

Describe the problem

During the test of cockroach, the raft appeared panic problem, the error log is as follows
Tocommit(61) is out of range [lastIndex(59)]. Was the raft log corrupted, truncated, or lost?
After careful investigation, I found that raft logs from other cluster were sent to the localhost:28253 process. There is a disorder in sending messages between different clusters.

To Reproduce

The steps to reproduce are as follows:
    1,
    Start the three nodes of cluster1 node1, node2, node3
    Node1->localhost:28251
    Node2->localhost:28252
    Node3->localhost:28253
    create table in cluster1's node1, keeps inserting data
    2,
    Kill cluster1's node3 localhost:28253 process, clear node3's data
    Keep cluster1's node1 inserting data
    3,
    Start the three nodes of cluster2 node1, node2, node3
    Node1->localhost:28254
    Node2->localhost:28255
    Node3->localhost:28253

    At this time, the node3 (localhost:28253) IP port seen by cluster1 has not changed. It is considered that node3 is still the node3 of cluster1 (actually node3 is cluster2 at this time),
cluster1's nodes will send grpc heartbeat to node3, node3 receives cluster1 grpc heartbeat after detecting clusterID inconsistency and return error to cluster1's nodes, then cluster1
    Think node3's connection (to localhost:28253) is unhealthy, no, cluster1's nodes will not synchronize raft logs to node3.

    However, the instantaneous clusterID that node3 just started has not been obtained yet.
https://github.com/cockroachdb/cockroach/blob/master/pkg/rpc/heartbeat.go#L96
At this time, the correct PingResponse will be returned to cluster1. After the cluster1 receives the feedback, cluster1's raft log will be send to node3.
    Leading to the raft process panic of node3

Metadata

Metadata

Assignees

Labels

C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.S-3-ux-surpriseIssue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions