release-2.1: gossip: avoid removing nodes that get a new address #34198

knz · 2019-01-23T19:00:31Z

Backport 1/1 commits from #34155.

/cc @cockroachdb/release

K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, while the other node
is still alive (for example, a re-shuffling of node addresses during
a rolling restart).

Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) erasing that
node's entry, thereby removing it from the cluster.

This scenario can be reproduced like this:

start 4 nodes n1-n4
stop n3 and n4
restart n3 with n4's address

Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
server.time_until_store_dead) during which Raft is confused to not
find n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.

Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.

K8s deployments make it possible for a node to get restarted using an address previously attributed to another node, *while the other node is still alive* (for example, a re-shuffling of node addresses during a rolling restart). Prior to this patch, the gossip code was assuming that if a node starts with an address previously attributed to another node, that other node must be dead, and thus was (incorrectly) *erasing* that node's entry, thereby removing it from the cluster. This scenario can be reproduced like this: - start 4 nodes n1-n4 - stop n3 and n4 - restart n3 with n4's address Prior to this patch, this scenario would yield "n4 removed from the cluster" in other nodes, and n3 was not restarting properly. With the patch, there is a period of time (until `server.time_until_store_dead`) during which Raft is confused to not find n4 at n3's address, but where the cluster otherwise operates normally. After the store time outs, n4 is properly marked as down and the log spam stops. Release note (bug fix): CockroachDB now enables re-starting a node at an address previously allocated for another node.

cockroach-teamcity · 2019-01-23T19:00:41Z

This change is

tbg · 2019-01-23T19:25:46Z

Let's let this bake for one or two nightly roachtest runs to see if that uncovers anything we didn't think of.

LGTM after that.

petermattis · 2019-01-28T18:53:50Z

@tbg Have you seen any roachtest failures that could be attributed to this? I haven't, but you're paying closer attention.

tbg · 2019-01-28T19:21:17Z

I have not. Let's merge.

knz requested review from a team, petermattis and tbg January 23, 2019 19:00

knz merged commit d829308 into cockroachdb:release-2.1 Jan 28, 2019

knz deleted the backport2.1-34155 branch January 28, 2019 21:39

tbg mentioned this pull request Feb 26, 2019

Nodes stuck in decommissioning state #35156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release-2.1: gossip: avoid removing nodes that get a new address #34198

release-2.1: gossip: avoid removing nodes that get a new address #34198

Uh oh!

knz commented Jan 23, 2019

Uh oh!

cockroach-teamcity commented Jan 23, 2019

Uh oh!

tbg commented Jan 23, 2019

Uh oh!

petermattis commented Jan 28, 2019

Uh oh!

tbg commented Jan 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

release-2.1: gossip: avoid removing nodes that get a new address #34198

release-2.1: gossip: avoid removing nodes that get a new address #34198

Uh oh!

Conversation

knz commented Jan 23, 2019

Uh oh!

cockroach-teamcity commented Jan 23, 2019

Uh oh!

tbg commented Jan 23, 2019

Uh oh!

petermattis commented Jan 28, 2019

Uh oh!

tbg commented Jan 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants