Skip to content

Conversation

@knz
Copy link
Contributor

@knz knz commented Jan 23, 2019

Backport 1/1 commits from #34155.

/cc @cockroachdb/release


Fixes #34120.

K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, while the other node
is still alive
(for example, a re-shuffling of node addresses during
a rolling restart).

Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) erasing that
node's entry, thereby removing it from the cluster.

This scenario can be reproduced like this:

  • start 4 nodes n1-n4
  • stop n3 and n4
  • restart n3 with n4's address

Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
server.time_until_store_dead) during which Raft is confused to not
find n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.

Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.

K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, *while the other node
is still alive* (for example, a re-shuffling of node addresses during
a rolling restart).

Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) *erasing* that
node's entry, thereby removing it from the cluster.

This scenario can be reproduced like this:

- start 4 nodes n1-n4
- stop n3 and n4
- restart n3 with n4's address

Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
`server.time_until_store_dead`) during which Raft is confused to not
find n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.

Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.
@knz knz requested review from a team, petermattis and tbg January 23, 2019 19:00
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Jan 23, 2019

Let's let this bake for one or two nightly roachtest runs to see if that uncovers anything we didn't think of.

LGTM after that.

@petermattis
Copy link
Collaborator

@tbg Have you seen any roachtest failures that could be attributed to this? I haven't, but you're paying closer attention.

@tbg
Copy link
Member

tbg commented Jan 28, 2019

I have not. Let's merge.

@knz knz merged commit d829308 into cockroachdb:release-2.1 Jan 28, 2019
@knz knz deleted the backport2.1-34155 branch January 28, 2019 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants