server,rpc: validate node IDs in RPC heartbeats #34197

knz · 2019-01-23T18:48:29Z

Prior to this patch, it was possible for a RPC client to dial a node
ID and get a connection to another node instead. This is because the
mapping of node ID -> address may be stale, and a different node could
take the address of the intended node from "under" the dialer.

(See #34155 for a scenario.)

This happened to be "safe" in many cases where it matters because:

RPC requests for distSQL are OK with being served on a different
node than intended (with potential performance drop);
RPC requests to the KV layer are OK with being served on a different
node than intended (they would route underneath);
RPC requests to the storage layer are rejected by the
remote node because the store ID in the request would not match.

However this safety is largely accidental, and we should not work with
the assumption that any RPC request is safe to be mis-routed. (In
fact, we have not audited all the RPC endpoints and cannot establish
this safety exists throughout.)

This patch works to prevent these mis-routings by adding a check of
the intended node ID during RPC heartbeats (including the initial
heartbeat), when the intended node ID is known. A new API
GRPCDialNode() is introduced to establish such connections.

Release note (bug fix): CockroachDB now performs fewer attempts to
communicate with the wrong node, when a node is restarted with another
node's address.

cockroach-teamcity · 2019-01-23T18:48:39Z

This change is

knz · 2019-01-23T18:52:24Z

A review to this code was made in #34155 (review)

knz · 2019-01-23T18:58:25Z

@petermattis I've seen your review on context.go:

One problem with this approach is that we won't remove this conn if there is a heartbeat or RPC error: we'll only remove the entry with the non-zero remoteNodeID.

Yes and what's the problem with that?

My understanding is that if the conn object gets closed on the non-zero path, it's going to be marked as unusable (the grpc code marks it as closed) and so further zero-nodeid rpc activity (gossip) will cause a re-dial. Is my understanding wrong?

knz · 2019-01-23T19:26:20Z

@petermattis you're going to love this. multiTestContext in storage/client_test.go uses a single rpc.Context for all the simulated nodes.

This is why enforcing args.NodeID != 0 && args.NodeID != nodeID in (*HeartbeatService).Ping() does not work: the nodeID comes from the current rpc.Context, but there is one hearbeat service per simulated node.

We have no way to distinguish the services (and have a different value for nodeID) in each hearbeat service short of instantiating a separate rpc.Context for each simulated server.

This would require a major change to multiTestContext (quite out of my league).

Hence my proposal to revert the condition to args.NodeID != 0 && nodeID != 0 && args.NodeID != nodeID as I had initially.

tbg · 2019-01-24T08:38:27Z

Hence my proposal to revert the condition to args.NodeID != 0 && nodeID != 0 && args.NodeID != nodeID as I had initially.

Before you do that, I would special-case nodeID -1 to mean "this is the multiTestContext, please let me do bad things".

knz · 2019-01-24T11:57:16Z

Before you do that, I would special-case nodeID -1 to mean "this is the multiTestContext, please let me do bad things".

Done, with a testing knob instead.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)

pkg/rpc/heartbeat.go, line 85 at r1 (raw file):

// populate separate node IDs for each heartbeat service.
// The returned callback should be called to cancel the effect.
func TestingAllowNamedRPCToAnonymousServer() func() {

This is a dated way of tweaking behavior for tests. Can you put a boolean on *HeartbeatService and Context (for plumbing from the latter into the former in NewServerWithInterceptor)? This is a bit more work, but now it's unclear that we'll really undo this change when an mtc-based test fails. I really want to avoid hacks like this, sorry to make you jump through another hoop. So concretely my suggestion is:

add a TestingNoValidateNodeIDs into Context
in NewServerWithInterceptor, carry the bool over into HeartbeatService
use the bool in Ping

Perhaps there's a more direct way, but we never hold on to the heartbeat service (we just stick it into a grpc server) and doing so wouldn't add clarity (then it would seem that one could replace the heartbeat service, but that wouldn't work).

tbg

Reviewed 18 of 18 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @petermattis)

petermattis · 2019-01-24T14:29:10Z

I'm still a bit anxious about this PR (I just realized that 2 RPC connections means we'll be exchanging clock synchronization twice per node), and I won't have time to properly think about it until tomorrow. I appreciate the work you're doing here @knz, but I'd like to request that we don't try to rush this in.

knz · 2019-01-25T13:03:30Z

Can you put a boolean on *HeartbeatService and Context (for plumbing from the latter into the former in NewServerWithInterceptor)?

Done, RFAL

(I had no choice but to do that anyway, otherwise the race detector didn't like me during stress runs)

knz · 2019-01-25T13:06:32Z

@petermattis

I'm still a bit anxious about this PR (I just realized that 2 RPC connections means we'll be exchanging clock synchronization twice per node

I think that since I'm still doing the thing to share the *Connection object between gossip and "named" (specific node ID) dials, there should be just 1 RPC connection between two nodes.

petermattis

I think that since I'm still doing the thing to share the *Connection object between gossip and "named" (specific node ID) dials, there should be just 1 RPC connection between two nodes.

I think gossip will frequently be the first connection to a remote node. That will prohibit sharing a connection.

Yes and what's the problem with that?

My understanding is that if the conn object gets closed on the non-zero path, it's going to be marked as unusable (the grpc code marks it as closed) and so further zero-nodeid rpc activity (gossip) will cause a re-dial. Is my understanding wrong?

I'm not sure what problems that will cause. I don't think this scenario is adequately tested to give us any confidence that the right thing occurs. Feel free to point to tests that I'm missing. My anxiety here is relaxing a previous invariant: closing a connection removes the connection from the connection map.

I'm wondering what the aim of this PR should be. Should we be trying to make it impossible to send an RPC to a remote node with the wrong node ID? That's noble (and perhaps the right thing to do), but also risky in the short term. Another approach would be to focus on reducing the log spam (that's the near term impetus for a change). For example, nodeDialer could poison the local gossip address cache for a node if it discovers the remote node has an unexpected ID. I think that could be done while leaving the current connection behavior in place.

I apologize for the strictness of this review. I'm happy to take this change over if you'd like.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)

knz · 2019-01-26T10:16:48Z

Should we be trying to make it impossible to send an RPC to a remote node with the wrong node ID? That's noble (and perhaps the right thing to do),

The evidence (based on my limited testing) is that we do not have a disciplined approach to prevent mis-routed RPCs from being incorrectly served and break invariants. See the commit message at the top of this PR. On its own that's an important goal.

but also risky in the short term.

Can you clarify why?

For example, nodeDialer could poison the local gossip address cache for a node if it discovers the remote node has an unexpected ID.

We can certainly also do this (the current PR, as it stands, does not do this yet). I'm happy to add this.

knz · 2019-04-15T11:55:47Z

(rebased PR - will check what CI fallout we get)

tbg · 2019-04-15T14:30:16Z

Could you update the commit/PR message to indicate what the behavior you're introducing is? I gather that outbound connections to NodeID zero don't validate the peer. From reading the code, I think the situation is as follows:

zero given, nothing cached: creates new connection that doesn't validate NodeID.
NodeID given, but cached with zero: opens new connection, leaves old connection in place (so dialing to zero later still gives the unvalidated conn back)
zero given, cached with nonzero: will use the cached connection.

Please state that (in better words) and give a bit of motivation. It'd also be good to learn who doesn't necessarily supply a NodeID. I think we should consider removing the nodeID-less API completely and force callers to explicitly pass a nodeID (even if it's zero) to make sure we don't regress in silly ways.

The major problem (but it's not a new problem) here seems to be that these unvalidated connections have no reason to ever go away. This is just fallout of the fact that we don't have a mechanism to gracefully tear down connections at all. We don't want to just brutally close the old connection because that gives ugly errors. And yet we know that consumers of a connection can hold on to it for essentially forever (for example Raft transport). Fixing this seems out of scope here and will require all callers to buy into the fact that they'll need to switch over at some point. It'll be a while until that's worth it.

I also looked into @petermattis' concern about clock offsets. We will have multiple connections open but the clock offset measurements key only on the address (i.e. we won't double count the measurements, though we'll measure and verify more frequently). That seems like it should be fine, but should also be mentioned in the commit, or even in the code, where appropriate.

I still have to review the code in detail, but it looked pretty solid. @petermattis had a good point that putting a connection in multiple slots in the map (zero and nonzero) might cause new behavior that needs to be tested appropriately.

knz · 2019-04-15T16:09:34Z

Could you update the commit/PR message to indicate what the behavior you're introducing is?
Please state that (in better words) and give a bit of motivation.

Done.

It'd also be good to learn who doesn't necessarily supply a NodeID.

Done (Gossip + CLI client commands)

I think we should consider removing the nodeID-less API completely and force callers to explicitly pass a nodeID (even if it's zero) to make sure we don't regress in silly ways.

Agreed! I added a 2nd commit to do exactly that. PTAL!

The major problem (but it's not a new problem) here seems to be that these unvalidated connections have no reason to ever go away.

for gossip it does not matter
for CLI clients it does not matter
there are no other users
so we're fine!

tbg

Before you merge this, can you set up a local three node cluster (or just run the tpcc headroom roachtest) and watch the logs for annoying errors, especially early as the cluster is brought up? I don't expect there to be much (that isn't already there before this PR) but it's worth a look.

Thanks for pulling through!

Reviewed 2 of 8 files at r2, 16 of 16 files at r3, 17 of 17 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/cli/start.go, line 1137 at r4 (raw file):

	// We use GRPCGossipDial() here because it does not matter
	// to which node we're talking to.
	conn, err := rpcContext.GRPCGossipDial(addr).Connect(ctx)

how about GRPCUnvalidatedDial?

pkg/rpc/heartbeat.go, line 60 at r3 (raw file):

	// currently used by the multiTestContext which does not suitably
	// populate separate node IDs for each heartbeat service.
	// The returned callback should be called to cancel the effect.

Remove this line.

pkg/server/server.go, line 1493 at r1 (raw file):

	// Now that we have a node ID, ensure that incoming RPC connections
	// are validated against this node ID.
	s.rpcContext.NodeID.Set(ctx, s.NodeID())

Why did you (have to) lower this into *Node?

Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See cockroachdb#34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. This behaves as follows: - node ID zero given, no connection cached: creates new connection that doesn't validate NodeID. This is suitable for the initial GRPC handshake during gossip, before node IDs are known. It is also suitable for the CLI commands which do not care about which node they are talking to (and they do not know the node ID yet -- only the RPC address). - nonzero NodeID given, but connection cached with node ID zero: opens new connection, leaves old connection in place (so dialing to node ID zero later still gives the unvalidated conn back.) This is suitable when setting up e.g. Raft clients after the peer node IDs are determined. At this point we want to introduce node ID validation. The old connection remains in place because the gossip code does not react well from having its connection closed from "under it". - zero given, cached with nonzero: will use the cached connection. This is suitable when gossip needs to verify e.g. the health of some remote node known only by its address. In this case it's OK to have it use the connection that is already established. This flexibility suggests that it is possible for clent components to "opt out" of node ID validation by specifying a zero value, in other places than strictly necessary for gossip and CLI commands. In fact, the situation is even more uncomfortable: it requires extra work to set up the node ID and naive test code will be opting out of validation implicitly, without clear feedback. This mis-design is addressed by a subsequent commit. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address.

The previous patch introduced node ID verification for GRPC connections but preserved the `GRPCDial()` API, alongside the ability to use node ID 0 with `GRPCDialNode()`, to signal that node ID verification should be disabled. Further examination revealed that this flexibility is 1) hard to reason about and 2) unneeded. So instead of keeping this option and then investing time into producing tests for all the combinations of verifications protocols, this patch "cuts the gordian knot" by removing this flexibility altogether. In summary: - `GRPCDial()` is removed. - `GRPCDialNode()` will call log.Fatal() if provided a 0 node ID. - `GRPCUnvalidatedDial()` is introduced, with a clarification about its contract. I have audited the code to validate that this is indeed only used by gossip, and the CLI client commands that really don't care about the node ID. Release note: None

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/cli/start.go, line 1137 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

how about GRPCUnvalidatedDial?

I'll trust your judgement.

pkg/rpc/heartbeat.go, line 85 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a dated way of tweaking behavior for tests. Can you put a boolean on *HeartbeatService and Context (for plumbing from the latter into the former in NewServerWithInterceptor)? This is a bit more work, but now it's unclear that we'll really undo this change when an mtc-based test fails. I really want to avoid hacks like this, sorry to make you jump through another hoop. So concretely my suggestion is:

add a TestingNoValidateNodeIDs into Context

in NewServerWithInterceptor, carry the bool over into HeartbeatService

use the bool in Ping

Perhaps there's a more direct way, but we never hold on to the heartbeat service (we just stick it into a grpc server) and doing so wouldn't add clarity (then it would seem that one could replace the heartbeat service, but that wouldn't work).

Done.

pkg/rpc/heartbeat.go, line 60 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Remove this line.

Done.

pkg/server/server.go, line 1493 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why did you (have to) lower this into *Node?

I think you're misreading the code and commenting on a rebase diff. I haven't changed anything here.

knz

Before you merge this, can you set up a local three node cluster [...] and watch the logs for annoying errors, especially early as the cluster is brought up?

I have used my testing script from earlier and I didn't find anything suspicious.

I did find what I was looking for however:

I190418 13:02:08.329989 333 rpc/nodedialer/nodedialer.go:143  [n1] unable to connect to n4: failed to connect to n4 at localhost:26004: initial connection heartbeat failed: rpc error: code = Unknown desc = client requested node ID 4 doesn't match server node ID 3

(the new check)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

tbg

Reviewed 19 of 19 files at r5, 17 of 17 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained

knz · 2019-04-24T13:21:39Z

Thank you

bors r=tbg

34197: server,rpc: validate node IDs in RPC heartbeats r=tbg a=knz Fixes #34158. Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See #34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address. 36952: storage: deflake TestNodeLivenessStatusMap r=tbg a=knz Fixes #35675. Prior to this patch, this test would fail `stressrace` after a few dozen iterations. With this patch, `stressrace` succeeds thousands of iterations. I have checked that the test logic is preserved: if I change one of the expected statuses in `testData`, the test still fail properly. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

craig · 2019-04-24T13:48:34Z

Build succeeded

GitHub CI (Cockroach)

Back in PR cockroachdb#34197 we mistakenly removed the .Offset field sent by each RPC heartbeat, through which the remote node monitors the current/client node's offset. This looks bad, but is actually somewhat innocuous. This is because we update the offset map and do the check on two situations: - when a server node gets pinged by a remote node using the remote-provided .Offset, - when receiving a ping response as client, using an estimate of the roundtrip latency between PingRequest-PingResponse. The bug above only invalidates the first check. The second check still occurs. Since all nodes are both client to another node, and server for the same other node, we still get a check both ways on all pairs of nodes. Nonetheless, the half-way broken check reduces robustness overall. So it's good to fix it. Release note: None

84031: rpc: fix the maxoffset check on the incoming path r=erikgrinaker a=knz Fixes #84017. Fixes #84027. ### Bug fix 1 Back in PR #34197 we mistakenly removed the .Offset field sent by each RPC heartbeat, through which the remote node monitors the current/client node's offset. This looks bad, but is actually somewhat innocuous. This is because we update the offset map and do the check on two situations: - when a server node gets pinged by a remote node using the remote-provided .Offset, - when receiving a ping response as client, using an estimate of the roundtrip latency between PingRequest-PingResponse. The bug above only invalidates the first check. The second check still occurs. Since all nodes are both client to another node, and server for the same other node, we still get a check both ways on all pairs of nodes. Nonetheless, the half-way broken check reduces robustness overall. So it's good to fix it. ### Bug fix 2 Release note (bug fix): CLI `cockroach` commands connecting to a remote server will not produce spurious "latency jump" warnings any more. This bug had been introduced in CockroachDB v21.2. Co-authored-by: Raphael 'kena' Poss <[email protected]>

knz requested review from a team, petermattis and tbg January 23, 2019 18:48

knz mentioned this pull request Jan 23, 2019

gossip: avoid removing nodes that get a new address #34155

Merged

knz force-pushed the 20190123-rpc-id branch 3 times, most recently from 5499d8d to 176b3c1 Compare January 23, 2019 19:42

knz force-pushed the 20190123-rpc-id branch from 176b3c1 to d999e24 Compare January 24, 2019 11:56

tbg reviewed Jan 24, 2019

View reviewed changes

knz force-pushed the 20190123-rpc-id branch from d999e24 to 3e52502 Compare January 25, 2019 13:02

knz force-pushed the 20190123-rpc-id branch from 3e52502 to f2506c7 Compare January 25, 2019 13:04

petermattis reviewed Jan 25, 2019

View reviewed changes

knz force-pushed the 20190123-rpc-id branch from f2506c7 to f3c1744 Compare January 26, 2019 10:20

knz mentioned this pull request Feb 26, 2019

Nodes stuck in decommissioning state #35156

Closed

knz force-pushed the 20190123-rpc-id branch from f3c1744 to 7b0309b Compare April 15, 2019 11:54

knz force-pushed the 20190123-rpc-id branch from 7b0309b to 41759f9 Compare April 15, 2019 13:42

knz force-pushed the 20190123-rpc-id branch 3 times, most recently from 3e2e015 to 1ef60f1 Compare April 15, 2019 14:19

knz force-pushed the 20190123-rpc-id branch from 1ef60f1 to a4eb29f Compare April 15, 2019 14:34

knz force-pushed the 20190123-rpc-id branch from a4eb29f to 9d2c1e4 Compare April 15, 2019 16:09

knz requested a review from a team as a code owner April 15, 2019 16:09

knz force-pushed the 20190123-rpc-id branch from 9d2c1e4 to 25747e5 Compare April 16, 2019 09:36

tbg approved these changes Apr 17, 2019

View reviewed changes

knz added 2 commits April 18, 2019 14:56

knz force-pushed the 20190123-rpc-id branch from 25747e5 to 6641499 Compare April 18, 2019 12:58

knz commented Apr 18, 2019

View reviewed changes

tbg approved these changes Apr 24, 2019

View reviewed changes

craig bot merged commit 6641499 into cockroachdb:master Apr 24, 2019

tbg mentioned this pull request Jun 3, 2019

rpc: clusterID check can fail to prevent two clusters from interacting #37907

Closed

knz mentioned this pull request Aug 27, 2019

storage: TestReopenConnection failed under stress #38194

Closed

knz mentioned this pull request Jul 7, 2022

rpc: maxoffset check does not activate since PR #34197 #84027

Closed

knz mentioned this pull request Jul 7, 2022

rpc: fix the maxoffset check on the incoming path #84031

Merged

server,rpc: validate node IDs in RPC heartbeats #34197

server,rpc: validate node IDs in RPC heartbeats #34197

Uh oh!

Conversation

knz commented Jan 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jan 23, 2019

Uh oh!

knz commented Jan 23, 2019

Uh oh!

knz commented Jan 23, 2019

Uh oh!

knz commented Jan 23, 2019

Uh oh!

tbg commented Jan 24, 2019

Uh oh!

knz commented Jan 24, 2019

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis commented Jan 24, 2019

Uh oh!

knz commented Jan 25, 2019

Uh oh!

knz commented Jan 25, 2019

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Jan 26, 2019

Uh oh!

knz commented Apr 15, 2019

Uh oh!

tbg commented Apr 15, 2019

Uh oh!

knz commented Apr 15, 2019

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Apr 24, 2019

Uh oh!

craig bot commented Apr 24, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

knz commented Jan 23, 2019 •

edited

Loading