-
Notifications
You must be signed in to change notification settings - Fork 4k
server,rpc: validate node IDs in RPC heartbeats #34197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A review to this code was made in #34155 (review) |
|
@petermattis I've seen your review on
Yes and what's the problem with that? My understanding is that if the conn object gets closed on the non-zero path, it's going to be marked as unusable (the grpc code marks it as closed) and so further zero-nodeid rpc activity (gossip) will cause a re-dial. Is my understanding wrong? |
|
@petermattis you're going to love this. This is why enforcing We have no way to distinguish the services (and have a different value for This would require a major change to Hence my proposal to revert the condition to |
5499d8d to
176b3c1
Compare
Before you do that, I would special-case nodeID |
Done, with a testing knob instead. |
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)
pkg/rpc/heartbeat.go, line 85 at r1 (raw file):
// populate separate node IDs for each heartbeat service. // The returned callback should be called to cancel the effect. func TestingAllowNamedRPCToAnonymousServer() func() {
This is a dated way of tweaking behavior for tests. Can you put a boolean on *HeartbeatService and Context (for plumbing from the latter into the former in NewServerWithInterceptor)? This is a bit more work, but now it's unclear that we'll really undo this change when an mtc-based test fails. I really want to avoid hacks like this, sorry to make you jump through another hoop. So concretely my suggestion is:
- add a TestingNoValidateNodeIDs into Context
- in NewServerWithInterceptor, carry the bool over into HeartbeatService
- use the bool in Ping
Perhaps there's a more direct way, but we never hold on to the heartbeat service (we just stick it into a grpc server) and doing so wouldn't add clarity (then it would seem that one could replace the heartbeat service, but that wouldn't work).
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 18 of 18 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @knz and @petermattis)
|
I'm still a bit anxious about this PR (I just realized that 2 RPC connections means we'll be exchanging clock synchronization twice per node), and I won't have time to properly think about it until tomorrow. I appreciate the work you're doing here @knz, but I'd like to request that we don't try to rush this in. |
Done, RFAL (I had no choice but to do that anyway, otherwise the race detector didn't like me during stress runs) |
I think that since I'm still doing the thing to share the |
petermattis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that since I'm still doing the thing to share the *Connection object between gossip and "named" (specific node ID) dials, there should be just 1 RPC connection between two nodes.
I think gossip will frequently be the first connection to a remote node. That will prohibit sharing a connection.
Yes and what's the problem with that?
My understanding is that if the conn object gets closed on the non-zero path, it's going to be marked as unusable (the grpc code marks it as closed) and so further zero-nodeid rpc activity (gossip) will cause a re-dial. Is my understanding wrong?
I'm not sure what problems that will cause. I don't think this scenario is adequately tested to give us any confidence that the right thing occurs. Feel free to point to tests that I'm missing. My anxiety here is relaxing a previous invariant: closing a connection removes the connection from the connection map.
I'm wondering what the aim of this PR should be. Should we be trying to make it impossible to send an RPC to a remote node with the wrong node ID? That's noble (and perhaps the right thing to do), but also risky in the short term. Another approach would be to focus on reducing the log spam (that's the near term impetus for a change). For example, nodeDialer could poison the local gossip address cache for a node if it discovers the remote node has an unexpected ID. I think that could be done while leaving the current connection behavior in place.
I apologize for the strictness of this review. I'm happy to take this change over if you'd like.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)
The evidence (based on my limited testing) is that we do not have a disciplined approach to prevent mis-routed RPCs from being incorrectly served and break invariants. See the commit message at the top of this PR. On its own that's an important goal.
Can you clarify why?
We can certainly also do this (the current PR, as it stands, does not do this yet). I'm happy to add this. |
|
(rebased PR - will check what CI fallout we get) |
3e2e015 to
1ef60f1
Compare
|
Could you update the commit/PR message to indicate what the behavior you're introducing is? I gather that outbound connections to NodeID zero don't validate the peer. From reading the code, I think the situation is as follows:
Please state that (in better words) and give a bit of motivation. It'd also be good to learn who doesn't necessarily supply a NodeID. I think we should consider removing the nodeID-less API completely and force callers to explicitly pass a nodeID (even if it's zero) to make sure we don't regress in silly ways. The major problem (but it's not a new problem) here seems to be that these unvalidated connections have no reason to ever go away. This is just fallout of the fact that we don't have a mechanism to gracefully tear down connections at all. We don't want to just brutally close the old connection because that gives ugly errors. And yet we know that consumers of a connection can hold on to it for essentially forever (for example Raft transport). Fixing this seems out of scope here and will require all callers to buy into the fact that they'll need to switch over at some point. It'll be a while until that's worth it. I also looked into @petermattis' concern about clock offsets. We will have multiple connections open but the clock offset measurements key only on the address (i.e. we won't double count the measurements, though we'll measure and verify more frequently). That seems like it should be fine, but should also be mentioned in the commit, or even in the code, where appropriate. I still have to review the code in detail, but it looked pretty solid. @petermattis had a good point that putting a connection in multiple slots in the map (zero and nonzero) might cause new behavior that needs to be tested appropriately. |
Done.
Done (Gossip + CLI client commands)
Agreed! I added a 2nd commit to do exactly that. PTAL!
|
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before you merge this, can you set up a local three node cluster (or just run the tpcc headroom roachtest) and watch the logs for annoying errors, especially early as the cluster is brought up? I don't expect there to be much (that isn't already there before this PR) but it's worth a look.
Thanks for pulling through!
Reviewed 2 of 8 files at r2, 16 of 16 files at r3, 17 of 17 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/cli/start.go, line 1137 at r4 (raw file):
// We use GRPCGossipDial() here because it does not matter // to which node we're talking to. conn, err := rpcContext.GRPCGossipDial(addr).Connect(ctx)
how about GRPCUnvalidatedDial?
pkg/rpc/heartbeat.go, line 60 at r3 (raw file):
// currently used by the multiTestContext which does not suitably // populate separate node IDs for each heartbeat service. // The returned callback should be called to cancel the effect.
Remove this line.
pkg/server/server.go, line 1493 at r1 (raw file):
// Now that we have a node ID, ensure that incoming RPC connections // are validated against this node ID. s.rpcContext.NodeID.Set(ctx, s.NodeID())
Why did you (have to) lower this into *Node?
Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See cockroachdb#34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. This behaves as follows: - node ID zero given, no connection cached: creates new connection that doesn't validate NodeID. This is suitable for the initial GRPC handshake during gossip, before node IDs are known. It is also suitable for the CLI commands which do not care about which node they are talking to (and they do not know the node ID yet -- only the RPC address). - nonzero NodeID given, but connection cached with node ID zero: opens new connection, leaves old connection in place (so dialing to node ID zero later still gives the unvalidated conn back.) This is suitable when setting up e.g. Raft clients after the peer node IDs are determined. At this point we want to introduce node ID validation. The old connection remains in place because the gossip code does not react well from having its connection closed from "under it". - zero given, cached with nonzero: will use the cached connection. This is suitable when gossip needs to verify e.g. the health of some remote node known only by its address. In this case it's OK to have it use the connection that is already established. This flexibility suggests that it is possible for clent components to "opt out" of node ID validation by specifying a zero value, in other places than strictly necessary for gossip and CLI commands. In fact, the situation is even more uncomfortable: it requires extra work to set up the node ID and naive test code will be opting out of validation implicitly, without clear feedback. This mis-design is addressed by a subsequent commit. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address.
The previous patch introduced node ID verification for GRPC connections but preserved the `GRPCDial()` API, alongside the ability to use node ID 0 with `GRPCDialNode()`, to signal that node ID verification should be disabled. Further examination revealed that this flexibility is 1) hard to reason about and 2) unneeded. So instead of keeping this option and then investing time into producing tests for all the combinations of verifications protocols, this patch "cuts the gordian knot" by removing this flexibility altogether. In summary: - `GRPCDial()` is removed. - `GRPCDialNode()` will call log.Fatal() if provided a 0 node ID. - `GRPCUnvalidatedDial()` is introduced, with a clarification about its contract. I have audited the code to validate that this is indeed only used by gossip, and the CLI client commands that really don't care about the node ID. Release note: None
knz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/cli/start.go, line 1137 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
how about
GRPCUnvalidatedDial?
I'll trust your judgement.
pkg/rpc/heartbeat.go, line 85 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is a dated way of tweaking behavior for tests. Can you put a boolean on
*HeartbeatServiceandContext(for plumbing from the latter into the former inNewServerWithInterceptor)? This is a bit more work, but now it's unclear that we'll really undo this change when an mtc-based test fails. I really want to avoid hacks like this, sorry to make you jump through another hoop. So concretely my suggestion is:
- add a TestingNoValidateNodeIDs into Context
- in NewServerWithInterceptor, carry the bool over into HeartbeatService
- use the bool in Ping
Perhaps there's a more direct way, but we never hold on to the heartbeat service (we just stick it into a grpc server) and doing so wouldn't add clarity (then it would seem that one could replace the heartbeat service, but that wouldn't work).
Done.
pkg/rpc/heartbeat.go, line 60 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Remove this line.
Done.
pkg/server/server.go, line 1493 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why did you (have to) lower this into
*Node?
I think you're misreading the code and commenting on a rebase diff. I haven't changed anything here.
knz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before you merge this, can you set up a local three node cluster [...] and watch the logs for annoying errors, especially early as the cluster is brought up?
I have used my testing script from earlier and I didn't find anything suspicious.
I did find what I was looking for however:
I190418 13:02:08.329989 333 rpc/nodedialer/nodedialer.go:143 [n1] unable to connect to n4: failed to connect to n4 at localhost:26004: initial connection heartbeat failed: rpc error: code = Unknown desc = client requested node ID 4 doesn't match server node ID 3
(the new check)
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 19 of 19 files at r5, 17 of 17 files at r6.
Reviewable status:complete! 1 of 0 LGTMs obtained
|
Thank you bors r=tbg |
34197: server,rpc: validate node IDs in RPC heartbeats r=tbg a=knz Fixes #34158. Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See #34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address. 36952: storage: deflake TestNodeLivenessStatusMap r=tbg a=knz Fixes #35675. Prior to this patch, this test would fail `stressrace` after a few dozen iterations. With this patch, `stressrace` succeeds thousands of iterations. I have checked that the test logic is preserved: if I change one of the expected statuses in `testData`, the test still fail properly. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>
Build succeeded |
Back in PR cockroachdb#34197 we mistakenly removed the .Offset field sent by each RPC heartbeat, through which the remote node monitors the current/client node's offset. This looks bad, but is actually somewhat innocuous. This is because we update the offset map and do the check on two situations: - when a server node gets pinged by a remote node using the remote-provided .Offset, - when receiving a ping response as client, using an estimate of the roundtrip latency between PingRequest-PingResponse. The bug above only invalidates the first check. The second check still occurs. Since all nodes are both client to another node, and server for the same other node, we still get a check both ways on all pairs of nodes. Nonetheless, the half-way broken check reduces robustness overall. So it's good to fix it. Release note: None
84031: rpc: fix the maxoffset check on the incoming path r=erikgrinaker a=knz Fixes #84017. Fixes #84027. ### Bug fix 1 Back in PR #34197 we mistakenly removed the .Offset field sent by each RPC heartbeat, through which the remote node monitors the current/client node's offset. This looks bad, but is actually somewhat innocuous. This is because we update the offset map and do the check on two situations: - when a server node gets pinged by a remote node using the remote-provided .Offset, - when receiving a ping response as client, using an estimate of the roundtrip latency between PingRequest-PingResponse. The bug above only invalidates the first check. The second check still occurs. Since all nodes are both client to another node, and server for the same other node, we still get a check both ways on all pairs of nodes. Nonetheless, the half-way broken check reduces robustness overall. So it's good to fix it. ### Bug fix 2 Release note (bug fix): CLI `cockroach` commands connecting to a remote server will not produce spurious "latency jump" warnings any more. This bug had been introduced in CockroachDB v21.2. Co-authored-by: Raphael 'kena' Poss <[email protected]>
Fixes #34158.
Prior to this patch, it was possible for a RPC client to dial a node
ID and get a connection to another node instead. This is because the
mapping of node ID -> address may be stale, and a different node could
take the address of the intended node from "under" the dialer.
(See #34155 for a scenario.)
This happened to be "safe" in many cases where it matters because:
node than intended (with potential performance drop);
node than intended (they would route underneath);
remote node because the store ID in the request would not match.
However this safety is largely accidental, and we should not work with
the assumption that any RPC request is safe to be mis-routed. (In
fact, we have not audited all the RPC endpoints and cannot establish
this safety exists throughout.)
This patch works to prevent these mis-routings by adding a check of
the intended node ID during RPC heartbeats (including the initial
heartbeat), when the intended node ID is known. A new API
GRPCDialNode()is introduced to establish such connections.Release note (bug fix): CockroachDB now performs fewer attempts to
communicate with the wrong node, when a node is restarted with another
node's address.