-
Notifications
You must be signed in to change notification settings - Fork 4k
server: add NetworkConnectivity endpoint to get network connection statuses between nodes #95429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
pkg/gossip/gossip.proto
Outdated
| ClientStatus client = 2 [(gogoproto.nullable) = false]; | ||
| ServerStatus server = 3 [(gogoproto.nullable) = false]; | ||
| Connectivity connectivity = 4 [(gogoproto.nullable) = false]; | ||
| Connectivity connectivity = 4 [(gogoproto.nullable) = false, deprecated = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not relevant to current work. Relates to following PR: #89613
pkg/server/status.go
Outdated
| // iterateNodes iterates nodeFn over all non-removed nodes concurrently. | ||
| // It then calls nodeResponse for every valid result of nodeFn, and | ||
| // nodeError on every error result. | ||
| func (s *statusServer) callNodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callNodes is identical to dialNodes function with only difference that nodes to dial are provided as argument instead of getting them from s.serverIterator.getAllNodes(ctx) (as commented out below).
It allows to dial specific nodes (known to gossip client in our case).
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic ingredients look good.
Would you mind sending the changes for the rpc package (OnDisconnect()) out separately including a unit test? We want to make sure that if a connection goes bad, we no longer claim to have a healthy latency for it. On that PR we should also manually verify this fact by actually looking at the matrix under a partition. We might be able to backport this, so it's useful to separate it out.
The basic approach for the network latency endpoint seems fine, but a lot of plumbing is getting introduced that I'm unsure of since I don't have the background. Why are we introducing new protos? What is the current network latency matrix using? Is that going to be deprecated? Why isn't it being deprecated here? Etc.
The network connectivity code (that I've just glanced over, not reviewed) could could be architected for better testability. Ideally you have a method that receives only the nodeID->addr mapping and a latencies map and constructs the result, and that is being unit tested. All of the server glue can come separately (in separate commits/PRs) and these will be "dumb" PRs that don't do anything too interesting.
It looks like you have some trouble using dialNodes in the context you're in (leading t callNodes being introduced). May I suggest a separate (separate PR is fine and easier to handle, I'll also take a separate commit) PR where you adjust dialNodes to work for your use case as well.
pkg/gossip/gossip.go
Outdated
| func (g *Gossip) GetBootstrapAddresses() map[util.UnresolvedAddr]roachpb.NodeID { | ||
| g.mu.RLock() | ||
| defer g.mu.RUnlock() | ||
| return g.bootstrapAddrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data race? You're leaking a map out from under the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other commits, you may not want to add this method in the first place.
| export type SetTraceRecordingTypeResponseMessage = | ||
| protos.cockroach.server.serverpb.SetTraceRecordingTypeResponse; | ||
|
|
||
| export type NetworkConnectivityRequest = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put all of the "front end" changes into a separate commit or PR?
pkg/gossip/gossip.go
Outdated
| func (g *Gossip) GetKnownNodeIDs() ([]roachpb.NodeID, error) { | ||
| g.mu.RLock() | ||
| infos, err := g.mu.is.getMatchedInfos(MakePrefixPattern(KeyNodeDescPrefix)) | ||
| g.mu.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be deferred. The infos you're collecting hold on to *info which must not be accessed outside of the mutex.
pkg/server/status.go
Outdated
| if local { | ||
| nodeLatencies := map[roachpb.NodeID]int64{} | ||
| // addressToNodeIdMap contains all known addresses | ||
| addressToNodeIdMap := s.gossip.GetBootstrapAddresses() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to rely on the bootstrapAddrs field here. The semantics for that are a bit shaky. Even though bootstrapAddrs seems like it very aggressively tracks remote addresses, I think we should try to build something that's more easily understood and that we can reason better about.
Each node gossips its NodeDescriptor here:
cockroach/pkg/gossip/gossip.go
Lines 403 to 405 in 2114678
| if err := g.AddInfoProto(MakeNodeIDKey(desc.NodeID), desc, NodeDescriptorTTL); err != nil { | |
| return errors.Wrapf(err, "n%d: couldn't gossip descriptor", desc.NodeID) | |
| } |
The TTL for that is 2h. So we can reasonably assume that a node will know all of its recent peers whenever network trouble begins. (And we cannot avoid this not always being true in contrived scenarios anyway: a node might restart while being partitioned, losing all memory state; yes we could add persistence but don't think we would as the persisted nodes could be stale after restart!)
You added an accessor for this this already: GetAllNodeIDs(). Now instead of listing just the IDs, it should also return the descriptors. Then you can construct a map from address to nodeID (don't forget about the LocalityAddrs which I just learned is a thing) and that can
Santamaura
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @koorosh, and @nvanbenschoten)
pkg/server/serverpb/status.proto line 1898 at r1 (raw file):
} message NetworkConnectivityResponse {
I think in general the structure of the endpoint has almost everything we need, but one question I have is how will the ui be aware about the edge case we talked about where a node has not attempted to connect with another node? To me it seems like there is no way currently to differentiate between not being able to connect with another node because of issues and not being able to connect with another node because it has not been attempted.
Santamaura
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @koorosh, and @nvanbenschoten)
pkg/server/serverpb/status.proto line 1898 at r1 (raw file):
Previously, Santamaura (Alex Santamaura) wrote…
I think in general the structure of the endpoint has almost everything we need, but one question I have is how will the ui be aware about the edge case we talked about where a node has not attempted to connect with another node? To me it seems like there is no way currently to differentiate between not being able to connect with another node because of issues and not being able to connect with another node because it has not been attempted.
Actually is there a reason we need to return 2 maps? Could we return only 1 map and instead just add on the liveness state e.g. something like:
Code snippet:
message NetworkConnectivityResponse {
message NodeConnectivity {
map<int32, int64> latencies = 1 [
(gogoproto.nullable) = false,
(gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"
];
kv.kvserver.liveness.livenesspb.Liveness liveness = 2
};
map<int32, NodeConnectivity> connectivity_by_node_id = 1 [
(gogoproto.nullable) = false,
(gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"
];
}
koorosh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @nvanbenschoten, and @Santamaura)
pkg/server/serverpb/status.proto line 1898 at r1 (raw file):
Previously, Santamaura (Alex Santamaura) wrote…
Actually is there a reason we need to return 2 maps? Could we return only 1 map and instead just add on the liveness state e.g. something like:
That makes sense, it can be grouped all together. Do we still need NodeConnectivity info here?
I think so, we want it to hold both the liveness state and latency map so when the payload is received on the FE i'm imagining something like |
|
I wanted to check what the status of this was. I think this is generally a good change, and should be able to be finished and merged now. |
8f9ebf3 to
8f64d89
Compare
|
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
8f64d89 to
38a8050
Compare
a9f966c to
9f637ad
Compare
cc @andrewbaptist , I've updated PR in a way to collect both latencies and connection states based on whether It requires follow up work to ensure that |
9f637ad to
5cafa7b
Compare
andrewbaptist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the general idea here, and want to do something similarly internally for determining which nodes are live (other than the liveness range). A few cleanups and this could be good!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @koorosh, and @nvanbenschoten)
pkg/gossip/gossip.go line 1536 at r3 (raw file):
g.mu.RLock() defer g.mu.RUnlock() infos, err := g.mu.is.getMatchedInfos(MakePrefixPattern(KeyNodeDescPrefix))
There is already gossip.IterateInfos. Are you doing something different here so you can't use that?
pkg/server/status.go line 1891 at r3 (raw file):
} // NetworkConnectivity returns information about connections statuses between nodes.
If this is going to initiate connections and not just report current statuses, this comment should change. I also was hoping to use something like this internally for a change I'm looking at to liveness. Maybe this could be separated into the needed parts (putting more of this in the rpc package) and only deal with "organizing" the data here.
pkg/server/status.go line 1904 at r3 (raw file):
if len(req.NodeID) > 0 { sourceNodeID, local, err := s.parseNodeID(req.NodeID)
You shouldn't need to parse here. Shouldn't the NodeId field be using a casttype to NodeID instead?
pkg/server/status.go line 1977 at r3 (raw file):
} if err := s.iterateNodes(ctx, "network connectivity", dialFn, nodeFn, responseFn, errorFn); err != nil {
It looks like this gets a list of all nodes and then attempts to dial all the ones it is not currently connected to. Today we mostly have a "full mesh" network where every node dials every other node, but this would "force" it to continue. I think it would be better to separate nodes into three classes:
- Currently connected
- Not connected (but not necessarily unable to connect).
- Last connection attempt failed.
I'm not sure how much harder that logic could be, but it seems incorrect to dial a node for reporting reason to find out if you can dial it. Is the intention that this would be called with every node in the system?
pkg/server/serverpb/status.proto line 1977 at r3 (raw file):
message NetworkConnectivityRequest { string node_id = 1 [ (gogoproto.customname) = "NodeID"
You should add a casttype = "NodeID". I see you did that for the response, but is there a specific reason it doesn't work on the Request side?
5cafa7b to
0d8407f
Compare
This change makes `ErrQuiescing` exported to reuse it later in `server` package (in scope of PR cockroachdb#95429) to rely on it as an indicator of initiated node shutdown. Release note: None
koorosh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Will do.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @Santamaura, and @tbg)
pkg/server/status.go line 2038 at r7 (raw file):
I tried to distinguish two possible cases:
- connection closed as intended, planned shutdown, decommissioning, etc (so it's
_CLOSEDstatus) - connection closed unexpectedly - that should bring attention of users. (set as
_PARTITIONEDstatus as all remaining errors)
Maybe it would be more clear to rename _PARTITIONED to _ERROR?
expose the error in the response (and show it in the UI tooltip)
Correct, I'm going to show errors in tooltip for all "unhealthy" statuses.
pkg/server/status.go line 2040 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ErrNotHeartbeatedmeans we are connecting for the first time. So I would call thisINITIALIZINGor something like that.
Agree. What about _ESTABLISHING (to make zero assumptions that connection is successful, just in process of establishing)
pkg/server/status.go line 2042 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This one would go away. I don't think we are in the right place here to reason about what a connection error "means"; the code should avoid trying to do that (it's not even something we can do in general). The connection is either absent, initializing, errored, or established. Exposing the error is more important than trying to classify it.
Agree in general and as mentioned above, this case can be marked as _ERROR status without specific reasoning.
It is a case where we couldn't identify exact reason why connection closed (as it done with first case).
pkg/rpc/context_test.go line 2102 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
please make a separate commit for the export of
errQuiescing.
Done.
c566631 to
8b0a9ac
Compare
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Nothing substantial left, only nits.
This patch introduces new endpoint which aimed to provide information about network connection status between nodes rather than nodes statuses (that is done with node liveness api). `NetworkConnectivity` endpoint relies on gossip client to get all known nodes in the cluster and then checks `rpcCtx.ConnHealth` for every peer. It can tell us following: - connection is live if no error returned; - `ErrNoConnection` returned if node is attempting to connect to target node; - other errors indicate that network connection is unhealthy; This functionality should not care about whether node is decommissioned, shutdown or whatever. Later on, it'll be used in conjunction with node liveness statuses to distinguish more specific reasons why network connection is broken. For instance, - if node liveness status is DEAD and connection is failed, then it should not be considered as network issue. - if node liveness status is LIVE/DECOMMISSIONING and network connection is failed, then it is network partitioning; Release note: None
8b0a9ac to
af1aec2
Compare
|
@tbg , TFTR and making possible to get this done! |
|
@andrewbaptist , I didn't address your suggestion regarding not using fan out of requests in this PR. Currently there's no easy way to avoid this. |
|
bors r+ |
|
👎 Rejected by code reviews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
bors retry |
|
Build failed (retrying...): |
|
Build succeeded: |
103837: ui: Network latency page improvements r=koorosh a=koorosh - use connectivity endpoint as a source of connection statuses and latency; - removed popup with disconnected peers, instead it will be shown on latency matrix; Resolves: #96101 Release note (ui change): add "Show dead nodes" option to show/hide dead/decommissioned nodes from latency matrix. Release note (ui change): "No connections" popup menu is removed. Now failed connections are displayed on the latencies matrix. - [x] test coverage (storybook added for visual validation) This PR depends on following PRs that should be merged before this one: - #95429 - #99191 #### example 1. one node cannot establish connection to another one.  #### example 2. Node 5 is stopped and it is considered as `unavailable` before setting as `dead`  #### example 3. Node 3 is dead. No connection from/to node. Show error message. <img width="954" alt="Screenshot 2023-06-23 at 14 07 25" src="https://github.com/cockroachdb/cockroach/assets/3106437/8078c421-aeaa-4038-adf5-e3c69ba6d863"> #### example 4. Decommissioned node.  106082: sql: add REPLICATION roleoption for replication mode r=rafiss a=otan These commits add the REPLICATION roleoption (as per PG), and then uses it to authenticate whether a user can use the replication protocol. Informs #105130 ---- # sql: add REPLICATION roleoption Matches PostgreSQL implementation of the REPLICATION roleoption. Release note (sql change): Added the REPLICATION role option for a user, which allows a user to use the streaming replication protocol. # sql: only allow REPLICATION users to login with replication mode In PG, the REPLICATION roleoption is required to use streaming replication mode. Enforce the same constraint. Release note: None Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Oliver Tan <[email protected]>
This patch introduces new endpoint which aimed to provide information
about network connection status between nodes rather than nodes statuses
(that is done with node liveness api).
NetworkConnectivityendpoint relies on gossip client to get all knownnodes in the cluster and then checks
rpcCtx.ConnHealthfor every peer.It can tell us following:
ErrNoConnectionreturned if nodes don't have connection;This functionality should not care about whether node is decommissioned,
shutdown or whatever.
Later on, it'll be used in conjunction with node liveness statuses to
distinguish more specific reasons why network connection is broken.
For instance,
then it should not be considered as network issue.
connection is failed, then it is network partitioning;
Release note: None
Part of #58033
Part of #96101
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-21710