server: add NetworkConnectivity endpoint to get network connection statuses between nodes #95429

koorosh · 2023-01-18T13:53:57Z

This patch introduces new endpoint which aimed to provide information
about network connection status between nodes rather than nodes statuses
(that is done with node liveness api).

NetworkConnectivity endpoint relies on gossip client to get all known
nodes in the cluster and then checks rpcCtx.ConnHealth for every peer.
It can tell us following:

connection is live if no error returned;
ErrNoConnection returned if nodes don't have connection;
other errors indicate that network connection is unhealthy;

This functionality should not care about whether node is decommissioned,
shutdown or whatever.

Later on, it'll be used in conjunction with node liveness statuses to
distinguish more specific reasons why network connection is broken.
For instance,

if node liveness status is DEAD and connection is failed,
then it should not be considered as network issue.
if node liveness status is LIVE/DECOMMISSIONING and network
connection is failed, then it is network partitioning;

Release note: None

Part of #58033
Part of #96101

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-21710

blathers-crl · 2023-01-18T13:54:01Z

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

cockroach-teamcity · 2023-01-18T13:54:11Z

This change is

koorosh · 2023-01-18T13:59:50Z

pkg/gossip/gossip.proto

  ClientStatus client = 2 [(gogoproto.nullable) = false];
  ServerStatus server = 3 [(gogoproto.nullable) = false];
-  Connectivity connectivity = 4 [(gogoproto.nullable) = false];
+  Connectivity connectivity = 4 [(gogoproto.nullable) = false, deprecated = true];


Not relevant to current work. Relates to following PR: #89613

koorosh · 2023-01-18T14:03:37Z

pkg/server/status.go

+// iterateNodes iterates nodeFn over all non-removed nodes concurrently.
+// It then calls nodeResponse for every valid result of nodeFn, and
+// nodeError on every error result.
+func (s *statusServer) callNodes(


callNodes is identical to dialNodes function with only difference that nodes to dial are provided as argument instead of getting them from s.serverIterator.getAllNodes(ctx) (as commented out below).
It allows to dial specific nodes (known to gossip client in our case).

tbg

The basic ingredients look good.

Would you mind sending the changes for the rpc package (OnDisconnect()) out separately including a unit test? We want to make sure that if a connection goes bad, we no longer claim to have a healthy latency for it. On that PR we should also manually verify this fact by actually looking at the matrix under a partition. We might be able to backport this, so it's useful to separate it out.

The basic approach for the network latency endpoint seems fine, but a lot of plumbing is getting introduced that I'm unsure of since I don't have the background. Why are we introducing new protos? What is the current network latency matrix using? Is that going to be deprecated? Why isn't it being deprecated here? Etc.

The network connectivity code (that I've just glanced over, not reviewed) could could be architected for better testability. Ideally you have a method that receives only the nodeID->addr mapping and a latencies map and constructs the result, and that is being unit tested. All of the server glue can come separately (in separate commits/PRs) and these will be "dumb" PRs that don't do anything too interesting.

It looks like you have some trouble using dialNodes in the context you're in (leading t callNodes being introduced). May I suggest a separate (separate PR is fine and easier to handle, I'll also take a separate commit) PR where you adjust dialNodes to work for your use case as well.

tbg · 2023-01-19T08:01:02Z

pkg/gossip/gossip.go

+func (g *Gossip) GetBootstrapAddresses() map[util.UnresolvedAddr]roachpb.NodeID {
+	g.mu.RLock()
+	defer g.mu.RUnlock()
+	return g.bootstrapAddrs


Data race? You're leaking a map out from under the mutex.

See other commits, you may not want to add this method in the first place.

tbg · 2023-01-20T12:41:43Z

pkg/ui/workspaces/db-console/src/util/api.ts

 export type SetTraceRecordingTypeResponseMessage =
  protos.cockroach.server.serverpb.SetTraceRecordingTypeResponse;

+export type NetworkConnectivityRequest =


Could you put all of the "front end" changes into a separate commit or PR?

tbg · 2023-01-20T13:08:22Z

pkg/gossip/gossip.go

+func (g *Gossip) GetKnownNodeIDs() ([]roachpb.NodeID, error) {
+	g.mu.RLock()
+	infos, err := g.mu.is.getMatchedInfos(MakePrefixPattern(KeyNodeDescPrefix))
+	g.mu.RUnlock()


This needs to be deferred. The infos you're collecting hold on to *info which must not be accessed outside of the mutex.

tbg · 2023-01-20T13:56:25Z

pkg/server/status.go

+		if local {
+			nodeLatencies := map[roachpb.NodeID]int64{}
+			// addressToNodeIdMap contains all known addresses
+			addressToNodeIdMap := s.gossip.GetBootstrapAddresses()


I'm not sure we want to rely on the bootstrapAddrs field here. The semantics for that are a bit shaky. Even though bootstrapAddrs seems like it very aggressively tracks remote addresses, I think we should try to build something that's more easily understood and that we can reason better about.

Each node gossips its NodeDescriptor here:

cockroach/pkg/gossip/gossip.go

Lines 403 to 405 in 2114678

if err := g.AddInfoProto(MakeNodeIDKey(desc.NodeID), desc, NodeDescriptorTTL); err != nil {

return errors.Wrapf(err, "n%d: couldn't gossip descriptor", desc.NodeID)

}

The TTL for that is 2h. So we can reasonably assume that a node will know all of its recent peers whenever network trouble begins. (And we cannot avoid this not always being true in contrived scenarios anyway: a node might restart while being partitioned, losing all memory state; yes we could add persistence but don't think we would as the persisted nodes could be stale after restart!)

You added an accessor for this this already: GetAllNodeIDs(). Now instead of listing just the IDs, it should also return the descriptors. Then you can construct a map from address to nodeID (don't forget about the LocalityAddrs which I just learned is a thing) and that can

Santamaura

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @koorosh, and @nvanbenschoten)

pkg/server/serverpb/status.proto line 1898 at r1 (raw file):

}

message NetworkConnectivityResponse {

I think in general the structure of the endpoint has almost everything we need, but one question I have is how will the ui be aware about the edge case we talked about where a node has not attempted to connect with another node? To me it seems like there is no way currently to differentiate between not being able to connect with another node because of issues and not being able to connect with another node because it has not been attempted.

Santamaura

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @koorosh, and @nvanbenschoten)

pkg/server/serverpb/status.proto line 1898 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

I think in general the structure of the endpoint has almost everything we need, but one question I have is how will the ui be aware about the edge case we talked about where a node has not attempted to connect with another node? To me it seems like there is no way currently to differentiate between not being able to connect with another node because of issues and not being able to connect with another node because it has not been attempted.

Actually is there a reason we need to return 2 maps? Could we return only 1 map and instead just add on the liveness state e.g. something like:

Code snippet:

message NetworkConnectivityResponse {
  message NodeConnectivity {
    map<int32, int64> latencies = 1 [
      (gogoproto.nullable) = false,
      (gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"
    ];
    kv.kvserver.liveness.livenesspb.Liveness liveness = 2
  };
  map<int32, NodeConnectivity> connectivity_by_node_id = 1 [
    (gogoproto.nullable) = false,
    (gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"
  ];
}

koorosh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @nvanbenschoten, and @Santamaura)

pkg/server/serverpb/status.proto line 1898 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

Actually is there a reason we need to return 2 maps? Could we return only 1 map and instead just add on the liveness state e.g. something like:

That makes sense, it can be grouped all together. Do we still need NodeConnectivity info here?

Santamaura · 2023-02-08T16:34:20Z

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @nvanbenschoten, and @Santamaura)

pkg/server/serverpb/status.proto line 1898 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…
That makes sense, it can be grouped all together. Do we still need NodeConnectivity info here?

I think so, we want it to hold both the liveness state and latency map so when the payload is received on the FE i'm imagining something like { 1: { latencies: {2: 100, 3:200}, liveness: 0, }, ...etc }
Let me know if it doesn't make sense.

andrewbaptist · 2023-04-25T22:12:09Z

I wanted to check what the status of this was. I think this is generally a good change, and should be able to be finished and merged now.

blathers-crl · 2023-04-28T16:51:58Z

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

koorosh · 2023-04-28T19:55:13Z

I wanted to check what the status of this was. I think this is generally a good change, and should be able to be finished and merged now.

cc @andrewbaptist , I've updated PR in a way to collect both latencies and connection states based on whether ConnHealth returns error or not.

It requires follow up work to ensure that ConnHealth provides distinctive errors that are used to imply connection state.

andrewbaptist

I really like the general idea here, and want to do something similarly internally for determining which nodes are live (other than the liveness range). A few cleanups and this could be good!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @koorosh, and @nvanbenschoten)

pkg/gossip/gossip.go line 1536 at r3 (raw file):

	g.mu.RLock()
	defer g.mu.RUnlock()
	infos, err := g.mu.is.getMatchedInfos(MakePrefixPattern(KeyNodeDescPrefix))

There is already gossip.IterateInfos. Are you doing something different here so you can't use that?

pkg/server/status.go line 1891 at r3 (raw file):

}

// NetworkConnectivity returns information about connections statuses between nodes.

If this is going to initiate connections and not just report current statuses, this comment should change. I also was hoping to use something like this internally for a change I'm looking at to liveness. Maybe this could be separated into the needed parts (putting more of this in the rpc package) and only deal with "organizing" the data here.

pkg/server/status.go line 1904 at r3 (raw file):

	if len(req.NodeID) > 0 {
		sourceNodeID, local, err := s.parseNodeID(req.NodeID)

You shouldn't need to parse here. Shouldn't the NodeId field be using a casttype to NodeID instead?

pkg/server/status.go line 1977 at r3 (raw file):

	}

	if err := s.iterateNodes(ctx, "network connectivity", dialFn, nodeFn, responseFn, errorFn); err != nil {

It looks like this gets a list of all nodes and then attempts to dial all the ones it is not currently connected to. Today we mostly have a "full mesh" network where every node dials every other node, but this would "force" it to continue. I think it would be better to separate nodes into three classes:

Currently connected
Not connected (but not necessarily unable to connect).
Last connection attempt failed.

I'm not sure how much harder that logic could be, but it seems incorrect to dial a node for reporting reason to find out if you can dial it. Is the intention that this would be called with every node in the system?

pkg/server/serverpb/status.proto line 1977 at r3 (raw file):

message NetworkConnectivityRequest {
  string node_id = 1 [
    (gogoproto.customname) = "NodeID"

You should add a casttype = "NodeID". I see you did that for the response, but is there a specific reason it doesn't work on the Request side?

This change makes `ErrQuiescing` exported to reuse it later in `server` package (in scope of PR cockroachdb#95429) to rely on it as an indicator of initiated node shutdown. Release note: None

koorosh

Ack. Will do.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @Santamaura, and @tbg)

pkg/server/status.go line 2038 at r7 (raw file):
I tried to distinguish two possible cases:

connection closed as intended, planned shutdown, decommissioning, etc (so it's _CLOSED status)
connection closed unexpectedly - that should bring attention of users. (set as _PARTITIONED status as all remaining errors)

Maybe it would be more clear to rename _PARTITIONED to _ERROR?

expose the error in the response (and show it in the UI tooltip)

Correct, I'm going to show errors in tooltip for all "unhealthy" statuses.

pkg/server/status.go line 2040 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ErrNotHeartbeated means we are connecting for the first time. So I would call this INITIALIZING or something like that.

Agree. What about _ESTABLISHING (to make zero assumptions that connection is successful, just in process of establishing)

pkg/server/status.go line 2042 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This one would go away. I don't think we are in the right place here to reason about what a connection error "means"; the code should avoid trying to do that (it's not even something we can do in general). The connection is either absent, initializing, errored, or established. Exposing the error is more important than trying to classify it.

Agree in general and as mentioned above, this case can be marked as _ERROR status without specific reasoning.
It is a case where we couldn't identify exact reason why connection closed (as it done with first case).

pkg/rpc/context_test.go line 2102 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

please make a separate commit for the export of errQuiescing.

Done.

tbg

Great! Nothing substantial left, only nits.

pkg/server/status.go

pkg/server/status_test.go

pkg/server/status.go

This patch introduces new endpoint which aimed to provide information about network connection status between nodes rather than nodes statuses (that is done with node liveness api). `NetworkConnectivity` endpoint relies on gossip client to get all known nodes in the cluster and then checks `rpcCtx.ConnHealth` for every peer. It can tell us following: - connection is live if no error returned; - `ErrNoConnection` returned if node is attempting to connect to target node; - other errors indicate that network connection is unhealthy; This functionality should not care about whether node is decommissioned, shutdown or whatever. Later on, it'll be used in conjunction with node liveness statuses to distinguish more specific reasons why network connection is broken. For instance, - if node liveness status is DEAD and connection is failed, then it should not be considered as network issue. - if node liveness status is LIVE/DECOMMISSIONING and network connection is failed, then it is network partitioning; Release note: None

koorosh · 2023-06-26T13:02:54Z

@tbg , TFTR and making possible to get this done!

koorosh · 2023-06-26T13:28:51Z

@andrewbaptist , I didn't address your suggestion regarding not using fan out of requests in this PR. Currently there's no easy way to avoid this.

koorosh · 2023-06-26T15:03:26Z

bors r+

craig · 2023-06-26T15:03:31Z

👎 Rejected by code reviews

andrewbaptist

koorosh · 2023-06-26T15:19:53Z

bors retry

craig · 2023-06-26T15:59:50Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-06-26T16:55:42Z

Build succeeded:

Bazel Essential CI (Cockroach)

103837: ui: Network latency page improvements r=koorosh a=koorosh - use connectivity endpoint as a source of connection statuses and latency; - removed popup with disconnected peers, instead it will be shown on latency matrix; Resolves: #96101 Release note (ui change): add "Show dead nodes" option to show/hide dead/decommissioned nodes from latency matrix. Release note (ui change): "No connections" popup menu is removed. Now failed connections are displayed on the latencies matrix. - [x] test coverage (storybook added for visual validation) This PR depends on following PRs that should be merged before this one: - #95429 - #99191 #### example 1. one node cannot establish connection to another one. ![Screenshot 2023-06-26 at 12 33 46](https://github.com/cockroachdb/cockroach/assets/3106437/2534e0fe-f5ba-48b5-8825-1c17a8112870) #### example 2. Node 5 is stopped and it is considered as `unavailable` before setting as `dead` ![Screenshot 2023-06-26 at 10 52 20](https://github.com/cockroachdb/cockroach/assets/3106437/bd703d4c-9812-4061-9140-4a8f8d3a5da9) #### example 3. Node 3 is dead. No connection from/to node. Show error message. <img width="954" alt="Screenshot 2023-06-23 at 14 07 25" src="https://github.com/cockroachdb/cockroach/assets/3106437/8078c421-aeaa-4038-adf5-e3c69ba6d863"> #### example 4. Decommissioned node. ![Screenshot 2023-06-26 at 18 11 23](https://github.com/cockroachdb/cockroach/assets/3106437/dc4cb22e-34b7-4cd3-a391-b8ea5fd0232d) 106082: sql: add REPLICATION roleoption for replication mode r=rafiss a=otan These commits add the REPLICATION roleoption (as per PG), and then uses it to authenticate whether a user can use the replication protocol. Informs #105130 ---- # sql: add REPLICATION roleoption Matches PostgreSQL implementation of the REPLICATION roleoption. Release note (sql change): Added the REPLICATION role option for a user, which allows a user to use the streaming replication protocol. # sql: only allow REPLICATION users to login with replication mode In PG, the REPLICATION roleoption is required to use streaming replication mode. Enforce the same constraint. Release note: None Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Oliver Tan <[email protected]>

koorosh requested review from erikgrinaker and tbg January 18, 2023 13:53

blathers-crl bot added the O-community Originated from the community label Jan 18, 2023

koorosh commented Jan 18, 2023

View reviewed changes

tbg requested changes Jan 20, 2023

View reviewed changes

shralex requested review from andrewbaptist and nvb January 24, 2023 20:24

Santamaura reviewed Jan 31, 2023

View reviewed changes

Santamaura reviewed Feb 6, 2023

View reviewed changes

koorosh commented Feb 8, 2023

View reviewed changes

koorosh force-pushed the network-partition-obs branch from 8f9ebf3 to 8f64d89 Compare April 28, 2023 16:51

koorosh force-pushed the network-partition-obs branch from 8f64d89 to 38a8050 Compare April 28, 2023 16:59

koorosh changed the title ~~wip: get network latency from gossip client~~ server: get network connection statuses and latencies from gossip client Apr 28, 2023

koorosh marked this pull request as ready for review April 28, 2023 17:02

koorosh requested review from a team as code owners April 28, 2023 17:02

koorosh requested a review from a team April 28, 2023 17:02

koorosh force-pushed the network-partition-obs branch 2 times, most recently from a9f966c to 9f637ad Compare April 28, 2023 19:10

koorosh force-pushed the network-partition-obs branch from 9f637ad to 5cafa7b Compare May 1, 2023 09:23

andrewbaptist suggested changes May 1, 2023

View reviewed changes

koorosh force-pushed the network-partition-obs branch from 5cafa7b to 0d8407f Compare May 3, 2023 10:19

tbg removed request for a team, erikgrinaker and nvb June 21, 2023 13:35

koorosh mentioned this pull request Jun 22, 2023

rpc: make ErrQuiescing error exported from rpc package #105328

Closed

koorosh commented Jun 22, 2023

View reviewed changes

tbg self-requested a review June 22, 2023 15:59

koorosh force-pushed the network-partition-obs branch 3 times, most recently from c566631 to 8b0a9ac Compare June 26, 2023 09:36

tbg approved these changes Jun 26, 2023

View reviewed changes

koorosh force-pushed the network-partition-obs branch from 8b0a9ac to af1aec2 Compare June 26, 2023 12:48

koorosh requested a review from andrewbaptist June 26, 2023 13:13

andrewbaptist approved these changes Jun 26, 2023

View reviewed changes

craig bot merged commit 8023be5 into cockroachdb:master Jun 26, 2023

	if err := g.AddInfoProto(MakeNodeIDKey(desc.NodeID), desc, NodeDescriptorTTL); err != nil {
	return errors.Wrapf(err, "n%d: couldn't gossip descriptor", desc.NodeID)
	}

server: add NetworkConnectivity endpoint to get network connection statuses between nodes #95429

server: add NetworkConnectivity endpoint to get network connection statuses between nodes #95429

Uh oh!

Conversation

koorosh commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Jan 18, 2023

Uh oh!

cockroach-teamcity commented Jan 18, 2023

Uh oh!

koorosh Jan 18, 2023

Choose a reason for hiding this comment

Uh oh!

koorosh Jan 18, 2023

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

tbg Jan 19, 2023

Choose a reason for hiding this comment

Uh oh!

tbg Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

tbg Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

tbg Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

tbg Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

Santamaura left a comment

Choose a reason for hiding this comment

Uh oh!

Santamaura left a comment

Choose a reason for hiding this comment

Uh oh!

koorosh left a comment

Choose a reason for hiding this comment

Uh oh!

Santamaura commented Feb 8, 2023

Uh oh!

andrewbaptist commented Apr 25, 2023

Uh oh!

blathers-crl bot commented Apr 28, 2023

Uh oh!

koorosh commented Apr 28, 2023

Uh oh!

andrewbaptist left a comment

Choose a reason for hiding this comment

Uh oh!

koorosh left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

koorosh commented Jun 26, 2023

Uh oh!

koorosh commented Jun 26, 2023

Uh oh!

koorosh commented Jun 26, 2023

Uh oh!

craig bot commented Jun 26, 2023

Uh oh!

andrewbaptist left a comment • edited by cockroach-dev-inf Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koorosh commented Jun 26, 2023

koorosh commented Jan 18, 2023 •

edited

Loading

andrewbaptist left a comment •

edited by cockroach-dev-inf

Loading