Skip to content

Conversation

@koorosh
Copy link
Contributor

@koorosh koorosh commented Jan 18, 2023

This patch introduces new endpoint which aimed to provide information
about network connection status between nodes rather than nodes statuses
(that is done with node liveness api).

NetworkConnectivity endpoint relies on gossip client to get all known
nodes in the cluster and then checks rpcCtx.ConnHealth for every peer.
It can tell us following:

  • connection is live if no error returned;
  • ErrNoConnection returned if nodes don't have connection;
  • other errors indicate that network connection is unhealthy;

This functionality should not care about whether node is decommissioned,
shutdown or whatever.

Later on, it'll be used in conjunction with node liveness statuses to
distinguish more specific reasons why network connection is broken.
For instance,

  • if node liveness status is DEAD and connection is failed,
    then it should not be considered as network issue.
  • if node liveness status is LIVE/DECOMMISSIONING and network
    connection is failed, then it is network partitioning;

Release note: None

Part of #58033
Part of #96101

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-21710

@koorosh koorosh requested review from erikgrinaker and tbg January 18, 2023 13:53
@blathers-crl
Copy link

blathers-crl bot commented Jan 18, 2023

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Jan 18, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

ClientStatus client = 2 [(gogoproto.nullable) = false];
ServerStatus server = 3 [(gogoproto.nullable) = false];
Connectivity connectivity = 4 [(gogoproto.nullable) = false];
Connectivity connectivity = 4 [(gogoproto.nullable) = false, deprecated = true];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant to current work. Relates to following PR: #89613

// iterateNodes iterates nodeFn over all non-removed nodes concurrently.
// It then calls nodeResponse for every valid result of nodeFn, and
// nodeError on every error result.
func (s *statusServer) callNodes(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callNodes is identical to dialNodes function with only difference that nodes to dial are provided as argument instead of getting them from s.serverIterator.getAllNodes(ctx) (as commented out below).
It allows to dial specific nodes (known to gossip client in our case).

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic ingredients look good.

Would you mind sending the changes for the rpc package (OnDisconnect()) out separately including a unit test? We want to make sure that if a connection goes bad, we no longer claim to have a healthy latency for it. On that PR we should also manually verify this fact by actually looking at the matrix under a partition. We might be able to backport this, so it's useful to separate it out.

The basic approach for the network latency endpoint seems fine, but a lot of plumbing is getting introduced that I'm unsure of since I don't have the background. Why are we introducing new protos? What is the current network latency matrix using? Is that going to be deprecated? Why isn't it being deprecated here? Etc.

The network connectivity code (that I've just glanced over, not reviewed) could could be architected for better testability. Ideally you have a method that receives only the nodeID->addr mapping and a latencies map and constructs the result, and that is being unit tested. All of the server glue can come separately (in separate commits/PRs) and these will be "dumb" PRs that don't do anything too interesting.

It looks like you have some trouble using dialNodes in the context you're in (leading t callNodes being introduced). May I suggest a separate (separate PR is fine and easier to handle, I'll also take a separate commit) PR where you adjust dialNodes to work for your use case as well.

func (g *Gossip) GetBootstrapAddresses() map[util.UnresolvedAddr]roachpb.NodeID {
g.mu.RLock()
defer g.mu.RUnlock()
return g.bootstrapAddrs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data race? You're leaking a map out from under the mutex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other commits, you may not want to add this method in the first place.

export type SetTraceRecordingTypeResponseMessage =
protos.cockroach.server.serverpb.SetTraceRecordingTypeResponse;

export type NetworkConnectivityRequest =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put all of the "front end" changes into a separate commit or PR?

func (g *Gossip) GetKnownNodeIDs() ([]roachpb.NodeID, error) {
g.mu.RLock()
infos, err := g.mu.is.getMatchedInfos(MakePrefixPattern(KeyNodeDescPrefix))
g.mu.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be deferred. The infos you're collecting hold on to *info which must not be accessed outside of the mutex.

if local {
nodeLatencies := map[roachpb.NodeID]int64{}
// addressToNodeIdMap contains all known addresses
addressToNodeIdMap := s.gossip.GetBootstrapAddresses()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to rely on the bootstrapAddrs field here. The semantics for that are a bit shaky. Even though bootstrapAddrs seems like it very aggressively tracks remote addresses, I think we should try to build something that's more easily understood and that we can reason better about.

Each node gossips its NodeDescriptor here:

if err := g.AddInfoProto(MakeNodeIDKey(desc.NodeID), desc, NodeDescriptorTTL); err != nil {
return errors.Wrapf(err, "n%d: couldn't gossip descriptor", desc.NodeID)
}

The TTL for that is 2h. So we can reasonably assume that a node will know all of its recent peers whenever network trouble begins. (And we cannot avoid this not always being true in contrived scenarios anyway: a node might restart while being partitioned, losing all memory state; yes we could add persistence but don't think we would as the persisted nodes could be stale after restart!)

You added an accessor for this this already: GetAllNodeIDs(). Now instead of listing just the IDs, it should also return the descriptors. Then you can construct a map from address to nodeID (don't forget about the LocalityAddrs which I just learned is a thing) and that can

@shralex shralex requested review from andrewbaptist and nvb January 24, 2023 20:24
Copy link
Contributor

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @koorosh, and @nvanbenschoten)


pkg/server/serverpb/status.proto line 1898 at r1 (raw file):

}

message NetworkConnectivityResponse {

I think in general the structure of the endpoint has almost everything we need, but one question I have is how will the ui be aware about the edge case we talked about where a node has not attempted to connect with another node? To me it seems like there is no way currently to differentiate between not being able to connect with another node because of issues and not being able to connect with another node because it has not been attempted.

Copy link
Contributor

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @koorosh, and @nvanbenschoten)


pkg/server/serverpb/status.proto line 1898 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

I think in general the structure of the endpoint has almost everything we need, but one question I have is how will the ui be aware about the edge case we talked about where a node has not attempted to connect with another node? To me it seems like there is no way currently to differentiate between not being able to connect with another node because of issues and not being able to connect with another node because it has not been attempted.

Actually is there a reason we need to return 2 maps? Could we return only 1 map and instead just add on the liveness state e.g. something like:

Code snippet:

message NetworkConnectivityResponse {
  message NodeConnectivity {
    map<int32, int64> latencies = 1 [
      (gogoproto.nullable) = false,
      (gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"
    ];
    kv.kvserver.liveness.livenesspb.Liveness liveness = 2
  };
  map<int32, NodeConnectivity> connectivity_by_node_id = 1 [
    (gogoproto.nullable) = false,
    (gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"
  ];
}

Copy link
Contributor Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @nvanbenschoten, and @Santamaura)


pkg/server/serverpb/status.proto line 1898 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

Actually is there a reason we need to return 2 maps? Could we return only 1 map and instead just add on the liveness state e.g. something like:

That makes sense, it can be grouped all together. Do we still need NodeConnectivity info here?

@Santamaura
Copy link
Contributor

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @erikgrinaker, @nvanbenschoten, and @Santamaura)

pkg/server/serverpb/status.proto line 1898 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…
That makes sense, it can be grouped all together. Do we still need NodeConnectivity info here?

I think so, we want it to hold both the liveness state and latency map so when the payload is received on the FE i'm imagining something like { 1: { latencies: {2: 100, 3:200}, liveness: 0, }, ...etc }
Let me know if it doesn't make sense.

@andrewbaptist
Copy link

I wanted to check what the status of this was. I think this is generally a good change, and should be able to be finished and merged now.

@koorosh koorosh force-pushed the network-partition-obs branch from 8f9ebf3 to 8f64d89 Compare April 28, 2023 16:51
@blathers-crl
Copy link

blathers-crl bot commented Apr 28, 2023

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@koorosh koorosh force-pushed the network-partition-obs branch from 8f64d89 to 38a8050 Compare April 28, 2023 16:59
@koorosh koorosh changed the title wip: get network latency from gossip client server: get network connection statuses and latencies from gossip client Apr 28, 2023
@koorosh koorosh marked this pull request as ready for review April 28, 2023 17:02
@koorosh koorosh requested review from a team as code owners April 28, 2023 17:02
@koorosh koorosh requested a review from a team April 28, 2023 17:02
@koorosh koorosh force-pushed the network-partition-obs branch 2 times, most recently from a9f966c to 9f637ad Compare April 28, 2023 19:10
@koorosh
Copy link
Contributor Author

koorosh commented Apr 28, 2023

I wanted to check what the status of this was. I think this is generally a good change, and should be able to be finished and merged now.

cc @andrewbaptist , I've updated PR in a way to collect both latencies and connection states based on whether ConnHealth returns error or not.

It requires follow up work to ensure that ConnHealth provides distinctive errors that are used to imply connection state.

@koorosh koorosh force-pushed the network-partition-obs branch from 9f637ad to 5cafa7b Compare May 1, 2023 09:23
Copy link

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the general idea here, and want to do something similarly internally for determining which nodes are live (other than the liveness range). A few cleanups and this could be good!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @koorosh, and @nvanbenschoten)


pkg/gossip/gossip.go line 1536 at r3 (raw file):

	g.mu.RLock()
	defer g.mu.RUnlock()
	infos, err := g.mu.is.getMatchedInfos(MakePrefixPattern(KeyNodeDescPrefix))

There is already gossip.IterateInfos. Are you doing something different here so you can't use that?


pkg/server/status.go line 1891 at r3 (raw file):

}

// NetworkConnectivity returns information about connections statuses between nodes.

If this is going to initiate connections and not just report current statuses, this comment should change. I also was hoping to use something like this internally for a change I'm looking at to liveness. Maybe this could be separated into the needed parts (putting more of this in the rpc package) and only deal with "organizing" the data here.


pkg/server/status.go line 1904 at r3 (raw file):

	if len(req.NodeID) > 0 {
		sourceNodeID, local, err := s.parseNodeID(req.NodeID)

You shouldn't need to parse here. Shouldn't the NodeId field be using a casttype to NodeID instead?


pkg/server/status.go line 1977 at r3 (raw file):

	}

	if err := s.iterateNodes(ctx, "network connectivity", dialFn, nodeFn, responseFn, errorFn); err != nil {

It looks like this gets a list of all nodes and then attempts to dial all the ones it is not currently connected to. Today we mostly have a "full mesh" network where every node dials every other node, but this would "force" it to continue. I think it would be better to separate nodes into three classes:

  1. Currently connected
  2. Not connected (but not necessarily unable to connect).
  3. Last connection attempt failed.

I'm not sure how much harder that logic could be, but it seems incorrect to dial a node for reporting reason to find out if you can dial it. Is the intention that this would be called with every node in the system?


pkg/server/serverpb/status.proto line 1977 at r3 (raw file):

message NetworkConnectivityRequest {
  string node_id = 1 [
    (gogoproto.customname) = "NodeID"

You should add a casttype = "NodeID". I see you did that for the response, but is there a specific reason it doesn't work on the Request side?

@koorosh koorosh force-pushed the network-partition-obs branch from 5cafa7b to 0d8407f Compare May 3, 2023 10:19
@tbg tbg removed request for a team, erikgrinaker and nvb June 21, 2023 13:35
koorosh added a commit to koorosh/cockroach that referenced this pull request Jun 22, 2023
This change makes `ErrQuiescing` exported to reuse it later
in `server` package (in scope of PR cockroachdb#95429) to rely on it
as an indicator of initiated node shutdown.

Release note: None
Copy link
Contributor Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Will do.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @Santamaura, and @tbg)


pkg/server/status.go line 2038 at r7 (raw file):
I tried to distinguish two possible cases:

  • connection closed as intended, planned shutdown, decommissioning, etc (so it's _CLOSED status)
  • connection closed unexpectedly - that should bring attention of users. (set as _PARTITIONED status as all remaining errors)

Maybe it would be more clear to rename _PARTITIONED to _ERROR?

expose the error in the response (and show it in the UI tooltip)

Correct, I'm going to show errors in tooltip for all "unhealthy" statuses.


pkg/server/status.go line 2040 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ErrNotHeartbeated means we are connecting for the first time. So I would call this INITIALIZING or something like that.

Agree. What about _ESTABLISHING (to make zero assumptions that connection is successful, just in process of establishing)


pkg/server/status.go line 2042 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This one would go away. I don't think we are in the right place here to reason about what a connection error "means"; the code should avoid trying to do that (it's not even something we can do in general). The connection is either absent, initializing, errored, or established. Exposing the error is more important than trying to classify it.

Agree in general and as mentioned above, this case can be marked as _ERROR status without specific reasoning.
It is a case where we couldn't identify exact reason why connection closed (as it done with first case).


pkg/rpc/context_test.go line 2102 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

please make a separate commit for the export of errQuiescing.

Done.

@tbg tbg self-requested a review June 22, 2023 15:59
@koorosh koorosh force-pushed the network-partition-obs branch 3 times, most recently from c566631 to 8b0a9ac Compare June 26, 2023 09:36
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Nothing substantial left, only nits.

This patch introduces new endpoint which aimed to provide information
about network connection status between nodes rather than nodes statuses
(that is done with node liveness api).

`NetworkConnectivity` endpoint relies on gossip client to get all known
nodes in the cluster and then checks `rpcCtx.ConnHealth` for every peer.
It can tell us following:
- connection is live if no error returned;
- `ErrNoConnection` returned if node is attempting to connect
to target node;
- other errors indicate that network connection is unhealthy;

This functionality should not care about whether node is decommissioned,
shutdown or whatever.

Later on, it'll be used in conjunction with node liveness statuses to
distinguish more specific reasons why network connection is broken.
For instance,
- if node liveness status is DEAD and connection is failed,
then it should not be considered as network issue.
- if node liveness status is LIVE/DECOMMISSIONING and network
connection is failed, then it is network partitioning;

Release note: None
@koorosh koorosh force-pushed the network-partition-obs branch from 8b0a9ac to af1aec2 Compare June 26, 2023 12:48
@koorosh
Copy link
Contributor Author

koorosh commented Jun 26, 2023

@tbg , TFTR and making possible to get this done!

@koorosh koorosh requested a review from andrewbaptist June 26, 2023 13:13
@koorosh
Copy link
Contributor Author

koorosh commented Jun 26, 2023

@andrewbaptist , I didn't address your suggestion regarding not using fan out of requests in this PR. Currently there's no easy way to avoid this.

@koorosh
Copy link
Contributor Author

koorosh commented Jun 26, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 26, 2023

👎 Rejected by code reviews

Copy link

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@koorosh
Copy link
Contributor Author

koorosh commented Jun 26, 2023

bors retry

@craig
Copy link
Contributor

craig bot commented Jun 26, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 26, 2023

Build succeeded:

@craig craig bot merged commit 8023be5 into cockroachdb:master Jun 26, 2023
craig bot pushed a commit that referenced this pull request Jul 18, 2023
103837: ui: Network latency page improvements r=koorosh a=koorosh

- use connectivity endpoint as a source of connection statuses and latency;
- removed popup with disconnected peers, instead it will be shown
on latency matrix;

Resolves: #96101 

Release note (ui change): add "Show dead nodes" option
to show/hide dead/decommissioned nodes from latency matrix.

Release note (ui change): "No connections" popup menu is removed. Now failed
connections are displayed on the latencies matrix.

- [x] test coverage (storybook added for visual validation)

This PR depends on following PRs that should be merged before this one:
- #95429
- #99191

#### example 1. one node cannot establish connection to another one.
![Screenshot 2023-06-26 at 12 33 46](https://github.com/cockroachdb/cockroach/assets/3106437/2534e0fe-f5ba-48b5-8825-1c17a8112870)

#### example 2. Node 5 is stopped and it is considered as `unavailable` before setting as `dead`
![Screenshot 2023-06-26 at 10 52 20](https://github.com/cockroachdb/cockroach/assets/3106437/bd703d4c-9812-4061-9140-4a8f8d3a5da9)

#### example 3. Node 3 is dead. No connection from/to node. Show error message.
<img width="954" alt="Screenshot 2023-06-23 at 14 07 25" src="https://github.com/cockroachdb/cockroach/assets/3106437/8078c421-aeaa-4038-adf5-e3c69ba6d863">

#### example 4. Decommissioned node.
 
![Screenshot 2023-06-26 at 18 11 23](https://github.com/cockroachdb/cockroach/assets/3106437/dc4cb22e-34b7-4cd3-a391-b8ea5fd0232d)





106082: sql: add REPLICATION roleoption for replication mode r=rafiss a=otan

These commits add the REPLICATION roleoption (as per PG), and then uses it to authenticate whether a user can use the replication protocol.

Informs #105130

----

# sql: add REPLICATION roleoption

Matches PostgreSQL implementation of the REPLICATION roleoption.

Release note (sql change): Added the REPLICATION role option for a user,
which allows a user to use the streaming replication protocol.

# sql: only allow REPLICATION users to login with replication mode

In PG, the REPLICATION roleoption is required to use streaming
replication mode. Enforce the same constraint.

Release note: None



Co-authored-by: Andrii Vorobiov <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

O-community Originated from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants