Skip to content

gossip: abnormal CPU usage growth with increasing node count #51838

@Drahflow

Description

@Drahflow

Is your feature request related to a problem? Please describe.

I am trying to estimate maximum reachable OLTP performance for a client of mine. To my frustration I was not able to scale a CockroachDB cluster to significantly more than 256 nodes, due to high CPU load when adding more nodes (most of which is taken up by gossip-protocol related functions according to profiling). My measurements suggest that the work done for gossiping on each node scales quadratically in the number of nodes, which puts an upper limit on the maximum cluster size at about 1000 nodes.

Describe the solution you'd like

The gossip protocol should only perform linear work in the number of nodes.

Describe alternatives you've considered

The gossip protocol intervals could be configurable so larger clusters could be achievable by trading away DDL and node failure detection speed. However, this would only add a small factor to the maximum size until the quadratic growth would have pushed failure detection times too high.

Jira issue: CRDB-4006

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-gossipC-investigationFurther steps needed to qualify. C-label will change.C-performancePerf of queries or internals. Solution not expected to change functional behavior.O-communityOriginated from the communityT-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions