@@ -11,27 +11,27 @@ spec:
1111 - name : ClickHouseKeeperRules
1212 rules :
1313 - alert : ClickHouseKeeperDown
14- expr : up{app=~'clickhouse-keeper.*'} == 0 or zk_ruok{app=~'clickhouse-keeper.*'} == 0
14+ expr : up{app=~'clickhouse-keeper.*'} == 0
1515 labels :
1616 severity : critical
1717 annotations :
1818 identifier : " {{ $labels.pod_name }}"
19- summary : " zookeeper possible down"
19+ summary : " ClickHouse Keeper possible down"
2020 description : |-
21- `zookeeper ` can't be scraped via prometheus.
21+ `ClickHouse Keeper ` can't be scraped via prometheus.
2222 Please check instance status
2323 ```kubectl logs -n {{ $labels.namespace }} {{ $labels.pod_name }} -f```
2424
2525 - alert : ClickHouseKeeperHighLatency
26- expr : zk_max_latency {app=~'clickhouse-keeper.*'} > 500
26+ expr : ClickHouseAsyncMetrics_KeeperMaxLatency {app=~'clickhouse-keeper.*'} > 500
2727 for : 15m
2828 labels :
2929 severity : warning
3030 annotations :
3131 identifier : " {{ $labels.pod_name }}.{{ $labels.namespace }}"
32- summary : " Average amount of time it takes for the server to respond to each client request (since the server was started) ."
32+ summary : " Maximum latency for ClickHouse Keeper requests is high ."
3333 description : |-
34- `avg_latency {pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "avg_latency {pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} ticks {{ end }}
34+ `ClickHouseAsyncMetrics_KeeperMaxLatency {pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "ClickHouseAsyncMetrics_KeeperMaxLatency {pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} ms {{ end }}
3535
3636 reset server statistics
3737 ```
@@ -61,15 +61,15 @@ spec:
6161 ```
6262
6363 - alert : ClickHouseKeeperOutstandingRequests
64- expr : zk_outstanding_requests {app=~'clickhouse-keeper.*'} > 10
64+ expr : ClickHouseMetrics_KeeperOutstandingRequests {app=~'clickhouse-keeper.*'} > 10
6565 for : 10m
6666 labels :
6767 severity : high
6868 annotations :
6969 identifier : " {{ $labels.pod_name }}.{{ $labels.namespace }}"
70- summary : " ClickHouseKeeper receives more requests than it can process."
70+ summary : " ClickHouse Keeper receives more requests than it can process."
7171 description : |-
72- `outstanding_requests {pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "outstanding_requests {pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
72+ `ClickHouseMetrics_KeeperOutstandingRequests {pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "ClickHouseMetrics_KeeperOutstandingRequests {pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
7373
7474 Look to CPU/Memory node/pod utilization
7575 ```
@@ -93,27 +93,110 @@ spec:
9393 echo "ClickHouseKeeper Write $((($writeEnd - $writeBegin) / 5)) b/s"
9494 ```
9595
96- - alert : ClickHouseKeeperHighFileDescriptors
97- expr : zk_open_file_descriptor_count{app=~'clickhouse-keeper.*'} > 4096
96+
97+ - alert : ClickHouseKeeperHighEphemeralNodes
98+ expr : ClickHouseAsyncMetrics_KeeperEphemeralsCount{app=~'clickhouse-keeper.*'} > 100
9899 for : 10m
99100 labels :
100101 severity : warning
101102 annotations :
102103 identifier : " {{ $labels.pod_name }}.{{ $labels.namespace }}"
103- summary : " Number of file descriptors used over the limit ."
104+ summary : " ClickHouse Keeper has too high ephemeral znodes count ."
104105 description : |-
105- `zk_open_file_descriptor_count{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "zk_open_file_descriptor_count{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} descriptors{{ end }}
106+ `ClickHouseAsyncMetrics_KeeperEphemeralsCount{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "ClickHouseAsyncMetrics_KeeperEphemeralsCount{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} nodes{{ end }}
107+ Look to documentation:
108+ https://clickhouse.com/docs/en/operations/clickhouse-keeper
106109
110+ - alert : ClickHouseKeeperCommitsFailed
111+ expr : increase(ClickHouseProfileEvents_KeeperCommitsFailed{app=~'clickhouse-keeper.*'}[5m]) > 0
112+ for : 5m
113+ labels :
114+ severity : critical
115+ annotations :
116+ identifier : " {{ $labels.pod_name }}.{{ $labels.namespace }}"
117+ summary : " ClickHouse Keeper has failed commits."
118+ description : |-
119+ ClickHouse Keeper is experiencing failed commits which indicates serious issues with the Raft consensus.
120+ `ClickHouseProfileEvents_KeeperCommitsFailed{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` increased in the last 5 minutes.
121+
122+ Check logs for errors:
123+ ```
124+ kubectl logs -n {{ $labels.namespace }} {{ $labels.pod_name }} --tail=100
125+ ```
107126
108- - alert : ClickHouseKeeperHighEphemeralNodes
109- expr : zk_ephemerals_count{app=~'clickhouse-keeper.*'} > 100
127+ - alert : ClickHouseKeeperSnapshotCreationsFailed
128+ expr : increase(ClickHouseProfileEvents_KeeperSnapshotCreationsFailed{app=~'clickhouse-keeper.*'}[10m]) > 0
129+ for : 5m
130+ labels :
131+ severity : high
132+ annotations :
133+ identifier : " {{ $labels.pod_name }}.{{ $labels.namespace }}"
134+ summary : " ClickHouse Keeper snapshot creation failed."
135+ description : |-
136+ ClickHouse Keeper failed to create snapshots which may lead to log accumulation and disk space issues.
137+
138+ Check disk space:
139+ ```
140+ kubectl exec -n {{ $labels.namespace }} {{ $labels.pod_name }} -- df -h
141+ ```
142+
143+ Check logs:
144+ ```
145+ kubectl logs -n {{ $labels.namespace }} {{ $labels.pod_name }} --tail=100 | grep -i snapshot
146+ ```
147+
148+ - alert : ClickHouseKeeperLostQuorum
149+ expr : ClickHouseAsyncMetrics_KeeperSyncedFollowers{app=~'clickhouse-keeper.*'} < 1 and ClickHouseAsyncMetrics_KeeperIsLeader{app=~'clickhouse-keeper.*'} == 1
150+ for : 5m
151+ labels :
152+ severity : critical
153+ annotations :
154+ identifier : " {{ $labels.pod_name }}.{{ $labels.namespace }}"
155+ summary : " ClickHouse Keeper leader has lost quorum."
156+ description : |-
157+ ClickHouse Keeper leader has less than the required number of synced followers.
158+ Current synced followers: {{ with printf "ClickHouseAsyncMetrics_KeeperSyncedFollowers{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
159+
160+ This means the cluster cannot commit new operations and is in a degraded state.
161+
162+ Check all keeper pods:
163+ ```
164+ kubectl get pods -n {{ $labels.namespace }} -l app=clickhouse-keeper
165+ kubectl logs -n {{ $labels.namespace }} -l app=clickhouse-keeper --tail=50
166+ ```
167+
168+ - alert : ClickHouseKeeperMemorySoftLimitExceeded
169+ expr : ClickHouseAsyncMetrics_KeeperIsExceedingMemorySoftLimitHit{app=~'clickhouse-keeper.*'} == 1
110170 for : 10m
111171 labels :
112172 severity : warning
113173 annotations :
114174 identifier : " {{ $labels.pod_name }}.{{ $labels.namespace }}"
115- summary : " ClickHouseKeeper have too high ephemeral znodes count ."
175+ summary : " ClickHouse Keeper is exceeding memory soft limit ."
116176 description : |-
117- `zk_ephemerals_count{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "ephemerals_count{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} nodes{{ end }}
118- Look to documentation:
119- https://zookeeper.apache.org/doc/current/zookeeperOver.html#Nodes+and+ephemeral+nodes
177+ ClickHouse Keeper is using more memory than the configured soft limit.
178+ This may lead to performance degradation or OOM issues.
179+
180+ Check memory usage:
181+ ```
182+ kubectl top pod -n {{ $labels.namespace }} {{ $labels.pod_name }}
183+ kubectl describe pod -n {{ $labels.namespace }} {{ $labels.pod_name }}
184+ ```
185+
186+ Consider increasing memory limits or investigating memory leaks.
187+
188+ - alert : ClickHouseKeeperHighFileDescriptorUsage
189+ expr : (ClickHouseAsyncMetrics_KeeperOpenFileDescriptorCount{app=~'clickhouse-keeper.*'} / ClickHouseAsyncMetrics_KeeperMaxFileDescriptorCount{app=~'clickhouse-keeper.*'}) > 0.8
190+ for : 10m
191+ labels :
192+ severity : warning
193+ annotations :
194+ identifier : " {{ $labels.pod_name }}.{{ $labels.namespace }}"
195+ summary : " ClickHouse Keeper is using a high percentage of available file descriptors."
196+ description : |-
197+ ClickHouse Keeper is using {{ with printf "(ClickHouseAsyncMetrics_KeeperOpenFileDescriptorCount{pod_name='%s',namespace='%s'} / ClickHouseAsyncMetrics_KeeperMaxFileDescriptorCount{pod_name='%s',namespace='%s'}) * 100" .Labels.pod_name .Labels.namespace .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.1f" }}{{ end }}% of available file descriptors.
198+
199+ Current open FDs: {{ with printf "ClickHouseAsyncMetrics_KeeperOpenFileDescriptorCount{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
200+ Max FDs: {{ with printf "ClickHouseAsyncMetrics_KeeperMaxFileDescriptorCount{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
201+
202+ If this continues to increase, the keeper may run out of file descriptors and become unresponsive.
0 commit comments