Skip to content

Commit 2521922

Browse files
authored
Merge pull request #1864 from discostur/improve-prometheus-rules
Refactor clickhouseKeeper prometheus rules
2 parents 3e96461 + 669e5b8 commit 2521922

File tree

1 file changed

+102
-19
lines changed

1 file changed

+102
-19
lines changed

deploy/prometheus/prometheus-alert-rules-chkeeper.yaml

Lines changed: 102 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,27 +11,27 @@ spec:
1111
- name: ClickHouseKeeperRules
1212
rules:
1313
- alert: ClickHouseKeeperDown
14-
expr: up{app=~'clickhouse-keeper.*'} == 0 or zk_ruok{app=~'clickhouse-keeper.*'} == 0
14+
expr: up{app=~'clickhouse-keeper.*'} == 0
1515
labels:
1616
severity: critical
1717
annotations:
1818
identifier: "{{ $labels.pod_name }}"
19-
summary: "zookeeper possible down"
19+
summary: "ClickHouse Keeper possible down"
2020
description: |-
21-
`zookeeper` can't be scraped via prometheus.
21+
`ClickHouse Keeper` can't be scraped via prometheus.
2222
Please check instance status
2323
```kubectl logs -n {{ $labels.namespace }} {{ $labels.pod_name }} -f```
2424
2525
- alert: ClickHouseKeeperHighLatency
26-
expr: zk_max_latency{app=~'clickhouse-keeper.*'} > 500
26+
expr: ClickHouseAsyncMetrics_KeeperMaxLatency{app=~'clickhouse-keeper.*'} > 500
2727
for: 15m
2828
labels:
2929
severity: warning
3030
annotations:
3131
identifier: "{{ $labels.pod_name }}.{{ $labels.namespace }}"
32-
summary: "Average amount of time it takes for the server to respond to each client request (since the server was started)."
32+
summary: "Maximum latency for ClickHouse Keeper requests is high."
3333
description: |-
34-
`avg_latency{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "avg_latency{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} ticks{{ end }}
34+
`ClickHouseAsyncMetrics_KeeperMaxLatency{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "ClickHouseAsyncMetrics_KeeperMaxLatency{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} ms{{ end }}
3535
3636
reset server statistics
3737
```
@@ -61,15 +61,15 @@ spec:
6161
```
6262
6363
- alert: ClickHouseKeeperOutstandingRequests
64-
expr: zk_outstanding_requests{app=~'clickhouse-keeper.*'} > 10
64+
expr: ClickHouseMetrics_KeeperOutstandingRequests{app=~'clickhouse-keeper.*'} > 10
6565
for: 10m
6666
labels:
6767
severity: high
6868
annotations:
6969
identifier: "{{ $labels.pod_name }}.{{ $labels.namespace }}"
70-
summary: "ClickHouseKeeper receives more requests than it can process."
70+
summary: "ClickHouse Keeper receives more requests than it can process."
7171
description: |-
72-
`outstanding_requests{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "outstanding_requests{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
72+
`ClickHouseMetrics_KeeperOutstandingRequests{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "ClickHouseMetrics_KeeperOutstandingRequests{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }}{{ end }}
7373
7474
Look to CPU/Memory node/pod utilization
7575
```
@@ -93,27 +93,110 @@ spec:
9393
echo "ClickHouseKeeper Write $((($writeEnd - $writeBegin) / 5)) b/s"
9494
```
9595
96-
- alert: ClickHouseKeeperHighFileDescriptors
97-
expr: zk_open_file_descriptor_count{app=~'clickhouse-keeper.*'} > 4096
96+
97+
- alert: ClickHouseKeeperHighEphemeralNodes
98+
expr: ClickHouseAsyncMetrics_KeeperEphemeralsCount{app=~'clickhouse-keeper.*'} > 100
9899
for: 10m
99100
labels:
100101
severity: warning
101102
annotations:
102103
identifier: "{{ $labels.pod_name }}.{{ $labels.namespace }}"
103-
summary: "Number of file descriptors used over the limit."
104+
summary: "ClickHouse Keeper has too high ephemeral znodes count."
104105
description: |-
105-
`zk_open_file_descriptor_count{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "zk_open_file_descriptor_count{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} descriptors{{ end }}
106+
`ClickHouseAsyncMetrics_KeeperEphemeralsCount{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "ClickHouseAsyncMetrics_KeeperEphemeralsCount{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} nodes{{ end }}
107+
Look to documentation:
108+
https://clickhouse.com/docs/en/operations/clickhouse-keeper
106109
110+
- alert: ClickHouseKeeperCommitsFailed
111+
expr: increase(ClickHouseProfileEvents_KeeperCommitsFailed{app=~'clickhouse-keeper.*'}[5m]) > 0
112+
for: 5m
113+
labels:
114+
severity: critical
115+
annotations:
116+
identifier: "{{ $labels.pod_name }}.{{ $labels.namespace }}"
117+
summary: "ClickHouse Keeper has failed commits."
118+
description: |-
119+
ClickHouse Keeper is experiencing failed commits which indicates serious issues with the Raft consensus.
120+
`ClickHouseProfileEvents_KeeperCommitsFailed{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` increased in the last 5 minutes.
121+
122+
Check logs for errors:
123+
```
124+
kubectl logs -n {{ $labels.namespace }} {{ $labels.pod_name }} --tail=100
125+
```
107126
108-
- alert: ClickHouseKeeperHighEphemeralNodes
109-
expr: zk_ephemerals_count{app=~'clickhouse-keeper.*'} > 100
127+
- alert: ClickHouseKeeperSnapshotCreationsFailed
128+
expr: increase(ClickHouseProfileEvents_KeeperSnapshotCreationsFailed{app=~'clickhouse-keeper.*'}[10m]) > 0
129+
for: 5m
130+
labels:
131+
severity: high
132+
annotations:
133+
identifier: "{{ $labels.pod_name }}.{{ $labels.namespace }}"
134+
summary: "ClickHouse Keeper snapshot creation failed."
135+
description: |-
136+
ClickHouse Keeper failed to create snapshots which may lead to log accumulation and disk space issues.
137+
138+
Check disk space:
139+
```
140+
kubectl exec -n {{ $labels.namespace }} {{ $labels.pod_name }} -- df -h
141+
```
142+
143+
Check logs:
144+
```
145+
kubectl logs -n {{ $labels.namespace }} {{ $labels.pod_name }} --tail=100 | grep -i snapshot
146+
```
147+
148+
- alert: ClickHouseKeeperLostQuorum
149+
expr: ClickHouseAsyncMetrics_KeeperSyncedFollowers{app=~'clickhouse-keeper.*'} < 1 and ClickHouseAsyncMetrics_KeeperIsLeader{app=~'clickhouse-keeper.*'} == 1
150+
for: 5m
151+
labels:
152+
severity: critical
153+
annotations:
154+
identifier: "{{ $labels.pod_name }}.{{ $labels.namespace }}"
155+
summary: "ClickHouse Keeper leader has lost quorum."
156+
description: |-
157+
ClickHouse Keeper leader has less than the required number of synced followers.
158+
Current synced followers: {{ with printf "ClickHouseAsyncMetrics_KeeperSyncedFollowers{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
159+
160+
This means the cluster cannot commit new operations and is in a degraded state.
161+
162+
Check all keeper pods:
163+
```
164+
kubectl get pods -n {{ $labels.namespace }} -l app=clickhouse-keeper
165+
kubectl logs -n {{ $labels.namespace }} -l app=clickhouse-keeper --tail=50
166+
```
167+
168+
- alert: ClickHouseKeeperMemorySoftLimitExceeded
169+
expr: ClickHouseAsyncMetrics_KeeperIsExceedingMemorySoftLimitHit{app=~'clickhouse-keeper.*'} == 1
110170
for: 10m
111171
labels:
112172
severity: warning
113173
annotations:
114174
identifier: "{{ $labels.pod_name }}.{{ $labels.namespace }}"
115-
summary: "ClickHouseKeeper have too high ephemeral znodes count."
175+
summary: "ClickHouse Keeper is exceeding memory soft limit."
116176
description: |-
117-
`zk_ephemerals_count{pod_name="{{ $labels.pod_name }}",namespace="{{ $labels.namespace }}"}` = {{ with printf "ephemerals_count{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.2f" }} nodes{{ end }}
118-
Look to documentation:
119-
https://zookeeper.apache.org/doc/current/zookeeperOver.html#Nodes+and+ephemeral+nodes
177+
ClickHouse Keeper is using more memory than the configured soft limit.
178+
This may lead to performance degradation or OOM issues.
179+
180+
Check memory usage:
181+
```
182+
kubectl top pod -n {{ $labels.namespace }} {{ $labels.pod_name }}
183+
kubectl describe pod -n {{ $labels.namespace }} {{ $labels.pod_name }}
184+
```
185+
186+
Consider increasing memory limits or investigating memory leaks.
187+
188+
- alert: ClickHouseKeeperHighFileDescriptorUsage
189+
expr: (ClickHouseAsyncMetrics_KeeperOpenFileDescriptorCount{app=~'clickhouse-keeper.*'} / ClickHouseAsyncMetrics_KeeperMaxFileDescriptorCount{app=~'clickhouse-keeper.*'}) > 0.8
190+
for: 10m
191+
labels:
192+
severity: warning
193+
annotations:
194+
identifier: "{{ $labels.pod_name }}.{{ $labels.namespace }}"
195+
summary: "ClickHouse Keeper is using a high percentage of available file descriptors."
196+
description: |-
197+
ClickHouse Keeper is using {{ with printf "(ClickHouseAsyncMetrics_KeeperOpenFileDescriptorCount{pod_name='%s',namespace='%s'} / ClickHouseAsyncMetrics_KeeperMaxFileDescriptorCount{pod_name='%s',namespace='%s'}) * 100" .Labels.pod_name .Labels.namespace .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.1f" }}{{ end }}% of available file descriptors.
198+
199+
Current open FDs: {{ with printf "ClickHouseAsyncMetrics_KeeperOpenFileDescriptorCount{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
200+
Max FDs: {{ with printf "ClickHouseAsyncMetrics_KeeperMaxFileDescriptorCount{pod_name='%s',namespace='%s'}" .Labels.pod_name .Labels.namespace | query }}{{ . | first | value | printf "%.0f" }}{{ end }}
201+
202+
If this continues to increase, the keeper may run out of file descriptors and become unresponsive.

0 commit comments

Comments
 (0)