Skip to content

Commit 2a19d55

Browse files
authored
[docs] Clean up internal observability docs (#10454)
#### Description Now that [4246](open-telemetry/opentelemetry.io#4246), [4322](open-telemetry/opentelemetry.io#4322), and [4529](open-telemetry/opentelemetry.io#4529) have been merged, and the new [Internal telemetry](https://opentelemetry.io/docs/collector/internal-telemetry/) and [Troubleshooting](https://opentelemetry.io/docs/collector/troubleshooting/) pages are live, it's time to clean up the underlying Collector repo docs so that the website is the single source of truth. I've deleted any content that was moved to the website, and linked to the relevant sections where possible. I've consolidated what content remains in the observability.md file and left troubleshooting.md and monitoring.md as stubs that point to the website. I also searched the Collector repo for cross-references to these files and adjusted links where appropriate. ~~Note that this PR is blocked by [4731](open-telemetry/opentelemetry.io#4731 EDIT: #4731 is merged and no longer a blocker. <!-- Issue number if applicable --> #### Link to tracking issue Fixes #8886
1 parent fead8fc commit 2a19d55

File tree

6 files changed

+115
-507
lines changed

6 files changed

+115
-507
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
3434
<a href="https://opentelemetry.io/docs/collector/configuration/">Configuration</a>
3535
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
36-
<a href="docs/monitoring.md">Monitoring</a>
36+
<a href="https://opentelemetry.io/docs/collector/internal-telemetry/#use-internal-telemetry-to-monitor-the-collector</a>
3737
&nbsp;&nbsp;&bull;&nbsp;&nbsp;
3838
<a href="docs/security-best-practices.md">Security</a>
3939
&nbsp;&nbsp;&bull;&nbsp;&nbsp;

docs/monitoring.md

Lines changed: 4 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,7 @@
11
# Monitoring
22

3-
Many metrics are provided by the Collector for its monitoring. Below some
4-
key recommendations for alerting and monitoring are listed.
3+
To learn how to monitor the Collector using its own telemetry, see the [Internal
4+
telemetry] page.
55

6-
## Critical Monitoring
7-
8-
### Data Loss
9-
10-
Use rate of `otelcol_processor_dropped_spans > 0` and
11-
`otelcol_processor_dropped_metric_points > 0` to detect data loss, depending on
12-
the requirements set up a minimal time window before alerting, avoiding
13-
notifications for small losses that are not considered outages or within the
14-
desired reliability level.
15-
16-
### Low on CPU Resources
17-
18-
This depends on the CPU metrics available on the deployment, eg.:
19-
`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for Kubernetes. Let's call it
20-
`available_cores` below. The idea here is to have an upper bound of the number
21-
of available cores, and the maximum expected ingestion rate considered safe,
22-
let's call it `safe_rate`, per core. This should trigger increase of resources/
23-
instances (or raise an alert as appropriate) whenever
24-
`(actual_rate/available_cores) < safe_rate`.
25-
26-
The `safe_rate` depends on the specific configuration being used.
27-
// TODO: Provide reference `safe_rate` for a few selected configurations.
28-
29-
## Secondary Monitoring
30-
31-
### Queue Length
32-
33-
Most exporters offer a [queue/retry mechanism](../exporter/exporterhelper/README.md)
34-
that is recommended as the retry mechanism for the Collector and as such should
35-
be used in any production deployment.
36-
37-
The `otelcol_exporter_queue_capacity` indicates the capacity of the retry queue (in batches). The `otelcol_exporter_queue_size` indicates the current size of retry queue. So you can use these two metrics to check if the queue capacity is enough for your workload.
38-
39-
The `otelcol_exporter_enqueue_failed_spans`, `otelcol_exporter_enqueue_failed_metric_points` and `otelcol_exporter_enqueue_failed_log_records` indicate the number of span/metric points/log records failed to be added to the sending queue. This may be cause by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.
40-
41-
The queue/retry mechanism also supports logging for monitoring. Check
42-
the logs for messages like `"Dropping data because sending_queue is full"`.
43-
44-
### Receive Failures
45-
46-
Sustained rates of `otelcol_receiver_refused_spans` and
47-
`otelcol_receiver_refused_metric_points` indicate too many errors returned to
48-
clients. Depending on the deployment and the client’s resilience this may
49-
indicate data loss at the clients.
50-
51-
Sustained rates of `otelcol_exporter_send_failed_spans` and
52-
`otelcol_exporter_send_failed_metric_points` indicate that the Collector is not
53-
able to export data as expected.
54-
It doesn't imply data loss per se since there could be retries but a high rate
55-
of failures could indicate issues with the network or backend receiving the
56-
data.
57-
58-
## Data Flow
59-
60-
### Data Ingress
61-
62-
The `otelcol_receiver_accepted_spans` and
63-
`otelcol_receiver_accepted_metric_points` metrics provide information about
64-
the data ingested by the Collector.
65-
66-
### Data Egress
67-
68-
The `otecol_exporter_sent_spans` and
69-
`otelcol_exporter_sent_metric_points`metrics provide information about
70-
the data exported by the Collector.
6+
[Internal telemetry]:
7+
https://opentelemetry.io/docs/collector/internal-telemetry/#use-internal-telemetry-to-monitor-the-collector

docs/observability.md

Lines changed: 106 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -1,140 +1,134 @@
1-
# OpenTelemetry Collector Observability
1+
# OpenTelemetry Collector internal observability
22

3-
## Goal
3+
The [Internal telemetry] page on OpenTelemetry's website contains the
4+
documentation for the Collector's internal observability, including:
45

5-
The goal of this document is to have a comprehensive description of observability of the Collector and changes needed to achieve observability part of our [vision](vision.md).
6+
- Which types of observability are emitted by the Collector.
7+
- How to enable and configure these signals.
8+
- How to use this telemetry to monitor your Collector instance.
69

7-
## What Needs Observation
10+
If you need to troubleshoot the Collector, see [Troubleshooting].
811

9-
The following elements of the Collector need to be observable.
12+
Read on to learn about experimental features and the project's overall vision
13+
for internal telemetry.
1014

11-
### Current Values
15+
## Experimental trace telemetry
1216

13-
- Resource consumption: CPU, RAM (in the future also IO - if we implement persistent queues) and any other metrics that may be available to Go apps (e.g. garbage size, etc).
17+
The Collector does not expose traces by default, but an effort is underway to
18+
[change this][issue7532]. The work includes supporting configuration of the
19+
OpenTelemetry SDK used to produce the Collector's internal telemetry. This
20+
feature is behind two feature gates:
1421

15-
- Receiving data rate, broken down by receivers and by data type (traces/metrics).
16-
17-
- Exporting data rate, broken down by exporters and by data type (traces/metrics).
18-
19-
- Data drop rate due to throttling, broken down by data type.
20-
21-
- Data drop rate due to invalid data received, broken down by data type.
22-
23-
- Current throttling state: Not Throttled/Throttled by Downstream/Internally Saturated.
24-
25-
- Incoming connection count, broken down by receiver.
26-
27-
- Incoming connection rate (new connections per second), broken down by receiver.
28-
29-
- In-memory queue size (in bytes and in units). Note: measurements in bytes may be difficult / expensive to obtain and should be used cautiously.
30-
31-
- Persistent queue size (when supported).
32-
33-
- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline).
34-
35-
- Latency broken down by pipeline elements (including exporter network roundtrip latency for request/response protocols).
36-
37-
“Rate” values must reflect the average rate of the last 10 seconds. Rates must exposed in bytes/sec and units/sec (e.g. spans/sec).
38-
39-
Note: some of the current values and rates may be calculated as derivatives of cumulative values in the backend, so it is an open question if we want to expose them separately or no.
40-
41-
### Cumulative Values
42-
43-
- Total received data, broken down by receivers and by data type (traces/metrics).
44-
45-
- Total exported data, broken down by exporters and by data type (traces/metrics).
46-
47-
- Total dropped data due to throttling, broken down by data type.
48-
49-
- Total dropped data due to invalid data received, broken down by data type.
50-
51-
- Total incoming connection count, broken down by receiver.
52-
53-
- Uptime since start.
54-
55-
### Trace or Log on Events
56-
57-
We want to generate the following events (log and/or send as a trace with additional data):
58-
59-
- Collector started/stopped.
60-
61-
- Collector reconfigured (if we support on-the-fly reconfiguration).
62-
63-
- Begin dropping due to throttling (include throttling reason, e.g. local saturation, downstream saturation, downstream unavailable, etc).
64-
65-
- Stop dropping due to throttling.
66-
67-
- Begin dropping due to invalid data (include sample/first invalid data).
68-
69-
- Stop dropping due to invalid data.
70-
71-
- Crash detected (differentiate clean stopping and crash, possibly include crash data if available).
72-
73-
For begin/stop events we need to define an appropriate hysteresis to avoid generating too many events. Note that begin/stop events cannot be detected in the backend simply as derivatives of current rates, the events include additional data that is not present in the current value.
22+
```bash
23+
--feature-gates=telemetry.useOtelWithSDKConfigurationForInternalTelemetry
24+
```
7425

75-
### Host Metrics
26+
The gate `useOtelWithSDKConfigurationForInternalTelemetry` enables the Collector
27+
to parse any configuration that aligns with the [OpenTelemetry Configuration]
28+
schema. Support for this schema is experimental, but it does allow telemetry to
29+
be exported using OTLP.
7630

77-
The service should collect host resource metrics in addition to service's own process metrics. This may help to understand that the problem that we observe in the service is induced by a different process on the same host.
31+
The following configuration can be used in combination with the aforementioned
32+
feature gates to emit internal metrics and traces from the Collector to an OTLP
33+
backend:
7834

79-
## How We Expose Telemetry
35+
```yaml
36+
service:
37+
telemetry:
38+
metrics:
39+
readers:
40+
- periodic:
41+
interval: 5000
42+
exporter:
43+
otlp:
44+
protocol: grpc/protobuf
45+
endpoint: https://backend:4317
46+
traces:
47+
processors:
48+
- batch:
49+
exporter:
50+
otlp:
51+
protocol: grpc/protobuf
52+
endpoint: https://backend2:4317
53+
```
8054
81-
By default, the Collector exposes service telemetry in two ways currently:
55+
See the [example configuration][kitchen-sink] for additional options.
8256
83-
- internal metrics are exposed via a Prometheus interface which defaults to port `8888`
84-
- logs are emitted to stdout
57+
> This configuration does not support emitting logs as there is no support for
58+
> [logs] in the OpenTelemetry Go SDK at this time.
8559
86-
Traces are not exposed by default. There is an effort underway to [change this][issue7532]. The work includes supporting
87-
configuration of the OpenTelemetry SDK used to produce the Collector's internal telemetry. This feature is
88-
currently behind two feature gates:
60+
You can also configure the Collector to send its own traces using the OTLP
61+
exporter. Send the traces to an OTLP server running on the same Collector, so it
62+
goes through configured pipelines. For example:
8963
90-
```bash
91-
--feature-gates=telemetry.useOtelWithSDKConfigurationForInternalTelemetry
64+
```yaml
65+
service:
66+
telemetry:
67+
traces:
68+
processors:
69+
batch:
70+
exporter:
71+
otlp:
72+
protocol: grpc/protobuf
73+
endpoint: ${MY_POD_IP}:4317
9274
```
9375
94-
The gate `useOtelWithSDKConfigurationForInternalTelemetry` enables the Collector to parse configuration
95-
that aligns with the [OpenTelemetry Configuration] schema. The support for this schema is still
96-
experimental, but it does allow telemetry to be exported via OTLP.
76+
## Goals of internal telemetry
9777
98-
The following configuration can be used in combination with the feature gates aforementioned
99-
to emit internal metrics and traces from the Collector to an OTLP backend:
78+
The Collector's internal telemetry is an important part of fulfilling
79+
OpenTelemetry's [project vision](vision.md). The following section explains the
80+
priorities for making the Collector an observable service.
10081
101-
```yaml
102-
service:
103-
telemetry:
104-
metrics:
105-
readers:
106-
- periodic:
107-
interval: 5000
108-
exporter:
109-
otlp:
110-
protocol: grpc/protobuf
111-
endpoint: https://backend:4317
112-
traces:
113-
processors:
114-
- batch:
115-
exporter:
116-
otlp:
117-
protocol: grpc/protobuf
118-
endpoint: https://backend2:4317
119-
```
82+
### Observable elements
12083
121-
See the configuration's [example][kitchen-sink] for additional configuration options.
84+
The following aspects of the Collector need to be observable.
12285
123-
Note that this configuration does not support emitting logs as there is no support for [logs] in
124-
OpenTelemetry Go SDK at this time.
86+
- [Current values]
87+
- Some of the current values and rates might be calculated as derivatives of
88+
cumulative values in the backend, so it's an open question whether to expose
89+
them separately or not.
90+
- [Cumulative values]
91+
- [Trace or log events]
92+
- For start or stop events, an appropriate hysteresis must be defined to avoid
93+
generating too many events. Note that start and stop events can't be
94+
detected in the backend simply as derivatives of current rates. The events
95+
include additional data that is not present in the current value.
96+
- [Host metrics]
97+
- Host metrics can help users determine if the observed problem in a service
98+
is caused by a different process on the same host.
12599
126100
### Impact
127101
128-
We need to be able to assess the impact of these observability improvements on the core performance of the Collector.
102+
The impact of these observability improvements on the core performance of the
103+
Collector must be assessed.
129104
130-
### Configurable Level of Observability
105+
### Configurable level of observability
131106
132-
Some of the metrics/traces can be high volume and may not be desirable to always observe. We should consider adding an observability verboseness “level” that allows configuring the Collector to send more or less observability data (or even finer granularity to allow turning on/off specific metrics).
107+
Some metrics and traces can be high volume and users might not always want to
108+
observe them. An observability verboseness “level” allows configuration of the
109+
Collector to send more or less observability data or with even finer
110+
granularity, to allow turning on or off specific metrics.
133111
134-
The default level of observability must be defined in a way that has insignificant performance impact on the service.
112+
The default level of observability must be defined in a way that has
113+
insignificant performance impact on the service.
135114
136-
[issue7532]: https://github.com/open-telemetry/opentelemetry-collector/issues/7532
137-
[issue7454]: https://github.com/open-telemetry/opentelemetry-collector/issues/7454
115+
[Internal telemetry]:
116+
https://opentelemetry.io/docs/collector/internal-telemetry/
117+
[Troubleshooting]: https://opentelemetry.io/docs/collector/troubleshooting/
118+
[issue7532]:
119+
https://github.com/open-telemetry/opentelemetry-collector/issues/7532
120+
[issue7454]:
121+
https://github.com/open-telemetry/opentelemetry-collector/issues/7454
138122
[logs]: https://github.com/open-telemetry/opentelemetry-go/issues/3827
139-
[OpenTelemetry Configuration]: https://github.com/open-telemetry/opentelemetry-configuration
140-
[kitchen-sink]: https://github.com/open-telemetry/opentelemetry-configuration/blob/main/examples/kitchen-sink.yaml
123+
[OpenTelemetry Configuration]:
124+
https://github.com/open-telemetry/opentelemetry-configuration
125+
[kitchen-sink]:
126+
https://github.com/open-telemetry/opentelemetry-configuration/blob/main/examples/kitchen-sink.yaml
127+
[Current values]:
128+
https://opentelemetry.io/docs/collector/internal-telemetry/#values-observable-with-internal-metrics
129+
[Cumulative values]:
130+
https://opentelemetry.io/docs/collector/internal-telemetry/#values-observable-with-internal-metrics
131+
[Trace or log events]:
132+
https://opentelemetry.io/docs/collector/internal-telemetry/#events-observable-with-internal-logs
133+
[Host metrics]:
134+
https://opentelemetry.io/docs/collector/internal-telemetry/#lists-of-internal-metrics

0 commit comments

Comments
 (0)