Revisit how collector internal metric are distributed across telemetry levels #7890

dmitryax · 2023-06-14T06:18:34Z

There are several verbosity levels that can be used to configure how many metrics the collector exposes:

	// Level is the level of telemetry metrics, the possible values are:
	//  - "none" indicates that no telemetry data should be collected;
	//  - "basic" is the recommended and covers the basics of the service telemetry.
	//  - "normal" adds some other indicators on top of basic.
	//  - "detailed" adds dimensions and views to the previous levels.

The problem is that they are barely used. Most of the metrics are exposed at the basic level. There is only one metric in batch processor exposed at the detailed level. The normal level is not being used. The suggestion is to revisit all the metrics and further distribute them across the levels.

The default level is basic (the lowest), which is the most common and provides most of the metrics. We can move a significant portion of the metrics to the normal level, which can become a new default. So default behavior doesn't change for the end user. While basic level can be kept to the bare minimum reserved for collector core:

process metrics
accepted and refused data by receivers
sent and failed to send data by exporters
sending queue size and enqueue failures

OTel components metrics will use only normal or detailed levels.

Long-term, we can consider providing a user an option to override this level per component.

The text was updated successfully, but these errors were encountered:

hughesjj · 2023-06-14T17:21:33Z

From collector wg/sig:

basic should only be for internal collector telemetry
normal should be the default for component authors
detailed should be ... well, the detailed view

I'm strongly in favor of better utilizing verbosity levels

victoramsantos · 2023-09-29T20:53:01Z

Agree. I was testing the difference between these levels and just found the same, that the detailed level logs has one more metric than others (namely otelcol_processor_batch_batch_send_size_bytes_bucket).

dmitryax · 2024-03-14T08:40:52Z

According to the proposed guidelines, I think we can move Basic to Normal level by default and move the following metric sets from Basic to Normal:

GRPC/HTTP server/client metrics.

otelcol_http_client_duration histogram
otelcol_http_client_request_size counter
otelcol_http_client_response_size counter
otelcol_http_server_duration histogram
otelcol_http_server_request_size counter
otelcol_http_server_response_size counter
otelcol_rpc_server_duration histogram
otelcol_rpc_server_request_size histogram
otelcol_rpc_server_requests_per_rpc histogram
otelcol_rpc_server_response_size histogram
otelcol_rpc_server_responses_per_rpc histogram

These metrics were not emitted before the transition to OTel instrumentation and can be pretty noisy even with enabled telemetry.disableHighCardinalityMetrics. It may even be worth considering moving them (or a portion of them) to the Detailed level.

Batch processor metrics:

otelcol_processor_batch_batch_send_size histogram
otelcol_processor_batch_metadata_cardinality gauge
otelcol_processor_batch_timeout_trigger_send counter
otelcol_processor_batch_size_trigger_send counter

Enabling them on the Basic level isn't aligned with the suggested guidelines. "Custom" (not generated by the helpers) component metrics should be emitted starting with the Normal level.

TylerHelmuth · 2024-03-14T17:33:11Z

@dmitryax I agree that component metrics should be emitted with Normal and that Normal should be the default level. I agree with Basic being only those generic receiver/processor/exporter metrics.

For the GRPC/HTTP server/client metrics, are they broken down per component? If so, they feels like Detailed metrics to me.

Side note about Detailed metrics. I foresee users wanting to be selective with which detailed metrics are emitted. Should we provide a way to specify which specific metrics, of any level, are enabled/disabled like we do for scrappers with mdatagen?

dmitryax · 2024-03-14T18:43:09Z

For the GRPC/HTTP server/client metrics, are they broken down per component? If so, they feels like Detailed metrics to me.

Not every component exposes them, only receivers and exporters with HTTP/GRPC clients/servers, but it can be more granular than per component. Client metrics are per net.peer.name, so any component -> external endpoint pair. GRPC server metrics have rpc.method and rpc.service, so one per data type at least. If telemetry.disableHighCardinalityMetrics feature gate isn't enabled it also adds port and host attributes, which can bring high cardinality problem, see #7517.

I would agree to move HTTP/GRPC client/server metrics to Detailed level. @open-telemetry/collector-approvers WDYT?

Should we provide a way to specify which specific metrics, of any level, are enabled/disabled like we do for scrappers with mdatagen?

There is nothing like that available now. Users can further reduce the set with filter processor or with metric_relabel_configs on prometheus receiver. But I agree it would be nice to have this capability right in service::telemetry::metrics.

TylerHelmuth · 2024-03-14T19:10:56Z

But I agree it would be nice to have this capability right in service::telemetry::metrics.

Out of scope for this issue for sure, but I agree.

**Description:** This change distributes the reported internal metrics across available levels and updates the level set by default: 1. The default level is changed from `basic` to `normal`, which can be overridden with `service::telmetry::metrics::level` configuration. 2. The following batch processor metrics are updated to be reported starting from `normal` level instead of `basic` level: - `processor_batch_batch_send_size` - `processor_batch_metadata_cardinality` - `processor_batch_timeout_trigger_send` - `processor_batch_size_trigger_send` 3. The following GRPC/HTTP server and client metrics are updated to be reported starting from `detailed` level: - `http.client.*` metrics - `http.server.*` metrics - `rpc.server.*` metrics - `rpc.client.*` metrics **Link to tracking Issue:** #7890

mx-psi · 2024-12-11T17:02:29Z

What is the status of this issue? What is left to decide?

mx-psi · 2024-12-16T16:54:24Z

Discussed on 2024-12-16 meeting, Pablo will look into adding this on the component guidelines and @dmitryax will create a follow up issue

jade-guiton-dd · 2025-02-26T11:09:59Z

Looking into the current state of things, it looks like there are still some pending issues. Here are the changes I suggest:

The processor helper metrics for incoming/outgoing items, and the upcoming pipeline telemetry metrics have not been discussed in this issue. I believe that they should fall under basic telemetry like the exporter/receiver helper metrics, in which case no changes are needed.
It seems the final decision regarding batch processor metrics was that they should be normal, which was implemented in Distribute internal metrics across different levels #9767. However, there have been multiple (accidental?) changes since then, and they are back to basic, except for processor_batch_batch_send_size_bytes which is detailed. I would like to switch them back to normal, and update the documentation on the OTel website accordingly.
(Note that the hardcoded views in service.go are a workaround while waiting for [component] Add support for components specifying what instruments to enable depending on metric level #11754)
I think processor_batch_batch_send_size_bytes was changed to detailed because computing the size in bytes requires additional work compared to counting the number of items. I believe this to be a reasonable use case for detailed, so I would like to keep that metric as detailed, and update our guidelines to mention this use case, instead of only cardinality-related issues.
I would also like to move our guidelines for the use of telemetry levels to somewhere other than the Go package doc for configtelemetry, as I'm unsure component devs will go read that. "Component guidelines" were mentioned by Pablo above, although I'm not sure where that is?
I would also like to update the rather vague documentation on the level enum values themselves, since I feel like those are more likely to be seen by component devs than whatever external document we put out.

mx-psi · 2025-02-26T11:37:57Z

"Component guidelines" were mentioned by Pablo above, although I'm not sure where that is?

I discussed this offline with Jade, I believe I mean the "component stability" document we have in the docs folder in this repository.

Regarding the list above, my only comment was that we should think about the migration, possibly with a feature gate for all telemetry related changes

…l guidelines (#12525) #### Description This PR: - requires "level: normal" before outputting batch processor metrics (in addition to one specific metric which was already restricted to "level: detailed") - clarifies wording in the telemetry level guidelines and documentation, and adds said guidelines to the requirements for stable components. Some rationale for these changes can be found in the tracking issue and [this comment](#7890 (comment)). #### Link to tracking issue Resolves #7890 #### To be discussed Should we add a feature gate for this, in case a user relies on "level: basic" outputting batch processor metrics? This feels like a niche use case, so considering the "alpha" stability level of these metrics, I don't think it's really necessary. Considering batch processor metrics had already been switched to "normal" once (#9767), but were turned back to basic at some later point (not sure when), we might also want to add tests to avoid further regressions (especially as the handling of telemetry levels is bound to change further with #11754). --------- Co-authored-by: Dmitrii Anoshin <[email protected]>

dmitryax added discussion-needed Community discussion needed and removed discussion-needed Community discussion needed labels Jun 14, 2023

dmitryax self-assigned this Jun 14, 2023

dmitryax mentioned this issue Mar 14, 2024

Distribute internal metrics across different levels #9767

Merged

dmitryax mentioned this issue Aug 5, 2024

[configtelemetry] add guidelines for each level of config telemetry #10365

Merged

codeboten added this to Collector: v1 Aug 7, 2024

github-project-automation bot moved this to Todo in Collector: v1 Aug 7, 2024

codeboten added this to the Self observability milestone Aug 7, 2024

mx-psi assigned jade-guiton-dd and unassigned dmitryax Feb 24, 2025

jade-guiton-dd moved this from Todo to In Progress in Collector: v1 Feb 26, 2025

jade-guiton-dd mentioned this issue Feb 28, 2025

[service] Move batchprocessor metrics to normal level and update level guidelines #12525

Merged

jade-guiton-dd moved this from In Progress to Waiting for reviews in Collector: v1 Feb 28, 2025

mx-psi closed this as completed in #12525 Mar 4, 2025

github-project-automation bot moved this from Waiting for reviews to Done in Collector: v1 Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revisit how collector internal metric are distributed across telemetry levels #7890

Revisit how collector internal metric are distributed across telemetry levels #7890

dmitryax commented Jun 14, 2023 •

edited

Loading

hughesjj commented Jun 14, 2023

Uh oh!

victoramsantos commented Sep 29, 2023

Uh oh!

dmitryax commented Mar 14, 2024 •

edited

Loading

Uh oh!

TylerHelmuth commented Mar 14, 2024

Uh oh!

dmitryax commented Mar 14, 2024

Uh oh!

TylerHelmuth commented Mar 14, 2024

Uh oh!

mx-psi commented Dec 11, 2024

Uh oh!

mx-psi commented Dec 16, 2024

Uh oh!

jade-guiton-dd commented Feb 26, 2025 •

edited

Loading

Uh oh!

mx-psi commented Feb 26, 2025

Uh oh!

Revisit how collector internal metric are distributed across telemetry levels #7890

Revisit how collector internal metric are distributed across telemetry levels #7890

Comments

dmitryax commented Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

hughesjj commented Jun 14, 2023

Uh oh!

victoramsantos commented Sep 29, 2023

Uh oh!

dmitryax commented Mar 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TylerHelmuth commented Mar 14, 2024

Uh oh!

dmitryax commented Mar 14, 2024

Uh oh!

TylerHelmuth commented Mar 14, 2024

Uh oh!

mx-psi commented Dec 11, 2024

Uh oh!

mx-psi commented Dec 16, 2024

Uh oh!

jade-guiton-dd commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mx-psi commented Feb 26, 2025

Uh oh!

dmitryax commented Jun 14, 2023 •

edited

Loading

dmitryax commented Mar 14, 2024 •

edited

Loading

jade-guiton-dd commented Feb 26, 2025 •

edited

Loading