Skip to content

Revisit how collector internal metric are distributed across telemetry levels #7890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dmitryax opened this issue Jun 14, 2023 · 10 comments · Fixed by #12525
Closed

Revisit how collector internal metric are distributed across telemetry levels #7890

dmitryax opened this issue Jun 14, 2023 · 10 comments · Fixed by #12525
Assignees

Comments

@dmitryax
Copy link
Member

dmitryax commented Jun 14, 2023

There are several verbosity levels that can be used to configure how many metrics the collector exposes:

	// Level is the level of telemetry metrics, the possible values are:
	//  - "none" indicates that no telemetry data should be collected;
	//  - "basic" is the recommended and covers the basics of the service telemetry.
	//  - "normal" adds some other indicators on top of basic.
	//  - "detailed" adds dimensions and views to the previous levels.

The problem is that they are barely used. Most of the metrics are exposed at the basic level. There is only one metric in batch processor exposed at the detailed level. The normal level is not being used. The suggestion is to revisit all the metrics and further distribute them across the levels.

The default level is basic (the lowest), which is the most common and provides most of the metrics. We can move a significant portion of the metrics to the normal level, which can become a new default. So default behavior doesn't change for the end user. While basic level can be kept to the bare minimum reserved for collector core:

  • process metrics
  • accepted and refused data by receivers
  • sent and failed to send data by exporters
  • sending queue size and enqueue failures

OTel components metrics will use only normal or detailed levels.

Long-term, we can consider providing a user an option to override this level per component.

@dmitryax dmitryax added discussion-needed Community discussion needed and removed discussion-needed Community discussion needed labels Jun 14, 2023
@hughesjj
Copy link

From collector wg/sig:

  • basic should only be for internal collector telemetry
  • normal should be the default for component authors
  • detailed should be ... well, the detailed view

I'm strongly in favor of better utilizing verbosity levels

@dmitryax dmitryax self-assigned this Jun 14, 2023
@victoramsantos
Copy link

Agree. I was testing the difference between these levels and just found the same, that the detailed level logs has one more metric than others (namely otelcol_processor_batch_batch_send_size_bytes_bucket).

@dmitryax
Copy link
Member Author

dmitryax commented Mar 14, 2024

According to the proposed guidelines, I think we can move Basic to Normal level by default and move the following metric sets from Basic to Normal:

  1. GRPC/HTTP server/client metrics.
otelcol_http_client_duration histogram
otelcol_http_client_request_size counter
otelcol_http_client_response_size counter
otelcol_http_server_duration histogram
otelcol_http_server_request_size counter
otelcol_http_server_response_size counter
otelcol_rpc_server_duration histogram
otelcol_rpc_server_request_size histogram
otelcol_rpc_server_requests_per_rpc histogram
otelcol_rpc_server_response_size histogram
otelcol_rpc_server_responses_per_rpc histogram

These metrics were not emitted before the transition to OTel instrumentation and can be pretty noisy even with enabled telemetry.disableHighCardinalityMetrics. It may even be worth considering moving them (or a portion of them) to the Detailed level.

  1. Batch processor metrics:
otelcol_processor_batch_batch_send_size histogram
otelcol_processor_batch_metadata_cardinality gauge
otelcol_processor_batch_timeout_trigger_send counter
otelcol_processor_batch_size_trigger_send counter

Enabling them on the Basic level isn't aligned with the suggested guidelines. "Custom" (not generated by the helpers) component metrics should be emitted starting with the Normal level.

@TylerHelmuth
Copy link
Member

@dmitryax I agree that component metrics should be emitted with Normal and that Normal should be the default level. I agree with Basic being only those generic receiver/processor/exporter metrics.

For the GRPC/HTTP server/client metrics, are they broken down per component? If so, they feels like Detailed metrics to me.

Side note about Detailed metrics. I foresee users wanting to be selective with which detailed metrics are emitted. Should we provide a way to specify which specific metrics, of any level, are enabled/disabled like we do for scrappers with mdatagen?

@dmitryax
Copy link
Member Author

For the GRPC/HTTP server/client metrics, are they broken down per component? If so, they feels like Detailed metrics to me.

Not every component exposes them, only receivers and exporters with HTTP/GRPC clients/servers, but it can be more granular than per component. Client metrics are per net.peer.name, so any component -> external endpoint pair. GRPC server metrics have rpc.method and rpc.service, so one per data type at least. If telemetry.disableHighCardinalityMetrics feature gate isn't enabled it also adds port and host attributes, which can bring high cardinality problem, see #7517.

I would agree to move HTTP/GRPC client/server metrics to Detailed level. @open-telemetry/collector-approvers WDYT?

Should we provide a way to specify which specific metrics, of any level, are enabled/disabled like we do for scrappers with mdatagen?

There is nothing like that available now. Users can further reduce the set with filter processor or with metric_relabel_configs on prometheus receiver. But I agree it would be nice to have this capability right in service::telemetry::metrics.

@TylerHelmuth
Copy link
Member

But I agree it would be nice to have this capability right in service::telemetry::metrics.

Out of scope for this issue for sure, but I agree.

dmitryax added a commit that referenced this issue Apr 16, 2024
**Description:**

This change distributes the reported internal metrics across available
levels and updates the level set by default:

1. The default level is changed from `basic` to `normal`, which can be
overridden with `service::telmetry::metrics::level` configuration.

2. The following batch processor metrics are updated to be reported
starting from `normal` level instead of `basic` level:
  - `processor_batch_batch_send_size`
  - `processor_batch_metadata_cardinality` 
  - `processor_batch_timeout_trigger_send` 
  - `processor_batch_size_trigger_send` 

3. The following GRPC/HTTP server and client metrics are updated to be
reported starting from `detailed` level:
  - `http.client.*` metrics 
  - `http.server.*` metrics 
  - `rpc.server.*` metrics 
  - `rpc.client.*` metrics

**Link to tracking Issue:**
#7890
@codeboten codeboten added this to the Self observability milestone Aug 7, 2024
@mx-psi
Copy link
Member

mx-psi commented Dec 11, 2024

What is the status of this issue? What is left to decide?

@mx-psi
Copy link
Member

mx-psi commented Dec 16, 2024

Discussed on 2024-12-16 meeting, Pablo will look into adding this on the component guidelines and @dmitryax will create a follow up issue

@mx-psi mx-psi assigned jade-guiton-dd and unassigned dmitryax Feb 24, 2025
@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Feb 26, 2025

Looking into the current state of things, it looks like there are still some pending issues. Here are the changes I suggest:

  1. The processor helper metrics for incoming/outgoing items, and the upcoming pipeline telemetry metrics have not been discussed in this issue. I believe that they should fall under basic telemetry like the exporter/receiver helper metrics, in which case no changes are needed.
  2. It seems the final decision regarding batch processor metrics was that they should be normal, which was implemented in Distribute internal metrics across different levels #9767. However, there have been multiple (accidental?) changes since then, and they are back to basic, except for processor_batch_batch_send_size_bytes which is detailed. I would like to switch them back to normal, and update the documentation on the OTel website accordingly.
    (Note that the hardcoded views in service.go are a workaround while waiting for [component] Add support for components specifying what instruments to enable depending on metric level #11754)
  3. I think processor_batch_batch_send_size_bytes was changed to detailed because computing the size in bytes requires additional work compared to counting the number of items. I believe this to be a reasonable use case for detailed, so I would like to keep that metric as detailed, and update our guidelines to mention this use case, instead of only cardinality-related issues.
  4. I would also like to move our guidelines for the use of telemetry levels to somewhere other than the Go package doc for configtelemetry, as I'm unsure component devs will go read that. "Component guidelines" were mentioned by Pablo above, although I'm not sure where that is?
  5. I would also like to update the rather vague documentation on the level enum values themselves, since I feel like those are more likely to be seen by component devs than whatever external document we put out.

@jade-guiton-dd jade-guiton-dd moved this from Todo to In Progress in Collector: v1 Feb 26, 2025
@mx-psi
Copy link
Member

mx-psi commented Feb 26, 2025

"Component guidelines" were mentioned by Pablo above, although I'm not sure where that is?

I discussed this offline with Jade, I believe I mean the "component stability" document we have in the docs folder in this repository.

Regarding the list above, my only comment was that we should think about the migration, possibly with a feature gate for all telemetry related changes

@jade-guiton-dd jade-guiton-dd moved this from In Progress to Waiting for reviews in Collector: v1 Feb 28, 2025
github-merge-queue bot pushed a commit that referenced this issue Mar 4, 2025
…l guidelines (#12525)

#### Description

This PR:
- requires "level: normal" before outputting batch processor metrics (in
addition to one specific metric which was already restricted to "level:
detailed")
- clarifies wording in the telemetry level guidelines and documentation,
and adds said guidelines to the requirements for stable components.

Some rationale for these changes can be found in the tracking issue and
[this
comment](#7890 (comment)).

#### Link to tracking issue
Resolves #7890

#### To be discussed

Should we add a feature gate for this, in case a user relies on "level:
basic" outputting batch processor metrics? This feels like a niche use
case, so considering the "alpha" stability level of these metrics, I
don't think it's really necessary.

Considering batch processor metrics had already been switched to
"normal" once (#9767), but were turned back to basic at some later point
(not sure when), we might also want to add tests to avoid further
regressions (especially as the handling of telemetry levels is bound to
change further with #11754).

---------

Co-authored-by: Dmitrii Anoshin <[email protected]>
@github-project-automation github-project-automation bot moved this from Waiting for reviews to Done in Collector: v1 Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
7 participants