Skip to content

Pipeline Metrics: End-to-End processing latency #542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rghetia opened this issue Feb 12, 2020 · 16 comments
Open

Pipeline Metrics: End-to-End processing latency #542

rghetia opened this issue Feb 12, 2020 · 16 comments
Assignees
Milestone

Comments

@rghetia
Copy link
Contributor

rghetia commented Feb 12, 2020

This issue specifically covers end-to-end latency metric which is a part of overall pipeline metrics (#484).

There are few things to consider here.

Should there be one data point for each metric/span or should it be for a batch?

  • Batch is preferred because metrics in a batch would have the same latency.

What happens when multiple batches are combined?

  • It should keep track of receive time for each batch. It should then record latency for original batch. For example, if batch A and B are combined into C then when C is exported it should record two latency (Tc-Ta) and (Tc-Tb)

What if batches are split?

  • If the batches are split it should again record latency for each split batches. For example if batch A is split into two batches B and C then record latency (Tb-Ta) and (Tc-Ta)

What labels should it have?

  • Receiver and Exporter at minimum. This should define the pipeline. Processors may not be required.

What if the element is dropped/filtered?

  • May be record latency when it is dropped/filtered. If this is the case then instead of Receiver/Exporter label it should be Begin/End label. Begin could be any receiver and End could be exporter or a processor.

What should be the type of metric?

  • Histogram (Distribution)

What about latency of individual stage in the pipeline?

  • This would help to debug/isolate the problem when the end-to-end latency is high.
  • The latency can be recorded using same measure with Start and End label containing the pipeline/stage name.
@tigrannajaryan
Copy link
Member

Do you intend to record end-to-end latency only or also latencies broken down by individual pipeline elements?

@rghetia
Copy link
Contributor Author

rghetia commented Feb 13, 2020

Do you intend to record end-to-end latency only or also latencies broken down by individual pipeline elements?

my intention was to record end-to-end but broken down by individual pipeline would help to debug if there is an issue. I'll update the description to include that.

@x13n
Copy link
Contributor

x13n commented Mar 5, 2020

Why not add n data points to the metric whenever a batch is exported? Within a single batch, metrics may have different timestamps if they were buffered, so that approach will make it independent on how they were batched, solving the issue of splitting/merging the batches during processing.

@dashpole
Copy link
Contributor

I'd like to work on this, if thats alright.

I did an initial investigation, but i'm still learning about the structure of the collector. One thing I ran into is that since the pipeline isn't synchronous, I can't just wrap a function call with start/stop functions. One thought I had is to propagate the receive time to the exporters using context, but that seems like it could be fragile (e.g. if a processor doesn't correctly propagate context). Are there other suggestions anyone has?

@dashpole
Copy link
Contributor

Comments from bogdan on 4/28:

  • Context may not be right, since context is lost when we do batching.
  • nit: maybe not call this latency, call it processing time.

@x13n
Copy link
Contributor

x13n commented Apr 29, 2021

The idea I initially had for this was to re-use metric timestamps and record the delta between metric timestamp and Now() when the metric is successfully exported. That wouldn't be just processing time though - it would be measuring collection (possibly outside of the Collector) as well.

@x13n
Copy link
Contributor

x13n commented Apr 29, 2021

(In which case it should probably be called metric freshness or something.)

@dashpole
Copy link
Contributor

dashpole commented May 6, 2021

I have implemented prototypes of a few approaches:

  1. Pipeline latency with startTime propagated via context. Measures from obsreport.StartXXReceiveOp to obsresport.EndXXExportOp: changes
  2. Pipeline latency with startTime embedded in pdata types. Measures from pdata.NewXX to obsreport.RecordPipelineDuration (basically at obsresport.EndXXExportOp): changes
  3. Component latency with startTime propagated via context. Measures from obsreport.StartXXYYOp to obsreport.EndXXYYOp: changes

@dashpole
Copy link
Contributor

dashpole commented May 7, 2021

A few thoughts:

3 is definitely the simplest, and aligns with the other obsreport telemetry we generate (per-component successes vs failures, per-component traces). Individual component processing durations aren't as good of a representation of end-user-experience as e2e latency is, but it is probably good enough to at least detect high-latency problems.

If we want to implement a full solution to this issue by adding true end-to-end duration metrics, I tend to prefer approach 1. Approach 2 Is a little tricky to get right, at least in my experience implementing it because components are free to create pdata.Metrics whenever they need to. For example, the batch processor creates new ones to hold batches. We would either have to prevent components from creating new pdata.Metrics (for example, by making them merge or split them with our helper functions), or expose the detail of passing the startTime to all components, and make them handle it correctly. With context, that can all be a bit more invisible, as we can just provide something like obsreport.MergeContext(primary, appended context.Context) to handle cases like merging batches.

@dashpole
Copy link
Contributor

dashpole commented May 12, 2021

Notes from 5/12:

  • Lets try doing something context-based
  • Batch, groupby processors will be problematic. Needs more prototyping.
  • We need to be specific about exactly what we are measuring. What happens if batches are merged?

@dashpole
Copy link
Contributor

I have a working draft for context-based e2e metrics that works for the batch processor: #3183.

Here is the description of the metric from the draft, which explains how we handle batching and multiple exporters:

	// mPipelineProcessingMetrics records the an observation for each
	// received batch each time it is exported. For example, a pipeline with
	// one receiver and two exporters will record two observations for each
	// batch of telemetry received: one observation for the time taken to be
	// exported by each expoter.
	// It records an observation for each batch received, not for each batch
	// sent. For example, if five received batches of telemetry are merged
	// together by the "batch" processor and exported together it will still
	// result in five observations: one observation starting from the time each
	// of the five original batches was received.
	mPipelineProcessingMetrics = stats.Float64(
		"pipeline_processing_duration_seconds",
		"Duration of between when a batch of telemetry in the pipeline was received, and when it was sent by an exporter.",
		stats.UnitSeconds)

Feel free to provide feedback on the draft. If the general approach is acceptable, i'll finish it up and send it out for review.

@alolita
Copy link
Member

alolita commented Sep 2, 2021

@bogdandrutu @tigrannajaryan following up on this issue - please provide feedback for @dashpole's suggested approaches?

@tigrannajaryan
Copy link
Member

@alolita I provided the feedback here #3183 (comment)

MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021
open-telemetry#861)

* Add default metrics to othttp instrumentation

* Changed metrics names, add tests, add standard labels to metrics

* Initialization global error handling, remove requests count metric, tuneup test

* Apply suggestions from code review

Co-authored-by: Tyler Yahn <[email protected]>

Co-authored-by: Joshua MacDonald <[email protected]>
Co-authored-by: Tyler Yahn <[email protected]>
@peachchen0716
Copy link

Hi @bogdandrutu I am looking for the e2e metrics too and the attempt to add this metric seems to be blocked. Do you mind take a look and share some context on the blocker? Thanks!

@peachchen0716
Copy link

peachchen0716 commented May 8, 2023

Hi @bogdandrutu I am looking for the e2e metrics too and the attempt to add this metric seems to be blocked. Do you mind take a look and share some context on the blocker? Thanks!

Hi @codeboten and @dmitryax, can I get some context/status on PdataContext mentioned in the closed pr for the issue? Is it still the recommended solution? Thanks!

swiatekm pushed a commit to swiatekm/opentelemetry-collector that referenced this issue Oct 9, 2024
@github-actions github-actions bot added the Stale label May 9, 2025
@sccoache
Copy link

sccoache commented Jun 6, 2025

Hello @bogdandrutu @dashpole, I am looking for a latency/lag time metric as well. Is there any status on this work? I see the closed PR from two years ago but nothing since then.

@github-actions github-actions bot removed the Stale label Jun 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants