Pipeline Metrics: End-to-End processing latency #542

rghetia · 2020-02-12T18:00:54Z

This issue specifically covers end-to-end latency metric which is a part of overall pipeline metrics (#484).

There are few things to consider here.

Should there be one data point for each metric/span or should it be for a batch?

Batch is preferred because metrics in a batch would have the same latency.

What happens when multiple batches are combined?

It should keep track of receive time for each batch. It should then record latency for original batch. For example, if batch A and B are combined into C then when C is exported it should record two latency (Tc-Ta) and (Tc-Tb)

What if batches are split?

If the batches are split it should again record latency for each split batches. For example if batch A is split into two batches B and C then record latency (Tb-Ta) and (Tc-Ta)

What labels should it have?

Receiver and Exporter at minimum. This should define the pipeline. Processors may not be required.

What if the element is dropped/filtered?

May be record latency when it is dropped/filtered. If this is the case then instead of Receiver/Exporter label it should be Begin/End label. Begin could be any receiver and End could be exporter or a processor.

What should be the type of metric?

Histogram (Distribution)

What about latency of individual stage in the pipeline?

This would help to debug/isolate the problem when the end-to-end latency is high.
The latency can be recorded using same measure with Start and End label containing the pipeline/stage name.

tigrannajaryan · 2020-02-12T19:42:05Z

Do you intend to record end-to-end latency only or also latencies broken down by individual pipeline elements?

rghetia · 2020-02-13T17:32:17Z

Do you intend to record end-to-end latency only or also latencies broken down by individual pipeline elements?

my intention was to record end-to-end but broken down by individual pipeline would help to debug if there is an issue. I'll update the description to include that.

x13n · 2020-03-05T09:39:17Z

Why not add n data points to the metric whenever a batch is exported? Within a single batch, metrics may have different timestamps if they were buffered, so that approach will make it independent on how they were batched, solving the issue of splitting/merging the batches during processing.

dashpole · 2021-04-28T16:44:12Z

I'd like to work on this, if thats alright.

I did an initial investigation, but i'm still learning about the structure of the collector. One thing I ran into is that since the pipeline isn't synchronous, I can't just wrap a function call with start/stop functions. One thought I had is to propagate the receive time to the exporters using context, but that seems like it could be fragile (e.g. if a processor doesn't correctly propagate context). Are there other suggestions anyone has?

dashpole · 2021-04-28T16:50:07Z

Comments from bogdan on 4/28:

Context may not be right, since context is lost when we do batching.
nit: maybe not call this latency, call it processing time.

x13n · 2021-04-29T07:30:33Z

The idea I initially had for this was to re-use metric timestamps and record the delta between metric timestamp and Now() when the metric is successfully exported. That wouldn't be just processing time though - it would be measuring collection (possibly outside of the Collector) as well.

x13n · 2021-04-29T07:31:43Z

(In which case it should probably be called metric freshness or something.)

dashpole · 2021-05-06T21:44:46Z

I have implemented prototypes of a few approaches:

Pipeline latency with startTime propagated via context. Measures from obsreport.StartXXReceiveOp to obsresport.EndXXExportOp: changes
Pipeline latency with startTime embedded in pdata types. Measures from pdata.NewXX to obsreport.RecordPipelineDuration (basically at obsresport.EndXXExportOp): changes
Component latency with startTime propagated via context. Measures from obsreport.StartXXYYOp to obsreport.EndXXYYOp: changes

dashpole · 2021-05-07T16:19:45Z

A few thoughts:

3 is definitely the simplest, and aligns with the other obsreport telemetry we generate (per-component successes vs failures, per-component traces). Individual component processing durations aren't as good of a representation of end-user-experience as e2e latency is, but it is probably good enough to at least detect high-latency problems.

If we want to implement a full solution to this issue by adding true end-to-end duration metrics, I tend to prefer approach 1. Approach 2 Is a little tricky to get right, at least in my experience implementing it because components are free to create pdata.Metrics whenever they need to. For example, the batch processor creates new ones to hold batches. We would either have to prevent components from creating new pdata.Metrics (for example, by making them merge or split them with our helper functions), or expose the detail of passing the startTime to all components, and make them handle it correctly. With context, that can all be a bit more invisible, as we can just provide something like obsreport.MergeContext(primary, appended context.Context) to handle cases like merging batches.

dashpole · 2021-05-12T16:49:03Z

Notes from 5/12:

Lets try doing something context-based
Batch, groupby processors will be problematic. Needs more prototyping.
We need to be specific about exactly what we are measuring. What happens if batches are merged?

dashpole · 2021-05-14T15:55:48Z

I have a working draft for context-based e2e metrics that works for the batch processor: #3183.

Here is the description of the metric from the draft, which explains how we handle batching and multiple exporters:

	// mPipelineProcessingMetrics records the an observation for each
	// received batch each time it is exported. For example, a pipeline with
	// one receiver and two exporters will record two observations for each
	// batch of telemetry received: one observation for the time taken to be
	// exported by each expoter.
	// It records an observation for each batch received, not for each batch
	// sent. For example, if five received batches of telemetry are merged
	// together by the "batch" processor and exported together it will still
	// result in five observations: one observation starting from the time each
	// of the five original batches was received.
	mPipelineProcessingMetrics = stats.Float64(
		"pipeline_processing_duration_seconds",
		"Duration of between when a batch of telemetry in the pipeline was received, and when it was sent by an exporter.",
		stats.UnitSeconds)

Feel free to provide feedback on the draft. If the general approach is acceptable, i'll finish it up and send it out for review.

alolita · 2021-09-02T06:32:54Z

@bogdandrutu @tigrannajaryan following up on this issue - please provide feedback for @dashpole's suggested approaches?

tigrannajaryan · 2021-09-02T13:13:58Z

@alolita I provided the feedback here #3183 (comment)

open-telemetry#861) * Add default metrics to othttp instrumentation * Changed metrics names, add tests, add standard labels to metrics * Initialization global error handling, remove requests count metric, tuneup test * Apply suggestions from code review Co-authored-by: Tyler Yahn <[email protected]> Co-authored-by: Joshua MacDonald <[email protected]> Co-authored-by: Tyler Yahn <[email protected]>

peachchen0716 · 2023-05-05T17:34:30Z

Hi @bogdandrutu I am looking for the e2e metrics too and the attempt to add this metric seems to be blocked. Do you mind take a look and share some context on the blocker? Thanks!

peachchen0716 · 2023-05-08T20:46:57Z

Hi @bogdandrutu I am looking for the e2e metrics too and the attempt to add this metric seems to be blocked. Do you mind take a look and share some context on the blocker? Thanks!

Hi @codeboten and @dmitryax, can I get some context/status on PdataContext mentioned in the closed pr for the issue? Is it still the recommended solution? Thanks!

sccoache · 2025-06-06T12:35:09Z

Hello @bogdandrutu @dashpole, I am looking for a latency/lag time metric as well. Is there any status on this work? I see the closed PR from two years ago but nothing since then.

nilebox mentioned this issue Jun 9, 2020

Support exposing internal metrics over OTLP rather than a Prometheus endpoint #1093

Closed

bogdandrutu added this to the Backlog milestone Aug 4, 2020

bogdandrutu assigned dashpole Apr 28, 2021

dashpole mentioned this issue May 14, 2021

Add e2e pipeline processing duration self-observability metrics #3183

Closed

alolita added the area:pipeline label Sep 2, 2021

swiatekm pushed a commit to swiatekm/opentelemetry-collector that referenced this issue Oct 9, 2024

Update contributing guide (open-telemetry#542)

2874f8c

github-actions bot added the Stale label May 9, 2025

github-actions bot removed the Stale label Jun 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline Metrics: End-to-End processing latency #542

Pipeline Metrics: End-to-End processing latency #542

rghetia commented Feb 12, 2020 •

edited

Loading

tigrannajaryan commented Feb 12, 2020

Uh oh!

rghetia commented Feb 13, 2020

Uh oh!

x13n commented Mar 5, 2020

Uh oh!

dashpole commented Apr 28, 2021

Uh oh!

dashpole commented Apr 28, 2021

Uh oh!

x13n commented Apr 29, 2021

Uh oh!

x13n commented Apr 29, 2021

Uh oh!

dashpole commented May 6, 2021 •

edited

Loading

Uh oh!

dashpole commented May 7, 2021 •

edited

Loading

Uh oh!

dashpole commented May 12, 2021 •

edited

Loading

Uh oh!

dashpole commented May 14, 2021

Uh oh!

alolita commented Sep 2, 2021

Uh oh!

tigrannajaryan commented Sep 2, 2021

Uh oh!

peachchen0716 commented May 5, 2023

Uh oh!

peachchen0716 commented May 8, 2023 •

edited

Loading

Uh oh!

sccoache commented Jun 6, 2025

Uh oh!

Pipeline Metrics: End-to-End processing latency #542

Pipeline Metrics: End-to-End processing latency #542

Comments

rghetia commented Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Should there be one data point for each metric/span or should it be for a batch?

What happens when multiple batches are combined?

What if batches are split?

What labels should it have?

What if the element is dropped/filtered?

What should be the type of metric?

What about latency of individual stage in the pipeline?

tigrannajaryan commented Feb 12, 2020

Uh oh!

rghetia commented Feb 13, 2020

Uh oh!

x13n commented Mar 5, 2020

Uh oh!

dashpole commented Apr 28, 2021

Uh oh!

dashpole commented Apr 28, 2021

Uh oh!

x13n commented Apr 29, 2021

Uh oh!

x13n commented Apr 29, 2021

Uh oh!

dashpole commented May 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dashpole commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dashpole commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dashpole commented May 14, 2021

Uh oh!

alolita commented Sep 2, 2021

Uh oh!

tigrannajaryan commented Sep 2, 2021

Uh oh!

peachchen0716 commented May 5, 2023

Uh oh!

peachchen0716 commented May 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sccoache commented Jun 6, 2025

Uh oh!

rghetia commented Feb 12, 2020 •

edited

Loading

dashpole commented May 6, 2021 •

edited

Loading

dashpole commented May 7, 2021 •

edited

Loading

dashpole commented May 12, 2021 •

edited

Loading

peachchen0716 commented May 8, 2023 •

edited

Loading