Incorrect "span_metrics_calls_total" Metric Value for SpanMetrics when Otel-Collector is Restarted #38262

meSATYA · 2025-02-28T03:31:42Z

Component(s)

connector/spanmetrics

What happened?

Description

We are generating spanmetrics by running otel-collector as statefulset behind a loadbalancing exporter with routing_key as service. The value of span_metrics_calls_total gives appropriate value until the time the collector is restarted. So, when we restart the collector, either the span_metrics_calls_total metric value shows a bump or a spike on the graph. This gives unpleasant impression that something is wrong in the service due to which calls are reduced or increased to the service.

Steps to Reproduce

Send the traces to a LoadBalancing Exporter collector running as deployment, then forward the traces from the LoadBalancing collector to another collector running as statefulset. Use routing_key as service.

Expected Result

The calls_total metric shouldn't show bump or spike when the otel-collector restarts.

Actual Result

Collector version

0.120.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

mode: "statefulset"

config:
  exporters:
    debug/spanmetrics:
      verbosity: basic 

    prometheusremotewrite/spanmetrics:
      endpoint: http://victoria-metrics-cluster-vminsert.metrics.svc.cluster.local:8480/insert/10/prometheus
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 60s
      compression: gzip
      tls:
        insecure_skip_verify: true

  extensions:
    health_check:
      endpoint: ${env:MY_POD_IP}:13133

  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [1ms, 10ms, 20ms, 50ms, 100ms, 250ms, 500ms, 800, 1s, 2s, 5s, 10s, 15s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      dimensions:
        - name: http.method
        - name: http.status_code
      dimensions_cache_size: 1000
      events:
        enabled: true
        dimensions:
          - name: exception.type
      exclude_dimensions: ['k8s.pod.uid', 'k8s.pod.name', 'k8s.container.name', 'k8s.deployment.name', 'k8s.deployment.uid', 'k8s.job.name', 'k8s.job.uid', 'k8s.namespace.name', 'k8s.node.name', 'k8s.pod.ip', 'k8s.pod.start_time', 'k8s.replicaset.name', 'k8s.replicaset.uid', 'azure.vm.scaleset.name', 'cloud.resource_id', 'host.id', 'host.type', 'instance', 'service.instance.id', 'host.name', 'job', 'dt.entity.host', 'dt.entity.process_group', 'dt.entity.process_group_instance', 'container.id']      
      exemplars:
        enabled: true
        max_per_data_point: 5
      metrics_flush_interval: 1m
      metrics_expiration: 5m
      namespace: span.metrics
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name

  processors:
    batch: {}

    batch/spanmetrics:
      send_batch_max_size: 5000
      send_batch_size: 4500
      timeout: 10s

    memory_limiter:
      check_interval: 5s
      limit_percentage: 80
      spike_limit_percentage: 25
    
  receivers:
    otlp/traces:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:4318
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
          max_recv_msg_size_mib: 12
          
  service:
    extensions:
      - health_check
    pipelines:
      metrics/spanmetrics:
        exporters:
        - prometheusremotewrite/spanmetrics
        processors:
        - batch/spanmetrics
        receivers:
        - spanmetrics

      traces/connector-pipeline:
        exporters:
        - spanmetrics
        processors:
        - batch
        receivers:
        - otlp/traces  
        
    telemetry:
      metrics:
        address: ${env:MY_POD_IP}:8888

Log output

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2025-02-28T03:31:57Z

Pinging code owners:

connector/spanmetrics: @portertech @Frapschen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Frapschen · 2025-04-03T03:45:15Z

The current spanmetrics stores recordings in memory, and after a restart, it will lose all previous recordings and reset the calls_total count.

we need to backup the recordings on disk and recover it when collector restart

pjanotti · 2025-05-12T21:35:45Z

@Frapschen @portertech is this something that you would like to address either via code or documentation?

meSATYA added bug Something isn't working needs triage New item requiring triage labels Feb 28, 2025

github-actions bot added the connector/spanmetrics label Feb 28, 2025

meSATYA changed the title ~~Incorrect "calls_total" Metric Value for SpanMetrics when Otel-Collector is Restarted~~ Incorrect "span_metrics_calls_total" Metric Value for SpanMetrics when Otel-Collector is Restarted Feb 28, 2025

github-actions bot mentioned this issue Mar 4, 2025

Weekly Report: 2025-02-25 - 2025-03-04 #38321

Closed

github-actions bot mentioned this issue Mar 11, 2025

Weekly Report: 2025-03-04 - 2025-03-11 #38503

Closed

github-actions bot mentioned this issue Mar 18, 2025

Weekly Report: 2025-03-11 - 2025-03-18 #38702

Closed

This was referenced Mar 23, 2025

Weekly Report: 2025-03-16 - 2025-03-23 LucaLanziani/opentelemetry-collector-contrib#16

Closed

Weekly Report: 2025-03-16 - 2025-03-23 LucaLanziani/opentelemetry-collector-contrib#17

Closed

github-actions bot mentioned this issue Mar 25, 2025

Weekly Report: 2025-03-18 - 2025-03-25 #38935

Closed

github-actions bot mentioned this issue Apr 1, 2025

Weekly Report: 2025-03-25 - 2025-04-01 #39070

Closed

github-actions bot mentioned this issue Apr 8, 2025

Weekly Report: 2025-04-01 - 2025-04-08 #39228

Closed

This was referenced Apr 15, 2025

Weekly Report: 2025-04-08 - 2025-04-15 #39396

Closed

Weekly Report: 2025-04-15 - 2025-04-22 #39524

Closed

github-actions bot mentioned this issue Apr 29, 2025

Weekly Report: 2025-04-22 - 2025-04-29 #39708

Closed

github-actions bot mentioned this issue May 6, 2025

Weekly Report: 2025-04-29 - 2025-05-06 #39865

Closed

pjanotti removed the needs triage New item requiring triage label May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect "span_metrics_calls_total" Metric Value for SpanMetrics when Otel-Collector is Restarted #38262

Incorrect "span_metrics_calls_total" Metric Value for SpanMetrics when Otel-Collector is Restarted #38262

meSATYA commented Feb 28, 2025

github-actions bot commented Feb 28, 2025

Uh oh!

Frapschen commented Apr 3, 2025

Uh oh!

pjanotti commented May 12, 2025

Uh oh!