Skip to content

Incorrect "span_metrics_calls_total" Metric Value for SpanMetrics when Otel-Collector is Restarted #38262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
meSATYA opened this issue Feb 28, 2025 · 3 comments
Labels
bug Something isn't working connector/spanmetrics

Comments

@meSATYA
Copy link

meSATYA commented Feb 28, 2025

Component(s)

connector/spanmetrics

What happened?

Description

We are generating spanmetrics by running otel-collector as statefulset behind a loadbalancing exporter with routing_key as service. The value of span_metrics_calls_total gives appropriate value until the time the collector is restarted. So, when we restart the collector, either the span_metrics_calls_total metric value shows a bump or a spike on the graph. This gives unpleasant impression that something is wrong in the service due to which calls are reduced or increased to the service.

Steps to Reproduce

Send the traces to a LoadBalancing Exporter collector running as deployment, then forward the traces from the LoadBalancing collector to another collector running as statefulset. Use routing_key as service.

Expected Result

The calls_total metric shouldn't show bump or spike when the otel-collector restarts.

Actual Result

Image Image Image

Collector version

0.120.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

mode: "statefulset"

config:
  exporters:
    debug/spanmetrics:
      verbosity: basic 

    prometheusremotewrite/spanmetrics:
      endpoint: http://victoria-metrics-cluster-vminsert.metrics.svc.cluster.local:8480/insert/10/prometheus
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 60s
      compression: gzip
      tls:
        insecure_skip_verify: true

  extensions:
    health_check:
      endpoint: ${env:MY_POD_IP}:13133

  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [1ms, 10ms, 20ms, 50ms, 100ms, 250ms, 500ms, 800, 1s, 2s, 5s, 10s, 15s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      dimensions:
        - name: http.method
        - name: http.status_code
      dimensions_cache_size: 1000
      events:
        enabled: true
        dimensions:
          - name: exception.type
      exclude_dimensions: ['k8s.pod.uid', 'k8s.pod.name', 'k8s.container.name', 'k8s.deployment.name', 'k8s.deployment.uid', 'k8s.job.name', 'k8s.job.uid', 'k8s.namespace.name', 'k8s.node.name', 'k8s.pod.ip', 'k8s.pod.start_time', 'k8s.replicaset.name', 'k8s.replicaset.uid', 'azure.vm.scaleset.name', 'cloud.resource_id', 'host.id', 'host.type', 'instance', 'service.instance.id', 'host.name', 'job', 'dt.entity.host', 'dt.entity.process_group', 'dt.entity.process_group_instance', 'container.id']      
      exemplars:
        enabled: true
        max_per_data_point: 5
      metrics_flush_interval: 1m
      metrics_expiration: 5m
      namespace: span.metrics
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name

  processors:
    batch: {}

    batch/spanmetrics:
      send_batch_max_size: 5000
      send_batch_size: 4500
      timeout: 10s

    memory_limiter:
      check_interval: 5s
      limit_percentage: 80
      spike_limit_percentage: 25
    
  receivers:
    otlp/traces:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:4318
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
          max_recv_msg_size_mib: 12
          
  service:
    extensions:
      - health_check
    pipelines:
      metrics/spanmetrics:
        exporters:
        - prometheusremotewrite/spanmetrics
        processors:
        - batch/spanmetrics
        receivers:
        - spanmetrics

      traces/connector-pipeline:
        exporters:
        - spanmetrics
        processors:
        - batch
        receivers:
        - otlp/traces  
        
    telemetry:
      metrics:
        address: ${env:MY_POD_IP}:8888

Log output

Additional context

No response

@meSATYA meSATYA added bug Something isn't working needs triage New item requiring triage labels Feb 28, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@meSATYA meSATYA changed the title Incorrect "calls_total" Metric Value for SpanMetrics when Otel-Collector is Restarted Incorrect "span_metrics_calls_total" Metric Value for SpanMetrics when Otel-Collector is Restarted Feb 28, 2025
@Frapschen
Copy link
Contributor

The current spanmetrics stores recordings in memory, and after a restart, it will lose all previous recordings and reset the calls_total count.

we need to backup the recordings on disk and recover it when collector restart

@pjanotti
Copy link
Contributor

@Frapschen @portertech is this something that you would like to address either via code or documentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics
Projects
None yet
Development

No branches or pull requests

3 participants