Skip to content

Googlemanagedprometheus exporter randomly falls into an infinite error state #31507

@rafal-dudek

Description

@rafal-dudek

Component(s)

exporter/googlemanagedprometheus

What happened?

Description

Sometimes, when pod in GKE with OpenTelemetry Collector starts up, it reports errors "One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric." in every consequent minute (scrape interval is 30s). After restarting pod the problem disappears. After some more restarts, problem happens again.
Looks like all the metrics are send properly to Google Monitoring but every minute additional duplicated data points are added to the batch, which causes the errors.

Steps to Reproduce

Create pod in Google Kubernetes Engine with OpenTelemetry Collector with similar config to ours. If problem does not occur, delete the pod and recreate it. Repeat until you see consistent error logs.

Expected Result

If there is any problem with saving data point to Google Monitoring which causes sending duplicated data point next minute, it should not repeat infinitely each minute.

Actual Result

Error with duplicated data point causes infinite errors state of OpenTelemetry exporter, fixed only when pod is deleted.

Collector version

v0.95.0

Environment information

Environment

Google Kubernetes Engine
Base image: ubi9/ubi
Compiler(if manually compiled): go 1.21.7

OpenTelemetry Collector configuration

receivers:  
  prometheus/otel-metrics:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 30s
          static_configs:
            - targets: ['127.0.0.1:8888']
          metrics_path: /metrics

processors:
  resource/metrics:
    attributes:
      - key: k8s.namespace.name
        value: namespace-name
        action: upsert
      - key: k8s.pod.name
        value: pod-name-tfx4k # The name of the POD - unique name after each recreation
        action: upsert
      - key: k8s.container.name
        value: otel-collector
        action: upsert
      - key: cloud.availability_zone
        value: us-central1-c
        action: upsert
      - key: service.name
        action: delete
      - key: service.version
        action: delete
      - key: service.instance.id
        action: delete
  metricstransform/gmp_otel:
    transforms:
    - include: ^(.*)$$
      match_type: regexp
      action: update
      new_name: otel_internal_$${1}
    - include: \.*
      match_type: regexp
      action: update
      operations:
        - action: add_label
          new_label: source_project_id
          new_value: gke-cluster-project
        - action: add_label
          new_label: pod_name
          new_value: pod-name-tfx4k # The name of the POD - unique name after each recreation
        - action: add_label
          new_label: container_name
          new_value: es-exporter
    - include: ^(.+)_(seconds|bytes)_(.+)$$
      match_type: regexp
      action: update
      new_name: $${1}_$${3}
    - include: ^(.+)_(bytes|total|seconds)$$
      match_type: regexp
      action: update
      new_name: $${1}
  resourcedetection/metrics:
    detectors: [env, gcp]
    timeout: 2s
    override: false
  batch/metrics:
    send_batch_size: 200
    timeout: 5s
    send_batch_max_size: 200
  memory_limiter:
    limit_mib: 297
    spike_limit_mib: 52
    check_interval: 1s
exporters:
  googlemanagedprometheus/otel-metrics:
    project: project-for-metrics
    timeout: 15s
    sending_queue:
      enabled: false
      num_consumers: 10
      queue_size: 5000
    metric:
      prefix: prometheus.googleapis.com
      add_metric_suffixes: False
  logging:
    loglevel: debug
    sampling_initial: 1
    sampling_thereafter: 500
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
  pprof:
    endpoint: 0.0.0.0:1777
service:
  telemetry:
    logs:
      level: "info"
  extensions: [health_check]
  pipelines:
    metrics/otel:
      receivers: [prometheus/otel-metrics]
      processors: [batch/metrics, resourcedetection/metrics, metricstransform/gmp_otel, resource/metrics]
      exporters: [googlemanagedprometheus/otel-metrics]

Log output

First 11 error logs:
2024-02-29T08:22:00.043Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-c,job:,instance:,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-12]: prometheus.googleapis.com/otel_internal_scrape_series_added/gauge{pod_name:pod-name-tfx4k,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver}\\nerror details: name = Unknown  desc = total_point_count:13  success_point_count:12  errors:{status:{code:9}  point_count:1} "rejected_items": 28}

2024-02-29T08:23:00.169Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,cluster:gke-cluster-name,location:us-central1-c,instance:,namespace:namespace-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_batch_batch_size_trigger_send/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,processor:batch/metrics}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:24:00.330Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{namespace:namespace-name,job:,cluster:gke-cluster-name,location:us-central1-c,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_scrape_samples_scraped/gauge{otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:25:00.450Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,job:,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_receiver_accepted_metric_points/counter{source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,transport:http,service_instance_id:instance-id-c4ef0b0aaf35,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver,service_name:otel-collector-ngp-monitoring,receiver:prometheus/app-metrics,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:26:00.599Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,cluster:gke-cluster-name,namespace:namespace-name,job:,location:us-central1-c} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_send_failed_metric_points/counter{source_project_id:gke-cluster-project,container_name:es-exporter,exporter:googlemanagedprometheus/app-metrics,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,otel_scope_name:otelcol/prometheusreceiver,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,pod_name:pod-name-tfx4k,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:27:00.737Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-c,job:,cluster:gke-cluster-name,namespace:namespace-name,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_grpc_io_client_completed_rpcs/counter{service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,grpc_client_status:INVALID_ARGUMENT,source_project_id:gke-cluster-project,service_name:otel-collector-ngp-monitoring,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,grpc_client_method:google.monitoring.v3.MetricService/CreateTimeSeries}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:28:00.895Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,job:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_grpc_io_client_completed_rpcs/counter{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,grpc_client_method:google.monitoring.v3.MetricService/CreateTimeSeries,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,otel_scope_name:otelcol/prometheusreceiver,grpc_client_status:INVALID_ARGUMENT,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:29:01.042Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,location:us-central1-c,cluster:gke-cluster-name,namespace:namespace-name,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_queue_capacity/gauge{service_name:otel-collector-ngp-monitoring,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,otel_scope_name:otelcol/prometheusreceiver,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,exporter:googlecloud/app-traces}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:30:01.166Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,job:,location:us-central1-c,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_accepted_metric_points/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,processor:memory_limiter,container_name:es-exporter,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:30:56.284Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_sent_metric_points/counter{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,exporter:googlemanagedprometheus/app-metrics,source_project_id:gke-cluster-project,otel_scope_name:otelcol/prometheusreceiver,container_name:es-exporter,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,pod_name:pod-name-tfx4k,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

2024-02-29T08:31:56.439Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{cluster:gke-cluster-name,instance:,job:,namespace:namespace-name,location:us-central1-c} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_batch_metadata_cardinality/gauge{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,service_name:otel-collector-ngp-monitoring,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,otel_scope_name:otelcol/prometheusreceiver,container_name:es-exporter}\\nerror details: name = Unknown  desc = total_point_count:37  success_point_count:36  errors:{status:{code:9}  point_count:1} "rejected_items": 35}

Additional context

I made some additional tests and it looks like googlemanagedprometheus timeout could be related to the problem.
With timeout 10s I got 5 pods with errors on 12 pods started.
With timeout 15s I got 1 pod with errors on 20 pods started.
So, maybe there is a problem with exporting timeout but this behavior with infinite errors does not look correct.

Histogram for 10s timeout:
image

Histogram for 15s timeout:
image
almost 2 hours of errors later (the same pod):
image

Blue rectangle mean new Pod started. Red rectangle mean the error described in this issue.
All pods are exactly the same, just with a different name with random suffix.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions