-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Component(s)
exporter/googlemanagedprometheus
What happened?
Description
Sometimes, when pod in GKE with OpenTelemetry Collector starts up, it reports errors "One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric." in every consequent minute (scrape interval is 30s). After restarting pod the problem disappears. After some more restarts, problem happens again.
Looks like all the metrics are send properly to Google Monitoring but every minute additional duplicated data points are added to the batch, which causes the errors.
Steps to Reproduce
Create pod in Google Kubernetes Engine with OpenTelemetry Collector with similar config to ours. If problem does not occur, delete the pod and recreate it. Repeat until you see consistent error logs.
Expected Result
If there is any problem with saving data point to Google Monitoring which causes sending duplicated data point next minute, it should not repeat infinitely each minute.
Actual Result
Error with duplicated data point causes infinite errors state of OpenTelemetry exporter, fixed only when pod is deleted.
Collector version
v0.95.0
Environment information
Environment
Google Kubernetes Engine
Base image: ubi9/ubi
Compiler(if manually compiled): go 1.21.7
OpenTelemetry Collector configuration
receivers:
prometheus/otel-metrics:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 30s
static_configs:
- targets: ['127.0.0.1:8888']
metrics_path: /metrics
processors:
resource/metrics:
attributes:
- key: k8s.namespace.name
value: namespace-name
action: upsert
- key: k8s.pod.name
value: pod-name-tfx4k # The name of the POD - unique name after each recreation
action: upsert
- key: k8s.container.name
value: otel-collector
action: upsert
- key: cloud.availability_zone
value: us-central1-c
action: upsert
- key: service.name
action: delete
- key: service.version
action: delete
- key: service.instance.id
action: delete
metricstransform/gmp_otel:
transforms:
- include: ^(.*)$$
match_type: regexp
action: update
new_name: otel_internal_$${1}
- include: \.*
match_type: regexp
action: update
operations:
- action: add_label
new_label: source_project_id
new_value: gke-cluster-project
- action: add_label
new_label: pod_name
new_value: pod-name-tfx4k # The name of the POD - unique name after each recreation
- action: add_label
new_label: container_name
new_value: es-exporter
- include: ^(.+)_(seconds|bytes)_(.+)$$
match_type: regexp
action: update
new_name: $${1}_$${3}
- include: ^(.+)_(bytes|total|seconds)$$
match_type: regexp
action: update
new_name: $${1}
resourcedetection/metrics:
detectors: [env, gcp]
timeout: 2s
override: false
batch/metrics:
send_batch_size: 200
timeout: 5s
send_batch_max_size: 200
memory_limiter:
limit_mib: 297
spike_limit_mib: 52
check_interval: 1s
exporters:
googlemanagedprometheus/otel-metrics:
project: project-for-metrics
timeout: 15s
sending_queue:
enabled: false
num_consumers: 10
queue_size: 5000
metric:
prefix: prometheus.googleapis.com
add_metric_suffixes: False
logging:
loglevel: debug
sampling_initial: 1
sampling_thereafter: 500
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
pprof:
endpoint: 0.0.0.0:1777
service:
telemetry:
logs:
level: "info"
extensions: [health_check]
pipelines:
metrics/otel:
receivers: [prometheus/otel-metrics]
processors: [batch/metrics, resourcedetection/metrics, metricstransform/gmp_otel, resource/metrics]
exporters: [googlemanagedprometheus/otel-metrics]
Log output
First 11 error logs:
2024-02-29T08:22:00.043Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-c,job:,instance:,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-12]: prometheus.googleapis.com/otel_internal_scrape_series_added/gauge{pod_name:pod-name-tfx4k,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver}\\nerror details: name = Unknown desc = total_point_count:13 success_point_count:12 errors:{status:{code:9} point_count:1} "rejected_items": 28}
2024-02-29T08:23:00.169Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,cluster:gke-cluster-name,location:us-central1-c,instance:,namespace:namespace-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_batch_batch_size_trigger_send/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,processor:batch/metrics}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:24:00.330Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{namespace:namespace-name,job:,cluster:gke-cluster-name,location:us-central1-c,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_scrape_samples_scraped/gauge{otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:25:00.450Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,job:,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_receiver_accepted_metric_points/counter{source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,transport:http,service_instance_id:instance-id-c4ef0b0aaf35,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,container_name:es-exporter,otel_scope_name:otelcol/prometheusreceiver,service_name:otel-collector-ngp-monitoring,receiver:prometheus/app-metrics,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:26:00.599Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,cluster:gke-cluster-name,namespace:namespace-name,job:,location:us-central1-c} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_send_failed_metric_points/counter{source_project_id:gke-cluster-project,container_name:es-exporter,exporter:googlemanagedprometheus/app-metrics,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,otel_scope_name:otelcol/prometheusreceiver,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,pod_name:pod-name-tfx4k,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:27:00.737Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:us-central1-c,job:,cluster:gke-cluster-name,namespace:namespace-name,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_grpc_io_client_completed_rpcs/counter{service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,grpc_client_status:INVALID_ARGUMENT,source_project_id:gke-cluster-project,service_name:otel-collector-ngp-monitoring,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,grpc_client_method:google.monitoring.v3.MetricService/CreateTimeSeries}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:28:00.895Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,job:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_grpc_io_client_completed_rpcs/counter{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,grpc_client_method:google.monitoring.v3.MetricService/CreateTimeSeries,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,otel_scope_name:otelcol/prometheusreceiver,grpc_client_status:INVALID_ARGUMENT,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:29:01.042Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,location:us-central1-c,cluster:gke-cluster-name,namespace:namespace-name,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_queue_capacity/gauge{service_name:otel-collector-ngp-monitoring,service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,pod_name:pod-name-tfx4k,otel_scope_name:otelcol/prometheusreceiver,service_instance_id:instance-id-c4ef0b0aaf35,container_name:es-exporter,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,exporter:googlecloud/app-traces}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:30:01.166Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{instance:,job:,location:us-central1-c,namespace:namespace-name,cluster:gke-cluster-name} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_accepted_metric_points/counter{service_name:otel-collector-ngp-monitoring,otel_scope_name:otelcol/prometheusreceiver,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,processor:memory_limiter,container_name:es-exporter,source_project_id:gke-cluster-project,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,service_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:30:56.284Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{job:,namespace:namespace-name,cluster:gke-cluster-name,location:us-central1-c,instance:} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_exporter_sent_metric_points/counter{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,exporter:googlemanagedprometheus/app-metrics,source_project_id:gke-cluster-project,otel_scope_name:otelcol/prometheusreceiver,container_name:es-exporter,service_name:otel-collector-ngp-monitoring,service_instance_id:instance-id-c4ef0b0aaf35,pod_name:pod-name-tfx4k,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
2024-02-29T08:31:56.439Z\terror\texporterhelper/common.go:213\tExporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.\t{"kind": "exporter "data_type": "metrics "name": "googlemanagedprometheus/otel-metrics "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{cluster:gke-cluster-name,instance:,job:,namespace:namespace-name,location:us-central1-c} timeSeries[0-36]: prometheus.googleapis.com/otel_internal_otelcol_processor_batch_metadata_cardinality/gauge{service_version:0.95.0-rc-9-g87a8be8-20240228-145123,source_project_id:gke-cluster-project,service_name:otel-collector-ngp-monitoring,otel_scope_version:0.95.0-rc-9-g87a8be8-20240228-145123,pod_name:pod-name-tfx4k,service_instance_id:instance-id-c4ef0b0aaf35,otel_scope_name:otelcol/prometheusreceiver,container_name:es-exporter}\\nerror details: name = Unknown desc = total_point_count:37 success_point_count:36 errors:{status:{code:9} point_count:1} "rejected_items": 35}
Additional context
I made some additional tests and it looks like googlemanagedprometheus timeout could be related to the problem.
With timeout 10s I got 5 pods with errors on 12 pods started.
With timeout 15s I got 1 pod with errors on 20 pods started.
So, maybe there is a problem with exporting timeout but this behavior with infinite errors does not look correct.
Histogram for 15s timeout:
almost 2 hours of errors later (the same pod):
Blue rectangle mean new Pod started. Red rectangle mean the error described in this issue.
All pods are exactly the same, just with a different name with random suffix.