-
Notifications
You must be signed in to change notification settings - Fork 540
Description
Component(s)
collector
What happened?
Component(s)
receiver/prometheus
What happened?
Description
We're setting up a single collector pod in a StatefulSet configured to monitor "all the rest" of our OTEL components ... this collector is called the otel-collector-cluster-agent
, and it uses the TargetAllocator to monitor for ServiceMonitor
jobs that have specific labels. We currently have two different ServiceMonitors
- one for collecting otelcol.*
metrics from the collectors, and one for collecting opentelemtry_.*
metrics from the Target Allocators.
We are seeing metrics reported from the TargetAllocator pods duplicated into DataPoints
that refer to the right, and wrong ServiceMonitor:
Metric #14
Descriptor:
-> Name: opentelemetry_allocator_targets
-> Description: Number of targets discovered.
-> Unit:
-> DataType: Gauge
NumberDataPoints #0
Data point attributes:
-> container: Str(ta-container)
-> endpoint: Str(targetallocation)
-> job_name: Str(serviceMonitor/otel/otel-collector-collectors/0) <<<< WRONG
-> namespace: Str(otel)
-> pod: Str(otel-collector-cluster-agent-targetallocator-bdf46447d-6stx7)
-> service: Str(otel-collector-cluster-agent-targetallocator)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-05-02 16:16:00.917 +0000 UTC
Value: 36.000000
NumberDataPoints #1
Data point attributes:
-> container: Str(ta-container)
-> endpoint: Str(targetallocation)
-> job_name: Str(serviceMonitor/otel/otel-collector-target-allocators/0) <<<< RIGHT
-> namespace: Str(otel)
-> pod: Str(otel-collector-cluster-agent-targetallocator-bdf46447d-6stx7)
-> service: Str(otel-collector-cluster-agent-targetallocator)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-05-02 16:16:00.917 +0000 UTC
When we delete the otel-collector-collectors
ServiceMonitor, the behavior does not change... which is wild... however, if we delete the entire stack and namespace and then re-create it without the second ServiceMonitor, then the data is correct... until we create the second Service Monitor, then it goes bad again.
Steps to Reproduce
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector-cluster-agent
namespace: otel
spec:
args:
feature-gates: +processor.resourcedetection.hostCPUSteppingAsString
config: |-
exporters:
debug:
sampling_initial: 15
sampling_thereafter: 60
debug/verbose:
sampling_initial: 15
sampling_thereafter: 60
verbosity: detailed
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: :1777
receivers:
prometheus:
config:
scrape_configs: []
report_extra_scrape_metrics: true
service:
extensions:
- health_check
- pprof
pipelines:
metrics/debug:
exporters:
- debug/verbose
receivers:
- prometheus
telemetry:
logs:
level: 'info'
deploymentUpdateStrategy: {}
env:
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: OTEL_RESOURCE_ATTRIBUTES
value: node.name=$(KUBE_NODE_NAME)
image: otel/opentelemetry-collector-contrib:0.98.0
ingress:
route: {}
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 30
managementState: managed
mode: statefulset
nodeSelector:
kubernetes.io/os: linux
observability:
metrics: {}
podDisruptionBudget:
maxUnavailable: 1
priorityClassName: otel-collector
replicas: 1
resources:
limits:
memory: 4Gi
requests:
cpu: "1"
memory: 2Gi
targetAllocator:
allocationStrategy: consistent-hashing
enabled: true
filterStrategy: relabel-config
image: otel/target-allocator:0.100.0-beta@sha256:fdd7e7f5f8f3903a3de229132ac0cf98c8857b7bdaf3451a764f550f7b037c26
observability:
metrics: {}
podDisruptionBudget:
maxUnavailable: 1
prometheusCR:
enabled: true
podMonitorSelector:
monitoring.xx.com/isolated: otel-cluster-agent
scrapeInterval: 1m0s
serviceMonitorSelector:
monitoring.xx.com/isolated: otel-cluster-agent
replicas: 1
resources:
limits:
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
updateStrategy: {}
upgradeStrategy: automatic
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/instance: otel-collector
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: otel-collector
app.kubernetes.io/version: ""
monitoring.xx.com/isolated: otel-cluster-agent
name: otel-collector-collectors
namespace: otel
spec:
endpoints:
- interval: 15s
metricRelabelings:
- action: keep
regex: otelcol.*
sourceLabels:
- __name__
port: monitoring
scrapeTimeout: 5s
namespaceSelector:
matchNames:
- otel
selector:
matchLabels:
app.kubernetes.io/component: opentelemetry-collector
app.kubernetes.io/managed-by: opentelemetry-operator
operator.opentelemetry.io/collector-monitoring-service: Exists
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/instance: otel-collector
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: otel-collector
app.kubernetes.io/version: ""
monitoring.xx.com/isolated: otel-cluster-agent
name: otel-collector-target-allocators
namespace: otel
spec:
endpoints:
- interval: 15s
metricRelabelings:
- action: keep
regex: opentelemetry.*
sourceLabels:
- __name__
port: targetallocation
scrapeTimeout: 5s
namespaceSelector:
matchNames:
- otel
selector:
matchLabels:
app.kubernetes.io/component: opentelemetry-targetallocator
app.kubernetes.io/managed-by: opentelemetry-operator
This creates a /scrape_configs
that looks like this:
% curl --silent localhost:8080/scrape_configs | jq .
{
"serviceMonitor/otel/otel-collector-collectors/0": {
"enable_compression": true,
"enable_http2": true,
"follow_redirects": true,
"honor_timestamps": true,
"job_name": "serviceMonitor/otel/otel-collector-collectors/0",
"kubernetes_sd_configs": [
{
"enable_http2": true,
"follow_redirects": true,
"kubeconfig_file": "",
"namespaces": {
"names": [
"otel"
],
"own_namespace": false
},
"role": "endpointslice"
}
],
"metric_relabel_configs": [
{
"action": "keep",
"regex": "otelcol.*",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__name__"
]
}
],
"metrics_path": "/metrics",
"relabel_configs": [
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"job"
],
"target_label": "__tmp_prometheus_job_name"
},
{
"action": "keep",
"regex": "(opentelemetry-collector);true",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_label_app_kubernetes_io_component",
"__meta_kubernetes_service_labelpresent_app_kubernetes_io_component"
]
},
{
"action": "keep",
"regex": "(opentelemetry-operator);true",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_label_app_kubernetes_io_managed_by",
"__meta_kubernetes_service_labelpresent_app_kubernetes_io_managed_by"
]
},
{
"action": "keep",
"regex": "(Exists);true",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_label_operator_opentelemetry_io_collector_monitoring_service",
"__meta_kubernetes_service_labelpresent_operator_opentelemetry_io_collector_monitoring_service"
]
},
{
"action": "keep",
"regex": "monitoring",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_endpointslice_port_name"
]
},
{
"action": "replace",
"regex": "Node;(.*)",
"replacement": "${1}",
"separator": ";",
"source_labels": [
"__meta_kubernetes_endpointslice_address_target_kind",
"__meta_kubernetes_endpointslice_address_target_name"
],
"target_label": "node"
},
{
"action": "replace",
"regex": "Pod;(.*)",
"replacement": "${1}",
"separator": ";",
"source_labels": [
"__meta_kubernetes_endpointslice_address_target_kind",
"__meta_kubernetes_endpointslice_address_target_name"
],
"target_label": "pod"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_namespace"
],
"target_label": "namespace"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_name"
],
"target_label": "service"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_pod_name"
],
"target_label": "pod"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_pod_container_name"
],
"target_label": "container"
},
{
"action": "drop",
"regex": "(Failed|Succeeded)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_pod_phase"
]
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "${1}",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_name"
],
"target_label": "job"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "monitoring",
"separator": ";",
"target_label": "endpoint"
},
{
"action": "hashmod",
"modulus": 1,
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__address__"
],
"target_label": "__tmp_hash"
},
{
"action": "keep",
"regex": "$(SHARD)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__tmp_hash"
]
}
],
"scheme": "http",
"scrape_interval": "15s",
"scrape_protocols": [
"OpenMetricsText1.0.0",
"OpenMetricsText0.0.1",
"PrometheusText0.0.4"
],
"scrape_timeout": "5s",
"track_timestamps_staleness": false
},
"serviceMonitor/otel/otel-collector-target-allocators/0": {
"enable_compression": true,
"enable_http2": true,
"follow_redirects": true,
"honor_timestamps": true,
"job_name": "serviceMonitor/otel/otel-collector-target-allocators/0",
"kubernetes_sd_configs": [
{
"enable_http2": true,
"follow_redirects": true,
"kubeconfig_file": "",
"namespaces": {
"names": [
"otel"
],
"own_namespace": false
},
"role": "endpointslice"
}
],
"metric_relabel_configs": [
{
"action": "keep",
"regex": "opentelemetry.*",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__name__"
]
}
],
"metrics_path": "/metrics",
"relabel_configs": [
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"job"
],
"target_label": "__tmp_prometheus_job_name"
},
{
"action": "keep",
"regex": "(opentelemetry-targetallocator);true",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_label_app_kubernetes_io_component",
"__meta_kubernetes_service_labelpresent_app_kubernetes_io_component"
]
},
{
"action": "keep",
"regex": "(opentelemetry-operator);true",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_label_app_kubernetes_io_managed_by",
"__meta_kubernetes_service_labelpresent_app_kubernetes_io_managed_by"
]
},
{
"action": "keep",
"regex": "targetallocation",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_endpointslice_port_name"
]
},
{
"action": "replace",
"regex": "Node;(.*)",
"replacement": "${1}",
"separator": ";",
"source_labels": [
"__meta_kubernetes_endpointslice_address_target_kind",
"__meta_kubernetes_endpointslice_address_target_name"
],
"target_label": "node"
},
{
"action": "replace",
"regex": "Pod;(.*)",
"replacement": "${1}",
"separator": ";",
"source_labels": [
"__meta_kubernetes_endpointslice_address_target_kind",
"__meta_kubernetes_endpointslice_address_target_name"
],
"target_label": "pod"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_namespace"
],
"target_label": "namespace"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_name"
],
"target_label": "service"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_pod_name"
],
"target_label": "pod"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_pod_container_name"
],
"target_label": "container"
},
{
"action": "drop",
"regex": "(Failed|Succeeded)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__meta_kubernetes_pod_phase"
]
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "${1}",
"separator": ";",
"source_labels": [
"__meta_kubernetes_service_name"
],
"target_label": "job"
},
{
"action": "replace",
"regex": "(.*)",
"replacement": "targetallocation",
"separator": ";",
"target_label": "endpoint"
},
{
"action": "hashmod",
"modulus": 1,
"regex": "(.*)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__address__"
],
"target_label": "__tmp_hash"
},
{
"action": "keep",
"regex": "$(SHARD)",
"replacement": "$1",
"separator": ";",
"source_labels": [
"__tmp_hash"
]
}
],
"scheme": "http",
"scrape_interval": "15s",
"scrape_protocols": [
"OpenMetricsText1.0.0",
"OpenMetricsText0.0.1",
"PrometheusText0.0.4"
],
"scrape_timeout": "5s",
"track_timestamps_staleness": false
}
}
Expected Result
We should see datapoints for opentelemetry_.*
metrics that only come from the target allocator pods and are attributed once .. meaning, one DataPoint
per target pod, and that's it:
Metric #14
Descriptor:
-> Name: opentelemetry_allocator_targets
-> Description: Number of targets discovered.
-> Unit:
-> DataType: Gauge
Value: 36.000000
NumberDataPoints #0
Data point attributes:
-> container: Str(ta-container)
-> endpoint: Str(targetallocation)
-> job_name: Str(serviceMonitor/otel/otel-collector-target-allocators/0) <<<< RIGHT
-> namespace: Str(otel)
-> pod: Str(otel-collector-cluster-agent-targetallocator-bdf46447d-6stx7)
-> service: Str(otel-collector-cluster-agent-targetallocator)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-05-02 16:16:00.917 +0000 UTC
Collector version
0.98.0
Environment information
Environment
OS: BottleRocket 1.19.2
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
Kubernetes Version
1.28
Operator version
0.98.0
Collector version
0.98.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
Log output
No response
Additional context
No response