-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Component(s)
connector/failover
What happened?
Description
The failover connector periodically retries higher priority pipelines that have failed, so that it can reinstate them as the stable pipeline should they start working again. We observe however that when it does so, it then reinstates the lower priority pipeline, even when the higher priority pipeline is working.
Steps to Reproduce
- Create a pipeline with two exporters, connected with a failover connector.
- Establish a job inserting logs so we can observe the output:
while true; do sleep 1; echo 'test' > logs; done
- Start two listeners on each of the receiving ports:
nc -l 127.0.0.1 4278 # the high priority exporter
nc -l 127.0.0.1 4279 # the low priority exporter
- Observe that logs correctly flow to the high priority exporter
- Shut down the high priority exporter
- Observe that logs correctly flow to the low priority exporter
- Restart the high priority exporter
nc -l 127.0.0.1 4278 # the high priority exporter
- Observe that logs correctly flow to the high priority exporter, but then after a few seconds fall back to the low priority exporter (and that it begins to flip flop back and forth)
Expected Result
The logs should be stably redirected to the high priority exporter once it comes back online
Actual Result
The logs flip flop between the high and low priority exporters
Investigation
Adding a bit more logging around pipeline decisions finds that the lower priority pipeline is being re-inserted at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/failoverconnector/internal/state/pipeline_selector.go#L105-L107
This is because the loop terminates for pipelines after, but including the current pipeline (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/failoverconnector/internal/state/pipeline_selector.go#L96-L98). This means that while the lower priority pipeline is active, it creates a job that makes it active again, even if we select a higher priority pipeline.
Collector version
master (failover connector isn't released yet)
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
receivers:
namedpipe:
path: ./logs
exporters:
syslog/a:
endpoint: localhost
port: 4278
retry_on_failure:
enabled: false
tls:
insecure: true
syslog/b:
endpoint: localhost
port: 4279
retry_on_failure:
enabled: false
tls:
insecure: true
connectors:
failover:
retry_interval: 10s
retry_gap: 3s
priority_levels:
- [logs/a]
- [logs/b]
service:
pipelines:
logs:
receivers: [namedpipe]
exporters: [failover]
logs/a:
receivers: [failover]
exporters: [syslog/a]
logs/b:
receivers: [failover]
exporters: [syslog/b]
Log output
No response
Additional context
No response