Skip to content

[connector/failover] Failover connector erroneously flips back to lower priority pipelines  #32094

@sinkingpoint

Description

@sinkingpoint

Component(s)

connector/failover

What happened?

Description

The failover connector periodically retries higher priority pipelines that have failed, so that it can reinstate them as the stable pipeline should they start working again. We observe however that when it does so, it then reinstates the lower priority pipeline, even when the higher priority pipeline is working.

Steps to Reproduce

  1. Create a pipeline with two exporters, connected with a failover connector.
  2. Establish a job inserting logs so we can observe the output:
while true; do sleep 1; echo 'test' > logs; done
  1. Start two listeners on each of the receiving ports:
nc -l 127.0.0.1 4278 # the high priority exporter
nc -l 127.0.0.1 4279 # the low priority exporter
  1. Observe that logs correctly flow to the high priority exporter
  2. Shut down the high priority exporter
  3. Observe that logs correctly flow to the low priority exporter
  4. Restart the high priority exporter
nc -l 127.0.0.1 4278 # the high priority exporter
  1. Observe that logs correctly flow to the high priority exporter, but then after a few seconds fall back to the low priority exporter (and that it begins to flip flop back and forth)

Expected Result

The logs should be stably redirected to the high priority exporter once it comes back online

Actual Result

The logs flip flop between the high and low priority exporters

Investigation

Adding a bit more logging around pipeline decisions finds that the lower priority pipeline is being re-inserted at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/failoverconnector/internal/state/pipeline_selector.go#L105-L107

This is because the loop terminates for pipelines after, but including the current pipeline (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/failoverconnector/internal/state/pipeline_selector.go#L96-L98). This means that while the lower priority pipeline is active, it creates a job that makes it active again, even if we select a higher priority pipeline.

Collector version

master (failover connector isn't released yet)

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  namedpipe:
    path: ./logs

exporters:
  syslog/a:
    endpoint: localhost
    port: 4278
    retry_on_failure:
      enabled: false
    tls:
      insecure: true
  syslog/b:
    endpoint: localhost
    port: 4279
    retry_on_failure:
      enabled: false
    tls:
      insecure: true

connectors:
  failover:
    retry_interval: 10s
    retry_gap: 3s
    priority_levels:
      - [logs/a]
      - [logs/b]

service:
  pipelines:
    logs:
      receivers: [namedpipe]
      exporters: [failover]
    logs/a:
      receivers: [failover]
      exporters: [syslog/a]
    logs/b:
      receivers: [failover]
      exporters: [syslog/b]

Log output

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions