Skip to content

Commit 92fae5e

Browse files
Fiery-FenixAkhigbeEromo
authored andcommitted
[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings (open-telemetry#36094)
#### Description ##### Problem statement `loadbalancing` exporter is actually a wrapper that's creates and manages set of actual `otlp` exporters Those `otlp` exporters technically shares same configuration parameters that are defined on `loadbalancing` exporter level, including `sending_queue` configuration. The only difference is `endpoint` parameter that are substituted by `loadbalancing` exporter itself This means, that `sending_queue`, `retry_on_failure` and `timeout` settings can be defined only on `otlp` sub-exporters, while top-level `loadbalancing` exporter is missing all those settings This configuration approach produces several issue, that are already reported by users: * Impossibility to use Persistent Queue in `loadbalancing` exporter (see open-telemetry#16826). That's happens because `otlp` sub-exporters are sharing the same configurations, including configuration of the queue, i.e. they all are using the same `storage` instance at the same time which is not possible at the moment * Data loss even using `sending_queue` configuration (see open-telemetry#35378). That's happens because Queue is defined on level of `otlp` sub-exporters and if this exporter cannot flush data from queue (for example, endpoint is not available anymore) there is no other options that just to discard data from queue, i.e. there is no higher level queue and persistent storage where data can be returned is case of permanent failure There might be some other potential issue that was already tracked and related to current configuration approach ##### Proposed solution The easiest way to solve issues above - is to use standard approach for queue, retry and timeout configuration using `exporterhelper` This will bring queue, retry and timeout functionality to the top-level of `loadbalancing` exporter, instead of `otlp` sub-exporters Related to mentioned issues it will bring: * Single Persistent Queue, that is used by all `otlp` sub-exporters (not directly of course) * Queue will not be discarded/destroyed if any (or all) of endpoint that are unreachable anymore, top-level queue will keep data until new endpoints will be available * Scale-up and scale-down event for next layer of OpenTelemetry Collectors in K8s environments will be more predictable, and will not include data loss anymore (potential fix for open-telemetry#33959). There is still a big chance of inconsistency when some data will be send to incorrect endpoint, but it's already better state that we have right now ##### Noticeable changes * `loadbalancing` exporter on top-level now uses `exporterhelper` with all supported functionality by it * `sending_queue` will be automatically disabled on `otlp` exporters when it already present on top-level `loadbalancing` exporter. This change is done to prevent data loss on `otlp` exporters because queue there doesn't provide expected result. Also it will prevent potential misconfiguration from user side and as result - irrelevant reported issues * `exporter` attribute for metrics generated from `otlp` sub-exporters now includes endpoint for better visibility and to segregate them from top-level `loadbalancing` exporter - was `"exporter": "loadbalancing"`, now `"exporter": "loadbalancing/127.0.0.1:4317"` * logs, generated by `otlp` sub-exporters now includes additional attribute `endpoint` with endpoint value with the same reasons as for metrics #### Link to tracking issue Fixes open-telemetry#35378 Fixes open-telemetry#16826 #### Testing Proposed changes was heavily tested on large K8s environment with set of different scale-up/scale-down event using persistent queue configuration - no data loss were detected, everything works as expected #### Documentation `README.md` was updated to reflect new configuration parameters available. Sample `config.yaml` was updated as well
1 parent 8efae04 commit 92fae5e

File tree

12 files changed

+354
-89
lines changed

12 files changed

+354
-89
lines changed
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Use this changelog template to create an entry for release notes.
2+
3+
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
4+
change_type: enhancement
5+
6+
# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
7+
component: loadbalancingexporter
8+
9+
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
10+
note: Adding sending_queue, retry_on_failure and timeout settings to loadbalancing exporter configuration
11+
12+
# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
13+
issues: [35378,16826]
14+
15+
# (Optional) One or more lines of additional information to render under the primary note.
16+
# These lines will be padded with 2 spaces and then inserted directly into the document.
17+
# Use pipe (|) for multiline entries.
18+
subtext: |
19+
When switching to top-level sending_queue configuration - users should carefully review queue size
20+
In some rare cases setting top-level queue size to n*queueSize might be not enough to prevent data loss
21+
22+
# If your change doesn't affect end users or the exported elements of any package,
23+
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
24+
# Optional: The change log or logs in which this entry should be included.
25+
# e.g. '[user]' or '[user, api]'
26+
# Include 'user' if the change is relevant to end users.
27+
# Include 'api' if there is a change to a library API.
28+
# Default: '[user]'
29+
change_logs: [user]

exporter/loadbalancingexporter/README.md

Lines changed: 97 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -48,14 +48,39 @@ This also supports service name based exporting for traces. If you have two or m
4848

4949
## Resilience and scaling considerations
5050

51-
The `loadbalancingexporter` will, irrespective of the chosen resolver (`static`, `dns`, `k8s`), create one exporter per endpoint. The exporter conforms to its published configuration regarding sending queue and retry mechanisms. Importantly, the `loadbalancingexporter` will not attempt to re-route data to a healthy endpoint on delivery failure, and data loss is therefore possible if the exporter's target remains unavailable once redelivery is exhausted. Due consideration needs to be given to the exporter queue and retry configuration when running in a highly elastic environment.
51+
The `loadbalancingexporter` will, irrespective of the chosen resolver (`static`, `dns`, `k8s`), create one `otlp` exporter per endpoint. Each level of exporters, `loadbalancingexporter` itself and all sub-exporters (one per each endpoint), have it's own queue, timeout and retry mechanisms. Importantly, the `loadbalancingexporter`, by default, will NOT attempt to re-route data to a healthy endpoint on delivery failure, because in-memory queue, retry and timeout setting are disabled by default ([more details on queuing, retry and timeout default settings](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md)).
5252

53-
* When using the `static` resolver and a target is unavailable, all the target's load-balanced telemetry will fail to be delivered until either the target is restored or removed from the static list. The same principle applies to the `dns` resolver.
53+
```
54+
+------------------+ +---------------+
55+
resiliency options 1 | | | |
56+
-- otlp exporter 1 ------------ backend 1 |
57+
| ---/ | | | |
58+
| ---/ +----|-------------+ +---------------+
59+
| ---/ |
60+
+-----------------+ ---/ |
61+
| --/ |
62+
| loadbalancing | resiliency options 2
63+
| exporter | |
64+
| --\ |
65+
+-----------------+ ----\ |
66+
----\ +----|-------------+ +---------------+
67+
----\ | | | |
68+
--- otlp exporter N ------------ backend N |
69+
| | | |
70+
+------------------+ +---------------+
71+
```
72+
73+
* For all types of resolvers (`static`, `dns`, `k8s`) - if one of endpoints is unavailable - first works queue, retry and timeout settings defined for sub-exporters (under `otlp` property). Once redelivery is exhausted on sub-exporter level, and resilience options 1 are enabled - telemetry data returns to `loadbalancingexporter` itself and data redelivery happens according to exporter level queue, retry and timeout settings.
74+
* When using the `static` resolver and all targets are unavailable, all load-balanced telemetry will fail to be delivered until either one or all targets are restored or valid target is added the static list. The same principle applies to the `dns` and `k8s` resolvers, except for endpoints list update which happens automatically.
5475
* When using `k8s`, `dns`, and likely future resolvers, topology changes are eventually reflected in the `loadbalancingexporter`. The `k8s` resolver will update more quickly than `dns`, but a window of time in which the true topology doesn't match the view of the `loadbalancingexporter` remains.
76+
* Resiliency options 1 (`timeout`, `retry_on_failure` and `sending_queue` settings in `loadbalancing` section) - are useful for highly elastic environment (like k8s), where list of resolved endpoints frequently changed due to deployments, scale-up or scale-down events. In case of permanent change of list of resolved exporters this options provide capability to re-route data into new set of healthy backends. Disabled by default.
77+
* Resiliency options 1 (`timeout`, `retry_on_failure` and `sending_queue` settings in `otlp` section) - are useful for temporary problems with specific backend, like network flukes. Persistent Queue is NOT supported here as all sub-exporter shares the same `sending_queue` configuration, including `storage`. Enabled by default.
78+
79+
Unfortunately, data loss is still possible if all of the exporter's targets remains unavailable once redelivery is exhausted. Due consideration needs to be given to the exporter queue and retry configuration when running in a highly elastic environment.
5580

5681
## Configuration
5782

58-
Refer to [config.yaml](./testdata/config.yaml) for detailed examples on using the processor.
83+
Refer to [config.yaml](./testdata/config.yaml) for detailed examples on using the exporter.
5984

6085
* The `otlp` property configures the template used for building the OTLP exporter. Refer to the OTLP Exporter documentation for information on which options are available. Note that the `endpoint` property should not be set and will be overridden by this exporter with the backend endpoint.
6186
* The `resolver` accepts a `static` node, a `dns`, a `k8s` service or `aws_cloud_map`. If all four are specified, an `errMultipleResolversProvided` error will be thrown.
@@ -90,6 +115,7 @@ Refer to [config.yaml](./testdata/config.yaml) for detailed examples on using th
90115
* `traceID`: Routes spans based on their `traceID`. Invalid for metrics.
91116
* `metric`: Routes metrics based on their metric name. Invalid for spans.
92117
* `streamID`: Routes metrics based on their datapoint streamID. That's the unique hash of all it's attributes, plus the attributes and identifying information of its resource, scope, and metric data
118+
* loadbalancing exporter supports set of standard [queuing, retry and timeout settings](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md), but they are disable by default to maintain compatibility
93119

94120
Simple example
95121

@@ -117,11 +143,76 @@ exporters:
117143
- backend-2:4317
118144
- backend-3:4317
119145
- backend-4:4317
120-
# Notice to config a headless service DNS in Kubernetes
146+
# Notice to config a headless service DNS in Kubernetes
147+
# dns:
148+
# hostname: otelcol-headless.observability.svc.cluster.local
149+
150+
service:
151+
pipelines:
152+
traces:
153+
receivers:
154+
- otlp
155+
processors: []
156+
exporters:
157+
- loadbalancing
158+
logs:
159+
receivers:
160+
- otlp
161+
processors: []
162+
exporters:
163+
- loadbalancing
164+
```
165+
166+
Persistent queue, retry and timeout usage example:
167+
168+
```yaml
169+
receivers:
170+
otlp:
171+
protocols:
172+
grpc:
173+
endpoint: localhost:4317
174+
175+
processors:
176+
177+
exporters:
178+
loadbalancing:
179+
timeout: 10s
180+
retry_on_failure:
181+
enabled: true
182+
initial_interval: 5s
183+
max_interval: 30s
184+
max_elapsed_time: 300s
185+
sending_queue:
186+
enabled: true
187+
num_consumers: 2
188+
queue_size: 1000
189+
storage: file_storage/otc
190+
routing_key: "service"
191+
protocol:
192+
otlp:
193+
# all options from the OTLP exporter are supported
194+
# except the endpoint
195+
timeout: 1s
196+
sending_queue:
197+
enabled: true
198+
resolver:
199+
static:
200+
hostnames:
201+
- backend-1:4317
202+
- backend-2:4317
203+
- backend-3:4317
204+
- backend-4:4317
205+
# Notice to config a headless service DNS in Kubernetes
121206
# dns:
122-
# hostname: otelcol-headless.observability.svc.cluster.local
207+
# hostname: otelcol-headless.observability.svc.cluster.local
208+
209+
extensions:
210+
file_storage/otc:
211+
directory: /var/lib/storage/otc
212+
timeout: 10s
123213

124214
service:
215+
extensions: [file_storage]
125216
pipelines:
126217
traces:
127218
receivers:
@@ -334,7 +425,7 @@ service:
334425
335426
## Metrics
336427
337-
The following metrics are recorded by this processor:
428+
The following metrics are recorded by this exporter:
338429
339430
* `otelcol_loadbalancer_num_resolutions` represents the total number of resolutions performed by the resolver specified in the tag `resolver`, split by their outcome (`success=true|false`). For the static resolver, this should always be `1` with the tag `success=true`.
340431
* `otelcol_loadbalancer_num_backends` informs how many backends are currently in use. It should always match the number of items specified in the configuration file in case the `static` resolver is used, and should eventually (seconds) catch up with the DNS changes. Note that DNS caches that might exist between the load balancer and the record authority will influence how long it takes for the load balancer to see the change.

exporter/loadbalancingexporter/config.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ import (
77
"time"
88

99
"github.com/aws/aws-sdk-go-v2/service/servicediscovery/types"
10+
"go.opentelemetry.io/collector/config/configretry"
11+
"go.opentelemetry.io/collector/exporter/exporterhelper"
1012
"go.opentelemetry.io/collector/exporter/otlpexporter"
1113
)
1214

@@ -30,6 +32,10 @@ const (
3032

3133
// Config defines configuration for the exporter.
3234
type Config struct {
35+
TimeoutSettings exporterhelper.TimeoutConfig `mapstructure:",squash"`
36+
configretry.BackOffConfig `mapstructure:"retry_on_failure"`
37+
QueueSettings exporterhelper.QueueConfig `mapstructure:"sending_queue"`
38+
3339
Protocol Protocol `mapstructure:"protocol"`
3440
Resolver ResolverSettings `mapstructure:"resolver"`
3541
RoutingKey string `mapstructure:"routing_key"`

exporter/loadbalancingexporter/factory.go

Lines changed: 103 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,21 @@ package loadbalancingexporter // import "github.com/open-telemetry/opentelemetry
77

88
import (
99
"context"
10+
"fmt"
1011

1112
"go.opentelemetry.io/collector/component"
1213
"go.opentelemetry.io/collector/exporter"
14+
"go.opentelemetry.io/collector/exporter/exporterhelper"
1315
"go.opentelemetry.io/collector/exporter/otlpexporter"
16+
"go.uber.org/zap"
1417

1518
"github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter/internal/metadata"
1619
)
1720

21+
const (
22+
zapEndpointKey = "endpoint"
23+
)
24+
1825
// NewFactory creates a factory for the exporter.
1926
func NewFactory() exporter.Factory {
2027
return exporter.NewFactory(
@@ -32,20 +39,110 @@ func createDefaultConfig() component.Config {
3239
otlpDefaultCfg.Endpoint = "placeholder:4317"
3340

3441
return &Config{
42+
// By default we disable resilience options on loadbalancing exporter level
43+
// to maintain compatibility with workflow in previous versions
3544
Protocol: Protocol{
3645
OTLP: *otlpDefaultCfg,
3746
},
3847
}
3948
}
4049

41-
func createTracesExporter(_ context.Context, params exporter.Settings, cfg component.Config) (exporter.Traces, error) {
42-
return newTracesExporter(params, cfg)
50+
func buildExporterConfig(cfg *Config, endpoint string) otlpexporter.Config {
51+
oCfg := cfg.Protocol.OTLP
52+
oCfg.Endpoint = endpoint
53+
54+
return oCfg
55+
}
56+
57+
func buildExporterSettings(params exporter.Settings, endpoint string) exporter.Settings {
58+
// Override child exporter ID to segregate metrics from loadbalancing top level
59+
childName := endpoint
60+
if params.ID.Name() != "" {
61+
childName = fmt.Sprintf("%s_%s", params.ID.Name(), childName)
62+
}
63+
params.ID = component.NewIDWithName(params.ID.Type(), childName)
64+
// Add "endpoint" attribute to child exporter logger to segregate logs from loadbalancing top level
65+
params.Logger = params.Logger.With(zap.String(zapEndpointKey, endpoint))
66+
67+
return params
68+
}
69+
70+
func buildExporterResilienceOptions(options []exporterhelper.Option, cfg *Config) []exporterhelper.Option {
71+
if cfg.TimeoutSettings.Timeout > 0 {
72+
options = append(options, exporterhelper.WithTimeout(cfg.TimeoutSettings))
73+
}
74+
if cfg.QueueSettings.Enabled {
75+
options = append(options, exporterhelper.WithQueue(cfg.QueueSettings))
76+
}
77+
if cfg.BackOffConfig.Enabled {
78+
options = append(options, exporterhelper.WithRetry(cfg.BackOffConfig))
79+
}
80+
81+
return options
82+
}
83+
84+
func createTracesExporter(ctx context.Context, params exporter.Settings, cfg component.Config) (exporter.Traces, error) {
85+
c := cfg.(*Config)
86+
exporter, err := newTracesExporter(params, cfg)
87+
if err != nil {
88+
return nil, fmt.Errorf("cannot configure loadbalancing traces exporter: %w", err)
89+
}
90+
91+
options := []exporterhelper.Option{
92+
exporterhelper.WithStart(exporter.Start),
93+
exporterhelper.WithShutdown(exporter.Shutdown),
94+
exporterhelper.WithCapabilities(exporter.Capabilities()),
95+
}
96+
97+
return exporterhelper.NewTraces(
98+
ctx,
99+
params,
100+
cfg,
101+
exporter.ConsumeTraces,
102+
buildExporterResilienceOptions(options, c)...,
103+
)
43104
}
44105

45-
func createLogsExporter(_ context.Context, params exporter.Settings, cfg component.Config) (exporter.Logs, error) {
46-
return newLogsExporter(params, cfg)
106+
func createLogsExporter(ctx context.Context, params exporter.Settings, cfg component.Config) (exporter.Logs, error) {
107+
c := cfg.(*Config)
108+
exporter, err := newLogsExporter(params, cfg)
109+
if err != nil {
110+
return nil, fmt.Errorf("cannot configure loadbalancing logs exporter: %w", err)
111+
}
112+
113+
options := []exporterhelper.Option{
114+
exporterhelper.WithStart(exporter.Start),
115+
exporterhelper.WithShutdown(exporter.Shutdown),
116+
exporterhelper.WithCapabilities(exporter.Capabilities()),
117+
}
118+
119+
return exporterhelper.NewLogs(
120+
ctx,
121+
params,
122+
cfg,
123+
exporter.ConsumeLogs,
124+
buildExporterResilienceOptions(options, c)...,
125+
)
47126
}
48127

49-
func createMetricsExporter(_ context.Context, params exporter.Settings, cfg component.Config) (exporter.Metrics, error) {
50-
return newMetricsExporter(params, cfg)
128+
func createMetricsExporter(ctx context.Context, params exporter.Settings, cfg component.Config) (exporter.Metrics, error) {
129+
c := cfg.(*Config)
130+
exporter, err := newMetricsExporter(params, cfg)
131+
if err != nil {
132+
return nil, fmt.Errorf("cannot configure loadbalancing metrics exporter: %w", err)
133+
}
134+
135+
options := []exporterhelper.Option{
136+
exporterhelper.WithStart(exporter.Start),
137+
exporterhelper.WithShutdown(exporter.Shutdown),
138+
exporterhelper.WithCapabilities(exporter.Capabilities()),
139+
}
140+
141+
return exporterhelper.NewMetrics(
142+
ctx,
143+
params,
144+
cfg,
145+
exporter.ConsumeMetrics,
146+
buildExporterResilienceOptions(options, c)...,
147+
)
51148
}

0 commit comments

Comments
 (0)