Skip to content

Broken trace context propagation: OTel Trace ID of DD agent spans converted by OTel Col Datadog Receiver are wrong #36926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cyrille-leclerc opened this issue Dec 23, 2024 · 3 comments · Fixed by #39654

Comments

@cyrille-leclerc
Copy link
Member

cyrille-leclerc commented Dec 23, 2024

Component(s)

receiver/datadog

What happened?

Description

When doing W3C Trace Context Propagation from an app instrumented with OTel to an app instrumented with the Datadog agent and the Datadog agent sending spans to the OTel Collector Datadog Receiver, the Trace Id reported by the OTel Collector Datadog
Receiver is different from the Trace ID of the parent spans, breaking the trace.

I can't confirm but I suspect that this is caused by the logic in the OTel Col Datadog
Receiver to produce the OTel Trace ID from the Datadog ids.

Architecture:

                                                                   
 ┌──────────┐                               ┌───────────┐───┐      
 │OTel Java ┼──────┐                        │           │   │      
 └─────┬────┘      │                        │Receiver   │ O │      
       │           │                        │           │ T │      
       │           │                        │───────────│ e │      
       │           └───────────────────────►│ OTLP      │ l │      
       │                                    │           │   │      
       │traceparent: trace=xyz, parent=abc  │           │ C │      
       │                                    │───────────│ o │      
       │                                    │           │ l │      
 ┌─────▼──────┐ ┌──────────────────────────►┤Datadog    │   │      
 │Datadog Java┼─┘                           └───────────┘───┘      
 └────────────┘                             span:                  
                                              spanId=...           
                                              parent=abc <--CORRECT
                                              traceId=uvw <--WRONG 

OTel Collector debug log.

  • First span: HTTP client call span emitted by a Java Spring Boot app instrumented by the OTel Java Agent v1.44.1 with
    • traceId=37940834c74a2dfc11835c979eca1433
    • spanId=bb4331d223d59950
  • Second span: HTTP Server span emitted by a Spring Boot app instrumented by the Datadog Java Agent v1.44.1 with
    • parentId=bb4331d223d59950 as expected
    • traceId=000000000000000011835c979eca1433 which is NOT expected, we expect 37940834c74a2dfc11835c979eca1433
Resource SchemaURL: https://opentelemetry.io/schemas/1.24.0
Resource attributes:
     -> deployment.environment.name: Str(staging)
     -> host.arch: Str(aarch64)
     -> host.name: Str(cyrille-le-clerc-macbook.local)
     -> os.description: Str(Mac OS X 15.2)
     -> os.type: Str(darwin)
     -> process.command_args: Slice([...,"-jar","target/checkout-1.1-SNAPSHOT.jar"])
     -> process.executable.path: Str(.../bin/java)
     -> process.pid: Int(14768)
     -> process.runtime.description: Str(Homebrew OpenJDK 64-Bit Server VM 17.0.13+0)
     -> process.runtime.name: Str(OpenJDK Runtime Environment)
     -> process.runtime.version: Str(17.0.13+0)
     -> service.instance.id: Str(ccad3c44-aebc-4f8b-96b9-c4ed6a5433c4)
     -> service.name: Str(checkout)
     -> service.namespace: Str(shop)
     -> service.version: Str(1.1)
     -> telemetry.distro.name: Str(opentelemetry-java-instrumentation)
     -> telemetry.distro.version: Str(2.10.0)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.44.1)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope io.opentelemetry.java-http-client 2.10.0-alpha
Span #0
    Trace ID       : 37940834c74a2dfc11835c979eca1433
    Parent ID      : 179ce2ee48649594
    ID             : bb4331d223d59950
    Name           : POST
    Kind           : Client
    Start time     : 2024-12-23 17:34:54.397226541 +0000 UTC
    End time       : 2024-12-23 17:34:54.63652375 +0000 UTC
    Status code    : Unset
    Status message :
Attributes:
     -> server.address: Str(shipping.local)
     -> tenant_id: Str(tenant-1)
     -> http.request.method: Str(POST)
     -> network.protocol.version: Str(1.1)
     -> http.response.status_code: Int(200)
     -> thread.id: Int(160)
     -> server.port: Int(8088)
     -> thread.name: Str(grpc-default-executor-36)
     -> url.full: Str(http://shipping.local:8088/shipOrder)

ResourceSpans #1
Resource SchemaURL: https://opentelemetry.io/schemas/1.16.0
Resource attributes:
     -> telemetry.sdk.language: Str(java)
     -> process.runtime.version: Str(17.0.13)
     -> service.version: Str(1.1)
     -> telemetry.sdk.version: Str(Datadog-1.44.1~13a9a2d011)
     -> telemetry.sdk.name: Str(Datadog)
     -> service.name: Str(shipping)
     -> host.name: Str(localhost)
     -> os.type: Str(darwin)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope Datadog 1.44.1~13a9a2d011
Span #0
    Trace ID       : 000000000000000011835c979eca1433
    Parent ID      : bb4331d223d59950
    ID             : 6176a9d3ea94c1f7
    Name           : servlet.request
    Kind           : Server
    Start time     : 2024-12-23 17:34:54.453192875 +0000 UTC
    End time       : 2024-12-23 17:34:54.638774084 +0000 UTC
    Status code    : Ok
    Status message :
Attributes:
     -> dd.span.Resource: Str(POST /shipOrder)
     -> sampling.priority: Str(1.000000)
     -> datadog.span.id: Str(7022987396569678327)
     -> datadog.trace.id: Str(1261954126867731507)
     -> servlet.path: Str(/shipOrder)
     -> deployment.environment: Str(production)
     -> peer.ipv4: Str(127.0.0.1)
     -> thread.name: Str(http-nio-8088-exec-1)
     -> language: Str(jvm)
     -> service.version: Str(1.1)
     -> span.kind: Str(server)
     -> http.method: Str(POST)
     -> _dd.p.dm: Str(-0)
     -> http.status_code: Str(200)
     -> _dd.tracer_host: Str(cyrille-le-clerc-macbook.local)
     -> http.url: Str(http://shipping.local:8088/shipOrder)
     -> http.hostname: Str(shipping.local)
     -> _dd.p.tid: Str(37940834c74a2dfc)
     -> servlet.context: Str(/)
     -> http.route: Str(/shipOrder)
     -> runtime-id: Str(8265563a-4256-4741-ba0c-ebbb676d4473)
     -> http.useragent: Str(Java-http-client/17.0.13)
     -> component: Str(tomcat-server)
     -> thread.id: Double(37)
     -> process.pid: Double(73021)
     -> _dd.profiling.enabled: Double(0)
     -> peer.port: Double(64145)
     -> _dd.trace_span_attribute_schema: Double(0)
     -> _sampling_priority_v1: Double(1)
     -> _dd.measured: Double(1)
     -> _dd.top_level: Double(1)

Steps to Reproduce

  • Setup an OTel Col with both OTLP and Datadog receivers and the debug exporter
  • Create two Spring Boot apps, one "upstream_app" calling the "downstream_app" through an HTTP call
    • On the HTTP handler of the "downstream_app", dump the traceparent http header to verify the context is propagated
  • Instrument the upstream app with OTel Java Auto Instr v2.10.0
  • Instrument the downstream app with dd-trace-java v1.44.1
export DD_TRACE_AGENT_URL="http://localhost:8126"
# disabling remote config to ensure no weird behavior 
export DD_REMOTE_CONFIGURATION_ENABLED=false

java \
     -javaagent:"$DATADOG_AGENT_JAR" \
     -Dserver.port=8088 \
     -jar target/shipping-1.1-SNAPSHOT.jar
  • Invoke the "upstream_app" to trigger an http call to the "downstream_app"
  • Inspect the produced spans in the OTel collector logs

Expected Result

  • The spans in the otel collector logs show that the trace context is properly propagated: there is just one traceID and the parentId of the HTTP handler of the "downstream_app" matches the spanId of th HTTP call of the "upstream_App".

Actual Result

The parentId is properly propagated by the TraceId is wrong.

Collector version

v0.116.0

Environment information

Environment

MacOS 15.2

Demo app:

OpenTelemetry Collector configuration

receivers:
  datadog:
    endpoint: localhost:8126
    read_timeout: 60s
exporters:
  debug:
    verbosity: detailed
service:
  pipelines:
    traces:
      receivers: [datadog]
      processors: 
      exporters: [debug]

Log output

See bug description

Additional context

No response

@cyrille-leclerc cyrille-leclerc added bug Something isn't working needs triage New item requiring triage labels Dec 23, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme
Copy link
Contributor

atoulme commented Mar 8, 2025

@boostchicken @gouthamve @MovieStoreGuy please review as codeowners

@xiu
Copy link
Contributor

xiu commented Apr 25, 2025

I've looked into this today as I've tried a similar setup, but with Go tracing libraries.

Taking Cyrille's example above, I've found that the TraceIDs on spans coming from the Datadog instrumentation have the lower 64 bits of the full 128 bits TraceID, and that the upper 64 bits were stores in a different tag/attribute (_dd.p.tid):

  • OTel TraceID: 37940834c74a2dfc11835c979eca1433
  • Datadog TraceID: 000000000000000011835c979eca1433
  • Datadog's _dd.p.tid: 37940834c74a2dfc

In my tests, I've also found that _dd.p.tid only makes it to the first Span that got the TraceContext, but doesn't make it to any child span in the same datadog instrumented service. Example:

Span #0
    Trace ID       : f233b7e1421e8bde1d99f09757cf199d
    Parent ID      : 6b953724b399048a
    ID             : 039f8ec65ed09993
    Name           : http.request
    Kind           : Server
    Start time     : 2025-04-24 20:03:03.974553579 +0000 UTC
    End time       : 2025-04-24 20:03:03.977185472 +0000 UTC
    Status code    : Ok
    Status message :
Attributes:
     -> dd.span.Resource: Str(GET /)
     -> sampling.priority: Str(1.000000)
     -> datadog.span.id: Str(261084286056176019)
     -> datadog.trace.id: Str(2133000431340558749)
     -> http.url: Str(http://ddtesttracing:8080/)
     -> http.useragent: Str(Go-http-client/1.1)
     -> _dd.p.tid: Str(f233b7e1421e8bde)
     -> span.kind: Str(server)
     -> runtime-id: Str(5bd3baf5-a909-4b7d-a56f-0bba44f103ec)
     -> http.status_code: Str(200)
     -> component: Str(net/http)
     -> http.route: Str(/)
     -> http.method: Str(GET)
     -> http.host: Str(ddtesttracing:8080)
     -> language: Str(go)
     -> process.pid: Double(1)
     -> _dd.profiling.enabled: Double(0)
     -> _dd.top_level: Double(1)
     -> _sampling_priority_v1: Double(1)
Span #1
    Trace ID       : f233b7e1421e8bde1d99f09757cf199d
    Parent ID      : 039f8ec65ed09993
    ID             : 5ab19b8ebe922796
    Name           : get a dice roll
    Kind           : Unspecified
    Start time     : 2025-04-24 20:03:03.974670204 +0000 UTC
    End time       : 2025-04-24 20:03:03.97703093 +0000 UTC
    Status code    : Ok
    Status message :
Attributes:
     -> dd.span.Resource: Str(get a dice roll)
     -> sampling.priority: Str(1.000000)
     -> datadog.span.id: Str(6535175571676211094)
     -> datadog.trace.id: Str(2133000431340558749)
     -> runtime-id: Str(5bd3baf5-a909-4b7d-a56f-0bba44f103ec)
     -> app.force_sample: Str(true)
     -> language: Str(go)
     -> process.pid: Double(1)
     -> _sampling_priority_v1: Double(1)

Internally though, the tracing library (at least the go one), keeps the 128 bits traceID and is able to propagate it down to a downstream service. Here's a context dump from the datadog instrumented service in which we can see TraceID (lower 64 bits) and TraceID128 (full 128 bits TraceID):

Service: ddtesttracing
Resource: GET /
TraceID: 13579461062520570321
TraceID128: c1df79011741b002bc73efe3a8c981d1
SpanID: 8706868100366489105
ParentID: 3781295785910731659
Start: 2025-04-24 20:03:04.680027676 +0000 UTC
Duration: 0s
Error: 0
Type: web
Tags:
	language:go
	http.method:GET
	http.url:http://ddtesttracing:8080/
	span.kind:server
	http.host:ddtesttracing:8080
	component:net/http
	http.route:/
	http.useragent:Go-http-client/1.1
	runtime-id:5bd3baf5-a909-4b7d-a56f-0bba44f103ec
	_sampling_priority_v1:1.000000
	process_id:1.000000
	_dd.profiling.enabled:0.000000
	_dd.top_level:1.000000).WithValue(pprof.labelContextKey, {"local root span id":"8706868100366489105", "span id":"7389149026368848904", "trace endpoint":"GET /"}).WithValue(internal.contextKey, Name: get a dice roll
Service: ddtesttracing
Resource: get a dice roll
TraceID: 13579461062520570321
TraceID128: c1df79011741b002bc73efe3a8c981d1
SpanID: 7389149026368848904
ParentID: 8706868100366489105
Start: 2025-04-24 20:03:04.68088189 +0000 UTC
Duration: 0s
Error: 0
Type: 
Tags:
	language:go
	runtime-id:5bd3baf5-a909-4b7d-a56f-0bba44f103ec
	app.force_sample:true
	_sampling_priority_v1:1.000000
	process_id:1.000000)

In datadog-receiver, we just use the lower 64 bits: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/datadogreceiver/internal/translator/traces_translator.go#L104 (the upper part is set to 0).

I'll push a PR with a tentative fix.

xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue Apr 25, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID _dd.p.tid: upper 64 bits of
* the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue Apr 25, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID _dd.p.tid: upper 64 bits of
* the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue Apr 25, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID _dd.p.tid: upper 64 bits of
* the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue Apr 25, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID _dd.p.tid: upper 64 bits of
* the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue Apr 25, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue Apr 25, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue Apr 29, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue May 1, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue May 4, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue May 6, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
xiu added a commit to xiu/opentelemetry-collector-contrib that referenced this issue May 7, 2025
With this commit, we add support for 128 bits TraceIDs coming from Datadog
instrumented services. This can happen when an OTel instrumented service calls
a downstream Datadog instrumented one. Datadog instrumentation libraries store
the 128 bits TraceID into two different fields:
* TraceID: lower 64 bits of the 128 bits TraceID
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before this
commit, only the lower 64 bits were used as TraceID.

Fixes open-telemetry#36926
atoulme pushed a commit that referenced this issue May 8, 2025
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
With this commit, we add support for 128 bits TraceIDs coming from
Datadog instrumented services. This can happen when an OTel instrumented
service calls a downstream Datadog instrumented one. Datadog
instrumentation libraries store the 128 bits TraceID into two different
fields:
* TraceID: lower 64 bits of the 128 bits TraceID 
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before
this commit, only the lower 64 bits were used as TraceID.

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes #36926

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Tested the setup with the following chain: OTel Instrumented Service -->
Datadog Instrumented Service -> OTel Instrumented Service. The TraceID
was maintained on the whole chain.

Also added a unit test (`TestToTraces64to128bits`).

#### Documentation
Updated README.md in
[58129d5](58129d5)
dragonlord93 pushed a commit to dragonlord93/opentelemetry-collector-contrib that referenced this issue May 23, 2025
…-telemetry#39654)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
With this commit, we add support for 128 bits TraceIDs coming from
Datadog instrumented services. This can happen when an OTel instrumented
service calls a downstream Datadog instrumented one. Datadog
instrumentation libraries store the 128 bits TraceID into two different
fields:
* TraceID: lower 64 bits of the 128 bits TraceID 
* _dd.p.tid: upper 64 bits of the 128 bits TraceID

This commit adds logic that reconstructs the 128 bits TraceID. Before
this commit, only the lower 64 bits were used as TraceID.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes open-telemetry#36926

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Tested the setup with the following chain: OTel Instrumented Service -->
Datadog Instrumented Service -> OTel Instrumented Service. The TraceID
was maintained on the whole chain.

Also added a unit test (`TestToTraces64to128bits`).

#### Documentation
Updated README.md in
[58129d5](open-telemetry@58129d5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants