Skip to content

[prometheusremotewritereceiver] fix body decode to snappy format #38899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

perebaj
Copy link
Contributor

@perebaj perebaj commented Mar 24, 2025

Description

This bug was found in the development of this feature. We are not decoding the body in the http handler, that is a MUST-HAVE requirement of this component.

@perebaj
Copy link
Contributor Author

perebaj commented Mar 24, 2025

@ArthurSens I think that this PR doesn't need to have a chlog. Right?

@ArthurSens
Copy link
Member

@ArthurSens I think that this PR doesn't need to have a chlog. Right?

It depends, I think we need to test this by building the collector from main, send a PRWv2 request and see if it works. If it's broken in main, then we need a changelog of type bug_fix

@perebaj
Copy link
Contributor Author

perebaj commented Mar 24, 2025

@ArthurSens I think that this PR doesn't need to have a chlog. Right?

It depends, I think we need to test this by building the collector from main, send a PRWv2 request and see if it works. If it's broken in main, then we need a changelog of type bug_fix

I think we need to test this by building the collector from main

You mean. just to create a test, send a snappy/compressed data, and validate if this new behavior is working properly? Can I do that creating a new test case on receiver_test.go?

@ArthurSens
Copy link
Member

ArthurSens commented Mar 24, 2025

No, I mean actually run the collector and a Prometheus, send prwv2 requests from the Prometheus to otelcollector and see if things are working correctly. (if we get errors decoding the requests)

We'll need to build and configure both tools properly, do you know how to do that? I know we can build the collector with make otelcontribcol, but I'm not sure if the prwreceiver will be available since it's not even alpha yet. We'll need to do some research on how to do that 😬

@perebaj
Copy link
Contributor Author

perebaj commented Mar 24, 2025

No, I mean actually run the collector and a Prometheus, send prwv2 requests from the Prometheus to otelcollector and see if things are working correctly. (if we get errors decoding the requests)

We'll need to build and configure both tools properly, do you know how to do that? I know we can build the collector with make otelcontribcol, but I'm not sure if the prwreceiver will be available since it's not even alpha yet. We'll need to do some research on how to do that 😬

Why the only way to validate if the data is being properly decompressed is creating the whole end2end test? I don't get it.

@perebaj
Copy link
Contributor Author

perebaj commented Mar 24, 2025

I agree that this end2end is pretty important, my push is to simplify the implementation/test. Finalize the target_info and the other things, and in the end, implement this end2end that you are mentioning.

@ArthurSens
Copy link
Member

ArthurSens commented Mar 24, 2025

I agree that this end2end is pretty important, my push is to simplify the implementation/test. Finalize the target_info and the other things, and in the end, implement this end2end that you are mentioning.

Sorry if I was not clear; I didn't mean to implement a whole end-to-end test in code. I meant to test manually! Sometimes, it's easy and enough for the job at hand :)

I just did it. I've built the collector while including the remote-write receiver (inspired by this PR #34747), and I've run it with the following configuration:

receivers:
  prometheusremotewrite:
    endpoint: localhost:9091

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers: [prometheusremotewrite]
      exporters: [debug]

Then I've built Prometheus from source and started it with the following configuration:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"] # Self-scrape to collect metrics

remote_write: 
  - url: "http://localhost:9091/api/v1/write"
    protobuf_message: io.prometheus.write.v2.Request # To enable PRWv2

Then, in Prometheus logs, I see the following:

time=2025-03-24T21:15:56.004Z level=ERROR source=queue_manager.go:1670 msg="non-recoverable error" component=remote remote_name=2757a8 url=http://localhost:9091/api/v1/write failedSampleCount=563 failedHistogramCount=0 failedExemplarCount=0 err="server returned HTTP status 400 Bad Request: snappy: corrupt input\n"

This proves that the code in main really is broken! Nice catch :)

If we're fixing a behavior, then we definitely need a changelog entry of type bug_fix


Now, I'm doing the same thing as before but building the collector binary from your PR (hopefully with the fix 🤞 )

@perebaj
Copy link
Contributor Author

perebaj commented Mar 24, 2025

I agree that this end2end is pretty important, my push is to simplify the implementation/test. Finalize the target_info and the other things, and in the end, implement this end2end that you are mentioning.

Sorry if I was not clear; I didn't mean to implement a whole end-to-end test in code. I meant to test manually! Sometimes, it's easy and enough for the job at hand :)

Now, I'm doing the same thing as before, but build the collector binary from your PR (hopefully with the fix 🤞 )

Thanks to share it. I will try to test it locally too, nice!

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing the same test from above, but in your branch, I can still see the same error in Prometheus logs:


time=2025-03-24T21:32:44.539Z level=ERROR source=queue_manager.go:1670 msg="non-recoverable error" component=remote remote_name=2757a8 url=http://localhost:9091/api/v1/write failedSampleCount=563 failedHistogramCount=0 failedExemplarCount=0 err="server returned HTTP status 400 Bad Request: snappy: corrupt input\n"

I can also see this error in the collector logs:

2025-03-24T18:32:44.537-0300    warn    [email protected]/receiver.go:155  Error decoding remote write request     {"otelcol.component.id": "prometheusremotewrite", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "error": "snappy: corrupt input"}

This log line comes from here, which means we're failing even before we reach the new code you wrote 😬

@@ -156,8 +157,15 @@ func (prw *prometheusRemoteWriteReceiver) handlePRW(w http.ResponseWriter, req *
return
}

decompressedBody, err := snappy.Decode(nil, body)
if err != nil {
prw.settings.Logger.Warn("Error decoding remote write request to snappy", zapcore.Field{Key: "error", Type: zapcore.ErrorType, Interface: err})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prw.settings.Logger.Warn("Error decoding remote write request to snappy", zapcore.Field{Key: "error", Type: zapcore.ErrorType, Interface: err})
prw.settings.Logger.Warn("Error decoding snappy-encoded remote write request", zapcore.Field{Key: "error", Type: zapcore.ErrorType, Interface: err})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't even need this code actually, we just need to understand what is happening in io.ReadAll(req.Body) 🤔

Why is it corrupted over there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should we decode the snappy before the readall func? Don't know, just a guess

@sujalshah-bit
Copy link
Contributor

sujalshah-bit commented Mar 26, 2025

I believe this PR also resolves this issue. It seems like both are addressing the same underlying problem.

@perebaj perebaj changed the title [prometheuswritereceiver] fix body decode to snappy format [prometheusremotewritereceiver] fix body decode to snappy format Apr 2, 2025
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Apr 17, 2025
@github-actions github-actions bot removed the Stale label Apr 22, 2025
@perebaj
Copy link
Contributor Author

perebaj commented Apr 29, 2025

We already solved it here

@ArthurSens
Copy link
Member

The snappy decoding has been fixed in open-telemetry/opentelemetry-collector#12911, is there anything left to do in this PR?

@perebaj perebaj closed this May 7, 2025
@perebaj
Copy link
Contributor Author

perebaj commented May 7, 2025

The snappy decoding has been fixed in open-telemetry/opentelemetry-collector#12911, is there anything left to do in this PR?

Nope, I leave a message, but forgot to close it, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants