Skip to content

[consumererror] Add OTLP-centric error type #13042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

evan-bradley
Copy link
Contributor

Description

Continuation of #11085.

Link to tracking issue

Fixes #7047

@evan-bradley evan-bradley requested a review from a team as a code owner May 15, 2025 21:32
Copy link

codecov bot commented May 15, 2025

Codecov Report

Attention: Patch coverage is 90.69767% with 8 lines in your changes missing coverage. Please review.

Project coverage is 91.25%. Comparing base (2e61528) to head (0897af5).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...sumererror/internal/statusconversion/conversion.go 80.00% 8 Missing ⚠️

❌ Your patch check has failed because the patch coverage (90.69%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13042      +/-   ##
==========================================
- Coverage   91.48%   91.25%   -0.23%     
==========================================
  Files         506      510       +4     
  Lines       28557    28830     +273     
==========================================
+ Hits        26125    26310     +185     
- Misses       1917     2002      +85     
- Partials      515      518       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@evan-bradley
Copy link
Contributor Author

I'll look at improving the code coverage tomorrow. In the meantime, this should be in a pretty good state.

@evan-bradley
Copy link
Contributor Author

The remaining functions missing test coverage are the status code conversion functions, which are pretty direct. I don't think tests are very helpful since the functions are pretty direct mappings. The only thing I can think of that would meaningfully improve coverage is to store the mappings in a map object as opposed to in a switch statement, but feels like a slightly worse implementation.

Copy link
Member

@TylerHelmuth TylerHelmuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so excited to see this revived

Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seemed to be consensus on the last iteration on this implementation, I think what we need now is to test this in real life, thus I am approving this so we can move forward

@mx-psi
Copy link
Member

mx-psi commented May 21, 2025

Since this was specially controversial last time, I suggest we wait either until we have more approvals (I suggest 4) or some time has passed (I would suggest Friday next week).

cc @open-telemetry/collector-approvers

Copy link
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, liked the idea

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good to see this moving forward. This looks the way I would expect it to look after reviewing earlier feedback from @bogdandrutu.

// data around the error that occurred.
//
// Error should be obtained from a given `error` object using `errors.As`.
type Error struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another property that is interesting, "retry-after". Not suggesting to fix it now, but I am curious how do we plan to support that? Only via grpc Status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will likely add another error type that will go along with this one, but if not we will add an option to this error type that also includes a timer for how long the caller should wait before retrying a request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a problem with gRPC status. The gRPC status carries more informations in the "details" part (including the retry-after) and will be error prone to create this Error from status, then multiple other errors for other parts of the status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details, I wasn't aware the gRPC status could contain that, but I found the RetryInfo type, which is what I believe you are talking about.

I believe we'll likely solve this through something like the following that has error creation go through a single point.

  1. Have the constructors take options that allow specifying the retry delay. The constructors will either return an Error struct that contains a RetryInfo or similar struct that can be pulled out with errors.As, or can be placed directly on the Error object itself.
  2. Have the gRPC constructor extract the retry info from the *status.Status struct and use that info to populate retry info.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction, and glad that you know about this now and can propose a solution :)

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a big problem we identified in the past, which is that the default behavior of the errors in the collector pipelines is that they are retryable. It seems that this PR changes that, which I 1000% support, but we need to make sure we document this change and analyze the impact of that.

@evan-bradley
Copy link
Contributor Author

There is a big problem we identified in the past, which is that the default behavior of the errors in the collector pipelines is that they are retryable. It seems that this PR changes that, which I 1000% support, but we need to make sure we document this change and analyze the impact of that.

Agreed. That will come in a follow-up once we start to use this, I will make sure we proceed discerningly here.

// See https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md for more details.
//
// If a gRPC code cannot be derived from these three sources then INTERNAL is returned.
func (e *Error) OTLPGRPCStatus() *status.Status {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually change the signature to something like:

func ToGRPCStatus(e error) *status.Status;

This way:

  • Can handle the "error.As" part.
  • Can handle extra details like retry-after you proposed with the constructor.


// NewOTLPGRPCError records a gRPC status code that was received from a server
// during data submission.
func NewOTLPGRPCError(origErr error, status *status.Status) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an "original" error in this case?

// data around the error that occurred.
//
// Error should be obtained from a given `error` object using `errors.As`.
type Error struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction, and glad that you know about this now and can propose a solution :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate how to expose exporterhelper.NewThrottleRetry in the consumererror
6 participants