Skip to content

[exporter/elasticsearch] report connection health via componentstatus #39562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

faec
Copy link
Contributor

@faec faec commented Apr 22, 2025

Description

Report component status of the Elasticsearch exporter based on the response code when making ingestion requests.

Testing

Added unit test confirming status is reported on http response. Also tested manually with the collector and confirmed that the error conditions appear when querying collector health status.

Copy link

linux-foundation-easycla bot commented Apr 22, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@faec faec changed the title Report elasticsearch exporter health via componentstatus [exporter/elasticsearch] report connection health via componentstatus Apr 24, 2025
@faec faec marked this pull request as ready for review April 24, 2025 21:06
@faec faec requested a review from a team as a code owner April 24, 2025 21:06
@faec faec requested a review from dashpole April 24, 2025 21:06
Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Using LogRoundTrip for health reporting is a bit hacky, but I reckon it is the easiest and the best way to achieve the goal because response parsing is done in go-docappender and there are no hooks.

Comment on lines 71 to 76
} else if resp.StatusCode >= 300 {
// Error results
err := fmt.Errorf("Elasticsearch request failed: %v", resp.Status)
componentstatus.ReportStatus(
cl.componentHost, componentstatus.NewRecoverableErrorEvent(err))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these recoverable? I would think a 401 would be non-recoverable without human intervention.

409s are not an error at all necessarily, when documents are truly duplicates. Is it worth special casing those? There should probably be an indicator they are happening but if someone explicitly is doing de-duplication via an _id in the document this will mark the exporter as degraded unnecessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-recoverable" in the collector is very unforgiving, to the extent that almost nothing we could encounter truly satisfies its definition (which is that the component can never again return to a healthy state no matter what for the life of the process -- so its real meaning isn't "requires human intervention" it's "requires intervention and a restart of the process"). For example I've seen some setups where ingest credentials are synced to the ingest workers before they're actually activated upstream (I don't know why, but it shouldn't break us), or similarly a 401 can be a side effect of broken or unreliable proxy settings that can be resolved without restarting the client process.

Good point about 409s though, those should be telemetry numbers rather than a component error state, I will add a special case for that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just special casing 409s makes sense to me then.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, probably worth a test case for 409s specifically since they are now meant not to trigger recovery from an error state.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test!

@atoulme
Copy link
Contributor

atoulme commented May 2, 2025

Please address CI

@faec
Copy link
Contributor Author

faec commented May 2, 2025

Maybe I'm not understanding how to sync module versions -- "go get" on go.opentelemetry.io/collector/component/componentstatus adds version v0.125.0, but check-collector-module-version.sh in the CI seemed to want a pseudo version starting with v0.125.1. I tried running check-collector-module-version.sh locally but it fails for me, and I can't find the right way to run it or to update to the right module version (other than waiting for CI to fail and copying the value it reports, but that's slow and the version seems to change every couple days). Presumably there's some workflow I'm missing, where should I be looking?

Copy link
Member

@andrzej-stencel andrzej-stencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
image

@andrzej-stencel andrzej-stencel merged commit e45dbe3 into open-telemetry:main May 9, 2025
173 checks passed
@github-actions github-actions bot added this to the next release milestone May 9, 2025
dragonlord93 pushed a commit to dragonlord93/opentelemetry-collector-contrib that referenced this pull request May 23, 2025
…open-telemetry#39562)

#### Description

Report component status of the Elasticsearch exporter based on the
response code when making ingestion requests.

#### Testing

Added unit test confirming status is reported on http response. Also
tested manually with the collector and confirmed that the error
conditions appear when querying collector health status.

---------

Co-authored-by: Andrzej Stencel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants