Skip to content

[scraperhelper] Can't run scrapers in parallel #13113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dehaansa opened this issue May 29, 2025 · 5 comments · May be fixed by #13167
Open

[scraperhelper] Can't run scrapers in parallel #13113

dehaansa opened this issue May 29, 2025 · 5 comments · May be fixed by #13167

Comments

@dehaansa
Copy link

dehaansa commented May 29, 2025

Component(s)

scraper/scraperhelper

Describe the issue you're reporting

As reported in this issue for the sqlqueryreceiver, the scraperhelper's controller always runs scrapers in series. In the case of the sqlqueryreceiver this means that each query is run sequentially rather than leveraging the connection pool options and running in parallel.

I think there are a few options for how to improve the behavior.

  1. No change to scraperherlper. To get the benefits of a connection pool for example, that logic will need to be embedded inside a single scraper instead of the current pattern in the sqlqueryreceiver of one scraper per query.

  2. Always parallelize scrapers. This might have issues in some cases, if scrapers could potentially conflict with each other. However, it feels like what I would expect the behavior to be as a user especially if the scraperhelper package continues to evolve (some examples in Scraper feedback, and what to do next? #11238)

  3. Configurable parameter in the scraper controller to run scrapers in parallel. The benefits of parallel without potentially breaking some existing uses of the package. I think parallel should be the default, but if we don't want to change existing behavior it could be opt-in.

  4. Configurable parameter(s) in the scraper definition to define if an individual scraper should be run in parallel. This feels excessive to me, but allows for the case where some scrapers must be run exclusive of each other. We could get deep in the weeds here with marking dependencies/conflicts and dividing sets of scrapers that can run in parallel, etc if that's something we want to support.

I'm in favor of changing the behavior to always parallelize (1), however existing uses of the scraper packages will need to be evaluated to be sure this is safe.

@dehaansa
Copy link
Author

CC @bogdandrutu as you appear to have done recent work on the scraper packages

@josepcorrea
Copy link

One of the main reasons we believe this is a bug is that the current sequential execution of scrapers can break the expected collection_interval behavior.

For example, if the collection_interval is set to 3 minutes and one SQL query takes 5 minutes to complete, the following queries won’t start until that one finishes. As a result, even lightweight queries that should run every 3 minutes might actually be delayed significantly, leading to inaccurate or outdated metrics.

This behavior defeats the purpose of defining a consistent scrape interval and can be especially problematic in environments where some queries are much heavier than others.

@josepcorrea
Copy link

Additionally, the max_open_conn property seems somewhat meaningless in this context, since with sequential execution, there is never more than one connection used at a time. This defeats the purpose of tuning connection pool limits for performance.

It's also worth noting that max_open_conn was mentioned as part of a fix in a related issue: open-telemetry/opentelemetry-collector-contrib#39270 — however, it appears that the underlying sequential execution behavior still limits its effectiveness.

@andrzej-stencel
Copy link
Member

That's correct, the scrapers are currently run sequentially for both logs and metrics. I agree parallel behavior would be desirable in certain circumstances, though not necessarily in all of them.

I'm in favor of 2. Make the controller programmatically configurable to run scrapers in parallel or sequentially. This way we can keep the current sequential behavior of existing scrapers and change the scrapers we wish to parallel, and/or create new scrapers with the behavior best for the scenario.

On the programmatic level, I'd rather either have no default and make it mandatory to choose parallel or sequential behavior, or have sequential as the default. Perhaps this could be a new ControllerOption like WithParallel?

@dehaansa
Copy link
Author

dehaansa commented Jun 6, 2025

I put together a POC here if anyone would like to review an implementation of 2 with serialized as default, going to evaluate in contrib tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants