-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[scraperhelper] Can't run scrapers in parallel #13113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CC @bogdandrutu as you appear to have done recent work on the scraper packages |
One of the main reasons we believe this is a bug is that the current sequential execution of scrapers can break the expected collection_interval behavior. For example, if the collection_interval is set to 3 minutes and one SQL query takes 5 minutes to complete, the following queries won’t start until that one finishes. As a result, even lightweight queries that should run every 3 minutes might actually be delayed significantly, leading to inaccurate or outdated metrics. This behavior defeats the purpose of defining a consistent scrape interval and can be especially problematic in environments where some queries are much heavier than others. |
Additionally, the It's also worth noting that |
That's correct, the scrapers are currently run sequentially for both logs and metrics. I agree parallel behavior would be desirable in certain circumstances, though not necessarily in all of them. I'm in favor of 2. Make the controller programmatically configurable to run scrapers in parallel or sequentially. This way we can keep the current sequential behavior of existing scrapers and change the scrapers we wish to parallel, and/or create new scrapers with the behavior best for the scenario. On the programmatic level, I'd rather either have no default and make it mandatory to choose parallel or sequential behavior, or have sequential as the default. Perhaps this could be a new ControllerOption like |
I put together a POC here if anyone would like to review an implementation of 2 with serialized as default, going to evaluate in contrib tomorrow. |
Uh oh!
There was an error while loading. Please reload this page.
Component(s)
scraper/scraperhelper
Describe the issue you're reporting
As reported in this issue for the sqlqueryreceiver, the scraperhelper's controller always runs scrapers in series. In the case of the sqlqueryreceiver this means that each query is run sequentially rather than leveraging the connection pool options and running in parallel.
I think there are a few options for how to improve the behavior.
No change to scraperherlper. To get the benefits of a connection pool for example, that logic will need to be embedded inside a single scraper instead of the current pattern in the sqlqueryreceiver of one scraper per query.
Always parallelize scrapers. This might have issues in some cases, if scrapers could potentially conflict with each other. However, it feels like what I would expect the behavior to be as a user especially if the scraperhelper package continues to evolve (some examples in Scraper feedback, and what to do next? #11238)
Configurable parameter in the scraper controller to run scrapers in parallel. The benefits of parallel without potentially breaking some existing uses of the package. I think parallel should be the default, but if we don't want to change existing behavior it could be opt-in.
Configurable parameter(s) in the scraper definition to define if an individual scraper should be run in parallel. This feels excessive to me, but allows for the case where some scrapers must be run exclusive of each other. We could get deep in the weeds here with marking dependencies/conflicts and dividing sets of scrapers that can run in parallel, etc if that's something we want to support.
I'm in favor of changing the behavior to always parallelize (1), however existing uses of the scraper packages will need to be evaluated to be sure this is safe.
The text was updated successfully, but these errors were encountered: