Skip to content

Conversation

mvolikas
Copy link
Contributor

@mvolikas mvolikas commented May 31, 2025

#621 Batching for Solr Cloud updates and other Cloud improvements

  • I have tested this in Solr Cloud with multiple shards for the status collection, and 2 replicas.
  • I tried to make the minimum set of changes to have "async" support for the Cloud client while keeping most of the advantages offered by solrj out of the box. Of course, this approach is not as robust as the solrj CloudSolrClient, but I think it is enough in the context of crawling.

@mvolikas mvolikas self-assigned this May 31, 2025
/**
* Flush all waiting updates for this slice to the slice leader The request will fail, if the
* leader goes down before handling it
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a retry policy to avoid this, but it will complicate the code even more. Additionally, missed Solr updates should not cause long-term issues in the crawl, since pages get refetched after some time, or am I missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They only get re-fetched, if the user has enabled re-fetch on error or didn't disable re-fetching at all :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then I will add a simple retry policy to be on the safe side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added logic that re-queues update batches that failed (e.g. the shard leader we sent the update to went down before handling the request). The batch will be dealt with the next time a flush is triggered.

@mvolikas mvolikas requested review from jnioche, tballison and rzo1 May 31, 2025 14:09
@rzo1
Copy link
Contributor

rzo1 commented May 31, 2025

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

@mvolikas
Copy link
Contributor Author

mvolikas commented Jun 1, 2025

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

In principle, yes, but I had difficulties running everything in containers because the Zookeeper container advertises something like "zookeeper:2181" to the Solr containers inside the Docker network, and from the host (where StormCrawler runs), you cannot resolve "zookeeper".
That being said, I agree that we should have some tests for Solr Cloud, and I will give it another try.

@rzo1
Copy link
Contributor

rzo1 commented Jun 2, 2025

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

In principle, yes, but I had difficulties running everything in containers because the Zookeeper container advertises something like "zookeeper:2181" to the Solr containers inside the Docker network, and from the host (where StormCrawler runs), you cannot resolve "zookeeper". That being said, I agree that we should have some tests for Solr Cloud, and I will give it another try.

The guys from Apache Camel had a similar issue: https://github.com/apache/camel-quarkus/tree/e5e768acd680b0d78122fb7eee30b0a70947f3f9/integration-tests/solr/src/test - looks like their setup worked (somehow) - maybe we can just port it for our needs?

They bascially rely on docker compose (in two variants) to spin it up. Maybe worth a look?

@mvolikas
Copy link
Contributor Author

The guys from Apache Camel had a similar issue: https://github.com/apache/camel-quarkus/tree/e5e768acd680b0d78122fb7eee30b0a70947f3f9/integration-tests/solr/src/test - looks like their setup worked (somehow) - maybe we can just port it for our needs?

They bascially rely on docker compose (in two variants) to spin it up. Maybe worth a look?

Thanks @rzo1, this helps.
The issue that remains is that the zookeeper gives back to StormCrawler Solr endpoints like "solr1:8983" that StormCrawler cannot resolve because it's not part of the Docker network.
I should be able to find a way around this, though.

@rzo1
Copy link
Contributor

rzo1 commented Jun 10, 2025

We can add tests later for it, so no direct blocker from my side

@mvolikas
Copy link
Contributor Author

mvolikas commented Jun 15, 2025

We can add tests later for it, so no direct blocker from my side

@rzo1, I will create an issue for this and take care after this PR is merged. Is that ok?

@mvolikas mvolikas requested a review from rzo1 June 15, 2025 12:33
}

if (slice == null) {
LOG.error("Could not find an active slice for update {}", update);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not happen unless a shard is being split, which, in turn, should not happen in the context of StormCrawler while the crawl is running.

@rzo1 rzo1 added this to the 3.4.0 milestone Jun 17, 2025
@rzo1 rzo1 added the SOLR label Jun 17, 2025
@mvolikas mvolikas linked an issue Jun 22, 2025 that may be closed by this pull request
@mvolikas mvolikas merged commit 2081397 into apache:main Jun 22, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

asynchronous queries and updates in SOLR?
2 participants