⚠️ Breaking Change: TextExtractor
Renamed and Refactored
Applies to: Users who directly used, extended, or overrode TextExtractor
via textextractor.class
in crawler.yaml
.
What Changed:
-
TextExtractor
has been renamed and is now an interface. -
The default implementation is now called
JSoupTextExtractor
. -
If you previously specified
TextExtractor
viatextextractor.class
, you must now use the fully qualified name of the new class:textextractor.class: "org.apache.stormcrawler.parse.JSoupTextExtractor"
or just remove the line as it is the default anyway.
No Action Needed If:
-
You did not override textextractor.class in your crawler.yaml.
-
You did not directly extend the old TextExtractor class.
Migration Notes:
-
Update custom implementations to implement the new TextExtractor interface.
-
Update any references to the old TextExtractor class to JSoupTextExtractor if applicable.
What's Changed
- Rel stormcrawler 3.3.0 rc1 by @tballison in #1507
- Bump junit.version from 5.12.0 to 5.12.1 by @dependabot in #1498
- Bump org.apache:apache from 33 to 34 by @dependabot in #1506
- Bump com.microsoft.playwright:playwright from 1.50.0 to 1.51.0 by @dependabot in #1504
- Bump org.apache.solr:solr-solrj from 9.8.0 to 9.8.1 by @dependabot in #1500
- Bump org.mockito:mockito-core from 5.16.0 to 5.16.1 by @dependabot in #1499
- Bump selenium.version from 4.29.0 to 4.30.0 by @dependabot in #1505
- Regenerated License file after dependency upgrades by @github-actions in #1508
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.2 to 3.5.3 by @dependabot in #1509
- #621 Async queries in Solr by @mvolikas in #1488
- Update README and compiler target to Java 17 in several plugins by @rzo1 in #1518
- #1516 - Add config options to change the response buffer size in OpenSearch by @rzo1 in #1517
- Bump de.thetaphi:forbiddenapis from 3.8 to 3.9 by @dependabot in #1513
- Bump org.jacoco:jacoco-maven-plugin from 0.8.12 to 0.8.13 by @dependabot in #1511
- Bump selenium.version from 4.30.0 to 4.31.0 by @dependabot in #1510
- Bump org.mockito:mockito-core from 5.16.1 to 5.17.0 by @dependabot in #1512
- Regenerated License file after dependency upgrades by @github-actions in #1519
- Bump junit.version from 5.12.1 to 5.12.2 by @dependabot in #1520
- Bump com.ibm.icu:icu4j from 76.1 to 77.1 by @dependabot in #1501
- Regenerated License file after dependency upgrades by @github-actions in #1521
- Fixes Update NOTICE File to Reflect 2025 by @rzo1 in #1522
- #1298 - Re-enable hold on failure (on coverage fail) by @rzo1 in #1523
- Bump testcontainers.version from 1.20.6 to 1.21.0 by @dependabot in #1524
- Bump org.jsoup:jsoup from 1.19.1 to 1.20.1 by @dependabot in #1530
- Regenerated License file after dependency upgrades by @github-actions in #1531
- Bump aws.version from 1.12.782 to 1.12.783 by @dependabot in #1529
- Bump com.microsoft.playwright:playwright from 1.51.0 to 1.52.0 by @dependabot in #1527
- Bump selenium.version from 4.31.0 to 4.32.0 by @dependabot in #1526
- Bump org.wiremock:wiremock from 3.12.1 to 3.13.0 by @dependabot in #1525
- Regenerated License file after dependency upgrades by @github-actions in #1534
- Bump org.apache.maven.plugins:maven-archetype-plugin from 3.3.1 to 3.4.0 by @dependabot in #1535
- Bump org.apache.maven.archetype:archetype-packaging from 3.3.1 to 3.4.0 by @dependabot in #1536
- Bump org.mockito:mockito-core from 5.17.0 to 5.18.0 by @dependabot in #1540
- Bump selenium.version from 4.32.0 to 4.33.0 by @dependabot in #1539
- Regenerated License file after dependency upgrades by @github-actions in #1541
- Remove Incubating references since we have graduated by @rzo1 in #1538
- Fix versions of SC in the READMEs + added instructions in RELEASING by @jnioche in #1543
- #1545 Use same version of URLFrontier as in the module by @jnioche in #1546
- Bump testcontainers.version from 1.21.0 to 1.21.1 by @dependabot in #1549
- Bump junit.version from 5.12.2 to 5.13.0 by @dependabot in #1548
- Bump aws.version from 1.12.783 to 1.12.785 by @dependabot in #1551
- Bump junit.version from 5.13.0 to 5.13.1 by @dependabot in #1550
- Bump tika.version from 3.1.0 to 3.2.0 by @dependabot in #1547
- Bump com.github.ben-manes.caffeine:caffeine from 3.2.0 to 3.2.1 by @dependabot in #1553
- Regenerated License file after dependency upgrades by @github-actions in #1554
- #1555 - Storm 2.8.1 by @rzo1 in #1556
- Regenerated License file after dependency upgrades by @github-actions in #1557
- Bump aws.version from 1.12.785 to 1.12.787 by @dependabot in #1563
- Bump org.apache:apache from 34 to 35 by @dependabot in #1562
- Bump org.wiremock:wiremock from 3.13.0 to 3.13.1 by @dependabot in #1561
- #1246 - Make ProxyManager to return optional incase no proxy is used by @quangdutran in #1532
- Regenerated License file after dependency upgrades by @github-actions in #1564
- Enable GH discussions by @rzo1 in #1565
- #621 add batching for cloud updates, fix cloud requests by @mvolikas in #1544
- #1558 - Add a LLM-based TextExtractor (OpenAI API compatible) by @rzo1 in #1559
- Bump testcontainers.version from 1.21.1 to 1.21.2 by @dependabot in #1568
- Bump org.codehaus.mojo:license-maven-plugin from 2.5.0 to 2.6.0 by @dependabot in #1567
- Regenerated License file after dependency upgrades by @github-actions in #1569
- Bump dev.langchain4j:langchain4j from 1.0.1 to 1.1.0 by @dependabot in #1574
- Regenerated License file after dependency upgrades by @github-actions in #1575
- Bump org.jsoup:jsoup from 1.20.1 to 1.21.1 by @dependabot in #1576
- Regenerated License file after dependency upgrades by @github-actions in #1577
New Contributors
- @quangdutran made their first contribution in #1532
Full Changelog: stormcrawler-3.3.0...stormcrawler-3.4.0