Skip to content

stormcrawler-3.4.0

Latest
Compare
Choose a tag to compare
@rzo1 rzo1 released this 24 Jun 05:41
· 66 commits to main since this release

⚠️ Breaking Change: TextExtractor Renamed and Refactored

Applies to: Users who directly used, extended, or overrode TextExtractor via textextractor.class in crawler.yaml.

What Changed:

  • TextExtractor has been renamed and is now an interface.

  • The default implementation is now called JSoupTextExtractor.

  • If you previously specified TextExtractor via textextractor.class, you must now use the fully qualified name of the new class:

    textextractor.class: "org.apache.stormcrawler.parse.JSoupTextExtractor"

or just remove the line as it is the default anyway.

No Action Needed If:

  • You did not override textextractor.class in your crawler.yaml.

  • You did not directly extend the old TextExtractor class.

Migration Notes:

  • Update custom implementations to implement the new TextExtractor interface.

  • Update any references to the old TextExtractor class to JSoupTextExtractor if applicable.

What's Changed

  • Rel stormcrawler 3.3.0 rc1 by @tballison in #1507
  • Bump junit.version from 5.12.0 to 5.12.1 by @dependabot in #1498
  • Bump org.apache:apache from 33 to 34 by @dependabot in #1506
  • Bump com.microsoft.playwright:playwright from 1.50.0 to 1.51.0 by @dependabot in #1504
  • Bump org.apache.solr:solr-solrj from 9.8.0 to 9.8.1 by @dependabot in #1500
  • Bump org.mockito:mockito-core from 5.16.0 to 5.16.1 by @dependabot in #1499
  • Bump selenium.version from 4.29.0 to 4.30.0 by @dependabot in #1505
  • Regenerated License file after dependency upgrades by @github-actions in #1508
  • Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.2 to 3.5.3 by @dependabot in #1509
  • #621 Async queries in Solr by @mvolikas in #1488
  • Update README and compiler target to Java 17 in several plugins by @rzo1 in #1518
  • #1516 - Add config options to change the response buffer size in OpenSearch by @rzo1 in #1517
  • Bump de.thetaphi:forbiddenapis from 3.8 to 3.9 by @dependabot in #1513
  • Bump org.jacoco:jacoco-maven-plugin from 0.8.12 to 0.8.13 by @dependabot in #1511
  • Bump selenium.version from 4.30.0 to 4.31.0 by @dependabot in #1510
  • Bump org.mockito:mockito-core from 5.16.1 to 5.17.0 by @dependabot in #1512
  • Regenerated License file after dependency upgrades by @github-actions in #1519
  • Bump junit.version from 5.12.1 to 5.12.2 by @dependabot in #1520
  • Bump com.ibm.icu:icu4j from 76.1 to 77.1 by @dependabot in #1501
  • Regenerated License file after dependency upgrades by @github-actions in #1521
  • Fixes Update NOTICE File to Reflect 2025 by @rzo1 in #1522
  • #1298 - Re-enable hold on failure (on coverage fail) by @rzo1 in #1523
  • Bump testcontainers.version from 1.20.6 to 1.21.0 by @dependabot in #1524
  • Bump org.jsoup:jsoup from 1.19.1 to 1.20.1 by @dependabot in #1530
  • Regenerated License file after dependency upgrades by @github-actions in #1531
  • Bump aws.version from 1.12.782 to 1.12.783 by @dependabot in #1529
  • Bump com.microsoft.playwright:playwright from 1.51.0 to 1.52.0 by @dependabot in #1527
  • Bump selenium.version from 4.31.0 to 4.32.0 by @dependabot in #1526
  • Bump org.wiremock:wiremock from 3.12.1 to 3.13.0 by @dependabot in #1525
  • Regenerated License file after dependency upgrades by @github-actions in #1534
  • Bump org.apache.maven.plugins:maven-archetype-plugin from 3.3.1 to 3.4.0 by @dependabot in #1535
  • Bump org.apache.maven.archetype:archetype-packaging from 3.3.1 to 3.4.0 by @dependabot in #1536
  • Bump org.mockito:mockito-core from 5.17.0 to 5.18.0 by @dependabot in #1540
  • Bump selenium.version from 4.32.0 to 4.33.0 by @dependabot in #1539
  • Regenerated License file after dependency upgrades by @github-actions in #1541
  • Remove Incubating references since we have graduated by @rzo1 in #1538
  • Fix versions of SC in the READMEs + added instructions in RELEASING by @jnioche in #1543
  • #1545 Use same version of URLFrontier as in the module by @jnioche in #1546
  • Bump testcontainers.version from 1.21.0 to 1.21.1 by @dependabot in #1549
  • Bump junit.version from 5.12.2 to 5.13.0 by @dependabot in #1548
  • Bump aws.version from 1.12.783 to 1.12.785 by @dependabot in #1551
  • Bump junit.version from 5.13.0 to 5.13.1 by @dependabot in #1550
  • Bump tika.version from 3.1.0 to 3.2.0 by @dependabot in #1547
  • Bump com.github.ben-manes.caffeine:caffeine from 3.2.0 to 3.2.1 by @dependabot in #1553
  • Regenerated License file after dependency upgrades by @github-actions in #1554
  • #1555 - Storm 2.8.1 by @rzo1 in #1556
  • Regenerated License file after dependency upgrades by @github-actions in #1557
  • Bump aws.version from 1.12.785 to 1.12.787 by @dependabot in #1563
  • Bump org.apache:apache from 34 to 35 by @dependabot in #1562
  • Bump org.wiremock:wiremock from 3.13.0 to 3.13.1 by @dependabot in #1561
  • #1246 - Make ProxyManager to return optional incase no proxy is used by @quangdutran in #1532
  • Regenerated License file after dependency upgrades by @github-actions in #1564
  • Enable GH discussions by @rzo1 in #1565
  • #621 add batching for cloud updates, fix cloud requests by @mvolikas in #1544
  • #1558 - Add a LLM-based TextExtractor (OpenAI API compatible) by @rzo1 in #1559
  • Bump testcontainers.version from 1.21.1 to 1.21.2 by @dependabot in #1568
  • Bump org.codehaus.mojo:license-maven-plugin from 2.5.0 to 2.6.0 by @dependabot in #1567
  • Regenerated License file after dependency upgrades by @github-actions in #1569
  • Bump dev.langchain4j:langchain4j from 1.0.1 to 1.1.0 by @dependabot in #1574
  • Regenerated License file after dependency upgrades by @github-actions in #1575
  • Bump org.jsoup:jsoup from 1.20.1 to 1.21.1 by @dependabot in #1576
  • Regenerated License file after dependency upgrades by @github-actions in #1577

New Contributors

Full Changelog: stormcrawler-3.3.0...stormcrawler-3.4.0