Skip to content

amin-aoulkadi/ABWCF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Actor-Based Web Crawling Framework

The ABWCF is a customizable, distributed and scalable web crawling framework for the JVM. It is based on Apache Pekko and the Actor model.

The ABWCF was created and developed (up to and including commit 76b52d9) as part of my master's thesis Developing an Actor-Based Web Crawling Framework with Apache Pekko.

Features

  • Web Crawling Basics: The ABWCF handles basic web crawling tasks (e.g. normalizing and deduplicating URLs, fetching resources, and parsing links from fetched HTML documents). Fetched resources are processed by user-defined code to perform use case-specific tasks.
  • Polite Crawling: The ABWCF supports crawl delays, the Robots Exclusion Protocol (i.e. robots.txt), X-Robots-Tag HTTP headers, and <meta name="robots"> HTML elements.
  • Crawl Limits: The ABWCF relies on user-defined regular expressions to filter out URLs that should not be crawled. This makes it possible to restrict crawls to certain hosts or domains. Crawls can also be limited by crawl depth.
  • Crawl Priority: The ABWCF includes a customizable mechanism to prioritize pages. Pages with a high crawl priority are more likely to be crawled than pages with a low crawl priority.
  • Persistence: The ABWCF persists which pages have already been crawled and which pages still need to be crawled. This makes it possible to pause and resume crawls.
  • Bandwidth Limits: The ABWCF supports configurable bandwidth limits for fetching.
  • Horizontal Scalability: It is possible to distribute a single crawl across multiple concurrent ABWCF instances.
  • Metrics: The ABWCF supports OpenTelemetry metrics.
  • Extensibility: With some knowledge of ABWCF internals and Pekko, users can add new features or replace existing ABWCF components with custom implementations.

Limitations

  • The ABWCF does not render the pages it visits, and it does not execute any JavaScript.
  • The ABWCF does not support authentication (at least not out-of-the-box).

Documentation

License

This work is licensed under the Apache License 2.0.

About

Actor-Based Web Crawling Framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages