Skip to content

Sparkler Usage

Tom Barber edited this page Dec 25, 2020 · 14 revisions

Basics

A Simple Crawl

Once you have Sparkler installed and configured you can kick off your first crawl. There are various command line flags to help you do this.

./sparkler.sh inject -su bbc.co.uk -id test
./sparkler.sh crawl -id test

This example basically says crawl bbc.co.uk and label the id test. The id is optional, if you don't supply it then you'll get a custom job id in return.

Crawls are always in 2 steps, the inject phase just preseeds the database. Then the crawl phase iterates through the seeded urls and populates the database with the crawl results.

Configuration

The default configuration file is in the conf directory named, sparkler-default.yaml. In here you will find sensible defaults for most things. You can set various plugins, headers, kafka config and more.

Fetcher Properties

The main place to tweak settings is the fetcher properties. In here you can set the server delay, which is the pause between crawl requests, this stops Sparkler spamming the servers causing undue load and also trying to make us look a little less like a robot.

You can also set fetcher headers, in here are the standard headers that get sent with the request to make you look like a browser.

You can also enable the fetcher.user.agents property which will cycle through the headers in the file.

Enabling Plugins

You enable plugins by editing the plugins.active block. This list of plugins is the defaults shipped with Sparkler and you can enable or disable any of the supplied plugins by adding or removing the # comment symbol.

Basic Plugins

Enabled by default are the urlfilter-regex and urlfilter-samehost.

These plugins provide a couple of sensible function that allow Sparkler to crawl without downloading the world. Regex will filter out some urls and links it picks up so it doesn't download loads of useless stuff. Samehost will, by default, ensure your crawl is limited to the same domain.

Samehost

This plugin does what it says on the tin, ensures the crawl is limited to the same host, so that you don't end up in a completely different domain, crawling completely different stuff. Of course, you may want that, in which case disable this plugin.

Regex

This url provides more flexiblity over the samehost plugin. Out of the box it will prevent a number of file urls being picked up, so for example, you don't crawl PDFs, videos, images etc. It also filters out ftp sites and mailto addresses, infinite loops and local files.

To adjust the regex, you can simply edit the regex-urlfilter.txt file which manages all the regex expressions that are required for matching.

Fetcher HTMLUnit

Also supplied with Sparkler is the fetcher htmlunit plugin. This plugin is a slightly different browser backend that allows you to crawl sites using a different engine. If you find the basic default(fastest) crawler doesn't work, then have a look at this one, and if this doesn't work checkout the other plugins below for more support.

Advanced Usage

Plugins

Fetcher Chrome

URL Injector

POST/PUT Commands

Config Override

Additional Fields

Clone this wiki locally