Sparkler Usage

Basics

A Simple Crawl

Once you have Sparkler installed and configured you can kick off your first crawl. There are various command line flags to help you do this.

./sparkler.sh inject -su bbc.co.uk -id test
./sparkler.sh crawl -id test

This example basically says crawl bbc.co.uk and label the id test. The id is optional, if you don't supply it then you'll get a custom job id in return.

Crawls are always in 2 steps, the inject phase just preseeds the database. Then the crawl phase iterates through the seeded urls and populates the database with the crawl results.

Configuration

The default configuration file is in the conf directory named, sparkler-default.yaml. In here you will find sensible defaults for most things. You can set various plugins, headers, kafka config and more.

Fetcher Properties

The main place to tweak settings is the fetcher properties. In here you can set the server delay, which is the pause between crawl requests, this stops Sparkler spamming the servers causing undue load and also trying to make us look a little less like a robot.

You can also set fetcher headers, in here are the standard headers that get sent with the request to make you look like a browser.

You can also enable the fetcher.user.agents property which will cycle through the headers in the file.

Enabling Plugins

You enable plugins by editing the plugins.active block. This list of plugins is the defaults shipped with Sparkler and you can enable or disable any of the supplied plugins by adding or removing the # comment symbol.

Basic Plugins

Enabled by default are the urlfilter-regex and urlfilter-samehost.

These plugins provide a couple of sensible function that allow Sparkler to crawl without downloading the world. Regex will filter out some urls and links it picks up so it doesn't download loads of useless stuff. Samehost will, by default, ensure your crawl is limited to the same domain.

Samehost

This plugin does what it says on the tin, ensures the crawl is limited to the same host, so that you don't end up in a completely different domain, crawling completely different stuff. Of course, you may want that, in which case disable this plugin.

Regex

This url provides more flexiblity over the samehost plugin. Out of the box it will prevent a number of file urls being picked up, so for example, you don't crawl PDFs, videos, images etc. It also filters out ftp sites and mailto addresses, infinite loops and local files.

To adjust the regex, you can simply edit the regex-urlfilter.txt file which manages all the regex expressions that are required for matching.

Fetcher HTMLUnit

Also supplied with Sparkler is the fetcher htmlunit plugin. This plugin is a slightly different browser backend that allows you to crawl sites using a different engine. If you find the basic default(fastest) crawler doesn't work, then have a look at this one, and if this doesn't work checkout the other plugins below for more support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparkler Usage

Basics

A Simple Crawl

Configuration

Fetcher Properties

Enabling Plugins

Basic Plugins

Samehost

Regex

Fetcher HTMLUnit

Advanced Usage

Plugins

Fetcher Chrome

URL Injector

POST/PUT Commands

Config Override

Additional Fields

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally