Skip to content

v0.9.0

Latest
Compare
Choose a tag to compare
@cyclone-github cyclone-github released this 13 May 13:52
· 3 commits to main since this release
159eed5

Spider

Changelog:

  • v0.9.0 by @cyclone-github in #7
    • added flag "-url-match" to only crawl URLs containing a specified keyword; #6
    • added notice to user if no URLs are crawled when using "-crawl 1 -url-match"
    • exit early if zero URLs were crawled (no processing or file output)
    • use custom User-Agent "Spider/0.9.0 (+https://github.com/cyclone-github/spider)"
    • removed clearScreen function and its imports
    • fixed crawl-depth calculation logic
    • fixed restrict link collection to .html, .htm, .txt and extension-less paths
    • upgraded dependencies and bumped Go version to v1.24.3

Spider: URL Mode

spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -url-match wordlist -o forum.hashpwn.net_spider.txt
 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Crawling URL:   https://forum.hashpwn.net
Base domain:    forum.hashpwn.net
Crawl depth:    2
ngram len:      1-3
Crawl delay:    20ms (increase this to avoid rate limiting)
Timeout:        1 sec
URLs crawled:   2
Processing...   [====================] 100.00%
Unique words:   475
Unique ngrams:  1977
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    forum.hashpwn.net_spider.txt
RAM used:       0.02 GB
Runtime:        2.283s

Spider: File Mode

spider -file kjv_bible.txt -sort -ngram 1-3
 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Reading file:   kjv_bible.txt
ngram len:      1-3
Processing...   [====================] 100.00%
Unique words:   35412
Unique ngrams:  877394
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    kjv_bible_spider.txt
RAM used:       0.13 GB
Runtime:        1.359s

Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).

Usage Instructions:

  • To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
    • spider -url 'https://github.com/cyclone-github'
  • To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
    • spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
  • To set a custom output file, use flag "-o filename"
    • spider -url 'https://github.com/cyclone-github' -o wordlist.txt
  • To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
    • spider -url 'https://github.com/cyclone-github' -delay 100
  • To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
    • spider -url 'https://github.com/cyclone-github' -timeout 2
  • To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
    • spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
  • To filter crawled URLs by keyword "foobar"
    • spider -url 'https://github.com/cyclone-github' -url-match foobar
  • To process a local text file, create ngrams len 1-3 and sort output by frequency
    • spider -file foobar.txt -ngram 1-3 -sort
  • Run spider -help to see a list of all options

spider -help

  -crawl int
        Depth of links to crawl (default 1)
  -cyclone
        Display coded message
  -delay int
        Delay in ms between each URL lookup to avoid rate limiting (default 10)
  -file string
        Path to a local file to scrape
  -url-match string
        Only crawl URLs containing this keyword (case-insensitive)
  -ngram string
        Lengths of n-grams (e.g., "1-3" for 1, 2, and 3-length n-grams). (default "1")
  -o string
        Output file for the n-grams
  -sort
        Sort output by frequency
  -timeout int
        Timeout for URL crawling in seconds (default 1)
  -url string
        URL of the website to scrape
  -version
        Display version

Compile from source:

  • If you want the latest features, compiling from source is the best option since the release version may run several revisions behind the source code.
  • This assumes you have Go and Git installed
    • git clone https://github.com/cyclone-github/spider.git # clone repo
    • cd spider # enter project directory
    • go mod init spider # initialize Go module (skips if go.mod exists)
    • go mod tidy # download dependencies
    • go build -ldflags="-s -w" . # compile binary in current directory
    • go install -ldflags="-s -w" . # compile binary and install to $GOPATH
  • Compile from source code how-to:

Changelog:

Mentions:

Antivirus False Positives:

  • Several antivirus programs on VirusTotal incorrectly detect compiled Go binaries as a false positive. This issue primarily affects the Windows executable binary, but is not limited to it. If this concerns you, I recommend carefully reviewing the source code, then proceed to compile the binary yourself.
  • Uploading your compiled binaries to https://virustotal.com and leaving an up-vote or a comment would be helpful as well.