Skip to content

crawl_rules.policy needs additional option(s) #231

@DasUberLeo

Description

@DasUberLeo

Problem Description

Current behaviours for crawl_rules.policy as of v0.22 appear to have the following behaviours:

  • allow- Page is crawled and indexed
  • deny - Page is not crawled and not indexed

For sites that have index pages, or site listings (example) - this forces you to index those pages, or at least send them to Elasticsearch and remove them using ingestion pipelines - not very elegant.

Proposed Solution

Change to the crawl_rules.policy for behaviour as follows:

  • allow- Page is crawled and indexed, I would also propose re-enumerating this as index
  • deny - Page is not crawled and not indexed, I would also propose re-enumerating this as discard
  • new thing - Page is crawled but not indexed, I would also propose the value of crawl

The addition of the crawl option would bounce through pages, and use their links for crawling, but not force indexing of the page. This seems to be more in line with the deny behaviour of the previous Elastic Crawler.

Alternatives

Other alternatives would be to have deny behave as per the previous crawler, or to add some kind of post-crawl filtering such pages.

Additional Context

This came to my attention when crawling a site with a large, multi-page directory when I attempted to filter the directory pages out, but then noticed a significant drop in the number of crawled pages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions