-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Problem Description
Current behaviours for crawl_rules.policy as of v0.22 appear to have the following behaviours:
allow
- Page is crawled and indexeddeny
- Page is not crawled and not indexed
For sites that have index pages, or site listings (example) - this forces you to index those pages, or at least send them to Elasticsearch and remove them using ingestion pipelines - not very elegant.
Proposed Solution
Change to the crawl_rules.policy for behaviour as follows:
allow
- Page is crawled and indexed, I would also propose re-enumerating this asindex
deny
- Page is not crawled and not indexed, I would also propose re-enumerating this asdiscard
- new thing - Page is crawled but not indexed, I would also propose the value of
crawl
The addition of the crawl
option would bounce through pages, and use their links for crawling, but not force indexing of the page. This seems to be more in line with the deny
behaviour of the previous Elastic Crawler.
Alternatives
Other alternatives would be to have deny
behave as per the previous crawler, or to add some kind of post-crawl filtering such pages.
Additional Context
This came to my attention when crawling a site with a large, multi-page directory when I attempted to filter the directory pages out, but then noticed a significant drop in the number of crawled pages.