A3. Crawler traps
In extreme cases, excessive undesired content will be so voluminous as to “trap” the Archive-It crawler in an endless crawl of unnecessary and frequently invalid URLs. This is typically the result of applications, scripts, or other dynamic web elements or services that can generate endless possible URLs based on information requested by the client. The presence of a crawler trap is easy to detect, as it will inevitably result in inappropriately high crawled and queued URL counts, as seen on the crawl report of a timed-out crawl.
Block calendars from trapping crawlers with speculative URL generation
The most demonstrative example of the effect described above is the endless generation of new URLs by a calendar element. The Archive-It Standard web crawler is designed specially to recognize and reject patterns characteristic of this problem, but it is still possible for it to be trapped by them. This phenomenon can be easily recognized in a crawl report in the form of excessive sequential crawled and potentially endless queued URLs that include the word “calendar,” include future dates and/or times, or include similarly characteristic web calendar URL text. To alleviate the effects of this problem, follow the guidelines for using Host Constraints to block the crawl of URLs from a specific path.
Block content management systems (CMSs) from trapping crawlers with speculative URL generation
Content management systems like Drupal tend to distract the Heritrix web crawler with superfluous and ultimately invalid URLs, resulting in less complete capture. The tell-tale sign of this problem is a timed-out crawl that returns seemingly endless recombinations and/or repetitions of the seed site’s directory paths.
Queued URLs from a test crawl of Freeman’s Auction House were seemingly endless and random combinations of the same and frequently repeating directories.
Because Archive-It does not support the same kind of crawling limitation options for directory paths that it does total documents from a host, you must instead limit the crawler from getting distracted in these cases by blocking URLs that match a regular expression. To do so: