A3. Crawler traps

In extreme cases, excessive undesired content will be so voluminous as to “trap” the Archive-It crawler in an endless crawl of unnecessary and frequently invalid URLs. This is typically the result of applications, scripts, or other dynamic web elements or services that can generate endless possible URLs based on information requested by the client. The presence of a crawler trap is easy to detect, as it will inevitably result in inappropriately high crawled and queued URL counts, as seen on the crawl report of a timed-out crawl.

Block calendars from trapping crawlers with speculative URL generation

The most demonstrative example of the effect described above is the endless generation of new URLs by a calendar element. The Archive-It Standard web crawler is designed specially to recognize and reject patterns characteristic of this problem, but it is still possible for it to be trapped by them. This phenomenon can be easily recognized in a crawl report in the form of excessive sequential crawled and potentially endless queued URLs that include the word “calendar,” include future dates and/or times, or include similarly characteristic web calendar URL text. To alleviate the effects of this problem, follow the guidelines for using Host Constraints to block the crawl of URLs from a specific path.

Block content management systems (CMSs) from trapping crawlers with speculative URL generation

Content management systems like Drupal tend to distract the Heritrix web crawler with superfluous and ultimately invalid URLs, resulting in less complete capture. The tell-tale sign of this problem is a timed-out crawl that returns seemingly endless recombinations and/or repetitions of the seed site’s directory paths.

Queued URLs from a test crawl of Freeman’s Auction House were seemingly endless and random combinations of the same and frequently repeating directories.

Because Archive-It does not support the same kind of crawling limitation options for directory paths that it does total documents from a host, you must instead limit the crawler from getting distracted in these cases by blocking URLs that match a regular expression. To do so:

  1. Navigate to the relevant collection management page and click on the Collection Scope link

  2. Select the “Block URL If…” option from the drop-down menu

  3. Enter the host’s URL in the space provided

  4. Under the Block URL If… section at the bottom, select the option from the drop-down menu that reads “it Matches the Regular Expression:”

  5. To ignore URLs in which any directory repeats, enter this exact text into the box: ^.*?(/.+?/).*?\1.*$|^.*?/(.+?/)\2.*$ and click Add

  6. To ignore URLs in which any directories appear in a string of three or more, enter this exact text into the box, removing any irrelevant and adding any additionally known problematic directories to the end of the string of specific directory names: ^.*(/misc|/sites|/all|/themes|/modules|/profiles|/css|/field|/node|/theme){3}.*$ and click Add Rule