Content management systems like Drupal tend to distract the Heritrix web crawler with superfluous and ultimately invalid URLs, resulting in less complete capture. The tell-tale sign of this problem is a timed-out crawl that returns seemingly endless recombinations and/or repetitions of the seed site’s directory paths.
Queued URLs from a test crawl of Freeman’s Auction House were seemingly endless and random combinations of the same and frequently repeating directories.
Because Archive-It does not support the same kind of crawling limitation options for directory paths that it does total documents from a host, you must instead limit the crawler from getting distracted in these cases by blocking URLs that match a regular expression. To do so:
Navigate to the relevant collection management page and click on the Collection Scope link
Select the “Block URL If…” option from the drop-down menu
Enter the host’s URL in the space provided
Under the Block URL If… section at the bottom, select the option from the drop-down menu that reads “it Matches the Regular Expression:”
To ignore URLs in which any directory repeats, enter this exact text into the box: ^.*?(/.+?/).*?\1.*$|^.*?/(.+?/)\2.*$ and click Add
To ignore URLs in which any directories appear in a string of three or more, enter this exact text into the box, removing any irrelevant and adding any additionally known problematic directories to the end of the string of specific directory names: ^.*(/misc|/sites|/all|/themes|/modules|/profiles|/css|/field|/node|/theme){3}.*$ and click Add Rule