A2. Excessive undesired content

At least as easily as it may pass over vital content, the Heritrix crawler may expend undue time and data allowance on content that does not belong in the given collection. Limiting the time and data expended on this content greatly enhances the opportunity to more completely capture the desired elements of a web instance before the latter is assured for quality, which in turn mitigates the necessity of iterative patch crawling.

To block or limit the crawler’s future interaction with excessive undesired content, apply one of the following host constraints to the collection:

Block host from crawl

Begin by noting each problematic host URL precisely as it appears in the Hosts tab of a crawl report, then:

1. Navigate to the collection’s Collection Scope tab on its management page.
2. Click on Add Collection Scope Rule
3. Click on the Block Hosts button and enter the hosts from the crawl report above into the text box
4. Check the respective box if you wish to block crawler access entirely
5. Click the Add Rule button.

The precise host domains and/or subdomains that are sufficiently problematic to block entirely will vary by collection. Since a significant portion of extraneous content is linked to ad services, host domains relating to ad providers (Amazon, Google, and Facebook, for example) are recommended to be blocked for all collections. Additionally, social media sites such as Vimeo and Twitter can be another major source of unnecessary content. NYARC QA staff should consult with the Head of the Web Archiving Program before adding any new collection-level host blocks.

Block URLs from a specific path

If instead of an entire host domain or subdomain you wish to block access to filepaths that include a specific string of text in their URLs, now:

1. Click on the Add Rule link in the “Block URLs if” column of the relevant host’s row
2. Select the option to “Ignore any URL that contains the following text” and enter the precise string of text as it appears in the relevant URLs
3. Click the Add button

Limit the crawl extent for hosts or specific paths

If instead of blocking you wish to merely limit the number of URLs that the crawler may access from this host, now: click on the click me text in the “Doc Limit” column of the relevant host’s row and type an appropriate number into the text field (this number will vary by encounter with each host, but default NYARC practice is to set a limit of 1,000 documents on external hosts with some desired, but excessive undesired content, such as YouTube, Scribd, etc.).

Block or limit the crawl extent for speculatively generated URLs

To block or limit the crawl of file paths that include repeat or seemingly randomly generated combinations of directories, as is typical of content management systems like Drupal, follow the instructions for avoiding crawler traps with regular expressions.

Page updated

Google Sites

Report abuse