Images, PDFs, and other ‘static’ media elements

The Heritrix crawler can deem long lists of linked images, downloadable PDFs, and/or similar files to be out of scope when they are hosted among external domains sufficiently distant from the crawl’s seed URL(s). To avoid losing some or all of these elements even after initiating patch crawls, the relevant host(s) must be added to the larger collection’s scoping rules.

Images significant to viewing and understanding the Danziger Gallery website , but hosted elsewhere, were ruled out of scope during an initial crawl.

Expand scope

To expand a collection’s scoping rules to explicitly include externally hosted content:

  1. Click the Collection Scope tab on the collection’s main management page

  2. Under the “Add Collection Scope Rule” menu, select the option to “Expand Scope to Include URL if…”

  3. Selecting the subsequent option for “it Contains the text,” enter the relevant host precisely as it appears in the “Hosts” column of the crawl report.

  4. Click the Add Rule button.