A1. Hosted content

The Heritrix web crawler can consider linked components such as images, downloadable documents, stylesheets, and/or font/script libraries that are necessary to the capture completeness, behavior, and appearance of a web instance to be “out of scope” when they are accessible only through filepaths wholly external to the given crawl’s seed URL(s).

The significance of this uncaptured content may be verified through Wayback QA, but expanding the crawler settings to generally “scope-in” these hosts is preferable to patch crawling individual URLs whenever the list of content automatically deemed of scope is too long to be patch crawled efficiently or too dynamic to be patch crawled effectively (ie. before the content moves or is taken offline and therefore cannot be crawled at all).