1. Assess scope quality and capture completeness with crawl reports
Access crawl reports
When a one-time or scheduled crawl completes, Archive-It will notify the Head of the NYARC Web Archiving Program by email and include a direct hyperlink to a crawl report. The QA Technician may alternatively access all of NYARC’s crawl reports through the Reports link on the Archive-It interface’s main navigation bar, or selectively by collection through the View Reports link on each collection’s management page. From the default Summary tab to which this link takes you, note the date on which the crawl was completed and record this date on the first page of the QA report form.
Identify scope improvements
Include vital hosted content
Review the Hosts tab’s “Out of Scope” column to identify crawled but ultimately uncaptured content hosted among domains external to the crawled seed(s) or subdomains internal to them.
URLs of embedded media (images, applications), downloadable documents (PDFs), stylesheets, and/or font/script libraries from directories other than the main seeds’ will frequently appear here because the crawler automatically deemed them to be outside of the scope of the collection. Recommendations for the exact host URLs to be added into the scope of future crawls, precisely as each appears in the “Host” column, must be recorded on each crawl’s QA report form.
Access blocked content
Review URLs listed under the “Blocked” column to identify content to which the Heritrix web crawler was obstructed by a Robots.txt protocol file.
Recommendations for the exact host URLs for which Robots.txt protocols need be ignored, precisely as each appears in the “Host” column, must be recorded on each crawl’s QA report form.
Limit superfluous or undesired content
When crawls end due to time limits or when URLs appear in the “Queued” column, review data volumes and URLs across all columns under the Hosts tab to identify any hosts crawled and/or content captured beyond an appropriate extent.
While appropriateness is by necessity a subjective metric, examples of hosts sufficiently outside of NYARC’s collecting scope as to be limited across all collections include: advertising services, social networks, and non-NYARC museum collections databases. Crawler “traps” that speculatively generate seemingly endless URL possibilities, such as calendars, databases, and content management systems like Drupal, must also be identified for crawl refinement.
Recommendations for the exact host URLs to be blocked from or limited in future capture, precisely as each appears in the “Host” column, must be recorded on each crawl’s QA report form. Specific directions exist for using “Host Constraints” to selectively block known problem hosts and to limit speculative URL generators in future crawls.