1. Assess scope quality and capture completeness with crawl reports
Follow the directions below to evaluate the quality of a crawl's scope and the extent of its completeness, and to make recommendations for the necessary improvements to either/both:
Access crawl reports
When a one-time or scheduled crawl completes, Archive-It will notify the Head of the NYARC Web Archiving Program by email and include a direct hyperlink to a crawl report. The QA Technician may alternatively access all of NYARC’s crawl reports through the Reports link on the Archive-It interface’s main navigation bar, or selectively by collection through the View Reports link on each collection’s management page. From the default Summary tab to which this link takes you, note the date on which the crawl was completed and record this date on the first page of the QA report form.
Identify scope improvements
To begin assessing scope quality and capture completeness, click on the Hosts tab to retrieve a list of host URLs--domains and subdomains that your crawler encountered around the web--and the respective volume of content found and captured, ignored, or blocked at each during the duration of the crawl.
Include vital hosted content
Review the Hosts tab’s “Out of Scope” column to identify crawled but ultimately uncaptured content hosted among domains external to the crawled seed(s) or subdomains internal to them.
URLs of embedded media (images, applications), downloadable documents (PDFs), stylesheets, and/or font/script libraries from directories other than the main seeds’ will frequently appear here because the crawler automatically deemed them to be outside of the scope of the collection. Recommendations for the exact host URLs to be added into the scope of future crawls, precisely as each appears in the “Host” column, must be recorded on each crawl’s QA report form.
Access blocked content
Review URLs listed under the “Blocked” column to identify content to which the Heritrix web crawler was obstructed by a Robots.txt protocol file.
Recommendations for the exact host URLs for which Robots.txt protocols need be ignored, precisely as each appears in the “Host” column, must be recorded on each crawl’s QA report form.
Limit superfluous or undesired content
When crawls end due to time limits or when URLs appear in the “Queued” column, review data volumes and URLs across all columns under the Hosts tab to identify any hosts crawled and/or content captured beyond an appropriate extent.
While appropriateness is by necessity a subjective metric, examples of hosts sufficiently outside of NYARC’s collecting scope as to be limited across all collections include: advertising services, social networks, and non-NYARC museum collections databases. Crawler “traps” that speculatively generate seemingly endless URL possibilities, such as calendars, databases, and content management systems like Drupal, must also be identified for crawl refinement.
If you cannot decide how, if at all, out of scope the content crawled is to the appropriate amount of content to be archived from a given site, it is advisable to use an advanced Google search to estimate the rough extent of that site. Use the search syntax: site:example.com to see how many pages a typical Google search retrieves from the website in question and compare those results to the number of URLs crawled and/or queued in your report.
Recommendations for the exact host URLs to be blocked from or limited in future capture, precisely as each appears in the “Host” column, must be recorded on each crawl’s QA report form. Specific directions exist for using “Host Constraints” to selectively block known problem hosts and to limit speculative URL generators in future crawls.
When you have completed the above, proceed to:
2. Assess capture completeness and render quality through web browsing