B1. Responsive content
Highly user-responsive elements of the live web, such as applications, scripts, structured data, and time-based media, are especially challenging to Archive-It’s “Standard” web crawler to capture completely, which has cascading negative impact on archived web instances’ behavior and appearance.
It is generally suggested that NYARC QA Technicians first attempt to capture content using the “Standard” crawler, and to subsequently make use of the Brozzler crawler if the initial capture is significantly lacking in completeness, especially for user-responsive, time-based or other client-side content (e.g. scripts that display images and other dynamic content which changes based on user input).
The challenge of completely capturing web instances with responsive elements is to crawl all possible permutations of the instance that these elements enable human web browsers to generate.
Corey Davis’s “simple” diagram (left) demonstrates how many moving parts must in fact be completely captured and synchronized in order to render the modest-looking browsing function on the University of Victoria’s Colonial Despatches website (right).
Insofar as all content required to render these permutations is made available by the host (the server) to the crawler (the client), complete capture and accurate rendition is generally feasible. Problems occur when the client is instead required to perform an action in order to provide information that dynamically generates novel content--a uniquely human-interactive process that web crawlers like Heritrix are not designed to perform. Manifestations of this problem in an archival environment take the form of dysfunctional navigational tools, unresponsive data querying tools and management systems, and partially- or nonfunctional time-based media applications.
To most efficiently capture web presences laden with server-side scripting language, follow the guidelines for identifying and mitigating the effects of excessive undesired content.
Externally hosted applications
For information specific to applications and services external to the host seed(s) of a given crawl, such as video and digital publishing applications, refer to Archive-It’s guidelines for scoping specific types of sites.
Social media services (i.e. Facebook, Twitter, Instagram, Pinterest, etc.), their respective account pages and embedded feeds and widgets, fall in many cases outside of NYARC’s collecting scope. Certain key exceptions include multimedia content (YouTube, SoundCloud, and Vimeo) which is frequently made available on websites in NYARC’s collections, including the sites of the institutions that comprise NYARC.