B1. Responsive content
Highly user-responsive elements of the live web, such as applications, scripts, structured data, and time-based media, are especially challenging to Archive-It’s “Standard” web crawler to capture completely, which has cascading negative impact on archived web instances’ behavior and appearance.
It is generally suggested that NYARC QA Technicians first attempt to capture content using the “Standard” crawler, and to subsequently make use of the Brozzler crawler if the initial capture is significantly lacking in completeness, especially for user-responsive, time-based or other client-side content (e.g. scripts that display images and other dynamic content which changes based on user input).
It remains NYARC’s responsibility to determine on a crawl-by-crawl basis precisely how much improvement to an incomplete capture can be achieved through the available methods of scope adjustment, patch crawling, or Archive-It intervention.
The challenge of completely capturing web instances with responsive elements is to crawl all possible permutations of the instance that these elements enable human web browsers to generate.
Corey Davis’s “simple” diagram (left) demonstrates how many moving parts must in fact be completely captured and synchronized in order to render the modest-looking browsing function on the University of Victoria’s Colonial Despatches website (right).
Insofar as all content required to render these permutations is made available by the host (the server) to the crawler (the client), complete capture and accurate rendition is generally feasible. Problems occur when the client is instead required to perform an action in order to provide information that dynamically generates novel content--a uniquely human-interactive process that web crawlers like Heritrix are not designed to perform. Manifestations of this problem in an archival environment take the form of dysfunctional navigational tools, unresponsive data querying tools and management systems, and partially- or nonfunctional time-based media applications.
Client-side scripts
Every attempt to capture embedded client-side scripts like JavaScript from a live web instance must be made prior to seeking assistance from Archive-It, by way of scope adjustment and patch crawling. Follow the guidelines for iteratively activating, detecting, and patch crawling missing dynamic content before concluding that no more missing content can be captured without third party intervention. All such failures to patch crawl missing content must be reported on the QA report form, after which the Head, Web Archiving Program may determine whether the significant properties of the web instance that NYARC desires to archive require further corrective actions.
Server-side scripts
Server-side scripting languages like PHP or ASP.NET present fewer challenges to accessing and capturing web content than do client-side, but require careful scoping and management. Archive-It’s “Standard” crawler can activate iterative URLs from among database and content management systems that rely on server-side scripts, though sometimes to the point that those iterations effectively become a crawler trap.
To most efficiently capture web presences laden with server-side scripting language, follow the guidelines for identifying and mitigating the effects of excessive undesired content.
Externally hosted applications
For information specific to applications and services external to the host seed(s) of a given crawl, such as video and digital publishing applications, refer to Archive-It’s guidelines for scoping specific types of sites.
Social media
Social media services (i.e. Facebook, Twitter, Instagram, Pinterest, etc.), their respective account pages and embedded feeds and widgets, fall in many cases outside of NYARC’s collecting scope. Certain key exceptions include multimedia content (YouTube, SoundCloud, and Vimeo) which is frequently made available on websites in NYARC’s collections, including the sites of the institutions that comprise NYARC.
That being said, capturing unnecessarily large and/or extraneous video and audio content is generally not advisable from the point of view of data budgeting. In light of this consideration, decisions should be made on a case-by-case basis, through consultation with the Head, Web Archiving Program, as to the desirability of capturing multimedia and/or other social-media content on a given website or seed (as long as the bar of current technical feasibility for capturing that particular host has been met, which can be determined by checking Archive-It’s System Status page).