I. Introduction
Quality assurance (QA) principles for web archives
Definition of QA
Quality Assurance (hereafter “QA”) is the process of verifying and/or making appropriate interventions to improve the accuracy and integrity of an archived web instance.
Capture, behavior, and appearance
While many different schemata are used to describe characteristics of the web known to affect quality in an archival environment, all most generally map to the model in which quality is evaluated and assured in terms of 1) capture, 2) behavior, and 3) appearance, wherein:
Capture describes the extent of content accessed and logged by a web crawler; the relative completeness of that content to the extent accessible from the live web
Behavior describes the accuracy of actions and responses facilitated by the archived web instance to its live analog; the degree to which the same functions (navigational, document retrieval, etc.) are supported in both environments
Appearance describes the aesthetic similarity between the archived web instance and its live analog; the effectiveness of replicating the “look and feel” of a web instance from its live environment
QA priorities for NYARC collections
Visual culture and visual literacy
Towards its stewardship of a uniquely visual medium used for the dissemination of largely visual information, NYARC places especially high priority on behavior and appearance metrics in the assurance of web archival quality. Comprehension of web-based art historical resources requires both text and context; the textual information disseminated among exhibition sites, artists’ websites, and auction catalogs, to name a few, must be viewed in the style, sequence, and with the necessary illustrations intended by their creators.
Definition of content
For the above reason, the following QA procedures and guidelines define archivable web “content” more broadly than the strictly textual elements that chiefly concern other large web archiving operations. Content must in NYARC’s context be understood to also include embedded and dynamic imagery, responsive web applications and scripts, and externally hosted downloadable documents.
Problematic content types
Each of the above and still more content types can present significant barriers to web archiving, and especially to achieving a high quality rendition of a live site within a new archival environment. For an inventory of the specific issues known to NYARC that may negatively affect quality, how to recognize them, and what if any strategies exist to ameliorate them, refer to the known quality problems and improvement strategies.
Social media & related web platforms
Social media (and other popular Web 2.0 content hosting platforms) exist in a constant state of change at the level of their source code, which is one reason such content can be challenging to capture and replay in web archives. Additionally, platforms such as Facebook and Instagram are known to be intentionally developed to resist archival capture. Other common content (such as videos from YouTube or Vimeo) may be relatively easily captured by Archive-It, but replay is met with intermittent success depending on the latest updates to each software (and their compatibility, in turn, with Archive-It’s own latest version).
In order to determine the best approach to capturing common sites, platforms, and embedded content, NYARC QA technicians should consult Archive-It’s current documentation, which includes not only a real-time System Status displaying the behavior of social media and other embedded content, but also a list of suggestions for how to approach archival capture for many of the most common sources.
NYARC’s current collecting scope does not include discrete social media pages as such (Facebook, Twitter, or Instagram pages, or accounts on YouTube, SoundCloud, or Vimeo), but this content is frequently found embedded on sites within NYARC’s collections – including the sites of the institutions that comprise NYARC. Such embedded content may be met with limited capture success, depending on the host site, while others are relatively unproblematic to capture. The relative ease of capture and replay depending on the host can be indexed to information in the System Status page.
One primary consideration when approaching such material is that Archive-It’s browser-based crawling technology, Brozzler, is recommended in particular when crawling social media or pages with embedded multimedia content.
Significant properties
Ultimately, compromises in total archival quality must be struck, lest the QA process backlog permanently arrest archived material from reaching its end-users. Lacking explicit prior knowledge of these future end-users’ needs, determining “how good is good enough” is a fraught, subjective proposition. It is therefore incumbent upon web archiving staff to articulate and mutually agree upon significant properties--content, functionality, and presentation elements that define the purpose and/or value--of web instances within NYARC’s scope. Popular content types and presentation styles known to impede the QA process and concurrently insignificant to NYARC’s broadest collecting scope are documented among known quality problems and improvement strategies. The degree to which other problematic content types are necessary to the behavior and appearance of a web instance must be determined on at least a collection-specific and preferably a seed-specific basis.
Archive-It capabilities and constraints
The Archive-It software suite is designed to maximize capture completeness, which makes significantly positive downstream effects on a web instance’s behavior and appearance in an archival environment. NYARC, in turn, maintains a large measure of control over issues of quality encountered at the scoping and crawling phases. Issues and respective mitigation strategies specific to rendering and managing web archival content known to NYARC are documented herein, but most typically require the intervention of a contracted Archive-It engineer and/or developer.
Capture tool capabilities and limitations
Archive-It’s default tools for discovering web content, writing that content to WARC format, and logging/reporting the activity, are the “Standard” web crawler (the default crawler for all Archive-It captures) and Brozzler, the Internet Archive’s browser-based crawling technology. Together, these tools crawl the web as would a search engine and a human user, and provide a modicum of the client-side generated information necessary to activate the responsive scripts on websites that enable further retrieval.
Capture completeness is, then, a reflection of Archive-It crawlers’ success at discovering, accessing, and navigating all file paths necessary to construct a high quality archival reflection of a given seed’s live web instance. Only those very rare and narrowest of crawls--the product of very deliberate scoping and/or limitations of live content extent--will completely capture a live web instance on a first pass. For the vast majority of cases, a QA technician must initiate patch crawls in order to complete the archival record copy.
The two types of crawlers that are part of Archive-It differ in important ways. The “Standard” crawler primarily follows hyperlinks, and downloads files, while the Brozzler crawler “records interactions between servers and web browsers as they occur, more closely resembling how a human user would experience the web” (as Archive-It’s documentation puts it). Brozzler is often a more viable option for websites that contain particularly complex content and/or client-side scripts, especially if this content was not successfully captured on a first pass using Archive-It’s Standard crawling technology.
Consult Archive-It’s documentation on Brozzler for more details.
Patch crawling
Patch crawling--the process of discovering and incorporating web content erroneously omitted from a seed’s initial crawl--is a distinguishing feature of NYARC's QA process. It is the most effective strategy to mitigate issues of archival capture completeness that have downstream effects upon the ways in which the archived web instances also behave and appear. To date, Archive-It is the only web archiving technology suite known to provide significant automation of this process. It is a process, however, that still requires human selection and frequently tedious management to overcome its own limitations and inefficiencies. To maximize its efficiency, specific directions for conducting and guidelines for managing this process are provided.
Deduplication and continuous improvement
Deduplication is the process of automatically omitting live web content unchanged between archival crawl periods from successive captures, replacing it in these successive archival iterations with the pre-existing archived content. Deduplication enables great efficiency in managing data budgets, but furthermore frees the QA technician from reviewing and/or enhancing aspects of archived web instances that have previously been assured for quality. Unless and until a web instance in NYARC’s scope introduces altogether new kinds of features or redesigns its entire site, comprehensive QA performed at the time of its initial capture can preclude the need for significant devotion to QA in the future.