Quality assurance (QA) principles for web archives
Definition of QA
Quality Assurance (hereafter “QA”) is the process of verifying and/or making appropriate interventions to improve the accuracy and integrity of an archived web instance.
Capture, behavior, and appearance
While many different schemata are used to describe characteristics of the web known to affect quality in an archival environment, all most generally map to the model in which quality is evaluated and assured in terms of 1) capture, 2) behavior, and 3) appearance, wherein:
Capture describes the extent of content accessed and logged by a web crawler; the relative completeness of that content to the extent accessible from the live web
Behavior describes the accuracy of actions and responses facilitated by the archived web instance to its live analog; the degree to which the same functions (navigational, document retrieval, etc.) are supported in both environments
Appearance describes the aesthetic similarity between the archived web instance and its live analog; the effectiveness of replicating the “look and feel” of a web instance from its live environment
QA priorities for NYARC collections
Visual culture and visual literacy
Towards its stewardship of a uniquely visual medium used for the dissemination of largely visual information, NYARC places especially high priority on behavior and appearance metrics in the assurance of web archival quality. Comprehension of web-based art historical resources requires both text and context; the textual information disseminated among exhibition sites, artists’ websites, and auction catalogs, to name a few, must be viewed in the style, sequence, and with the necessary illustrations intended by their creators.
Definition of content
For the above reason, the following QA procedures and guidelines define archivable web “content” more broadly than the strictly textual elements that chiefly concern other large web archiving operations. Content must in NYARC’s context be understood to also include embedded and dynamic imagery, responsive web applications and scripts, and externally hosted downloadable documents.
Problematic content types
Each of the above and still more content types can present significant barriers to web archiving, and especially to achieving a high quality rendition of a live site within a new archival environment. For an inventory of the specific issues known to NYARC that may negatively affect quality, how to recognize them, and what if any strategies exist to ameliorate them, refer to the known quality problems and improvement strategies.
Social media & related web platforms
Ultimately, compromises in total archival quality must be struck, lest the QA process backlog permanently arrest archived material from reaching its end-users. Lacking explicit prior knowledge of these future end-users’ needs, determining “how good is good enough” is a fraught, subjective proposition. It is therefore incumbent upon web archiving staff to articulate and mutually agree upon significant properties--content, functionality, and presentation elements that define the purpose and/or value--of web instances within NYARC’s scope. Popular content types and presentation styles known to impede the QA process and concurrently insignificant to NYARC’s broadest collecting scope are documented among known quality problems and improvement strategies. The degree to which other problematic content types are necessary to the behavior and appearance of a web instance must be determined on at least a collection-specific and preferably a seed-specific basis.
Archive-It capabilities and constraints
Capture tool capabilities and limitations
Patch crawling--the process of discovering and incorporating web content erroneously omitted from a seed’s initial crawl--is a distinguishing feature of NYARC's QA process. It is the most effective strategy to mitigate issues of archival capture completeness that have downstream effects upon the ways in which the archived web instances also behave and appear. To date, Archive-It is the only web archiving technology suite known to provide significant automation of this process. It is a process, however, that still requires human selection and frequently tedious management to overcome its own limitations and inefficiencies. To maximize its efficiency, specific directions for conducting and guidelines for managing this process are provided.
Deduplication and continuous improvement
Deduplication is the process of automatically omitting live web content unchanged between archival crawl periods from successive captures, replacing it in these successive archival iterations with the pre-existing archived content. Deduplication enables great efficiency in managing data budgets, but furthermore frees the QA technician from reviewing and/or enhancing aspects of archived web instances that have previously been assured for quality. Unless and until a web instance in NYARC’s scope introduces altogether new kinds of features or redesigns its entire site, comprehensive QA performed at the time of its initial capture can preclude the need for significant devotion to QA in the future.