1. Web Archiving Overview
A). History and Mission:
There is a very real danger of a “digital black hole” in the art historical record if libraries are not able to successfully manage hybrid collections of digital and print resources, with many ephemeral web resources already having been lost. In 2001, the estimated lifespan of a website was only 100 days, and today it’s not much longer. Recognizing that publication methods for art research materials, such as auction catalogs and catalogues raisonnés, were shifting from analog to digital, NYARC began to explore web archiving as a collection building strategy in 2010, with a pilot project in partnership with the Internet Archives’ Archive-It subscription service. This pilot project was undertaken to evaluate the usefulness of Archive-It to capture auction sale information that was only being distributed online.
This proved to be an excellent learning experience, as the pilot clearly demonstrated the challenges and complexities of capturing web content, such as scoping or restricting crawls to exclude unwanted content. We were able to both successfully capture content and realize limitations around metadata creation and discovery with the use of Archive-It. NYARC recognized the need to undertake a focused investigation to fully develop a program and understand the organizational, economic, and technical challenges of building and preserving a web archive.
In 2012, NYARC was awarded a one-year grant from The Andrew W. Mellon Foundation, called “Reframing Collections for a Digital Age,” which allowed for research on publishing trends and input from researchers on their current use and needs of web-based materials. We sought legal advice to develop intellectual property and fair use guidelines, investigate web archiving technologies, and evaluate our current systems and workflows. The project was structured into three successive phases, each led by a separate consultant: 1). The tipping point, 2). Harvesting models, and 3). Infrastructure.
The resulting findings from the Reframing planning study informed the two-year implementation grant, “Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art Resources,” also funded by The Andrew W. Mellon Foundation.
Our two-year program objectives were to capture, make accessible, and preserve important art-rich websites. In capturing this content, we also worked to determine best practices and document a web archiving workflow for the NYARC libraries, one that we are now sharing with the greater art research community. Beyond the implementation grant period, NYARC continues to build web archive collections and refine program workflow.
B). What is web archiving?
Web archiving refers to collecting websites to ensure that they are preserved in an archive for future researchers – with the objective of collecting all of the files that make up the content, visual appearance, and functionality of a given website. In NYARC’s web archiving practice, sites are most often captured via the Heritrix web crawler and are then stored in a standard WARC file format. WARC files, or Web ARChive file format, combine multiple digital resources into an aggregate archived file with related metadata. Files often include HTML, style sheets, video, image files, JavaScript, and related metadata.
Our aim is to provide access to these materials to our researchers, both the content and the functionality, and to preserve web-archived materials over time.
C). Common web archiving terms and definitions:
The Internet Archive's Archive-It service offers a comprehensive glossary of web archiving terms.
D). Audience:
These policies and guidelines are primarily intended to assist NYARC web archiving staff, project administrators, and library departmental staff engaged in collection development, content harvesting, collection management, resource description, and preservation. By making these guidelines publicly accessible, we hope that they will be of use to the greater library and archives community.