5. Collection and Seed Management

A). Creating New Collections and Adding Seed URLs:

Presently NYARC maintains 10 web archive collections (all accessible via NYARC's Archive-It partner page). Should NYARC decide to add Archive-It collections in the future, the Head of NYARC's Web Archiving Program will create the new collections, add the nominated seed URLs for inclusion in each collection, and will manage the frequency of capture and scope adjustment via the Archive-It web interface.

B). Scoping Crawls:

Modifying the scope of a crawl includes adding constraints to individual hosts, such as completely blocking the archiving of all content from a specific host or just blocking certain URLs from that host. Within host constraints, a limit can be added to the maximum number of documents (URLs) that can be archived from that host or a constraint to ignore robots.txt can be added to the host. Other crawl limits relate to archiving only PDFs from a given host or capturing a limited amount of data. Expanding the scope of a crawl can be done by specifying inclusion of URLs that contain matching text, adding regular expressions, and using SURTs to direct scoping orders to the Heritrix web crawler.

C). Frequency of Capture:

Default crawl frequency settings within Archive-It allow the user to either manually conduct one-time captures of their seed URLs or to set their captures to automatically occur at the following intervals:

  • Twice Daily

  • Daily

  • Weekly

  • Monthly

  • Bi-monthly

  • Quarterly

  • Semiannual

  • Annual

D). Public Collection Content:

Websites crawled as part of NYARC’s Archive-It collections are made public once they have been spot-checked and found to be reasonably complete (NYARC’s collections can be viewed here). NYARC’s QA technicians will do a more thorough review of each capture, conduct patch crawls for any missing content, and coordinate efforts to repair incomplete captures of sites. See our Quality Assurance (QA) documentation for more detail.

NYARC’s web archive collections are all indexed and the full-text can be searched either via the public Archive-It partner page or through searching in NYARC Discovery. Bibliographic records are also available in Arcade and WorldCat.