B2. Obstructed content
Refer to the respective guidelines provided below for crawling live web content alternatively obstructed by barriers to all web crawlers or to unidentified human web browsers:
Content blocked by Robots.txt
Content locked behind credentialing system (login screen, paywall, etc.)
Enabling Archive-It to crawl and browse web environments critical to NYARC’s collecting scope that are obstructed by credentialing systems is an issue to principally be resolved at the collection development phase. This only becomes an issue of web archival quality when that enabled access has unintended downstream effects upon crawled/captured content, or whenever unanticipated host domains (ie. other than the principal seed/s) obstruct access to their content with a credentialing system.
Login (and don't logout)
Should it be determined that a host domain or specific URL obstructed by a credentialing system is sufficiently significant to the value of the seed/collection to capture, the NYARC Web Archiving Program Coordinator may follow these steps to enable access:
Register for a username and password with the relevant service
Add the desired host domain or specific URL to the collection at the seed level
Edit that seed’s settings to include your credentials in the relevant Login Username and Login Password fields
Ensure in seed settings that the checkbox indicating visibility on Archive-It’s public site is empty
If the credentialing system, however, requires the crawler to provide more information than a username and/or password (ie. ZIP code, proper name, etc.), the obstacle must be reported to Archive-It. No proven mitigation strategy yet exists for this problem.
Credentialing systems can obstruct access to content even after web crawler access has been successfully enabled. In the NYARC experience, this is the product of the crawler’s propensity to follow logout links as it would any other path provided it. Preventing such an unintuitive cycle requires implementing a host constraint on any URLs that match a regular expression of potential logout links. If crawl report or Wayback reviews indicate that a credentialing system continues to block crawls from content to which access has been enabled, submit a help ticket to Archive-It, describing the problem and specifically requesting a custom regular expression to constrain the crawler.
Preventing the crawler from logging out when presented the opportunity, however, required adding a host constraint on URL strings that matched a regular expression:
In keeping with prudent intellectual property rights observance, no mitigation is to be pursued in the case of content obstructed by a subscription “paywall” unless negotiated explicitly with the site owner at the collection development stage.
Skarstedt Gallery links to one of their many listings in The New York Times, but because NYARC lacks a paid subscription to the latter, the web crawler cannot provide the necessary credentials to access and capture the relevant content. No attempt to transcend such an obstacle is permissible at this stage.