B2. Obstructed content

Refer to the respective guidelines provided below for crawling live web content alternatively obstructed by barriers to all web crawlers or to unidentified human web browsers:

Content blocked by Robots.txt

Robots.txt is a file found ubiquitously around the web and is commonly employed by site owners to obstruct web crawlers from accessing all or some of their online content. Frequently, and especially in the case of browser-based and template-driven web publishing platforms like Wordpress, administrative areas of sites are and can continue to be blocked from crawling. When, however, URLs vital to the capture completeness, behavior, and/or appearance of a web presence are blocked, then their respective Robots.txt protocols must be overridden. This may be achieved at the host level by adding a host URL, precisely as it appears in the relevant crawl report’s Hosts tab, to its collection’s list of Host Constraints, followed by checking the “Ignore Robots.txt” button at the end of its subsequent row in that list. It may be affected more selectively by checking the “Ignore Robots.txt” box at the point of initiating a patch crawl of select missing URLs.



Content locked behind credentialing system (login screen, paywall, etc.)

Enabling Archive-It to crawl and browse web environments critical to NYARC’s collecting scope that are obstructed by credentialing systems is an issue to principally be resolved at the collection development phase. This only becomes an issue of web archival quality when that enabled access has unintended downstream effects upon crawled/captured content, or whenever unanticipated host domains (ie. other than the principal seed/s) obstruct access to their content with a credentialing system.

Login (and don't logout)

Should it be determined that a host domain or specific URL obstructed by a credentialing system is sufficiently significant to the value of the seed/collection to capture, the NYARC Web Archiving Program Coordinator may follow these steps to enable access:

    1. Register for a username and password with the relevant service

    2. Add the desired host domain or specific URL to the collection at the seed level

    3. Edit that seed’s settings to include your credentials in the relevant Login Username and Login Password fields

    4. Ensure in seed settings that the checkbox indicating visibility on Archive-It’s public site is empty

If the credentialing system, however, requires the crawler to provide more information than a username and/or password (ie. ZIP code, proper name, etc.), the obstacle must be reported to Archive-It. No proven mitigation strategy yet exists for this problem.

Credentialing systems can obstruct access to content even after web crawler access has been successfully enabled. In the NYARC experience, this is the product of the crawler’s propensity to follow logout links as it would any other path provided it. Preventing such an unintuitive cycle requires implementing a host constraint on any URLs that match a regular expression of potential logout links. If crawl report or Wayback reviews indicate that a credentialing system continues to block crawls from content to which access has been enabled, submit a help ticket to Archive-It, describing the problem and specifically requesting a custom regular expression to constrain the crawler.

Registering a username, a password, and entering both into the principal seed’s settings enabled initial access to the Isamu Noguchi Catalogue Raisonné:

Preventing the crawler from logging out when presented the opportunity, however, required adding a host constraint on URL strings that matched a regular expression:

Paywalls

In keeping with prudent intellectual property rights observance, no mitigation is to be pursued in the case of content obstructed by a subscription “paywall” unless negotiated explicitly with the site owner at the collection development stage.

Live

Archived

Skarstedt Gallery links to one of their many listings in The New York Times, but because NYARC lacks a paid subscription to the latter, the web crawler cannot provide the necessary credentials to access and capture the relevant content. No attempt to transcend such an obstacle is permissible at this stage.

As with all other instances, however, this inability to patch crawl desired missing content must be documented on the QA report form.