D2. Managing Missing URLs

Because the patch crawling process is so critically important to achieving optimal completeness of each archival capture, and simultaneously so dependent upon human selectivity, it can be plagued by errors and inefficiencies. It is the position of NYARC to advocate for development of enhanced processes of managing missing URLs through the Archive-It software service interface, and specifically towards functions beyond patching or ignoring URLs on the currently largely individual basis. Close management practices that temporarily mitigate representative inefficiencies of this process are for the meantime provided below.

Invalid URLs

Invalid URLs are a class of missing URLs defined by their inactivity on the live web. Because they appear in their respective pages’ source codes in partial, circular, or other forms confusing to Archive-It's crawling and rendering technologies, they lead to no discernible content for archival capture from the live web.

These (Facebook-hosted) URLs from Pace Prints are only partially comprehensible to the crawler. They therefore return as “invalid.”

At present, QA technicians have no recourse to patch crawl the content to which Invalid URLs are meant to direct. For this reason, and because they can appear in such volumes as to complicate or even subvert the efficient selection of missing URLs for patch crawling, Archive-It automatically “ignores” all Invalid URLs and removes them from the Missing URLs list.

Missing URLs for successfully captured content

The relatively rare subset of instances in which Invalid URLs refer to content that has in fact been successfully captured tends to manifest in those cases in which the content in question is detected as missing only because it fails to render through the Wayback interface. Such failures to render captured content through the Wayback interface must be reported to Archive-It.

Excessive missing URLs

Regardless of the reason that they were not captured, excessively high numbers of missing URLs for ultimately undesired content can make the selection of truly desired content from this list unnecessarily time-consuming and error-prone.

This single missing URL for an advertising service encoded into several different news articles litters the Missing URL list from NYARC’s New York City Galleries Collection. It can be blocked from appearing in future lists by instituting a host constraint.

QA technicians are advised to note any host domains that manifest missing URLs to undesired content so often that they must be blocked or limited in future crawls, and to make explicit recommendations of such scope adjustments in the QA report form.

For crawls currently under review/improvement, it is in the meantime best practice to “ignore” all such URLs in order to remove them from the active Missing URLs list prior to performing any other patch crawling function:

    1. Click on the Missing URL header text to list all URLs in ascending alphabetical order

    2. Select all of the repeating URLs, or those from domains outside of the collection’s scope, by clicking on the checkbox next to each

    3. Click on the Ignore Selected URLs button at the top-left corner of the page

Erroneously patch crawled URLs

Ignoring and thereby removing invalid and excessive URLs from a collection’s Missing URLs list is currently the only known assurance against erroneously patch crawling that same missing content. Otherwise, URLs purposefully left ‘unchecked’ in the patch crawl selection process have a tendency to appear in the list of URLs queued for patch crawling nonetheless. Unless and until they provide more information regarding why it occurs and how QA technicians can avoid it, all instances of this error must be reported to Archive-It.