D1. Activating and Detecting Missing URLs

When QA is enabled, the Wayback interface automatically detects some, though frequently not all, content left uncaptured in the form of URLs missing from the current page view. In order for any URL missing from a web instance’s initial capture to be detected by the Wayback interface as missing, and therefore to be added to the summary list of Missing URLs to later be patch crawled, it must first be activated--prompted, that is, either by a web crawler or human web browser--directly from the page that is rendered by the source code in which it appears. To Heritrix, Wayback, or even the human web surfer, these URLs can be challenging to discover and activate for reasons temporal, structural, or both.

If you do not see the Enable QA link in the banner of your Wayback browsing view: log out of the Archive-It software interface, log back in, and refresh your view.

Activating URLs bound to time-based media

Time-based media elements frequently generate more URLs than the Heritrix crawler can detect during its initial capture of their surrounding web pages. The crawler may “move on” from these elements, in other words, without returning to detect any new URLs dynamically generated by and during their playback. As a result, these media elements will only render partially in an archival environment.

Any page containing one or multiple time-based media elements that manifest this problem should first be added explicitly to the crawl scope at the seed level.. In cases when this strategy is not entirely effective on its own, it is necessary to patch crawl the URLs that Wayback does detect, then to return to the subsequent capture of the page and repeat the process for further missing URLs made newly detectable by the activation of the previous missing URLs.

Detecting URLs hidden by design

Wayback will fail to enumerate and summarize missing URLs when those URLs are in actuality native to the source code of pages other than the one currently open to view. As a result, the list of URLs accessible through the containing page’s View Missing URLs link will not always include content visibly missing from the archived rendering. The most frequent and representative manifestation of this problem occurs when pages contain frames that appear as normal page content to the human browser while in the opened page’s source code they are in fact no more than a single URL.

For Wayback to detect the URLs written into the source code of pages that are themselves embedded into container pages as frames, it is necessary to find the URL for the embedded page within the container page’s source code, and to activate that link in order to open the embedded page on its own. Providing Wayback this unmediated view will enable it to detect any missing URLs and add them to the summary list of URLs to be patch crawled.

The links visible towards the bottom of this Claims Conference/WJRO page could not be detected and therefore could not be patch crawled until it was determined that they were in fact found in the source code to a wholly separate page embedded into the surrounding one with an <iframe> tag.

Sometimes, as in the case of Whyte’s, the entire structure of the website is built upon frames, meaning that the default page view will rarely detect any missing URLs unique to the user’s present view. Each frame must instead be opened (find the URL for each in the page’s source code) as its own page in order for Wayback to detect its own missing URLs.

Crawling URLs both time-based and hidden

On rare occasions, the origin of a time-based media element is so obscured as to be virtually impossible for either the Heritrix crawler or the human web browser to discover, activate, and/or detect missing. This manifests in an archival environment as a partially- or non-functional media element for which no Missing URLs are detected for future patch crawling either on the rendered page or elsewhere throughout the web instance, and regardless of how many times activation of the element itself is attempted.

When the above description characterizes a missing video element embedded from an external domain like YouTube or Vimeo, refer specifically to the directions for scoping-in externally hosted video.

Otherwise, add the page on which the problem appears to manifest and ideally the broader instance’s (preferably XML-based) sitemap to the collection’s scope at the seed level in order to enable more human-intuitive interaction between dynamic elements and Archive-It’s integrated browser-based crawling technology, Umbra. When this strategy is not entirely effective on its own, it is then necessary to contact the site owner or designer to determine the origins and relations of the missing content. Without these paths to crawl, it is not possible to capture the content with Archive-It.