B3. Questionable markup

Source code that is conventional but not wholly W3C- or otherwise standards-compliant can confound the Archive-It Heritrix web crawler, leading to incomplete capture. Most importantly, anchored links written in noncompliant markup--that therefore go unrecognized as links at all--will first be passed over by the web crawler, then undetectable to the Wayback interface for later patch crawling.

In the example anchored link above, from MoMA’s Sanja Ivekovic exhibition site, the web crawler recognizes a default URL in the <img> tag, but not the destination image that follows the (noncompliant) data-src attribute. As a result, even manually following the image path leads to very different places on the live and archived webs:

Live

Archived

Archive-It technical support staff advise that pages manifesting this problem be added to the collection scope at the seed level in order to enable more human-intuitive interaction between any possible URL and Archive-It’s integrated browser-based crawling technology, Umbra. NYARC, however, has yet to confirm the efficacy of this method. In the meantime, these errors must be reported to Archive-It, and as with all other instances in which desired content cannot be patch crawled, must be documented on the QA report form.

Page updated

Google Sites

Report abuse