Blogs

The Heritrix crawler is capable of crawling and capturing web content fairly reliably from blogs built upon popular template-based platforms like Wordpress. Problems specific to capturing blogs typically manifest when those blogs are incorporated into the structure of a larger web instance with its own domain. Recognizing links to these incorporated blogs and their contents as in fact paths to domains external to the main seed’s, the crawler will sometimes erroneously deem these URLs outside of the scope of the given collection.

To ensure that a broader web presence’s blog and its embedded content capture completely, follow the directions for expanding the a collection’s scope to include a specific host, making sure to include both the blog’s and any associated file-hosting subdomain’s URLs.

This Wordpress-hosted blog appears as a frame within the larger Whyte’s site:

It was initially ruled out of scope entirely:

While patch crawling proved incrementally successful at restoring its content and functionality, expanding the wider Auction Houses collection’s crawl scope to the include the URLs from the blog and it’s ‘files’ subdomain enabled a much fuller capture in just one pass: