3. Improve capture, behavior, and appearance

After you have assessed it with thorough web browsing, you may proceed to take action to improve an archived web instance’s capture, behavior, and appearance. Captures indicative of systematically omitted but desired content, and/or of systematically included but excessive undesired content are best addressed at this stage by scope adjustment. Those that require more selective inclusion of erroneously omitted content are best addressed at this stage by patch crawling. Problems that NYARC cannot solve by thusly enhancing capture completeness must be reported via Support Ticket for improvement to and by Archive-It.

Adjust crawl scope

QA technicians must use the QA Report Form to recommend the host(s) and/or other specific URLs to be added, blocked, or limited in future crawls, or for which Robots.txt protocols must be overridden, precisely as each host or specific URL appears in the current crawl report’s Hosts tab. The Head of the NYARC Web Archiving Program and/or the QA technician may then follow the specific directions provided:

Patch crawl missing content

After you have identified opportunities to improve future crawls of the applicable seed(s) with scope adjustments, you may proceed to design and initiate a patch crawl of select missing content. (As a general rule, NYARC QA technicians do not patch crawl missing content on a page-by-page basis as is enabled by the View Missing URLs link on each captured webpage’s Wayback view, rather on a full crawl basis as is enabled by the View Missing URLs from Wayback QA link on each collection’s management page in the Archive-It software service interface).

View missing URLs from Wayback QA

To begin, navigate to the relevant collection’s management page in Archive-It and click on the Wayback QA link:

This provides you access to a list of links activated and thereafter detected by the Wayback interface to be missing from captures throughout the collection:

The list may be organized by clicking on the respective column header text to display items in ascending alphabetical order by the precise text of the missing URL, that of the source page from which it is missing, the possible reason that it is missing from the current capture, the general type of file that the URL represents, or the expected size of that file. Such organization is essential in order to efficiently manage problematic lists of missing URLs, which may be complicated by their volume and/or complexity.

Run Patch Crawl

To pursue missing content, select all relevant missing URLs, by clicking on the checkbox next to each URL, then clicking on the Patch Crawl Selected button in the left corner above the list:

Before running your patch crawl, Archive-It will prompt you to confirm your selection based on the reason they may be missing from the current capture. Documents “Blocked by Robots.txt” require that you first click on the capture documents blocked by robots.txt checkbox before clicking on the Run Patch Crawl button.

Clicking on the Run Patch Crawl button will queue your patch crawl to run in the order that it was initiated among NYARC’s other one-time, scheduled, and test crawls. You may abort or check on the status of your patch crawl at any time thereafter by hovering over the Crawls link in the Archive-It interface’s top navigational menu, then clicking on the Current Crawls link:

Running a patch crawl thusly will remove the selected missing URLs from view in your collection-wide list, so you may now return to the list by way of the Wayback QA link in order to patch crawl missing content from the list’s successive pages.

Evaluate and document patch crawl results

As with all other crawl types, the completion of a patch crawl initiates an email message to the Archive-It user with a link to a crawl report, which likewise may be accessed directly through the Archive-It software’s collection or reports interface, or through the Wayback QA tab under “Patch Crawl Reports.” The crawl’s report may be assessed immediately, however its effects upon the previously captured web instance will generally only be visible 24 hours after completion. After the requisite time has passed, access the archived web instance through the Wayback browser interface in order to evaluate the effectiveness of your patch crawl(s). Identify any further opportunities to capture select missing content enabled by your completed patch crawl, including the activation of newly detectable missing URLs, and run successive patch crawls as necessary.

The necessity for and initiation of any patch crawl(s) must ultimately be reported, with a brief description of the missing content pursued through this strategy, on the QA report form.

Submit issues to Archive-It

Issues of quality beyond those that can be mitigated through scope adjustment and/or patch crawling may require intervention by Archive-It crawl engineers and/or Wayback developers. To report these issues, click on the Help Center link in the Archive-It Interface Header, then select Submit a Request on the upper-right side of the Help Center page.

These help requests are coordinated among partners, engineers, and developers by Archive-It partner specialists. To make these frequently technical and occasionally prolonged interactions as effective and efficient as possible, follow these communication principles:

  • Cite precedent: Check existing NYARC help tickets (under the NYARC tab in the help interface) for similar issues before submitting any brand new request. Even when these prior interactions do not provide an immediate mitigation strategy, they may provide the partner specialist with vital information about parallel efforts and progress made by their colleagues. Indicating which other seeds may have manifested a similar problem, or which specialist may have recently solved a seemingly identical one, for instance, will greatly improve their efficiency.

  • Inquire broadly: When requesting help, always be sure to ask your partner specialist (as politely and briefly as possible) to indicate the likeliest causes of your problem; offer informed theories of your own when/if you have them, but always let them know that you are engaged in the problem mitigation yourself. More than any quick fix, information regarding the source of your given issue can help NYARC to anticipate future QA issues and processing needs, and will subsequently reduce the load on Archive-It's partner specialists and engineers.

  • Check Proxy mode: When experiencing any problem with playback (ie. issues other than accessing and/or crawling a desired file path), remember to first compare your view of the archived resource in Wayback mode to a view in Proxy mode. Always be certain to report that you have performed this step in your initial help request, include any further questions that this comparison raises, and, insofar as is feasible, provide screenshots of each mode.

  • Ignore robots: When experiencing an access-related issue--any problem related to crawling a desired file path, rather than to playing it back--remember first to ignore the Robots.txt protocol. To avoid delays, be certain to report explicitly in your initial help request that you have already done so.

  • Share screenshots: For myriad reasons, Archive-It employees and contractors frequently do not see the same issues that we see in New York manifest at their stations in San Francisco or elsewhere. For this reason, it is critically important whenever feasible to provide them with screenshots. To do so on a Windows PC, simply navigate to the view most representative of the issue that you want to resolve, tap the "Print Scrn" button on your keyboard, open a graphic editing software (all NYARC PC's should at least have MS Paint), and paste the view from your clipboard onto the canvas, where you may crop/resize and save it as a PNG file on your desktop. [When using a Mac, instead of Print Scrn hold Command-Shift-4 and tap the spacebar to select a window of which to take a screenshot]. There is no standard naming convention for these files, but good practice is to at least include an indication of the seed name and viewing mode--Live, Proxy, or Wayback. Include screenshots of any/all modes that manifest differences.

  • Specify software: It is important for the specialists and engineers to evaluate your issue(s) in the same context in which you encounter them, or else they may miss vital information. When encountering any rendering or otherwise browsing-related problem, be sure to specify in your initial help request precisely which browser(s) (ie. Chrome, Firefox, etc.) you have used and which operating system (ie. Windows 11, Mac OS Big Sur, etc.) your computer runs.

When you have completed the necessary steps above, proceed to:

4. Document QA process, problems, and recommendations