SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 175
University of Michigan
Quality Assurance for BHL Web Archives
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_QAguidelines.pdf
9/21/2011 10
if you believe that an additional tag (or tags) may be
necessary.
d. Capture History
i. Check general the following for potential issues:
1. “Status”: may reveal ongoing technical issues
2. “Files”: could be problematic if extremely low or high
3. “Duration”: could be problematic if extremely short or
timed out
3. Click “View Results” link to access the Crawl Overview
a. Check seed URL(s) for redirects
b. In case of an extremely small number of files or short duration, check
“Robot Exclusions” statistics to see if the crawler was blocked
c. In case of an extremely large number of files or in the event that the
crawler exceeded the 36 hour duration, check the “Hosts Report” to see
how many URLs are remaining for the main seed URL(s)
d. Pending the review of the archived content, it may be necessary to
examine other crawl reports.
4. View archived website
a. Verify that content is an archived resource (instead of a redirected ‘live’
web page).
b. Verify that CSS files are present (i.e. pages are not text only)
c. Click on main navigational links (depending upon crawl settings,
additional content may or may not have been intended for capture).
d. For high priority targets, click through the entire site to ensure that
significant content and features have been captured.
e. Troubleshooting:
i. If a particular resource does not appear in the archive, conduct a
search for the URL (search feature available from the main
Results screen)
University of Michigan
Quality Assurance for BHL Web Archives
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_QAguidelines.pdf
9/21/2011 10
if you believe that an additional tag (or tags) may be
necessary.
d. Capture History
i. Check general the following for potential issues:
1. “Status”: may reveal ongoing technical issues
2. “Files”: could be problematic if extremely low or high
3. “Duration”: could be problematic if extremely short or
timed out
3. Click “View Results” link to access the Crawl Overview
a. Check seed URL(s) for redirects
b. In case of an extremely small number of files or short duration, check
“Robot Exclusions” statistics to see if the crawler was blocked
c. In case of an extremely large number of files or in the event that the
crawler exceeded the 36 hour duration, check the “Hosts Report” to see
how many URLs are remaining for the main seed URL(s)
d. Pending the review of the archived content, it may be necessary to
examine other crawl reports.
4. View archived website
a. Verify that content is an archived resource (instead of a redirected ‘live’
web page).
b. Verify that CSS files are present (i.e. pages are not text only)
c. Click on main navigational links (depending upon crawl settings,
additional content may or may not have been intended for capture).
d. For high priority targets, click through the entire site to ensure that
significant content and features have been captured.
e. Troubleshooting:
i. If a particular resource does not appear in the archive, conduct a
search for the URL (search feature available from the main
Results screen)