SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 177
University of Michigan
Quality Assurance for BHL Web Archives
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_QAguidelines.pdf
9/21/2011 12
i. Crawl of unusual size
j. Adjust crawl frequency
7. Make recommendations on the QA Spreadsheet in regards to:
a. Back up spreadsheet while working on it
b. The deletion of a previous crawl.
i. Deletions should be reserved for crawls that were misdirected,
erroneous, or never completed (due to robots.txt or technical
issues).
ii. In some cases, excessively large captures (i.e. greater than 4 GB)
may need to be deleted to preserve space.
c. The initiation of a new crawl.
d. Reducing the crawl frequency of high-priority sites
e. Communication with the contact owner if it will be necessary to
request a modification of the robots.txt file or resolve another issue
with the site. Try to identify and record the name/email address of the
site’s webmaster or main contact.
8. Edit crawl settings:
a. “Capture Linked Pages”
i. For U of M content:
1. Only “high priority” sites should include the capture of
linked pages.
2. For all other sites, the capture linked pages setting should
be changed to “No” to avoid an excessive amount of
content in the web archives.
ii. For MHC content, the QA specialist may need to
b. If you determine that the web archives need to capture a smaller/wider
range of content, make one (or more) of the following changes (and
note in the QA Spreadsheet):
i. Decrease/increase scope (host, directory, or page)
University of Michigan
Quality Assurance for BHL Web Archives
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_QAguidelines.pdf
9/21/2011 12
i. Crawl of unusual size
j. Adjust crawl frequency
7. Make recommendations on the QA Spreadsheet in regards to:
a. Back up spreadsheet while working on it
b. The deletion of a previous crawl.
i. Deletions should be reserved for crawls that were misdirected,
erroneous, or never completed (due to robots.txt or technical
issues).
ii. In some cases, excessively large captures (i.e. greater than 4 GB)
may need to be deleted to preserve space.
c. The initiation of a new crawl.
d. Reducing the crawl frequency of high-priority sites
e. Communication with the contact owner if it will be necessary to
request a modification of the robots.txt file or resolve another issue
with the site. Try to identify and record the name/email address of the
site’s webmaster or main contact.
8. Edit crawl settings:
a. “Capture Linked Pages”
i. For U of M content:
1. Only “high priority” sites should include the capture of
linked pages.
2. For all other sites, the capture linked pages setting should
be changed to “No” to avoid an excessive amount of
content in the web archives.
ii. For MHC content, the QA specialist may need to
b. If you determine that the web archives need to capture a smaller/wider
range of content, make one (or more) of the following changes (and
note in the QA Spreadsheet):
i. Decrease/increase scope (host, directory, or page)