Managing Born-Digital Special Collections and Archival Materials, SPEC Kit 329 (August 2012)

Nelson, Naomi L.; Shaw, Seth; Deromedi, Nancy; Shallcross, Michael; Ghering, Cynthia; Schmidt, Lisa; Belden, Michelle; Esposito, Jackie R.; Goldman, Ben; Pyatt, Tim

178 · Representative Documents: Workflows
University of Michigan
Quality Assurance for BHL Web Archives
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_QAguidelines.pdf
9/21/2011 13
ii. Decrease/increase maximum crawl time (1 or 36 hours)
iii. Recommend the deletion/addition of additional seed URLs on
the QA Spreadsheet.
c. While crawl schedules should be accurately set at the time of capture,
check with an archivist if the frequency for a site seems too low/high.
Common Issues and Problems with Web Captures
• Crawler traps: These are essentially infinite loops from which a robot is unable
to escape. Online calendars are among the most common examples. The
crawler will start with the present date and capture page after page of the
calendar until the crawl expires without preserving more meaningful site
content. The resulting capture may have a very large number of files and will
likely reach the maximum time setting before finishing.
• Unexpected seed redirects: The web crawler may be unexpectedly redirected
from the target seed URL and begin the crawl on a random page (sometimes
completely unassociated with the original seed URL). The redirection may
truncate the crawl, cause important content (such as a home page) to be
missed, or may lead to a crawler trap.
• Inaccurate seed URLs: Some sites require the crawler to start at a specific web
page instead of a basic domain name. For instance, the accurate capture of the
U of M Law School required http://www.law.umich.edu/Pages/default.aspx to
be included as a seed (instead of just http://www.law.umich.edu/). Other sites
will require the crawler to start at “.../home” or “…/index.html.” Failure to
include accurate seeds may result in a failed crawl, unexpected redirect, or a
crawler trap. The BHL QA specialist may need to visit the live website to
identify the exact URL from which the crawler should begin.
• Robots.txt files: A “robots.txt” file is an Internet convention used by
webmasters to prevent all or certain sections of websites from being captured
by a web crawler. The robots.txt must reside in the root of the site’s domain
and its presence may be verified by typing ‘/robots.txt’ after the root URL (i.e.
http://umich.edu/robots.txt). By convention, a web crawler or robot will read
the robots.txt file of a target site before doing anything else. This text file will
specify what sections of a site the robot is forbidden to crawl. A typical
robots.txt exclusion statement is as follows:
User-agent: *
Disallow: /
User-agent’ refers to the crawler *is a wildcard symbol that indicates the
exclusion applies to all robots and /applies the exclusion to all pages on the

Previous Page Next Page

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

Extracted Text (may have errors)

178 · Representative Documents: Workflows
University of Michigan
Quality Assurance for BHL Web Archives
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_QAguidelines.pdf
9/21/2011 13
ii. Decrease/increase maximum crawl time (1 or 36 hours)
iii. Recommend the deletion/addition of additional seed URLs on
the QA Spreadsheet.
c. While crawl schedules should be accurately set at the time of capture,
check with an archivist if the frequency for a site seems too low/high.
Common Issues and Problems with Web Captures
• Crawler traps: These are essentially infinite loops from which a robot is unable
to escape. Online calendars are among the most common examples. The
crawler will start with the present date and capture page after page of the
calendar until the crawl expires without preserving more meaningful site
content. The resulting capture may have a very large number of files and will
likely reach the maximum time setting before finishing.
• Unexpected seed redirects: The web crawler may be unexpectedly redirected
from the target seed URL and begin the crawl on a random page (sometimes
completely unassociated with the original seed URL). The redirection may
truncate the crawl, cause important content (such as a home page) to be
missed, or may lead to a crawler trap.
• Inaccurate seed URLs: Some sites require the crawler to start at a specific web
page instead of a basic domain name. For instance, the accurate capture of the
U of M Law School required http://www.law.umich.edu/Pages/default.aspx to
be included as a seed (instead of just http://www.law.umich.edu/). Other sites
will require the crawler to start at “.../home” or “…/index.html.” Failure to
include accurate seeds may result in a failed crawl, unexpected redirect, or a
crawler trap. The BHL QA specialist may need to visit the live website to
identify the exact URL from which the crawler should begin.
• Robots.txt files: A “robots.txt” file is an Internet convention used by
webmasters to prevent all or certain sections of websites from being captured
by a web crawler. The robots.txt must reside in the root of the site’s domain
and its presence may be verified by typing ‘/robots.txt’ after the root URL (i.e.
http://umich.edu/robots.txt). By convention, a web crawler or robot will read
the robots.txt file of a target site before doing anything else. This text file will
specify what sections of a site the robot is forbidden to crawl. A typical
robots.txt exclusion statement is as follows:
User-agent: *
Disallow: /
User-agent’ refers to the crawler *is a wildcard symbol that indicates the
exclusion applies to all robots and /applies the exclusion to all pages on the

Help

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 177
University of Michigan
Quality Assurance for BHL Web Archives
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_QAguidelines.pdf
9/21/2011 12
i. Crawl of unusual size
j. Adjust crawl frequency
7. Make recommendations on the QA Spreadsheet in regards to:
a. Back up spreadsheet while working on it
b. The deletion of a previous crawl.
i. Deletions should be reserved for crawls that were misdirected,
erroneous, or never completed (due to robots.txt or technical
issues).
ii. In some cases, excessively large captures (i.e. greater than 4 GB)
may need to be deleted to preserve space.
c. The initiation of a new crawl.
d. Reducing the crawl frequency of high-priority sites
e. Communication with the contact owner if it will be necessary to
request a modification of the robots.txt file or resolve another issue
with the site. Try to identify and record the name/email address of the
site’s webmaster or main contact.
8. Edit crawl settings:
a. “Capture Linked Pages”
i. For U of M content:
1. Only “high priority” sites should include the capture of
linked pages.
2. For all other sites, the capture linked pages setting should
be changed to “No” to avoid an excessive amount of
content in the web archives.
ii. For MHC content, the QA specialist may need to
b. If you determine that the web archives need to capture a smaller/wider
range of content, make one (or more) of the following changes (and
note in the QA Spreadsheet):
i. Decrease/increase scope (host, directory, or page)

Previous Page Next Page

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 179
Michigan State University
How to Accession Electronic records to the Spartan Archive Storage Vault
DRAFT
3/15/12
How to Accession Electronic Records to the Spartan Archive Storage Vault
1. Receive transfer from unit, including transmittal form and inventory
2. Create Archivists’ Toolkit record
• Assign ‘A’ accession #
• Enter accession date (indicates accession created)
• Link to Resource (MSU unit/record group)
3. Provide accession #to unit
4. Add accession #to transmittal form
5. If transmittal form and inventory are paper, scan as PDFs. If transmittal form and inventory are
digital files, print a copy.
6. File paper version of transmittal form in records management files. Inventory should be stapled
to transmittal form, if available.
7. If necessary, create a folder in the Storage Vault for the record group. The name of the folder for
the record group should be the official UAHC record group number.
For electronic records coming in on hard drives or removable media:
8. Label hard drive or media with accession #.If more than one piece of hardware or media is in
the accession, label each with the accession #plus a sequential number. For example, Axxxxxx-‐-‐-‐
1, Axxxxxx-‐-‐-‐2, etc.
9. Write protect hard drive or media, if possible.
10. Connect hard drive or insert media on electronic records processing workstation.
11. Check for viruses on hard drive or media using the Kaspersky virus scanning utility.
• Connect hard drive or insert removable media as necessary
• Open Kaspersky.
• Select disk to scan.
• Click “Start Scan” button.
• If viruses are present, Kaspersky will identify the infected files and ask to quarantine
them. Agree to quarantine. (Steps in this case TBD.)

Previous Page Next Page

Page 9Survey Results click to expand contents

Page 91Representative Documents click to expand contents

Page 93Job Descriptions click to expand contents

Page 113Collection Policies click to expand contents

Page 121Gift/Purchase Agreements click to expand contents

Page 127Format Policies click to expand contents

Page 153Workflows click to expand contents

Page 197Selected Resources click to expand contents

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

All section downloads (PDF) click to expand contents

Extracted Text (may have errors)

Help

loading

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources