178 · Representative Documents: Workflows
University of Michigan
Quality Assurance for BHL Web Archives
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_QAguidelines.pdf
9/21/2011 13
ii. Decrease/increase maximum crawl time (1 or 36 hours)
iii. Recommend the deletion/addition of additional seed URLs on
the QA Spreadsheet.
c. While crawl schedules should be accurately set at the time of capture,
check with an archivist if the frequency for a site seems too low/high.
Common Issues and Problems with Web Captures
Crawler traps: These are essentially infinite loops from which a robot is unable
to escape. Online calendars are among the most common examples. The
crawler will start with the present date and capture page after page of the
calendar until the crawl expires without preserving more meaningful site
content. The resulting capture may have a very large number of files and will
likely reach the maximum time setting before finishing.
Unexpected seed redirects: The web crawler may be unexpectedly redirected
from the target seed URL and begin the crawl on a random page (sometimes
completely unassociated with the original seed URL). The redirection may
truncate the crawl, cause important content (such as a home page) to be
missed, or may lead to a crawler trap.
Inaccurate seed URLs: Some sites require the crawler to start at a specific web
page instead of a basic domain name. For instance, the accurate capture of the
U of M Law School required http://www.law.umich.edu/Pages/default.aspx to
be included as a seed (instead of just http://www.law.umich.edu/). Other sites
will require the crawler to start at “.../home” or “…/index.html.” Failure to
include accurate seeds may result in a failed crawl, unexpected redirect, or a
crawler trap. The BHL QA specialist may need to visit the live website to
identify the exact URL from which the crawler should begin.
Robots.txt files: A “robots.txt” file is an Internet convention used by
webmasters to prevent all or certain sections of websites from being captured
by a web crawler. The robots.txt must reside in the root of the site’s domain
and its presence may be verified by typing ‘/robots.txt’ after the root URL (i.e.
http://umich.edu/robots.txt). By convention, a web crawler or robot will read
the robots.txt file of a target site before doing anything else. This text file will
specify what sections of a site the robot is forbidden to crawl. A typical
robots.txt exclusion statement is as follows:
User-agent: *
Disallow: /
User-agent’ refers to the crawler *is a wildcard symbol that indicates the
exclusion applies to all robots and /applies the exclusion to all pages on the
Previous Page Next Page