166 · Representative Documents: Workflows
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 5
Configuration of Web Crawler Settings
WAS utilizes the open-source web crawler Heritrix to archive websites. As a
command-line tool, this application allows for a wide range of user settings the
curatorial interface in WAS provides for a more-limited number of options. For each
crawl, archivists may adjust the following settings:
• Scope: defines how much of the site will be captured. The archivist may elect
to capture the entire host site (i.e. http://bentley.umich.edu/), a specific
directory (i.e. http://bentley.umich.edu/exhibits/), or a single page (i.e. a
letter written by Abbie Hoffman to John Sinclair, featured at
http://bentley.umich.edu/exhibits/sinclair/ahletter.php) (see Figure 2).
Figure 2
To thoroughly capture target websites, the Bentley Historical Library
generally uses the “Host site” setting, unless the target is a single directory
located on a more extensive host or a specific page.
Linked pages: determines whether or not content from other hosts/URLs
will be captured archivists have two options for this setting. If set to “No,”
the crawler will only archive materials on the seed URL entered by the
archivist if “Yes,” the crawler will follow hypertext links one ‘hop’ to capture
linked resources. Capturing linked pages will not result in an indefinite crawl
(in which the robot follows link after link after link) instead, the crawler will
only capture the page (and embedded content) that is specified by the
hypertext link. No additional content on this latter site will be crawled.
To avoid preserving extraneous content, the Bentley Historical Library by
default does not captures linked pages. Archivists will only capture linked
pages if it required as a result of website design or if it is necessary to
capture contextual information for a high priority web crawl.
Maximum time: specifies the maximum duration of a crawl. The archivist
may select “Brief Capture (1 hour)” or “Full Capture (36 hours)” and the
crawl will continue until all content has been preserved (in which case it may
end early) or the allotted time period has elapsed. If a session times out
before the crawler has finished, the resulting capture may be incomplete.
To avoid missing content due to time restrictions, the Bentley Historical
Library uses the “Full Capture” option by default. Archivists use the “Brief
Capture” if the target involves a limited amount of content and the additional
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 5
Configuration of Web Crawler Settings
WAS utilizes the open-source web crawler Heritrix to archive websites. As a
command-line tool, this application allows for a wide range of user settings the
curatorial interface in WAS provides for a more-limited number of options. For each
crawl, archivists may adjust the following settings:
• Scope: defines how much of the site will be captured. The archivist may elect
to capture the entire host site (i.e. http://bentley.umich.edu/), a specific
directory (i.e. http://bentley.umich.edu/exhibits/), or a single page (i.e. a
letter written by Abbie Hoffman to John Sinclair, featured at
http://bentley.umich.edu/exhibits/sinclair/ahletter.php) (see Figure 2).
Figure 2
To thoroughly capture target websites, the Bentley Historical Library
generally uses the “Host site” setting, unless the target is a single directory
located on a more extensive host or a specific page.
Linked pages: determines whether or not content from other hosts/URLs
will be captured archivists have two options for this setting. If set to “No,”
the crawler will only archive materials on the seed URL entered by the
archivist if “Yes,” the crawler will follow hypertext links one ‘hop’ to capture
linked resources. Capturing linked pages will not result in an indefinite crawl
(in which the robot follows link after link after link) instead, the crawler will
only capture the page (and embedded content) that is specified by the
hypertext link. No additional content on this latter site will be crawled.
To avoid preserving extraneous content, the Bentley Historical Library by
default does not captures linked pages. Archivists will only capture linked
pages if it required as a result of website design or if it is necessary to
capture contextual information for a high priority web crawl.
Maximum time: specifies the maximum duration of a crawl. The archivist
may select “Brief Capture (1 hour)” or “Full Capture (36 hours)” and the
crawl will continue until all content has been preserved (in which case it may
end early) or the allotted time period has elapsed. If a session times out
before the crawler has finished, the resulting capture may be incomplete.
To avoid missing content due to time restrictions, the Bentley Historical
Library uses the “Full Capture” option by default. Archivists use the “Brief
Capture” if the target involves a limited amount of content and the additional