Managing Born-Digital Special Collections and Archival Materials, SPEC Kit 329 (August 2012)

Nelson, Naomi L.; Shaw, Seth; Deromedi, Nancy; Shallcross, Michael; Ghering, Cynthia; Schmidt, Lisa; Belden, Michelle; Esposito, Jackie R.; Goldman, Ben; Pyatt, Tim

166 · Representative Documents: Workflows
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 5
Configuration of Web Crawler Settings
WAS utilizes the open-source web crawler Heritrix to archive websites. As a
command-line tool, this application allows for a wide range of user settings the
curatorial interface in WAS provides for a more-limited number of options. For each
crawl, archivists may adjust the following settings:
• Scope: defines how much of the site will be captured. The archivist may elect
to capture the entire host site (i.e. http://bentley.umich.edu/), a specific
directory (i.e. http://bentley.umich.edu/exhibits/), or a single page (i.e. a
letter written by Abbie Hoffman to John Sinclair, featured at
http://bentley.umich.edu/exhibits/sinclair/ahletter.php) (see Figure 2).
Figure 2
To thoroughly capture target websites, the Bentley Historical Library
generally uses the “Host site” setting, unless the target is a single directory
located on a more extensive host or a specific page.
Linked pages: determines whether or not content from other hosts/URLs
will be captured archivists have two options for this setting. If set to “No,”
the crawler will only archive materials on the seed URL entered by the
archivist if “Yes,” the crawler will follow hypertext links one ‘hop’ to capture
linked resources. Capturing linked pages will not result in an indefinite crawl
(in which the robot follows link after link after link) instead, the crawler will
only capture the page (and embedded content) that is specified by the
hypertext link. No additional content on this latter site will be crawled.
To avoid preserving extraneous content, the Bentley Historical Library by
default does not captures linked pages. Archivists will only capture linked
pages if it required as a result of website design or if it is necessary to
capture contextual information for a high priority web crawl.
Maximum time: specifies the maximum duration of a crawl. The archivist
may select “Brief Capture (1 hour)” or “Full Capture (36 hours)” and the
crawl will continue until all content has been preserved (in which case it may
end early) or the allotted time period has elapsed. If a session times out
before the crawler has finished, the resulting capture may be incomplete.
To avoid missing content due to time restrictions, the Bentley Historical
Library uses the “Full Capture” option by default. Archivists use the “Brief
Capture” if the target involves a limited amount of content and the additional

Previous Page Next Page

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

Extracted Text (may have errors)

166 · Representative Documents: Workflows
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 5
Configuration of Web Crawler Settings
WAS utilizes the open-source web crawler Heritrix to archive websites. As a
command-line tool, this application allows for a wide range of user settings the
curatorial interface in WAS provides for a more-limited number of options. For each
crawl, archivists may adjust the following settings:
• Scope: defines how much of the site will be captured. The archivist may elect
to capture the entire host site (i.e. http://bentley.umich.edu/), a specific
directory (i.e. http://bentley.umich.edu/exhibits/), or a single page (i.e. a
letter written by Abbie Hoffman to John Sinclair, featured at
http://bentley.umich.edu/exhibits/sinclair/ahletter.php) (see Figure 2).
Figure 2
To thoroughly capture target websites, the Bentley Historical Library
generally uses the “Host site” setting, unless the target is a single directory
located on a more extensive host or a specific page.
Linked pages: determines whether or not content from other hosts/URLs
will be captured archivists have two options for this setting. If set to “No,”
the crawler will only archive materials on the seed URL entered by the
archivist if “Yes,” the crawler will follow hypertext links one ‘hop’ to capture
linked resources. Capturing linked pages will not result in an indefinite crawl
(in which the robot follows link after link after link) instead, the crawler will
only capture the page (and embedded content) that is specified by the
hypertext link. No additional content on this latter site will be crawled.
To avoid preserving extraneous content, the Bentley Historical Library by
default does not captures linked pages. Archivists will only capture linked
pages if it required as a result of website design or if it is necessary to
capture contextual information for a high priority web crawl.
Maximum time: specifies the maximum duration of a crawl. The archivist
may select “Brief Capture (1 hour)” or “Full Capture (36 hours)” and the
crawl will continue until all content has been preserved (in which case it may
end early) or the allotted time period has elapsed. If a session times out
before the crawler has finished, the resulting capture may be incomplete.
To avoid missing content due to time restrictions, the Bentley Historical
Library uses the “Full Capture” option by default. Archivists use the “Brief
Capture” if the target involves a limited amount of content and the additional

Help

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 165
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 4
Standard (DACS) to ensure that the nature of the collections are clear, archivists
supply “Web Archives” in the final title. The University Archives and Records
Program (UARP) furthermore includes “University of Michigan” in titles to highlight
the provenance of websites. Complete names for sites in the University of Michigan
Web Archives thus follow the pattern “Board of Regents Web Archives (University
of Michigan).”

Previous Page Next Page

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 167
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 6
crawl time would result in unnecessary content (for instance, the archivist
only wants to capture a blog’s most recent posts and is not interested in the
entire site).
• Capture frequency: designates how often a crawl will be repeated. The
archivist may elect to crawl a site once or configure the robot to perform
daily, weekly, monthly, or custom captures (see Figure 3).
Figure 3
Archivists generally choose the “Custom” option and select an annual capture date,
being mindful of important events/dates that might result in updates to the target
site. (For instance, University of Michigan sites are captured near the beginning or
end of the academic year.) This strategy is particularly effective with ‘aggregative’
websites in which new content is placed at the top/front of pages while older
information is moved further down the page or placed in an ‘archive’ section. For
high priority targets (such as the University of Michigan Office of the President) or
sites with a large turnover of important content, captures may be scheduled on a
more frequent basis.
As the foregoing discussion reveals, the accurate and effective configuration of crawl
settings must be based on the archivist’s appraisal of content and understanding of
the target site’s structure. The failure to consider these factors may lead to a capture
that, on the one hand, is narrowly circumscribed and incomplete or, on the other, is
unnecessarily broad and filled with superfluous information.

Previous Page Next Page

Page 9Survey Results click to expand contents

Page 91Representative Documents click to expand contents

Page 93Job Descriptions click to expand contents

Page 113Collection Policies click to expand contents

Page 121Gift/Purchase Agreements click to expand contents

Page 127Format Policies click to expand contents

Page 153Workflows click to expand contents

Page 197Selected Resources click to expand contents

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

All section downloads (PDF) click to expand contents

Extracted Text (may have errors)

Help

loading

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources