Managing Born-Digital Special Collections and Archival Materials, SPEC Kit 329 (August 2012)

Nelson, Naomi L.; Shaw, Seth; Deromedi, Nancy; Shallcross, Michael; Ghering, Cynthia; Schmidt, Lisa; Belden, Michelle; Esposito, Jackie R.; Goldman, Ben; Pyatt, Tim

164 · Representative Documents: Workflows
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 3
Identification of Content
The Bentley Historical Library employs the Heritrix web crawler (also known as a
spider or robot) to copy and preserve websites. As a subscriber to WAS, the Bentley
Library relies upon an implementation of Heritrix specially configured and
maintained by the CDL. A web crawler is an application that starts at a specified URL
and then methodically follows hyperlinks to copy html pages and associated files
(images, audio files, style sheets, etc.) as well as the websites underlying structure.
The initiation of a web capture requires the archivist to specify one or more seed
URLs from which the web crawling application will preserve the target site.
Accurate and thorough website preservation requires the archivist to become
familiar with a site’s content and architecture in order to define the exact nature of
the target. This attention to detail is important because content may be hosted from
multiple domains. For example, the University of Michigan’s Horace H. Rackham
School of Graduate Studies hosts the majority of its content at
http://www.rackham.umich.edu/ but maintains information on academic programs
at https://secure.rackham.umich.edu/academic_information/programs/. To
completely capture the Rackham School’s online presence, archivists needed to
identify both domains as seed URLs.
At the same time, multiple domains present on a site may merit preservation as
separate websites. For example, the University of Michigan’s Office of the Vice
President of Research (http://research.umich.edu/) maintains a large body of
information related to research administration (http://www.drda.umich.edu/) and
human research compliance (http://www.ohrcr.umich.edu/). Although these latter
sites could be included as secondary seeds for the Vice President of Research’s site,
their scope and informational value led archivists to preserve them separately.
Once the target of the crawl has been identified and defined, the archivist enters the
seed URL(s) and site name in the WAS curatorial interface (see Figure 1).
Figure 1
The Bentley Historical Library standardizes the names of preserved sites by using
the title found at the top of the target web page or, in the absence of a
formal/adequate title, the name of the creator (i.e. the individual or organization
responsible for the intellectual content of the site). The library follows the best
practices for collection titles as established by Describing Archives: a Content

Previous Page Next Page

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

Extracted Text (may have errors)

164 · Representative Documents: Workflows
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 3
Identification of Content
The Bentley Historical Library employs the Heritrix web crawler (also known as a
spider or robot) to copy and preserve websites. As a subscriber to WAS, the Bentley
Library relies upon an implementation of Heritrix specially configured and
maintained by the CDL. A web crawler is an application that starts at a specified URL
and then methodically follows hyperlinks to copy html pages and associated files
(images, audio files, style sheets, etc.) as well as the websites underlying structure.
The initiation of a web capture requires the archivist to specify one or more seed
URLs from which the web crawling application will preserve the target site.
Accurate and thorough website preservation requires the archivist to become
familiar with a site’s content and architecture in order to define the exact nature of
the target. This attention to detail is important because content may be hosted from
multiple domains. For example, the University of Michigan’s Horace H. Rackham
School of Graduate Studies hosts the majority of its content at
http://www.rackham.umich.edu/ but maintains information on academic programs
at https://secure.rackham.umich.edu/academic_information/programs/. To
completely capture the Rackham School’s online presence, archivists needed to
identify both domains as seed URLs.
At the same time, multiple domains present on a site may merit preservation as
separate websites. For example, the University of Michigan’s Office of the Vice
President of Research (http://research.umich.edu/) maintains a large body of
information related to research administration (http://www.drda.umich.edu/) and
human research compliance (http://www.ohrcr.umich.edu/). Although these latter
sites could be included as secondary seeds for the Vice President of Research’s site,
their scope and informational value led archivists to preserve them separately.
Once the target of the crawl has been identified and defined, the archivist enters the
seed URL(s) and site name in the WAS curatorial interface (see Figure 1).
Figure 1
The Bentley Historical Library standardizes the names of preserved sites by using
the title found at the top of the target web page or, in the absence of a
formal/adequate title, the name of the creator (i.e. the individual or organization
responsible for the intellectual content of the site). The library follows the best
practices for collection titles as established by Describing Archives: a Content

Help

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 163
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 2
Introduction
The Bentley Historical Library’s Digital Curation Division has developed a
methodology and workflow for the acquisition of content. These procedures are
based on the available features of the California Digital Library (CDL)’s Web
Archiving Service (WAS) as well as standard archival practices (such as appraisal
and description). This document provides an overview of the Bentley Historical
Library’s methodology for website preservation.
The actual process of website preservation may be broken down into three main
steps:
1. Identification of the crawl target
2. Configuration of the crawler settings
3. Contextualization of content
Guided by collecting priorities, surveys of relevant websites, and knowledge of
significant individuals and organizations, archivists identify potential targets for
preservation. By standardizing the configuration of web crawler settings and
addition of metadata and descriptions, archivists are able to ensure that websites
are preserved in a manner that is consistent, efficient, and cost-effective.
Given the fast pace of change in web archiving technology and ongoing development
of features and functionalities in WAS, this methodology document will be reviewed
on an annual basis and revised accordingly.

Previous Page Next Page

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 165
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 4
Standard (DACS) to ensure that the nature of the collections are clear, archivists
supply “Web Archives” in the final title. The University Archives and Records
Program (UARP) furthermore includes “University of Michigan” in titles to highlight
the provenance of websites. Complete names for sites in the University of Michigan
Web Archives thus follow the pattern “Board of Regents Web Archives (University
of Michigan).”

Previous Page Next Page

Page 9Survey Results click to expand contents

Page 91Representative Documents click to expand contents

Page 93Job Descriptions click to expand contents

Page 113Collection Policies click to expand contents

Page 121Gift/Purchase Agreements click to expand contents

Page 127Format Policies click to expand contents

Page 153Workflows click to expand contents

Page 197Selected Resources click to expand contents

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

All section downloads (PDF) click to expand contents

Extracted Text (may have errors)

Help

loading

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources