164 · Representative Documents: Workflows
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 3
Identification of Content
The Bentley Historical Library employs the Heritrix web crawler (also known as a
spider or robot) to copy and preserve websites. As a subscriber to WAS, the Bentley
Library relies upon an implementation of Heritrix specially configured and
maintained by the CDL. A web crawler is an application that starts at a specified URL
and then methodically follows hyperlinks to copy html pages and associated files
(images, audio files, style sheets, etc.) as well as the websites underlying structure.
The initiation of a web capture requires the archivist to specify one or more seed
URLs from which the web crawling application will preserve the target site.
Accurate and thorough website preservation requires the archivist to become
familiar with a site’s content and architecture in order to define the exact nature of
the target. This attention to detail is important because content may be hosted from
multiple domains. For example, the University of Michigan’s Horace H. Rackham
School of Graduate Studies hosts the majority of its content at
http://www.rackham.umich.edu/ but maintains information on academic programs
at https://secure.rackham.umich.edu/academic_information/programs/. To
completely capture the Rackham School’s online presence, archivists needed to
identify both domains as seed URLs.
At the same time, multiple domains present on a site may merit preservation as
separate websites. For example, the University of Michigan’s Office of the Vice
President of Research (http://research.umich.edu/) maintains a large body of
information related to research administration (http://www.drda.umich.edu/) and
human research compliance (http://www.ohrcr.umich.edu/). Although these latter
sites could be included as secondary seeds for the Vice President of Research’s site,
their scope and informational value led archivists to preserve them separately.
Once the target of the crawl has been identified and defined, the archivist enters the
seed URL(s) and site name in the WAS curatorial interface (see Figure 1).
Figure 1
The Bentley Historical Library standardizes the names of preserved sites by using
the title found at the top of the target web page or, in the absence of a
formal/adequate title, the name of the creator (i.e. the individual or organization
responsible for the intellectual content of the site). The library follows the best
practices for collection titles as established by Describing Archives: a Content
University of Michigan
BHL Web Archives: Methodology for the Acquisition of Content
http://bentley.umich.edu/dchome/webarchives/BHL_WebArchives_Methodology.pdf
August 2, 2011 3
Identification of Content
The Bentley Historical Library employs the Heritrix web crawler (also known as a
spider or robot) to copy and preserve websites. As a subscriber to WAS, the Bentley
Library relies upon an implementation of Heritrix specially configured and
maintained by the CDL. A web crawler is an application that starts at a specified URL
and then methodically follows hyperlinks to copy html pages and associated files
(images, audio files, style sheets, etc.) as well as the websites underlying structure.
The initiation of a web capture requires the archivist to specify one or more seed
URLs from which the web crawling application will preserve the target site.
Accurate and thorough website preservation requires the archivist to become
familiar with a site’s content and architecture in order to define the exact nature of
the target. This attention to detail is important because content may be hosted from
multiple domains. For example, the University of Michigan’s Horace H. Rackham
School of Graduate Studies hosts the majority of its content at
http://www.rackham.umich.edu/ but maintains information on academic programs
at https://secure.rackham.umich.edu/academic_information/programs/. To
completely capture the Rackham School’s online presence, archivists needed to
identify both domains as seed URLs.
At the same time, multiple domains present on a site may merit preservation as
separate websites. For example, the University of Michigan’s Office of the Vice
President of Research (http://research.umich.edu/) maintains a large body of
information related to research administration (http://www.drda.umich.edu/) and
human research compliance (http://www.ohrcr.umich.edu/). Although these latter
sites could be included as secondary seeds for the Vice President of Research’s site,
their scope and informational value led archivists to preserve them separately.
Once the target of the crawl has been identified and defined, the archivist enters the
seed URL(s) and site name in the WAS curatorial interface (see Figure 1).
Figure 1
The Bentley Historical Library standardizes the names of preserved sites by using
the title found at the top of the target web page or, in the absence of a
formal/adequate title, the name of the creator (i.e. the individual or organization
responsible for the intellectual content of the site). The library follows the best
practices for collection titles as established by Describing Archives: a Content