Managing Born-Digital Special Collections and Archival Materials, SPEC Kit 329 (August 2012)

Nelson, Naomi L.; Shaw, Seth; Deromedi, Nancy; Shallcross, Michael; Ghering, Cynthia; Schmidt, Lisa; Belden, Michelle; Esposito, Jackie R.; Goldman, Ben; Pyatt, Tim

42 · Survey Results: Survey Questions and Responses
A procedure using a combination of Adobe Photoshop and Adobe Bridge was developed locally to batch process files
to accomplish this task. Sensitive data: we have yet to work out issues surrounding born-digital institutional records
with restricted access, e.g., promotion &tenure files, President’s Office files, etc. An organization uses an online service
to process applications that in the past had been delivered in paper format. Acquiring the records in a format that is
useable by the archive may require a contract of some sort with the vendor. This remains to be resolved.
In 2010 the library acquired a collection of nearly 50 floppy discs and a number of CDs most were unlabeled (or
labeled unhelpfully), meaning that we had to view each one and try to deduce at least minimal information so we could
describe the contents. However, the most challenging item was a hard drive, carefully wrapped, with a label reading
“The contents of this drive can only be accessed at the original computer from the New York Times. If installed at any
other computer, you may damage the contents and you may format (wipe out) the drive.” We have no idea quite how to
approach this so have simply left it alone as is!
Inability to access content saved on obsolete media or in obsolete programs. Lack of secure, redundant, geographically
distributed, and reliable preservation storage systems. Lack of system for managing and providing access to born-digital
materials that will allow for restricting some content for a period of time and will also help automate some processes like
generating checksums, virus checking, extraction of technical metadata from file headers, etc.
Ingestion of compound/complex objects (i.e., objects made of many types of materials at once). We use Google
Spreadsheets to compile metadata and file locations, but a solution like BagIt is likely to be more effective. Presentation
of complex objects. Determining how to show a user an object consisting of many disparate parts (e.g., a video with
a transcript, screenshots, and an associated web page). This is usually considered a prerequisite to ingestion, since
an object is only considered accessible if it can be usefully retrieved. We still address this question on an ad hoc basis.
Providing granular security options for all content. The technology required to provide very granular control over rights
and permissions makes it difficult to build services for ingesting and reusing repository content. Few repository systems
(we use Fedora) have a fully developed solution in this regard, so we use our own solution based on the university’s
Shibboleth identity system.
Lack of a standard set of best practice guidelines for dealing with original context (e.g., file system hierarchy) of born-
digital files when ingesting. Lack of a policy on file format normalization, and identification of what a “record copy”
means in the born-digital context. Fear and misunderstanding of the nature of born-digital material.
Lack of software and/or hardware to read files and physical media: We rely on library and college IT departments to
access file content, and we acquire legacy hardware when possible. Lack of server space to use for transfer of records
from digital media: We recently acquired server space hosted by the university’s IT department for use in backing up
digital media. Maintaining privacy and security of confidential records complying with university policy as well as
federal and state laws governing privacy: We have policies governing access to confidential records, but procedures
specific to born-digital materials are still being developed.
Legacy File Format Normalization: We have a collection that includes over 25 different file extensions, mostly text-
based documents, many of which were unrecognized and/or created significant artifacts or “garbage” when rendered
in modern programs. A lot of these files were created on the now defunct and unsupported Nota Bene annotation/
bibliography software. We used a conversion tool called FileMerlin to convert as many of the troubling files as we could
and a Windows Command Line script utilizing Microsoft Word to convert Wordperfect and other Legacy File formats
that Word would recognize. After a significant amount of manual and automated work, we increased the number of
legible files in the collection from around 40% to around 95%. Legacy Media recovery: Like many institutions, we have
many “hybrid” collections that include legacy media such as 3.5”/5 1/4” floppies, hard drives, CD/DVD, even whole
computing environments. We are building a Legacy Archival Media Migration Platform (LAMMP) and an accompanying
manual as an environment and a workflow for capturing images of these media and generating metadata and capturing

Previous Page Next Page

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

Extracted Text (may have errors)

42 · Survey Results: Survey Questions and Responses
A procedure using a combination of Adobe Photoshop and Adobe Bridge was developed locally to batch process files
to accomplish this task. Sensitive data: we have yet to work out issues surrounding born-digital institutional records
with restricted access, e.g., promotion &tenure files, President’s Office files, etc. An organization uses an online service
to process applications that in the past had been delivered in paper format. Acquiring the records in a format that is
useable by the archive may require a contract of some sort with the vendor. This remains to be resolved.
In 2010 the library acquired a collection of nearly 50 floppy discs and a number of CDs most were unlabeled (or
labeled unhelpfully), meaning that we had to view each one and try to deduce at least minimal information so we could
describe the contents. However, the most challenging item was a hard drive, carefully wrapped, with a label reading
“The contents of this drive can only be accessed at the original computer from the New York Times. If installed at any
other computer, you may damage the contents and you may format (wipe out) the drive.” We have no idea quite how to
approach this so have simply left it alone as is!
Inability to access content saved on obsolete media or in obsolete programs. Lack of secure, redundant, geographically
distributed, and reliable preservation storage systems. Lack of system for managing and providing access to born-digital
materials that will allow for restricting some content for a period of time and will also help automate some processes like
generating checksums, virus checking, extraction of technical metadata from file headers, etc.
Ingestion of compound/complex objects (i.e., objects made of many types of materials at once). We use Google
Spreadsheets to compile metadata and file locations, but a solution like BagIt is likely to be more effective. Presentation
of complex objects. Determining how to show a user an object consisting of many disparate parts (e.g., a video with
a transcript, screenshots, and an associated web page). This is usually considered a prerequisite to ingestion, since
an object is only considered accessible if it can be usefully retrieved. We still address this question on an ad hoc basis.
Providing granular security options for all content. The technology required to provide very granular control over rights
and permissions makes it difficult to build services for ingesting and reusing repository content. Few repository systems
(we use Fedora) have a fully developed solution in this regard, so we use our own solution based on the university’s
Shibboleth identity system.
Lack of a standard set of best practice guidelines for dealing with original context (e.g., file system hierarchy) of born-
digital files when ingesting. Lack of a policy on file format normalization, and identification of what a “record copy”
means in the born-digital context. Fear and misunderstanding of the nature of born-digital material.
Lack of software and/or hardware to read files and physical media: We rely on library and college IT departments to
access file content, and we acquire legacy hardware when possible. Lack of server space to use for transfer of records
from digital media: We recently acquired server space hosted by the university’s IT department for use in backing up
digital media. Maintaining privacy and security of confidential records complying with university policy as well as
federal and state laws governing privacy: We have policies governing access to confidential records, but procedures
specific to born-digital materials are still being developed.
Legacy File Format Normalization: We have a collection that includes over 25 different file extensions, mostly text-
based documents, many of which were unrecognized and/or created significant artifacts or “garbage” when rendered
in modern programs. A lot of these files were created on the now defunct and unsupported Nota Bene annotation/
bibliography software. We used a conversion tool called FileMerlin to convert as many of the troubling files as we could
and a Windows Command Line script utilizing Microsoft Word to convert Wordperfect and other Legacy File formats
that Word would recognize. After a significant amount of manual and automated work, we increased the number of
legible files in the collection from around 40% to around 95%. Legacy Media recovery: Like many institutions, we have
many “hybrid” collections that include legacy media such as 3.5”/5 1/4” floppies, hard drives, CD/DVD, even whole
computing environments. We are building a Legacy Archival Media Migration Platform (LAMMP) and an accompanying
manual as an environment and a workflow for capturing images of these media and generating metadata and capturing

Help

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 41
are actively working on a preservation plan that will address this issue. Authentication: we don’t currently have a
mechanism to authenticate born-digital objects – we “trust” the source and ingest. We are hoping to make this part of
our Digitization Preservation Policy, which is currently in development.
Developing policies and procedures relating to the acquisition and ingest of born-digital content: the Digital Archivist
has recently completed a research leave where he has drafted a digital preservation policy that could apply to born-
digital materials. Developing an open-source digital asset management system: the ingest process for our digital asset
management system has been unreliable in its early stages of development. The Libraries has dedicated an IT person to
this system and has hired a vendor to further development of the system, particularly regarding its stability. Creating an
inventory of born-digital material on legacy media: the Digital Archivist will soon be compiling such an inventory based
on existing finding aids.
Developing secure hardware infrastructure to protect PII collected and retained have worked closely with the campus IT
security office. Securing secure, backed-up server space for dark archive. Planning access strategy for restricted content.
Digital storage space. We have recently conducted an inventory of all of our special collection digital assets (not
just born-digital). This will be used to more effectively plan our storage needs—the amount and types of storage.
Sustainability of digital library and preservation platform. We haven’t yet adequately addressed this issue.
File format is an enormous challenge. We are receiving research data proprietary to specific data collection and
analysis tools, such as the SURF surface mapping data produced by the software MountainsMap. Another is the gene
sequencing data, FASTA, produced by the SOLiD gene sequencing system. We don’t have non-proprietary formats
in which to store this data and we don’t know enough about persistence and backward compatibility for the tools.
Our researchers are skilled at using the tools and interpreting the data but aren’t able to answer our questions about
persistence and longevity for the data. Thus far, our only strategy is to document the instruments that created the
data, document as much as we know about the data (which is often in multiple files) and to bring this issue up in every
research data gathering and suggest that conversations with these instrument providers are needed. File size is another
challenge. Large files take a very long time to process and can make born-digital files difficult to manipulate in the
repository and for end users to download. We currently bundle large files into zip files for downloading but need an
effective background methodology for ingest.
File format on legacy tape drives from punch card data that has Census/private information for different nations. Need
for old hardware on site for conversions and ingest with immediate time demands. Scaling up for the demand.
File formats: i.e., Word 1.0 documents. Hardware: i.e., receipt of records on 5 1/4” or 3 1/2” discs no computers that
will read such discs. Uncertainty about the authenticity of the records we have received. Do we have the only copy or
are there multiple copies/versions available elsewhere?
Hardware and software. We don’t always have the hardware and/or software to access legacy file formats, and don’t
know how to access files without changing their metadata. We try to collect obsolete hardware when possible, and
sometimes outsource accessing these legacy files. Selection of file formats for streaming media we are currently
working on this with library IT staff. We face challenges trying to educate the university community about giving us their
born-digital files, and lack confidence that we can preserve it and make it accessible because of lack of resources and
internal technical expertise. We are working on outreach to university offices, and working on developing necessary
skills for archiving born-digital content.
Hardware lack of secure storage and backup. We are attempting to implement now, working with university IT. Privacy/
security. We hope to develop written policies.
Images received in digital format but named idiosyncratically by the photographer. In order for these files to be used in a
local digital environment it is necessary to provide meaningful file names in relation to existing or new local directories.

Previous Page Next Page

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials · 43
contents where possible. We have finished developing and testing this process and are ready to image our first batch of
3.5” floppies, followed shortly by 5 1/4” floppies and hard drive after we acquire the hardware (drives, drivers and write
blockers).
Legacy software: needed legacy equipment to access and transfer files. Donated mixed material collections: donor may
not own rights to all of collection that was contributed. Images in dissertations that might have fair use rights but not
necessarily general dissemination rights: how to deal with this.
Limited staff comfortable with ingest. Although we have an ingest process that has now been formalized and
undertaken with more than 50 accessions, we still only have a couple of staff members who possess the sufficient
technical skills and understanding of digital records issues to undertake even the rudimentary steps in the accessioning
process. This leads to resource constraint issues as more and more digital records on media are being taken in, even
if they are not actively collected. To grow this program more, we need more, and lower-level staff to undertake much
of the accessioning process, as they currently do with paper. Minimal description practices don’t match ingest process.
We are following and forensic model of accessions where we are creating forensic images of storage media during
accessioning and setting those images aside for further processing. However, the current model for archival accessioning
on paper is to undertake minimal arrangement and description during the accessioning process, thereby eliminating
a backlog requiring future processing. Hardware and software ingest lab development was time consuming and
difficult. Although we have now built up a significant shared lab to enable the ingest of born-digital records from many
different types of storage media, the process of building such a lab took several years, expertise, and funding. Each new
collection seems to bring new technical issues that must be dealt with.
Major issue is technological — especially how to receive content from private donors. Still being worked on.
Media obsolescence/failure. This includes outmoded storage systems like 5.25” floppy disks and zip disks. Even if we
have hardware to accommodate them, we sometimes find that the content is corrupted or otherwise inaccessible.
We have a small collection of old drives and other resources nearby after that we consider outsourcing but will often
store as is or even deaccession, depending on resources and anticipated value of the content. Software obsolescence:
sometimes it isn’t even obsolete, it’s just got a small market share, like AskSam. So far, we have been able to find
programs to access and migrate/normalize this content. File formats: we have received proprietary camcorder files, for
example, which we had difficulty assessing the value of. Upon further investigation, these were found to be metadata
files and thumbnails. We determined in the end that we would keep them.
Met with outgoing dean and transferred email account to library servers once he left the position. Outlook PSTs are
highly proprietary. Transferred deceased faculty member’s email account to library servers. Mac to Windows migration
was very time consuming. Email account is Eudora and no easy way to convert emails to less proprietary format.
Transferred digitized president’s office correspondence from CDs to library servers. Transfer process took hours.
Obsolete file formats. Readability of legacy media. Lack of identifying information accompanying legacy media
(unlabeled, no contextual information).
Obsolete media storage. To date we have been able to outsource this to a vendor. Lack of any repository to store
or manage personal materials donated. We’ve taken in a few batches of material and have stored them with only a
promise of byte stream recovery and have temporarily turned other material away.
Obsolete media, file systems, and file-formats e.g. 8” floppy disks, FAT variant disk formats, and WordStar files
(existing converters did not work). Data loss from media corruption. Managing the politics surrounding SEI/PII. Some
disks have content the donor did not expect to be there, was private, and outside of our collecting scope. Some capture
mechanisms are poor or incomplete compared to the original versions e.g., social media and enterprise systems data.

Previous Page Next Page

Page 9Survey Results click to expand contents

Page 91Representative Documents click to expand contents

Page 93Job Descriptions click to expand contents

Page 113Collection Policies click to expand contents

Page 121Gift/Purchase Agreements click to expand contents

Page 127Format Policies click to expand contents

Page 153Workflows click to expand contents

Page 197Selected Resources click to expand contents

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

All section downloads (PDF) click to expand contents

Extracted Text (may have errors)

Help

loading

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources

SPEC Kit 329: Managing Born-Digital Special Collections and Archival Materials (August 2012) resources