Monday, February 16, 2009

Ensuring that digital data last

An expanded version of a paper originally presented at the:
EMELD Symposium on ”Endangered Data vs. Enduring Practice,”
Linguistic Society of America annual meeting
8-11 January 2004, Boston, MA

Ensuring that digital data last
The priority of archival form over working form and presentation form

Gary F. Simons
SIL International


Abstract

The paradox of development
What’s a linguist to do?
Characteristics of an enduring format
Three levels of archival practice
Illustrating levels of archival practice
As rock solid as ASCII
Conclusion
References

Abstract

One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records in history are those that were carved into stone by the ancients. By contrast, digital word processing is our most advanced writing technology to date, but it is also the most ephemeral. Hardware and software technologies are changing so rapidly that a typical storage medium or file format is obsolete within 5 to 10 years. Unless linguists take special measures to counter this, their digital records of endangered languages are in danger of dying out before the languages themselves.

A linguist must do two things in order to ensure that digital data endure: (1) the materials must be put into an enduring file format, and (2) the materials must be deposited with an archive that will make a practice of migrating them to new storage media as needed. The paper addresses the first of these issues. Most projects tend to focus on the working form of data (that is, the form in which the materials are stored as they are worked on from day to day) and the presentation form (the form in which the materials will be presented to the public). But these forms are closely tied to particular pieces of software and thus tend to become obsolete when the software does. The paper thus argues for the priority of the archival form (a form that is self documenting and software independent) as the object of language documentation. Many file formats for textual data are discussed and illustrated with the ultimate conclusion that descriptive XML markup represents best current practice for the archival form. [full article]

1. The paradox of development

One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records from antiquity are those that were carved into stone or pressed into kiln-baked clay tablets. Writing on velum and papyrus was a great advance in that the process was faster and the resulting product was much less bulky; but it was also a step backwards on the durability scale since the medium could be destroyed by fire or by water or even by microbes. With the modern use of paper, writing has advanced further, but it has become less durable yet as the chemicals used in the manufacture of paper can cause the medium to deteriorate from within, even in the best of storage conditions.

To complete the trend, digital word processing, which is our most advanced writing technology to date, is also the most ephemeral. Whereas ink on acid-free paper will endure for centuries, the longevity of digital storage media is an order of magnitude shorter. The industry’s early answer to long-term digital storage was magnetic tape, but this has proved to have a life expectancy of only 10 to 20 years (Van Bogart 1995). The current answer, CD-R, fares better but is still ephemeral from an archival point of view. Manufacturers report that CD-R discs should have a life expectancy of 100 to 200 years, but independent tests conducted at the National Institute of Standards and Technology found the life expectancy of the CD-R discs they tested to be 30 years (Byers 2003:13). The CD-RW medium is significantly less stable; the manufacturers predict a life expectancy of only 25 years. If the lab testing on CD-Rs is any indication, the actual life expectancy is probably more like 5 to 10 years. Byers (2003) gives an excellent description of how CD and DVD technologies work and how the media deteriorate over time.

But the problem is even worse than this, because the hardware devices that read these media become obsolete long before the media reach the end of their life expectancy. For instance, in the last 25 years we have seen removable media on personal computers advance from 8-inch floppies to 5.25-inch floppies to 3.5-inch floppies to Zip drives to CD-Rs to DVD-Rs. Unless one is diligent about migrating all of one’s legacy data to new media each time a new technology takes hold, those data will soon become trapped on media that no available hardware can read.

And the problem is worse yet, because software is changing, too. Though software technology is not advancing as quickly as hardware technology, the effect of software change is more devastating since the migration strategy that works for keeping data files accessible on the latest media cannot ensure that the files remain usable. This is because the functionality associated with those files is tied to particular software, and when the hardware that ran the needed software ceases to be available, then the functionality associated with those files ceases to exist. The fact that software vendors may change the file formats and functionality with each new version of software only exacerbates the problem.

When the results of our word processing are entrusted to the proprietary formats of a single software vendor, then we are completely at the mercy of that vendor as to whether our work will survive into the future. For instance, the author has a number of books and articles that were produced in the 1980s with Microsoft Word and its stylesheet feature (Simons 1989). The data files have been faithfully migrated over the years so that they remain readable today. However, current versions of Word no longer support stylesheets or the particular file format, with the result that the documents can no longer be rendered. The text stream can still be retrieved with any plain text editor since the characters are encoded with the ASCII standard, but the formatting and layout are encoded in a proprietary binary format and thus are completely lost in the absence of software that understands that format.

The phenomenon of digital data loss has become so prevalent that many are beginning to warn of an impending “digital dark age”—the idea that historians of the future will look back to our present age as another Dark Ages since so much important information documenting our current civilization is recorded digitally and will have vanished (Bergeron 2002; Deegan and Tanner 2002). The popular press has chronicled many high-profile cases of digital data loss (Stepanek 1998; McKie and Thorpe 2002). A recent Associated Press story quotes a technologist in the MIT library to relate a state of affairs that hits closer to home for the typical academic (Jesdanun 2003):

Every now and then, a faculty member would come in in tears having some boxes of completely unreadable tapes—they've lost their life's work.

The bottom line is that in these days of short-lived computer media, hardware, and software, linguists need to be particularly careful about the way they use digital technologies lest their work be lost within a decade or two. In the absence of such diligence, our digital data records are even more endangered than the languages we are seeking to document.[full article]