Tuesday, April 14, 2009
The World Digital Library will make available on the Internet, free of charge and in multilingual format, significant primary materials from cultures around the world, including manuscripts, maps, rare books, musical scores, recordings, films, prints, photographs, architectural drawings, and other significant cultural materials. The objectives of the World Digital Library are to promote international and inter-cultural understanding and awareness, provide resources to educators, expand non-English and non-Western content on the Internet, and to contribute to scholarly research.
Sunday, March 8, 2009
documentation and archiving
by Gary Holton
University of California at Los Angeles
We are in the midst of an intellectual property gold rush. Thousands of fortune-seekers are trying to stake their claims to promising territory, existing claims-holders are seeking increasingly aggressive means of defending their claims, and the original owners are often being ignored. Scholars and enthusiasts whose work uses intellectual property, and archives and libraries that store it, are largely bystanders in this goldrush; but they are profoundly affected by it.
Standards in Language Engineering (ISLE).’We are grateful to Dafydd Gibbon, David Nathan, Nicholas Ostler, and the Language
editors and anonymous reviewers for comments on earlier versions of this paper.
1For a lucid discussion of the terms ‘language documentation’ and ‘language description’ we refer the reader to Himmelmann
9 Our purpose in citing specific examples is not to single them out for criticism, but to show how serious work by conscientious
scholars has grappled with a host of technical problems in the course of exploring a large space of imperfect solutions.
15Further examples may be found on SIL’s page on Linguistic Computing Resources http://www.sil.org/linguistics/
computing.html, on the Linguistic Exploration page http://www.ldc.upenn.edu/exploration/, and on the Linguistic
Annotation page http://www.ldc.upenn.edu/annotation/.
Bird & Simons, Language 79, 2003 (to appear)
38 Celebrated early grammarians include P¯an. in¯ı (5th century BC), Dionysius of Thrace (2nd century BC), and Hesychius of
Alexandria (5th century AD).
Monday, February 16, 2009
EMELD Symposium on ”Endangered Data vs. Enduring Practice,”
Linguistic Society of America annual meeting
8-11 January 2004, Boston, MA
Ensuring that digital data last
The priority of archival form over working form and presentation form
Gary F. Simons
The paradox of development
What’s a linguist to do?
Characteristics of an enduring format
Three levels of archival practice
Illustrating levels of archival practice
As rock solid as ASCII
One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records in history are those that were carved into stone by the ancients. By contrast, digital word processing is our most advanced writing technology to date, but it is also the most ephemeral. Hardware and software technologies are changing so rapidly that a typical storage medium or file format is obsolete within 5 to 10 years. Unless linguists take special measures to counter this, their digital records of endangered languages are in danger of dying out before the languages themselves.
A linguist must do two things in order to ensure that digital data endure: (1) the materials must be put into an enduring file format, and (2) the materials must be deposited with an archive that will make a practice of migrating them to new storage media as needed. The paper addresses the first of these issues. Most projects tend to focus on the working form of data (that is, the form in which the materials are stored as they are worked on from day to day) and the presentation form (the form in which the materials will be presented to the public). But these forms are closely tied to particular pieces of software and thus tend to become obsolete when the software does. The paper thus argues for the priority of the archival form (a form that is self documenting and software independent) as the object of language documentation. Many file formats for textual data are discussed and illustrated with the ultimate conclusion that descriptive XML markup represents best current practice for the archival form. [full article]
1. The paradox of development
One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records from antiquity are those that were carved into stone or pressed into kiln-baked clay tablets. Writing on velum and papyrus was a great advance in that the process was faster and the resulting product was much less bulky; but it was also a step backwards on the durability scale since the medium could be destroyed by fire or by water or even by microbes. With the modern use of paper, writing has advanced further, but it has become less durable yet as the chemicals used in the manufacture of paper can cause the medium to deteriorate from within, even in the best of storage conditions.
To complete the trend, digital word processing, which is our most advanced writing technology to date, is also the most ephemeral. Whereas ink on acid-free paper will endure for centuries, the longevity of digital storage media is an order of magnitude shorter. The industry’s early answer to long-term digital storage was magnetic tape, but this has proved to have a life expectancy of only 10 to 20 years (Van Bogart 1995). The current answer, CD-R, fares better but is still ephemeral from an archival point of view. Manufacturers report that CD-R discs should have a life expectancy of 100 to 200 years, but independent tests conducted at the National Institute of Standards and Technology found the life expectancy of the CD-R discs they tested to be 30 years (Byers 2003:13). The CD-RW medium is significantly less stable; the manufacturers predict a life expectancy of only 25 years. If the lab testing on CD-Rs is any indication, the actual life expectancy is probably more like 5 to 10 years. Byers (2003) gives an excellent description of how CD and DVD technologies work and how the media deteriorate over time.
But the problem is even worse than this, because the hardware devices that read these media become obsolete long before the media reach the end of their life expectancy. For instance, in the last 25 years we have seen removable media on personal computers advance from 8-inch floppies to 5.25-inch floppies to 3.5-inch floppies to Zip drives to CD-Rs to DVD-Rs. Unless one is diligent about migrating all of one’s legacy data to new media each time a new technology takes hold, those data will soon become trapped on media that no available hardware can read.
And the problem is worse yet, because software is changing, too. Though software technology is not advancing as quickly as hardware technology, the effect of software change is more devastating since the migration strategy that works for keeping data files accessible on the latest media cannot ensure that the files remain usable. This is because the functionality associated with those files is tied to particular software, and when the hardware that ran the needed software ceases to be available, then the functionality associated with those files ceases to exist. The fact that software vendors may change the file formats and functionality with each new version of software only exacerbates the problem.
When the results of our word processing are entrusted to the proprietary formats of a single software vendor, then we are completely at the mercy of that vendor as to whether our work will survive into the future. For instance, the author has a number of books and articles that were produced in the 1980s with Microsoft Word and its stylesheet feature (Simons 1989). The data files have been faithfully migrated over the years so that they remain readable today. However, current versions of Word no longer support stylesheets or the particular file format, with the result that the documents can no longer be rendered. The text stream can still be retrieved with any plain text editor since the characters are encoded with the ASCII standard, but the formatting and layout are encoded in a proprietary binary format and thus are completely lost in the absence of software that understands that format.
The phenomenon of digital data loss has become so prevalent that many are beginning to warn of an impending “digital dark age”—the idea that historians of the future will look back to our present age as another Dark Ages since so much important information documenting our current civilization is recorded digitally and will have vanished (Bergeron 2002; Deegan and Tanner 2002). The popular press has chronicled many high-profile cases of digital data loss (Stepanek 1998; McKie and Thorpe 2002). A recent Associated Press story quotes a technologist in the MIT library to relate a state of affairs that hits closer to home for the typical academic (Jesdanun 2003):
Every now and then, a faculty member would come in in tears having some boxes of completely unreadable tapes—they've lost their life's work.
The bottom line is that in these days of short-lived computer media, hardware, and software, linguists need to be particularly careful about the way they use digital technologies lest their work be lost within a decade or two. In the absence of such diligence, our digital data records are even more endangered than the languages we are seeking to document.[full article]
Tuesday, January 6, 2009
The Internet is a connection infrastructure, like a road network. Built upon it are various information transport systems, including one called 'http' which supports World Wide Web document transfer. Like the road transport system, the Internet provides point-to-point connection (email) while also supporting mass transport where many can share the same experience simultaneously (the Web). In this sense the Web is a kind of broadcasting: it allows sharing and accumulation of cultural 'capital' and cultural 'memory'. Since 1994 in Australia this broadcasting system has grown phenomenally. It has become an infant medium.
In fact, http has changed society's understanding of computing, by transforming the nature of computer networks from one between computers to one of relationships between documents. And properties of documents are in turn being transformed, as discussed further below. Because documents are about authors, audiences, relationships, and power, computers are now, finally, about people. Although the talking robots promised by science fiction in the 1970s and by Artificial Intelligence in the 1980s failed to materialise, millions of people have turned their computers into communication sets and supplied the intelligence themselves. The Web grew so quickly that millions had used it before anyone actually thought of advertising it (Anon. 1997a:156). Its uptake rate has been unprecedented: while it took seventy-five years for the number of telephone users to reach fifty million, "it has taken 10 years to do the same" on the Internet (Riley 1997).
As an evolving medium, the Web is not fundamentally different from the others: its properties are crystallising from a mix of cultural preferences, physical constraints, and other historical and accidental factors. And like other media, once its formats have become conventional, those contributing factors will be reinterpreted as 'natural'. Just as no-one is interested in the workings, specifications or brand of their television, or cares about how its signals are distributed, the Internet will not be mature (or successful) until it disappears from general comment.Right now, the Web is still 'soft'; its shape has not settled. That is why many Indigenous and non-Indigenous peoples have been working to promote expectations of Indigenous participation and content so that these values become built in to the medium. The final shape of the Web will reflect the values and materials of those who have participated in its growth.
Friday, December 26, 2008
DAM-LR proposes to develop and deploy an infrastructure for the European research community that is interested in an easy management of and access to linguistic resources of all kinds such as large (multimedia) corpora, lexicons, grammar descriptions and others.
It will not only foster the local developments that take place at this moment in various linguistic data centers by deploying prototypical archive solutions but also integrate these local archives virtually such that users of linguistic resources just see one large collection, users have just one identity to access the stored material, ingest mechanisms allow users to integrate new data into this domain of linguistic resources, managers get efficient tools to manage resources in the distributed domain and managers get efficient mechanisms to deal with the access management aspects.
Therefore, the proposed DAM-LR concept will offer completely new opportunities for the producers of data, for the managers and for the users. In doing so DAM-LR will be a very important contribution to establish a Semantic Web of language resources, since only an integrated domain based on interoperable concepts such as unified access mechanisms will allow agents to smoothly find their way in a complex domain of heterogeneous data types.
DAM-LR will be based on 4 pillars that have been discussed intensively at various international meetings.
- The metadata concept for language resources has been developed during the last 4 years and can be seen as being stabilized and solid
- The introduction of unique resource identifiers will be important for operating in distributed collections and DAM-LR can use well-proven technology here
- A unified user and group management system will give persons one identity for accessing all resources and
- A unified access management system will allow managers to set access rights and delegate the possibility to set access rights in the intended distributed domain.
The DAM-LR project is funded by the EC Research Infrastructures