Sunday, March 8, 2009
Ethical practices in language
documentation and archiving
by Gary Holton
http://www.language-archives.org/events/olac05/olac-lsa05-holton.pdf
Intellectual Property and Audiovisual Archives and Collections
Anthony Seeger
University of California at Los Angeles
--------------------------------------------------------------------------------
We are in the midst of an intellectual property gold rush. Thousands of fortune-seekers are trying to stake their claims to promising territory, existing claims-holders are seeking increasingly aggressive means of defending their claims, and the original owners are often being ignored. Scholars and enthusiasts whose work uses intellectual property, and archives and libraries that store it, are largely bystanders in this goldrush; but they are profoundly affected by it.
http://www.loc.gov/folklife/fhcc/propertykey.html
http://www.loc.gov/folklife/fhcc/propertykey.html
Toward a Global Infrastructure for the Sustainability of Language Resources
http://www.sil.org/~simonsg/preprint/PACLIC22.pdf
Seven Dimensions of Portability (Bibliography)
Standards in Language Engineering (ISLE).’We are grateful to Dafydd Gibbon, David Nathan, Nicholas Ostler, and the Language
editors and anonymous reviewers for comments on earlier versions of this paper.
1For a lucid discussion of the terms ‘language documentation’ and ‘language description’ we refer the reader to Himmelmann
(1998).
2http://www.linguistics.ucsb.edu/faculty/cumming/WordForLinguists/Interlinear.htm
3http://hctv.humnet.ucla.edu/departments/linguistics/VowelsandConsonants/
4http://www.linguistics.unimelb.edu.au/research/projects/jiwarli/gloss.html
5http://etext.lib.virginia.edu/apache/ChiMesc2.html
6http://coombs.anu.edu.au/WWWVLPages/AborigPages/LANG/GAMDICT/GAMDICT.HTM
7http://www.ldc.upenn.edu/sb/fieldwork/
8http://www.cnc.bc.ca/yinkadene/dakinfo/dulktop.htm
9 Our purpose in citing specific examples is not to single them out for criticism, but to show how serious work by conscientious
scholars has grappled with a host of technical problems in the course of exploring a large space of imperfect solutions.
10http://fonetiek-6.leidenuniv.nl/pil/stresstyp/stresstyp.html
11http://www.linguistics.berkeley.edu/CBOLD/
12http://ultratext.hil.unb.ca/Texts/Maliseet/dictionary/index.html
13http://ingush.berkeley.edu:7012/BITC.html
14http://www.rosettaproject.org:8080/live/
15Further examples may be found on SIL’s page on Linguistic Computing Resources http://www.sil.org/linguistics/
computing.html, on the Linguistic Exploration page http://www.ldc.upenn.edu/exploration/, and on the Linguistic
Annotation page http://www.ldc.upenn.edu/annotation/.
16http://www.sil.org/computing/shoebox/
17http://fieldworks.sil.org/
18http://fonsg3.hum.uva.nl/praat/
19http://www.sil.org/computing/speechtools/speechanalyzier.htm
20http://childes.psy.cmu.edu/
21http://www.shlrc.mq.edu.au/emu/
22http://sf.net/projects/agtk/
23http://www.etca.fr/CTA/gip/Projets/Transcriber/
24http://sf.net/projects/agtk/
25http://www.xrce.xerox.com/research/mltt/fst/
26http://www.sil.org/computing/catalog/pc-parse.html
27http://www.sil.org/LinguaLinks/LingWksh.html
28http://www.sumerian.org/
29http://www.ailla.org/
30http://www.rosettaproject.org/
31http://www.uaf.edu/anlc/
32http://195.83.92.32/index.html.en
33http://www.nmnh.si.edu/naa/
34http://www.ldc.upenn.edu/exploration/archives.html
35http://www.language-archives.org/
36http:///www.ldc.upenn.edu/
Bird & Simons, Language 79, 2003 (to appear)
26
37http://registry.dfki.de/
38 Celebrated early grammarians include P¯an. in¯ı (5th century BC), Dionysius of Thrace (2nd century BC), and Hesychius of
Alexandria (5th century AD).
39http://www.unicode.org/
40http://xml.coverpages.org/sgml.html
41http://www.w3.org/XML/
42http://www.language-archives.org/
43http://www.linguistlist.org/olac/
44http://www.openarchives.org/
45http://www.mpi.nl/world/ISLE/documents/draft/ISLE_MetaData_2.5.pdf
46http://dublincore.org/
47http://www.linguistlist.org/olac/
48http://www.sil.org/silewp/citation.html
49http://www.doi.org/
50http://lcweb.loc.gov/preserv/
51http://www.unesco.org/webworld/portal_archives/pages/
52http://www.iasa-web.org/
53http://www.clir.org/
54http://palimpsest.stanford.edu/
55http://palimpsest.stanford.edu/bytopic/audio/
56http://www.oclc.org/research/pmwg/
57http://www.oclc.org/research/pmwg/
58http://www.rlg.org/
59http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html
60http://www.archive.org/
61http://lockss.stanford.edu/
62http://www.language-archives.org/
63http://www.opensource.org/
Bird
Monday, February 16, 2009
Ensuring that digital data last
EMELD Symposium on ”Endangered Data vs. Enduring Practice,”
Linguistic Society of America annual meeting
8-11 January 2004, Boston, MA
Ensuring that digital data last
The priority of archival form over working form and presentation form
Gary F. Simons
SIL International
Abstract
The paradox of development
What’s a linguist to do?
Characteristics of an enduring format
Three levels of archival practice
Illustrating levels of archival practice
As rock solid as ASCII
Conclusion
References
Abstract
One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records in history are those that were carved into stone by the ancients. By contrast, digital word processing is our most advanced writing technology to date, but it is also the most ephemeral. Hardware and software technologies are changing so rapidly that a typical storage medium or file format is obsolete within 5 to 10 years. Unless linguists take special measures to counter this, their digital records of endangered languages are in danger of dying out before the languages themselves.
A linguist must do two things in order to ensure that digital data endure: (1) the materials must be put into an enduring file format, and (2) the materials must be deposited with an archive that will make a practice of migrating them to new storage media as needed. The paper addresses the first of these issues. Most projects tend to focus on the working form of data (that is, the form in which the materials are stored as they are worked on from day to day) and the presentation form (the form in which the materials will be presented to the public). But these forms are closely tied to particular pieces of software and thus tend to become obsolete when the software does. The paper thus argues for the priority of the archival form (a form that is self documenting and software independent) as the object of language documentation. Many file formats for textual data are discussed and illustrated with the ultimate conclusion that descriptive XML markup represents best current practice for the archival form. [full article]
1. The paradox of development
One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records from antiquity are those that were carved into stone or pressed into kiln-baked clay tablets. Writing on velum and papyrus was a great advance in that the process was faster and the resulting product was much less bulky; but it was also a step backwards on the durability scale since the medium could be destroyed by fire or by water or even by microbes. With the modern use of paper, writing has advanced further, but it has become less durable yet as the chemicals used in the manufacture of paper can cause the medium to deteriorate from within, even in the best of storage conditions.
To complete the trend, digital word processing, which is our most advanced writing technology to date, is also the most ephemeral. Whereas ink on acid-free paper will endure for centuries, the longevity of digital storage media is an order of magnitude shorter. The industry’s early answer to long-term digital storage was magnetic tape, but this has proved to have a life expectancy of only 10 to 20 years (Van Bogart 1995). The current answer, CD-R, fares better but is still ephemeral from an archival point of view. Manufacturers report that CD-R discs should have a life expectancy of 100 to 200 years, but independent tests conducted at the National Institute of Standards and Technology found the life expectancy of the CD-R discs they tested to be 30 years (Byers 2003:13). The CD-RW medium is significantly less stable; the manufacturers predict a life expectancy of only 25 years. If the lab testing on CD-Rs is any indication, the actual life expectancy is probably more like 5 to 10 years. Byers (2003) gives an excellent description of how CD and DVD technologies work and how the media deteriorate over time.
But the problem is even worse than this, because the hardware devices that read these media become obsolete long before the media reach the end of their life expectancy. For instance, in the last 25 years we have seen removable media on personal computers advance from 8-inch floppies to 5.25-inch floppies to 3.5-inch floppies to Zip drives to CD-Rs to DVD-Rs. Unless one is diligent about migrating all of one’s legacy data to new media each time a new technology takes hold, those data will soon become trapped on media that no available hardware can read.
And the problem is worse yet, because software is changing, too. Though software technology is not advancing as quickly as hardware technology, the effect of software change is more devastating since the migration strategy that works for keeping data files accessible on the latest media cannot ensure that the files remain usable. This is because the functionality associated with those files is tied to particular software, and when the hardware that ran the needed software ceases to be available, then the functionality associated with those files ceases to exist. The fact that software vendors may change the file formats and functionality with each new version of software only exacerbates the problem.
When the results of our word processing are entrusted to the proprietary formats of a single software vendor, then we are completely at the mercy of that vendor as to whether our work will survive into the future. For instance, the author has a number of books and articles that were produced in the 1980s with Microsoft Word and its stylesheet feature (Simons 1989). The data files have been faithfully migrated over the years so that they remain readable today. However, current versions of Word no longer support stylesheets or the particular file format, with the result that the documents can no longer be rendered. The text stream can still be retrieved with any plain text editor since the characters are encoded with the ASCII standard, but the formatting and layout are encoded in a proprietary binary format and thus are completely lost in the absence of software that understands that format.
The phenomenon of digital data loss has become so prevalent that many are beginning to warn of an impending “digital dark age”—the idea that historians of the future will look back to our present age as another Dark Ages since so much important information documenting our current civilization is recorded digitally and will have vanished (Bergeron 2002; Deegan and Tanner 2002). The popular press has chronicled many high-profile cases of digital data loss (Stepanek 1998; McKie and Thorpe 2002). A recent Associated Press story quotes a technologist in the MIT library to relate a state of affairs that hits closer to home for the typical academic (Jesdanun 2003):
Every now and then, a faculty member would come in in tears having some boxes of completely unreadable tapes—they've lost their life's work.
The bottom line is that in these days of short-lived computer media, hardware, and software, linguists need to be particularly careful about the way they use digital technologies lest their work be lost within a decade or two. In the absence of such diligence, our digital data records are even more endangered than the languages we are seeking to document.[full article]
Tuesday, January 6, 2009
David Nathan: Plugging in Indigenous Knowledge: Connections and Innovations
New network, newer media
The Internet is a connection infrastructure, like a road network. Built upon it are various information transport systems, including one called 'http' which supports World Wide Web document transfer. Like the road transport system, the Internet provides point-to-point connection (email) while also supporting mass transport where many can share the same experience simultaneously (the Web). In this sense the Web is a kind of broadcasting: it allows sharing and accumulation of cultural 'capital' and cultural 'memory'. Since 1994 in Australia this broadcasting system has grown phenomenally. It has become an infant medium.
In fact, http has changed society's understanding of computing, by transforming the nature of computer networks from one between computers to one of relationships between documents. And properties of documents are in turn being transformed, as discussed further below. Because documents are about authors, audiences, relationships, and power, computers are now, finally, about people. Although the talking robots promised by science fiction in the 1970s and by Artificial Intelligence in the 1980s failed to materialise, millions of people have turned their computers into communication sets and supplied the intelligence themselves. The Web grew so quickly that millions had used it before anyone actually thought of advertising it (Anon. 1997a:156). Its uptake rate has been unprecedented: while it took seventy-five years for the number of telephone users to reach fifty million, "it has taken 10 years to do the same" on the Internet (Riley 1997).
As an evolving medium, the Web is not fundamentally different from the others: its properties are crystallising from a mix of cultural preferences, physical constraints, and other historical and accidental factors. And like other media, once its formats have become conventional, those contributing factors will be reinterpreted as 'natural'. Just as no-one is interested in the workings, specifications or brand of their television, or cares about how its signals are distributed, the Internet will not be mature (or successful) until it disappears from general comment.
Right now, the Web is still 'soft'; its shape has not settled. That is why many Indigenous and non-Indigenous peoples have been working to promote expectations of Indigenous participation and content so that these values become built in to the medium. The final shape of the Web will reflect the values and materials of those who have participated in its growth.full text
Friday, December 26, 2008
DAM-LR
Project Description
DAM-LR proposes to develop and deploy an infrastructure for the European research community that is interested in an easy management of and access to linguistic resources of all kinds such as large (multimedia) corpora, lexicons, grammar descriptions and others.
It will not only foster the local developments that take place at this moment in various linguistic data centers by deploying prototypical archive solutions but also integrate these local archives virtually such that users of linguistic resources just see one large collection, users have just one identity to access the stored material, ingest mechanisms allow users to integrate new data into this domain of linguistic resources, managers get efficient tools to manage resources in the distributed domain and managers get efficient mechanisms to deal with the access management aspects.
Therefore, the proposed DAM-LR concept will offer completely new opportunities for the producers of data, for the managers and for the users. In doing so DAM-LR will be a very important contribution to establish a Semantic Web of language resources, since only an integrated domain based on interoperable concepts such as unified access mechanisms will allow agents to smoothly find their way in a complex domain of heterogeneous data types.
DAM-LR will be based on 4 pillars that have been discussed intensively at various international meetings.
- The metadata concept for language resources has been developed during the last 4 years and can be seen as being stabilized and solid
- The introduction of unique resource identifiers will be important for operating in distributed collections and DAM-LR can use well-proven technology here
- A unified user and group management system will give persons one identity for accessing all resources and
- A unified access management system will allow managers to set access rights and delegate the possibility to set access rights in the intended distributed domain.
The DAM-LR project is funded by the EC Research Infrastructures
Concluding Documents
The Language Archiving Technology portal
"eScience is about global collaboration in key areas of science and the next generation of infrastructure that will enable it" (John Taylor).
The Language Archiving Technology (LAT) is meant to contribute to the sort of infrastructure that will be required in eHumanities. Its design will finally help to boost research in the humanities and to attract indigenous communities and the interested public to use the rich information in the language archive. It focuses on open accessibility of language resources; it supports dynamic and continuously enriched collections according to the Live Archives ideas; it stresses the need for long-term archiving of our digital collections covering unique material about languages that will probably become extinct in a few decades and it follows the trend towards service oriented architectures.
Language support with I.T.: not a high wire act
David Nathan
Paper presented at Learning IT Together, Brisbane, April 1999.Introduction
For Australia's Indigenous languages - all of which are endangered - the current information technologies (IT) present an opportunity to develop a radical but pragmatic language practice. Multimedia and networked platforms allow us to tap into the best of available resources for language work, and to slice through some of the rhetorical positions that are holding back localised language ownership and use. IT, which is not only the language of our time but also the most powerful language tool other than natural language itself, can be mobilised to enhance the status and motivation of language work, while at the same time producing effective and enduring language resources.
Many factors contribute to the continuing destruction of Australian languages. The main one may be a western ideology of contempt for minority languages and a suspicion of bilingualism (Dorian 1998), an ideology which takes hold even among its victims. Other factors are associated with the relative status of the Indigenous language and the colonial one: they include the socioeconomic status of the language group; the presence or absence of a middle class "with the social self confidence to insist on traditional identity and heritage"; the existence of a body of literature in the language; and an association of the language with religious or other important practices (Dorian 1998:13).
Reversing or combatting these factors is a complex process, and is not guaranteed to maintain endangered languages, let alone revive languages which have been destroyed. We do know at least that perceived status of the language is one important factor. Another important factor for Indigenous Australians is locality. Around Australia, many Aboriginal people emphasise both local ownership of their ancestral language, as well as the strong relationship between language and land, or territory (see, for example, Jeanie Bell in Nathan 1996:25). This ideology underlies a crucial success factor for language programs: local community initiation and participation (Amery 1994:147-50; SSABSA 1996:44,52; WA Ministry of Education 1992:9,29).
There is a grave shortage of authentic texts in Indigenous languages. With its origins in the destruction of languages, the shortage also reflects the literacy disadvantage of many Aboriginal people, as well as the dominance of English for institutional and other status forms of communication. On the other hand, language is not literacy and we should beware of making written materials the core resources of language work (a point often made by Aboriginal people; see also McKay 1996: 233).
We are now at a historical time where Indigenous languages have attracted renewed interest from both their communities and the State educations systems, at the same time as we have new opportunities through IT to generate appropriate resources in forms and contexts that are most effective for language maintenance and revival. IT is an ideal tool for local projects for language recording, preservation, and learning; above all, as a catalyst or platform for participatory practice involving multiple, contexts, objectives and skills. IT is a modern, relevant, and, most importantly, highly valued area, with which we can achieve a range of language objectives.
Why is it important not to have a "high wire act?" Firstly, the difficulties involved in language revival are so great that we cannot afford to expend resources and emotions on projects that do not give sufficient return; or, even worse, projects that are born to fail. Secondly, while computers are the best tool we have for assisting language work, as discussed below, they are most effectively used to create modest resources through community participation in localised settings. All the technological pieces have been put in place over the last 10 years; now they wait to be exploited. Thirdly, there is little point using IT simply because it is there. Sometimes we should smile when teachers are too worried about having internet access in their classroom so students can communicate with children in other states, or countries, when they are never offered the use of, say, a telephone to call Granny to ask how to say something in lingo! And finally, the primary technology underlying computer applications for language is the human technology of natural languages, together with the alphabetic system used to encode them.
Sleeping beauties?
The process of language revival is often bound up with deep emotions and ideology. This has also been noted by Indigenous linguists in similar colonial settings in other countries. However, language revival unavoidably has to be something ideological, something that "we do to ourselves", not "done to us", and must be able to embrace all aspects of the social and personal lives of the community.
Nevertheless, there are some statements that seem to dampen progress toward productive language work. For example, I have heard it explained that some endangered or destroyed languages are not dead, but merely "sleeping". Although we might admire the sleeping beauty of these Indigenous languages, the catchcry should not be an excuse for inaction; it can tend to put off developing, within the community, beliefs and processes aiding language revival. It can also be underpinned by a belief that the community has a special talent for learning its own language, leading to disappointment when it is found just how difficult language work can be.
Other similar statements may not be helpful to language revival. For example, claims that language and the culture are inextricably linked, and that the culture can only be expressed through the language, may discourage or disenfranchise the young people who are the primary targets for language revival (Dorian 1998:20).
Drawing on their experience in south-east Alaska, Dauenhauer and Dauenhauer (one of whom is a native speaker of an Tlingit, an Alaskan language), report typical responses to the question: do we want to preserve our Indigenous languages:
While it is generally politically and emotionally correct to proclaim resoundingly, "Yes!," the underlying and lingering fears, anxieties and insecurities over traditional language and culture suggest that the answer may really be, "No." ... We often find that those who vote "Yes" to "save the language and culture" expect someone else to "save" it for the others ... [b]ut language and culture do not exist in the abstract, as inalienable "products." They exist as active processes in the here and now.
(Dauenhauer and Dauenhauer 1998:63)
Computers
Perhaps it doesn't matter what tools we use for language work, as long as they foster active participation in the construction of resources that actually tell something of the community's experience. Friere reports of poor Chileans taking part in literacy programs who "wrote words with their tools on the dirt roads where they were working" (Friere 1972:43).However, the pace and direction of development of the "new media" (multimedia and networked communications) over the past ten years has provided us with two key lessons: what people want is not intelligent computers, but computers that allow us to communicate with other intelligent humans; and secondly, that people prefer messages expressed in "traditional" cultural forms until genuinely effective new genres have been evolved.
Digital platforms can help cut through at least some of the problems that hold back language resource development. Computer activities, and their resulting products, are typically given high status. They offer a goal-directed, quick-feedback context that encourages collaboration between people with a range of skills, in particular, encouraging young people to participate.
Because the new media are now challenging the shape and distribution of information, some people now believe that there are opportunities for Aboriginal people to play more leading roles in communication and publishing, both because of the renewed importance of graphic skills, and also because levels of disadvantage are lowered as people bypass or "leapfrog" the paper-based literacies (cf Nathan 1999).
There are many reasons why the new technologies appear to be perfectly suited to creating and delivering language learning resources:
- language involves authentic, rich and varied interaction. New technologies provide for this more than ever before
- multimedia offers presentation of sound, the true medium of human language, as well as other pedagogically effective media (such as graphics)
- networks support communication and relationships, the function of human language (cf Knobel et al 1997)
- computers offer a fast, cheap, accessible, and relatively painless tool for text work such as making dictionaries and grammars, designing writing systems
- hypertext can link texts of all kinds to dictionaries, grammars and other information
- the convergence of media technologies allows linking cultural artefacts to language activities
- today's computers and software are accessible and powerful enough for small local communities to create and adapt their own texts and resources
Methodologies
A full discussion of methodologies for using multimedia effectively for language work is beyond the scope of this paper. In summary, strategies for successfully undertaking small-scale multimedia resource development are:- emphasise the acquisition or re-presentation of materials that are languageperformances, not merely data or evidence for the construction of analyses
- store and present sound; make it as accessible as possible, for example, by appropriate selection of interface or enriching the sound data with searchable annotation
- store different categories of information in robust and neutral formats, so they can be used in the future to create a variety of resources, including those not already forseen
- pay attention to the design of presentation interfaces; make them attractive and motivating to the intended audiences
- design for access that is not dependent on written forms
- use a publishing approach; aim to produce concrete resources that are suited to their intended audiences and not overambitious
- start now! There is no set of rules for producing effective interactive multimedia, so the best experience comes from "getting hands dirty." Some useful examples of lessons learnt come from the use of cartoons as an elicitation tool (see below).
In a following section I will illustrate a practical application of this framework to the development of useful and authentic resources in the familiar cartoon format.
Cartoons: a case study
Over the past 3 years at AIATSIS we have developed, in collaboration with language speakers, a simple software shell for presenting cartoons. Because there are so many endangered languages, it is wise to create a kind of template into which various language content can be recorded, rather than concentrate entirely on resources for one or two languages. This project has provided many lessons about the use of computers for language work.The initial suggestion to use an electronic cartoon format came from a Kamilaroi (Gamilaraay) elder and former teacher, Auntie Rose Fernando, of Collarenebri NSW, a prominent and tireless promoter of language preservation in NSW. Cartoons have been used for Aboriginal education before, notably the Streetwize series aimed at health promotion. However, extending them to a computer platform has opened up several new possibilities.
Language use in cartoons combines formula with performance and be regarded as an extension to the use of songs, which have proved very effective in the Indigenous language classroom (Dauenhauer and Dauenhauer 1998:68, Amery 1994, Hudson 1994).
While the scope for presenting spontaneous language via cartoons is limited, their ability to do so is far superior to most of the other forms used in language teaching or recording. Unlike a dictionary, they present words in context; unlike a grammar they present sentences with social meanings; unlike many stories it shows how language is used in the context of real contemporary relationships. (In the right circumstances, the cartoons can incorporate real characters from the local community.) Being less formal than other text forms, they encourage us to retain idiomatic and informal expressions that are otherwise often unwittingly censored from printed products.
Most importantly, of course, cartoons allow us to use sound, and to provide access to the sound of language without intermediation by written forms. The cartoon's graphic form, with speech bubbles that objectify text, serves as an transparent interface or screen device for presenting language. Users know more-or-less what to do and what to expect as they interact with it.
Yandrruwanda cartoon. Produced with Greg McKellar and Muda Aboriginal Corporation
source: http://www.it.usyd.edu.au/~djn/papers/NotHighWire.htm