TalkBank Switchboard Corpus, Version 0.1 Speech data and text transcription is Copyright (C) 2000 University of Pennsylvania. Other transcriptions and annotations are the copyright of their individual authors (as identified below). Permission is granted for use of this material in accordance with the Open Content License [http://opencontent.org/opl.shtml], a copy of which is included on this CD-ROM. This CD-ROM contains speech files, transcripts and annotations for 36 calls from the Switchboard Corpus [www.ldc.upenn.edu/Catalog/LDC93S7.html]. The Switchboard corpus has been enriched with various kinds of annotations since it was first published. Full details are found in [1], which is included on this CD-ROM. From the original set of 2438 calls, 36 calls were selected which had complete discourse and treebank annotations and significant phonetic annotation. The TalkBank project [www.talkbank.org] seeks to enrich these annotations with as many new kinds of annotation as possible, and also to complete the partial phonetic transcription. A subsequent version of this CD-ROM will include the annotations that others provide to us. We earnestly hope that interested members of the community will contribute annotations, exemplifying their models on a common set of data, leading to a better understanding of the models, and of their similarities and differences. Where maximal overlap with existing phonetic annotation is required, please begin by annotating calls which have the most ICSI phonetic annotation (2830, 2887, ...). For details on annotation coverage, please see tables/annot-summary.tab. CONTENTS OF THE CD-ROM README this file LICENSE the Open Content License graff-bird.ps,pdf local copy of Graff & Bird [1] speech-mac/ speech files in AIFF format OR speech-pc/ speech files in RIFF (wav) format tables/ ancilliary information about the calls and speakers (documentation is not available at present) trans-phonetic/ partial phonetic transcription (Greenberg, ICSI) trans-text/ orthographic transcription (TI, LDC, BBN, ISIP) trans-wordalign/ word-level time alignments (Picone, ISIP) annotations/ this directory will house all 3rd party annotations discourse/ discourse annotation (Jurafsky, Colorado; Shriberg, SRI) treebank/ Penn Treebank annotation, including part-of-speech, disfluency and syntactic tree annotation Correspondence about this publication should be directed to: Dr Steven Bird Linguistic Data Consortium University of Pennsylvania 3615 Market St, Suite 200, Philadelphia, PA 19104-2608 Email: sb@ldc.upenn.edu We gratefully acknowledge the support of Steve Greenberg (UC Berkeley), Dan Jurafsky (University of Colorado), Joe Picone (Mississippi State), and Elizabeth Shriberg (SRI), in furnishing data which is included on this CD-ROM. [1] David Graff & Steven Bird (2000). Many uses, many annotations for large speech corpora: Switchboard and TDT as case studies. Proceedings of the Second International Conference on Language Resources and Evaluation, pp. 427-433, Paris: European Language Resources Association, 2000. http://arXiv.org/abs/cs/0007024