A corpus of 36 calls from the larger LDC Switchboard (SWB) corpus of 2438 calls. These are the calls that have complete discourse and treebank annotations and significant phonetic annotation. The TalkBank project seeks to enrich these annotations with as many new kinds of annotation aspossible, and also to complete the partial phonetic transciption. We hope that interested members of the community will contribute annotations, exemplifying their models on a common set of data. Our thanks to Steven Greenberg (UC Berkeley), Dan Jurafsky (University of Colorado), Joe Picone (Mississippi State), and Elisabeth Shriberg (SRI) for furnishing annotations included in this corpus. Following is the complete readme.doc file from the LDC CD-ROM.
TalkBank Switchboard Corpus, Version 0.1
Speech data and text transcription is Copyright (C) 2000 University of Pennsylvania. Other transcriptions and annotations are the copyright of their individual authors (as identified below).
Permission is granted for use of this material in accordance with the Open Content License [http://opencontent.org/opl.shtml], a copy of which is
included on this CD-ROM.
This CD-ROM contains speech files, transcripts and annotations for 36
calls from the Switchboard Corpus [www.ldc.upenn.edu/Catalog/LDC93S7.html].
The Switchboard corpus has been enriched with various kinds of annotations
since it was first published. Full details are found in [1], which is
included on this CD-ROM.
From the original set of 2438 calls, 36 calls were selected which had
complete discourse and treebank annotations and significant phonetic
annotation.
The TalkBank project [www.talkbank.org] seeks to enrich these annotations
with as many new kinds of annotation as possible, and also to complete the
partial phonetic transcription. A subsequent version of this CD-ROM will
include the annotations that others provide to us. We earnestly hope that
interested members of the community will contribute annotations,
exemplifying their models on a common set of data, leading to a better
understanding of the models, and of their similarities and differences.
Where maximal overlap with existing phonetic annotation is required, please
begin by annotating calls which have the most ICSI phonetic annotation
(2830, 2887, ...). For details on annotation coverage, please see
tables/annot-summary.tab.
CONTENTS OF THE CD-ROM
README this file
LICENSE the Open Content License
graff-bird.ps,pdf local copy of Graff & Bird [1]
speech-mac/ speech files in AIFF format
OR
speech-pc/ speech files in RIFF (wav) format
tables/ ancilliary information about the calls and speakers
(documentation is not available at present)
trans-phonetic/ partial phonetic transcription (Greenberg, ICSI)
trans-text/ orthographic transcription (TI, LDC, BBN, ISIP)
trans-wordalign/ word-level time alignments (Picone, ISIP)
annotations/ this directory will house all 3rd party annotations
discourse/ discourse annotation (Jurafsky, Colorado; Shriberg, SRI)
treebank/ Penn Treebank annotation, including part-of-speech,
disfluency and syntactic tree annotation
Correspondence about this publication should be directed to:
Dr Steven Bird
Linguistic Data Consortium
University of Pennsylvania
3615 Market St, Suite 200,
Philadelphia, PA 19104-2608
Email: sb@ldc.upenn.edu
We gratefully acknowledge the support of Steve Greenberg (UC Berkeley),
Dan Jurafsky (University of Colorado), Joe Picone (Mississippi State),
and Elizabeth Shriberg (SRI), in furnishing data which is included on this
CD-ROM.
[1] David Graff & Steven Bird (2000). Many uses, many annotations for large
speech corpora: Switchboard and TDT as case studies. Proceedings of the
Second International Conference on Language Resources and Evaluation,
pp. 427-433, Paris: European Language Resources Association, 2000.
http://arXiv.org/abs/cs/0007024
|