OLAC Record oai:www.ldc.upenn.edu:LDC2003S06 |
Metadata | ||
Title: | Santa Barbara Corpus of Spoken American English Part II | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Du Bois, John W., et al. Santa Barbara Corpus of Spoken American English Part II LDC2003S06. Web Download. Philadelphia: Linguistic Data Consortium, 2003 | |
Contributor: | Du Bois, John W. | |
Chafe, Wallace L. | ||
Meyer, Charles | ||
Thompson, Sandra A. | ||
Martey, Nii | ||
Date (W3CDTF): | 2003 | |
Date Issued (W3CDTF): | 2003-10-02 | |
Description: | *Introduction* Santa Barbara Corpus of Spoken American English Part II was produced by Linguistic Data Consortium (LDC) catalog number LDC2003S06 and ISBN 1-58563-272-4. Santa Barbara Corpus of Spoken American English Part II is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected by: University of California, Santa Barbara Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)). Santa Barbara Corpus of Spoken American English Part II is also part of the International Corpus of English (ICE) (Charles W. Meyer, Director), representing the American Component. For software and additional data resources, please refer to the following sites: TalkBank, International Corpus of English. Part I of the Santa Barbara Corpus of Spoken American English is also available as LDC2000S85. *Data* The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22,050Hz. The speech files total ~six hours of audio (1.8GB), representing over 47K-words and over 5K unique words in transcription. Each speech file is accompanied by two transcripts in which intonation units are time stamped with respect to the audio recording. The two types of transcripts are defined by the file extension: .trn and .ca. The text and coding content of specific transcripts are identical. However, the transcripts with the ".ca" extension are transcripts in the CHAT format for conversational analysis, formatted for use with the CLAN software, available from TalkBank. The transcripts with ".trn" extension are structured according to the LDC Callhome format, for use with a variety of annotation tools. (Please also note that transcript coding is not presented as in the ICE standard). Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. There are 4 .flt files which are empty because there was no information that needed to be filtered out from the audio files. The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. *Acknowledgements* The completion and release of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania. *Updates* There are no updates available at this time. | |
Extent: | Corpus size: 1887436 KB | |
Format: | Sampling Rate: 22050 | |
Sampling Format: pcm | ||
Identifier: | LDC2003S06 | |
https://catalog.ldc.upenn.edu/LDC2003S06 | ||
ISBN: 1-58563-272-4 | ||
ISLRN: 951-825-759-886-6 | ||
DOI: 10.35111/v06j-4w13 | ||
Language: | English | |
Language (ISO639): | eng | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2003S06 | |
Rights Holder: | Portions © 2003 University of California, © 2003 Trustees of the University of Pennsylvania | |
Type (DCMI): | Sound | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2003S06 | |
DateStamp: | 2021-07-01 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Du Bois, John W.; Chafe, Wallace L.; Meyer, Charles; Thompson, Sandra A.; Martey, Nii. 2003. Linguistic Data Consortium. | |
Terms: | area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text |