OLAC Record oai:www.ldc.upenn.edu:LDC2006T13 |
Metadata | ||
Title: | Web 1T 5-gram Version 1 | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006 | |
Contributor: | Brants, Thorsten | |
Franz, Alex | ||
Date (W3CDTF): | 2006 | |
Date Issued (W3CDTF): | 2006-09-19 | |
Description: | *Introduction* Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. *Data* The n-gram counts were generated from text taken from publicly accessible Web pages. The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following: * Hyphenated word are usually separated, and hyphenated numbers usually form one token. * Sequences of numbers separated by slashes (e.g. in dates) form one token. * Sequences that look like urls or email addresses form one token. The files total 24 GB compressed (gzip'ed) text files containing the following: Tokens 1,024,908,267,229 Sentences 95,119,665,584 Unigrams 13,588,391 Bigrams 314,843,401 Trigrams 977,069,902 Fourgrams 1,313,818,354 Fivegrams 1,176,470,663 *Samples* For an example of the 3-gram data in this corpus, please review this text sample (TXT). For an example of the 4-gram data in this corpus, please review this text sample (TXT). *Updates* None at this time. | |
Extent: | Corpus size: 20971520 KB | |
Identifier: | LDC2006T13 | |
https://catalog.ldc.upenn.edu/LDC2006T13 | ||
ISBN: 1-58563-397-6 | ||
ISLRN: 831-344-220-094-6 | ||
DOI: 10.35111/cqpa-a498 | ||
Language: | English | |
Language (ISO639): | eng | |
License: | Web 1T 5-gram Version 1 Agreement: https://catalog.ldc.upenn.edu/license/web-1t-5-gram-version-1.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2006T13 | |
Rights Holder: | Portions © 2006 Google Inc., © 2006 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2006T13 | |
DateStamp: | 2021-02-26 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Brants, Thorsten; Franz, Alex. 2006. Linguistic Data Consortium. | |
Terms: | area_Europe country_GB dcmi_Text iso639_eng olac_primary_text |