OLAC Record: Web 1T 5-gram Version 1

OLAC Record
oai:www.ldc.upenn.edu:LDC2006T13

Metadata

Title: Web 1T 5-gram Version 1

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Brants, Thorsten

Franz, Alex

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-09-19

Description: *Introduction* Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. *Data* The n-gram counts were generated from text taken from publicly accessible Web pages. The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following: * Hyphenated word are usually separated, and hyphenated numbers usually form one token. * Sequences of numbers separated by slashes (e.g. in dates) form one token. * Sequences that look like urls or email addresses form one token. The files total 24 GB compressed (gzip'ed) text files containing the following: Tokens 1,024,908,267,229 Sentences 95,119,665,584 Unigrams 13,588,391 Bigrams 314,843,401 Trigrams 977,069,902 Fourgrams 1,313,818,354 Fivegrams 1,176,470,663 *Samples* For an example of the 3-gram data in this corpus, please review this text sample (TXT). For an example of the 4-gram data in this corpus, please review this text sample (TXT). *Updates* None at this time.

Extent: Corpus size: 20971520 KB

Identifier: LDC2006T13

https://catalog.ldc.upenn.edu/LDC2006T13

ISBN: 1-58563-397-6

ISLRN: 831-344-220-094-6

DOI: 10.35111/cqpa-a498

Language: English

Language (ISO639): eng

License: Web 1T 5-gram Version 1 Agreement: https://catalog.ldc.upenn.edu/license/web-1t-5-gram-version-1.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006T13

Rights Holder: Portions © 2006 Google Inc., © 2006 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006T13

DateStamp: 2021-02-26

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Brants, Thorsten; Franz, Alex. 2006. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T13
Up-to-date as of: Tue Apr 8 1:30:57 EDT 2025

Metadata
Title:		Web 1T 5-gram Version 1
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Brants, Thorsten
Contributor:		Franz, Alex
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-09-19
Description:		Introduction Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. Data The n-gram counts were generated from text taken from publicly accessible Web pages. The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following: * Hyphenated word are usually separated, and hyphenated numbers usually form one token. * Sequences of numbers separated by slashes (e.g. in dates) form one token. * Sequences that look like urls or email addresses form one token. The files total 24 GB compressed (gzip'ed) text files containing the following: Tokens 1,024,908,267,229 Sentences 95,119,665,584 Unigrams 13,588,391 Bigrams 314,843,401 Trigrams 977,069,902 Fourgrams 1,313,818,354 Fivegrams 1,176,470,663 Samples For an example of the 3-gram data in this corpus, please review this text sample (TXT). For an example of the 4-gram data in this corpus, please review this text sample (TXT). Updates None at this time.
Extent:		Corpus size: 20971520 KB
Identifier:		LDC2006T13
		https://catalog.ldc.upenn.edu/LDC2006T13
		ISBN: 1-58563-397-6
		ISLRN: 831-344-220-094-6
		DOI: 10.35111/cqpa-a498
Language:		English
Language (ISO639):		eng
License:		Web 1T 5-gram Version 1 Agreement: https://catalog.ldc.upenn.edu/license/web-1t-5-gram-version-1.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006T13
Rights Holder:		Portions © 2006 Google Inc., © 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006T13
DateStamp:		2021-02-26
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Brants, Thorsten; Franz, Alex. 2006. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text