OLAC Record: Coreference in Universal Dependencies 1.3 (CorefUD 1.3)

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-5896

Metadata

Title: Coreference in Universal Dependencies 1.3 (CorefUD 1.3)

Bibliographic Citation: http://hdl.handle.net/11234/1-5896

Creator: Novák, Michal

Popel, Martin

Zeman, Daniel

Žabokrtský, Zdeněk

Nedoluzhko, Anna

Acar, Kutay

Bamman, David

Bourgonje, Peter

Cinková, Silvie

Eckhoff, Hanne

Cebiroğlu Eryiğit, Gülşen

Hajič, Jan

Hardmeier, Christian

Haug, Dag

Jørgensen, Tollef

Kåsen, Andre

Krielke, Pauline

Landragin, Frédéric

Lapshinova-Koltunski, Ekaterina

Mæhlum, Petter

Martí, M. Antònia

Mikulová, Marie

Milintsevich, Kirill

Mujadia, Vandan

Muzerelle, Judith

Nam, Sangha

Nøklestad, Anders

Ogrodniczuk, Maciej

Øvrelid, Lilja

Pamay Arslan, Tuğba

Porada, Ian

Recasens, Marta

Solberg, Per Erik

Stede, Manfred

Straka, Milan

Swanson, Daniel

Toldova, Svetlana

Vadász, Noémi

Velldal, Erik

Vincze, Veronika

Zeldes, Amir

Žitkus, Voldemaras

Date (W3CDTF): 2025-04-22T07:29:39Z

Date Available: 2025-04-22T07:29:39Z

Description: CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.3 consists of 28 datasets for 18 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 24 datasets for 17 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 2 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.2, the version 1.3 comprises new languages and corpora, namely French-ANCOR, Hindi-HDTB, and Korean-ECMT. In addition, English-GUM and Czech-PDT have been updated to newer versions and conversion of zeros in Hungarian-KorKor has been improved (a list of all changes in each dataset can be found in the corresponding README file).

Identifier (URI): http://hdl.handle.net/11234/1-5896

Language: Ancient Greek (to 1453)

Ancient Hebrew

Catalan

Czech

English

French

German

Hindi

Hungarian

Korean

Lithuanian

Norwegian

Church Slavic

Polish

Russian

Spanish

Turkish

Language (ISO639): grc

hbo

cat

ces

eng

fra

deu

hin

hun

kor

lit

nor

chu

pol

rus

spa

tur

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Replaces (URI): http://hdl.handle.net/11234/1-5478

Rights: Licence CorefUD v1.3

https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-1.3

Subject: coreference

bridging relations

harmonized annotation

dependency

treebank

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-5896

DateStamp: 2025-04-22

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Novák, Michal; Popel, Martin; Zeman, Daniel; Žabokrtský, Zdeněk; Nedoluzhko, Anna; Acar, Kutay; Bamman, David; Bourgonje, Peter; Cinková, Silvie; Eckhoff, Hanne; Cebiroğlu Eryiğit, Gülşen; Hajič, Jan; Hardmeier, Christian; Haug, Dag; Jørgensen, Tollef; Kåsen, Andre; Krielke, Pauline; Landragin, Frédéric; Lapshinova-Koltunski, Ekaterina; Mæhlum, Petter; Martí, M. Antònia; Mikulová, Marie; Milintsevich, Kirill; Mujadia, Vandan; Muzerelle, Judith; Nam, Sangha; Nøklestad, Anders; Ogrodniczuk, Maciej; Øvrelid, Lilja; Pamay Arslan, Tuğba; Porada, Ian; Recasens, Marta; Solberg, Per Erik; Stede, Manfred; Straka, Milan; Swanson, Daniel; Toldova, Svetlana; Vadász, Noémi; Velldal, Erik; Vincze, Veronika; Zeldes, Amir; Žitkus, Voldemaras. 2025. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Asia area_Europe country_CZ country_DE country_ES country_FR country_GB country_GR country_HU country_IL country_IN country_KR country_LT country_NO country_PL country_RU country_TR dcmi_Text iso639_cat iso639_ces iso639_chu iso639_deu iso639_eng iso639_fra iso639_grc iso639_hbo iso639_hin iso639_hun iso639_kor iso639_lit iso639_nor iso639_pol iso639_rus iso639_spa iso639_tur olac_primary_text

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-5896
Up-to-date as of: Wed Apr 23 1:02:02 EDT 2025

Metadata
Title:		Coreference in Universal Dependencies 1.3 (CorefUD 1.3)
Bibliographic Citation:		http://hdl.handle.net/11234/1-5896
Creator:		Novák, Michal
		Popel, Martin
		Zeman, Daniel
		Žabokrtský, Zdeněk
		Nedoluzhko, Anna
		Acar, Kutay
		Bamman, David
		Bourgonje, Peter
		Cinková, Silvie
		Eckhoff, Hanne
		Cebiroğlu Eryiğit, Gülşen
		Hajič, Jan
		Hardmeier, Christian
		Haug, Dag
		Jørgensen, Tollef
		Kåsen, Andre
		Krielke, Pauline
		Landragin, Frédéric
		Lapshinova-Koltunski, Ekaterina
		Mæhlum, Petter
		Martí, M. Antònia
		Mikulová, Marie
		Milintsevich, Kirill
		Mujadia, Vandan
		Muzerelle, Judith
		Nam, Sangha
		Nøklestad, Anders
		Ogrodniczuk, Maciej
		Øvrelid, Lilja
		Pamay Arslan, Tuğba
		Porada, Ian
		Recasens, Marta
		Solberg, Per Erik
		Stede, Manfred
		Straka, Milan
		Swanson, Daniel
		Toldova, Svetlana
		Vadász, Noémi
		Velldal, Erik
		Vincze, Veronika
		Zeldes, Amir
		Žitkus, Voldemaras
Date (W3CDTF):		2025-04-22T07:29:39Z
Date Available:		2025-04-22T07:29:39Z
Description:		CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.3 consists of 28 datasets for 18 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 24 datasets for 17 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 2 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.2, the version 1.3 comprises new languages and corpora, namely French-ANCOR, Hindi-HDTB, and Korean-ECMT. In addition, English-GUM and Czech-PDT have been updated to newer versions and conversion of zeros in Hungarian-KorKor has been improved (a list of all changes in each dataset can be found in the corresponding README file).
Identifier (URI):		http://hdl.handle.net/11234/1-5896
Language:		Ancient Greek (to 1453)
		Ancient Hebrew
		Catalan
		Czech
		English
		French
		German
		Hindi
		Hungarian
		Korean
		Lithuanian
		Norwegian
		Church Slavic
		Polish
		Russian
		Spanish
		Turkish
Language (ISO639):		grc
		hbo
		cat
		ces
		eng
		fra
		deu
		hin
		hun
		kor
		lit
		nor
		chu
		pol
		rus
		spa
		tur
Publisher:		Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Replaces (URI):		http://hdl.handle.net/11234/1-5478
Rights:		Licence CorefUD v1.3
Rights:		https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-1.3
Subject:		coreference
		bridging relations
		harmonized annotation
		dependency
		treebank
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-5896
DateStamp:		2025-04-22
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Novák, Michal; Popel, Martin; Zeman, Daniel; Žabokrtský, Zdeněk; Nedoluzhko, Anna; Acar, Kutay; Bamman, David; Bourgonje, Peter; Cinková, Silvie; Eckhoff, Hanne; Cebiroğlu Eryiğit, Gülşen; Hajič, Jan; Hardmeier, Christian; Haug, Dag; Jørgensen, Tollef; Kåsen, Andre; Krielke, Pauline; Landragin, Frédéric; Lapshinova-Koltunski, Ekaterina; Mæhlum, Petter; Martí, M. Antònia; Mikulová, Marie; Milintsevich, Kirill; Mujadia, Vandan; Muzerelle, Judith; Nam, Sangha; Nøklestad, Anders; Ogrodniczuk, Maciej; Øvrelid, Lilja; Pamay Arslan, Tuğba; Porada, Ian; Recasens, Marta; Solberg, Per Erik; Stede, Manfred; Straka, Milan; Swanson, Daniel; Toldova, Svetlana; Vadász, Noémi; Velldal, Erik; Vincze, Veronika; Zeldes, Amir; Žitkus, Voldemaras. 2025. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms:		area_Asia area_Europe country_CZ country_DE country_ES country_FR country_GB country_GR country_HU country_IL country_IN country_KR country_LT country_NO country_PL country_RU country_TR dcmi_Text iso639_cat iso639_ces iso639_chu iso639_deu iso639_eng iso639_fra iso639_grc iso639_hbo iso639_hin iso639_hun iso639_kor iso639_lit iso639_nor iso639_pol iso639_rus iso639_spa iso639_tur olac_primary_text