OLAC Record
oai:catalogue.elra.info:ELRA-S0494

Metadata
Title:EthioSpeech
Access Rights: Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):2025-03-21
Date Issued (W3CDTF):2025-03-21
Description:EthioSpeech Corpora is comprised of over 391 hours of recorded read speech in six different Ethiopian languages by ca. 200 speakers per language: Amharic (68 hours), Tigrigna (62 hours), Oromo (70 hours), Somali (56 hours), Afar (68 hours), and Sidama (68 hours). The dominating domain is media (mainly newspapers), but for some of the languages texts from different domains were used, including spiritual contents. The recording is made using mobile devices using the LIG-Aikuma speech recording tool that is installed on the devices. This project will be a valuable resource for the development of well-performing automatic speech recognition (ASR) systems for these six languages (in a monolingual setup) and for other related languages (in a multilingual and/or cross-lingual setup) that are useful in various aspects of daily life. Use cases of speech recognition systems using this dataset include dictation systems, transcription systems, assistive technologies, spoken dialogue systems, speech translation, and other similar speech technologies. To make the data set representative, the team selected six working languages that are used across regional states of Ethiopia while also maintaining the gender and age balance of readers, nearly equal for Amharic, Tigrigna and Oromo, whereas mainly male gender for the other 3 languages. The age distribution is between 18 and 40.More details are given below:- Amharic: Number of recorded sentences (only verified): 25,610Number of speakers: 203Recorded Speech length in hours: 68:11- Tigrinya: Number of recorded sentences (only verified): 26,955Number of speakers: 210Recorded Speech length in hours: 61:42- Oromo: Number of recorded sentences (only verified): 25,287Number of speakers: 200Recorded Speech length in hours: 69:57- Somali: Number of recorded sentences (only verified): 25,175Number of speakers: 200Recorded Speech length in hours: 55:57- Afar: Number of recorded sentences (only verified): 25,659Number of speakers: 200Recorded Speech length in hours: 67:53- Sidama: Number of recorded sentences (only verified): 25,113Number of speakers: 200Recorded Speech length in hours: 67:36
Identifier:ELRA-S0494
ISLRN: 886-456-351-764-8
Identifier (URI):https://catalog.elra.info/en-us/repository/browse/ELRA-S0494/
Language:Somali
Tigrinya
Oromo
Sidamo
Amharic
Language (ISO639):som
tir
orm
sid
amh
Medium:Not specified
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Sound
Type (OLAC):primary_text

OLAC Info

Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-S0494
DateStamp:  2025-03-21
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2025. ELRA (European Language Resources Association).
Terms: area_Africa country_ET country_SO dcmi_Sound iso639_amh iso639_orm iso639_sid iso639_som iso639_tir olac_primary_text


http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0494
Up-to-date as of: Thu Apr 3 2:08:25 EDT 2025