Conference article

Málrómur: A Manually Verified Corpus of Recorded Icelandic Speech

Steinþór Steingrímsson
The Árni Magnússon Institute for Icelandic Studies, Iceland

Jón Guðnason
Reykjavik University, Iceland

Sigrún Helgadóttir
The Árni Magnússon Institute for Icelandic Studies, Iceland

Eiríkur Rögnvaldsson
Department of Icelandic, University of Iceland, Iceland

Download article

Published in: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:29, p. 237-240

NEALT Proceedings Series 29:29, p. 237-240

Show more +

Published: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

This paper describes the Málrómur corpus, an open, manually verified, Icelandic speech corpus. The recordings were collected in 2011–2012 by Reykjavik University and the Icelandic Center for Language Technology in cooperation with Google. 152 hours of speech were recorded from 563 participants. The recordings were subsequently manually inspected by evaluators listening to all the segments, determining whether any given segment contains the utterance the participant was supposed to read, and nothing else. Out of 127,286 recorded segments 108,568 were approved and 18,718 deemed unsatisfactory.

Keywords

No keywords available

References

Jón Guðnason, Oddur Kjartansson, Jökull Jóhannsson, Elín Carstensdóttir, Hannes Högni Vilhjálmsson, Hrafn Loftsson, Sigrún Helgadóttir, Kristín M. Jóhannsdóttir, and Eiríkur Rögnvaldsson. 2012. Almannarómur: An Open Icelandic Speech Corpus. In Proceedings of SLTU ’12, 3rd Workshop on Spoken Languages Technologies for Under-Resourced Languages, Cape Town, South Africa.

Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2013. Language Resources for Icelandic. In K. De Smedt, L. Borin, K. Lindén, B. Maegaard, E. Rögnvaldsson, and K. Vider, editors, Proceedings of the Workshop on Nordic Language Research Infrastructure at NODALIDA 2013, pages 60–76. NEALT Proceedings Series 20. Linköping Electronic Conference Proceedings, Linköping, Sweden.

Thad Hughes, Kaisuke Nakajima, Linne Ha, Atul Vasu, Pedro Moreno, and Mike LeBeau. 2010. Building Transcribed Speech Corpora Quickly and Cheaply for Many Languages. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), pages 1914–1917, Makuhari, Chiba, Japan.

Citations in Crossref