Conference article

Polish Read Speech Corpus for Speech Tools and Services

Danijel Koržinek
Polish-Japanese Academy of Information Technology, Warsaw, Poland

Krzysztof Marasek
Polish-Japanese Academy of Information Technology, Warsaw, Poland

Łukasz Brocki
Polish-Japanese Academy of Information Technology, Warsaw, Poland

Krzysztof Wołk
Polish-Japanese Academy of Information Technology, Warsaw, Poland

Published in: Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure

Linköping Electronic Conference Proceedings 136:4, p. 54-62

Published: 2017-05-23

ISBN: 978-91-7685-499-0

ISSN: 1650-3686 (print), 1650-3740 (online)


This paper describes the speech processing activities conducted at the Polish consortium of the CLARIN project. The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. The tools include the following: grapheme-to-phoneme conversion, speech-to-text alignment, voice activity detection, speaker diarization, keyword spotting and automatic speech transcription. Furthermore, in order to develop these tools, a large high-quality studio speech corpus was recorded and released under an open license, to encourage development in the area of Polish speech research. Another purpose of the corpus was to serve as a reference for studies in phonetics and pronunciation. All the tools and resources were released on the the Polish CLARIN website. This paper discusses the current status and future plans for the project.


Speech corpora, speech recognition, speech alignment, grapheme-to-phoneme, speaker diarization, voice activity detection, keyword spottingspotting


