Published: 2017-05-08
ISBN: 978-91-7685-601-7
ISSN: 1650-3686 (print), 1650-3740 (online)
This paper describes a Multilingual Open Data corpus for European languages that was built in scope of the MODEL project. We describe the approach chosen to select data sources, which data sources were used, how the source data was handled, what tools were used and what data was obtained in the result of the project. Obtained data quality is presented, and a summary of challenges and chosen solutions are described, too. This paper may serve as a guide and reference in case someone might try to do something similar, as well as a guide to the new open data obtained.
Koehn, P. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit. Phuket, Thailand: AAMT, pp. 79-86.
Moore, R.C. 2002. Fast and Accurate Sentence Alignment of Bilingual Corpora. In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users. London, UK: Springer-Verlag, pp. 135-144.
Skadinš R., Tiedemann J., Rozis R., Deksne D. 2014. Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus. Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), pp. 1850–1855.
Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. 2012. DGT-TM: A freely Available Translation Memory in 22 Languages. Proceedings of the 8th international conference on Language Resources and Evaluation (LREC’2012). Istanbul, Turkey, pp. 454-459.
Tiedemann, J. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pp. 237-248, John Benjamins, Amsterdam/Philadelphia