Conference article

Tilde MODEL - Multilingual Open Data for EU Languages

Roberts Rozis
Tilde

Raivis Skadinš
Tilde

Download article

Published in: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:35, p. 263-265

NEALT Proceedings Series 29:35, p. 263-265

Show more +

Published: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

This paper describes a Multilingual Open Data corpus for European languages that was built in scope of the MODEL project. We describe the approach chosen to select data sources, which data sources were used, how the source data was handled, what tools were used and what data was obtained in the result of the project. Obtained data quality is presented, and a summary of challenges and chosen solutions are described, too. This paper may serve as a guide and reference in case someone might try to do something similar, as well as a guide to the new open data obtained.

Keywords

No keywords available

References

Koehn, P. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit. Phuket, Thailand: AAMT, pp. 79-86.

Moore, R.C. 2002. Fast and Accurate Sentence Alignment of Bilingual Corpora. In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users. London, UK: Springer-Verlag, pp. 135-144.

Skadinš R., Tiedemann J., Rozis R., Deksne D. 2014. Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus. Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), pp. 1850–1855.

Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. 2012. DGT-TM: A freely Available Translation Memory in 22 Languages. Proceedings of the 8th international conference on Language Resources and Evaluation (LREC’2012). Istanbul, Turkey, pp. 454-459.

Tiedemann, J. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pp. 237-248, John Benjamins, Amsterdam/Philadelphia

Citations in Crossref