Conference article

Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus

Starkaður Barkarson
The Árni Magnússon Institute for Icelandic Studies, , University of Iceland, Iceland

Steinþór Steingrímsson
The Árni Magnússon Institute for Icelandic Studies, , University of Iceland, Iceland

Download article

Published in: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Linköping Electronic Conference Proceedings 167:15, p. 140--145

NEALT Proceedings Series 42:15, p. 140--145

Show more +

Published: 2019-10-02

ISBN: 978-91-7929-995-8

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

We present ParIce, a new English-Icelandic parallel corpus. This is the first parallel corpus built for the purposes of language technology development and research for Icelandic, although some Icelandic texts can be found in various other multilingual parallel corpora. We map out which Icelandic texts are available for these purposes, collect aligned data and align other bilingual texts we acquired. We describe the alignment process and how we filter the data to weed out noise and bad alignments. In total we collected 43 million Icelandic words in 4.3 million aligned segment pairs, but after filtering, our corpus includes 38.8 million Icelandic words in 3.5 million segment pairs. We estimate that approximately 5% of the corpus data is noise or faulty alignments while more than 50% of the segments we deleted were faulty. We estimate that our filtering process reduced the number of faulty segments in the corpus by more than 60% while only reducing the number of good alignments by approximately 8%.

Keywords

Parallel Corpus Machine Translation Icelandic Language Technology Filtering Parallel Corpus Filtering Corpus Building Alignments Aligning Bilingual Texts Assessment Quality assessment

References

No references available

Citations in Crossref