Starkaður Barkarson
The Árni Magnússon Institute for Icelandic Studies, , University of Iceland, Iceland
Steinþór Steingrímsson
The Árni Magnússon Institute for Icelandic Studies, , University of Iceland, Iceland
Download articlePublished in: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland
Linköping Electronic Conference Proceedings 167:15, p. 140--145
NEALT Proceedings Series 42:15, p. 140--145
Published: 2019-10-02
ISBN: 978-91-7929-995-8
ISSN: 1650-3686 (print), 1650-3740 (online)
We present ParIce, a new English-Icelandic parallel corpus. This is the first parallel corpus built for the purposes of language technology development and research for Icelandic, although some Icelandic texts can be found in various other multilingual parallel corpora. We map out which Icelandic texts are available for these purposes, collect aligned data and align other bilingual texts we acquired. We describe the alignment process and how we filter the data to weed out noise and bad alignments. In total we collected 43 million Icelandic words in 4.3 million aligned segment pairs, but after filtering, our corpus includes 38.8 million Icelandic words in 3.5 million segment pairs. We estimate that approximately 5% of the corpus data is noise or faulty alignments while more than 50% of the segments we deleted were faulty. We estimate that our filtering process reduced the number of faulty segments in the corpus by more than 60% while only reducing the number of good alignments by approximately 8%.
Parallel Corpus
Machine Translation
Icelandic
Language Technology
Filtering
Parallel Corpus Filtering
Corpus Building
Alignments
Aligning
Bilingual Texts
Assessment
Quality assessment