Article | Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland | Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus Linköping University Electronic Press Conference Proceedings
Göm menyn

Title:
Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus
Author:
Starkaður Barkarson: The Árni Magnússon Institute for Icelandic Studies, , University of Iceland, Iceland Steinþór Steingrímsson: The Árni Magnússon Institute for Icelandic Studies, , University of Iceland, Iceland
Download:
Full text (pdf)
Year:
2019
Conference:
Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland
Issue:
167
Article no.:
015
Pages:
140--145
No. of pages:
5
Publication type:
Abstract and Fulltext
Published:
2019-10-02
ISBN:
978-91-7929-995-8
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

We present ParIce, a new English-Icelandic parallel corpus. This is the first parallel corpus built for the purposes of language technology development and research for Icelandic, although some Icelandic texts can be found in various other multilingual parallel corpora. We map out which Icelandic texts are available for these purposes, collect aligned data and align other bilingual texts we acquired. We describe the alignment process and how we filter the data to weed out noise and bad alignments. In total we collected 43 million Icelandic words in 4.3 million aligned segment pairs, but after filtering, our corpus includes 38.8 million Icelandic words in 3.5 million segment pairs. We estimate that approximately 5% of the corpus data is noise or faulty alignments while more than 50% of the segments we deleted were faulty. We estimate that our filtering process reduced the number of faulty segments in the corpus by more than 60% while only reducing the number of good alignments by approximately 8%.

Keywords: Parallel Corpus Machine Translation Icelandic Language Technology Filtering Parallel Corpus Filtering Corpus Building Alignments Aligning Bilingual Texts Assessment Quality assessment

Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Author:
Starkaður Barkarson, Steinþór Steingrímsson
Title:
Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus
References:
No references available

Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Author:
Starkaður Barkarson, Steinþór Steingrímsson
Title:
Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2019-11-06