Konferensartikel

DaLAJ - a dataset for linguistic acceptability judgments for Swedish

Elena Volodina

Yousuf Ali Mohammed

Julia Klezl

Ladda ner artikel

Ingår i: Proceedings of the 10th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2021)

Linköping Electronic Conference Proceedings 177:3, s. 28-37

Visa mer +

Publicerad: 2021-05-21

ISBN: 978-91-7929-625-4

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

We present DaLAJ 1.0, a Dataset for Linguistic Acceptability Judgments for Swedish, comprising 9 596 sentences in its first version. DaLAJ is based on the SweLL second language learner data (Volodina et al., 2019), consisting of essays at different levels of proficiency. To make sure the dataset can be freely available despite the GDPR regulations, we have sentence-scrambled learner essays and removed part of the metadata about learners, keeping for each sentence only information about the mother tongue and the level of the course where the essay has been written. We use the normalized version of learner language as the basis for DaLAJ sentences, and keep only one error per sentence. We repeat the same sentence for each individual correction tag used in the sentence. For DaLAJ 1.0 four error categories of 35 available in SweLL are used, all connected to lexical or word-building choices. The dataset is included in the SwedishGlue benchmark. Below, we describe the format of the dataset, our insights and motivation for the chosen approach to data sharing.

Nyckelord

acceptability judgments, Swedish, second language data, SwedishGlue

Referenser

Inga referenser tillgängliga

Citeringar i Crossref