Andrey Kutuzov
University of Oslo, Oslo, Norway
Elizaveta Kuzmenko
University of Trento, Trento, Italy
Download articlePublished in: DL4NLP 2019. Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, 30 September, 2019, University of Turku, Turku, Finland
Linköping Electronic Conference Proceedings 163:3, p. 22-28
NEALT Proceedings Series 38:3, p. 22-28
Published: 2019-09-27
ISBN: 978-91-7929-999-6
ISSN: 1650-3686 (print), 1650-3740 (online)
In this paper, we critically evaluate the widespread assumption that deep learning NLP models do not require lemmatized input. To test this, we trained versions of contextualised word embedding ELMo models on raw tokenized corpora and on the corpora with word tokens replaced by their lemmas. Then, these models were evaluated on the word sense disambiguation task. This was done for the English and Russia n languages. The experiments showed that while lemmatization is indeed not necessary for English, the situation is different for Russian. It seems that for rich-morphology languages, using lemmatized training and testing data yields small but consistent improvements: at least for word sense disambiguation. This means that the decisions about text pre-processing before training ELMo should consider the linguistic nature of the language in question.
ELMo, lemmatization, pre-processing, contextualised embeddings, word sense disambiguation