Marina Santini
RISE Research Institutes of Sweden, (Division ICT - RISE SICS East), Stockholm, Sweden
Benjamin Danielsson
Department of Computer and Information Science , Linköping University, Linköping, Sweden
Arne Jönsson
RISE Research Institutes of Sweden, Stockholm, Sweden / Department of Computer and Information Science , Linköping University, Linköping, Sweden
Download articlePublished in: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland
Linköping Electronic Conference Proceedings 167:11, p. 105--114
NEALT Proceedings Series 42:11, p. 105--114
Published: 2019-10-02
ISBN: 978-91-7929-995-8
ISSN: 1650-3686 (print), 1650-3740 (online)
We explore the effectiveness of four feature representations -- bag-of-words, word embeddings, principal components and autoencoders -- for the binary categorization of the easy-to-read variety vs standard language. Standard language refers to the ordinary language variety used by a population as a whole or by a community, while the ``easy-to-read’’ variety is a simpler (or a simplified) version of the standard language. We test the efficiency of these feature representations on three corpora, which differ in size, class balance, unit of analysis, language and topic. We rely on supervised and unsupervised machine learning algorithms. Results show that bag-of-words is a robust and straightforward feature representation for this task and performs well in many experimental settings. Its performance is equivalent or equal to the performance achieved with principal components and autoencorders, whose preprocessing is however more time-consuming. Word embeddings are less accurate than the other feature representations for this classification task.
feature representation
text classification
easy-to-read variety
standard language
weka
supervised machine learning
deep learning
clustering
bag-of-words
principal components
autoencoders
word embeddings