Conference article

Multiclass Text Classification on Unbalanced, Sparse and Noisy Data

Tillmann Dönicke
Institute for Natural Language Processing, University of Stuttgart, Stuttgart, Germany

Florian Lux
Institute for Natural Language Processing, University of Stuttgart, Stuttgart, Germany

Matthias Damaschk
Institute for Natural Language Processing, University of Stuttgart, Stuttgart, Germany

Download article

Published in: DL4NLP 2019. Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, 30 September, 2019, University of Turku, Turku, Finland

Linköping Electronic Conference Proceedings 163:7, p. 58-65

NEALT Proceedings Series 38:7, p. 58-65

Show more +

Published: 2019-09-27

ISBN: 978-91-7929-999-6

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

This paper discusses methods to improve the performance of text classification on data that is difficult to classify due to a large number of unbalanced classes with noisy examples. A variety of features are tested, in combination with three different neural-network-based methods with increasing complexity. The classifiers are applied to a songtext–artist dataset which is large, unbalanced and noisy. We come to the conclusion that substantial improvement can be obtained by removing unbalancedness and sparsity from the data. This fulfills a classification task unsatisfactorily—however, with contemporary methods, it is a practical step towards fairly satisfactory results.

Keywords

neural network, text classification, sparse data, noisy data, stylometry

References

No references available

Citations in Crossref