Conference article

Topic modelling applied to a second language: A language adaptation and tool evaluation study

Maria Skeppstedt
The Language Council of Sweden, the Institute for Language and Folklore, Sweden

Magnus Ahltorp
The Language Council of Sweden, the Institute for Language and Folklore, Sweden

Kostiantyn Kucher
Department of Computer Science and Media Technology, Linnaeus University, Vaxjö, Sweden

Andreas Kerren
Department of Computer Science and Media Technology, Linnaeus University, Vaxjö, Sweden

Rafal Rzepka
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Japan. RIKEN Center for Advanced Intelligence Project (AIP), Tokyo, Japan

Kenji Araki
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Japan

Download articlehttps://doi.org/10.3384/ecp2020172017

Published in: Selected Papers from the CLARIN Annual Conference 2019

Linköping Electronic Conference Proceedings 172:17, p. 145-156

Show more +

Published: 2020-07-03

ISBN: 978-91-7929-807-4

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

The Topics2Themes tool, which enables text analysis on the output of topic modelling, was originally developed for the English language. In this study, we explored and evaluated adaptations required for applying the tool to Japanese texts. That is, we adapted Topics2Themes to a language that is very different from the one for which the tool was originally developed. To apply Topics2Themes to Japanese texts, in which white space is not used for indicating word boundaries, the texts had to be pre-tokenised and white space inserted to indicate a token segmentation. Topics2Themes was also extended by the addition of word translations and phonetic readings to support users who are second-language speakers of Japanese. To evaluate the adaptation to a second language, as well as the reading support, we applied the tool to a corpus consisting of short Japanese texts. Twelve different topics were automatically identified, and a total of 183 texts representative for the twelve topics were extracted. A learner of Japanese carried out a manual analysis of these representative texts, and identified 35 reoccurring, fine-grained themes.

Keywords

topic modelling, computer-assisted text analysis, language adaptation

References

No references available

Citations in Crossref