Conference article

Using Data Mining and the CLARIN Infrastructure to Extend Corpus-based Linguistic Research

Thomas Bartz
TU Dortmund University, Department of German Language and Literature, Dortmund, Germany

Christian Pölitz
TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany

Katharina Morik
TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany

Angelika Storrer
Mannheim University, Department of German Philology, Mannheim, Germany

Download article

Published in: Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands

Linköping Electronic Conference Proceedings 116:1, p. 1-13

Show more +

Published: 2015-08-26

ISBN: 978-91-7685-954-4

ISSN: 1650-3686 (print), 1650-3740 (online)


Large digital corpora of written language, such as those that are held by the CLARIN-D centers, provide excellent possibilities for linguistic research on authentic language data. Nonetheless, the large number of hits that can be retrieved from corpora often leads to challenges in concrete linguistic research settings. This is particularly the case, if the queried word-forms or constructions are (semantically) ambiguous. The joint project called ‘Corpus-based Linguistic Research and Analysis Using Data Mining’ (“Korpus-basierte linguistische Recherche und Analyse mit Hilfe von Data-Mining” – ‘KobRA’) is therefore underway to investigating the benefits and issues of using machine learning technologies in order to perform after-retrieval cleaning and disambiguation tasks automatically. The following article is an overview of the questions, methodologies and current results of the project, specifically in the scope of corpus-based lexicography/historical semantics. In this area, topic models were used in order to partition search result KWIC lists retrieved by querying various corpora for polysemous or homonym words by the individual meanings of these words.


corpus-based linguistic and lexicographic studies;data mining;disambiguation


David M. Blei, Andrew Y. Ng, and Michael I. Jordan. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3 (3), 993-1022.

David M. Blei and John D. Lafferty. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, 113-120.

Samuel Brody and Mirella Lapata. (2009). Bayesian word sense induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 103-111.

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. (1991). Word-sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, 264–270.

Jacob Cohen. (1960). A coefficient of agreement for nominal scales. In Educational and Psychological Measurement 20, 37-46.

Stefan Engelberg and Lothar Lemnitzer. (2009). Lexikographie und Wörterbuchbenutzung. Tübingen: Stauffenburg.

Tony McEnery, Richard Xiao, and Yukio Tono. (2006). Corpus-Based Language Studies – an advanced resource book. London: Routledge.

Gerd Fritz. (2012). Theories of meaning change – an overview. In C. Maienborn et al. (Eds.), Semantics. An International Handbook of Natural Language Meaning. Volume 3. Berlin: de Gruyter, 2625-2651.

Gerd Fritz. (2005). Einführung in die historische Semantik. Tübingen: Niemeyer.

Alexander Geyken. (2007). The DWDS corpus. A reference corpus for the German language of the twentieth century. In C. Fellbaum (Ed.), Idioms and collocations. Corpus-based linguistic and lexicographic studies. London: Continuum, 23-40.

Thomas L. Griffiths and Mark Steyvers. (2004). Finding scientific topics. In Proceedings of the National Academy of Sciences, 101 (Suppl. 1), 5228-235.

Erhard Hinrichs and Thomas Zastrow. (2012). Automatic Annotation and Manual Evaluation of the Diachronic German Corpus TüBa-D/DC. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 1622-1627.

Rudi Keller and Ilja Kirschbaum. (2003). Bedeutungswandel. Eine Einführung. Berlin: de Gruyter.

Dan Klein & Christopher D. Manning (2003): Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics – Volume 1, ACL ’03, pag-es 423–430, Stroudsburg, PA, USA. Association for Computational Linguistics.

Wolfgang Klein and Alexander Geyken. (2010). Das Digitale Wörterbuch der Deutschen Sprache (DWDS). In U. Heid et al. (Eds.), Lexikographica. Berlin: de Gruyter, 79-93.

Anke Lüdeling and Merja Kytö. (Eds.). (2008). Corpus Linguistics. An International Handbook. Volume 1. Berlin: de Gruyter.

Anke Lüdeling and Merja Kytö. (Eds.). (2009). Corpus Linguistics. An International Handbook. Volume 2. Berlin: de Gruyter.

Ingo Mierswa et al. (2006). YALE: Rapid Prototyping for Complex Data Mining Tasks. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining.

Roberto Navigli. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41 (2), 10:1-10:69.

Roberto Navigli and Giuseppe Crisafulli. (2010). Inducing word senses to improve web search result clustering. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 116-126.

Roberto Navigli and Daniele Vannella. (2013). Semeval-2013 task 11: Word sense induction and disambiguation within an end-user application. In Second Joint Conference on Lexical and Computational Semantics, Volume 2: Proceedings of the Seventh International Workshop on Semantic valuation, 193-201.

Uwe Quasthoff, Matthias Richter, and Chris Biemann. (2006). Corpus Portal for Search in Monolingual Corpora. In Proceedings of the fifth international conference on Language Resources and Evaluation, 1799-1802.

Christian Rohrdantz et al. (2011). Towards Tracking Semantic Change by Visual Analytics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 305-310.

Paul Rayson and Mark Stevenson. (2008). Sense and semantic tagging. In A. Lüdeling and M. Kytö (Eds.), Corpus Linguistics. Volume 1. Berlin: de Gruyter, 564-578.

Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. (2004). Probabilistic author-topic models for information discovery. In Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining, 306–315.

Angelika Storrer. (2011). Korpusgestützte Sprachanalyse in Lexikographie und Phraseologie. In K. Knapp et al. (Eds.), Angewandte Linguistik. Ein Lehrbuch. 3. vollst. uberarb. und erw. Aufl. Tubingen: Francke, 216-239.

Citations in Crossref