Literary Exploration Machine A Web-Based Application for Textual Scholars

Maciej Maryl
Institute of Literary Research of the Polish Academy of Sciences, Warsaw, Poland

Maciej Piasecki
Faculty of Computer Science and Management, Wroclaw University of Science and Technology, Poland

Tomasz Walkowiak
Faculty of Electronics, Wroclaw University of Science and Technology, Poland

Ladda ner artikel

Ingår i: Selected papers from the CLARIN Annual Conference 2017, Budapest, 18–20 September 2017

Linköping Electronic Conference Proceedings 147:11, s. 128-144

Visa mer +

Publicerad: 2018-05-16

ISBN: 978-91-7685-273-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper presents a design of a web-based application for textual scholars. The goal of this project is to create a complex and stable research environment allowing scholars to upload the texts they analyse and either explore them with a suite of dedicated tools or transform them into a different format (e.g. text, table, list, spreadsheet). The latter functionality is especially important for research focusing on Polish texts (due to the rich morphology and weakly constrained word order of Polish) because it allows for their further processing with tools built for English. This project utilises the existing CLARIN-PL applications and supplements them with new functionalities.


digital literary studies natural language processing text mining content analysis web-based system


[Blei et al., 2003] Blei, D.M., Ng, A.Y. and Jordan, M.I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. 3 (4–5): pp. 993–1022.

[Broda et al., 2012] Broda, B., Marcinczuk, M., Maziarz, M., Radziszewski. A., and Wardynski, A. 2012. KPWr: Towards a Free Corpus of Polish. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugûr Dogân, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, May 2012. European Language Resources Association (ELRA)., pp. 3218–3222.

[Bryda and Tomanek, 2015] Bryda G., Tomanek K. 2015. Odkrywanie wiedzy w wypowiedziach tekstowych. Metoda budowy slownika klasyfikacyjnego. In Niedbalski J. (Ed.) Metody i techniki odkrywania wiedzy. Narzedzia CAQDAS w procesie analizy danych jakosciowych, Wydawnictwo UL, pp. 51–81.

[Brosz et al., 2017] Brosz M., Bryda G., Siuda P. 2017. Od redaktorów: Big Data i CAQDAS a procedury badawcze w polu socjologii jakosciowej. Przeglad Socjologii Jakosciowej [Big Data, CAQDAS and research procedure in the field of qualitative research], t. 13, nr 2, pp. 6?23 [Access 30.01.2018, URL: www. przegladsocjologiijakosciowej.org].

[Calle-Martin and Miranda-Garcia, 2012] Calle-Martin, J. and Miranda-Garcia, A. 2012. Stylometry and Authorship Attribution: Introduction to the Special Issue. English Studies, 3(93): 251?258.

[Calzolari et al., 2014] Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.) 2014. Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014. Reykjavík, Iceland, ELRA.

[Dallas et al., 2017] Dallas, C., Chatzidiakou, N., Benardou, A., Bender, M., Berra, A., Clivaz, C., … Zebec, T. 2017. European Survey on Scholarly Practices and Digital Needs in the Arts and Humanities – Highlights Report. Zenodo.

[Eder et al., 2017] Eder, M., Piasecki, M. and Walkowiak, T. 2017. An Open Stylometric System Based on Multilevel Text Analysis. Cognitive Studies | Études cognitives, No. 17,

[Eder and Rybicki, 2012] Eder, M. and Rybicki, J. 2012. Introduction to Stylometric Analysis using R. Digital Humanities 2012 Conference. Hamburg.

[Jones et al., 2018] Jones E, Oliphant E, Peterson P, et al. 2018. SciPy: Open Source Scientific Tools for Python. http://www.scipy.org/ [Online; accessed 2018-03-27].

[Kedzia et al., 2015] Kedzia, P., Piasecki, M. and Orlinska, M. J. 2015. Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources. Cognitive Studies | Études cognitives, (15), 269–292.

[Manning et al., 2008] Manning, C., Prabhakar, R. and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press.

[Marcinczuk, 2017] Marcinczuk, M. 2017. Lemmatization of Multi-word Common Noun Phrases and Named Entities in Polish. In (ed.) Ruslan Mitkov and Galia Angelova Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017 Varna, Bulgaria, Sep 4–6 2017, INCOMA Ltd., pp. 483?491, https://doi.org/10.26615/978-954-452-049-6_064

[Marcinczuk et al., 2016] Marcinczuk, M., Oleksy, M., Maziarz, M., Wieczorek, J., Fikus, D., Turek, A., Wolski, M., Bernas, T., Kocon, J., Kedzia, P. 2016. Polish Corpus of Wroclaw University of Technology 1.2, CLARIN-PL digital repository, http://hdl.handle.net/11321/270

[Marcinczuk et al., 2013] Marcinczuk, M., Kocon, J. and Janicki, M. 2013. Liner2 – A Customizable Framework for Proper Names Recognition for Polish. In Bembenik, Robert and Skonieczny, Lukasz and Rybinski, Henryk and Kryszkiewicz, Marzena and Niezgodka, Marek, Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer, vol. 467, pp. 231–253.

[Marcinczuk and Radziszewski, 2013] Marcinczuk, M. & Radziszewski, A 2013. WCCL Match – A Language for Text Annotation. In Klopotek, A., M., Koronacki, Jacek, Marciniak, Malgorzata et al (editors), Language Processing and Intelligent Information Systems, pages 131–144. Springer Berlin Heidelberg.

[Maryl, 2016a] Maryl, M. 2016a. Tekstów swiat. Przyczynek do makroanalitycznej monografii czasopisma literaturoznawczego [World of Texts. Take on a Macroanalytical Monograph of a Scholarly Journal] In Nasilowska, A. & Lapinski, Z. (Eds.), Projekt na daleka mete. Prace ofiarowane Ryszardowi Nyczowi, Warszawa: Wyd. IBL, pp. 443–462.

[Maryl, 2016b] Maryl, M. 2016b. Cyberwspólnota sadów zalu w perspektywie makroanalitycznej [Cybercommunity of regret statements in the macroanalytical perspective]. In 3rd Congress of the Polish Society for Cultural Studies, Adam Mickiewicz University of Poznan, 21–23 September 2016.

[Maziarz et al., 2016] Maziarz, M., Piasecki, M., Rudnicka, E., Szpakowicz, S., and Kedzia, P. 2016. plWordNet 3.0 – a Comprehensive Lexical-Semantic Resource. In COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, The COLING 2016 Organizing Committee pp. 2259–2268, 2016, http://www.aclweb.org/anthology/C16-1213

[Maziarz et al., 2013] Maziarz, M., Piasecki, M., and Szpakowicz, S. 2013. The Chicken-and-egg Problem in Wordnet Design: Synonymy, Synsets and Constitutive Relations. Language Resources and Evaluation, 47(3):769–796.

[McCallum, 2002] McCallum, A.K. 2002. MALLET: A Machine Learning for Language Toolkit. Web page of the system. URL: http://mallet.cs.umass.edu.

[Pease, 2011] Pease, A. 2011. Ontology: A Practical Guide. Articulate Software Press, Angwin, CA,. [Pedregosa et al., 2011] Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. 2011. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research 12, pp.2825–2830.

[Piasecki et al., 2009] Piasecki, M., Szpakowicz, S. and Broda, B. 2009. A WordNet from the Ground Up. Wroclaw: Oficyna Wydawnicza Politechniki Wroclawskiej, http://www.dbc.wroc.pl/dlibra/docmetadata?id=4220&from=publication

[Piasecki and Walentynowicz, 2017] Piasecki, M. and Walentynowicz, W. 2017. MorphoDiTa-based Tagger Adapted to the Polish Language Technology. In Z. Vetulani and P. Paroubek, editors, Proceedings of Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 377–381, Poznan, 2017. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu.

[Piasecki et al., 2016] Piasecki, M.; Walkowiak, T. & Eder, M. 2016. WebSty –- an Open Web-based System for Exploring Stylometric Structures in Document Collections. In Eder, M. & Rybicki, J. (Eds.) Digital Humanities 2016 Conference Abstracts, Jagiellonian University and Pedagogical University, 2016, pp. 859–861.

[Przepiórkowski et al., 2012] Przepiórkowski, A., Banko, M., Górski, R. L. and Lewandowska- Tomaszczyk, B. (eds) 2012. Narodowy Korpus Jezyka Polskiego. Warszawa: PWN.

[Radziszewski, 2013] Radziszewski, A. 2013. A Tiered CRF Tagger for Polish. In Bembenik, Robert and Skonieczny, Lukasz and Rybinski, Henryk and Kryszkiewicz, Marzena and Niezgodka, Marek, Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Berlin: Springer, vol. 467, pp. 215–230.

[Rohnka et al., 2015] Rohnka, N., Szymczyk, B., Rusanowska, M., Holas, P., Krejtz, I., Nezlek, J. 2015. Wlasciwosci jezyka osób cierpiacych na zaburzenia emocjonalne i osobowosci - analiza tresci opisów codziennych wydarzen [Language characteristics of individuals with emotional, and personality disorders: content analysis of daily events]. Psychiatria i Psychoterapia, Vol. 11, No. 3, pp. 3–20.

[Rybicki, 2017] Rybicki, J. 2017. Reading Novels with Statistics: What Numbers of Words Tell Us about Authorship, Genre, or Chronology. In J. A. Dobelman (Ed.) Models and Reality: Festschrift For James Robert Thompson, Chicago: T&NO Company, pp. 207–224.

[Saloni et al., 2015] Saloni, Z., Wolinski, M. Wolosz, R., Gruszczynski, W., and Skowronska, D. 2015. Slownik gramatyczny jezyka polskiego. [Grammatical dictionary of Polish]. SGJP, 3rd edition.

[Stamatatos, 2009] Stamatatos, E. 2009. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, 3(60): 538–556.

[Tausczik and Pennebaker, 2010] Tausczik, Y.R., and Pennebaker, J.W. 2010. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29, 24–54.

[Witten et al., 2017] Witten, I.H., Frank, E., Hall, M.A., Pal, C.J. 2017. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman.

[Wolinski, 2014] Wolinski, M. 2014. Morfeusz Reloaded. In (Calzolari et al., 2014), pages 1106–1111. [Zhao and Karypis, 2005] Zhao, Y. and Karypis, G. 2005. Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery, 10(2): 1.

Citeringar i Crossref