The Effect of Author Set Size in Authorship Attribution for Lithuanian

Jurgita Kapočiūtė-Dziki ė
Vytautas Magnus University, Kaunas, Lithuania

Ligita Šarkutė
Kaunas University of Technology, Kaunas, Lithuania

Andrius Utka
Vytautas Magnus University, Kaunas, Lithuania

Ladda ner artikel

Ingår i: Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 109:13, s. 87-96

NEALT Proceedings Series 23:13, p. 87-96

Visa mer +

Publicerad: 2015-05-06

ISBN: 978-91-7519-098-3

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper reports the first authorship attribution results based on the effect of the author set size using automatic computational methods for the Lithuanian language. The aim is to determine how fast authorship attribution results are deteriorating while the number of candidate authors is gradually increasing: i.e. starting from 3, going up to 5, 10, 20, 50, and 100. Using supervised machine learning techniques we also investigated the effect of balancing on the dataset, and the influence of the different features (lexical, character, morphological, etc.), and language types (normative parliamentary speeches and non-normative forum posts). The experiments revealed that the effectiveness of the method and feature type depends more on the language type than on the number of candidate authors. The content features based on word lemmas are the most useful type for the normative texts, due to the fact that Lithuanian is a highly inflective, morphologically and vocabulary rich language. The character features are the most accurate type for forum posts, where texts are too complicated to be effectively processed with the external morphological tools.


Inga nyckelord är tillgängliga


Ahmed Abbasi and Hsinchun Chen. 2005. Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intelligent Systems, 20(5):67–75.

Ahmed Abbasi and Hsinchun Chen. 2008. Writerprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace. ACM Transactions on Information Systems, 26(2):1–29.

Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. OBA2: An Onion approach to Binary code Authorship Attribution. Digital Investigation, 11(1):S94–S103.

Shlomo Argamon, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan. 2007. Stylistic Text Classification Using Functional Lexical Features: Research Articles. Journal of the American Society for Information Science and Technology, 58(6):802–822.

Corina Cortes and Vladimir Vapnik. 1995. Support- Vector Networks. Machine Learning, 20(3):273–297.

Marco Cristani, Giorgio Roffo, Cristina Segalin, Loris Bazzani, Alessandro Vinciarelli, and Vittorio Murino. 2012. Conversationally-inspired Stylometric Features for Authorship Attribution in Instant Messaging. In Proceedings of the 20th ACM International Conference on Multimedia, pages 1121–1124.

Vidas Daudaravicius, Erika Rimkute, and Andrius Utka. 2007. Morphological annotation of the Lithuanian corpus. In Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, pages 94–99.

Olivier de Vel, Alison M. Anderson, Malcolm W. Corney, and George M. Mohay. 2001. Mining e-Mail Content for Author Identification Forensics. SIGMOD Record, 30(4):55–64.

Michael Gamon. 2004. Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features. In Proceedings of the 20th International Conference on Computational Linguistics, pages 611–617.

Mark Hall, Eibe Frank, Holmes Geoffrey, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1):10–18.

Giacomo Inches, Morgan Harvey, and Fabio Crestani. 2013. Finding Participants in a Chat: Authorship Attribution for Conversational Documents. In International Conference on Social Computing, pages 272–279.

Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with many Relevant Features. In 10th European Conference on Machine Learning, volume 1398, pages 137–142.

Matthew L. Jockers and Daniela M. Witten. 2010. A Comparative Study of Machine Learning Methods for Authorship Attribution. Literary and Linguistic Computing, 25(2):215–223.

Patrick Juola. 2007. Future Trends in Authorship Attribution. In Advances in Digital Forensics III IFIP – The International Federation for Information Processing, volume 242, pages 119–132.

Jurgita Kapociute-Dzikiene, Andrius Utka, and Ligita Šarkute. 2014. Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches. In 17th International Conference on Text, Speech, and Dialogue, pages 93–100.

Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribution in the wild. Language Resources and Evaluation, 45(1):83–94.

Sotiris B. Kotsiantis. 2007. Supervised Machine Learning: A Review of Classification Techniques. In Proceedings of the 2007 Conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies, pages 3–24.

David D. Lewis and William A. Gale. 1994. A Sequential Algorithm for Training Text Classifiers. In 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3–12.

Kim Luyckx and Walter Daelemans. 2008. Authorship Attribution and Verification with Many Authors and Limited Data. In Proceedings of the 22Nd International Conference on Computational Linguistics, volume 1, pages 513–520.

Kim Luyckx. 2010. Scalability Issues in Authorship Attribution. Ph.D. thesis, University of Antwerp, Belgium.

Christopher D. Manning and Hinrich Sch¨utze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA.

Quinn Michael McNemar. 1947. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages. Psychometrika, 12(2):153–157.

Thomas Corwin Mendenhall. 1887. The Characteristic Curves of Composition. Science, 9:237–246.

Frederik Mosteller and David L. Wallace. 1963. Inference in an authorship problem. Journal Of The American Statistical Association, 58(302):275–309.

Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the Feasibility of Internet-Scale Author Identification. In Proceedings of the 2012 IEEE Symposium on Security and Privacy, pages 300–314.

Jamal Abdul Nasir, Nico G¨ornitz, and Ulf Brefeld. 2014. An Off-the-shelf Approach to Authorship Attribution. The 25th International Conference on Computational Linguistics, pages 895–904.

Walter Ribeiro Oliveira, Edson Justino, and Luiz S. Oliveira. 2013. Comparing compression models for authorship attribution. Forensic Science International, 228(1-3):100–104.

Juozas Pikcilingis. 1971. Kas yra stilius?[What is the style?]. Vaga, Vilnius, Lithuania. (in Lithuanian). Tieyun Qian, Bing Liu, Li Chen, and Zhiyong Peng. 2014. Tri-Training for Authorship Attribution with Limited Training Data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 2, pages 345–351.

Jacques Savoy. 2012. Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages. Journal of Quantitative Linguistics, 19(2):132–161.

Jacques Savoy. 2013. Authorship Attribution Based on a Probabilistic Topic Model. Information Processing and Management, 49(1):341–354.

Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship Attribution of Micro-Messages. In Empirical Methods in Natural Langauge Processing, pages 1880–1891.

Fabrizio Sebastiani. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1–47.

Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2011. Authorship Attribution with Latent Dirichlet Allocation. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 181–189.

Thamar Solorio, Sangita Pillay, Sindhu Raghavan, and Manuel Montes-y Gómez. 2011. Modality Specific Meta Features for Authorship Attribution in Web
Forum Posts. In The 5th International Joint Conference on Natural Language Processing, pages 156–164.

Rui Sousa-Silva, Gustavo Laboreiro, Luis Sarmento, Tim Grant, Eug´enio C. Oliveira, and Belinda Maia. 2011. ’twazn me!!! ;(’ automatic authorship analysis of micro-blogging messages. In Proceedings of the 16th International Conference on Natural Language Processing and Information Systems, pages 161–168.

Efstathios Stamatatos. 2008. Author Identification: Using Text Sampling to Handle the Class Imbalance Problem. Information Processing and Management, 44(2):790–799.

Efstathios Stamatatos. 2009. A Survey of Modern Authorship Attribution Methods. Journal of the Association for Information Science and Technology, 60(3):538–556.

Efstathios Stamatatos. 2011. Plagiarism Detection Using Stopword N-Grams. Journal of the American Society for Information Science and Technology, 62(12):2512–2527.

Enhua Tan, Lei Guo, Songqing Chen, Xiaodong Zhang, and Yihong Zhao. 2013. UNIK: Unsupervised Social Network Spam Detection. In Proceedings of the 22Nd ACM International Conference on Conference on Information & Knowledge Management, pages 479–488.

Hans Van Halteren, R. Harald Baayen, Fiona Tweedie, Marco Haverkort, and Anneke Neijt. 2005. New Machine Learning Methods Demonstrate the Existence of a Human Stylome. Journal of Quantitative Linguistics, 12(1):65–77.

Gintare Žalkauskaite. 2012. Idiolekto požymiai elektroniniuose laiškuose. [Idiolect signs in e-mails]. Ph.D. thesis, Vilnius University, Lithuania. (in Lithuanian).

Ying Zhao and Justin Zobel. 2005. Effective and Scalable Authorship Attribution Using Function Words. In Proceedings of the Second AIRS Asian Information Retrieval Symposium, pages 174–189.

Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3):378–393.

Vytautas Zinkevicius. 2000. Lemuoklis – morfologinei analizei [Morphological analysis with Lemuoklis]. In Darbai ir Dienos, volume 24, pages 246–273. (In Lithuanian).

Citeringar i Crossref