Creating register sub-corpora for the Finnish Internet Parsebank

Veronika Laippala
Turku Institute for Advanced Studies, University of Turku, Finland / School of Languages and Translation Studies, University of Turku, Finland / Turku NLP Group, University of Turku, Finland

Juhani Luotolahti
Turku NLP Group, University of Turku, Finland

Aki-Juhan Kyröläinen
School of Languages and Translation Studies, University of Turku, Finland / Turku NLP Group, University of Turku, Finland

Tapio Salakoski
Turku NLP Group, University of Turku, Finland

Filip Ginter
Turku NLP Group, University of Turku, Finland

Ladda ner artikel

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:18, s. 152-161

NEALT Proceedings Series 29:18, s. 152-161

Visa mer +

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper develops register sub-corpora for the Web-crawled Finnish Internet Parsebank. Currently, all the documents belonging to different registers, such as news and user manuals, have an equal status in this corpus. Detecting the text register would be useful for both NLP and linguistics (Giesbrecht and Evert, 2009) (Webber, 2009) (Sinclair, 1996) (Egbert et al., 2015). We assemble the subcorpora by first naively deducing four register classes from the Parsebank document URLs and then developing a classifier based on these, to detect registers also for the rest of the documents. The results show that the naive method of deducing the register is efficient and that the classification can be done sufficiently reliably. The analysis of the prediction errors however indicates that texts sharing similar communicative purposes but belonging to different registers, such as news and blogs informing the reader, share similar linguistic characteristics. This attests of the well-known difficulty to define the notion of registers for practical uses. Finally, as a significant improvement to its usability, we release two sets of sub-corpus collections for the Parsebank. The A collection consists of two million documents classified to blogs, forum discussions, encyclopedia articles and news with a naive classification precision of >90%, and the B collection four million documents with a precision of >80%.


Inga nyckelord är tillgängliga


Noushin Rezapour Asheghi, Serge Sharoff, and Katja Markert. 2016. Crowdsourcing for web genre annotation. Language Resources and Evaluation, 50(3):603–641.

Douglas Biber and Jesse Egbert. 2015. Using grammatical features for automatic register identification in an unrestricted corpus of documents from the Open Web. Journal of Research Design and Statistics in Linguistics and Communication Science, 2(1).

Douglas Biber, S. Johansson, G. Leech, Susan Conrad, and E. Finegan. 1999. The Longman Grammar of Spoken and Written English. Longman, London. Douglas Biber, Jesse Egbert, and Mark Davies. 2015. Exloring the composition of the searchable web: a corpus-based taxonomy of web registers. Corpora, 10(1):11–45.

Douglas Biber. 1989. Variation across speech and writing. Cambridge University Press, Cambridge.

Douglas Biber. 1995. Dimensions of Register Variation: A Cross-linguistic Comparison. Cambridge University Press, Cambridge.

Bernd Bohnet. 2010. Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 89–97, Stroudsburg, PA, USA. Association for Computational Linguistics.

Pedro. Carpena, Pedro. Bernaola-Galván, Michael Hackenberg, Ana. V. Coronado, and Jose L. Oliver. 2009. Level statistics of words: Finding keywords in literary texts and symbolic sequences. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 79(3):035102.

Kevin Crowston, Barbara Kwasnik, and Joseph Rubleske, 2011. Genres on the Web: Computational Models and Empirical Studies, chapter Problems in the Use-Centered Development of a Taxonomy of Web Genres, pages 69–84. Springer Netherlands, Dordrecht.

Jesse Egbert, Douglas Biber, and Mark Davies. 2015. Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology, 66(9):1817–1831.

S. Meyer Zu Essen and Barbara Stein. 2004. Genre classification of web pages: User study and feasibility analysis. Proceedings of the 27th Annual German Conference on Artificial Intelligence, pages 256–259.

Eugenie Giesbrecht and Stefan Evert. 2009. Is part-of speech tagging a solved task? an evaluation of postaggers for the german web as corpus. In Web as Corpus Workshop (WAC5), pages 27–36.

Stephan Gries, John Newman, and Cyrus Shaoul. 2011. N-grams and the clustering of registers. Empirical Language Research.

Stephan Gries, 2012. Methodological and analytic frontiers in lexical research, chapter Behavioral Profiles: a fine-grained and quantitative approach in corpus-based lexical semantics. John Benjamins, Amsterdam and Philadelphia.

Isabelle M. Guyon and Andre Elisseeff. 2003. An introduction to variable and feature selection. The journal of machine learning research, 3:1157–1182.

Jenna Kanerva, Matti Luotolahti, Veronika Laippala, and Filip Ginter. 2014. Syntactic n-gram collection from a large-scale corpus of Internet Finnish. In Proceedings of the Sixth International Conference Baltic HLT 2014, pages 184–191. IOS Press.

Adam Kilgariff and Gregory Grefenstette. 2003. Introduction to the special issue on Web as Corpus. Computational Linguistics, 29(3).

Christoph Lindemann and Lars Littig, 2011. Genres on the Web: Computational Models and Empirical Studies, chapter Classification of Web Sites at Super-genre Level, pages 211–235. Springer Netherlands, Dordrecht.

Juhani Luotolahti, Jenna Kanerva, Veronika Laippala, Sampo Pyysalo, and Filip Ginter. 2015. Towards universal web parsebanks. In Proceedings of the International Conference on Dependency Linguistics (Depling’15), pages 211–220. Uppsala University.

C.R. Miller. 1984. Genre as social action. Quaterly journal of speech, 70(2):151–167.

Marina Santini and Serge Sharoff. 2009. Web genre benchmark under construction. JLCL, 24(1):129–145.

Roland Schäfer and Felix Bildhauer, 2016. Proceedings of the 10th Web as Corpus Workshop, chapter Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison, pages 1–6. Association for Computational Linguistics.

Mike Scott and Chistopher Tribble. 2006. Textual Patterns: keyword and corpus analysis in language education. Benjamins, Amsterdam.

Serge Sharoff, ZhiliWu, and Katja Markert. 2010. The web library of Babel: evaluating genre collections.

John Sinclair. 1996. Preliminary recommendations on corpus typology.

John Swales. 1990. Genre analysis: English in academic and research settings. Cambridge University Press, Cambridge.

Vedrana Vidulin, Mitja Lustrek, and Matjax Gams. 2007. Using genres to improve search engines. In Workshop ”Towards genre-enabled Search Engines: The impact of NLP” at RANLP, pages 45–51.

Bonnie Webber. 2009. Genre distinctions for discourse in the Penn treebank. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP., pages 674–682. Association for Computational Linguistics.

Citeringar i Crossref