The Benefit of Syntactic vs. Linear n-grams for Linguistic Description

Melanie Andresen
Universität Hamburg, Institute for German Language and Literature, Germany

Heike Zinsmeister
Universität Hamburg, Institute for German Language and Literature, Germany

Ladda ner artikel

Ingår i: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), September 18-20, 2017, Università di Pisa, Italy

Linköping Electronic Conference Proceedings 139:3, s. 4-14

Visa mer +

Publicerad: 2017-09-13

ISBN: 978-91-7685-467-9

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Automatic dependency annotations have been used in all kinds of language applications. However, there has been much less exploitation of dependency annotations for the linguistic description of language varieties. This paper presents an attempt to employ dependency annotations for describing style. We argue that for this purpose, linear n-grams (that follow the text’s surface) alone do not appropriately represent a language like German. For this claim, we present theoretically as well as empirically founded arguments. We suggest syntactic n-grams (that follow the dependency paths) as a possible solution. To demonstrate their potential, we compare the German academic languages of linguistics and literary studies using both linear and syntactic n-grams. The results show that the approach using syntactic n-grams allows for the detection of linguistically meaningful patterns that do not emerge in a linear n-gram analysis, e. g. complex verbs and light verb constructions.


Inga nyckelord är tillgängliga


Elena Afros and Catherine F. Schryer. 2009. Promotional (meta)discourse in research articles in language and literary studies. English for Specific Purposes, 28(1):58–68, January.

Melanie Andresen and Heike Zinsmeister. 2017. Approximating Style by n-Gram-based Annotation. In Proceedings of the Workshop on Stylistic Variation, Copenhagen, Denmark, September.

Markus Becker and Anette Frank. 2002. A Stochastic Topological Parser of German. In Proceedings of COLING 2002, pages 71–77.

Douglas Biber, Stig Johansson, Geoffrey Leech, Susan Conrad, and Edward Finegan. 1999. Longman Grammar of Spoken and Written English. Longman, Harlow.

Bernd Bohnet. 2010. Very High Accuracy and Fast Dependency Parsing is not a Contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.

Stefan Bott and Sabine Schulte im Walde. 2015. Exploiting Fine-grained Syntactic Transfer Features to Predict the Compositionality of German Particle Verbs. In Proceedings of the 11th International Conference on Computational Semantics, IWCS 2015, 15-17 April, 2015, Queen Mary University of London, London, UK, pages 34–39.

Anna Cardinaletti and Ian Roberts. 2002. Clause Structure and X-Second. In Guglielmo Cinque, editor, Functional Structure in DP and IP: The Cartography of Syntactic Structures, volume 1, pages 123–166. Oxford University Press.

Richard Futrell, Kyle Mahowald, and Edward Gibson. 2015. Quantifying Word Order Freedom in Dependency Corpora. In Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pages 91–100, Uppsala.

Yoav Goldberg and Jon Orwant. 2013. A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 241–247, Atlanta, Georgia, USA, June. Association for Computational Linguistics.

David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006. A closer look at skip-gram modelling. In Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006), pages 1–4.

Dan Jurafsky and James H Martin. 2014. Speech and language processing, volume 3. Pearson. Tibor Kiss and Jan Strunk. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics, 32(4):485–525, December.

Sandra K¨ubler and Heike Zinsmeister. 2015. Corpus Linguistics and Linguistically Annotated Corpora. Bloomsbury, London, New York.

Gabriella Lapesa and Stefan Evert. 2017. Large-scale evaluation of dependency-based DSMs: Are they worth the effort? In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 394–400, Valencia, Spain, April. Association for Computational Linguistics.

Jefrey Lijffijt, Terttu Nevalainen, Tanja S¨aily, Panagiotis Papapetrou, Kai Puolam¨aki, and Heikki Mannila. 2014. Significance testing of word frequencies in corpora. Digital Scholarship in the Humanities, pages 1–24, December.

Michaela Mahlberg. 2013. Corpus Stylistics and Dickens’s Fiction. Number 14 in Routledge advances in corpus linguistics. Routledge, New York.

Wolfgang Maier, Miriam Kaeshammer, Peter Baumann, and Sandra K¨ubler. 2014. Discosuite – A Parser Test Suite for German Discontinuous Structures. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association (ELRA).

Joakim Nivre, ? Zeljko Agi´c, and Lars Ahrenberg. 2017. Universal Dependencies 2.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University.

Timothy Osborne, Michael Putnam, and Thomas Groß. 2012. Catenae: Introducing a Novel Unit of Syntactic Analysis. Syntax, 15(4):354–396, December.

Sebastian Pad´o and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161–199.

Magali Paquot and Yves Bestgen. 2009. Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In Andreas H. Jucker, Daniel Schreier, and Marianne Hundt, editors, Corpora: Pragmatics and Discourse, pages 247–269. Brill, January.

Anne Schiller, Simone Teufel, Christine Thielen, and Christine St¨ockert. 1999. Guidelines f¨ur das Tagging deutscher Textcorpora mit STTS (kleines und großes Tagset). Stuttgart, T¨ubingen.

Wolfgang Seeker and Jonas Kuhn. 2012. Making Ellipses Explicit in Dependency Conversion for a German Treebank. In Proceedings of the 8th International Conference on Language Resources and Evaluation, pages 3132–3139, Istanbul, Turkey.

Grigori Sidorov, Francisco Velasquez, Efstathios Stamatatos, Alexander Gelbukh, and Liliana Chanona-Hernández. 2012. Syntactic Dependency-Based Ngrams as Classification Features. In Ildar Batyrshin and Miguel Gonz´alez Mendoza, editors, Advances in Computational Intelligence, number 7630 in Lecture Notes in Computer Science, pages 1–11. Springer, October.

Grigori Sidorov. 2013. Syntactic Dependency Based N-grams in Rule Based Automatic English as Second Language Grammar Correction. International Journal of Computational Linguistics and Applications, 4(2):169–188.

Rion Snow, Daniel Jurafsky, Andrew Y Ng, et al. 2004. Learning syntactic patterns for automatic hypernym discovery. In NIPS, volume 17, pages 1297–1304.

Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):538–556, March.

Mark Twain. 1880. A Tramp Abroad. Chatto & Windus, London.

Andreas van Cranenburgh and Rens Bod. 2017. A Data-Oriented Model of Literary Language. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 1:1228–1238.

Yannick Versley. 2013. A graph-based approach for implicit discourse relations. Computational Linguistics in the Netherlands Journal, 3:148–173.

Citeringar i Crossref