Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora

Simon Clematide
Institute of Computational Linguistics, University of Zurich, Switzerland

Ladda ner artikel

Ingår i: Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 111:2, s. 6-16

NEALT Proceedings Series 25:2, p. 6-16

Visa mer +

Publicerad: 2015-05-07

ISBN: 978-91-7519-035-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Large and open multiparallel corpora are a valuable resource for contrastive corpus linguists if the data is annotated and stored in a way that allows precise and flexible ad hoc searches. A linguistic query language should also support computational linguists in automated multilingual data mining. We review a broad range of approaches for linguistic query and reporting languages according to usability criteria such as expressibility, expressiveness, and efficiency. We propose an architecture that tries to strike the right balance to suit practical purposes.


multiparallel corpora; treebank query languages; text corpus query languages


Vít Baisa, Miloš Jakub&iacuteček, Adam Kilgarriff, Vojtěch
Kováě, and Pavel Rychly. 2014. Bilingual word sketches: the translate button. In Proc EURALEX, pages 505–513.

Piotr Banski, J Bingel, N Diewald, E Frick, M Hanl, M Kupietz, P Pezik, C Schnober, and A Witt. 2013. KorAP: the new corpus analysis platform at IDS Mannheim. In Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference.

Steven Bird, Yi Chen, Susan B. Davidson, Haejoong Lee, and Yifeng Zheng. 2006. Designing and evaluating an XPath dialect for linguistic queries. In Proceedings of the 22nd International Conference on Data Engineering, pages 52–.

Gosse Bouma and Geert Kloosterman. 2007. Mining syntactically annotated corpora with XQuery. In Proceedings of the Linguistic Annotation Workshop, pages 17–24.

Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther K¨onig,Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation, 2(4):597–620.

Jean Carletta, Stefan Evert, Ulrich Heid, and Jonathan Kilgour. 2005. The NITE XML toolkit: Data model and query language. Language Resources and Evaluation, 39(4):313–334.

Christian Chiarcos, Julia Ritz, and Manfred Stede. 2009. By all these lovely tokens... merging conflicting tokenizations. In Proceedings of the Third Linguistic Annotation Workshop, pages 35–43.

Christian Chiarcos, Kerstin Eckart, and Julia Ritz. 2010. Creating and exploiting a resource of parallel parses. In Proceedings of the Fourth Linguistic Annotation Workshop, pages 166–171.

Christian Chiarcos. 2012. A generic formalism to represent linguistic corpora in RDF and OWL/DL. In Proc LREC 2012, pages 3205–3212.

Oliver Christ. 1994. A modular and flexible architecture for an integrated corpus query system. In Proceedings of COMPLEX’95 3rd Conference on Computational Lexicography and Text Research Budapest, Hungary, pages 23–32.

Mark Davies. 2005. The advantage of using relational databases for large corpora: Speed, advanced queries and unlimited annotation. International Journal of Corpus Linguistics, 10(3):307–334.

Lukas Faulstich, Ulf Leser, and Thorsten Vitt. 2006. Implementing a linguistic query language for historic texts. In Current Trends in Database Technology, pages 601–612.

Elena Frick, Carsten Schnober, and Piotr Ba´nski. 2012. Evaluating query languages for a corpus processing system. In Proc LREC 2012, pages 2286–2294.

Sumukh Ghodke and Steven Bird. 2012. Fangorn: A system for querying very large treebanks. In Proceedings of COLING 2012: Demonstration Papers, pages 175–182, December.

Torsten Grust, Maurice Van Keulen, and Jens Teubner. 2004. Accelerating XPath evaluation in any RDBMS. ACM Trans. Database Syst., 29(1):91– 131, March.

Markus G¨artner, Gregor Thiele, Wolfgang Seeker, Anders Björkelund, and Jonas Kuhn. 2013. ICARUS – an extensible graphical search tool for dependency treebanks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.

Najeh Hajlaoui, David Kolovratnik, Jaakko Väyrynen, Ralf Steinberger, and Daniel Varga. 2014. DCEP - digital corpus of the European Parliament. In Proc of LREC 2014, pages 3164–3171.

Andrew Hardie. 2012. CQPweb — combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3):380–409.

Florian Holzschuher and Ren´e Peinl. 2013. Performance of graph query languages: Comparison of Cypher, Gremlin and native access in Neo4J. In Proceedings of the Joint EDBT/ICDT 2013 Workshops, pages 195–204.

Milos Jakubicek, Adam Kilgarriff, Diana McCarthy, and Pavel Rychly. 2010. Fast syntactic searching in very large corpora for many languages. In Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, pages 741–747.

Daniel Janus and Adam Przepiórkowski. 2007. Poliqarp: An open source corpus indexer and search engine with syntactic extensions. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 85–88.

Stephan Kepser. 2003. Finite structure query: A tool for querying syntactically annotated corpora. In Proceedings of EACL, pages 179–186.

Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, and David Tugwell. 2004. The Sketch Engine. In Proceedings of the Eleventh EURALEX International Congress, pages 105–116.

Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakub&iacuteček, Vojtěch Kovář, Jan Michelfeit, Pavel
Rychly, and V&iactue;t Suchomel. 2014. The Sketch Engine: ten years on. Lexicography, 1(1):7–36.

Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit, volume 5, pages 79–86.

Milen Kouylekov and Stephan Oepen. 2014. Semantic technologies for querying linguistic annotations: An experiment focusing on graph-structured data. In Proc LREC 2014, pages 4331–4336.

Thomas Krause and Amir Zeldes. 2014. Annis3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities.

Esther K¨onig and Wolfgang Lezius. 2000. A description language for syntactically annotated corpora. In COLING 2000, July 31 - August 4, 2000, Universität des Saarlandes, Saarbr¨ucken, Germany, pages 1056–1060.

Esther K¨onig and Wolfgang Lezius. 2003. The TIGER language. A description language for syntax graphs. formal definition. Technical report, Institute for Natural Language Processing, University of Stuttgart.

Catherine Lai and Steven Bird. 2004. Querying and updating treebanks: A critical survey and requirements analysis. In Proceedings of the Australasian Language Technology Workshop 2004, pages 139–146.

Catherine Lai and S. G. Bird. 2005. LPath+: A firstorder complete language for linguistic tree query. In Proc PACLIC’19, pages 1–12.

Catherine Lai and Steven Bird. 2010. Querying linguistic trees. J. of Logic, Lang. and Inf., 19(1):53–73, January.

Joakim Lundborg, Torsten Marek, Maël Mettler, and Martin Volk. 2007. Using the Stockholm TreeAligner. In Proceedings of the 6th Workshop on Treebanks and Linguistic Theories, pages 73–78.

Torsten Marek, Joakim Lundborg, and Martin Volk. 2008. Extending the TIGER query language with universal quantification. In KONVENS 2008: 9.
Konferenz zur Verarbeitung nat¨urlicher Sprache, pages 5–17.

Hendrik Maryns and Stephan Kepser. 2009. Monasearch - a tool for querying linguistic treebanks. In Treebanks and Linguistic Theories 2009, pages 29–40.

Jir´i M&iactue;rovský. 2008. Netgraph - making searching in treebanks easy. In In Proc. of IJCNLP’08, pages 945–950.

Lars Nygaard and J.B. Johannessen. 2004. Searchtree - a user-friendly treebank search interface. In Proc TLT 2004, T¨ubingen, December 10–11, 2004, pages 183–189.

Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29(1):19–51.

Piotr Pezik. 2011. Providing corpus feedback for translators with the PELCRA search engine for NKJP. In Explorations across languages and corpora : PALC 2009, Lódź Studies in Linguistics, pages 135–144, Frankfurt am Main; New York. Peter Lang.

Piotr Pezik. 2013. Indexed graph databases for querying rich TEI annotation. http://digilab2.let.uniroma1.it/teiconf2013/wpcontent/uploads/2013/09/Pezik.pdf.

Alexandre Rafalovitch and Robert Dale. 2009. United nations general assembly resolutions: A sixlanguage parallel corpus. In Proceedings of the MT Summit, pages 292–299.

Douglas L. T. Rohde. 2005. TGrep2 user manual. http://tedlab.mit.edu/dr/Tgrep2/tgrep2.pdf.

Viktor Rosenfeld. 2010. An Implementation of the Annis 2 Query Language. Student thesis, Humboldt-Universität zu Berlin.

Pavel Rychlý. 2007. Manatee/Bonito – a modular corpus manager. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2007, pages 65–70.

Roman Schneider. 2013. KoGra-DB: Using MapReduce for language corpora. In 43. Jahrestagung der Gesellschaft f¨ur Informatik (GI), pages 140–142.

Raivis Skadin¸ ?s, J¨org Tiedemann, Roberts Rozis, and Daiga Deksne. 2014. Billions of parallel words for free: Building and using the EU Bookshop Corpus. In Proc of LREC 2014, pages 1850–1855.

Ralf Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tufis¸, and D. Varga. 2006. The JRCacquis: A multilingual aligned parallel corpus with 20+ languages. In Proc of LREC 2006.

Ralf Steinberger, Andreas Eisele, Szymon Klocek, Spyridon Pilos, and Patrick Schl¨uter. 2012. DGTTM: A freely available Translation Memory in 22 languages. In Proc of LREC 2012, pages 454–459.

Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski, and Signe Gilbro. 2014. An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation, 48(4):679–707.

Jörg Tiedemann. 2011. Bitext Alignment, volume 4 of Synthesis Lectures on Human Language Technologies. Morgan & Claypool. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proc of LREC 2012, pages 2214–2218.

Amir Zeldes, Anke L¨udeling, Julia Ritz, and Christian Chiarcos. 2009. ANNIS: A search tool for multi-layer annotated corpora. In Corpus Linguistics 2009.

Jan Štěpánek and Petr Pajas. 2010. Querying diverse treebanks in a uniform way. In Proc LREC 2010, pages 1828–1835.

Citeringar i Crossref