Marie-Catherine de Marneffe
Linguistics Department, The Ohio State University, Columbus, OH, USA
Matias Grioni
Computer Science Department, The Ohio State University, Columbus, OH, USA
Jenna Kanerva
Turku NLP group, University of Turku, Finland
Filip Ginter
Turku NLP group, University of Turku, Finland
Ladda ner artikelIngår i: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), September 18-20, 2017, Università di Pisa, Italy
Linköping Electronic Conference Proceedings 139:14, s. 108-115
A fundamental issue in annotation efforts is to ensure that the same phenomena within and across corpora are annotated consistently. To date, there has not been a clear and obvious way to ensure annotation consistency of dependency corpora. Here, we revisit the method of Boyd et al. (2008) to flag inconsistencies in dependency corpora, and evaluate it on three languages with varying degrees of morphology (English, French, and Finnish UD v2).We show that the method is very efficient in finding errors in the annotations. We also build an annotation tool, which we will make available, that helps to streamline the manual annotation required by the method.
Bharat Ram Ambati, Rahul Agarwal, Mridul Gupta, Samar Husain, and Dipti Misra Sharma. 2011. Error detection for treebank validation. Proceedings of the 9th Workshop on Asian Language Resources, pages 23–30.
Alena Böhmová, Jan Hajic, Eva Hajicová, and Barbora Hladk´a. 2003. The Prague dependency treebank. In Treebanks, pages 103–127. Springer.
Guillaume Bonfante, Bruno Guillaume, Mathieu Morey, and Guy Perrier. 2011. Modular graph rewriting to compute semantics. In Proceedings of the Ninth International Conference on Computational Semantics, pages 65–74.
Adriane Boyd, Markus Dickinson, and W Detmar Meurers. 2008. On detecting errors in dependency treebanks. Research on Language & Computation, 6(2):113–137.
Koenraad De Smedt, Victoria Ros´en, and Paul Meurer. 2016. Studying consistency in UD treebanks with INESS-Search. In Fourteenth Workshop on Treebanks and Linguistic Theories (TLT14).
Markus Dickinson and W. Detmar Meurers. 2003a. Detecting errors in part-of-speech annotation. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, pages 107–114.
Markus Dickinson and W. Detmar Meurers. 2003b. Detecting inconsistencies in treebanks. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT-03).
Markus Dickinson and W. Detmar Meurers. 2005. Detecting errors in discontinuous structural annotation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 322–329.
Eleazar Eskin. 2000. Detecting errors within a corpus using anomaly detection. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pages 148–153.
Martin Forst, Núria Bertomeu, Berthold Crysmann, Frederik Fouvry, Silvia Hansen-Schirra, and Valia Kordoni. 2004. Towards a dependency-based gold standard for German parsers – the TiGer Dependency Bank. In Proceedings of the COLING Workshop on Linguistically Interpreted Corpora (LINC04), Geneva, Switzerland.
Markus G¨artner, Gregor Thiele, Wolfgang Seeker, Anders Bj¨orkelund, and Jonas Kuhn. 2013. ICARUS –An extensible graphical search tool for dependency treebanks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.
Filip Ginter, Jan Hajic, Juhani Luotolahti, Milan Straka, and Daniel Zeman. 2017. CoNLL 2017 shared task - automatically annotated raw texts and word embeddings. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University.
Juhani Luotolahti, Jenna Kanerva, Sampo Pyysalo, and Filip Ginter. 2015. SETS: Scalable and efficient tree search in dependency graphs. In NAACL Demo, pages 51–55.
Joakim Nivre, Jens Nilsson, and Johan Hall. 2006. Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC-06).
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pages 1659–1666.
Jan Štepánek and Petr Pajas. 2010. Querying diverse treebanks in a uniform way. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010).
Milan Straka, Jan Hajic, and Jana Straková. 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Paris, France, May. European Language Resources Association (ELRA).
Gregor Thiele, Wolfgang Seeker, Markus Gärtner, Anders Björkelund, and Jonas Kuhn. 2014. A graphical interface for automatic error mining in corpora. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 57–60. Association for Computational Linguistics.
Hans van Halteren. 2000. The detection of inconsistency in manually tagged text. In Proceedings of the 2nd Workshop on Linguistically Interpreted Corpora.