Preprocessing is a normal first step in parsing, but it is the step that most researchers consider trivial and not worth reporting. The problem is exacerbated by the fact that parsing research often focuses on parsing a treebank rather than parsing a text since the treebank obscures many of the preprocessing steps that have gone into the curation of the text. In this paper, we argue that preprocessing has a non-negligible effect on parsing, and that we need to be careful in documenting our preprocessing steps in order to ensure replicability. We focus on parsing Arabic since Arabic is more difficult than English in the sense that 1) the orthography has intricacies such as vocalization that need to be handled and that 2) the basic units in the treebank do not necessarily correspond to words but sometimes constitute morphemes. The latter necessitates the use of a segmenter in order to convert the text to a form that the parser has seen in training. We investigate a scenario where we combine a morphological analyzer/segmenter, MADAMIRA, with a parser trained on the Arabic Treebank. We mainly examine the differences in orthographic and segmentation decisions between the analyzer and the treebank. We show that normalizing the two representations is not a simple process and that results can be artificially low or misleading if we do not pay attention. In other words, this paper is an attempt at establishing best practices for parsing Arabic, but also more generally for documenting preprocessing more carefully.
Al-Emran, M., Zaza, S., and Shaalan, K. (2015). Parsing Modern Standard Arabic using treebank resources. In 2015 International Conference on Information and Communication Technology Research (ICTRC), pages 80–83.
Attia, M., Foster, J., Hogan, D., Roux, J. L., Tounsi, L., and Van Genabith, J. (2010). Handling unknown words in statistical latent-variable parsing models for Arabic, English and French. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 67–75. Association for Computational Linguistics.
Becker, M. and Frank, A. (2002). A stochastic topological parser for German. In Proceedings of the 19th Onternational Conference on Computational Linguistics, pages 1–7.
Bod, R. (1996). Monte Carlo Parsing. In Bunt, H. and Tomita, M., editors, Recent Advances in Parsing Technology, pages 255–280. Kluwer.
Bod, R. (2001). What is the minimal set of fragments that achieves maximal parse accuracy? In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 66–73.
Buckwalter, T. (2004). Arabic morphological analyzer version 2.0. Linguistic Data Consortium. Charniak, E. and Johnson, M. (2005). Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 173–180.
Cheung, J. C. K. and Penn, G. (2009). Topological field parsing of German. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 64–72.
Chiang, D., Diab, M., Habash, N., Rambow, O., and Shareef, S. (2006). Parsing Arabic dialects. In 11th Conference of the European Chapter of the Association for Computational Linguistics.
Collins, M. and Koo, T. (2005). Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–70.
Dakota, D. and Kübler, S. (2017). Towards replicability in parsing. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 185–194, Varna, Bulgaria.
Diab, M. (2009). Second generation AMIRA tools for Arabic processing: Fast and robust tokenization, POS tagging, and base phrase chunking. In 2nd International Conference on Arabic Language Resources and Tools, volume 110.
Diab, M., Hacioglu, K., and Jurafsky, D. (2004). Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL 2004: Short papers, pages 149–152.
Goldberg, Y. and Elhadad, M. (2011). Joint Hebrew segmentation and parsing using a PCFG-LA lattice parser. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 704–709.
Goodman, J. (1996). Efficient algorithms for parsing the DOP model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA.
Green, S. and Manning, C. D. (2010). Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 394–402.
Habash, N. (2010). Introduction to Arabic Natural Language Processing, volume 3 of Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
Habash, N. and Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 573–580.
Habash, N., Rambow, O., and Roth, R. (2009). MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), volume 41, Cairo, Egypt.
Habash, N. and Roth, R. M. (2009). Catib: The Columbia Arabic Treebank. In Proceedings of the ACL-IJCNLP 2009 Conference, pages 221–224.
Habash, N. and Sadat, F. (2006). Arabic preprocessing schemes for statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, pages 49–52.
Hu, H., Dakota, D., and Kübler, S. (2017). Non-deterministic segmentation for Chinese lattice parsing. In Proceedings of the International Conference Recent Advances in Natural Language Processing, Varna, Bulgaria.
Johnson, M. (1998). PCFG models of linguistic tree representations. Computational Linguistics, 24(4):613–632.
Klein, D. and Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 423–430.
Kulick, S. and Bies, A. (2009). Treebank analysis and search using an extracted tree grammar. In Eighth International Workshop on Treebanks and Linguistic Theories.
Maamouri, M., Bies, A., Buckwalter, T., and Mekki, W. (2004). The Penn Arabic Treebank: Building a large-scale annotated Arabic corpus. In NEMLAR Conference on Arabic Language esources and Tools, volume 27, pages 466–467, Cairo, Egypt.
Maamouri, M., Kulick, S., and Bies, A. (2008). Diacritic annotation in the Arabic treebank and its impact on parser evaluation. In LREC.
McClosky, D., Charniak, E., and Johnson, M. (2006). Reranking and self-training for parser adaptation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 337–344.
Pasha, A., Al-Badrashiny, M., Diab, M. T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., and Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In LREC, volume 14, pages 1094–1101.
Petrov, S., Barrett, L., Thibaux, R., and Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 433–440.
Petrov, S. and Klein, D. (2007). Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, pages 404–411.
Petrov, S. and McDonald, R. (2012). Overview of the 2012 Shared Task on Parsing the Web. In SANCL, Montreal, Canada.
Seddah, D., Kübler, S., and Tsarfaty, R. (2014). Introducing the SPMRL 2014 shared task on parsing morphologically-rich languages. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages (SPMRL-SANCL), pages 103–109, Dublin, Ireland.
Seddah, D., Sagot, B., and Candito, M. (2012). The Alpage architecture at the SANCL 2012 shared task: robust pre-processing and lexical bridging for user-generated content parsing. In SANCL 2012-First Workshop on Syntactic Analysis of Non-Canonical Language, an NAACL-HLT’12 workshop.
Seddah, D., Tsarfaty, R., Kübler, S., Candito, M., Choi, J. D., Farkas, R., Foster, J., Goenaga, I., Gojenola Galletebeitia, K., Goldberg, Y., Green, S., Habash, N., Kuhlmann, M., Maier, W., Nivre, J., Przepiórkowski, A., Roth, R., Seeker, W., Versley, Y., Vincze, V., Woli´nski, M., Wróblewska, A., and de la Clergerie, E. V. (2013). Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 146–182, Seattle, WA.
Sekine, S. and Collins, M. (1997). EVALB bracket scoring program. URL: http://www.csnyu.edu/cs/projects/proteus/evalb.
Wagner, J., Seddah, D., Foster, J., and Van Genabith, J. (2007). C-structures and F-structures for the British National Corpus. In Proceedings of the Twelfth International Lexical Functional Grammar Conference. CSLI Publications.