Unity in Diversity: A Unified Parsing Strategy for Major Indian Languages

Juhi Tandon
Kohli Center on Intelligent Systems (KCIS), International Institute of Information Technology, Hyderabad (IIIT-H), Gachibowli, Hyderabad, India

Dipti Misra Sharma
Kohli Center on Intelligent Systems (KCIS), International Institute of Information Technology, Hyderabad (IIIT-H), Gachibowli, Hyderabad, India

Ladda ner artikel

Ingår i: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), September 18-20, 2017, Università di Pisa, Italy

Linköping Electronic Conference Proceedings 139:29, s. 255-265

Visa mer +

Publicerad: 2017-09-13

ISBN: 978-91-7685-467-9

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper presents our work to apply non linear neural network for parsing five resource poor Indian Languages belonging to two major language families- Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas Kannada, Telugu and Malayalam belong to the Dravidian family. While little work has been done previously on Bengali and Telugu linear transition-based parsing, we present one of the first parsers for Marathi, Kannada and Malayalam. All the Indian languages are free word order and range from being moderate to very rich in morphology. Therefore in this work we propose the usage of linguistically motivated morphological features (suffix and postposition ) in the non linear framework, to capture the intricacies of both the language families. We also capture chunk and gender, number, person information elegantly in this model. We put forward ways to represent these features cost effectively


Inga nyckelord är tillgängliga


Firoj Alam, Shammur Absar Chowdhury, and Sheak Rashed Haider Noori. 2016. Bidirectional lstmscrfs networks for bangla pos tagging. In Computer and Information Technology (ICCIT), 2016 19th International Conference on, pages 377–382. IEEE.

Bharat Ram Ambati, Samar Husain, Sambhav Jain, Dipti Misra Sharma, and Rajeev Sangal. 2010a. Two methods to incorporate local morphosyntactic features in hindi dependency parsing. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 22–30. Association for Computational Linguistics.

Bharat Ram Ambati, Samar Husain, Joakim Nivre, and Rajeev Sangal. 2010b. On the role of morphosyntactic features in hindi dependency parsing. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 94–102. Association for Computational Linguistics.

Rafiya Begum, Samar Husain, Arun Dhwaj, Dipti Misra Sharma, Lakshmi Bai, and Rajeev Sangal. 2008. Dependency annotation scheme for indian languages. In IJCNLP, pages 721–726. Citeseer.

A. Bharati, V. Chaitanya, R. Sangal, and KV Ramakrishnamacharyulu. 1995. Natural Language Processing: A Paninian Perspective. Prentice-Hall of India.

Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Lakshmi Bai. 2006. Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. LTRC-TR31.

Akshar Bharati, Rajeev Sangal, and Dipti M Sharma. 2007. Ssf: Shakti standard format guide.

Akshar Bharati, Samar Husain, Bharat Ambati, Sambhav Jain, Dipti Sharma, and Rajeev Sangal. 2008a. Two semantic features make all the difference in parsing accuracy. Proc. of ICON, 8.

Akshar Bharati, Samar Husain, Dipti Misra Sharma, and Rajeev Sangal. 2008b. A two-stage constraint based dependency parser for free word order languages. In Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP).

Akshar Bharati, DM Sharma S Husain, L Bai, R Begam, and R Sangal. 2009. Anncorra: Treebanks for indian languages, guidelines for annotating hindi treebank (version–2.0).

Riyaz Ahmad Bhat, Irshad Ahmad Bhat, Naman Jain, and Dipti Misra Sharma. 2016a. A house united: Bridging the script and lexical barrier between hindi and urdu. In International Conference on Computational Linguistics (COLING 2016).

Riyaz Ahmad Bhat, Irshad Ahmad Bhat, and Dipti Misra Sharma. 2016b. Improving transitionbased dependency parsing of hindi and urdu by modeling syntactically relevant phenomena. ACM Transactions on Asian and Low-Resource Language Information Processing (TALIP).

Riyaz Ahmad Bhat. 2017. Exploiting linguistic knowledge to address representation and sparsity issues in dependency parsing of indian languages.

Rajesh Bhatt, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, and Fei Xia. 2009. A multi-representational and multi-layered treebank for hindi/urdu. In Proceedings of the Third Linguistic Annotation Workshop, pages 186–189. Association for Computational Linguistics.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.

Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750.

Narayan Choudhary and Girish Nath Jha. 2011. Creating multilingual parallel corpora in indian languages. In Language and Technology Conference, pages 527–537. Springer.

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.

Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In ACL (1), pages 1370–1380. Citeseer.

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.

Arup Ratan Ghosh. 2013. Memory Based Learner for Bengali POS Tagging. Ph.D. thesis, JADAVPUR UNIVERSITY.

Yoav Goldberg and Michael Elhadad. 2009. Hebrew dependency parsing: Initial results. In Proceedings of the 11th International Conference on Parsing Technologies, pages 129–133. Association for Computational Linguistics.

Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. 2012. Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In LREC, pages 759–765.

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.

Matt Hohensee and Emily M Bender. 2012. Getting more from morphology in multilingual dependency parsing. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 315–326. Association for Computational Linguistics.

Matthew Hohensee. 2012. It’s only morpho-logical: Modeling agreement in cross-linguistic dependency parsing. Ph.D. thesis.

Samar Husain, Prashanth Mannem, Bharat Ambati, and Phani Gadde. 2010. The icon-2010 tools contest on indian language dependency parsing. Proceedings of ICON-2010 Tools Contest on Indian Language Dependency Parsing, ICON, 10:1–8.

Girish Nath Jha. 2010. The tdil program and the indian langauge corpora intitiative (ilci). In LREC.

Sruthilaya Reddy Kesidi. 2013. CONSTRAINTBASED HYBRID DEPENDENCY PARSER FOR TELUGU. Ph.D. thesis, International Institute of Information Technology Hyderabad, India.

Sudheer Kolachina, Prasanth Kolachina, Manish Agarwal, and Samar Husain. 2010. Experiments with malt parser for parsing indian languages. Proc of ICON-2010 tools contest on Indian language dependency parsing. Kharagpur, India.

Prudhvi Kosaraju, Samar Husain, Bharat Ram Ambati, Dipti Misra Sharma, and Rajeev Sangal. 2012. Intra-chunk dependency annotation: expanding hindi inter-chunk annotated treebank. In Proceedings of the Sixth Linguistic Annotation Workshop, pages 49–56. Association for Computational Linguistics.

Sandra K¨ubler, Ryan McDonald, and Joakim Nivre. 2009. Dependency parsing. Synthesis Lectures on Human Language Technologies, 1(1):1–127.

Prashanth Mannem. 2009. Bidirectional dependency parser for hindi, telugu and bangla. Proceedings of ICON09 NLP Tools Contest: Indian Language Dependency Parsing, India.

Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 91–98. Association for Computational Linguistics.

Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the conference on empirical methods in natural language processing, pages 62–72. Association for Computational Linguistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprintarXiv:1301.3781.

Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülsen Eryigit, Sandra Kübler, Svetoslav Marinov, and Erwin Marsi. 2007. Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(02):95–135.

Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th InternationalWorkshop on Parsing Technologies (IWPT. Citeseer.

Joakim Nivre. 2008. Algorithms for deterministic incremental dependency parsing. Computational Linguistics, 34(4):513–553.

Joakim Nivre. 2009. Parsing indian languages with maltparser. Proceedings of the ICON09 NLP Tools Contest: Indian Language Dependency Parsing, pages 12–18.

Martha Palmer, Rajesh Bhatt, Bhuvana Narasimhan, Owen Rambow, Dipti Misra Sharma, and Fei Xia. 2009. Hindi syntax: Annotating dependency, lexical predicate-argument structure, and phrase structure. In The 7th International Conference on Natural Language Processing, pages 14–17.

Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013. Parsing with compositional vector grammars. In ACL (1), pages 455–465.

Juhi Tandon, Himani Chaudhary, Riyaz Ahmad Bhat, and Dipti Misra Sharma. 2016. Conversion from paninian karakas to universal dependencies for hindi dependency treebank. LAW X, page 141.

Lucien Tesnière. 1959. Eléments de syntaxe structurale. Librairie C. Klincksieck. J¨org Tiedemann. 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, volume V, pages 237–248. John Benjamins, Amsterdam/ Philadelphia, Borovets, Bulgaria.

Reut Tsarfaty, Djamé Seddah, Yoav Goldberg, Sandra Kübler, Marie Candito, Jennifer Foster, Yannick Versley, Ines Rehbein, and Lamia Tounsi. 2010. Statistical parsing of morphologically rich languages (spmrl): what, how and whither. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 1–12. Association for Computational Linguistics.

Reut Tsarfaty, Djamé Seddah, Sandra Kübler, and Joakim Nivre. 2013. Parsing morphologically rich languages: Introduction to the special issue. Computational Linguistics, 39(1):15–22.

Devadath V V and Dipti Misra Sharma. 2016. Significance of an accurate sandhi-splitter in shallow parsing of dravidian languages. In Proceedings of the ACL 2016 Student Research Workshop, pages 37–42, Berlin, Germany, August. Association for Computational Linguistics.

MengqiuWang and Christopher D Manning. 2013. Effect of non-linear deep architecture in sequence labeling. In IJCNLP, pages 1285–1291.

Fei Xia, Owen Rambow, Rajesh Bhatt, Martha Palmer, and Dipti Misra Sharma. 2009. Towards a multirepresentational treebank. In The 7th International Workshop on Treebanks and Linguistic Theories. Groningen, Netherlands, pages 159–170.

RZ Xiao, AM McEnery, JP Baker, and Andrew Hardie. 2004. Developing asian language corpora: standards and practice. In The 4th Workshop on Asian Language Resources.

Citeringar i Crossref