Konferensartikel

Statistical syntactic parsing for Latvian

Lauma Pretkalnina
Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia

Laura Rituma
Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:25, s. 279-289

NEALT Proceedings Series 16:25, p. 279-289

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

Syntactic parsing is an important technique in the natural language processing; yet Latvian is still lacking an efficient general coverage syntax parser. This paper reports on the first experiments on statistical syntactic parsing for Latvian — a highly inflective Indo-European language with a relatively free word order. We have induced a statistical parser from small; non-balanced Latvian Treebank using the MaltParser toolkit and measured the unlabeled attachment score (UAS). As MaltParser is based on the dependency grammar approach; we have also developed a convertor from the hybrid dependency-based annotation model used in the Latvian Treebank to the pure dependency annotation model. We have obtained a promising 74.63% UAS in 10-fold cross-validation using only ~2500 sentences. The results revealed that best results can be achieved using non-projective stack parsing algorithm with lazy arc adding strategy; but comparably good results can be achieved using projective parsing algorithms combined with appropriate projectiviziation preprocessing.

Nyckelord

Latvian; treebank; dependency parsing; statistical parsing; MaltParser

Referenser

Barzdinš; G.; Gruzitis; N.; Nešpore; G. and Saulite; B. (2007). Dependency-Based Hybrid Model of Syntactic Analysis for the Languages with a Rather Free Word Order. In: Proceedings of the 16th Nordic Conference of Computational Linguistics; pages 13–20; Tartu.

Bohnet; B. and Nivre; J. (2012) A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning; pages 1455–1465.

Chang; C.-C. and Lin; C.-J. (2011). LIBSVM : a library for support vector machines. In ACM Transactions on Intelligent Systems and Technology; 27(2); pages 1–27.

Deksne; D. and Skadinš; R. (2011). CFG Based Grammar Checker for Latvian. In Proceedings of the 18th Nordic Conference of Computational Linguistics ; pages 275–278 Riga.

Erjavec; T. (2010). MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications; Lexicons and Corpora. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’2010); pages 19–21; Malta.

Gómez-Rodríguez; C. and Nivre; J. (2010). A Transition-Based Parser for 2-Planar Dependency Structures. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics; pages 1492–1501

Hajic; J.; Böhmová; A.; Hajicová; E. and Vidová Hladká; B. (2000). The Prague Dependency Treebank: A Three-Level Annotation Scenario. A. Abeillé (ed.): Treebanks: Building and Using Parsed Corpora; pages 103–127; Amsterdam; Kluwer.

Hajic; J.; Vidová Hladká; B. and Pajas; P. (2001). The Prague Dependency Treebank: Annotation Structure and Support. In Proceedings of the IRCS Workshop on Linguistic Databases; pages 105–114; Philadelphia.

Koo; T. and Collins; M. (2010). Efficient Third-order Dependency Parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics ; pages 1–11; Association for Computational Linguistics.

Nešpore G.; Saulite B.; Barzdinš G. and Gruzitis N. (2010). Comparison of the SemTi- Kamols and Tesnière’s Dependency Grammars. In Proceedings of the 4th International Conference on Human Language Technologies — the Baltic Perspective. Frontiers in Artificial Intelligence and Applications; Vol. 219; pages. 233–240; IOS Press.

Nivre; J. (2003). An Efficient Algorithm for Projective Dependency Parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT 03); pages 149–160; Nancy.

Nivre; J. (2004). Incrementality in Deterministic Dependency Parsing. In Incremental Parsing: Bringing Engineering and Cognition Together. Workshop at ACL-2004; Barcelona.

Nivre; J. (2009). Non-Projective Dependency Parsing in Expected Linear Time. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4thInternational Joint Conference on Natural Language Processing of the AFNLP; pages 351– 359.

Nivre; J. and Nilsson; J. (2005). Pseudo-Projective Dependency Parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL) ; pages 99–106.

Nivre; J.; Kuhlmann; M. and Hall; J. (2009). An Improved Oracle for Dependency Parsing with Online Reordering. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09); pages 73–76.

Nivre J. and Hall J. (2010). A Quick Guide to MaltParser Optimization. http://maltparser.org/guides/opt/quick-opt.pdf [last visited on 16/01/2013].

Paikens P.; Gruzitis N. (2012). An implementation of a Latvian resource grammar in Grammatical Framework. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC); pages 1680–1685; Istanbul.

Paikens P.; Rituma L.; and Pretkalnina L. (2013). Morphological analysis with limited resources: Latvian example. In Proceedings of 19th Nordic Conference of Computational Linguistics; to be published; Oslo.

Pretkalnina L.; Nešpore G.; Levane-Petrova K.; and Saulite B. (2011a). A Prague Markup Language Profile for the SemTi-Kamols Grammar Model. In Proceedings of the 18th Nordic Conference of Computational Linguistics; pages 303–306; Riga.

Pretkalnina L.; Nešpore G.; Levane-Petrova K.; and Saulite B. (2011b). Towards a Latvian Treebank. In Actas del 3 Congreso Internacional de Lingüística de Corpus. Tecnologias de la Información y las Comunicaciones: Presente y Futuro en el Análisis de Corpus ; pages 119–127; Valence.

Citeringar i Crossref