Eckhard Bick
University of Southern Denmark, Denmark
Download articlePublished in: Proceedings of the NoDaLiDa 2017 Workshop on Constraint Grammar - Methods, Tools and Applications, 22 May 2017, Gothenburg, Sweden
Linköping Electronic Conference Proceedings 140:2, p. 6-9
NEALT Proceedings Series 33:2, p. 6-9
Published: 2017-07-06
ISBN: 978-91-7685-465-5
ISSN: 1650-3686 (print), 1650-3740 (online)
This paper presents a Constraint Grammar-based method for changing the tokenization of existing annotated data, establishing standard space-based ("atomic") tokenization for corpora otherwise using MWE fusion and contraction splitting for the sake of syntactic transparency or for semantic reasons. Our method preserves ingoing and outgoing dependency arcs and allows the addition of internal tags and structure for MWEs. We discuss rule examples and evaluate the method against both a Portuguese treebank and live news text annotation.
Afonso, Susana & Eckhard Bick & Renato Haber & Diana Santos. 2002. Floresta sintá(c)tica: A treebank for Portuguese. In Proceedings of LREC’2002, Las Palmas. pp. 1698-1703, Paris: ELRA
Bick, Eckhard & Tino Didriksen. 2015. CG-3 - Beyond Classical Constraint Grammar. In: Beáta Megyesi: Proceedings of NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. pp. 31-39. Linköping: LiU Electronic Press. ISBN 978-91-7519-098-3
Grefenstette, Gregory & Pasi Tapanainen. 1994. What is a word, what is a sentence? Problems of tokenization. Proceedings of the 3rd Conference on Computational Lexicography and Text Research (COMPLEX’94), Budapest. pp. 79-87
Kaplan, Ronald M. 2005. A method for tokenizing text. In: Festschrift in Honor of Kimmo Koskenniemi’s 60th anniversary. CSLI Publications, Stanford, CA. pp. 55-64
McDonald, Ryan et al. 2013. Universal dependency annotation for multilingual parsing. In: Proceedings of ACL 2013, Sofia. pp. 92-98