Using Constraint Grammar for Treebank Retokenization

Eckhard Bick
University of Southern Denmark, Denmark

Ladda ner artikel

Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Constraint Grammar - Methods, Tools and Applications, 22 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 140:2, s. 6-9

NEALT Proceedings Series 33:2, s. 6-9

Visa mer +

Publicerad: 2017-07-06

ISBN: 978-91-7685-465-5

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper presents a Constraint Grammar-based method for changing the tokenization of existing annotated data, establishing standard space-based ("atomic") tokenization for corpora otherwise using MWE fusion and contraction splitting for the sake of syntactic transparency or for semantic reasons. Our method preserves ingoing and outgoing dependency arcs and allows the addition of internal tags and structure for MWEs. We discuss rule examples and evaluate the method against both a Portuguese treebank and live news text annotation.


Inga nyckelord är tillgängliga


Afonso, Susana & Eckhard Bick & Renato Haber & Diana Santos. 2002. Floresta sintá(c)tica: A treebank for Portuguese. In Proceedings of LREC’2002, Las Palmas. pp. 1698-1703, Paris: ELRA

Bick, Eckhard & Tino Didriksen. 2015. CG-3 - Beyond Classical Constraint Grammar. In: Beáta Megyesi: Proceedings of NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. pp. 31-39. Linköping: LiU Electronic Press. ISBN 978-91-7519-098-3

Grefenstette, Gregory & Pasi Tapanainen. 1994. What is a word, what is a sentence? Problems of tokenization. Proceedings of the 3rd Conference on Computational Lexicography and Text Research (COMPLEX’94), Budapest. pp. 79-87

Kaplan, Ronald M. 2005. A method for tokenizing text. In: Festschrift in Honor of Kimmo Koskenniemi’s 60th anniversary. CSLI Publications, Stanford, CA. pp. 55-64

McDonald, Ryan et al. 2013. Universal dependency annotation for multilingual parsing. In: Proceedings of ACL 2013, Sofia. pp. 92-98

Citeringar i Crossref