Using Constraint Grammar for Treebank Retokenization

Eckhard Bick
University of Southern Denmark, Denmark

Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Constraint Grammar - Methods, Tools and Applications, 22 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 140:2, s. 6-9

NEALT Proceedings Series 33:2, s. 6-9

Publicerad: 2017-07-06

ISBN: 978-91-7685-465-5

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper presents a Constraint Grammar-based method for changing the tokenization of existing annotated data, establishing standard space-based ("atomic") tokenization for corpora otherwise using MWE fusion and contraction splitting for the sake of syntactic transparency or for semantic reasons. Our method preserves ingoing and outgoing dependency arcs and allows the addition of internal tags and structure for MWEs. We discuss rule examples and evaluate the method against both a Portuguese treebank and live news text annotation.


