Simple and Accountable Segmentation of Marked-up Text

Jonathon Read
School of Computing, Teesside University, UK

Rebeca Dridan
Department of Informatics, University of Oslo, Norway

Stephan Oepen
Department of Informatics, University of Oslo, Norway

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:33, s. 365-373

NEALT Proceedings Series 16:33, s. 365-373

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Segmenting documents into discrete; sentence-like units is usually a first step in any natural language processing pipeline. However; current segmentation tools perform poorly on text that contains markup. While stripping markup is a simple solution; we argue for the utility of the extra-linguistic information encoded by markup and present a scheme for normalising markup across disparate formats. We further argue for the need to maintain accountability when preprocessing text; such that a record of modifications to source documents is maintained. Such records are necessary in order to augment documents with information derived from subsequent processing. To facilitate adoption of these principles we present a novel tool for segmenting text that contains inline markup. By converting to plain text and tracking alignment; the tool is capable of state-of-the-art sentence boundary detection using any external segmenter; while producing segments containing normalised markup; with an account of how to recreate the original form.


Accountability; Markup; Normalisation; Sentence Boundary Detection; Traceability


