Simple and Accountable Segmentation of Marked-up Text

Jonathon Read
School of Computing, Teesside University, UK

Rebeca Dridan
Department of Informatics, University of Oslo, Norway

Stephan Oepen
Department of Informatics, University of Oslo, Norway

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:33, s. 365-373

NEALT Proceedings Series 16:33, s. 365-373

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Segmenting documents into discrete; sentence-like units is usually a first step in any natural language processing pipeline. However; current segmentation tools perform poorly on text that contains markup. While stripping markup is a simple solution; we argue for the utility of the extra-linguistic information encoded by markup and present a scheme for normalising markup across disparate formats. We further argue for the need to maintain accountability when preprocessing text; such that a record of modifications to source documents is maintained. Such records are necessary in order to augment documents with information derived from subsequent processing. To facilitate adoption of these principles we present a novel tool for segmenting text that contains inline markup. By converting to plain text and tracking alignment; the tool is capable of state-of-the-art sentence boundary detection using any external segmenter; while producing segments containing normalised markup; with an account of how to recreate the original form.


Accountability; Markup; Normalisation; Sentence Boundary Detection; Traceability


Flickinger; D.; Oepen; S.; and Ytrestøl; G. (2010). Wikiwoods: Syntacto-semantic annotation for English Wikipedia. In Proceedings of the 7th Conference on International Language Resources and Evaluation; Valletta; Malta.

Foster; J.; Cetinoglu; O.; Wagner; J.; Le Roux; J.; Nivre; J.; Hogan; D.; and van Genabith; J. (2011). From news to comment: Resources and benchmarks for parsing the language of Web 2.0. In Proceedings of the 2011 International Joint Conference on Natural Language Processing; page 893 – 901; Chiang Mai; Thailand.

Gimpel; K.; Schneider; N.; O’Connor; B.; Das; D.; Mills; D.; Eisenstein; J.; Heilman; M.; Yogatama; D.; Flanigan; J.; and Smith; N. A. (2011). Part-of-speech tagging for Twitter: Annotation; features; and experiments. In Proceedings of the 49th Meeting of the Association for Computational Linguistics; page 42 – 47; Portland; OR; USA.

Kilgarriff; A. and Grefenstette; G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics; 29(3):333 – 347.

Marcus; M.; Santorini; B.; and Marcinkiewicz; M. A. (1993). Building a large annotated corpora of English: The Penn Treebank. Computational Linguistics; 19:313 – 330.

Read; J.; Dridan; R.; Oepen; S.; and Solberg; L. J. (2012a). Sentence boundary detection: A long solved problem? In Proceedings of the 24th International Conference on Computational Linguistics; Mumbai; India.

Read; J.; Flickinger; D.; Dridan; R.; Oepen; S.; and Øvrelid; L. (2012b). The WeSearch Corpus; Treebank; and Treecache. A comprehensive sample of user-generated content. In Proceedings of the 8th International Conference on Language Resources and Evaluation; Istanbul; Turkey.

Schäfer; U.; Kiefer; B.; Spurk; C.; Steffen; J.; and Wang; R. (2011). The ACL Anthology Searchbench. In Proceedings of the 49th Meeting of the Association for Computational Linguistics System Demonstrations; page 7 – 13; Portland; OR; USA.

Solberg; L. J. (2012). A corpus builder for Wikipedia. Master’s thesis; University of Oslo; Norway.

Citeringar i Crossref