Experiments on sentence segmentation in Old Swedish editions

Gerlof Bouma
Språkbanken, Department of Swedish University of Gothenburg, Sweden

Yvonne Adesam
Språkbanken, Department of Swedish University of Gothenburg, Sweden

Ingår i: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Linköping Electronic Conference Proceedings 87:2, s. 11-26

NEALT Proceedings Series 18:2, s. 11-26

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-587-2

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


We Present experiments on automatic segmentation of electronic Old Swedish editions into sentence-like units. Our target material is haracterized by a great variation in the type of boundaries that are marked orthographically; the extent of boundary marking; and the means of boundary marking. We begin with an exploration of boundary marking in a large; unannotated corpus of Old Swedish texts. Then we show that we are able to improve upon a simple but effective segmenting baseline; using a conditional random field model trained on a manually annotated corpus. A more valuable lesson the modelling teaches us; however; is that we need to address the boundary marking variation explicitly.


Sentence-like units; boundary detection; Old Swedish


