Gerlof Bouma
Språkbanken, Department of Swedish University of Gothenburg, Sweden
Yvonne Adesam
Språkbanken, Department of Swedish University of Gothenburg, Sweden
Download articlePublished in: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18
Linköping Electronic Conference Proceedings 87:2, p. 11-26
NEALT Proceedings Series 18:2, p. 11-26
Published: 2013-05-17
ISBN: 978-91-7519-587-2
ISSN: 1650-3686 (print), 1650-3740 (online)
We Present experiments on automatic segmentation of electronic Old Swedish editions into sentence-like units. Our target material is haracterized by a great variation in the type of boundaries that are marked orthographically; the extent of boundary marking; and the means of boundary marking. We begin with an exploration of boundary marking in a large; unannotated corpus of Old Swedish texts. Then we show that we are able to improve upon a simple but effective segmenting baseline; using a conditional random field model trained on a manually annotated corpus. A more valuable lesson the modelling teaches us; however; is that we need to address the boundary marking variation explicitly.