Experiments on sentence segmentation in Old Swedish editions

Gerlof Bouma
Språkbanken, Department of Swedish University of Gothenburg, Sweden

Yvonne Adesam
Språkbanken, Department of Swedish University of Gothenburg, Sweden

Ladda ner artikel

Ingår i: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Linköping Electronic Conference Proceedings 87:2, s. 11-26

NEALT Proceedings Series 18:2, p. 11-26

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-587-2

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


We Present experiments on automatic segmentation of electronic Old Swedish editions into sentence-like units. Our target material is haracterized by a great variation in the type of boundaries that are marked orthographically; the extent of boundary marking; and the means of boundary marking. We begin with an exploration of boundary marking in a large; unannotated corpus of Old Swedish texts. Then we show that we are able to improve upon a simple but effective segmenting baseline; using a conditional random field model trained on a manually annotated corpus. A more valuable lesson the modelling teaches us; however; is that we need to address the boundary marking variation explicitly.


Sentence-like units; boundary detection; Old Swedish


Adesam; Y.; Ahlberg; M.; and Bouma; G. (2012). bokstaffua; bokstaffwa; bokstafwa; bokstaua; bokstawa. . . Towards lexical link-up for a corpus of Old Swedish. In Jancsary; editor; Empirical Methods in Natural Language Processing: Proceedings of KONVENS 2012 (LThist 2012 workshop); page 365–369; Vienna.

Evert; S. (2005). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis; IMS Stuttgart.

Gillick; D. (2009). Sentence boundary detection and the problem with the U.S. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume: Short Papers; pages 241–244; Boulder; Colorado. Association for Computational Linguistics.

Gotoh; Y. and Renals; S. (2000). Sentence boundary detection in broadcast speech transcripts. In ASR2000 - Automatic Speech Recognition: Challenges for the new Millenium; pages 228–235; Paris; France.

Haug; D. T. T.; Jøhndal; M.; Eckhoff; H. M.; Welo; E.; Hertzenberg; M. J. B.; and Müth; A. (2009). Computational and linguistic issues in designing a syntactically annotated parallel corpus of indo-european languages. Traitement Automatique des Langues; 50.

Höder; S. (2011). Phrases and Clauses Tagging Manual for syntactic analyses of Old Nordic texts encoded as Menotic XML documents (PaCMan). University of Hamburg; Hamburg. Version 2.0.

Huang; H.-H.; Sun; C.-T.; and Chen; H.-H. (2010). Classical Chinese sentence segmentation. In CIPS-SIGHAN Joint Conference on Chinese Language Processing; pages 15–23.

Kiss; T. and Strunk; J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics; 32(4):485–525.

Liu; Y. and Shriberg; E. (2007). Comparing evaluation metrics for sentence boundary detection. In ICASSP.

Liu; Y.; Stolcke; A.; Shriberg; E.; and Harper; M. (2005). Using Conditional Random Fields for sentence boundary detection in speech. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05); pages 451–458; Ann Arbor; Michigan. Association for Computational Linguistics.

Loman; B. and Jørgensen; N. (1971). Manual for analys och beskrivning av makrosyntagmer. Studentlitteratur; Lund.

Mikheev; A. (2002). Periods; capitalized words; etc. Computational Linguistics; 28(3):289–318.

Petran; F. (2012). Studies for segmentation of historical texts: Sentences or chunks? In Mambrini; F.; Passarotti; M.; and Sporleder; C.; editors; Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities; pages 75–86; Lisbon.

Read; J.; Dridan; R.; Oepen; S.; and Solberg; L. J. (2012). Sentence boundary detection: A long solved problem? In Proceedings of COLING 2012: Posters; pages 985–994; Mumbai; India. The COLING 2012 Organizing Committee.

Stevenson; M. and Gaizauskas; R. (2000). Experiments on sentence boundary detection. In Proceedings of the Sixth Conference on Applied Natural Language Processing; pages 84–89; Seattle; Washington; USA. Association for Computational Linguistics.

Svensson; L. (1974). Nordisk Paleografi. Number 28 in Lunda studier i nordisk språkvetenskap; serie A. Studentlitteratur; Lund.

Citeringar i Crossref