Gerlof Bouma
Språkbanken, Department of Swedish University of Gothenburg, Sweden
Yvonne Adesam
Språkbanken, Department of Swedish University of Gothenburg, Sweden
Download articlePublished in: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18
Linköping Electronic Conference Proceedings 87:2, p. 11-26
NEALT Proceedings Series 18:2, p. 11-26
Published: 2013-05-17
ISBN: 978-91-7519-587-2
ISSN: 1650-3686 (print), 1650-3740 (online)
We Present experiments on automatic segmentation of electronic Old Swedish editions into sentence-like units. Our target material is haracterized by a great variation in the type of boundaries that are marked orthographically; the extent of boundary marking; and the means of boundary marking. We begin with an exploration of boundary marking in a large; unannotated corpus of Old Swedish texts. Then we show that we are able to improve upon a simple but effective segmenting baseline; using a conditional random field model trained on a manually annotated corpus. A more valuable lesson the modelling teaches us; however; is that we need to address the boundary marking variation explicitly.
Adesam; Y.; Ahlberg; M.; and Bouma; G. (2012). bokstaffua; bokstaffwa; bokstafwa; bokstaua; bokstawa. . . Towards lexical link-up for a corpus of Old Swedish. In Jancsary; editor; Empirical Methods in Natural Language Processing: Proceedings of KONVENS 2012 (LThist 2012 workshop); page 365–369; Vienna.
Evert; S. (2005). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis; IMS Stuttgart.
Gillick; D. (2009). Sentence boundary detection and the problem with the U.S. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume: Short Papers; pages 241–244; Boulder; Colorado. Association for Computational Linguistics.
Gotoh; Y. and Renals; S. (2000). Sentence boundary detection in broadcast speech transcripts. In ASR2000 - Automatic Speech Recognition: Challenges for the new Millenium; pages 228–235; Paris; France.
Haug; D. T. T.; Jøhndal; M.; Eckhoff; H. M.; Welo; E.; Hertzenberg; M. J. B.; and Müth; A. (2009). Computational and linguistic issues in designing a syntactically annotated parallel corpus of indo-european languages. Traitement Automatique des Langues; 50.
Höder; S. (2011). Phrases and Clauses Tagging Manual for syntactic analyses of Old Nordic texts encoded as Menotic XML documents (PaCMan). University of Hamburg; Hamburg. Version 2.0.
Huang; H.-H.; Sun; C.-T.; and Chen; H.-H. (2010). Classical Chinese sentence segmentation. In CIPS-SIGHAN Joint Conference on Chinese Language Processing; pages 15–23.
Kiss; T. and Strunk; J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics; 32(4):485–525.
Liu; Y. and Shriberg; E. (2007). Comparing evaluation metrics for sentence boundary detection. In ICASSP.
Liu; Y.; Stolcke; A.; Shriberg; E.; and Harper; M. (2005). Using Conditional Random Fields for sentence boundary detection in speech. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05); pages 451–458; Ann Arbor; Michigan. Association for Computational Linguistics.
Loman; B. and Jørgensen; N. (1971). Manual for analys och beskrivning av makrosyntagmer. Studentlitteratur; Lund.
Mikheev; A. (2002). Periods; capitalized words; etc. Computational Linguistics; 28(3):289–318.
Petran; F. (2012). Studies for segmentation of historical texts: Sentences or chunks? In Mambrini; F.; Passarotti; M.; and Sporleder; C.; editors; Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities; pages 75–86; Lisbon.
Read; J.; Dridan; R.; Oepen; S.; and Solberg; L. J. (2012). Sentence boundary detection: A long solved problem? In Proceedings of COLING 2012: Posters; pages 985–994; Mumbai; India. The COLING 2012 Organizing Committee.
Stevenson; M. and Gaizauskas; R. (2000). Experiments on sentence boundary detection. In Proceedings of the Sixth Conference on Applied Natural Language Processing; pages 84–89; Seattle; Washington; USA. Association for Computational Linguistics.
Svensson; L. (1974). Nordisk Paleografi. Number 28 in Lunda studier i nordisk språkvetenskap; serie A. Studentlitteratur; Lund.