Konferensartikel

Edit transducers for spelling variation in Old Spanish

Jordi Porta
Departamento de Tecnología y Sistemas, Centro de Estudios de la Real Academia Española, Madrid, Spain

José-Luis Sancho
Departamento de Tecnología y Sistemas, Centro de Estudios de la Real Academia Española, Madrid, Spain

Javier Gómez
Departamento de Tecnología y Sistemas, Centro de Estudios de la Real Academia Española, Madrid, Spain

Ladda ner artikel

Ingår i: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Linköping Electronic Conference Proceedings 87:6, s. 70-79

NEALT Proceedings Series 18:6, p. 70-79

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-587-2

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

A system for the analysis of Old Spanish word forms using weighted finite-state transducers is presented. The system uses previously existing resources such as a modern lexicon; a phonological transcriber and a set of rules implementing the evolution of Spanish from the Middle Ages. The results obtained in all datasets show significant improvements; both in accuracy and in the trade-off between precision and recall; with respect to the baseline and the Levenshtein edit distance. A qualitative error analysis suggests several potential ways to improve the performance of the system.

Nyckelord

Old Spanish; Finite-State Transducers; Spelling Variation; Historical Linguistics

Referenser

Allauzen; C. and Mohri; M. (2008). 3-way composition of weighted finite-state transducers. In Proceedings of the 13th International Conference on Implementation and Application of Automata (CIAA–2008); pages 262–273; San Francisco; California; USA.

Allauzen; C.; Riley; M.; Schalkwyk; J.; Skut; W.; and Mohri; M. (2007). OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Ninth International Conference on Implementation and Application of Automata; (CIAA–2007); pages 11–23; Praque; Czech Republic.

Bollmann; M.; Petran; F.; and Dipper; S. (2011). Applying rule-based normalization to different types of historical texts — An evaluation. In Proceedings of the 5th Languange and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics; pages 339–344; Poznan; Poland.

Borin; L. and Forsberg; M. (2008). Something old; something new: A computational morphological description of Old Swedish. In LREC 2008 Workshop on Language Technology for Cultural Heritage Data (LaTeCH–2008); pages 9–16; Marrakech; Morocco.

Chomsky; N. and Halle; M. (1968). The sound pattern of English. Harper & Row; New York.

Damerau; F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM; 7(3):171–176.

Jurish; B. (2010a). Efficient online k-best lookup in weighted finite-state cascades. In Hanneforth; T. and Fanselow; G.; editors; Language and Logos: Studies in Theoretical and Computational Linguistics; volume 72 of Studia grammatica; pages 313–327. Akademie Verlag; Berlin.

Jurish; B. (2010b). More than words: Using token context to improve canonicalization of historical German. Journal for Language Technology and Computational Linguistics; 25(1):23– 39.

Kaplan; R. M. and Kay; M. (1994). Regular models of phonological rule systems. Computational Linguistics; 20(3):331–378.

Karttunen; L. (1995). The replace operator. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL–95); pages 16–23; Cambridge; Massachusetts; USA.

Karttunen; L. (1996). Directed replacement. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL–96); pages 108–115; Santa Cruz; California; USA.

Levenshtein; V. I. (1966). Binary codes capable of correcting deletions; insertions; and reversals. Soviet Physics Doklady; 10(8):707–710.

Lloyd; P. M. (1987). From Latin to Spanish. American Philosophical Society; Philadelphia.

Mohri; M. (2009). Weighted automata algorithms. In Droste; M.; Kuich; W.; and Vogler; H.; editors; Handbook of Weighted Automata; pages 213–254. Springer; Berlin.

Mohri; M. and Riley; M. (2002). An efficient algorithm for the n-best-strings problem. In Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP–2002); Denver; Colorado; USA.

Morreale; M. (1978). Trascendencia de la variatio para el estudio de la grafía; fonética; morfología y sintaxis de un texto medieval; ejemplificada en el MS Esc. I.I.6. In Annali della Facoltà di Lettere e Filosofia dell’Università di Padova; volume II; pages 249–261; Florence; Italy.

Penny; R. J. (2002). A history of the Spanish Language. Cambridge University Press; Cambridge; second edition.

Piotrowski; M. (2012). Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies; 5(2):1–157.

Pombo; E. L. (2012). Variation and standardization in the history of Spanish spelling. In Baddeley; S. and Voeste; A.; editors; Orthographies in Early Modern Europe; pages 15–62. De Gruyter Mouton; Berlin; Boston.

RAE (2001). Diccionario de la lengua española. Espasa; Madrid; 22th edition.

Roark; B.; Sproat; R.; Allauzen; C.; Riley; M.; Sorensen; J.; and Tai; T. (2012). The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations; pages 61–66; Jeju Island; Korea.

Sánchez; F.; Porta; J.; Sancho; J. L.; Nieto; A.; Ballester; A.; Fernández; A.; Gómez; J.; Gómez; L.; Raigal; E.; and Ruiz; R. (1999). La anotación de los corpus CREA y CORDE. In Proceedings of SEPLN 1999; volume 25; pages 175–182; Lleida; Spain.

Sánchez-Marco; C.; Boleda; G.; and Padró; L. (2011). Extending the tool; or how to annotate historical language varieties. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage; Social Sciences; and Humanities; pages 1–9; Portland; OR; USA.

Wells; J. C. (1997). Sampa computer readable phonetic alphabet. In Gibbon; D.; Moore; R.; and Winski; R.; editors; Handbook of Standards and Resources for Spoken Language Systems; pages 684–732. Mouton de Gruyter; Berlin and New York.

Citeringar i Crossref