Improving POS Tagging in Old Spanish Using TEITOK

Maarten Janssen

Josep Ausensi
Universitat Pompeu Fabra, Department of Translation and Language Sciences, Spain

Josep M. Fontana
Universitat Pompeu Fabra, Department of Translation and Language Sciences, Spain

Ladda ner artikel

Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

Linköping Electronic Conference Proceedings 133:2, s. 2-6

NEALT Proceedings Series 32:2, s. 2-6

Visa mer +

Publicerad: 2017-05-10

ISBN: 978-91-7685-503-4

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


In this paper, we describe how the TEITOK corpus tools helped to create a diachronic corpus for Old Spanish that contains both paleographic and linguistic information, which is easy to use for non-specialists, and in which it is easy to perform manual improvements to automatically assigned POS tags and lemmas.


Inga nyckelord är tillgängliga


Malin Ahlberg, Lars Borin, Markus Forsberg, Martin Hammarstedt, Leif-J¨oran Olsson, Olof Olsson, Johan Roxendal, and Jonatan Uppstr¨om. 2013. Korp and karp – a bestiary of language resources: the research infrastructure of spr°akbanken.

BNC. 2007. British national corpus, version 3 BNC XML edition.

Ivy A. Corfis, John O’Neill, and Jr. Theodore S. Beardsley. 1997. Early Celestina Electronic Texts and Concordances. Madison. Stefan Evert and Andrew Hardy. 2015. Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In 10th International Conference on Open Repositories (OR2015), June.

Pablo Picasso Feliciano de Faria, Fabio Natanael Kepler, and Maria Clara Paix˜ao de Sousa. 2010. An integrated tool for annotating historical corpora. In Proceedings of the Fourth Linguistic Annotation Workshop, LAW IV ’10, pages 217–221, Stroudsburg, PA, USA. Association for Computational Linguistics.

Andrew Hardie. 2012. Cqpweb combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3).

Serge Heiden. 2010. The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In Ryo Otoguro, Kiyoshi Ishikawa, Hiroshi Umemoto, Kei Yoshimoto, and Yasunari Harada, editors, 24th Pacific Asia Conference on Language, Information and Computation, pages 389–398, Sendai, Japan. Institute for Digital Enhancement of Cognitive Development, Waseda University.

María Teresa Herrera and María Estela González de Fauve. 1997. Textos y Concordancias Electrónicos del Corpus Médico Español. Madison.

Maarten Janssen. 2012. NeoTag: a POS tagger for grammatical neologism detection. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pages 2118–2124.

Maarten Janssen. 2015. Multi-level manuscript transcription: TEITOK. In Congresso de Humanidades Digitais em Portugal, Lisboa, October 8-9, 2015.

Lloyd Kasten, John Nitti, and Wilhelmina Jonxis-Henkemans. 1997. The Electronic Texts and Concordances of the ProseWorks of Alfonso X, El Sabio. Madison.

Thomas Krause and Amir Zeldes. 2016. Annis3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities, 31(1):118.

John Nitti and Lloyd Kasten. 1997. The Electronic Texts and Concordances of Medieval Navarro- Aragonese Manuscripts. Madison.

John O’Neill. 1999. Electronic Texts and Concordances of the Madison Corpus of Early Spanish Manuscripts and Printings. Madison.

Lluís Padró, Miquel Collado, Samuel Reese, Marina Lloberes, and Irene Castell´on. 2010. Freeling 2.1: Five years of open-source language processing tools. In Proceedings of 7th Language Resources and Evaluation Conference (LREC’10), La Valletta, Malta, May.

Cristina Sánchez-Marco, Gemma Boleda, and Lluís Padró. 2010. Annotation and representation of a diachronic corpus of spanish. In Proceedings of the Language Resources and Evaluation Conference, Malta, May. Association for Computational Linguistics.

Cristina Sánchez-Marco, Gemma Boleda, and Lluís Padró. 2011. Extending the tool, or how to annotate historical language varieties. In Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, pages 1–9. Association for Computational Linguistics.

Cristina Sánchez-Marco, J.M. Fontana, and J. Domingo. 2012. Anotación automática de textos diacrónicos del español. In Actas del VIII Congreso Internacional de Historia de la Lengua Espa˜nola, Universidad de Santiago de Compostela.

Jorge Vivaldi. 2009. Corpus and exploitation tool: Iulact and bwananet. In I International Conference on Corpus Linguistics (CICL 2009), A survey on corpus-based research, Universidad de Murcia, pages 224–239.

Citeringar i Crossref