Jörg Knappen
Sprachwissenschaft und Sprachtechnologie, Universität des Saarlandes, Germany
Stefan Fischer
Sprachwissenschaft und Sprachtechnologie, Universität des Saarlandes, Germany
Hannah Kermes
Sprachwissenschaft und Sprachtechnologie, Universität des Saarlandes, Germany
Elke Teich
Sprachwissenschaft und Sprachtechnologie, Universität des Saarlandes, Germany
Peter Fankhauser
Institut für Deutsche Sprache (IDS), Germany
Download articlePublished in: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
Linköping Electronic Conference Proceedings 133:3, p. 7-11
NEALT Proceedings Series 32:3, p. 7-11
Published: 2017-05-10
ISBN: 978-91-7685-503-4
ISSN: 1650-3686 (print), 1650-3740 (online)
The Royal Society Corpus is a corpus of Early and Late modern English built in an agile process covering publications of the Royal Society of London from 1665 to 1869 (Kermes et al., 2016) with a size of approximately 30 million words. In this paper we will provide details on two aspects of the building process namely the mining of patterns for OCR correction and the improvement and evaluation of part-of-speech tagging.
Bea Alex, Claire Grover, Ewan Klein, and Richard Tobin. 2012. Digitised historical text: Does it have to be mediOCRe? In Proceedings of KONVENS 2012 (LThist 2012 workshop), pages 401–409, Vienna, Austria.
Alistair Baron and Paul Rayson. 2008. VARD 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of the Postgraduate Conference in Corpus Linguistics, Birmingham, UK.
Alistair Cockburn. 2001. Agile Software Development. Addison-Wesley Professional, Boston, USA.
Hannah Kermes, Stefania Degaetano-Ortlieb, Ashraf Khamis, J¨org Knappen, and Elke Teich. 2016. The royal society corpus: From uncharted data to corpus. In Proceedings of the LREC 2016, Portorož, Slovenia, May 23-28.
Wang Ling, Chris Dyer, Alan Black, and Isabel Trancoso. 2015. Two/too simple adaptations of word2vec for syntax problems. In Proceedings of NAACL.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing.
Helmut Schmid. 1995. Improvements in part-ofspeech tagging with an application to german. In Proceedings of the ACL SIGDAT-Workshop.
Ted Underwood and Loretta Auvil. 2012. Basic OCR correction. http://usesofscale.com/gritty-details/basic-ocr-correction/.
Holger Voormann and Ulrike Gut. 2008. Agile corpus building. Corpus Linguistics and Linguistic Theory, 4(2):235–251.