Konferensartikel

Data-driven Morphology and Sociolinguistics for Early Modern Dutch

Marijn Schraagen
Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands

Marjo van Koppen
Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands

Feike Dietz
Institute for Cultural Inquiry, Utrecht University, The Netherlands

Ladda ner artikel

Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

Linköping Electronic Conference Proceedings 133:9, s. 47-53

NEALT Proceedings Series 32:9, p. 47-53

Visa mer +

Publicerad: 2017-05-10

ISBN: 978-91-7685-503-4

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

The advent of Early Modern Dutch (starting ~1550) marked significant developments in language use in the Netherlands. Examples include the loss of the case marking system, the loss of negative particles and the introduction of new vocabulary. These developments typically lead to a lot of variation both within and between language users. Linguistics research aims to characterize and account for such variation patterns. Due to sparseness of digital resources and tools, research is still dependent on traditional, qualitative analysis. This paper describes an ongoing effort to increase the amount of tools and resources, exploring two different routes: (i) modernization of historical language and (ii) adding linguistic and sociolinguistic annotations to historical language directly. This paper discusses and compares the experimental setup, and preliminary results of these two routes and provides an outlook on the envisioned linguistic and sociolinguistic research approach.

Nyckelord

Inga nyckelord är tillgängliga

Referenser

Marcel Bax and Nanne Streekstra. 2003. Civil rites: ritual politeness in early modern Dutch letter–writing. Journal of Historical Pragmatics, 4(2):303–325.

Marcel Bollmann, Florian Petran, and Stefanie Dipper. 2011. Rule-based normalization of historical texts. In Proceedings of Language Technologies for Digital Humanities and Cultural Heritage Workshop, pages 34–42. ACL.

Loes Braun. 2002. Information retrieval from Dutch historical corpora. Master’s thesis, Maastricht University.

Hennie Brugman, Martin Reynaert, Nicoline van der Sijs, René van Stipriaan, Erik Tjong Kim Sang, and Antal van den Bosch. 2016. Nederlab: Towards a single portal and research environment for diachronic Dutch text corpora. In Proceedings of LREC 2016.

Chris Callison-Burch, Miles Osborne, and Philipp K¨ohn. 2006. Re-evaluation the role of BLEU in machine translation research. In Proceedings of EACL, pages 249–256. ACL.

Stefan Evert and Andrew Hardie. 2011. Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics 2011 conference. University of Birmingham.

Frank van Eynde, Jakub Zavrel, andWalter Daelemans. 2000. Part of speech tagging and lemmatisation for the Spoken Dutch Corpus. In Proceedings of LREC 2000.

Maarten van Gompel and Martin Reynaert. 2013. FoLiA: A practical XML format for linguistic annotation-a descriptive and comparative study. Computational Linguistics in the Netherlands Journal, 3:63–81.

Hans van Halteren and Margit Rem. 2013. Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters. Language Resources and Evaluation, 47(4):1233–1259.

Pontus de Heuiter. 1581. Nederduitse orthographie. Edited by G.R.W. Dibbets, 1972, Wolters- Noordhoff.

Jack Hoeksema. 1997. Negation and negative concord in Middle Dutch. Amsterdam Studies in the Theory and History of Linguistic Science, 4:139–156.

Robert Howell. 2006. Immigration and koineisation: the formation of Early Modern Dutch urban vernaculars. Transactions of the Philological Society, 104(2):207–227.

Dieuwke Hupkes and Rens Bod. 2016. Pos-tagging of historical Dutch. In Proceedings of LREC 2016.

Philipp K¨ohn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics.

Marijn Koolen, Frans Adriaans, Jaap Kamps, and Maarten de Rijke. 2006. A cross-language approach to historic document retrieval. In ECIR 2006: Proceedings of the 28th European Conference on IR Research, pages 407–419. Springer.

Joos Lambrecht. 1550. Nederlandsche spellijnghe, uutghesteld by vraghe ende antwoorde. Edited by J.F.J. Heremans and F. Vanderhaeghen, 1882, C. Annoot-Braeckman.

Marco van Leeuwen, Ineke Maas, and Andrew Miles. 2002. HISCO: Historical International Standard Classification of Occupations. Cornell University Press.

Judith Nobels and Gijsbert Rutten. 2014. Language norms and language use in seventeenth-century Dutch: negation and the genitive. In Gijsbert Rutten, editor, Norms and usage in language history, 1600-1900. A sociolinguistic and comparative perspective., pages 21–48. John Benjamins Publishing Company.

Kishore Papineni, Salim Roukos, ToddWard, andWei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.

Eva Pettersson, Béata Megyesi, and Joakim Nivre. 2012. Rule-based normalisation of historical text - a diachronic study. In Proceedings of the First international Workshop on Language Technology for Historical Text, KONVENS, pages 333–341.

Eva Pettersson, Béata Megyesi, and Jörg Tiedemann. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the workshop on computational historical linguistics at NODALIDA, pages 54–69. Linköping.

Alexander Robertson and Peter Willett. 1992. Searching for historical word-forms in a database of 17thcentury English text using spelling-correction methods. In Proceedings of ACM SIGIR ’92, pages 256–265. ACM.

Yves Scherrer and Toma?z Erjavec. 2013. Modernizing historical Slovene words with character-based SMT. In BSNLP 2013-4th Biennial Workshop on Balto-Slavic Natural Language Processing.

Yves Scherrer and Tomaž Erjavec. 2016. Modernising historical slovene words. Natural Language Engineering, 22(6):881–905.

Hendrik Spiegel. 1584. Twe-spraack. Ruygh-bewerp. Kort begrip. Rederijck-kunst. Edited by W.J.H. Caron, 1962, Wolters-Noordhoff.

Simon Stevin. 1586. Uytspraeck van de weerdicheyt der Duytsche tael. Chr. Plantijn.

Erik Tjong Kim Sang. 2016. Improving part-of-speech tagging of historical text by first translating to modern text. In Proceedings of the International Workshop on Computational History and Data-Driven Humanities, pages 54–64. Springer.

Fred Weerman and Petra de Wit. 1999. The decline of the genitive in Dutch. Linguistics, 37(6):1155–1192.

Citeringar i Crossref