Leon Strømberg-Derczynski
Manuel Ciosici
Rebekah Baglini
Morten H. Christiansen
Jacob Aarup Dalsgaard
Riccardo Fusaroli
Peter Juel Henrichsen
Rasmus Hvingelby
Andreas Kirkedal
Alex Speed Kjeldsen
Claus Ladefoged
Finn Årup Nielsen
Jens Madsen
Malte Lau Petersen
Jonathan Hvithamar Rystrøm
Daniel Varab
Download articlePublished in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021.
Linköping Electronic Conference Proceedings 178:46, p. 413-421
NEALT Proceedings Series 45:46, p. 413-421
Published: 2021-05-21
ISBN: 978-91-7929-614-8
ISSN: 1650-3686 (print), 1650-3740 (online)
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.