Conference article

The Danish Gigaword Corpus

Leon Strømberg-Derczynski

Manuel Ciosici

Rebekah Baglini

Morten H. Christiansen

Jacob Aarup Dalsgaard

Riccardo Fusaroli

Peter Juel Henrichsen

Rasmus Hvingelby

Andreas Kirkedal

Alex Speed Kjeldsen

Claus Ladefoged

Finn Årup Nielsen

Jens Madsen

Malte Lau Petersen

Jonathan Hvithamar Rystrøm

Daniel Varab

Download article

Published in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021.

Linköping Electronic Conference Proceedings 178:46, p. 413-421

NEALT Proceedings Series 45:46, p. 413-421

Show more +

Published: 2021-05-21

ISBN: 978-91-7929-614-8

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.

Keywords

corpus, unstructured text, Danish

References

No references available

Citations in Crossref