Konferensartikel

The Danish Gigaword Corpus

Leon Strømberg-Derczynski

Manuel Ciosici

Rebekah Baglini

Morten H. Christiansen

Jacob Aarup Dalsgaard

Riccardo Fusaroli

Peter Juel Henrichsen

Rasmus Hvingelby

Andreas Kirkedal

Alex Speed Kjeldsen

Claus Ladefoged

Finn Årup Nielsen

Jens Madsen

Malte Lau Petersen

Jonathan Hvithamar Rystrøm

Daniel Varab

Ladda ner artikel

Ingår i: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021.

Linköping Electronic Conference Proceedings 178:46, s. 413-421

Visa mer +

Publicerad: 2021-05-21

ISBN: 978-91-7929-614-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.

Nyckelord

corpus, unstructured text, Danish

Referenser

Inga referenser tillgängliga

Citeringar i Crossref