Conference article

The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP

Stian Rødven Eide
Språkbanken, Dept. of Swedish University of Gothenburg, Sweden

Nina Tahmasebi
Språkbanken, Dept. of Swedish University of Gothenburg, Sweden

Lars Borin
Språkbanken, Dept. of Swedish University of Gothenburg, Sweden

Download article

Published in: Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, Proceedings of the Workshop, July 11, 2016, Krakow, Poland

Linköping Electronic Conference Proceedings 126:2, p. 8--12

Show more +

Published: 2016-07-08

ISBN: 978-91-7685-733-5

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

In this paper we present a dataset of contemporary Swedish containing one billion words. The dataset consists of a wide range of sources, all annotated using a state-of-the-art corpus annotation pipeline, and is intended to be a static and clearly versioned dataset. This will facilitate reproducibility of experiments across institutions and make it easier to compare NLP algorithms on contemporary Swedish. The dataset contains sentences from 1950 to 2015 and has been carefully designed to feature a good mix of genres balanced over each included decade. The sources include literary, journalistic, academic and legal texts, as well as blogs and web forum entries.

Keywords

No keywords available

References

Yvonne Adesam, Lars Borin, Gerlof Bouma, Markus Forsberg, and Richard Johansson. 2014. Koala – korp’s linguistic annotations developing an infrastructure for text-based research with high-quality annotations.

BNC Consortium. 2007. The British National Corpus, version 3 (BNC XML Edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/.

Lars Borin, Markus Forsberg, and Johan Roxendal. 2012. Korp – the corpus infrastructure of Spr°akbanken. In Proceedings of LREC 2012, page 474–478, Istanbul. ELRA.

Lars Borin, Markus Forsberg, and Lennart L¨onngren. 2013. SALDO: a touch of yin toWordNet’s yang. Language Resources and Evaluation, 47(4):1191–1211.

Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. 2008. Introducing and evaluating ukwac, a very large web-derived corpus of english. In In Proceedings of the 4th Web as Corpus Workshop (WAC-4.

Yoav Goldberg and Omer Levy. 2014. word2vec explained: deriving mikolov et al.’s negative-sampling wordembedding method. CoRR, abs/1402.3722.

Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. 2011. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182.

Luis Nieto Pi˜na and Richard Johansson. 2016. Embedding senses for efficient graph-based word sense disambiguation. In Proceedings of TextGraphs-10, San Diego, United States.

Gertrud Pettersson. 1996. Svenska spr°aket under sjuhundra °ar. Studentlitteratur, Lund.

E. Sandhaus. 2008. The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia, 6(12).

Roland Schäfer and Felix Bildhauer. 2012. Building large corpora from the web using a new efficient tool chain. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).

Citations in Crossref