LBK2013: A balanced; annotated national corpus for Norwegian Bokmål

Rune Lain Knudsen
Institute of Linguistic and Nordic Studies, University of Oslo

Ruth Vatvedt Fjeld
Institute of Linguistic and Nordic Studies, University of Oslo

Ingår i: Proceedings of the workshop on lexical semantic resources for NLP at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 19

Linköping Electronic Conference Proceedings 88:3, s. 12-20

NEALT Proceedings Series 19:3, s. 12-20

Publicerad: 2013-05-17

ISBN: 978-91-7519-586-5

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


At the Department of Linguistics and Scandinavian Studies (ILN) and the University of Oslo; the task of assembling a balanced corpus representing modern Norwegian Bokmål has reached a significant milestone. The Corpus for Bokmål Lexicography (LBK) now consists of more than 100;000;000 words. These documents have been selected based on a statistical analysis of reading habits in the general population of Norway. The documents have been subject to both manual bibliographic annotation; as well as automatic morphological annotation for each document. LBK will play a central part of a set of interconnected lexical resources; the aim of which is to provide an extensive documentation of Norwegian Bokmål that covers lexical and other linguistic/lexico-syntactic aspects. This paper presents LBK2013; a subset of LBK that we consider to be an accurate and comprehensive representation of modern written Norwegian Bokmål. A description of the corpus; as well as a number of related projects are described.


NoDaLiDa 2013; Speech and Language Technologies; Northern Europe; Corpora; Lexicography; Lexical Semantics


