LBK2013: A balanced; annotated national corpus for Norwegian Bokmål

Rune Lain Knudsen
Institute of Linguistic and Nordic Studies, University of Oslo

Ruth Vatvedt Fjeld
Institute of Linguistic and Nordic Studies, University of Oslo

Ladda ner artikel

Ingår i: Proceedings of the workshop on lexical semantic resources for NLP at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 19

Linköping Electronic Conference Proceedings 88:3, s. 12-20

NEALT Proceedings Series 19:3, s. 12-20

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-586-5

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


At the Department of Linguistics and Scandinavian Studies (ILN) and the University of Oslo; the task of assembling a balanced corpus representing modern Norwegian Bokmål has reached a significant milestone. The Corpus for Bokmål Lexicography (LBK) now consists of more than 100;000;000 words. These documents have been selected based on a statistical analysis of reading habits in the general population of Norway. The documents have been subject to both manual bibliographic annotation; as well as automatic morphological annotation for each document. LBK will play a central part of a set of interconnected lexical resources; the aim of which is to provide an extensive documentation of Norwegian Bokmål that covers lexical and other linguistic/lexico-syntactic aspects. This paper presents LBK2013; a subset of LBK that we consider to be an accurate and comprehensive representation of modern written Norwegian Bokmål. A description of the corpus; as well as a number of related projects are described.


NoDaLiDa 2013; Speech and Language Technologies; Northern Europe; Corpora; Lexicography; Lexical Semantics


Agirre; E. and Edmonds; P.; editors (2007). Word Sense Disambiguation - Algorithms and Applications; chapter 5; pages 107–131. Springer.

Evert; S. and Hardie; A. (2011). Twenty-first century corpus workbench: Updating a query architecture for the new millenium. In Proceedings of the Corpus Linguistics 2011 Conference. University of Birmingham.

Fellbaum; C.; editor (1998). WordNet - An Electronic Lexical Database. MIT Press.

Fjeld; R. V. and Nygaard; L. (2009). NorNet - a monolingual wordnet of modern norwegian. In NODALIDA 2009 workshop: WordNets and other Lexical Semantic Resources - between Lexical Semantics; Lexicography; Terminology and Formal Ontologies; volume 7 of NEALT Proceedings Series; pages 13–16.

Fjeld; R. V.; Nygaard; L.; and Bick; E. (2010). Semi-automatic retrieval of phraseological units in a corpus of modern norwegian. In Korpora; Web und Datenbanken. Computergestützte Methoden in der modernen Phraseologie und Lexicographie; volume 25.

Johannessen; J. B.; Hagen; K.; Lynum; A.; and Nøklestad; A. (2012). OBT+Stat: A combined rule-based and statistical tagger. In Exploring Newpaper Language; volume 49 of Studies in Corpus Linguistics; pages 51–65. John Benjamins.

Kilarriff; A. and Rosenzweig; J. (2000). English SENSEVAL: Report and results. In Proceedings of the 2nd International Conference on Language Resources and Evaluation.

Kilgarriff; A. and Rosenzweig; J. (2000). Framework and results for english SENSEVAL. In Computers and the Humanities; volume 34; pages 15–48. fd.

Kilgarriff; A. and Tugwell; D. (2002). Sketching words. In Lexicography and Natural Language Processing. Euralex.

Nygaard; L.; Priestley; J.; Nøklestad; A.; and Johannessen; J. B. (2008). Glossa: a multilingual; multimodal; configurable user interface. In Chair); N. C. C.; Choukri; K.; Maegaard; B.; Mariani; J.; Odijk; J.; Piperidis; S.; and Tapias; D.; editors; Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08); Marrakech; Morocco. European Language Resources Association (ELRA). http://www.lrec conf.org/proceedings/lrec2008.

Palmer; M.; Fellbaum; C.; and Dang; H. T. (2006). Making fine-grained and coarse-grained sense distinctions; both manually and automatically. In Natural Language Engineering; volume 12.

Pedersen; B.; Nimb; S.; Asmussen; J.; Sørensen; N.; Trap-Jensen; L.; and Lorentzen; H. (2009). Dannet: the challenge of compiling a wordnet for danish by reusing a monolingual dictionary. Language Resources and Evaluation; 43(3):269–299.

Citeringar i Crossref