Conference article

Docria: Processing and Storing Linguistic Data with Wikipedia

Marcus Klang
Department of Computer Science, Lund University, Lund, Sweden

Pierre Nugues
Department of Computer Science, Lund University, Lund, Sweden

Download article

Published in: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Linköping Electronic Conference Proceedings 167:48, p. 401--405

NEALT Proceedings Series 42:48, p. 401--405

Show more +

Published: 2019-10-02

ISBN: 978-91-7929-995-8

ISSN: 1650-3686 (print), 1650-3740 (online)


The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpus, which spans a huge range of topics and is freely available. Storing and processing such corpora requires flexible document models as they may contain malicious or incorrect data. Docria is a library which attempts to address this issue with a model using typed property hypergraphs. Docria can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running apreduce frameworks with optimized compiled code. Docria is available as opensource code at


No keywords available


No references available

Citations in Crossref