Marcus Klang
Department of Computer Science, Lund University, Lund, Sweden
Pierre Nugues
Department of Computer Science, Lund University, Lund, Sweden
Download articlePublished in: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland
Linköping Electronic Conference Proceedings 167:48, p. 401--405
NEALT Proceedings Series 42:48, p. 401--405
Published: 2019-10-02
ISBN: 978-91-7929-995-8
ISSN: 1650-3686 (print), 1650-3740 (online)
The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpus, which spans a huge range of topics and is freely available. Storing and processing such corpora requires flexible document models as they may contain malicious or incorrect data. Docria is a library which attempts to address this issue with a model using typed property hypergraphs. Docria can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running apreduce frameworks with optimized compiled code. Docria is available as opensource code at https://github.com/marcusklang/docria.