Docria: Processing and Storing Linguistic Data with Wikipedia

Klang, Marcus; Nugues, Pierre

Konferensartikel

Docria: Processing and Storing Linguistic Data with Wikipedia

Marcus Klang
Department of Computer Science, Lund University, Lund, Sweden

Pierre Nugues
Department of Computer Science, Lund University, Lund, Sweden

Ladda ner artikel

Ingår i: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Linköping Electronic Conference Proceedings 167:48, s. 401--405

NEALT Proceedings Series 42:48, p. 401--405

Visa mer +

Publicerad: 2019-10-02

ISBN: 978-91-7929-995-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpus, which spans a huge range of topics and is freely available. Storing and processing such corpora requires flexible document models as they may contain malicious or incorrect data. Docria is a library which attempts to address this issue with a model using typed property hypergraphs. Docria can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running apreduce frameworks with optimized compiled code. Docria is available as opensource code at https://github.com/marcusklang/docria.

Konferensartikel

Docria: Processing and Storing Linguistic Data with Wikipedia

Abstract

Nyckelord

Referenser

Citeringar i Crossref