Conference article

Using Apache Spark on Hadoop Clusters as Backend for WebLicht Processing Pipelines

Soheila Sahami
NLP Group, Leipzig University, Germany

Thomas Eckart
NLP Group, Leipzig University, Germany

Gerhard Heyer
NLP Group, Leipzig University, Germany

Download article

Published in: Selected papers from the CLARIN Annual Conference 2018, Pisa, 8-10 October 2018

Linköping Electronic Conference Proceedings 159:19, p. 188-195

Show more +

Published: 2019-05-28

ISBN: 978-91-7685-034-3

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

Modern annotation tools and pipelines that support automatic text annotation and processing have become indispensable for many linguistic and NLP-driven applications. To simplify their active use and to relieve users from complex configuration tasks, Serviceoriented architecture (SOA) based platforms – like CLARIN’s WebLicht – have emerged. However, in many cases the current state of participating endpoints does not allow processing of “big data”-sized text material or the execution of many user tasks in parallel. A potential solution is the use of distributed computing frameworks as a backend for SOAs. These systems and their corresponding software architecture already support many of the features relevant for processing big data for large user groups. This submission describes such an implementation based on Apache Spark and outlines potential consequences for improved processing pipelines in federated research infrastructures.

Keywords

WebLicht, Apache Hadoop, Apache Spark, Service-oriented architectures, Processing pipelines, Big data

References

No references available

Citations in Crossref