Soheila Sahami
NLP Group, Leipzig University, Germany
Thomas Eckart
NLP Group, Leipzig University, Germany
Gerhard Heyer
NLP Group, Leipzig University, Germany
Download articlePublished in: Selected papers from the CLARIN Annual Conference 2018, Pisa, 8-10 October 2018
Linköping Electronic Conference Proceedings 159:19, p. 188-195
Published: 2019-05-28
ISBN: 978-91-7685-034-3
ISSN: 1650-3686 (print), 1650-3740 (online)
Modern annotation tools and pipelines that support automatic text annotation and processing have become indispensable for many linguistic and NLP-driven applications. To simplify their active use and to relieve users from complex configuration tasks, Serviceoriented architecture (SOA) based platforms – like CLARIN’s WebLicht – have emerged. However, in many cases the current state of participating endpoints does not allow processing of “big data”-sized text material or the execution of many user tasks in parallel. A potential solution is the use of distributed computing frameworks as a backend for SOAs. These systems and their corresponding software architecture already support many of the features relevant for processing big data for large user groups. This submission describes such an implementation based on Apache Spark and outlines potential consequences for improved processing pipelines in federated research infrastructures.
WebLicht,
Apache Hadoop,
Apache Spark,
Service-oriented architectures,
Processing pipelines,
Big data