Using Apache Spark on Hadoop Clusters as Backend for WebLicht Processing Pipelines

Soheila Sahami
NLP Group, Leipzig University, Germany

Thomas Eckart
NLP Group, Leipzig University, Germany

Gerhard Heyer
NLP Group, Leipzig University, Germany

Ladda ner artikel

Ingår i: Selected papers from the CLARIN Annual Conference 2018, Pisa, 8-10 October 2018

Linköping Electronic Conference Proceedings 159:19, s. 188-195

Visa mer +

Publicerad: 2019-05-28

ISBN: 978-91-7685-034-3

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Modern annotation tools and pipelines that support automatic text annotation and processing have become indispensable for many linguistic and NLP-driven applications. To simplify their active use and to relieve users from complex configuration tasks, Serviceoriented architecture (SOA) based platforms – like CLARIN’s WebLicht – have emerged. However, in many cases the current state of participating endpoints does not allow processing of “big data”-sized text material or the execution of many user tasks in parallel. A potential solution is the use of distributed computing frameworks as a backend for SOAs. These systems and their corresponding software architecture already support many of the features relevant for processing big data for large user groups. This submission describes such an implementation based on Apache Spark and outlines potential consequences for improved processing pipelines in federated research infrastructures.


WebLicht, Apache Hadoop, Apache Spark, Service-oriented architectures, Processing pipelines, Big data


[Apache Hadoop2019] Apache Hadoop. 2019. Apache Hadoop Documentation. Online. Date Accessed: 11 Jan 2019. URL http://hadoop.apache.org/.

[Bhosale and Gadekar2014] Harshawardhan S Bhosale and Devendra P Gadekar. 2014. A review paper on big data and hadoop. International Journal of Scientific and Research Publications, 4(10):1–7.

[Bondi2000] Andre B Bondi. 2000. Characteristics of scalability and their impact on performance. In Proceedings of the 2nd international workshop on Software and performance, pages 195–203. ACM.

[Dean and Ghemawat2004] Jeffrey Dean and Sanjay Ghemawat. 2004. Mapreduce: Simplified data processing on large clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, pages 137–150, San Francisco, CA.

[Erl2005] Thomas Erl. 2005. Service-oriented architecture: Concepts, Technology, and Design. Prentice Hall PTR.

[Gate Cloud2018] Gate Cloud. 2018. GATE Cloud: Text Analytics in the Cloud. Online. Date Accessed: 11 Apr 2018. URL https://cloud.gate.ac.uk/.

[Hamstra and Zaharia2013] Mark Hamstra and Matei Zaharia. 2013. Learning Spark: lightning-fast big data analytics. O’Reilly & Associates.

[Heid et al.2010] Ulrich Heid, Helmut Schmid, Kerstin Eckart, and Erhard W Hinrichs. 2010. A Corpus Representation Format for Linguistic Web Services: The D-SPIN Text Corpus Format and its Relationship with ISO Standards. In Proceedings of LREC 2010.

[Hill1990] Mark D Hill. 1990. What is scalability? ACM SIGARCH Computer Architecture News, 18(4):18–21.

[Hinrichs et al.2010] Erhard Hinrichs, Marie Hinrichs, and Thomas Zastrow. 2010. WebLicht: Web-based LRT services for German. In Proceedings of the ACL 2010 System Demonstrations, pages 25–29. Association for Computational Linguistics.

[Lars-Peter Meyer2018] Lars-Peter Meyer. 2018. The Galaxy Cluster. Online. Date Accessed: 12 Apr 2018. URL https://www.scads.de/de/aktuelles/blog/264-big-data-cluster-in-shared-nothingarchitecture-in-leipzig.

[Meyer et al.2018] Lars-Peter Meyer, Jan Frenzel, Eric Peukert, Rene Jakel, and Stefan Kuhne. 2018. Big data services. In Service Engineering, pages 63–77. Springer.

[Papazoglou2003] Mike P Papazoglou. 2003. Service-oriented computing: Concepts, characteristics and directions. In Web Information Systems Engineering, 2003. WISE 2003. Proceedings of the Fourth International Conference on, pages 3–12. IEEE.

[Salloum et al.2016] Salman Salloum, Ruslan Dautov, Xiaojun Chen, Patrick Xiaogang Peng, and Joshua Zhexue Huang. 2016. Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4):145–164.

[White2012] Tom White. 2012. Hadoop: The definitive guide. O’Reilly Media, Inc.

[Zaharia et al.2016] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11):56–65.

Citeringar i Crossref