Towards Large-Scale Language Analysis in the Cloud

Emanuele Lapponi
Language Technology Group, Department of Informatics, University of Oslo, Norway

Erik Velldal
Language Technology Group, Department of Informatics, University of Oslo, Norway

Nikolay A. Vazov
Research Support Services Group, University Center for Information Technology, University of Oslo, Norway

Stephan Oepen
Language Technology Group, Department of Informatics, University of Oslo, Norway

Ladda ner artikel

Ingår i: Proceedings of the workshop on Nordic language research infrastructure at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 20

Linköping Electronic Conference Proceedings 89:1, s. 1-10

NEALT Proceedings Series 20:1, s. 1-10

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-585-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper documents ongoing work within the Norwegian CLARINO project on building a Language Analysis Portal (LAP). The portal will provide an intuitive and easily accessible web interface to a centralized repository of a wide range of language technology tools; all installed on a high-performance computing cluster. Users will be able to compose and run workflows using an easy-to-use graphical interface; with multiple tools and resources chained together in potentially complex pipelines. Although the project aims to reach out to a diverse set of user groups; it particularly will facilitate use of language analysis in the social sciences; humanities; and other fields without strong computational traditions. While the development of the portal is still in its early stages; this paper documents ongoing work towards an already operable pilot in addition to providing an overview of long-term goals and visions. At the core of the current pilot implementation we find Galaxy; a web-based workflow management system initially developed for data-intensive research in genomics and bioinformatics; therefore; an important part of the work on the pilot is to adapt and evaluate Galaxy for the context of a language analysis portal.


Research infrastructure; High-Performance Computing; web portal; CLARINO


Bird; S.; Klein; E.; and Loper; E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly.

Blankenberg; D.; Kuster; G. V.; Coraor; N.; Ananda; G.; Lazarus; R.; Mangan; M.; Nekrutenko; A.; and Taylor; J. (2010). Galaxy: a web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology; pages 19.10.1–19.10.21.

Cunningham; H.; Maynard; D.; Bontcheva; K.; Tablan; V.; Aswani; N.; Roberts; I.; Gorrell; G.; Funk; A.; Roberts; A.; Damljanovic; D.; Heitz; T.; Greenwood; M. A.; Saggion; H.; Petrak; J.; Li; Y.; and Peters; W. (2011). Text Processing with GATE (Version 6).

Giardine; B.; Riemer; C.; Hardison; R. C.; Burhans; R.; Elnitski; L.; Shah; P.; Zhang; Y.; Blankenberg; D.; Albert; I.; Taylor; J.; Miller; W.; Kent; W. J.; and Nekrutenko; A. (2005). Galaxy: a platform for interactive large-scale genome analysis. Genome Research; 15(10):1451– 5.

Goecks; J.; Nekrutenko; A.; Taylor; J.; and Team; T. G. (2010). Galaxy: a comprehensive approach for supporting accessible; reproducible; and transparent computational research in the life sciences. Genome Biology; 11(8):R86.

Götz; T. and Suhre; O. (2004). Design and implementation of the UIMA common analysis system. IBM Syst. J.; 43(3):476–489.

Heid; U.; Schmid; H.; Eckart; K.; and Hinrichs; E. (2010). A corpus representation format for linguistic web services: The D-SPIN Text Corpus Format and its relationship with ISO standards. In Proceedings of the 7th International Conference on Language Resources and Evaluation; pages 494–499.

Missier; P.; Soiland-Reyes; S.; Owen; S.; Tan; W.; Nenadic; A.; Dunlop; I.; Williams; A.; Oinn; T.; and Goble; C. (2010). Taverna; reloaded. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management; pages 471–481.

Citeringar i Crossref