Building an open-source development infrastructure for language technology projects

Sjur N. Moshagen
University of Tromsø, Norway

Tommi A. Pirinen
Helsinki university, Finland

Trond Trosterud
University of Tromsø, Norway

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:31, s. 343-352

NEALT Proceedings Series 16:31, s. 343-352

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


The article presents the Giellatekno & Divvun language technology resources; more specifically the effort to utilise open-source tools to improve the build infrastructure; and the solutions to help adapt to best practices for software development. The article especially discusses how the infrastructure has been remade to cope with an increasing number of languages without incurring extra overhead for the maintainers; and at the same time let the linguists concentrate on the linguistic work. Finally; the article discusses how a uniform infrastructure like the one presented can be used to easily compare languages in terms of morphological or computational complexity; coverage or for cross-lingual applications.


NoDaLiDa 2013; Infrastructure; Computational linguistics; Finite-state transducers; Language resources; Multilinguality


Antonsen; L.; Trosterud; T.; and Wiechetek; L. (2010). Reusing Grammatical Resources for New Languages. In Calzolari; N.; Choukri; K.; Maegaard; B.; Mariani; J.; Odijk; J.; Piperidis; S.; Rosner; M.; and Tapias; D.; editors; Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10); Valletta; Malta. European Language Resources Association (ELRA).

Broda; B.; Marci´nczuk; M.; and Piasecki; M. (2010). Building a Node of the Accessible Language Technology Infrastructure. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10).

Cunningham; H.; Humphreys; K.; Gaizauskas; R.; and Wilks; Y. (1997). Software infrastructure for natural language processing. In Proceedings of the fifth conference on Applied natural language processing; ANLC ’97; pages 237–244; Stroudsburg; PA; USA. Association for Computational Linguistics.

Cunningham; H.; Maynard; D.; Bontcheva; K.; Tablan; V.; Aswani; N.; Roberts; I.; Gorrell; G.; Funk; A.; Roberts; A.; Damljanovic; D.; Heitz; T.; Greenwood; M. A.; Saggion; H.; Petrak; J.; Li; Y.; and Peters; W. (2011). Text Processing with GATE (Version 6). Gate.

Federmann; C.; Giannopoulou; I.; Girardi; C.; Hamon; O.; Mavroeidis; D.; Minutoli; S.; and Schröder; M. (2012). META-SHARE v2: An Open Network of Repositories for Language Resources including Data and Tools. In Calzolari; N.; Choukri; K.; Declerck; T.; Do?gan; M. U.; Maegaard; B.; Mariani; J.; Odijk; J.; and Piperidis; S.; editors; Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); Istanbul; Turkey. European Language Resources Association (ELRA).

Forcada; M. L.; Ginestí-Rosell; M.; Nordfalk; J.; O’Regan; J.; Ortiz-Rojas; S.; Pérez-Ortiz; J. A.; Sánchez-Martínez; F.; Ramírez-Sánchez; G.; and Tyers; F. M. (2011). Apertium: a free/open-source platform for rule-based machine translation. Machine Translation.

Huizinga; D. and Kolawa; A. (2007). Automated Defect Prevention: Best Practices in Software Management. Wiley.

Karlsson; F. (1990). Constraint Grammar As A Framework For Parsing Running Text. Proceedings of the 13th International Conference on Computational Linguistics; pages 168–173.

Knuth; D. E. (1984). Literate Programming. The Computer Journal; 27(2):97–111.

Lindén; K.; Axelson; E.; Hardwick; S.; Pirinen; T.; and Silfverberg; M. (2011). Hfst—framework for compiling and applying morphologies. Systems and Frameworks for Computational Morphology; pages 67–85.

Loper; E. and Bird; S. (2002). NLTK: the Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1; ETMTNLP ’02; pages 63–70; Stroudsburg; PA; USA. Association for Computational Linguistics.

Oflazer; K. (1996). Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. COMPUTATIONAL LINGUISTICS; 22:73–89.

Trosterud; T. (2012). A restricted freedom of choice: Linguistic diversity in the digital landscape. Nordlyd; 39(2):89–104.

Váradi; T.; Krauwer; S.; Wittenburg; P.; Wynne; M.; and Koskenniemi; K. (2008). CLARIN: Common Language Resources and Technology Infrastructure. In Calzolari; N.; Choukri; K.; Maegaard; B.; Mariani; J.; Odijk; J.; Piperidis; S.; and Tapias; D.; editors; Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08); Marrakech; Morocco. European Language Resources Association (ELRA). http://www.lrecconf. org/proceedings/lrec2008/.

Wettig; H.; Hiltunen; S.; and Yangarber; R. (2011). MDL-based Models for Alignment of Etymological Data. In Proceedings of RANLP: the 8th Conference on Recent Advances in Natural Language Processing; Hissar; Bulgaria.

Citeringar i Crossref