Many a Little Makes a Mickle - Infrastructure Component Reuse for a Massively Multilingual Linguistic Study

Lars Borin
University of Gothenburg, Sweden

Shafqat Mumtaz Virk
University of Gothenburg, Sweden

Anju Saxena
Uppsala University, Sweden

Ladda ner artikel

Ingår i: Selected papers from the CLARIN Annual Conference 2017, Budapest, 18–20 September 2017

Linköping Electronic Conference Proceedings 147:8, s. 86-101

Visa mer +

Publicerad: 2018-05-16

ISBN: 978-91-7685-273-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


We present ongoing work aiming at turning the linguistic material available in Grierson’s classical Linguistic Survey of India (LSI) into a digital language resource, a database suitable for a broad array of linguistic investigations of the languages of South Asia and studies relating to language typology and contact linguistics. The project has two concrete main aims: (1) to conduct a linguistic investigation of the claim that South Asia constitutes a linguistic area; (2) to develop state-of-the-art language technology for automatically extracting the relevant information from the text of the LSI. In this presentation we focus on how, in the first part of the project, a number of existing research infrastructure components provided by Swe-Clarin, the Swedish CLARIN consortium, have been ‘recycled’ in order to allow the linguists involved in the project to quickly orient themselves in the vast LSI material, and to be able to provide input to the language tech- nologists designing the tools for information extraction from the descriptive grammars.


corpus infrastructure lexicon infrastructure Swe-Clarin large-scale comparative linguistics linguistic database language typology areal linguistics genetic linguistics South Asian languages


[Björkelund et al.2009] Anders Björkelund, Love Hafdell, and Pierre Nugues. 2009. Multilingual semantic role labeling. In Proceedings of CoNLL 2009: Shared Task, pages 43–48, Boulder, Colorado. ACL.

[Borin and Forsberg2011] Lars Borin and Markus Forsberg. 2011. A diachronic computational lexical resource for 800 years of Swedish. In Caroline Sporleder, Antal van den Bosch, and Kalliopi Zervanou, editors, Language Technology for Cultural Heritage, pages 41–61. Springer, Berlin.

[Borin et al.2010] Lars Borin, Dana Dannélls, Markus Forsberg, Dimitrios Kokkinakis, and Maria Toporowska Gronostaj. 2010. The past meets the present in Swedish FrameNet++. In 14th EURALEX International Congress, pages 269–281, Leeuwarden. EURALEX.

[Borin et al.2012a] Lars Borin, Markus Forsberg, Leif-Jöran Olsson, and Jonatan Uppström. 2012a. The open lexical infrastructure of Språkbanken. In Proceedings of LREC 2012, pages 3598–3602, Istanbul. ELRA.

[Borin et al.2012b] Lars Borin, Markus Forsberg, and Johan Roxendal. 2012b. Korp – the corpus infrastructure of Språkbanken. In Proceedings of LREC 2012, pages 474–478, Istanbul. ELRA.

[Borin et al.2013a] Lars Borin, Markus Forsberg, and Lennart Lönngren. 2013a. SALDO: A touch of yin to WordNet’s yang. Language Resources and Evaluation, 47(4):1191–1211.

[Borin et al.2013b] Lars Borin, Markus Forsberg, Leif-Jöran Olsson, Olof Olsson, and Jonatan Uppström. 2013b. The lexical editing sysem of Karp. In Proceedings of the eLex 2013 Conference, pages 503–516, Tallin.

[Borin et al.2014] Lars Borin, Anju Saxena, Taraka Rama, and Bernard Comrie. 2014. Linguistic landscaping of South Asia using digital language resources: Genetic vs. areal linguistics. In Proceedings of LREC 2014, pages 3137–3144, Reykjavik. ELRA.

[Chuang et al.2012] Jason Chuang, Daniel Ramage, Christopher D. Manning, and Jeffrey Heer. 2012. Interpretation and trust: Designing model-driven visualizations for text analysis. In ACM Human Factors in Computing Systems (CHI).

[Dryer and Haspelmath2013] Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.

[Ebert2006] Karen Ebert. 2006. South Asia as a linguistic area. In Keith Brown, editor, Encyclopedia of Languages and Linguistics. Elsevier, Oxford, 2nd edition.

[Evert and Hardie2011] Stefan Evert and Andrew Hardie, 2011. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. University of Birmingham.

[Fader et al.2011] Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In Proceedings of EMNLP 2011, pages 1535–1545, Edinburgh. ACL.

[Georg2017] Stefan Georg. 2017. Other isolated languages of Asia. In Lyle Campbell, editor, Language Isolates, pages 139–161. Routledge, London.

[Grierson1903–1927] George A. Grierson. 1903–1927. A Linguistic Survey of India, volume I–XI. Government of India, Central Publication Branch, Calcutta.

[Hammarstedt et al.2017a] Martin Hammarstedt, Lars Borin, Markus Forsberg, Johan Roxendal, Anne Schumacher, and Maria Öhrman. 2017a. Korp 6 – Användarmanual [Korp 6 – User manual]. Research reports from the Department of Swedish GU-ISS 2017-02, University of Gothenburg, Gothenburg. http://hdl.handle.net/2077/53096.

[Hammarstedt et al.2017b] Martin Hammarstedt, Johan Roxendal, Maria Öhrman, Lars Borin, Markus Forsberg, and Anne Schumacher. 2017b. Korp 6 – Technical report. Research reports from the Department of Swedish GU-ISS 2017-01, University of Gothenburg, Gothenburg. http://hdl.handle.net/2077/53095.

[Havre et al.2000] Susan Havre, Beth Hetzler, and Lucy Nowell. 2000. ThemeRiver: Visualizing theme changes over time. In IEEE Symposium on Information Visualization, 2000. InfoVis 2000, pages 115–123, Salt Lake City. IEEE.

[Hook1977] Peter E. Hook. 1977. The distribution of the compound verb in the languages of North India and the question of its origin. International Journal of Dravidian Linguistics, 6:336–351.

[Janda et al.to appear 2018] Laura A. Janda, Olga Lyashevskaya, Tore Nesset, Ekaterina Rakhilina, and Francis M. Tyers. to appear 2018. A constructicon for Russian: Filling in the gaps. In Benjamin Lyngfelt, Lars Borin, Tiago Timponi Torrent, and Kyoko Hirose Ohara, editors, Constructicons in Contrast. Constructicography as a Fusion Between Construction Grammar and Lexicography. John Benjamins, Amsterdam.

[Krstajic et al.2012] Miloš Krstajic, Mohammad Najm-Araghi, Florian Mansmann, and Daniel A. Keim. 2012. Incremental visual text analytics of news story development. In Proceedings of VDA 2012, Burlingame, California. SPIE.

[Lyngfelt et al.2012] Benjamin Lyngfelt, Lars Borin, Markus Forsberg, Julia Prentice, Rudolf Rydstedt, Emma Sköldberg, and Sofia Tingsell. 2012. Adding a constructicon to the Swedish resource network of Språkbanken. In Proceedings of KONVENS 2012 (LexSem 2012 Workshop), pages 452–461, Vienna. ÖGAI.

[Manning et al.2014] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of ACL 2014, pages 55–60.

[Masica1976] Colin P. Masica. 1976. Defining a Linguistic Area: South Asia. Chicago University Press, Chicago.

[Nichols2003] Johanna Nichols. 2003. Diversity and stability in language. In Brian D. Joseph and Richard D. Janda, editors, The Handbook of Historical Linguistics, pages 283–310. Blackwell, Oxford.

[Saxena2016] Anju Saxena. 2016. Indo-Aryan in typological and areal perspective. Keynote presentation at SALA-32, Lisbon, 27–29 April, 2016.

[Simons and Fennig2018] Gary F. Simons and Charles D. Fennig, editors. 2018. Ethnologue: Languages of the World. SIL International, Dallas, 21st edition. Online version: http://www.ethnologue.com.

[Sun et al.2013] Guo-Dao Sun, Ying-Cai Wu, Rong-Hua Liang, and Shi-Xia Liu. 2013. A survey of visual analytics techniques and applications: State-of-the-art research and future challenges. Journal of Computer Science and Technology, 28(5):852–867.

[Surdeanu et al.2003] Mihai Surdeanu, Sanda Harabagiu, JohnWilliams, and Paul Aarseth. 2003. Using predicateargument structures for information extraction. In Proceedings of ACL 2003, pages 8–15, Sapporo. ACL.

[Swadesh1955] Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics, 21(2):121–137.

[Ward and Barker2013] Jonathan Stuart Ward and Adam Barker. 2013. Undefined by data: A survey of big data definitions. CoRR, abs/1309.5821.

[Xia and Lewis2007] Fei Xia and William Lewis. 2007. Multilingual structural projection across interlinear text. In Proceedings of HLT 2007, pages 452–459, Rochester, New York. ACL.

Citeringar i Crossref