Conference article

Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries

Michael Beißwenger
University of Duisburg-Essen, Germany

Thierry Chanier
Université Clermont, Auvergne, France

Tomaž Erjavec
Jožef Stefan Institute, Ljubljana, Slovenia

Darja Fišer
University of Ljubljana, Ljubljana, Slovenia

Axel Herold
Berlin-Brandenburg Academy of Sciences, Berlin, Germany

Nikola Ljubešic
Jožef Stefan Institute, Ljubljana, Slovenia

Harald Lüngen
Institute for the German Language, Mannheim, Germany

Céline Poudat
Université de Nice, Sophia Antipolis, France

Egon Stemle
Eurac Research, Bolzano, Italy

Angelika Storrer
University of Mannheim, Mannheim, Germany

Ciara Wigham
Université Clermont, Auvergne, France

Download article

Published in: Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26–28 October 2016, CLARIN Common Language Resources and Technology Infrastructure

Linköping Electronic Conference Proceedings 136:1, p. 1-18

Show more +

Published: 2017-05-23

ISBN: 978-91-7685-499-0

ISSN: 1650-3686 (print), 1650-3740 (online)


The paper presents best practices and results from projects dedicated to the creation of corpora of computer-mediated communication and social media interactions (CMC) from four different countries. Even though there are still many open issues related to building and annotating corpora of this type, there already exists a range of tested solutions which may serve as a starting point for a comprehensive discussion on how future standards for CMC corpora could (and should) be shaped like.


CMC corpora, computer-mediated communication, social media corpora, corpus annotation, language resources, TEI, community building


[Baron et al.2012] Alistair Baron, Paul Rayson, Phil Greenwood, James Walkerdine, and Awais Rashid. 2012.

Children Online: A Survey of Child Language and CMC Corpora. International Journal of CorpusLinguistics, 17(4):443–81.

[Bartz et al.2014] Thomas Bartz, Michael Beißwenger, and Angelika Storrer. 2014. Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Ph nomene, erausforderungen, Erweiterungsvorschl ge. Journal for Language Technology and Computational Linguistics, 28(1):157–198.

[Beißwenger and Storrer2008] Michael Beißwenger and Angelika Storrer. 2008. Corpora of computer-mediated communication. In: Lüdeling, Anke; Kytö, Merja (eds.). Corpus Linguistics HSK, vol. 29.1. Walter de Gruyter, Berlin, Germany, pp. 292–309.

[Beißwenger et al.2012] Michael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika Storrer. 2012. A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative (Online), (3) (doi: 10.4000/jtei.476).

[Beißwenger2013] Michael Beißwenger. 2013. Das Dortmunder Chat-Korpus. Zeitschrift für germanistischeLinguistik, 41(1):161–164.

[Beißwenger et al.2015] Michael Beißwenger, Thomas Bartz, Angelika Storrer, and Swantje Westpfahl. 2015. Tagset and Guidelines for the PoS Tagging of Language Data from Genres of Computer-mediatedCommunication / Social Media.

[Beißwenger et al.2016] Michael Beißwenger, Sabine Bartsch, Stefan Evert, and Kay-Michael Würzner. 2016. EmpiriST 2015: A shared task on the automatic linguistic annotation of computer-mediated communication and web corpora. In: Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST SharedTask. Berlin, Germany, pp. 44–56.

[Bolander and Locher2014] Brook Bolander and Miriam A. Locher. 2014. Doing Sociolinguistic Research on Computer-Mediated Data: A Review of Four Methodological Issues. Discourse, Context & Media, (3):14–26.

[Chanier et al.2014] Thierry Chanier, Celine Poudat, Benoit Sagot, Georges Antoniadis, Ciara Wigham, Linda Hriba, Julien Longhi, and Djamé Seddah. 2014. The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. Journal of language Technology and Computational Linguistics, 29(2):1–30.

[Chanier and Wigham2016] Thierry Chanier and Ciara Wigham. 2016. Standardizing Multimodal Teaching and Learning Corpora. In: Marie-Jo, Hamel; Caws, Catherine (eds.). Language-Learner Computer Interactions:Theory, Methodology and CALL Applications. John Benjamins, Amsterdam, Netherlands, pp. 215-240. DOI: 10.1075/lsse.2.10cha.

[Chiari and Canzonetti2014] Isabella Chiari and Alessio Canzonetti. 2014. Le forme della comunicazione mediata dal computer: generi, tipi e standard di annotazione. In: Garavelli, Enrico; Suomela-Härmä, Elina (eds.). Dal manoscritto al web: canali e modalità di trasmissone dell’ italiano. Tecniche, materiali e usi nellastoria della lingua. Atti del XII Convegno della Società Internazionale di Linguistica e Filologia Italiana (SILFI), Helsinki, 18-19 June 2012. Franco Cesati Editore, Firenze, Italy, pp. 595-606.

[Cibej and Ljubešic2015] Jaka Cibej and Nikola Ljubešic. 2015. “S kje pa si?” – Metapodatki o regionalni pripadnosti uporabnikov družbenega omržja Twitter. Zbornik konference Slovenšcina na spletu in v novih medijih, Ljubljana, Slovenia, pp. 10-14.

[CLARIN-D schema2015] CLARIN-D TEI schema for CMC corpora. 2015.

[CoMeRe repository2016] CoMeRe repository. 2016. Corpora of Computer-Mediated Communication in French., Nancy, France.

[CoMeRe schema2014] CoMeRe TEI schema for CMC corpora, version 2. 2014. and

[Dobrovoljc et al.2015] Kaja Dobrovoljc, Simon Krek, Peter olozan, Tomaž Erjavec, and Miro Romih. 2015. Morphological Lexicon Sloleks 1.2., Slovenian language resource repository CLARIN.SI,

[Dürscheid and Stark2011] Christa Dürscheid and Elisabeth Stark. 2011. sms4science: An international corpusbased texting project and the specific challenges for multilingual Switzerland. In: Thurlow, Crispin; Mroczek, Kristine (eds.): Digital Discourse. Language in the New Media. Oxford University Press, Oxford, UK, pp. 299-320.

[Erjavec2012] Tomaž Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1):131–142.

[Erjavec2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources. Language Resources andEvaluation, 49(3):753–775.

[Erjavec et al.2016a] Tomaž Erjavec, Jaka Cibej, and Darja Fišer. 2016. Omogocanje dostopa do korpusov slovenskih spletnih besedil v luci pravnih omejitev. Slov n na 2.0, 4(2):189–219.

[Erjavec et al.2016b] Tomaž Erjavec, Jaka Cibej, Špela Arhar oldt, Nikola Ljubešic, and Darja Fišer. 2016. Gold-Standard Datasets for Annotation of Slovene Computer-Mediated Communication. In: Proceedings ofthe Tenth Workshop on Recent Advances in Slavonic Natural Languages Processings, Brno, the Czech Republic, pp. 29–40.

[Erjavec et al.2016c] Tomaž Erjavec, Darja Fišer, Jaka Cibej, Špela Arhar oldt, and Nikola Ljubešic. 2016. CMC Training Corpus Janes-Norm 1.2, Slovenian language resource repository CLARIN.SI,

[Erjavec et al.2016d] Tomaž Erjavec, Darja Fišer, Jaka Cibej, Špela Arhar oldt, and Nikola Ljubešic. 2016. CMC Training Corpus Janes-Tag 1.2, Slovenian language resource repository CLARIN.SI,

[Fišer and Beißwenger2016] Darja Fišer and Michael Beißwenger (eds.). 2016. Proceedings of the 4thConference on CMC and Social Media Corpora for the Humanities (cmc-corpora2016). University of Ljubljana, Slovenia.

[Fišer et al.2016] Darja Fišer, Tomaž Erjavec, and Nikola Ljubešic. 2016. JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin. Slov n na 2.0, 4(2):67–99.

[Forsyth an Martell2007] Eric N. Forsyth and Craig H. Martell. 2007. Lexical and Discourse Analysis of Online Chat Dialog. In: Proceedings of the First IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, USA, pp. 19-26.

[Frey et al.2014] Jennifer-Carmen Frey, Egon W. Stemle, and Aivars Glaznieks. 2014. Collecting Language Data of Non-Public Social Media Profiles. In: Workshop Proceedings of the 12th Edition of the KONVENSConference, edited by Gertrud Faaß and Josef Ruppenhofer. Universitätsverlag Hildesheim, Hildesheim, Germany, pp. 11-15.

[Frey et. al.2016] Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W. Stemle. 2016. The DiDi Corpus of South Tyrolean CMC Data: A Multilingual Corpus of Facebook Texts. Accepted at CLIC-it 2016.

[Grcar et al.2012] Miha Grcar, Simon Krek, and Kaja Dobrovoljc. 2012. Ob l : tat t n obl o lad njo na valn n l mat ato za slovenski jezik (Obeliks: a statistical morphosyntactic tagger and lemmatiserfor Slovene). Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia.

[Holozan et al.2008] Peter olozan, Simon Krek, Matej Pivec, Simon Rigac, Simon Rozman, and Aleš Velušcek. 2008. Specifikacije za ucni korpus. Project "Sporazumevanje v slovenskem jeziku (Specifications for the Training Corpus. The "Communication in Slovene" project).

[Horbach et al.2014] Andrea Horbach, Diana Steffen, Steffen Thater, and Manfred Pinkal. 2014. Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication. In: Proceedings of KONVENS 2014, pp. 171–177.

[iRights.Law2016] iRights.Law Rechtsanwälte. 2016. Rechtsgutachten zur Integration mehrerer Text-Korporain die CLARIN-D-Infrastrukturen. (Legal opinion for the ChatCorpus2CLARIN project, 46 pages).

[Kilgarriff et al.2014] Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubícek, Vojtech Kovár, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. 2014. The Sketch Engine: ten years on. Lexicography, 1(1):7–36.

[Krek et al.2013] Simon Krek, Tomaž Erjavec, Kaja Dobrovoljc, Sara Može, Nina Ledinek, and Nanika Holz. 2013. Training Corpus ssj500k 1.3. Slovenian language resource repository CLARIN.SI,

[Ljubešic and Erjavec2016] Nikola Ljubešic and Tomaž Erjavec. 2016. Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In: Proceedings of the 10th Language Resources andEvaluation Conference, Portorož, Slovenia, pp. 1527–1531.

[Ljubešic et al.2016a] Nikola Ljubešic, Tomaž Erjavec, and Darja Fišer. 2016. Corpus-based diacritic restoration for South Slavic languages. In: Proceedings of the 10th Language Resources and Evaluation Conference. Portorož, Slovenia, pp. 3612–3616.

[Ljubešic et al.2016b] Nikola Ljubešic, Katja Zupan, Darja Fišer, and Tomaž Erjavec. 2016. Normalising Slovene data: historical texts vs. user-generated content. In: Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), Bochum, Germany, pp. 146–155.

[Ljubešic et al.2015] Nikola Ljubešic, Darja Fišer, Tomaž Erjavec, Jaka Cibej, Dafne Marko, Senja Pollak, and Iza Škrjanec. 2015. Predicting the level of text standardness in user-generated content. In: Proceedings of the10th International Conference on Recent Advances in Natural Language Processing, Hissar, Bulgaria, pp. 371–378.

[Logar Berginc et al.2012] Nataša Logar Berginc, Miha Grcar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, and Simon Krek. 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja,vsebina, uporaba (The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: compilation, content, use.) Ljubljana, Slovenia: Trojina, zavod za uporabno slovenistiko, Faculty of Social Sciences.

[Lüngen et al.2016] Harald Lüngen, Michael Beißwenger, Eric Ehrhardt, Axel Herold, and Angelika Storrer. 2016. Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN. In: Proceedings of the 13th Conference on Natural Language Processing(KONVENS 2016), Bochum, Germany, pp. 156–164.

[Margaretha and Lüngen2014] Eliza Margaretha and Harald Lüngen. 2014. Building Linguistic Corpora from Wikipedia Articles and Discussions. Journal of language Technology and Computational Linguistics, 29(2):59–82.

[Oostdijk et al.2013] Nelleke Oostdijk, Martin Reynaert, Véronique Hoste, and Ineke Schuurman. 2013. The Construction of a 500 Million Word Reference Corpus of Contemporary Written Dutch. In: Spyns, Peter; Odijk, Jan (eds). Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme, Springer Verlag, Berlin, Germany, pp. 219-247.

[Panckhurst et al.2016] Rachel Panckhurst, Catherine Détrie, Cédric Lopez, Claudine Moïse, Mathieu Roche, and Bertrand Verin. 2016. 88milSMS: A corpus of authentic text messages in French. [corpus] In: Chanier, Thierry (ed). Banque de corpus CoMeRe. Ortolang, Nancy, France.

[Poudat et al.2017] Céline Poudat, Natalia Grabar, Camille Paloque-Berges, Thierry Chanier, and Kun Jin. 2017. Wikiconflits: un corpus de discussions éditoriales conflictuelles du Wikipédia francophone. In: Wigham, C.R.; Ledegen, G. (eds.). 2017. Corpus de communication médiée par les réseaux: Construction, structuration, analyse. Collection umanités Numériques. L’ armattan, Paris, France, pp. 211-222.

[Riou and Sagot2016] Stéphane Riou and Benoit Sagot. 2016. Etiquetage morpho-syntaxique du corpus FAVI [corpus]. D’après Yun, . & Chanier, T. (2014). Corpus d’apprentissage FAVI (Français académique virtuel international) [cmr-favi-tei-v1]. Banque de corpus CoMeRe. Ortolang, Nancy, France.

[Schiller et al.1999] Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Institut für maschinelle Sprachverarbeitung, University of Stuttgart, Germany.

[Schröck and Lüngen2015] Jasmin Schröck and Harald Lüngen. 2015. Building and Annotating a Corpus of German-Language Newsgroups. In: Proceedings of the 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media (NLP4CMC2015). Essen, Germany, pp. 17-22.

[TEI P5] TEI Consortium (eds) (2007): TEI P5: Guidelines for Electronic Text Encoding and Interchange.

[Verheijen and Stoop2016] Lieke Verheijen and Wessel Stoop. 2016. Collecting Facebook Posts and WhatsApp Chats. In: Proceedings. Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Springer International Publishing, Cham, Germany, pp. 249–58.

[Westpfahl and Schmidt2016] Swantje Westpfahl and Thomas Schmidt. 2016. FOLK-Gold A GOLD standard for Part-of-Speech- Tagging of Spoken German. In: Proceedings of the Tenth conference on International Language Resources and Evaluation (LREC16), Paris, France, pp. 1493-1499.

[Wigham and Chanier2013] Ciara Wigham and Thierry Chanier. 2013. Interactions Between Text Chat and Audio Modalities for L2 Communication and Feedback in the Synthetic World Second Life. Computer Assisted Language Learning, 28(3):260-283. DOI: 10.1080/09588221.2013.851702.

[Yimam et al.2013] Seid Muhie Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Chris Biemann. 2013. Webanno: A flexible, web-based and visually supported system for distributed annotations. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (System Demonstrations), Association for Computational Linguistics, Stroudsburg, USA, pp. 1–6.

Citations in Crossref