The Teacher-Student Chatroom Corpus

Andrew Caines
ALTA Institute & Computer Laboratory, University of Cambridge, U.K.

Helen Yannakoudakis
Department of Informatics, King’s College London, U.K.

Helena Edmondson
Theoretical & Applied Linguistics, University of Cambridge, U.K.

Helen Allen
Cambridge Assessment, University of Cambridge, U.K.

Pascual Pérez-Paredes
Faculty of Education, University of Cambridge, U.K.

Bill Byrne
Department of Engineering, University of Cambridge, U.K.

Paula Buttery
ALTA Institute & Computer Laboratory, University of Cambridge, U.K.

Published in: Proceedings of the 9th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2020)

Linköping Electronic Conference Proceedings 175:2, s. 10-20

Published: 2020-11-20

ISBN: 978-91-7929-732-9

ISSN: 1650-3686 (print), 1650-3740 (online)


The Teacher-Student Chatroom Corpus (TSCC) is a collection of written conversations captured during one-to-one lessons between teachers and learners of English. The lessons took place in an online chatroom and therefore involve more interactive, immediate and informal language than might be found in asynchronous exchanges such as email correspondence. The fact that the lessons were one-to-one means that the teacher was able to focus exclusively on the linguistic abilities and errors of the student, and to offer personalised exercises, scaffolding and correction. The TSCC contains more than one hundred lessons between two teachers and eight students, amounting to 13.5K conversational turns and 133K words: it is freely available for research use. We describe the corpus design, data collection procedure and annotations added to the text. We perform some preliminary descriptive analyses of the data and consider possible uses of the TSCC.


English language learning, discourse analysis, dialogue systems


