Conference article

Short answer grading: When sorting helps and when it doesn’t

Ulrike Pado
HFT Stuttgart, Stuttgart, Germany

Cornelia Kiefer
HFT Stuttgart, Stuttgart, Germany

Download article

Published in: Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015, Vilnius, 11th May, 2015

Linköping Electronic Conference Proceedings 114:6, p. 42-50

NEALT Proceedings Series 26:6, p. 42-50

Show more +

Published: 2015-05-06

ISBN: 978-91-7519-036-5

ISSN: 1650-3686 (print), 1650-3740 (online)


Automatic short-answer grading promises improved student feedback at reduced teacher effort both during and after instruction. Automated grading is, however, controversial in high-stakes testing and complex systems can be difficult to set up by non-experts, especially for frequently changing questions. We propose a versatile, domain-independent system that assists manual grading by pre-sorting answers according to their similarity to a reference answer. We show near state-of-the-art performance on the task of automatically grading the answers from CREG (Meurers et al., 2011). To evaluate the grader assistance task, we present CSSAG (Computer Science Short Answers in German), a new corpus of German computer science questions answered by natives and highly-proficient non-natives. On this corpus, we demonstrate the positive influence of answer sorting on the slowest-graded, most complex-to-assess questions.


short-answer grading; assisted grading; short-answer corpora


Enrique Alfonseca and Diana P´erez. 2004. Automatic assessment of open ended questions with a BLEUinspired algorithm and shallow NLP. In Advances in Natural Language Processing, volume 3230 of Lecture Notes in Computer Science. Springer.

Daniel Bär, Torsten Zesch, and Iryna Gurevych. 2013. DKPro Similarity: An open source framework for text similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 121–126.

Sumit Basu, Chuck Jacobs, and Lucy Vanderwende. 2013. Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics, 1:391–402.

Kathryn Bock. 1986. Syntactic persistence in language production. Cognitive Psychology, 18:355–387.

Michael Brooks, Sumit Basu, Charles Jacobs, and Lucy Vanderwende. 2014. Divide and correct: Using clusters to grade short answers at scale. In Learning @ Scale.

Steven Burrows, Iryna Gurevych, and Benno Stein. 2015. The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25:60–117.

Richard Eckart de Castilho and Iryna Gurevych. 2014. A broad-coverage collection of portable NLP components for building shareable analysis pipelines. In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT (OIAF4HLT) at COLING 2014, pages 1–11.

Michael Hahn and Detmar Meurers. 2012. Evaluating the meaning of answers to reading comprehension questions: A semantics-based approach. In Proceedings of the 7th Workshop on Innovative Use of NLP for Building Educational Applications (BEA7).

Andrea Horbach, Alexis Palmer, and Magdalena Wolska. 2014. Finding a tradeoff between accuracy and rater’s workload in grading clustered short answers. In Proceedings of the 9th LREC, pages 588–595.

Claudia Leacock and M. Chodorow. 2003. C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4):389–405.

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.

Detmar Meurers, Ramon Ziai, Niels Ott, and Janina Kopp. 2011. Evaluating answers to reading comprehension questions in context: Results for German and the role of information structure. In Proceedings of the TextInfer 2011 Workshop on Textual Entailment, pages 1–9.

David E. Meyer and RogerW. Schvaneveldt. 1971. Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology, 90(2):227–234.

Michael Mohler, Razvan Bunescu, and Rada Mihalcea. 2011. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 752–762. ACL.

Diana P´erez, Enrique Alfonseca, Pilar Rodr´iguez, Alfio Gliozzo, Carlo Strappavara, and Bernardo Magnini. 2005. About the effects of combining latent semantic analysis with natural language processing techniques for free-text assessment. Revista Signos: Estudios de Lingu¨istica, 38(59):325–343.

Martin Pickering and Simon Garrod. 2004. The interactive-alignment model: Developments and refinements. Behavioral and Brain Sciences, 27:212– 225.

Michael J. Wise. 1996. YAP3: Improved detection of similarities in computer program and other texts. In SIGCSEB: SIGCSE Bulletin (ACM Special Interest Group on Computer Science Education), pages 130–134. ACM Press.

Citations in Crossref