Work Smart - Reducing Effort in Short-Answer Grading

Margot Mieskes
Hochschule Darmstadt, h_da, Darmstadt, Germany

Ulrike Pado
Hochschule für Technik Stuttgart, HFT, Stuttgart, Germany

Ladda ner artikel

Ingår i: Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018) at SLTC, Stockholm, 7th November 2018

Linköping Electronic Conference Proceedings 152:7, s. 57-68

NEALT Proceedings Series 36:7, p. 57-68

Visa mer +

Publicerad: 2018-11-02

ISBN: 978-91-7685-173-9

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


In language (and content) instruction, free-text questions are important instruments for gauging student ability. Grading is often done manually, so that frequent testing means high teacher workloads. We propose a new strategy for supporting manual graders: We carefully analyse the performance of automated graders individually and as a grader ensemble and present a procedure to guide manual effort and to estimate the size of the remaining grading error. We evaluate our approach on a range of data sets to demonstrate its robustness.


short-answer grading, machine grading, manual grading support


Sumit Basu, Chuck Jacobs, and Lucy Vanderwende. 2013. Powergrading: a clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics (TACL), 1:391–402.

Michael Brooks, Sumit Basu, Charles Jacobs, and Lucy Vanderwende. 2014. Divide and correct: Using clusters to grade short answers at scale. In Proceedings of L@S ’14, Atlanta, Georgia, 4–5 March 2014, pages 89–98.

Steven Burrows, Iryna Gurevych, and Benno Stein. 2015. The Eras and Trends of Automatic Short Answer Grading. International Journal of Artificial Intelligence in Education, 25:60–117.

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46.

Myroslava Dzikovska, Rodney Nielsen, Chris Brew, Claudia Leacock, Danilo Giampiccolo, Luisa Bentivogli, Peter Clark, Ido Dagan, and Hoa Trang Dang. 2013. SemEval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In Proceedings of SemEeval-2013, Atlanta, Georgia, 14–15 June 2013, pages 263–274.

Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382.

Derrick Higgins, Chris Brew, Michael Heilman, Ramon Ziai, Lei Chen, Aoife Cahill, Michael Flor, Nitin Madnani, Joel R. Tetreault, Daniel Blanchard, Diane Napolitano, Chong Min Lee, and John Blackmore. 2014. Is getting the right answer just about choosing the right words? The role of syntacticallyinformed features in short answer scoring. Computing Research Repository, Computation and Language.

Andrea Horbach and Alexis Palmer. 2016. Investigating active learning for short-answer scoring. In Proceedings of BEA-11, San Diego, California, 16 June 2016, pages 301–311.

Andrea Horbach, Alexis Palmer, and Magdalena Wolska. 2014. Finding a tradeoff between accuracy and rater’s workload in grading clustered short answers. In Proceedings of LREC 2014, Reykjavik, Iceland, 26–31 May 2014, pages 588–595.

Andrea Horbach and Manfred Pinkal. 2018. Semi-Supervised Clustering for Short Answer Scoring. In Proceedings of LREC 2018, Miyazaki, Japan, 7–12 May 2018.

Ludmila I. Kuncheva. 2004. Combining Pattern Classifiers – Methods and Algorithms. Wiley, Hoboken, NJ.

Nitin Madnani, Anastassia Loukina, and Aoife Cahill. 2016. A large scale quantitative exploration of modeling strategies for content scoring. In Proceedings of BEA-12, Copenhagen, Denmark, 8 September 2017, pages 457–467.

Detmar Meurers, Ramon Ziai, Niels Ott, and Stacey Bailey. 2011a. Integrating parallel analysis modules to evaluate the meaning of answers to reading comprehension questions. Special Issue on Free-text Automatic Evaluation. International Journal of Continuing Engineering Education and Life-Long Learning (IJCEELL), 21(4):355–369.

Detmar Meurers, Ramon Ziai, Niels Ott, and Janina Kopp. 2011b. Evaluating answers to reading comprehension questions in context: Results for german and the role of information structure. In Proceedings of the TextInfer 2011 Workshop on Textual Entailment, pages 1–9, Edinburgh, Scottland, UK. Association for Computational Linguistics.

Michael Mohler, Razvan Bunescu, and Rada Mihalcea. 2011. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of ACL-HLT 2011, Portland, Oregon 19–24 June 2011, pages 752–762.

Ulrike Padó. 2016. Get semantic with me! The usefulness of different feature types for short-answer grading. In Proceedings of COLING 2016, Osaka, Japan, 13–16 December 2016.

Ulrike Padó. 2017. Question difficulty – How to estimate without norming, how to use for automated grading. In Proceedings of BEA-12, Copenhagen, Denmark, 8 September 2017.

Ulrike Padó and Cornelia Kiefer. 2015. Short answer grading: When sorting helps and when it doesn’t. In 4th NLP4CALL Workshop at Nodalida, pages 42–50, Vilnius, Lithuania.

Helen Yannakoudakis and Ronan Cummins. 2015. Evaluating the performance of automated text scoring systems. In Proceedings of BEA-10, Denver, Colorado, 4 June 2015, pages 213–223.

Torsten Zesch, Michael Heilmann, and Aoife Cahill. 2015. Reducing annotation efforts in supervised short answer scoring. In Proceedings of BEA-10, Denver, Colorado, 4 June 2015, pages 124–132.

Torsten Zesch and Andrea Horbach. 2018. ESCRITO - An NLP-Enhanced Educational Scoring Toolkit. In Proceedings of LREC 2018, Miyazaki, Japan, 7–12 May 2018.

Citeringar i Crossref