Konferensartikel

Features indicating readability in Swedish text

Johan Falkenjack
Department of Information and Computer Science, Linköping University, Linköping, Sweden

Katarina Heimann Mühlenbock
Språkbanken, University of Gothenburg, Gothenburg

Arne Jönsson
SICS East Swedish ICT AB, Sweden

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:8, s. 27-40

NEALT Proceedings Series 16:8, p. 27-40

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

Studies have shown that modern methods of readability assessment; using automated linguistic analysis and machine learning (ML); is a viable road forward for readability classification and ranking. In this paper we present a study of different levels of analysis and a large number of features and how they affect an ML-system’s accuracy when it comes to readability assessment. We test a large number of features proposed for different languages (mainly English) and evaluate their usefulness for readability assessment for Swedish as well as comparing their performance to that of established metrics. We find that the best performing features are language models based on part-of-speech and dependency type.

Nyckelord

Readability assessment; Machine learning; Dependency parsing; Weka

Referenser

Alusio; S.; Specia; L.; Gasperin; C.; and Scarton; C. (2010). Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP nfor Building Educational Applications; pages 1–9.

Björnsson; C. H. (1968). Läsbarhet. Liber; Stockholm.

Borin; L.; Forsberg; M.; and Roxendal; J. (2012). Korp – the corpus infrastructure of språkbanken. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12).

Chall; J. S. and Dale; E. (1995). Readability revisited: The new Dale–Chall readability formula. Brookline Books; Cambride; MA.

Coleman; M. and Liau; T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology; 60:283–284.

Collins-Thompson; K. and Callan; J. (2004). A Language Modeling Approach to Predicting Reading Difficulty. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics.

Dale; E. and Chall; J. S. (1949). The concept of readability. Elementary English; 26(23).

Davison; A. and Kantor; R. N. (1982). On the failure of readability formulas to define readable texts: A case study from adaptations. Reading Research Quarterly; 17(2):187–209.

Dell’Orletta; F.; Montemagni; S.; and Venturi; G. (2011). READ-IT: Assessing Readability of Italian Texts with a View to Text Simplification. In Proceedings of the 2nd Workshop on Speech and Language Processing for Assistive Technologies; pages 73–83.

Falkenjack; J. and Heimann Mühlenbock; K. (2012). Readability as probability. In Proceedings of The Fourth Swedish Language Technology Conference; pages 27–28.

Feng; L. (2010). Automatic Readability Assessment. PhD thesis; City University of New York.

Flesch; R. (1948). A new readibility yardstick. Journal of Applied Psychology; 32(3):221–233.

Fry; E. B. (1968). A readability formula that saves time. Journal of Reading; 11:513–516.

Heilman; M. J.; Collins-Thompson; K.; Callan; J.; and Eskenazi; M. (2007). Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts. In Proceedings of NAACL HLT 2007; pages 460–467.

Heilman; M. J.; Collins-Thompson; K.; and Eskenazi; M. (2008). An Analysis of Statistical Models and Features for Reading Difficulty Prediction. In Proceedings of the Third ACL Workshop on Innovative Use of NLP for Building Educational Applications; pages 71–79.

Heimann Mühlenbock; K. (2013). I see what you mean. Assessing readability for specific target groups. Dissertation; Språkbanken; Dept of Swedish; University of Gothenburg.

Hultman; T. G. and Westman; M. (1977). Gymnasistsvenska. LiberLäromedel; Lund. Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); Linköping Electronic Conference Proceedings #85 [page 38 of 474]

Kincaid; J. P.; Fishburne; R. P.; Rogers; R. L.; and Chissom; B. S. (1975). Derivation of new readability formulas (Automated Readability Index; Fog Count; and Flesch Reading Ease Formula) for Navy enlisted personnel. Technical report; U.S. Naval Air Station; Millington; TN.

Liu; H. (2008). Dependency distance as a metric of language comprehension difficulty. Journal of Cognitive Science; 9(2):169–191.

McLaughlin; G. H. (1969). SMOG grading - a new readability formula. Journal of Reading; 22:639–646.

Miller; G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM; 38(11):39–41.

Mühlenbock; K. (2008). Readable; Legible or Plain Words – Presentation of an easy-to-read Swedish corpus. In Saxena; A. and Viberg; Å.; editors; Multilingualism: Proceedings of the 23rd Scandinavian Conference of Linguistics; volume 8 of Acta Universitatis Upsaliensis; pages 327–329; Uppsala; Sweden. Acta Universitatis Upsaliensis.

Mühlenbock; K. and Johansson Kokkinakis; S. (2009). LIX 68 revisited - An extended readability measure. In Mahlberg; M.; González-Díaz; V.; and Smith; C.; editors; Proceedings of the Corpus Linguistics Conference CL2009; Liverpool; UK.

Nenkova; A.; Chae; J.; Louis; A.; and Pitler; E. (2010). Structural Features for Predicting the Linguistic Quality of Text Applications to Machine Translation; Automatic Summarization and Human–Authored Text.; pages 222–241. Empirical Methods in NLG. Springer-Verlag. Dependency Parsing. In Proceedings of the fifth international conference on Language Resources

Nivre; J.; Hall; J.; and Nilsson; J. (2006). MaltParser: A Data-Driven Parser-Generator for and Evaluation (LREC2006); pages 2216–2219.

Petersen; S. (2007). Natural language processing tools for reading level assessment and text simplification for bilingual education. PhD thesis; University of Washington; Seattle; WA.

Petersen; S. and Ostendorf; M. (2009). A machine learning approach toreading level assessment. Computer Speech and Language; 23:89–106.

Pitler; E. and Nenkova; A. (2008). Revisiting Readability: A Unified Framework for Predicting Text Quality. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing; pages 186–195; Honolulu; HI.

Platt; J. C. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Technical Report MSR-TR-98-14; Microsoft Research.

Schwarm; S. E. and Ostendorf; M. (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics.

Sjöholm; J. (2012). Probability as readability: A new machine learning approach to readability assessment for written Swedish. Master’s thesis; Linköping University.

Smith; C.; Danielsson; H.; and Jönsson; A. (2012). A good space: Lexical predictors in vector space evaluation. In Proceedings of the eighth international conference on Language Resources and Evaluation (LREC); Istanbul; Turkey.

Witten; I. H.; Frank; E.; and Hall; M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. The Morgan Kaufmann series in data management system. Morgan Kaufmann Publishers; third edition.

Citeringar i Crossref