Conference article

Semi-automatic selection of best corpus examples for Swedish: Initial algorithm evaluation

Elena Volodina
Department of Swedish & Språkbanken, University of Gothenburg, Sweden

Richard Johansson
Department of Swedish & Språkbanken, University of Gothenburg, Sweden

Sofie Johansson Kokkinakis
Department of Swedish & Språkbanken, University of Gothenburg, Sweden

Download article

Published in: Proceedings of the SLTC 2012 workshop on NLP for CALL; Lund; 25th October; 2012

Linköping Electronic Conference Proceedings 80:7, p. 59-70

Show more +

Published: 2012-11-12


ISSN: 1650-3686 (print), 1650-3740 (online)


The study presented here describes the results of the initial evaluation of two sorting approaches to automatic ranking of corpus examples for Swedish. Representatives from two potential target user groups have been asked to rate top three hits per approach for sixty search items from the point of view of the needs of their professional target groups; namely second/foreign language (L2) teachers and lexicographers. This evaluation has shown; on the one hand; which of the two approaches to example rating (called in the text below algorithms #1 and #2) performs better in terms of finding better examples for each target user group; and on the other hand; which features evaluators associate with good examples. It has also facilitated statistic analysis of the “good” versus “bad” examples with reference to the measurable features; such as sentence length; word length; lexical frequency profiles; PoS constitution; dependency structure; etc. with a potential to find out new reliable classifiers.


No keywords available


Carl Hugo Björnsson. 1968. Läsbarhet. Liber Stockholm.

Lars Borin; Markus Forsberg; & Johan Roxendal. 2012a. Korp – the corpus infrastructure of Språkbanken. Proceedings of LREC 2012. Istanbul: ELRA. 474–478.

Lars Borin; Markus Forsberg; Karin Friberg Heppin; Richard Johansson; Annika Kjellandsson. 2012b. Search Result Diversification Methods to Assist Lexicographers. Proceedings of the 6th Linguistic Annotation Workshop.

Magnus Cedergren. 1992. Kvantitativa läsbarhetsanalyser som metod för datorstödd granskning. <> (Retrieved 2007-02-08) Stockholm: Inst.för Numerisk analys och datalogi; Kungl. Tekniska högskolan; NADA.

Kevyn Collins-Thompson and James P. Callan. 2004. A Language Modelling Approach to Predicting Reading Difficulty. Proceedings of the HLT/NAACL Annual Conference. Boston; MA; USA.

Council of Europe 2001. The Common European Framework of Reference for Languages. Cambridge University Press.

Jörg Didakowski; Lothar Lemnitzer & Alexander Geyken. 2012. Automatic example sentence extraction for a contemporary German dictionary. Proceedings of EuraLex 2012.

Jan Einarsson. 1976. Talbanken: Talbankens skriftspråkskonkordans/Talbankens talspråkskonkordans. Lund University.

Rudolf Flesch. 1948 A new readability yardstick. Journal of Applied Psychology; Vol. 32; pp. 221– 233.

Karin Friberg Heppin; Maria Toporowska Gronostaj. 2012. The Rocky Road towards a Swedish FrameNet – Creating SweFN. Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC 2012); Istanbul; Turkey. p. 256–261

Glenn Fulcher. 1997. Text Difficulty and Accessibility:Reading Formulae and Expert Judgement. System vol.25; 497–513.

Jerker Järborg. 1989. Betydelseanalys och betydelsebeskrivning i lexikalisk databas. Göteborg: Inst. f. sv. Spr.; Göteborgs universitet.

Katarina Heimann Mühlenbock. Forthcoming. I see what you mean – Assessing readability for specific target groups. PhD Thesis; Gothenburg University.

Philip Hubbard. 2012. Curation for systematization of authentic content for autonomous learning. EuroCALL 2012 Proceedings; Gothenburg.

Thomas N. Huckin. 1983. A Cognitive Approach to Readability. In: Paul V. Anderson; R. John Brockmann and Carolyn R. Miller; Editors; New Essays in Technical and Scientific Communication: Research; Theory; Practice; Baywood; Farmington; NY; pp. 71–90.

Milos Husák. 2008. Automatic Retrieval of Good Dictionary Examples. Bachelor Thesis; Brno. Retrieved on 2010-09-22 from

Adam Kilgarriff; Milos Husák; Katy McAdam; Michael Rundell; Pavel Rychlý. 2008. GDEX: Automatically finding good dictionary examples in a corpus. Proc EURALEX; Barcelona; Spain.

Sofie Johansson Kokkinakis and Elena Volodina. 2011. Corpus-based approaches for the creation of a frequency based vocabulary list in the EU project KELLY – issues on reliability; validity and coverage. Proceedings of eLex 2011; Slovenia.

Iztok Kosem; Milos Husák and McCarthy Diana. 2011. GDEX for Slovene. Proceedings of eLex 2011; Slovenia; pp.151–159.

Gunnel Källgren; Sofia Gustafson-Capková and Britt Hartmann. 2006. Manual of the Stockholm Umeå Corpus version 2.0. Department of Linguistics; Stockholm University.

Enrico Minack; Wolf Siberski; and Wolfgang Nejdl. 2011. Incremental diversification for very large sets: a streaming-based approach. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development of Information Retrieval; SIGIR’11; pp. 585–594. New York; United States.

Katarina Mühlenbock and Sofie Johansson Kokkinakis. 2009. LIX 68 revisited - An extended readability measure. Proceedings of Corpus Linguistics 2009.

Joakim Nivre; Jens Nilsson & Johan Hall. 2006. Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. InProceedings of the fifth international conference on Language Resources and Evaluation (LREC2006) Genoa: ELRA. 1392–1395.

Niels Ott and Detmar Meurers. 2010. Information Retrieval for Education: Making Search Engines Language Aware. Themes in Science and Technology Education. Vol 3; No 1-2. Special issue on “Computer-aided language analysis; teaching and learning: approaches; perspectives and applications” edited by George Weir and Shin’ichiro Ishikawa; 2010.

Amruta Purandare and Ted Pedersen. 2004. Word sense discrimination by clustering contexts in vector and similarity spaces. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL); pp. 41–48. Boston; United States.

Thomas M. Segler. 2007. Investigating the Selection of Example Sentences for Unknown Target Words in ICALL Reading Texts for L2 German. Doctoral
Thesis. University of Edinburgh. Retrieved on 2010- 09-22 from gler TM thesis 2007.pdf

Ulf Teleman. 1974. Manual för grammatisk beskrivning av talad och skriven svenska. Lund.

Elena Volodina. 2010. Corpora in Language Classroom: Reusing Stockholm Umeå Corpus in a Vocabulary Exercise Generator. LAP Lambert Academic Publishing; Colne; Germany.

Elena Volodina and Lars Borin. 2012. Developing an Open-Source Web-Based Exercise Generator for Swedish. EuroCALL 2012 Proceedings; Gothenburg.

Elena Volodina & Sofie Johansson Kokkinakis. 2012. Introducing Swedish Kelly-list; a new lexical eresource for Swedish. LREC 2012; Turkey.

Citations in Crossref