Conference article

Using Factual Density to Measure Informativeness of Web Documents

Chrstopher Horn
Know-Center GmbH, Graz, Austria

Alisa Zhila
Centro de Investigación en Computación, Instituto Politåcnico Nacional, Mexico City, Mexico

Alexander Gelbukh
Centro de Investigación en Computación, Instituto Politåcnico Nacional, Mexico City, Mexico

Roman Kern
Know-Center GmbH, Graz, Austria

Elisabeth Lex
Know-Center GmbH, Graz, Austria

Download article

Published in: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:21, p. 227-238

NEALT Proceedings Series 16:21, p. 227-238

Show more +

Published: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

The information obtained from the Web is increasingly important for decision making and for our everyday tasks. Due to the growth of uncertified sources; blogosphere; comments in the social media and automatically generated texts; the need to measure the quality of text information found on the Internet is becoming of crucial importance. It has been suggested that factual density can be used to measure the informativeness of text documents. However; this was only shown on very specific texts such as Wikipedia articles. In this work we move to the sphere of the arbitrary Internet texts and show that factual density is applicable to measure the informativeness of textual contents of arbitrary Web documents. For this; we compiled a human-annotated reference corpus to be used as ground truth data to measure the adequacy of automatic prediction of informativeness of documents. Our corpus consists of 50 documents randomly selected from the Web; which were ranked by 13 human annotators using the MaxDiff technique. Then we ranked the same documents automatically using ExtrHech; an open information extraction system. The two rankings correlate; with Spearman’s coefficient ? = 0.41 at significance level of 99.64%.

Keywords

Quality of texts; Web; fact extraction; open information extraction; informativeness; natural language processing

References

Almquist; E. and Lee; J. (2009). What do customers really want? http://hbr.org/ 2009/04/what-do-customers-really-want/ar/1. [last visited on 09/04/2013].

Banko; M.; Cafarella; M. J.; Soderland; S.; Broadhead; M.; and Etzioni; O. (2007). Open information extraction from the web. In IN IJCAI; pages 2670–2676.

Blumenstock; J. E. (2008). Size matters: word count as a measure of quality on wikipedia. In Proceedings of the 17th international conference on World Wide Web; WWW ’08; pages 1095–1096; New York; NY; USA. ACM.

Dandapat; S.; Sarkar; S.; and Basu; A. (2007). Automatic part-of-speech tagging for bengali: an approach for morphologically rich languages in a poor resource scenario. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions; ACL ’07; pages 221–224; Stroudsburg; PA; USA. Association for Computational Linguistics.

Etzioni; O.; Banko; M.; Soderland; S.; and Weld; D. S. (2008). Open information extraction from the web. Commun. ACM; 51(12):68–74.

Fader; A.; Soderland; S.; and Etzioni; O. (2011). Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; EMNLP ’11; pages 1535–1545; Stroudsburg; PA; USA. Association for Computational Linguistics.

Kirkpatrick; M. (2011). New 5 billion page web index with page rank now available for free from common crawl foundation. http://readwrite.com/2011/11/07/common_ crawl_foundation_announces_5_billion_page_w. [last visited on 25/01/2013].

Kohlschütter; C.; Fankhauser; P.; and Nejdl; W. (2010). Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining; WSDM ’10; pages 441–450; New York; NY; USA. ACM.

Lex; E.; Juffinger; A.; and Granitzer; M. (2010). Objectivity classification in online media. In Proceedings of the 21st ACM conference on Hypertext and hypermedia; HT ’10; pages 293–294; New York; NY; USA. ACM.

Lex; E.; Voelske; M.; Errecalde; M.; Ferretti; E.; Cagnina; L.; Horn; C.; Stein; B.; and Granitzer; M. (2012). Measuring the quality of web content using factual information. In Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality; WebQuality ’12; pages 7–10; New York; NY; USA. ACM.

Lipka; N. and Stein; B. (2010). Identifying featured articles in wikipedia: writing style matters. In Proceedings of the 19th international conference on World wide web; WWW ’10; pages 1147–1148; New York; NY; USA. ACM.

Louviere; J. J. and Woodworth; G. (1991). Best-worst scaling: A model for the largest difference judgments. Technical report; University of Alberta.

Ma; J.; Xiao; T.; Zhu; J. B.; and Ren; F. L. (2012). Easy-first chinese pos tagging and dependency parsing. In COLING 2012; 24th International Conference on Computational Linguistics; Proceedings of the Conference: Technical Papers; pages 1731–1746. Indian Institute
of Technology Bombay.

Nakatani; S. (2011). Language detection library for java. http://code.google.com/p/ language-detection/. [last visited on 25/01/2013].

Padró; L.; Reese; S.; Agirre; E.; and Soroa; A. (2010). Semantic services in freeling 2.1: Wordnet and ukb. In Bhattacharyya; P.; Fellbaum; C.; and Vossen; P.; editors; Principles; Construction; and Application of Multilingual Wordnets; pages 99–105; Mumbai; India. Global Wordnet Conference 2010; Narosa Publishing House.

Padró; L. and Stanilovsky; E. (2012). Freeling 3.0: Towards wider multilinguality. In Chair); N. C. C.; Choukri; K.; Declerck; T.; Do?gan; M. U.; Maegaard; B.; Mariani; J.; Odijk; J.; and Piperidis; S.; editors; Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); Istanbul; Turkey. European Language Resources Association (ELRA).

Weber; N.; Schoefegger; K.; Bimrose; J.; Ley; T.; Lindstaedt; S.; Brown; A.; and Barnes; S.-A. (2009). Knowledge maturing in the semantic mediawiki: A design study in career guidance. In Proceedings of the 4th European Conference on Technology Enhanced Learning: Learning in the Synergy of Multiple Disciplines; EC-TEL ’09; pages 700–705; Berlin; Heidelberg. Springer-Verlag.

Wu; F. and Weld; D. S. (2010). Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics; ACL ’10; pages 118–127; Stroudsburg; PA; USA. Association for Computational Linguistics.

Zhila; A. and Gelbukh; A. (2013). Comparison of open information extraction for english and spanish. In Accepted for Proceedings of the 19th International Computational Linguistics Conference Dialogue; Dialogue 2013.

Citations in Crossref