The automatic identification of discourse units in Dutch text

Nynke van der Vliet
University of Groningen, The Netherlands

Gosse Bouma
University of Groningen, The Netherlands

Gisela Redeker
University of Groningen, The Netherlands

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:37, s. 411-421

NEALT Proceedings Series 16:37, s. 411-421

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


The identification of discourse units is an essential step in discourse parsing; the automatic construction of a discourse structure from a text. We present a rule-based algorithm to identify elementary discourse units (EDUs) in Dutch written text. Contrary to approaches that focus on the determination of segment boundaries; we identify complete discourse units; which is especially helpful for the recognition of interrupted EDUs that contain embedded discourse units. We use syntactic and lexical information to decompose sentences into EDUs. Experimental results show that our algorithm for EDU identification performs well on texts of various genres.


Discourse analysis; elementary discourse units; segmentation


Afantenos; S.; Denis; P.; Muller; P.; and Danlos; L. (2010). Learning recursive segments for discourse parsing. Arxiv preprint arXiv:1003.5372.

Bach; N. X.; Nguyen; M. L.; and Shimazu; A. (2012). A reranking model for discourse segmentation using subtree features. In SIGDIAL Conference’12; pages 160–168.

Borisova; I. and Redeker; G. (2010). Same and Elaboration relations in the Discourse Graphbank. In Proceedings of the 11th annual SIGdial Meeting on Discourse and Dialogue; Tokyo; September 24-25.

Bosma; W. E. (2008). Discourse Oriented Summarization. PhD thesis; University of Twente; Enschede; the Netherlands.

Carlson; L. and Marcu; D. (2001). Discourse tagging reference manual. Technical report; ISI Technical Report ISI-TR-545.

Carlson; L.; Okurowski; M. E.; and Marcu; D. (2002). RST discourse treebank. Linguistic Data Consortium; University of Pennsylvania.

Eisenstein; J. (2009). Hierarchical text segmentation from multi-scale lexical cohesion. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics; pages 353–361.

Fisher; S. and Roark; B. (2007). The utility of parse-derived features for automatic discourse segmentation. In Proceedings of ACL ’07; pages 488–495.

Hearst; M. (1997). Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics; 23(1):33–64.

Hernault; H.; Bollegala; D.; and Ishizuka; M. (2010). A sequential model for discourse segmentation. In Proceedings of CICLing 2010; pages 315–326.

Le Thanh; H.; Abeysinghe; G.; and Huyck; C. (2004). Automated discourse segmentation by syntactic information and cue phrases. In Proceedings of the IASTED International Conference on Artificial Intelligence and Applications (AIA 2004); Innsbruck; Austria.

Louis; A.; Joshi; A.; and Nenkova; A. (2010). Discourse indicators for content selection in summarization. In Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue; pages 147–156.

Lüngen; H.; Puskàs; C.; Bärenfänger; M.; Hilbert; M.; and Lobin; H. (2006). Discourse segmentation of German written text. In Proceedings of the 5th International Conference on Natural Language Processing (FinTAL 2006).

Mann; W. C. and Thompson; S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text; 8(3):243–281.

Marcu; D. (2000). The theory and practice of discourse parsing and summarization. The MIT Press.

Maslennikov; M. and Chua; T. (2007). A multi-resolution framework for information extraction from free text. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics; pages 592–599.

O’Donnell; M. (1997). RST-Tool: An RST analysis tool. In Proc. of the 6th European Workshop on Natural Language Generation; Duisburg.

Redeker; G.; Berzlánovich; I.; van der Vliet; N.; Bouma; G.; and Egg; M. (2012). Multi-Layer discourse annotation of a Dutch text corpus. In Proceedings of LREC 2012; Istanbul; May 21-27; pages 2820–2825.

Sagae; K. (2009). Analysis of discourse structure with syntactic dependencies and data-driven shift-reduce parsing. In Proceedings of the 11th International Conference on Parsing Technologies; pages 81–84.

Soricut; R. and Marcu; D. (2003). Sentence level discourse parsing using syntactic and lexical information. In Proceedings of HLT/NAACL 2003; pages 228–235.

Sporleder; C. and Lapata; M. (2005). Discourse chunking and its application to sentence compression. In Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP); pages 257–264.

Subba; R. and Di Eugenio; B. (2007). Automatic Discourse Segmentation using Neural Networks. In Proc. of the 11th Workshop on the Semantics and Pragmatics of Dialogue; pages 189–190.

Thione; G.; Van Den Berg; M.; Polanyi; L.; and Culy; C. (2004). Hybrid text summarization: Combining external relevance measures with structural analysis. In Proceedings ACL Workshop Text Summarization Branches Out. Barcelona.

Tofiloski; M.; Brooke; J.; and Taboada; M. (2009). A syntactic and lexical-based discourse segmenter. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers; pages 77–80.

Van der Vliet; N. (2010). Syntax-based discourse segmentation of Dutch text. In Slavkovik; M.; editor; Proceedings of the 15th Student Session; ESSLLI; pages 203–210.

van der Vliet; N.; Berzlánovich; I.; Bouma; G.; Egg; M.; and Redeker; G. (2011). Building a Discourse-annotated Dutch Text Corpus. In Dipper; S. and Zinsmeister; H.; editors; Bochumer Linguistische Arbeitsberichte 3; pages 157–171.

Van Noord; G. et al. (2006). At last parsing is now operational. In Verbum ex machina: actes de la 13e conférence sur le traitement automatique des langues naturelles (TALN 2006): Leuven; 10-13 avril 2006; page 20.

Verberne; S.; Boves; L.; Oostdijk; N.; and Coppen; P. (2007). Discourse-based answering of why-questions. Traitement Automatique des Langues (TAL); special issue on "Discours et document: traitements automatiques"; 47(2):21–41.

Citeringar i Crossref