Annotation and Multimodal Perception of Attitudes: A Study on Video Blogs

Noor Alhusna Madzlan
CLCS, School of Linguistics, Speech and Communication Sciences, Trinity College Dublin, Ireland / ELLD, Faculty of Languages and Communication, UPSI, Malaysia

Justine Reverdy
SCSS, School of Computer Science and Statistics, Trinity College Dublin, Ireland

Francesca Bonin
SCSS, School of Computer Science and Statistics, Trinity College Dublin, Ireland

Loredana Cerrato
SCSS, School of Computer Science and Statistics, Trinity College Dublin, Ireland

Nick Campbell
SCSS, School of Computer Science and Statistics, Trinity College Dublin, Ireland

Ladda ner artikel

Ingår i: Proceedings from the 3rd European Symposium on Multimodal Communication, Dublin, September 17-18, 2015

Linköping Electronic Conference Proceedings 105:9, s. 50-54

Visa mer +

Publicerad: 2016-09-16

ISBN: 978-91-7685-679-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


We report the set-up and results of an experiment designed to verify to what extent attitudes can be identified and labelled by using an ad hoc annotation scheme. Respondents were asked to label the multimodal expressions of attitudes of a number of video bloggers selected from a vlog corpus. This study aims at measuring respondents’ attitude choice as well as the difference in their attitude judgments. We investigate the contribution of different modalities to the process of attitude choice (audio, video, all). The results are analysed from three perspectives: inter-annotator agreement, contribution level for each modality and certainty level of attitude choice. Participants showed to perform better in perceiving attitudes when they were presented with the combined audio-visual stimuli in comparison to the audio only and video only stimuli. Participants showed to be more certain in selecting “Friendliness” than the other attitudes.


multimodal perception, video blogs, annotation, affective states


[1] Y. Lu, V. Auberg´e, A. Rilliard et al., “Do you hear my attitude? prosodic perception of social affects in mandarin,” Proceedings of Speech Prosody 2012, pp. 685–688, 2012.

[2] P. Ekman, “Are there basic emotions?” 1992.

[3] A. Kappas and N. Kr¨amer, Face-to-Face Communication over the Internet: Emotions in a Web of Culture, Language, and Technology, ser. Studies in Emotion and Social Interaction. Cambridge University Press, 2011. [Online]. Available: http://books.google.ie/books?id=ofM AHampHsC

[4] J. A. Hall and D. Matsumoto, “Gender differences in judgments of multiple emotions from facial expressions.” Emotion, vol. 4, no. 2, p. 201, 2004.

[5] C. Yoo, J. Park, and D. J. MacInnis, “Effects of store characteristics and in-store emotional experiences on store attitude,” Journal of Business Research, vol. 42, no. 3, pp. 253–263, 1998.

[6] P. Ekman, W. V. Friesen, M. O’Sullivan, A. Chan, I. Diacoyanni-Tarlatzis, K. Heider, R. Krause,W. A. LeCompte, T. Pitcairn, P. E. Ricci-Bitti et al., “Universals and cultural differences in the judgments of facial expressions of emotion.” Journal of personality and social psychology, vol. 53, no. 4, p. 712, 1987.

[7] P. Persson, J. Laaksolahti, and P. L¨onnqvist, “Understanding socially intelligent agents-a multilayered phenomenon,” Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, vol. 31, no. 5, pp. 349–360, 2001.

[8] A. Vinciarelli and G. Mohammadi, “Towards a technology of nonverbal communication: Vocal behavior in social and affective phenomena,” igi-global, Tech. Rep., 2010.

[9] M. D. Pell, “Judging emotion and attitudes from prosody following brain damage,” Progress in brain research, vol. 156, pp. 303–317, 2006.

[10] D. Ballin, M. Gillies, and I. Crabtree, “A framework for interpersonal attitude and non-verbal communication in improvisational visual media production,” 2004.

[11] P. J. Henrichsen and J. Allwood, “Predicting the attitude flow in dialogue based on multi-modal speech cues,” NEALT PROCEEDINGS SERIES, 2012.

[12] Y. Morlec, G. Bailly, and V. Auberg´e, “Generating the prosody of attitudes,” in Intonation: Theory, Models and Applications, 1997.

[13] J.-M. Blanc and P. F. Dominey, “Identification of prosodic attitudes by a temporal recurrent network,” Cognitive Brain Research, vol. 17, no. 3, pp. 693–699, 2003.

[14] A. Rilliard, J.-C. Martin, V. Auberg´e, T. Shochi et al., “Perception of french audio-visual prosodic attitudes,” Speech Prosody, Campinas, Brasil, 2008.

[15] D.-K. Mac, V. Auberg´e, A. Rilliard, and E. Castelli, “Crosscultural perception of vietnamese audio-visual prosodic attitudes,”
in Speech Prosody, 2010.

[16] J. Allwood, S. Lanzini, and E. Ahls´en, “Contributions of different modalities to the attribution of affective-epistemic states,” in Proceedings from the 1st European Symposium on Multimodal Communication University of Malta, pp. 1–6.

[17] J. Allwood, L. Cerrato, L. Dybkjaer, K. Jokinen, C. Navarretta, and P. Paggio, “The mumin multimodal coding scheme,” in Proc. Workshop on Multimodal Corpora and Annotation, 2005.

[18] N. A. Madzlan, J. Han, F. Bonin, and N. Campbell, “Towards automatic recognition of attitudes: Prosodic analysis of video blogs,” Speech Prosody, Dublin, Ireland, pp. 91–94, 2014.

[19] N. Madzlan, J. Han, F. Bonin, and N. Campbell, “Automatic recognition of attitudes in video blogs - prosodic and visual feature analysis,” in INTERSPEECH, 2014.

[20] J. L. Fleiss, J. Cohen, and B. Everitt, “Large sample standard errors of kappa and weighted kappa.” Psychological Bulletin, vol. 72, no. 5, p. 323, 1969.

[21] B. Schuller, “Multimodal affect databases: Collection, challenges, and chances,” The Oxford Handbook of Affective Computing, pp.
323–333, 2014.

[22] T. Shochi, D. Erickson, A. Rilliard, V. Auberg´e, J.-C. Martin et al., “Recognition of japanese attitudes in audio-visual speech,” in Speech prosody, vol. 2008, 2008, pp. 689–692.

Citeringar i Crossref