TY - GEN
T1 - A multi-modal approach to emotion recognition using undirected topic models
AU - Shah, Mohit
AU - Chakrabarti, Chaitali
AU - Spanias, Andreas
PY - 2014/1/1
Y1 - 2014/1/1
N2 - A multi-modal framework for emotion recognition using bag-of-words features and undirected, replicated softmax topic models is proposed here. Topic models ignore the temporal information between features, allowing them to capture the complex structure without a brute-force collection of statistics. Experiments are performed over face, speech and language features extracted from the USC IEMOCAP database. Performance on facial features yields an unweighted average recall of 60.71%, a relative improvement of 8.89% over state-of-the-art approaches. A comparable performance is achieved when considering only speech (57.39%) or a fusion of speech and face information (66.05%). Individually, each source is shown to be strong at recognizing either sadness (speech) or happiness (face) or neutral (language) emotions, while, a multi-modal fusion retains these properties and improves the accuracy to 68.92%. Implementation time for each source and their combination is provided. Results show that a turn of 1 second duration can be classified in approximately 666.65ms, thus making this method highly amenable for real-time implementation.
AB - A multi-modal framework for emotion recognition using bag-of-words features and undirected, replicated softmax topic models is proposed here. Topic models ignore the temporal information between features, allowing them to capture the complex structure without a brute-force collection of statistics. Experiments are performed over face, speech and language features extracted from the USC IEMOCAP database. Performance on facial features yields an unweighted average recall of 60.71%, a relative improvement of 8.89% over state-of-the-art approaches. A comparable performance is achieved when considering only speech (57.39%) or a fusion of speech and face information (66.05%). Individually, each source is shown to be strong at recognizing either sadness (speech) or happiness (face) or neutral (language) emotions, while, a multi-modal fusion retains these properties and improves the accuracy to 68.92%. Implementation time for each source and their combination is provided. Results show that a turn of 1 second duration can be classified in approximately 666.65ms, thus making this method highly amenable for real-time implementation.
UR - http://www.scopus.com/inward/record.url?scp=84907409936&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84907409936&partnerID=8YFLogxK
U2 - 10.1109/ISCAS.2014.6865245
DO - 10.1109/ISCAS.2014.6865245
M3 - Conference contribution
AN - SCOPUS:84907409936
SN - 9781479934324
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
SP - 754
EP - 757
BT - 2014 IEEE International Symposium on Circuits and Systems, ISCAS 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 IEEE International Symposium on Circuits and Systems, ISCAS 2014
Y2 - 1 June 2014 through 5 June 2014
ER -