Articulation constrained learning with application to speech emotion recognition

Mohit Shah; Ming Tu; Visar Berisha; Chaitali Chakrabarti; Andreas Spanias

doi:10.1186/s13636-019-0157-9

Articulation constrained learning with application to speech emotion recognition

Mohit Shah, Ming Tu, Visar Berisha, Chaitali Chakrabarti, Andreas Spanias

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ₁-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

Original language	English (US)
Article number	14
Journal	Eurasip Journal on Audio, Speech, and Music Processing
Volume	2019
Issue number	1
DOIs	https://doi.org/10.1186/s13636-019-0157-9
State	Published - Dec 1 2019

Keywords

Articulation
Constrained optimization
Cross-corpus
Emotion recognition

ASJC Scopus subject areas

Acoustics and Ultrasonics
Electrical and Electronic Engineering

Access to Document

10.1186/s13636-019-0157-9

Cite this

@article{36e823b3967541708d32861218a12d5e,

title = "Articulation constrained learning with application to speech emotion recognition",

abstract = "Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.",

keywords = "Articulation, Constrained optimization, Cross-corpus, Emotion recognition",

author = "Mohit Shah and Ming Tu and Visar Berisha and Chaitali Chakrabarti and Andreas Spanias",

note = "Publisher Copyright: {\textcopyright} 2019, The Author(s).",

year = "2019",

month = dec,

day = "1",

doi = "10.1186/s13636-019-0157-9",

language = "English (US)",

volume = "2019",

journal = "Eurasip Journal on Audio, Speech, and Music Processing",

issn = "1687-4714",

publisher = "Springer Publishing Company",

number = "1",

}

TY - JOUR

T1 - Articulation constrained learning with application to speech emotion recognition

AU - Shah, Mohit

AU - Tu, Ming

AU - Berisha, Visar

AU - Chakrabarti, Chaitali

AU - Spanias, Andreas

PY - 2019/12/1

Y1 - 2019/12/1

N2 - Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

AB - Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

KW - Articulation

KW - Constrained optimization

KW - Cross-corpus

KW - Emotion recognition

UR - http://www.scopus.com/inward/record.url?scp=85071023744&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071023744&partnerID=8YFLogxK

U2 - 10.1186/s13636-019-0157-9

DO - 10.1186/s13636-019-0157-9

M3 - Article

AN - SCOPUS:85071023744

SN - 1687-4714

VL - 2019

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

IS - 1

M1 - 14

ER -

Articulation constrained learning with application to speech emotion recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this