Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure

IRIS

Speech emotion recognition (SER) is a challenging framework in demanding human machine interaction systems. Standard approaches based on the categorical model of emotions reach low performance, probably due to the modelization of emotions as distinct and independent affective states. Starting from the recently investigated assumption on the dimensional circumplex model of emotions, SER systems are structured as the prediction of valence and arousal on a continuous scale in a two-dimensional domain. In this study, we propose the use of a PLS regression model, optimized according to specific features selection procedures and trained on the Italian speech corpus EMOVO, suggesting a way to automatically label the corpus in terms of arousal and valence. New speech features related to the speech amplitude modulation, caused by the slowly-varying articulatory motion, and standard features extracted from the pitch contour, have been included in the regression model. An average value for the coefficient of determination R2 of 0.72 (maximum value of 0.95 for fear and minimum of 0.60 for sadness) is obtained for the female model and a value for R2 of 0.81 (maximum value of 0.89 for anger and minimum value of 0.71 for joy) is obtained for the male model, over the seven primary emotions (including the neutral state).

Mencattini, A., Martinelli, E., Costantini, G., Todisco, M., Basile, B., Bozzali, M., et al. (2014). Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. KNOWLEDGE-BASED SYSTEMS, 63(June 2014), 68-81 [10.1016/j.knosys.2014.03.019].

Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure

MENCATTINI, ARIANNA;MARTINELLI, EUGENIO;COSTANTINI, GIOVANNI;Todisco, M;Basile, B;Bozzali, M;DI NATALE, CORRADO

2014-04-02

Abstract

Speech emotion recognition (SER) is a challenging framework in demanding human machine interaction systems. Standard approaches based on the categorical model of emotions reach low performance, probably due to the modelization of emotions as distinct and independent affective states. Starting from the recently investigated assumption on the dimensional circumplex model of emotions, SER systems are structured as the prediction of valence and arousal on a continuous scale in a two-dimensional domain. In this study, we propose the use of a PLS regression model, optimized according to specific features selection procedures and trained on the Italian speech corpus EMOVO, suggesting a way to automatically label the corpus in terms of arousal and valence. New speech features related to the speech amplitude modulation, caused by the slowly-varying articulatory motion, and standard features extracted from the pitch contour, have been included in the regression model. An average value for the coefficient of determination R2 of 0.72 (maximum value of 0.95 for fear and minimum of 0.60 for sadness) is obtained for the female model and a value for R2 of 0.81 (maximum value of 0.89 for anger and minimum value of 0.71 for joy) is obtained for the male model, over the seven primary emotions (including the neutral state).

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				2-apr-2014
			
	Status di pubblicazione
	
				Pubblicato
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1016/j.knosys.2014.03.019
			
	Rilevanza
	
				Rilevanza internazionale
			
	Tipo
	
				Articolo
			
	Referee
	
				Esperti anonimi
			
	Settore disciplinare dell'articolo (valido fino a 24/06/2024)
	
				Settore ING-INF/07 - MISURE ELETTRICHE ED ELETTRONICHE
			
	Lingua del contenuto
	
				English
			
	Impact Factor ISI
	
				Con Impact Factor ISI
			
	Parole chiave
	
				Speech emotion recognition (SER)
Circumplex model of emotions
Partial least square (PLS) regression
Pearson correlation coefficient
Pitch contour characterization
Audio signal modulation
			
	URL alternativo
	
				http://www.scopus.com/record/display.url?eid=2-s2.0-84899981373&origin=resultslist&sort=plf-f&src=s&st1=mencattini&st2=a&nlo=1&nlr=20&nls=count-f&sid=350A6A319CD20A1D9E4A158A7026B5EF.FZg2ODcJC9ArCe8WOZPvA%3a63&sot=anl&sdt=aut&sl=39&s=AU-ID%28%22Mencattini%2c+Arianna%22+6507158637%29&relpos=0&relpos=0&citeCnt=0&searchTerm=AU-ID%28\%26quot%3BMencattini%2C+Arianna\%26quot%3B+6507158637%29#
			
	Citazione
	
				Mencattini, A., Martinelli, E., Costantini, G., Todisco, M., Basile, B., Bozzali, M., et al. (2014). Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. KNOWLEDGE-BASED SYSTEMS, 63(June 2014), 68-81 [10.1016/j.knosys.2014.03.019].
			
	Tutti gli autori
	
						Mencattini, A; Martinelli, E; Costantini, G; Todisco, M; Basile, B; Bozzali, M; DI NATALE, C
					
	Tipologia
	
				Articolo su rivista
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0950705114001087-main.pdf solo utenti autorizzati Descrizione: Articolo principale Dimensione 2.16 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.16 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/86630

Citazioni

ND

79

64

social impact