Classification of Affect in Spoken Utterances

6.891: Machine Learning Project

Nitin Sawhney {nitin@media.mit.edu}

Dec. 7, 1999

[ PDF ]

 

Abstract

Prosodic patterns in speech convey affective state to listeners. In infant-directed speech such characteristics are exaggerated, hence such data was utilized to conduct a short study for the classification of affect. Pitch and energy measures extracted from the utterances of 12 speakers provided prosodic features for the classifier. A Support Vector Machine (SVM) approach was used to run a series of machine learning experiments for a 3-way classification of attention, prohibition and approval in these utterances. Due to limited data, cross-validation techniques provided better performance. A linear kernel SVM resulted in an overall accuracy of 65% using a simple set of acoustic features (as previously reported in the literature using other techniques).

Introduction

The goal of this project is to develop a representation and utilize learning techniques that allows machines to recognize affect in a human speaker's voice. The focus is on early pre-linguistic affective messages in utterances, typically heard by infants or animals. Across languages it has been shown that prosodic patterns used by parents convey prohibition, praise and attentional-bids in a similar manner [Fernald93]. A research question is whether acoustic characteristics in the utterances alone can provide sufficient information to the listener to distinguish such affective content.

Related Work

Prosodic effects in speech are considered to be related to pitch (F0) contour, energy contour and speaking rate [Cahn89]. However studies indicate that the manner in which acoustic features are correlated with affect may be speaker dependent [Sterrer83]. A preliminary experiment at the Media Lab [Roy96] utilized the Fisher linear discriminant method to find an optimal combination of 6 acoustic measurements to classify approval/disapproval. They achieved classification of 65%-88% for 300 utterances spoken by 3 speakers. They found that the most discriminating features for each speaker were different, however average F0 and ratio of first 2 harmonics served as better features. They suggest that higher accuracy requires analysis of verbal content and sentence level prosodic cues such as the F0 contour (they did not utilize a robust pitch tracker).

Recent work at Interval [Slaney98] focused on recognizing 3 classes (approval/attentional bids/prohibition) of affective state in 30-50 utterances each from 12 different speakers. They analyzed speech using 3 classes of features: pitch (variance, slope, range, mean), formant transitions (using MFCCs), and energy variations. A gaussian mixture model (10 gaussians per class) was used to model the data. Features that provided the best performance were global pitch range, global MFCC and global pitch slope as well as energy variance in the first segment. Overall a speaker dependent classifier worked best, giving 66% accuracy (while human listeners performed 65%). Curiously utterances from female speakers provided higher affect classification than males (67% vs. 57%).

For this project, I considered appropriate features, pre-processing and alternative learning techniques to aid classification of such affective auditory data.

Acquiring Data

After correspondence with Malcolm Slaney in Nov. '99, we obtained access to the infant-directed utterances used in their study [Slaney98]. The 'babyears' dataset consists of recordings from 12 speakers (6 mothers and 6 fathers) talking to infants in a quiet room. As the parents were asked to play and interact with their infants naturally, verbal interaction yielded exaggerated utterances grouped into 3 classes: Approval, Attention, and Prohibition. The dataset contained 30-50 utterances segmented into phrases by a speech-silence discriminator. The dataset contained 212 approvals, 149 attentional bids and 148 prohibitions.

Subjective listening experiments with 7 adult subjects on this dataset (carried out at Interval) yielded 79% classification. However when adults do not have access to the linguistic message in utterances, the performance is lower. This was shown in an earlier study where listeners judged emotional messages in Danish to 65% accuracy [Engberg97].

The data we obtained was a set of 600 audio segments in AIFF format, grouped under 12 speakers. The data was labeled as approval (ap), attention (at), and prohibition (pr), along with additional utterances (ad) of parents speaking without emphasizing affect (post-study interviews). Each audio file was a stereo recording; one channel was from a close-talking microphone, and the other from a stereo microphone. It was important to ensure that we analyze all audio from the same channel (close-talking microphone).

Analysis & Processing

Both the Interval study and the paper by Roy and Pentland [Roy96] analyzed a number of features related to pitch, formants and energy to infer prosodic characteristics in the utterances, as mentioned in the related work section. We primarily chose to use pitch and energy features for this study. This requires use of a robust pitch tracker to generate the pitch-related statistics. We utilized Praat [Boersma99], a speech analysis tool developed for use by phoneticians, to do feature extraction from the AIFF audio data. Praat (v.3.8) uses a highly accurate pitch-extraction algorithm that measures F0 with an accuracy of 10-6, and HNR values up to 60 dB [Boersma93]. A Perl script was written to generate Praat scripts to automatically process all audio files, and analyze the pitch, intensity and formant contours in the utterances (see figure 1).

 

Figure 1: Analysis of the utterance "Good Job!" using the Praat tool. Here we see the waveform, intensity contour (in dB), formants in the spectrum and the pitch track (in Hz).

The generated Praat files were then parsed and analyzed in Matlab to extract pitch and energy statistics. For each utterance, the pitch variance, slope, range, and mean were extracted along with energy variance and range. The pitch measures were converted to octaves by computing the log base 2 of the pitch estimates to place them on a perceptual scale [Slaney98]. No formant information (like MFCCs) was utilized in these experiments. These features (4 pitch measures and 2 energy measures) were obtained for 4 time periods in the utterance: the whole phrase as well as for 3 segments: beginning, middle and end. Hence a 24-element feature vector was extracted for each utterance.

To ensure that the features were meaningful we first looked at the correlation between features in Matlab (using Corrcoef). Correlations between the pitch measures and between energy measures were expected. The data was whitened (subtracting mean from data and dividing by variance) to reduce some correlations. However, later experiments did not show any change in performance (whitening actually increased classification errors slightly).

 

Figure 2: Matrix of Correlation Coefficients for the 24 features extracted form the dataset. It's clearly seen that there's greater correlation (brighter squares) between the pitch measures and between the energy measures

Classification

The main focus of this study was on running classification experiments using a support vector machine (SVM) approach. It is considered a good candidate because it has good generalization performance without the need for a priori knowledge, even when the dimensionality of the input space is high or the training examples are sparse. For a set of points belonging to one of two classes a linear SVM finds a hyperplane that maximizes the distance of the largest possible fraction of points of either class from the hyperplane. For a good discussion of statistical learning theory see [Vapnik99], and for an example of an illustrative application of SVMs towards image classification see [Chapelle99]. Use of SVMs on speech data for phonetic classification was first shown in a recent paper [Clarkson99]. For our experiments we chose to use the SVM Toolbox (v.2) running in Matlab (Steve Gunn, University of Southampton).

SVMs are designed for binary classification, however since the affect data in this study requires 3 classes, an appropriate method for multi-class learning was needed. One can either modify the design of the SVM to incorporate multi-class learning into the quadratic-solving algorithm or combine several binary classifiers. We took the latter approach of applying pair-wise comparisons between classes using either "One vs. One" or "One vs. All" methods. Here the decision function of each class from each SVM trained on the data is compared during testing and the one with the largest decision function is selected as the accurate classifier. To run these experiments, the data for each speaker was split into "One vs. One" and "One vs. All" data sets, and separate SVMs were then iteratively trained on the data, and their combined outputs tested with new data.

In addition, since we had such limited data for each speaker (30-50 utterances only), cross-validation was used to improve performance. Here the data is divided into S distinct segments and the SVM is trained on data from S-1 segments, while its performance is tested on the remaining segment. This process is repeated for each of the S possible segments of utterances and the test errors are averaged across all results. This procedure allows a high proportion of the available data to be used in training and allows all data to be used in testing for cross-validation error. We chose to divide utterances for each speaker into 10 subsets (or segments) for most of the classification experiments.

Results

Experiment 1:

One Vs. One approach with 2/3 Training data & 1/3 Test data.

Kernel

SVM Error Rate

Linear

53.87 %

ERBF (sigma = 1)

54.95 %

Gaussian (sigma = 1)

76.28 %

Such a result was expected with training using such a sparse data set. Changing the value of the SVM error penalty parameter C (set to 1000) had no influence in the experiments, except for providing a better bound for training with all the data (all speakers). This generally enforces full seperability for high dimensional data, for all kernels except the linear case.

By using 9/10 ths of the data for training, we managed to reduce the SVM error down to 42.24%. This indicated that cross-validation would yield better results.

 Experiment 2:

One Vs. One Cross-Validation using 10 segments with Linear Kernel SVMs for each speaker:

Speaker

Error

Sp01

40%

Sp02

45.71%

Sp03

42.85%

Sp04

35.55%

Sp05

28.33%

Sp06

27.78%

Sp07

40 %

Sp08

26.66%

Sp09

27.77%

Sp10

42 %

Sp11

33.33 %

Sp12

32.22%

Average

35.18%

The result represents the best average classification performance (64.82%) that was achieved with the SVM Classifiers using a One vs. One training approach. The training errors were usually found to be 0%.

The first 6 speakers were female and the last 6 male. The Avg. error on Female speech - 36.70% i.e. classification of 63.29%, and the Male Avg. error - 33.66%, and classification rate of 66.33%. These results are similar to those reported in the literature, however they don't show any overall difference in male vs. female accuracy, as reported in Malcolm's paper. One reason could be because we did not use formant information measured by MFCCs (mel-frequency cepstral coefficients) which may encode gender-specific characteristics (along with pitch). Also the simple features we abstracted here were probably a coarse representation of the prosody, and may not explain more complex characteristics of affect in such utterances.

Whitening the data improved the performance on some speakers but reduced the overall performance to 35.87% errors. Informal tests using different kernels did not improve the performance.

Experiment 3:

One vs. All Cross-Validation using 10 segments with Linear Kernel SVMs for each speaker:

Speaker

Error

Sp01

43.33%

Sp02

50%

Sp03

50%

Sp04

36%

Sp05

40%

Sp06

38%

Sp07

44 %

Sp08

35%

Sp09

32%

Sp10

63 %

Sp11

33 %

Sp12

40 %

Average

42.12%

One vs. All training took much longer CPU processing time, and obtained reduced performance.

In most cases 40-60% of the data were used as support vectors (data points that define margin of the hyperplane), hence it did not seem instructive to examine these support vectors to look for common acoustic properties in the utterances, as I had hoped.

Experiment 4:

Experiments were also conducted using Linear Discriminant and Nearest Neighbor techniques with cross-validation using 10 segments:

Linear Discriminant

Nearest Neighbor

One Vs. One

41.85%

40.92%

One Vs. All

37.78%

36.62%

These observations are somewhat unexpected. First it indicates that the data is linearly separable (hence results similar to linear SVMs), however the One Vs. All approach got slightly better performance in these experiments, which contradicts the results from the SVM classification. I don't see an obvious explanation for that. So I consider these results preliminary and would like to verify using other discrimination techniques and different sets of auditory features.

Future Work

Having explored the use of machine learning techniques on the 'babyears' dataset, I would like to consider experiments on a different dataset of infant-directed speech. The hypothesis is that one can use prosodic features in utterances for "Semantic Highlighting" i.e. to find the presence of keywords that the speaker is most likely referring to in a phrase e.g., "Look at this ball" or "Come here John". One would hope that prosodic characteristics in the speaker's voice would be somewhat correlated with the onset of such keywords. In the literature, it has been stated that exaggerated F0 excursions serve as directing signals. Infants consistently direct their attention to speech with wider F0 excursions. Another task would be to consider how infants might segment words in Speech. It has been hypothesized and shown that infants identify word boundaries based on cues such as "Strong/Weak" patterns in words [Jusczyk97]. These experiments would require hand-labeling all utterances to indicate the presence of the keyword or strong/weak patterns, making it a binary classification task.

A more extended experiment would include the use of Independent Component Analysis (ICA) to possibly extract higher-order structure [Bell96] consisting of some statistically independent components in the data perhaps. However it's hard to say what sort of result this would yield on speech utterances. So this could be an exploratory phase, but a potentially more interesting experiment.

Conclusions

It is encouraging to see that one can find meaningful information in acoustic data alone to characterize affect in speech, however higher performance would require some access to the linguistic content or visual stimulus (as an infant does). Simple features seem to provide sufficient prosodic information for such classification, although they do not show important variances in characteristics of different speakers or gender. SVM approaches used here did a reasonable classification of the data, however for a multi-class experiment with limited data, much more effort was required to setup the learning. Overall it is clear that such techniques can be utilized for a range of statistical learning experiments towards a better understanding of how infants may process and segment speech.

Acknowledgements

I'd like to thank Malcolm Slaney at Interval Research for allowing me to use their dataset for this study and to Paul Boersma (University of Amsterdam) for use of the Praat analysis tool. Thanks to Aggelos Blestas, Deb Roy, Brian Clarkson, Sumit Basu, Paris Smaragdis, Robert Burke and Kinh Tieu for discussions on the project.

References

[Bell96] Bell, Anthony and Terrence J. Sejnowski. "Learning the higher-order structure of a natural sound". Network: Computation in Neural Systems, 7. 1996. http://www.cnl.salk.edu/~tony/ica.html

[Boersma93] Boersma, Paul. "Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound." In Proceedings of the Institute of Phonetic Sciences 17, pp. 97-110, 1993. http://www.fon.hum.uva.nl/paul/papers/Proceedings_1993.ps

[Boersma99] Praat is a freely available tool for speech analysis, developed by Paul Boersma and David Weenink at the Institute of Phonetic Sciences of the University of Amsterdam, The Netherlands. http://www.fon.hum.uva.nl/praat

[Cahn89] Cahn, J.E. "Generation of affect in synthesized speech." Proceedings of the 1989 conference of AVIOS, pp. 251-256, 1989.

[Chapelle99] Chapelle, Oliver. P. Haffner, and V. N. Vapnik. "Support Vector Machines for Histogram-Based Image Classification". IEEE Transactions on Neural Networks, Vol. 10, No. 5, September 1999.

[Clarkson99] Clarkson, Philip and Pedro J. Moreno. "On the use of Support Vector Machines for Phonetic Classification". In the proceedings of ICCASP '99.

[Engberg97] Engberg, S. I., A.V. Hansen, O. Anderson, P. Dalgaard. "Design, recording and verification of a Danish emotional speech database." Proceedings of EuroSpeech '97, Rhodes, Greece, Vol. 4, pp. 1695-1698, 1997.

[Fernald93] Fernald A. "Approval and disapproval: Infant responsiveness to vocal affect in familiar and unfamiliar languages." Developmental Psychology, Vol. 64, pp. 657-674, 1993.

[Jusczyk97] Jusczyk, Peter. The Discovery of Spoken Language. MIT Press. 1997.

[Roy96] Roy, Deb and Alex Pentland. "Automatic Spoken Affect Analysis and Classification". In the Proceedings of the International Conference on Automatic Face and Gesture Recognition, Killington, VT. 1996. http://www.media.mit.edu/~dkroy/papers/html/fg96/fg96.html

[Slaney98] Slaney, Malcolm and Gerald McRoberts. "Baby Ears: A Recognition System for Affective Vocalizations", ICASSP '98. http://www.interval.com/papers/1997-063/

[Sterr83] Sterrer, L.A., et al. "Acoustic and perceptual indicators of emotional stress". J. Acoust. Am. 73 (4), April 1983, 1354-1360.

[Vapnik99] Vapnik, Valdimir N. "An Overview of Statistical Learning Theory". IEEE Transactions on Neural Networks, Vol. 10, No. 5, September 1999.