Situational Awareness from Environmental Sounds

Nitin Sawhney (nitin@media.mit.edu)

Speech Interface Group, MIT Media Lab

Final Project Report for Modeling Adaptive Behavior (MAS 738)

Pattie Maes

June 13, 1997

Abstract

Environmental sounds provide many contextual cues that enable us to recognize important aspects of our surroundings. The goal of this project is to consider techniques to allow machines to extract and classify features from pre-defined classes of sounds in the environment. We will present the different phases of the project: capture of environmental audio, pre-processing of audio data, feature extraction using power spectral density and filter-banks, training and testing via a simple nearest-neighbor algorithm. We will discuss some preliminary results and future work on incorporating such techniques on a wearable computer to provide a form of "situational awareness" to the system.

Introduction

Environmental cues related to time, place and level/type of activity, if recognized, can be utilized to automatically present dynamic information to users relevant to their current context. Specifically, users using audio-based wearable systems could be provided with live or pre-recorded audio streams of weather forecasts, traffic information, voice mail or radio news based on the nature of their current environment. Such an application requires several means of indexing the user's context via time, position, and task attributes. Environmental sounds can also be classified to provide yet another important index into the user context. Background sounds in places like the office, classrooms, streets, train stations and cafes can be a rich source for modeling user context. In this paper we have focused on a simple classification of five pre-defined classes of environmental sounds, via extraction of several discriminating features. We will now describe prior work in the area, the methodology used and step through the techniques for each phase of the project.

Related Work

Several approaches have been used in the past to distinguish sounds, using multiple features and different classification techniques.

Scheirer [5] used a multi-dimensional classification framework by examining 13 features to measure distinct properties of speech and music signals. Some of the successful features include 4Hz modulation energy, low-energy frame percentage, variance of spectral flux, and a pulse metric. They have concluded that not all features are necessary to perform accurate classification, hence a real-time system would gain improved performance by using a limited subset of the best features. The performance is improved by averaging the results of the frame-by-frame classification in non-overlapping 2.4 second windows. They recorded 80 audio samples of 15 sec duration each, from radio stations and used them as test (10%) and training data (90%). A three-way classifier using this feature set to discriminate speech, music, and simultaneous speech and music provided only about 65% accurate performance. They found little difference between the performance of different classifiers used. This suggests that the topology of the feature space is rather simple, and indicates the use of a computationally simple algorithm, such as spatial partitioning for use in implementations. The MAP Gaussian classifier does a better job of discriminating music from speech and vise versa whereas the k-d spatial classifier has nearly the same performance on all classes.

In the past, both Cluster [4][7] and Neural Net-based [3] approaches have been utilized for some level of sound classification. A cluster analysis on a set of variables can be done by selecting a set of representative objects in the data set. The corresponding clusters can be found by assigning the remaining variables to the nearest representative object (the medoid of the cluster). Yet such techniques (PAM - partitioning around mediods) generally create spherical or elliptical clusters, and is not suited to discover drawn out clusters. Recent approaches [4] use a high dimensionally probability density function (PDF) by describing a set of clusters that approximate a PDF. Such a model encodes the most likely transition of ordered features. The centroid of each cluster, its variances and relative weight form an estimate of the statistics of the training vectors. For each new sound, the model of clusters are compared with that of known templates and the distances between them to give a measure of similarity. This approach may work well for short and highly periodic sounds, yet its not clear how well it groups complicated background sounds.

Recurrent Neural Nets could be used to classify sounds via supervised or unsupervised learning. Recurrent nets utilize one or more feedback loops to retain a memory of the non-linear structure of input vectors. If the network is trained to detect certain features as important, a larger number of neurons are utilized to represent that feature, and their syntactic weights are stored in corresponding locations in a layer called a feature map. In a related work [3], Kohonen feature maps were used to segment similarity between sounds relative to their distances in the map. The advantage of a Neural Net approach would be that non-linear mappings in the audio data would be more easily represented. Yet it would be harder to understand the how good a model was discovered by the neural net, and to estimate which measured attributes of the sound most strongly correlated with specific outputs. In either approach, a large data set is needed for training and both approaches are processor intensive.

Such approaches have been primarily applied for discriminating speech from music or partitioning short-duration sounds into classes. We are concerned about sounds recorded in the actual environmental settings, where the quality of microphone, environmental noise and general variance in sampled data, which makes the classification a more difficult and unique task relative to prior approaches (where pre-labeled repository of well defined audio samples is utilized). The focus of this work is on distinguishing longer-duration environmental sounds into pre-defined classes, using near-real time classification techniques. Hence there is a trade-off between simple/holistic features and complex, multi-dimensional features on shorter temporal frames of the audio. Computing intensive extraction and classification techniques are considered relative to simpler yet less robust approaches. A preliminary evaluation of these techniques offers some insight into the design of a sound extraction and classification engine for use in wearable computing.

Methodology

The project proceeded in several phases, each of which was performed with known assumptions and exploration of several alternative techniques. Environmental sounds were recorded via DAT and segmented into training and test data-sets. We experimented with several types of features such as RASTA analysis, Power Spectral Density (PSD), and extraction of frequency bands from a filter bank. And finally two main methods for classification were considered; the use of a recurrent neural network and a simple nearest neighbor classifier.

Figure 1: Overall process for capture, feature extraction and classification of audio data

Figure 1: Overall process for capture, feature extraction and classification of audio data

These techniques will be described in greater detail in the following sections, along with an evaluation of the results.

Audio Capture and Segmentation

A Digital Audio Tape (DAT) recorder was utilized to capture over three hours of environmental sounds from several locations in the Boston/Cambridge area. Specific paths of travel were followed to capture a mix of sounds from traffic, people, subway trains, and general indoor/outdoor noise. All data was hand-labeled both during the audio-capture (high-level labeling while walking and recording audio) as well as precise labeling after the recording sessions. The DAT recorder enabled capture of high quality audio samples (upto 48 kHz) and allowed non-linear access to select clean (noise free) representative samples of each environment for the training set. Hence samples with great amount of foreground sounds were eliminated, whereas samples with consistent and periodic background sounds were retained. Training and test data was kept separate by capturing audio in three different locations i.e. the MIT campus, Harvard Square and South Station in downtown Boston. The sounds recorded at MIT were primarily used for training, where as the other audio samples were used for test purposes.

DAT audio was digitized on SGI Indy machines as AIFF files using a rate of 16 Bit and 16 KHz samples/sec. This ensured an appropriate trade-off between sound quality and storage capacity. Captured audio was carefully listened to and re-edited based on the pre-transcribed labeling into shorter samples representing specific environments. Audio samples were further edited and initially classified along five main categories: Cafes, Hallway, Outdoors, Subway, Traffic. These categories did not work well during initial training efforts due to the ambiguous nature of their classification i.e. several similar sounds could easily be found within any two of these classes, such as voices. Hence to provide better discrimination between classes, the samples were reclassified along: People, Voices, Subway, Traffic, and Other (including outdoor sounds and unclassified samples). For each of the five classes, exactly 10 audio samples with a duration of 15 secs were extracted from the edited audio samples for both training and test data. All 100 AIFF audio files were then converted to a "raw" format to be easily read for feature extraction.

Feature Extraction

Acoustic aspects of the sounds such as sudden changes in pitch, attacks in the sound, or non-periodicity in the frequency spectrum of the sound could indicate foreground sounds, i.e. samples that should be eliminated. Whereas sequence, periodicity and co-occurence of frequency components would give a better indication of consistent background sounds. It is clear that appropriate pre-processing of the sounds would greatly aid in reducing the data and allow extraction of relevant features to improve the results from Classification. Due to project time constraints, we did not attempt to do much pre-processing of the data. Three main types of features were considered for analysis:

RASTA

The RelAtive SpecTrAl (RASTA) methodology makes use of Perceptual Linear Predictive (PLP) speech analysis and makes it more robust for linear spectral distortions [8] i.e. steady state spectral factors in speech that are less influenced by the frequency response in the communication channel. The short-term absolute spectrum is replaced by a spectral estimate in which each frequency channel is band-pass filtered with sharp spectral zero at the zero frequency. The new spectral estimate is less sensitive to slow variations in the short-term spectrum. The low-pass filtering helps in smoothing out some of the fast frame-to-frame spectral changes present in the short-term spectral estimate. The high-pass portion of the band-pass filter also helps alleviate the effect of convolutional noise added to the channel. Hence RASTA is less sensitive to the choice of microphone or its position relative to the mouth. Yet it reduces the effect of constant additive background noise. Hence this feature is not ideal for classifying environmental sounds, as it was primarily designed for reducing noise and improving recognition of human speech.

Power Spectral Density

The Power Spectral Density (PSD) estimates the average power over a signal using Welch's averaged periodogram method. The vector is divided into overlapping sections, each of which is detrended, then windowed by a defined parameter and zero-padded to length 256. The magnitude squared response of the 256 length DFTs of the sections are averaged to form the PSD. In this project, PSD was utilized as a high-level feature, by examining the PSD of an entire 15 sec test audio sample and comparing it with PSD from training vectors. This feature worked well since it was less computationally intensive and provided an approximate first-pass classification.

Frequency Bands from Filter-Banks

If one is to consider the audio processing done by the human ear, the sound pressure is captured by the eardrum and passed to the cochlea, where a time-frequency transform is done. Hence a representation based on a frequency transform is a reasonable approximation of the filtering done by the human cochlea, and can be produced by several means such as filterbanks.

Roy Patterson has proposed a model of psychoacoustic filtering based on critical bands. This auditory front-end combines a Gammatone filter bank implemented within the Auditory Toolbox [6] in Matlab. The first step is to compute the filter coefficients for a bank of Gammatone filters, defined by Patterson and Holdsworth for simulating the cochlea (using the MakeERB function in Matlab). Each filter in the filter bank is a fifth order IIR filter with both poles and zeros. We can specify the number of channels contained by the filter bank, that extend from half the sampling rate to the lowest frequency specified. We chose to use a 21 frequency bands extending from a range of 80 Hz to 8000 Hz. The upper limit of 8000 Hz is chosen as half the sampling rate of audio samples used (16 KHz) as dictated by the Nyquest frequency. The lower limit of 80 Hz is chosen because little energy is observed below it in most sound textures [4].

The forward and feedback parameters produced from the filterbank are used to compute an array of filter outputs for a specified waveform (using the FilterBank function in Matlab). Each channel of the filterbank only allows components within a narrow frequency band to go through. The Hilbert transform extracts the envelope of a narrow-band signal, and produces an out-of-phase version of the input signal. Combining the original signal and the Hilbert transform into a complex signal and taking the magnitude gives an energy envelope of the signal. The envelope is much smoother than the original signal which makes it a domain where it is easier to track energy transitions.

The output of each waveform analyzed by a filterbank consists of a 21-D array of 16000 points. These points are sub-sampled by a 100 times to reduce the size of data computed. Generally such sub-sampling on speech-based audio would be unacceptable, since it would eliminate many of the fast-changing features, yet we are interested only in slow-changing attributes of the environmental sounds in the audio. Each 15 sec audio file is examined on a frame-by-frame basis (where one frame has a 1 sec duration), and subsequent frequency bands are computed for each.

Classification

Recurrent Neural Network

A Recurrent Neural Network (RNN) was utilized for supervised learning using the RASTA coefficients from training samples. The RNN was trained with 80 hidden layers and a learning rate of .003. The RNN was trained using 5 output classes via target files over 5000 epochs (iterations). The weights and errors produced for each training run were recorded for evaluation. Test files were then analyzed by the RNN based on weights from the training data.

Nearest Neighbor

The nearest-neighbor estimator simply places the points of the training set in feature space. New points are classified by examine the local neighborhood of feature space to determine which training point is closest to the test point, and assign the class of this "nearest neighbor". Here a simple Euclidean distance metric is utilized to compare each test vector with an array of training vectors to find the training point with the minimum distance. For the frequency bands all 21-D vectors in each training vector are compared to each 1 sec frame of a test audio sample. The results from classification of each frame is shown with its 4 nearest neighbors (closest training points). For an entire audio sample of 15 secs, the maximum percentage for classification for each frame is used to determine the overall class of the sample.

Evaluation

The RASTA-RNN approach yielded poor results primarily due to the lack of discriminating features for environmental sounds present in RASTA coefficients (primarily used for speech). The maximum percentage of classification achieved by the RNN for 5 classes was 73.5% on training data itself (audio from Harvard Campus) and 24% on new test data (audio from South Station).

The PSD-NN approach yielded a better high-level classification of 58%. Yet a frame-by-frame classification using PSD yielded a lower rate of 52.53% classification. Individual percentages are as follows:

Other : 73.33

People : 19.33

Subway : 32.00

Traffic : 47.33

Voice : 90.67

Overall Classification Percentage: 52.53

The FilterBank-NN approach yielded a better frame-by-frame classification of 68% overall, and individual class percentages as follows:

Other : 90.00

People : 40.00

Subway : 70.00

Traffic : 40.00

Voice : 100.00

Overall Classification percentage : 68.00

In addition, we have provided a interface to specify a single audio test file, that can be directly processed and classified on a frame-by-frame basis, using the frequency bands extracted. It displays a spectogram of the specified window of the test file and that of the nearest matching training vector.

Future Work

Spatial Partitioning for the Nearest Neighbor Framework

A spatial partitioning scheme such as k-d spatial classification can be used to conduct a class vote among the k nearest neighbors to a point, and only consider those training points in the particular region of space grouped together by a k-d tree partitioning algorithm [5]. This should make more efficient the process of determining the closest point in the training set and may be faster than true nearest neighbor schemes.

Hierarchical Classification

Several ambiguities exist in the sounds implicitly grouped under specific classes. Hence a hierarchical classification approach seems promising, where newer sub-classes are continuously defined as the training data-set gets larger and more diverse. This would enable a finer grain classification of test data and potentially eliminate several misclassification errors.

Contextual Reinforcement of Classification

Evaluation with human subjects of auditory perception of environmental sounds could reveal the limitations of using the cochlea model for discriminating classes of environmental sounds. Humans classify sounds in their environment based on prior experience and context-dependent knowledge at a specific moment in time. Such as one could only hear traffic at an intersection, even if it sounded like a train. Hence such contextual cues could be incorporated in classification analysis done by machines. Cues based on position (via GPS and IR), time of day, and user/domain knowledge can greatly aid in reducing ambiguities in sound classification.

Porting to a wearable system for real-time usage scenarios

A true test of the ideas outlined in this paper are worth considering in the context of an actual wearable computing platform. Some of the techniques described for feature extraction and classification, could be ported to run efficiently on a wearable platform with continuous audio buffering of environmental sounds. The system could be continuously be updated with new training data in different locations and could adaptively build more robust models of granular sound classes from several environments over time.

Conclusions

We have described a process for capturing audio, extracting features, and using simple classification techniques to discriminate pre-defined classes of environmental sound. The overall results indicate that simple high-level features like Power Spectral Density (PSD) provide a rough approximation for a first-level classification, that is less computationally intensive. Frequency bands generated from filterbank analysis of frame-by-frame windows of the audio, provide a more robust feature. A simple nearest neighbor classifier performs quite well for such a data set, whereas a Recurrent Neural Network must be investigated further if it performs better (earlier results indicated less convergence). The trade-off between robust vs. Real-time feature extraction and classification is an important issue if such techniques are to be applied in real scenarios of use such as in wearable computing.

Acknowledgments

I'd like to thank Alex Westner, Deb Roy, and Tony Jebara for their helpful guidance and insightful comments through out the project.

References

[1] Bregman, Albert S. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, 1990.

[2] Ellis, Daniel P. "Prediction-driven Computational Scene Analysis", Ph.D. Thesis in Media Arts and Sciences, MIT. June 1996.

[3] Feiten, B. and S. Gunzel, "Automatic Indexing of a Sound Database using Self-Organizing Neural Nets". Computer Music Journal, 18:3, pp. 53-65, Fall 1994.

[4] Nicolas Saint-Arnaud, "Classification of Sound Textures". M.S. Thesis in Media Arts and Sciences, MIT. September 1995.

[5] Schrider, Eric and Malcolm Slaney, "Construction and Evaluation of a Robust Multi-feature Speech/Music Discriminator", Proc. ICASSP-97, Apr 21-24, Munich, Germany, 1997.

[6] Slaney, Malcolm, "Auditory Toolbox: A Matlab Toolbox for Auditory Modeling Work", Apple Advanced Technology Report #45, Apple Computer Inc., Advanced Technology Group.

[7] Wold, E., T. Blum, D. Keislar, and J. Wheaton. "Content-based Classification Search and Retrieval of Audio". IEEE Multimedia Magazine, Fall 1996.

[8] Hermansky, Hynek, N. Morgan, A. Bayya, P. Kohn, "RASTA-PLP Speech Analysis", Technical Report (TR-91-069), International Computer Science Institute, Berkeley, CA., Dec. 1991.

Appendix: Sample Test

Postscript files of Sample Images:

Comparison of 21 Frequency Bands

Comparison of Spectograms

>> dist_test('Voice','voc_trn11.raw',5)

Loading Training Data ...

Processing Test File: Voice/voc_trn11.raw ...

Extracting 21 Frequency Bands from 5 frames: 1.2.3.4.5.

Testing frame 1 ...

Nearest Neighbors:

1. vector 44 : 1178238 : 0.00

2. vector 42 : 1208434 : 2.56

3. vector 47 : 1229932 : 4.39

4. vector 50 : 1232592 : 4.61

5. vector 10 : 1263136 : 7.21

Class : Voice

Testing frame 2 ...

Nearest Neighbors:

1. vector 10 : 1274004 : 0.00

2. vector 50 : 1280047 : 0.47

3. vector 2 : 1285496 : 0.90

4. vector 46 : 1299290 : 1.98

5. vector 42 : 1307559 : 2.63

Class : Other

Testing frame 3 ...

Nearest Neighbors:

1. vector 50 : 875074 : 0.00

2. vector 2 : 906477 : 3.59

3. vector 42 : 907328 : 3.69

4. vector 10 : 923734 : 5.56

5. vector 46 : 942038 : 7.65

Class : Voice

Testing frame 4 ...

Nearest Neighbors:

1. vector 46 : 951375 : 0.00

2. vector 50 : 957850 : 0.68

3. vector 10 : 1020113 : 7.23

4. vector 2 : 1032137 : 8.49

5. vector 42 : 1032467 : 8.52

Class : Voice

Testing frame 5 ...

Nearest Neighbors:

1. vector 50 : 300540 : 0.00

2. vector 46 : 370967 : 23.43

3. vector 42 : 417099 : 38.78

4. vector 10 : 433945 : 44.39

5. vector 2 : 446633 : 48.61

Class : Voice

Percentage Classification:

Other : 20.00 %

People : 0.00 %

Subway : 0.00 %

Traffic : 0.00 %

Voice : 80.00 %

Class Selected: Voice - 80.00 %