Some Audio-related Demos


In this page I have sounds and videos of various projects I’ve worked on in the past.  You can find the technical papers describing all this work in my publications page.  One would hope this page will get updated regularly :)

  Audio Interfaces

Audio editors are pretty lousy, you can’t use a graphical interface to do anything useful when editing sound mixtures.  In this demo we present an audio-driven interface which allows a user to vocalize the sound they want to select and an automatic process matches that input to the most appropriate sound.  Once the selection is done then we can manipulate sounds independently and then throw them back in the mix.  This ties in a lot of work on audio separation shown in a later section.

Demo video of user-assisted audio selection

   Source Separation

One of my pet projects is source separation.  In all honesty I can't figure out why one might want to separate a sound since it is perfectly possible to use the same reasoning to perform many classification and processing operations in mixtures.  But, hey, who am I to judge ... That said it is a fun thing to try and it certainly has resulted into a lot of neat research and has diverted a lot of outside interest into audio.  I've worked on and off and tangentially on this subject for more than 10 years now.  Here are some highlights in reverse chronological order:

Latent Variable Spectral Decompositions (c. 2005 - today)

This approach is actually a family of related statistical models which attempt to decompose time-frequency distributions into low-rank representations.  They are similar to NMF approaches, but they are easy to incorporate with more fancy machine learning and construct some really neat models.  We’ve made convolutive forms, Markovian models, LDA versions, sparse coders, hierarchical structures etc.  As far as I know they produce the state of the art results for separation on monophonic mixtures (20+dB SIR on 0dB mixtures).

Here's an example of removing a soprano from a recording.  The removed voice was pitch shifted and mixed in back to form a (slightly "off") duet:

Original mixture  Extracted soprano  Remix

Here's a more ethnic version of the above.  The remix consists of transgendrifying the singer so that neighboring animals are not alarmed by the high-pitched tones :)

Original mixture  Extracted vocals  Remix

Here's an example recorded from an AIBO robot as it was walking on a wooden floor (a speech recognition nightmare):

Original mixture  Denoised speech

Same thing with the AIBO moving its head around (resulting into motor and ear flapping noise)

Original mixture  Denoised speech

Here's another "denoising" example.  Note how the “noise” source is somewhat correlated to the music (although not as much as it should!):

Noisy recording  Noise source  Denoised output

Convolutive NMF (c. 2004)

Recently I developed the idea of the convolutive non-negative matrix factorization and applied it on speech mixtures.  One could train on a set of speakers and then when provided with new input be able to decompose the input into a sets which most fit each speaker.  Here's an example mixture and the extracted speakers:

Original mixture  Extracted male voice  Extracted female voice

Here is an example of denoising (a special case of multi-source separation).  In this case the speaker is known but the background interference is not, and neither is the speaker utterance in the mixture:

Original mixture  Extracted voice.

Frequency domain ICA (c. 1995)

My earliest claim to fame came with my masters thesis.  I applied ICA in the time-frequency domain in order to solve convolutive mixing problems fast.  It worked out fine and also spawned a lot of work on the dreaded bin permutation problem!  Here’s a (simple and contrived) example which just sounds neat.  It is played in “slow motion” so that you can hear each frequency band separate at its own pace. 

Input mix  Extracted speech

   Sound Recognition for Content Analysis

First Commercial step

With Ajay Divakaran, Bhiksha Raj and Regu Radhakrishnan we used sound recognition for video content analysis. The first real-world application was sports highlights detection, which is an extraordinarily hard task in the visual domain (hey, if all goals looked the same the game wouldn't be worth it!), but a trivial task in the audio domain. By recognizing key sounds like crowds going wild, clapping, ball hits, speech, music, etc, we can deduce the state of excitement in the video stream. The resulting system works fine on a variety of sports.  This system was initially released running on the Mitsubishi DVR-HE50W personal video recorders and has since been extended to find highlights in all sorts of sports (soccer, basketball, baseball, sumo, etc ...)

Video demonstrating the use of audio

cues to detect sports highlights

Surveillance Systems

Just as in the video content analysis project we can't just detect highlights, we can also detect emergencies. This is a demo video where we have an simulated elevator mugging. As you can see there is not much to see!  The elevator is dark and the contrast is lousy and that trying to figure out when someone is being mugged from the visual information is a very hard task. On the other hand, during such cases people scream, and move around hitting things, their tone of voice is distressed and there is plenty of audio commotion that can be detected reliably. On our training test of a few hundred videos we get almost 100% accuracy in detecting muggings.

Video demonstrating how audio cues

can help easily identify emergencies

Same idea as the above projects applied on traffic monitoring.  The videos here are from an intersection in Louisville, KY. There are two cameras pointing at a troublesome intersection. Having the cameras on 24h a day means that some poor soul has to watch the footage and find the interesting sections which can help improve design and safety of the intersection. Instead we can turn on the cameras only when specific sounds are detected. The cameras keep a recording buffer of a few seconds, once we recognize sounds like impacts, tire squealing, car horns, etc, we save that buffer and record the next few seconds. This provides us with a before and after glimpse of traffic "highlights". The videos in this section show some of the extracted scenes. Two are real accidents, one is a near accident (which are very useful in determining hot to improve signage), and one is just one of those out-of-the-ordinary events. Just as in the previous examples recognition rates are well in the 90s%. Additionally we can track when sirens are around and manipulate the traffic lights appropriately.

Video demonstrating the use of audio

cues to detect traffic incidents

Analysis of Movies

The same ideas can be applied for content analysis of movies using the audio track.  Using this idea you can search for scenes by their representative sounds.  You can search for sections with guns shooting, cars skidding, people talking, dogs barking, or whatever else makes sound.  Just as in the sports system these cues are very reliable and relatively easy to track as compared to their visual counterparts.  Having the metadata out of this kind of analysis we can divide a movie into sections, cluster scenes or entire movies and automatically tag movie databases efficiently.  In this demo video the bar in the bottom displays the likelihood of a detected audio class.  These likelihoods can be used to search for various events in a movie.  Note that unlike generic sound recognition methods this one works even when the sounds are mixed together.

Video demonstrating concurrent sound

recognition using various movie clips

   Missing Spectral Data and Bandwidth Expansion

I’m also very interested in missing data theory.  In the following examples I automatically fill in the time-frequency gaps using a latent variable model.  Here are two examples of a large gap, and or many distributed gaps.  In both cases the reconstruction was performed using only the data available from the input.


Corrupted Input Sound                Reconstructed Sound


Corrupted Input Sound                Reconstructed Sound

We can also use the same ideas to perform bandwidth expansion with pre-trained models.  In the following example we start with a band-limited latin-jazz recording with no low and high frequency content.  We then train a model of latin-jazz sounds by recording gibberish from a synthesizer.  The model itself holds enough audio information to help make up the missing frequencies and provide a reasonable expansion.

Original Band-limited Input

Training Example

Recovered Wideband Output

   Source Localization

Here are two videos demonstrating localization.  The first one is a system that is running on a PTZ camera and automatically turns the camera towards the most interesting sounds.  The camera performed both localization and recognition of sounds in order to decide where to look towards.  It was designed so that upon recognizing a sound it would turn to its most likely direction (e.g. it would look towards the elevators when it heard the elevator bell).  You’ll see me and Jay Thornton compete for the cameras attention.  In this video the camera only tracks voices.

Demo video of audio-assisted camera

In the following video I run around my office as my computer performs multi-source localization.  The video should be self explanatory.

Demo video of multi-source localization

    Back to my home page

     Paris Smaragdis, Jan 22, 2010