Redundancy Reduction for Computational Audition, a Unifying Approach.

Paris Smaragdis,
Massachusetts Institute of Technology, Media Laboratory,
May 2001.

Abstract

Computational audition has always been a subject of multiple theories. Unfortunately very few place audition in the grander scheme of perception, and even fewer facilitate formal and robust definitions as well as efficient implementations. In our work we set forth to address these issues. We present mathematical principles that unify the objectives of lower level listening functions, in an attempt to formulate a global and plausible theory of computational audition. Using tools to perform redundancy reduction, and adhering to theories of its incorporation in a perceptual framework, we pursue results that support our approach. Our experiments focus on three major auditory functions, preprocessing, grouping and scene analysis. For auditory preprocessing, we prove that it is possible to evolve coclear-like filters by adaptation to natural sounds. Following that and using the same principles as in preprocessing, we present a treatment that collapses the heuristic set of the gestalt auditory grouping rules, down to one efficient and formal rule. We succesfully apply the same elements once again to form an auditory scene analysis foundation, capable of detection, autonomous feature extraction, and separation of sources in real-world complex scenes. Our treatment was designed in such a manner so as to be independent of parameter estimations and data representations specific to the auditory domain. Some of our experiments have been replicated in other domains of perception, providing equally satisfying results, and a potential for defining global ground rules for computational perception, even outside the realm of our five senses.



Chapters:

The document filea are either gzipped postscript files or pdf files (the pdf files are not compressed so that you can use the acrobat plugin without trouble). On some platforms the pdf files do some wrong font substitutions and many equations end up as gibberish. If you get that problem use the .ps files instead (actually I'd rather you use the .ps files to begin with, pdf has been quirky).

Front Matter ps (111k), pdf (31k)

Table of Contents ps (76k), pdf (17k)

Chapter 1. Introduction ps (533k), pdf (216k)

Chapter 2. Auditory Preprocessing and Basis Selection ps (979k), pdf (879k)

Chapter 3. Perceptual Grouping ps (569k), pdf (308k)

Chapter 4. Auditory Scene Analysis ps (884k), pdf (725k)

Chapter 5. In Closing ps (113k), pdf (32k)

Appendix A. Multimodal Examples ps (399k), pdf (198k)

Bibliography ps (433k), pdf (52k)

And here's the whole thesis if you like big files ps (1.9M), pdf (2.3M)



Sound and Movie Examples

Sound examples are in the WAVE format which seems to the most recognized. Movies are either in AVI format or QuickTime (qt), between the two you should have no trouble playing them.

Chapter 4 examples

Appendix A examples