Computational audition has always been a subject of multiple theories. Unfortunately very few place audition in the grander scheme of perception, and even fewer facilitate formal and robust definitions as well as efficient implementations. In our work we set forth to address these issues. We present mathematical principles that unify the objectives of lower level listening functions, in an attempt to formulate a global and plausible theory of computational audition. Using tools to perform redundancy reduction, and adhering to theories of its incorporation in a perceptual framework, we pursue results that support our approach. Our experiments focus on three major auditory functions, preprocessing, grouping and scene analysis. For auditory preprocessing, we prove that it is possible to evolve coclear-like filters by adaptation to natural sounds. Following that and using the same principles as in preprocessing, we present a treatment that collapses the heuristic set of the gestalt auditory grouping rules, down to one efficient and formal rule. We succesfully apply the same elements once again to form an auditory scene analysis foundation, capable of detection, autonomous feature extraction, and separation of sources in real-world complex scenes. Our treatment was designed in such a manner so as to be independent of parameter estimations and data representations specific to the auditory domain. Some of our experiments have been replicated in other domains of perception, providing equally satisfying results, and a potential for defining global ground rules for computational perception, even outside the realm of our five senses.
Front Matter ps (111k), pdf (31k)
Table of Contents ps (76k), pdf (17k)
Chapter 1. Introduction ps (533k), pdf (216k)
Chapter 2. Auditory Preprocessing and Basis Selection ps (979k), pdf (879k)
Chapter 3. Perceptual Grouping ps (569k), pdf (308k)
Chapter 4. Auditory Scene Analysis ps (884k), pdf (725k)
Chapter 5. In Closing ps (113k), pdf (32k)
Appendix A. Multimodal Examples ps (399k), pdf (198k)
Bibliography ps (433k), pdf (52k)
And here's the whole thesis if you like big files ps (1.9M), pdf (2.3M)
Sound examples are in the WAVE format which seems to the most recognized. Movies are either in AVI format or QuickTime (qt), between the two you should have no trouble playing them.
Chapter 4 examples
Here's the piano part (83k) from section 4.3.2, page 91.
For the "Da da da" sound examples refer to the second movie's sound examples from appendix A. It is the same sound scene.
Here's the Billie Holiday excerpt (65k) from section 4.3.4, page 96, its vocal components 1 (65k), 2 (65k), 3 (65k), and the full vocal reconstruction (65k)
Appendix A examples
Here's the second movie (avi 387k, qt 621k) from section A.3, page 110.
Here are the second movie's three extracted visual components: one hand (avi 44k, qt 256k), the other hand (avi 46k, qt 283k), and the constant terms (avi 42k, qt 543k).
Here are some of the extracted auditory movie components: the bass (86k), the clave (86k), the vocals (86k), and the snare drum (86k).