Machine Listening

Julius O. Smith
Associate Professor of Music and Electrical Engineering
Stanford University, CCRMA

Since audio signals are interpreted by the human ear-brain system, that complex perceptual mechanism should be simulated somehow in software for "machine listening". In other words, to perform on par with humans, the computer should hear and understand audio content much as humans do. Analyzing audio accurately involves several fields: electrical engineering (spectrum analysis, filtering, and audio transforms); psychoacoustics (sound perception); cognitive sciences (neuroscience and artificial intelligence); acoustics (physics of sound production); and music (harmony, rhythm, and timbre). Furthermore, audio transformations such as pitch shifting, time stretching, and sound object filtering, should be perceptually and musically meaningful. For best results, these transformations require perceptual understanding of spectral models, high-level feature extraction, and sound analysis/synthesis. Finally, structuring and coding the content of an audio file (sound and metadata) stand to benefit from efficient compression schemes, which discard inaudible information in the sound.

Written Requirement
The written requirement for this area will consist of a 24-hour take-home exam to be evaluated by Professor Julius O. Smith.

Reading List
Applications of Digital Signal Processing to Audio and Acoustics, edited by Mark Kahrs and Karlheinz Brandenburg, Kluwer Academic Publishers, 1998.
T. Quatieri, Discrete-Time Speech Signal Processing, principles and practice, Prentice Hall Signal Processing Series, Alan V. Oppenheim Series Editor.
M. Bosi and R. Goldberg, Introduction to Digital Audio Coding: Basic Principles and Audio Coding Standards, (Unpublished Manuscript).
Dafx: Digital Audio Effects, Edited by Udo Zoelzer, Wiley, John & Sons, Incorporated, May 2002.
B. Moore, An Introduction to the Psychology of Hearing, Academic Press, 1997.
Musical Signal Processing, Edited by: Curtis Roads, Stephen Pope, Aldo Piccialli, and Giovanni De Poli, Swets & Zeitlinger Publishers, 1997.

E. Zwicker and H. Fastl, Psychoacoustics: Facts and Models, Springer Verlag, 1999.
J. O. Smith, Techniques for Digital Filter Design & System Identification with Application to the Violin, PhD/EE Dissertation, Stanford University, June 1983.
S. Levine, Audio Representations for Data Compression and Compressed Domain Processing, PhD Dissertation, Stanford University, 1998.
T. Verma, A Perceptually Based Audio Signal Model With Application to Scalable Audio Compression, PhD Dissertation, Stanford University, 2000.
X. Serra, A system for sound analysis/transformation/synthesis based on a deterministic plus stochastic decomposition, PhD Dissertation, Stanford University, Oct. 1989.
E. Sheirer, Music-Listening Systems, PhD Dissertation, Massachusetts Institute of Technology, Media Lab, April 2000.
D. Ellis, Prediction-driven computational auditory scene analysis, PhD Dissertation, Massachusetts Institute of Technology, Media Laboratory, April 1996.
M. Casey, Auditory Group Theory: with Applications to Statistical Basis Methods for Structured Audio, Ph.D. Thesis, Massachusetts Institute of Technology, Media Laboratory, February 1998.
P. Smaragdis, Redundancy Reduction for Computational Audition, a Unifying Approach, MIT, Media Laboratory, May 2001.

D. Robinson, Perceptual Model for Assessment of Coded Audio, PhD Dissertation, University of Essex, Department of Electronic Systems Engineering, March 2002.
Relevant publications:
X. Serra, Musical Sound Modeling with Sinusoids plus Noise, Musical Signal Processing, C. Roads et al., Editors. Swets & Zeitlinger Publishers, 1997.
R. J. McAulay, Th. F. Quatieri, Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. on Acoust., Speech and Signal Proc., vol ASSP-34, pp. 744-754, 1986.
K. Brandenburg, MP3 And AAC Explained, In Proceedings of the AES 17th International Conference, Florence, Italy, 1999.
K. Brandenburg and H. Popp, An introduction to MPEG Layer-3, Fraunhofer Institut fur Integrierte Schaltungen (IIS), EBU Technical Review, June 2000.
K. Brandenburg and M. Bosi, Overview of MPEG Audio: Current and Future Standards for Low Bit Rate Audio Coding, J. Audio Eng. Soc., Vol. 45, No. 1/2, pp. 4--21, Jan./Feb. 1997.
E. Scheirer and Barry Vercoe, SAOL: The MPEG-4 Structured Audio Orchestra Language, Computer Music Journal 23:2 (Summer 1999), pp 31-51.
E. Scheirer, The MPEG-4 Structured Audio Standard, Proc. 1998 IEEE ICASSP (invited paper), Seattle, May 1998.
D. Robinson & M. Hawksford, Psychoacoustic Models and Non-linear Human Hearing, Proceedings of the 109th Convention of the Audio Engineering Society, Los Angeles, September 2000.
G. Todd et al., AC-3: Flexible Perceptual Coding for Audio Transmission and Storage, Proceedings of the 96th Convention of the Audio Engineering Society, February 1994.
D. Pan, A Tutorial on Mpeg Audio Compression, IEEE Multimedia Journal, summer 1995.