MS in Media Arts and Sciences at MIT


Perceptual Synthesis Engine:
An Audio-Driven Timbre Generator

A real-time synthesis engine which models and predicts the timbre of acoustic instruments based on perceptual features extracted from an audio stream is presented. The thesis describes the modeling sequence including the analysis of natural sounds, the inference step that finds the mapping between control and output parameters, the timbre prediction step, and the sound synthesis. The system enables applications such as cross-synthesis, pitch shifting or compression of acoustic instruments, and timbre morphing between instrument families. It is fully implemented in the Max/MSP environment. The Perceptual Synthesis Engine was developed for the Hyperviolin as a novel, generic and perceptually meaningful synthesis technique for non-discretely pitched instruments.

Submited to the Program in Media Arts and Sciences,
School of Architecture and Planning,
in partial fulfillment of the requirements for the degree of
Master of Science in Media Arts and Sciences
at the
Massachusetts Institute of Technology
September 2001


Thesis Supervisor : Tod Machover
Professor of Music and Media
MIT Program in Media Arts and Sciences

Thesis Reader : Joe Paradiso
Principal Research Scientist
MIT Media Laboratory

Thesis Reader : Miller Puckette
Professor of Music
University of California, San Diego

Thesis Reader : Barry Vercoe
Professor of Media Arts and Sciences
MIT Program in Media Arts and Sciences

For more details, Download my Master's Thesis


Sound examples for the female singing voice:

Original female singing voice

Watch a movie with the peak extraction of 25 harmonics (7.5 Mb)

Re-synthesized female singing voice with 25 sinusoids

Estimation of three perceptual features: pitch, loudness, and brightness


Unsupervized learning between perceptual features, and spectrum.
Only two normalized axes are represented here: pitch (x), and loudness (y).
The gaussian distributions are depicted in red (height is in black).

Predicted female singing voice from 3 perceptual parameters (real time)
Original violin glissando sound
Predicted female singing voice from analysis of the previous violin sound (real time)
Loudness control before prediction (real time)
Brightness control before prediction (real time)
Pitch modulation control before prediction (real time, includes input)


Residue (noise) is modeled with a 25-coefficient polynomial function.


Sound examples for the violin model:

Original violin (a Stradivarius)
Predicted violin from 3 perceptual parameters (real time, 30 sinusoids only)
Morphing from violin to female singing voice, back to violin, with violin input

Special thanks to Tara Rosenberger Shankar, Youngmoo Kim, Hila Plittman, Joshua Bell, Nyssim Leford and Michael Broxton for their help with data collection.