864.18

Final class project
Modelization of the Noise of Acoustic Instruments

There are many types of noises. Each of them have a different waveform and spectrum (we say they are colored). They have various properties and obviously sound.

Click on these white, pink, brown and gray noise waveforms to hear their corresponding sound.

I'm interested in modeling the noise produced by acoustic sounds.

Why doing this?

There are four basic models for sound generation :

"Physical models" use the acoustic properties of the instrument.
"Sample based models" play back filtered recorded portions of the sound.
"Abstract models" attempt to provide musically useful parameters in an abstract formula.
"Spectrum models" attempt to describe the sound as it is heard by the ear.

Physical models may sound very natural and may be musicaly expressive but are usually CPU expensive and are limited in the range of sounds they can generate with a given model. Sample based models sound very natural but are not very expressive from the performer point of view. Abstract models may be very expressive and computationaly cheap but hardly sound acoustic.

In another hand, spectral models may sound very natural, may be very expressive, can generate any types of sound and are not CPU too expensive anymore. Until recently, there was still the problem of controlling this type of synthesis. Now, there are non-linear techniques experimented at CNMAT (UCB) [David Wessel, Cyril Drame] and at the Media Lab (MIT) [Bernd Schoner, Chuck Cooper, Chris Douglas and Neil Gershenfeld] that showed it was possible to control it in real time.

The two techniques can generate the harmonic structure of the sound from perceptual parameters (CNMAT) or gestures (Media Lab). These techniques are very promising. I believe that one missing element of these great models is the additive noise that acoustic sounds usually produce.

More on Analysis/Synthesis

Additive Synthesis is a powerful way for generating sound [Spectral Modeling Synthesis by Xavier Serra]. By analysing an 'Acoustic' and 'Harmonic' instrument, one can extract a series of sinusoids that represent the main frequency content of the sound (Fourier theorem on periodic signals) : we call them harmonics. These harmonics have different amplitudes and are more or less equally spaced in frequency. The overall spectral envelope is usually a strong caracteristic of the timbre. Each bump in the envelope is called a formant and these are usually stable when pitch changes.

By substracting this deterministic content of the sound to the original Acoustic sound, we usually discover a lot of noise left. This residual (unpitched) signal is necessary for representing realistic Acoustic sounds, especially at the attack of a new note. It has been demonstrated by [Grey, Wessel, Risset and Mathews] that an important caracteristic of timbre is contained in the attack itself.

Each instrument contains its own specific residue : a blowing noise for a flute, a scratchy metallic noise for a violin, a short plucked sound for a guitar, etc.The residue is more or less corrolated to pitch and amplitude in addition to other parameters such as brightness. These parameters change significantly in time. So a linear stochastic process is probably not the optimal representation of what is going on. Probably a non-linear or non-stationary model is prefered.

Then what's my project?

My project is to find the best way to describe and generate that residue part of an acoustic sound. As a starting point I'll be looking into some pre-analysed sounds obtained from Rafael A. Irizarry. Using a local harmonic model and a dynamic window size [American Statistical Association postcript paper], he obtained a pretty convincing separation of harmonic and additive noise signal.

Here are some examples to start with :

Violin :

Guitar :

Clarinet :

My goal is to make a model of the residual sound and generate it in real time.

First step : Sonogram observation

Here is a preliminary visual analysis (sonogram) of the previous violin sounds. From left to right is displayed the original sound, the deterministic sound and the residual sound. A spectrum (in red) randomly chosen in time is displayed on the right of the sonogram. As you can see, the harmonic structure in the middle spectrum was extracted from the original spectrum on the left. And after about 15 harmonics, the energy drops considerably to the level of noise and is no longer visible.

On the right was displayed the complex amplified spectrum of the residue. The white lines in the sonogram are the result of the substraction of the second sonogram from the first one. Some residual harmonics are still visible in the mid spectrum at about 5 Khz.

Second step : Database of separated sounds

Another but necessary step to the construction of the model that I want to build is to accomplish this separation "deterministic + noise" on a large set of sounds so that some learning algorithm can be trained on them. Since my research is now focused on the violin instrument, I will build a database of residue sounds from a violin. I would need to first implement a separation algorithm. Many thanks to Rafael Irizarry who gently sent me his Splus code that I'll be using in order to validate my model.

Third step : Perceptual parameter estimation

Since my goal is to drive the ultimate model from sound itself rather than gestures, I will need some real-time percetual parameter estimations of the input sound (anything acoustic like a voice for instance) as well as the non real-time trained sounds (the violin). I will first estimate the pitch, the loudness the noisiness and the brightness. This could be done on the original sound or on the two previously separated sounds. This last suggestion infers that I'll need to do the separation of the driving input sound in real time.

Fourth step : More feeding data

Other than those parameters, the learning algorithm should know about the state of the loudness curve in order to differentiate the attack from the release of a note, so I'll also estimate the derivative of the waveform envelope : as visible on the amplitude graphs above (in purple), the amount of noise is fairly corrolated to the amplitude in the sound, but also to the attack.

Fifth step : The noise representation

There are at least two ways to synthesize noise. First in the time domain you can shape white noise using a filter (convolution of white noise with the impulse response of your filter). In that case you need to estimate the coefficients of this filter. Or you can work in the spectral domain and multiply a given spectral envelope with some white noise distribution and accomplish an inverse-FFT. I'm still thinking of the best representation to use. My intiution is that the second approach is easier to control, more flexible and a good enough approximation.

Yael Maguire and Neil Gershenfeld suggested that I estimate the moment expansion of the distribution (from the Characteristic function). Instead of white noise, I can generate a waveform that has the same time varying statistical properties as the residue that I study. E.g. the two first moments are the mean and the variance.

Using the Linear Least Squares function fitting method, I approximated the spectral envelope. In the example below (violin residue), I used a polynomial approximation with 35 coefficients. In green is the original log-magnitude spectrum, in red is a low-pass filtered version of it (I used a FIR filter), in yellow is the polynomial approximation of the filtered spectrum and in white is the polynomial approximation of the original spectrum. The fitting function is surprisingly good without any filtering.

Click on the image to download a movie

Sixth step : The learning algorithm

Different learning techniques studied in class can be used. I'm interested in using Neural Networks or cluster-weighted modeling to learn how to match the analysis data with the noise representation data. The training algorithm will take the perceptual and other analysis parameters as an input and the noise representation as an output.

Seventh step : Validation of the model

First it is easy to feed the trained algorithm with sounds used for training and compare/visualize/measure errors of the output of our model with the residue corresponding to the original sound. They should sound and look fairly similar. Then new inputs from non training sets of sounds and noises should be tried. The output should also be compared with the residue of these new sounds. And finally other instruments will be tested as an input to the model. In any case, the output should sound like a violin. Readdition of the noise with the deterministic part of the sound will be tested and qualitatively compared to the original sound.

And after that ?

Some explorations of possibilities with this type of analysis/synthesis will include morphing between several sounds, controling timbre parameters, pitch-shifting, time-stretching, freezing the sound, equalizing, controling new types of effects, etc.

Deterministic/noise sound separation in real time

So far, I talked about controling synthesis/sound processing in real-time. If we want to control it with perceptual parameters such as pitch, loudness, noisiness or brightness coming from an instrument, we need to analyze the sound in real-time. This implies a new set of technical issues that were not necessary when analyzing a sound for training purposes. Moreover, lots of difficulties in the analysis comes from the fact that the sound is noisy. Succeeding in separating the harmonic structure from the noise of the instrument is making the extraction of these parameters much easier and more robust. I am very interested in implementing a real-time version of a separation algorithm.

And there is more

The technique described here can be applied to other music related problems : sound compression, sound source separation, echo/reverb cancelation, acoustics, music perception, etc.