Benjamin Waber
MIT Media Lab > Human Dynamics Group



Autmoatically Generating Vocal Social Signals
Benjamin N. Waber and William Stoltzman
MIT Media Laboratory

Status: Completed

From previous work we have found that certain speech features correlate very highly with persuasion. In particular, we found that the length of time of speech segments is a very good indicator of the persuasive power of a communication. Volume regulation (making sure that your voice is neither too high nor too low), also correlates highly with persuasiveness. While there are other factors, such as short speech segments where the speaker says "um," "like," and so forth were also found to be correlated with persuasion, we do not modify these features since they are difficult to automatically insert into speech in a believable fashion.

Our method operates only on speaking regions, since it was found that the amount of time in between utterances has no effect on the persuasiveness of a speech. Using a phase vocoder, we expand or contract the length of time of an utterance in the time domain without modifying its spectral domain characteristics. A phase vocoder is a method that performs the Short Time Fourier Transform (STFT) at fixed time intervals and calculates the frequency changes between each of these intervals. Calculating the frequency changes in the Fourier domain on a different time basis and inverse transforming then changes the time base of the signal.

For volume regulation, we process the speech signal and push the magnitude of the speech signal closer to or farther from the mean, making sure to increase the magnitude of the resulting signal at every point so that the maximum volume is the same as before volume regulation was performed.

With regards to testing, it was found that it was easy for a listener to distinguish between the original and processed speech versions based solely on sound quality. We determined that first processing the original speech signal through our system with a very slight transformation made this task very difficult, and thus it is recommended that this step be taken during testing.

To demonstrate our methods we created a Graphical User Interface (GUI) that allows the user to load in WAV files, change the persuasiveness of the speech using a slider that goes from 0 (not persuasive) to 1 (very persuasive). The initial setting of this slider indicates the voicing rate of the original speech determined by our speech analysis program. The user can then save the resulting WAV file for testing, or play back the modified speech for immediate evaluation.


Toward a Social Signaling Framework: Activity and Emphasis in Speech
William Stoltzman Master's Thesis