William Oliver, Eric Metois, Chong (John) Yu, and Sharon Daniel
MIT Media Laboratory
Cambridge, MA 02139
This paper discusses the artistic and technical considerations in the development of the Singing Tree, an interface which responds to vocal input with auditory and visual feedback.
The Singing Tree is one of six interfaces used in the Mind Forest (Act I) of the Brain Opera, an interactive opera composed and developed by Tod Machover and the Opera of the Future team at the MIT Media Laboratory. The Brain Opera, based in part on Marvin Minsky's book, "Society of Mind," is divided into three parts: the Mind Forest, in which the audience explores and creates music related to the Brain Opera via six novel interfaces; the Internet, in which internet participants explore and create music via Java applets; and the Performance, in which three performers use novel interfaces to play written music and introduce audience and internet contributions to the piece. In "Society of Mind," Marvin Minsky makes a metaphor between the human brain and a forest of agents, and herein lies the concept of the Brain Opera's Mind Forest and its interfaces: the Singing Tree, the Marvin Tree, the Rhythm Tree, the Gesture Wall, the Harmonic Driving unit, and the Melody Easel. The participants are the agents who interface with the Brain Opera through the various stations. In the case of the Singing Tree, the participant sings a pitch and, while singing, hears music and watches a video unfold. The goal is to maintain a steady pitch, and the participant's degree of success in achieving this goal is represented in both the auditory and visual feedback.
There are three Singing Trees in the Mind Forest space, each with a microphone and LCD screen contained in a white hood designed by Ray Kinoshita and Maggie Orth. The hood resembles an ear and the participant enters this ‘ear' to sing into the microphone. The hood's height is adjustable, and it provides an enclosed surrounding to reduce the participants' feelings of self-consciousness about public singing.
Each singing tree has a unique video stream, each of which was designed by Sharon Daniel; they are scenes of the front view of a human face, the side view of a human face, and a cupped human hand holding a flower bulb. Each stream starts in an inactive state; for example, the human face seen from the front has closed eyes. When the participant starts singing a steady pitch, the video ‘wakes' and proceeds to an identifiable goal: for example, an eye opens and one zooms inside the eye to find a dancer spinning, or the flower bulb in the hand blossoms. The mapping of the singer's ability to maintain pitch is direct; the video proceeds towards the goal as long as the singer maintains a steady pitch, and reverses to the inanimate state otherwise.
The auditory stream is also a reward-oriented response in which the singer who maintains a steady pitch hears a beautiful, angelic response of bassy strings, arpeggiating woodwinds, and a chorus of voices in harmony with the singer. Deviations from the initial sound that the singer produces result in a gradual increase in dissonance, movement towards brassier and more percussive instruments, and more chaotic rhythms. There were many ways in which a singer could deviate from his or her initial sound, including changes in pitch, vibrato, sung vowel, vocal tone, and loudness. As described in more detail later, various vocal parameters were measured using algorithms written by Eric Metois. These parameters were then used to drive John Yu's sound synthesis engine, "Sharle", using mapping algorithms designed by Will Oliver and coded by John Yu.
From an artistic standpoint, the mappings had to be intuitive and obvious enough to create the impression of sonic gesture and maintain interest, and at the same time, subtle and complicated enough to prevent a deterministic impression. While it is difficult to write about an auditory experience and one should experience the singing tree firsthand, many of the mappings can be described generally. The amplitude of the singer's voice was mapped primarily to instrument volume and, to a lesser extent, to note density and instrument type. Literal deviations from pitch were measured incrementally as well as with regards to the velocity and direction of change and the acceleration and direction of change. These parameters were mapped primarily to note density, instrument choice, tonal coherence, rhythm consistency, and, to a lesser extent, to volume. Vowel formants proved to be difficult to measure, but changes in vowel formant were relatively easy to measure and were mapped to the scale on which the musical response was based. The quality of voice was based largely on changes in vocal tonality, and were mapped to rhythm consistency, instrument choice, and, to a lesser extent, to tonal coherence.
Other auditory considerations included the speakers and headphones. During the Brain Opera, the singer's voice was routed to the headphones so the singer could hear both the singing voice and the musical response. However, the voice was not sent to the speakers to protect the singer's privacy and encourage those who might otherwise feel embarrassed to sing. Lastly, the Singing Trees always sing. When nobody is singing into the microphone, the tree returns to an incoherent, dissonant, and quiet output meant to resemble a sleeping or inactive brain with random thoughts passing about.
The Singing Tree equipment includes an IBM computer (133 Mhz Pentium with 64 Mb RAM), Kurzweil K2500 sampler/synthesizer, Mackie 1202 audio mixer, amplifier, ART preamp, ART compressor/limiter, LCD screen, and microphone. The basic idea is that the singer's mic signal goes to the mixing board from which it splits to the headphones and the computer. The computer takes the voice and outputs MIDI to drive the K2500, which sends an audio signal to the mixer and on to the speakers and headphones. In addition, the computer sends bitmaps of the video to the LCD screen.
The voice enters the computer and is analyzed by Eric Metois' DSP tool kit. This tool kit includes a real time pitch analyzer which determines the pitch through a time domain analysis. The amplitude and frequency spectrum are also measured. In addition, a cepstrally smoothed spectral analysis is used to determine the first three vowel formants. For more information on the DSP algorithms used, please refer to Eric Metois' Ph.D. thesis on Pitch Synchronous Embedding Synthesis (Psymbesis), "Musical Sound Information; Musical Gesture and Embedding Synthesis," available on-line through the Brain Opera Website (http://brainop.media.mit.edu). Several alternative methods were considered and tested in Matlab by Will Oliver including linear predictive coding using quasi-linearization, Markov modeling, and an energy tracking and separation method. The results, while educational, were all marginal and inaccurate in real time. Eric's cepstral analysis, while unable to reliably determine the exact formant in real time, was very good at determining changes in formant activity. This was perfect for the type of mapping subsequently used.
John Yu's music synthesis program, "Sharle," is a randomized music generation engine based on melody and rhythm seeds generated through several levels of hierarchy. The original version contains several user controllable parameters including cohesion, key, scale, tempo, rhythm consistency, instrument, pitch center, and consonance to name a few. For the Singing Tree, several of these parameters were controlled through mappings based on the vocal analysis, and several were held a constant or near constant to maintain a consistent environment. For more information on Sharle, please refer to John Yu's master's thesis, "Computer Generated Musical Composition," also available on-line through the Brain Opera Website.
The mappings connecting the vocal analysis parameters to Sharle were based on fuzzy logic, or stochastic set theory. This was a natural way to feed information to Sharle since it was a random music generation engine. In its original form, Sharle used probability assignments to instruments, for example, to probabilistically achieve a desired output through a randomized driver. The mapping algorithms made these probability assignments dynamic and coupled them to several of the vocal analysis parameters at once. Using dynamic fuzzy sets makes changes or transitions in the music have definite trends without being deterministic. More information on the mappings and the use of fuzzy sets is not yet available.
The Singing Tree was one of six interfaces used in the Brain Opera Mind Forest which allowed participants of varying musical abilities and backgrounds an opportunity to create music in novel ways. A next step in development of the Singing Tree would be to allow the singer to determine the initial pitch and then sing intervals around that pitch.
Click here to go back to the main index page.