Tristan Jehan PhD Thesis - Creating Music by Listening

Creating Music by Listening

by

Tristan Jehan

Diplôme d’Ingénieur en Informatique et Télécommunications
IFSIC, Université de Rennes 1, France, 1997

M.S. Media Arts and Sciences
Massachusetts Institute of Technology, 2000

Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
in partial fulﬁllment of the requirements for the degree of
Doctor of Philosophy at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September, 2005

© Massachusetts Institute of Technology 2005. All rights reserved.

Author...................................................................................
Program in Media Arts and Sciences
June 17, 2005

Certiﬁed by............................................................................
Tod Machover
Professor of Music and Media
Thesis Supervisor

Accepted by...........................................................................
Andrew B. Lippman
Chairman, Departmental Committee on Graduate Students

Creating Music by Listening
by Tristan Jehan

Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning, on June 17, 2005,
in partial fulﬁllment of the requirements
for the degree of Doctor of Philosophy

Abstract

Machines have the power and potential to make expressive music on their own. This thesis aims to computationally model the process of creating music using experience from listening to examples. Our unbiased signal-based solution models the life cycle of listening, composing, and performing, turning the machine into an active musician, instead of simply an instrument. We accomplish this through an analysis-synthesis technique by combined perceptual and structural modeling of the musical surface, which leads to a minimal data representation.

We introduce a music cognition framework that results from the interaction of psychoacoustically grounded causal listening, a time-lag embedded feature representation, and perceptual similarity clustering. Our bottom-up analysis intends to be generic and uniform by recursively revealing metrical hierarchies and structures of pitch, rhythm, and timbre. Training is suggested for top-down unbiased supervision, and is demonstrated with the prediction of downbeat. This musical intelligence enables a range of original manipulations including song alignment, music restoration, cross-synthesis or song morphing, and ultimately the synthesis of original pieces.

Thesis supervisor: Tod Machover, D.M.A.
Title: Professor of Music and Media

Thesis Committee

Thesis supervisor......................................................................
Tod Machover
Professor of Music and Media
MIT Program in Media Arts and Sciences

Thesis reader.............................................................................
Peter Cariani
Research Assistant Professor of Physiology
Tufts Medical School

Thesis reader.............................................................................
François Pachet
Senior Researcher
Sony Computer Science Laboratory

Thesis reader.............................................................................
Julius O. Smith III
Associate Professor of Music and (by courtesy) Electrical Engineering
CCRMA, Stanford University

Thesis reader.............................................................................
Barry Vercoe
Professor of Media Arts and Sciences
MIT Program in Media Arts and Sciences

Acknowledgments

It goes without saying that this thesis is a collaborative piece of work. Much like the system presented here draws musical ideas and sounds from multiple song examples, I personally drew ideas, inﬂuences, and inspirations from many people to whom I am very thankful for:

My committee: Tod Machover, Peter Cariani, François Pachet, Julius O. Smith III, Barry Vercoe.

My collaborators and friends: Brian, Mary, Hugo, Carla, Cati, Ben, Ali, Anthony, Jean-Julien, Hedlena, Giordano, Stacie, Shelly, Victor, Bernd, Frédo, Joe, Peter, Marc, Sergio, Joe Paradiso, Glorianna Davenport, Sile O’Modhrain, Deb Roy, Alan Oppenheim.

My Media Lab group and friends: Adam, David, Rob, Gili, Mike, Jacqueline, Ariane, Laird.

My friends outside of the Media Lab: Jad, Vincent, Gaby, Erin, Brazilnut, the Wine and Cheese club, 24 Magazine St., 45 Banks St., 1369, Rustica, Anna’s Taqueria.

My family: Micheline, René, Cécile, François, and Co.

“A good composer does not imitate; he steals.”
– Igor Stravinsky

Contents

1 Introduction
2 Background
2.1 Symbolic Algorithmic Composition
2.2 Hybrid MIDI-Audio Instruments
2.3 Audio Models
2.4 Music information retrieval
2.5 Framework
  2.5.1 Music analysis/resynthesis
  2.5.2 Description
  2.5.3 Hierarchical description
  2.5.4 Meaningful sound space
  2.5.5 Personalized music synthesis
3 Music Listening
  3.0.6 Anatomy
  3.0.7 Psychoacoustics
3.1 Auditory Spectrogram
  3.1.1 Spectral representation
  3.1.2 Outer and middle ear
  3.1.3 Frequency warping
  3.1.4 Frequency masking
  3.1.5 Temporal masking
  3.1.6 Putting it all together
3.2 Loudness
3.3 Timbre
3.4 Onset Detection
  3.4.1 Prior approaches
  3.4.2 Perceptually grounded approach
  3.4.3 Tatum grid
3.5 Beat and Tempo
  3.5.1 Comparative models
  3.5.2 Our approach
3.6 Pitch and Harmony
3.7 Perceptual feature space
4 Musical Structures
4.1 Multiple Similarities
4.2 Related Work
  4.2.1 Hierarchical representations
  4.2.2 Global timbre methods
  4.2.3 Rhythmic similarities
  4.2.4 Self-similarities
4.3 Dynamic Programming
4.4 Sound Segment Similarity
4.5 Beat Analysis
4.6 Pattern Recognition
  4.6.1 Pattern length
  4.6.2 Heuristic approach to downbeat detection
  4.6.3 Pattern-synchronous similarities
4.7 Larger Sections
4.8 Chapter Conclusion
5 Learning Music Signals
5.1 Machine Learning
  5.1.1 Supervised, unsupervised, and reinforcement learning
  5.1.2 Generative vs. discriminative learning
5.2 Prediction
  5.2.1 Regression and classiﬁcation
  5.2.2 State-space forecasting
  5.2.3 Principal component analysis
  5.2.4 Understanding musical structures
  5.2.5 Learning and forecasting musical structures
  5.2.6 Support Vector Machine
5.3 Downbeat prediction
  5.3.1 Downbeat training
  5.3.2 The James Brown case
  5.3.3 Inter-song generalization
5.4 Time-Axis Redundancy Cancellation
  5.4.1 Introduction
  5.4.2 Nonhierarchical k-means clustering
  5.4.3 Agglomerative Hierarchical Clustering
  5.4.4 Compression
  5.4.5 Discussion
6 Composing with sounds
6.1 Automated DJ
  6.1.1 Beat-matching
  6.1.2 Time-scaling
6.2 Early Synthesis Experiments
  6.2.1 Scrambled Music
  6.2.2 Reversed Music
6.3 Music Restoration
  6.3.1 With previously known structure
  6.3.2 With no prior knowledge
6.4 Music Textures
6.5 Music Cross-Synthesis
6.6 Putting it all together
7 Conclusion
7.1 Summary
7.2 Discussion
7.3 Contributions
  7.3.1 Scientiﬁc contributions
  7.3.2 Engineering contributions
  7.3.3 Artistic contributions
7.4 Future directions
7.5 Final Remarks
A “Skeleton”
A.1 Machine Listening
A.2 Machine Learning
A.3 Music Synthesis
A.4 Software
A.5 Database
Bibliography

List of Figures

1-1 Example of paintings by computer program AARON
1-2 Life cycle of the music making paradigm
2-1 Sound analysis/resynthesis paradigm
2-2 Music analysis/resynthesis paradigm
2-3 Machine listening, transformation, and concatenative synthesis
2-4 Analysis framework
2-5 Example of a song decomposition in a tree structure
2-6 Multidimensional scaling perceptual space
3-1 Anatomy of the ear
3-2 Transfer function of the outer and middle ear
3-3 Cochlea and scales
3-4 Bark and ERB scales compared
3-5 Frequency warping examples: noise and pure tone
3-6 Frequency masking example: two pure tones
3-7 Temporal masking schematic
3-8 Temporal masking examples: four sounds
3-9 Perception of rhythm schematic
3-10 Auditory spectrogram: noise, pure tone, sounds, and music
3-11 Timbre and loudness representations on music
3-12 Segmentation of a music example
3-13 Tatum tracking
3-14 Beat tracking
3-15 Chromagram schematic
3-16 Chroma analysis example: four sounds
3-17 Chromagram of a piano scale
3-18 Pitch-content analysis of a chord progression
3-19 Musical metadata extraction
4-1 Similarities in the visual domain
4-2 3D representation of the hierarchical structure of timbre
4-3 Dynamic time warping schematic
4-4 Weight function for timbre similarity of sound segments
4-5 Chord progression score
4-6 Timbre vs. pitch analysis
4-7 Hierarchical self-similarity matrices of timbre
4-8 Pattern length analysis
4-9 Heuristic analysis of downbeat: simple example
4-10 Heuristic analysis of downbeat: real-world example
4-11 Pattern self-similarity matrices of rhythm and pitch
5-1 PCA schematic
5-2 Manifold examples: electronic, funk, jazz music
5-3 Rhythm prediction with CWM and SVM
5-4 SVM classiﬁcation schematic
5-5 Time-lag embedding example
5-6 PCA reduced time-lag space
5-7 Supervised learning schematic
5-8 Intra-song downbeat prediction
5-9 Causal downbeat prediction schematic
5-10 Typical Maracatu rhythm score notation
5-11 Inter-song downbeat prediction
5-12 Segment distribution demonstration
5-13 Dendrogram and musical path
5-14 Compression example
6-1 Time-scaling schematic
6-2 Beat matching example
6-3 Beat matching schematic
6-4 Scrambled music source example
6-5 Scrambled music result
6-6 Reversed music result
6-7 Fragment-based image completion
6-8 Restoring music schematic
6-9 Segment-based music completion example
6-10 Video textures
6-11 Music texture schematic
6-12 Music texture example (1600%)
6-13 Cross-synthesis schematic
6-14 Photomosaic
6-15 Cross-synthesis example
A-1 Skeleton software screenshot
A-2 Skeleton software architecture