We train our model with six Maracatu pieces from the band Maracatu Estrela Brilhante. Those pieces have different tempi, and include a large number of drum sounds, singing voices, choirs, lyrics, and a variety of complex rhythmic patterns. Best results were found using an embedded 9-segment feature vector, and a Gaussian kernel for the SVM. We verify our Maracatu “expert” model both on the training data set (8100 data points), and on a new piece from the same album (1200 data points). The model performs outstandingly on the training data, and does well on the new untrained data (Figure 5-11). Total computation cost (including listening and modeling) was found to be somewhat signiﬁcant at training stage (about 15 minutes on a dual-2.5 GHz Mac G5 for the equivalent of 20 minutes of music), but minimal at prediction stage (about 15 seconds for a 4-minute song).
This experiment demonstrates the workability of extracting the downbeat in arbitrarily complex musical structures, through supervised learning, and without needing the beat information. Although we applied it to extract the downbeat location, such framework should allow to learn and predict other music information, such as, beat location, time signature, key, genre, artist, etc. But this is left for future work.
Repeating sounds and patterns are widely exploited throughout music. However, although analysis and music information retrieval applications are often concerned with processing speed and music description, they typically discard the beneﬁts of sound redundancy cancellation. Our method uses unsupervised clustering, allows for reduction of the data complexity, and enables applications such as compression .
Typical music retrieval applications deal with large databases of audio data. One of the major concerns of these programs is the meaningfulness of the music description, given solely an audio signal. Another concern is the efficiency of searching through a large space of information. With these considerations, some recent techniques for annotating audio include psychoacoustic preprocessing models , and/or a collection of frame-based (i.e., 10–20 ms) perceptual audio descriptors  . The data is highly reduced, and the description hopefully relevant. However, although the annotation is appropriate for sound and timbre, it remains complex and inadequate for describing music.
In this section, two types of clustering algorithms are proposed: nonhierarchical and hierarchical. In nonhierarchical clustering, such as the k-means algorithm, the relationship between clusters is undetermined. Hierarchical clustering, on the other hand, repeatedly links pairs of clusters until every data object is included in the hierarchy. The goal is to group similar segments together to form clusters whose centroid or representative characterizes the group, revealing musical patterns and a certain organization of sounds in time.
K-means clustering is an algorithm used for partitioning (clustering) N data points into K disjoint subsets so as to minimize the sum-of-squares criterion:
where xn is a vector representing the nth data point and μj is the geometric centroid of the data points in Sj. The number of clusters K must be selected at onset. The data points are assigned at random to initial clusters, and a re-estimation procedure ﬁnally leads to non-optimized minima. Despite these limitations, and because of its simplicity, k-means clustering is the most popular clustering strategy. An improvement over k-means, called “Spectral Clustering,” consists roughly of a k-means method in the eigenspace , but it is not yet implemented.
We start with the segment metadata as described in section 3.7. That MDS space being theoretically normalized and Euclidean (the geometrical distance between two points is “equivalent” to their perceptual distance), it is acceptable to use k-means for a ﬁrst prototype. Perceptually similar segments fall in the same region of the space. An arbitrary small number of clusters is chosen depending on the targeted accuracy and compactness. The process is comparable to vector quantization: the smaller the number of clusters, the smaller the lexicon and the stronger the quantization. Figure 5-12 depicts the segment distribution for a short audio excerpt at various segment ratios (deﬁned as the number of retained segments, divided by the number of original segments). Redundant segments get naturally clustered, and can be coded only once. The resynthesis for that excerpt, with 30% of the original segments, is shown in Figure 5-14.
One of the main drawbacks of using k-means clustering is that we may not know ahead of time how many clusters we want, or how many of them would ideally describe the perceptual music redundancy. The algorithm does not adapt to the type of data. It makes sense to consider a hierarchical description of segments organized in clusters that have subclusters that have subsubclusters, and so on.
Agglomerative hierarchical clustering is a bottom-up procedure that begins with each object as a separate group. These groups are successively combined based on similarity until there is only one group remaining, or a speciﬁed termination condition is satisﬁed. For n objects, n - 1 mergings are done. Agglomerative hierarchical methods produce dendrograms (Figure 5-13). These show hierarchical relations between objects in form of a tree.
We can start from a similarity matrix as described in section 4.2.4. We order segment pairs by forming clusters hierarchically, starting from the most similar pairs. At each particular stage the method joins together the two clusters that are the closest from each other (most similar). Differences between methods arise because of the different ways of deﬁning distance (or similarity) between clusters. The most basic agglomerative model is single linkage, also called nearest neighbor. In single linkage, an object is linked to a cluster if at least one object in the cluster is the closest. One defect of this distance measure is the creation of unexpected elongated clusters, called the “chaining effect.” On the other hand, in complete linkage, two clusters fuse depending on the most distant pair of objects among them. In other words, an object joins a cluster when its similarity to all the elements in that cluster is equal or higher to the considered level. Other methods include average linkage clustering, average group, and Ward’s linkage .
The main advantages of hierarchical clustering are 1) we can take advantage of our already computed perceptual similarity matrices; 2) the method adapts its number of clusters automatically to the redundancy of the music; and 3) we can choose the level of resolution by deﬁning a similarity threshold. When that threshold is high (fewer clusters), the method leads to rough quantizations of the musical description (Figure 5-13). When it is low enough (more clusters) so that it barely represents the just-noticeable difference between segments (a perceptual threshold), the method allows for reduction of the complexity of the description without altering its perception: redundant segments get clustered, and can be coded only once. This particular evidence leads to a compression application.
Compression is the process by which data is reduced into a form that minimizes the space required to store or transmit it. While modern lossy audio coders efficiently exploit the limited perception capacities of human hearing in the frequency domain , they do not take into account the perceptual redundancy of sounds in the time domain. We believe that by canceling such redundancy, we can reach further compression rates. In our demonstration, the segment ratio indeed highly correlates with the compression rate that is gained over traditional audio coders.
Perceptual clustering allows us to reduce the audio material to the most perceptually relevant segments, by retaining only one representative (near centroid) segment per cluster. These segments can be stored along with a list of indexes and locations. Resynthesis of the audio consists of juxtaposing the audio segments from the list at their corresponding locations (Figure 5-14). Note that no cross-fading between segments or interpolations are used at resynthesis.
If the threshold is chosen too high, too few clusters may result in musical distortions at resynthesis, i.e., the sound quality is fully maintained, but the musical “syntax” may audibly shift from its original form. The ideal threshold is theoretically a constant value across songs, which could be deﬁned through empirical listening test with human subjects and is currently set by hand. The clustering algorithm relies on our matrix of segment similarities as introduced in 4.4. Using the agglomerative clustering strategy with additional supervised feedback, we can optimize the distance-measure parameters of the dynamic programming algorithm (i.e., parameter h in Figure 4-4, and edit cost as in section 4.3) to minimize the just-noticeable threshold, and equalize the effect of the algorithm across large varieties of sounds.
Reducing audio information beyond current state-of-the-art perceptual codecs by structure analysis of its musical content is arguably a bad idea. Purists would certainly disagree with the beneﬁt of cutting some of the original material altogether, especially if the music is entirely performed. There are obviously great risks for music distortion, and the method applies naturally better to certain genres, including electronic music, pop, or rock, where repetition is an inherent part of its qualities. Formal experiments could certainly be done for measuring the entropy of a given piece and compressibility across sub-categories.
We believe that, with a real adaptive strategy and an appropriate perceptually grounded error estimation, the principle has great potential, primarily for devices such as cell phones, and PDAs, where bit rate and memory space matter more than sound quality. At the moment, segments are compared and concatenated as raw material. There is no attempt to transform the audio itself. However, a much more reﬁned system would estimate similarities independently of certain perceptual criteria, such as loudness, duration, aspects of equalization or ﬁltering, and possibly pitch. Resynthesis would consist of transforming parametrically the retained segment (e.g., amplifying, equalizing, time-stretching, pitch-shifting, etc.) in order to match its target more closely. This could greatly improve the musical quality, increase the compression rate, while reﬁning the description.
Perceptual coders have already provided us with a valuable strategy for estimating the perceptually relevant audio surface (by discarding what we cannot hear). Describing musical structures at the core of the codec is an attractive concept that may have great signiﬁcance for many higher-level information retrieval applications, including song similarity, genre classiﬁcation, rhythm analysis, transcription tasks, etc.