Traveling Microphones (2020)

In the fall of 2020, I had the opportunity to serve as TA for the course ARTS@ML: Why here? What's next? at the Media Lab. Seeing the interesting things my colleagues in the class had been building, I was inspired to work on a little experiment of my own.

I've always been interested in semi-automated, non-linear explorations of content on Freesound. There's such an amazing range uploaded there, but also just so much of it; it's hard to know where to begin exploring. Previously, I've worked on projects that used visual object detection to retrieve related sounds for soundscape composition, pre-aggregated sounds distributed onto an interactive street-level map, and more. In this case, I built on past work in which I had tried to "sample the world", so to speak. For example, here are kitchens across the planet:

Here's one for horns, where small melodic fragments sometimes emerge:

Following Sound

Given a search term (e.g. "water" or "voice"), I query the Freesound API for sounds tagged with it recorded, or rather geotagged as being, near each integer geographic coordinate (i.e. one-degree increments of latitude and longitude). I select at most one sound from a limited radius at each of these points, and store its preview.

To synthesize sequences of sounds, I wanted to produce a path through the audio that allowed for a varied trajectory, but also preserved some parsimony between successive samples. More specifically I wanted to order the sounds by their content (what they sound like) rather than by any concept (such as their coordinates, tags, etc.).

Getting Started

Obtaining representative segments of found sounds was an important pre-processing step: concatenating whole files would have produced long pieces, whose elements' individual character would likely each have faded from memory before the next had a chance to make any impression.

I used the simplest of all heuristics for auditory salience: peak amplitude. It's quick and easy to seek to the loudest sample and excerpt material from there. This is, of course, hugely noisy and approximate; it really just retrieves something likely to be noticeable, and thus usable. The duration of the excerpt is decided by multiplying a provided base duration (e.g. 0.1 seconds) by a randomly selected low power of two, with the exponent biased toward 0 (yielding 1, i.e. the base duration). This allows a kind of stochastic syncopation when all segments are sequenced, just to keep things more interesting than a uniform rate might otherwise.

Sampling Aesthetics

Much of my favorite music relies, in some form or another, on audio sampling, or more generally the manipulation of recorded sound (some examples: Parmegiani, Aaliyah/Timbaland, Noisia, etc.). Sampling constitutes a great variety of recombinative practices.

For the past few years, I’ve really enjoyed music that stretches this concept a little by using tiny segments in large numbers. In a way I think it speaks to aspects of the modern environment: we’re dealing with unprecedented quantities of media and information, increasingly summarization and skimming, and impossible juxtapositions of content facilitated by new media platforms (e.g. imagine swiping between videos of ducks swimming and factories operating on instagram). This brings to mind work like John Oswald’s classic Plunderphonics and Akufen’s microsampling; barely identifiable chunks in rapid succession, not so small as to be rid of any perceived material or causality (e.g. Vaggione’s micromontage and granular work), but not so large that they can independently be attributed to their sources. I was also pointed by a student to Noah Creshevsky’s hyperrealist music, another really interesting example. I chose this as my starting point: an approach wherein each sample depends on its neighbor to collectively produce narrative and meaning.

Sequencing Sounds

There are a few ways to create a "perceptual path" through a set of audio clips. My initial strategy involved extracting common features, specifically MFCCs, and sorting clips by the Euclidean distance from an anchor. In a second iteration, I re-formulated this as an instance of the classic Traveling Salesman Problem (TSP). The TSP, as first posed in 1934 by Hassler Whitney (later described by Julia Robinson), can be thought of as, given a list of N cities, finding the smallest tour that completes a loop visiting each city once.

Despite the use of geographic coordinates, our sound "locations" here are actually auditory features. In this case, I used mlrose's implementation for solving the TSP with a genetic algorithm, and a subset of MFCC features. Empirically, I found this consistently worked pretty well and was also approximate enough to sometimes be surprising. Once organized, all that remains is to concatenate these sounds. Given an overlap argument (between 0 and 1), retrieve the appropriate segments are retrieved, fades applied on either side, and overlapped sequentially as appropriate. Following this synthesis, a sequence of filenames with coordinates and timestamps as a CSV file is produced, for visualization.

Visualizing Distributions

Interestingly, it seems there aren't any particularly inspired tools for this kind of “kinetic cartography” (time-varying maps with video output). I followed a seemingly common procedure, which is to use cartopy to write out a sequence of individual frames, and then apply the ffmpeg concat filter to turn them into a video. The ffmpeg command also takes the sequenced audio as an input, and that's it!

Here's just one more example with the search term "music", and a little quicker:

Several more examples here.