I recently found this paper from Tristan Jehan at MIT on "Creating Music by Listening". It's pretty long but it is so interesting and so ties in with some things I've been thinking about recently that rather than posting a link I thought I'd write up what I understand from it here. Up front I'd like to say that my interest in the paper has nothing to do with the author being my namesake nor the fact that he obviously has good musical taste - he gives examples of his technology applied to Herbie Hancock, James Brown and batucadas amongst others. Speaking of which there are loads of audio examples of the technology on his website and I recommend you listen to these at the appropriate point in the text.
The author of the paper wants to know if computers can be creative and compose music. Although there are principles and rules in the process of composition today these are very complex and it is probably not practical to go down the route of trying to collate, learn and use these. Instead the approach used is to listen to musical examples, attempt to create an internal model of them and to use this to synthesize music. The thesis concentrates on "composing new music automatically by recycling a preexisting one". More specifically it concerns itself with working with polyphonic digital samples of the music on micro and macro scales. One motivation behind the work is to personalise music by potentially providing the listener with exactly the music they want and not being restricted by what has gone before.
The architecture proposed in the paper is basically machine analysis -> machine learning -> machine synthesis. The analysis consists of a number of stages: a listening model representing what is heard by humans, the extraction of musical features, an analysis of the time component of the music (particularly repeating patterns) and an analysis of the macro-structure of the music. The fundamental timescale of these processes is of the order of 10ms, 100ms, 1s and several seconds or more respectively. Most music can be seen to have this structure from beats to patterns to riffs to verses, choruses and movements and structural hierarchies are also seen in frequency as well as time - patterns of notes, chord sequences etc.
If we examine the initial analysis in more detail it looks at features of the music such as loudness, timbre, segmentation into small units (e.g. notes, chords, drum sounds), beats, tempo, pitch and harmony (i.e. chords). Various methods are described in the paper for attempting to determine these from the source audio. For determining the structure of the music the system uses an approach of looking for self-similarities - segmenting the music by time, looking at a particular section and comparing it to all the other sections. The comparison uses the features extracted previously; pitch, rhythm, timbre etc. and works on a series of hierarchical levels of time to find short-term and longer-term structures.
Once the analysis has been done on a piece of music then there are a number of possible applications that the paper looks at including compression, restoration and composition.
Music compression algorithms, like MP3, typically compress in the frequency domain - they remove frequencies that are not audible or are less important. By analysing the structure of music this system could allow compression in the time domain as well. Interestingly, too much compression of this type results in musical distortion while the sound quality stays the same. Obviously this would be most effective on repetitive music, as much modern urban and dance music is.
One application of this technology would be to restore corrupted digital audio - if a section of the music is missing then it can be replaced with a new section synthesised from the rest of the music. This can be made easier and more efficient by including the structural information as metadata embedded in the audio. Again it works best with repetitive music.
The composition aspect of this work is based on what has been learnt from existing songs. The simplest application looked at is an automated DJ doing beat matching and cross-fading between tracks, the examples given are pretty impressive, managing transitions between various tempos and various styles of music. Next up are music textures - the system can take a clip of music, analyse it and then extend it to infinity while never seeming to repeat - the author calls these "music textures". Some of these sounded a bit like the mutant offspring of a skipping CD player while some of them were so good that it felt a bit like being permanently stuck in the intro of a song! Finally the paper looks at music cross-synthesis which creates some kind of musical mash-ups. It takes the musical structure of one piece and the sound content of another piece and synthesises a completely new piece. This is as far as the system goes at the moment but the author's next aspiration is to automatically create entirely new compositions.
So if that was interesting the dissertation itself, some presentations and loads of audio examples can be found online.
The author of the paper is also one of the founders of the Echo Nest. Type a music-related word into the box and it starts playing snippets of sound that seem to be related. I've no idea what this is but based on this paper it could be pretty interesting. (Guardian interview).
Other miscellaneous facts from the paper:
The "tatum" is "the lowest regular perceived pulse train that a listener intuitively infers from the timing of perceived musical events; a time quantum.". Named after the jazz pianist Art Tatum noted for his virtuosity at the keyboard and his long runs of notes.
The James Brown case. Because JB's music is often characterised by a repeated single chord and syncopated rhythms it can be difficult to extract the down beat (i.e. the first beat of the bar) which is typically detected by analysing the rhythm and the chord changes.