Large-scale corpus analysis of historical electronic music

Photo of CDs, stacked and ready for transferring

CDs awaiting ripping and cataloguing attention

We're in the later stages of our Transforming Musicology mini-project, and audio CDs have been gathering around us. Though we have 1000 pieces in our database, we still have quite a number to rip and catalogue, a task that is inevitably going to lead through the summer. We're seeking a strong cross-section of historical electronic music works, and though our holdings in later synth pop and electronic dance music also need to be well rounded, the biggest challenge is surveying earlier works, cross referencing electronic music history textbooks, and seeking out early recordings in an appropriate format. We do want to make sure there are good international holdings, and good representation of female composers, though the institutional male European/North American composer has a certain unfortunate foothold in this area.

Eventually, audio feature extraction will be used across a 2000 audio track corpus to look at changes over a fifty year period in electronic music. The nature of the audio feature extraction, and associated assumptions about musical similarity, are critical to the conclusions. We've run some internal experiments on how 'musically meaningful' the features we'll analyse by computer are, especially to the standards of human music analysts. Later this year, we have plans to run a follow-up study with an online experiment to further interrogate the best performing features (those most conspicuous to human ears).

Trevor Wishart's Imago (2002), similarity matrix based on Euclidean distance between MFCC audio feature vectors over the piece's complete 26 minute duration

As part of the project, we've given a number of presentations, in Oxford to the enclosing Transforming Musicology project, in Leicester and Durham for research seminars, and in Huddersfield to a complementary project called TaCEM. During talks with researchers on that project, we discovered through machine analysis a tripartite structure in Trevor Wishart's Imago (2002) that leapt out of a similarity matrix plot, illustrating that the machine analysis route can provide results of interest to human analysts deeply immersed in specific works.

Test audio feature extraction runs have been made over the developing corpus; preliminary results were presented in Oxford, Leicester and Durham though the analysis remains tentative. Comparison corpuses from UbuWeb (476 files), and via the first 120 CD releases of the empreintes DIGITALes electroacoustic record label (919 files) have also been analyzed. These corpora will help to provide a balanced and comparative picture of the main mini-project database and its scope. We hope to make a data release this summer which incorporates the meta-data over the corpus, and audio feature data over the pieces, and further explore machine learning techniques applied to the data. For instance, unsupervised clustering of works can get away from the imposition of genre labels and look at how historic pieces naturally form associations within a larger body of works.

Nick Collins is a Reader in Composition at Durham University and the principal investigator on the Transforming Musicology mini-project Large-scale corpus analysis of historical electronic music using MIR tools: Informing an ontology of electronic music and cross-validating content-based methods.