Thrash Metal, Wiki Data, and The Problem of Web Scraping
So I was looking at a list of thrash metal bands the other day, like you do, and the first entry caused me an immediate problem.
It read exactly as follows:
An explanation is due at this point; I am currently conducting a study of musical genre, utilising the EchoNest dataset. The procedure I am following is designed to facilitate the plotting of genre inception and proliferation. In order to do this I generate a list of genres (currently around 1,300), and then I grab a list of artists, along with the '
start_date' parameter as supplied by the EchoNest system. The data is saved in a '
^' separated file to facilitate processing. As an aside, I have used the '
^' to delineate as it is the first character symbol I could find that doesn't appear in any band name within the EchoNest data - I may have to start a musical project using this symbol, just to make my own life difficult, but I digress...
The idea is this: if I can generate a list of artists-within-genres and their dates of formation I can plot the inception dates of genres, and then examine the proliferation of these over time.
HammerFall, the Swedish metal band, obviously didn't start in 1918 - they formed in 1993 as it turns out. The question then is how did this discrepancy occur?
A quick search on EchoNest for 'HammerFall +1918' revealed the problem; EchoNest utilises various sources for start date information, including MusicBrainz, Discogs and Wikipedia, and the Wiki page for 'James Michael' came up as a likely candidate for the confusion.
James Michael produced HammerFall's 2011 album 'Infected'. At the top of his Wiki page there is a line that reads:
'For the Anglo-Australian solicitor and poet (1824-1868), see James Lionel Michael. For the U.S. federal judge (1918-2005), see James Harry Michael Jr..'
I could find no other sources that include HammerFall and 1918, so this is currently my best guess as to the cause of the problem. At least it didn't go for the '1824' (but then again, why didn't it?)
The next entry in the 'Thrash Metal' list (
Rigor Mortis^1961) caused my next problem.
The thrash band 'Rigor Mortis' was formed in 1983 (featuring the late, great 'Ministry' guitarist Mike Scaccia, among others) and released their eponymous debut album in 1988. A search for 'rigor mortis 1961' however returns, as the first result, an entry for the OCLC (Open Computer Library Centre) WorldCat record for the lead sheet to the 1961 jazz song 'Rigor Mortis' by Henry Glover. My conclusion, as with the 'HammerFall' entry, is that this has been scraped and added to the dataset as being relevant to the band when it clearly is not. Why (or if) this datasource is used by EchoNest is, at this time, unclear. I failed to find any other 1961/Rigor Mortis references though, so I'm sticking with this theory for now.
The validity of this type of information needs to be carefully considered as it becomes more and more pervasive in applications and systems. Perhaps the next step in the use of the vast amounts of information on the Web is to ask this; how do we, as researchers, ensure that the information we can gain from the Web by automated means is accurate? The scale alone may not be enough; a billion versions of wrong is, after all, still wrong.
Justin Gagen is a Transforming Musicology-funded Ph.D student at Goldsmiths' College working on the Musicology of the Social Media strand of the project.