« Sleevenotez... on Amazon's Elastic Compute Cloud? | Main | Replication »

Duff metadata and out of date databases

Duff metadata is the thorn in the side of any large mp3 collection; even those that have all been ripped firsthand with fanatical attention to detail will still have their moments. For example, for some unknown reason my rip of Carter the Unstoppable Sex Machine’s 101 Damnations has all the tracks incorrectly named, despite me ripping it myself and using FreeDB to look up the track information. (I realise of course that by owning that album I may deserve bad metadata). Anything that relies on community submitted data (which in essence all tagging applications do) will be subject to inaccuracies and change, however careful one is.

Because Sleevenotez is an application that relies on the quality of an mp3's metadata to drive it our initial intention had been to avoid tackling the problem of the duff stuff too early. Instead we planned to rely on a test user with a music collection of over 9,000 "correctly" tagged mp3s (by correct in this case we are talking tagged with the MusicBrainz tagger). As Audioscrobbler makes a reasonable thing about their use of the MusicBrainz data, our assumption was that by correctly tagging our test collection and transmitting the MusicBrainz UFID up to Audioscrobbler that we would receive that MBID (or GID as MusicBrainz call it) back, allowing us to query the database by gid and ignore any metadata issues for the moment.

(Before you get really confused.... UFID, MBID and GID are all the same thing. UFID is how it appears in the ID3v2 tags, MBID is what Audioscrobbler calls it, while GID is how it's actually referred to in the MusicBrainz data model. From now on we'll always refer to GIDs)

The first stumbling block we hit was that in fact the Audioscrobbler web services do not return the GID. Even though we send it to them, we do not get it back (despite it being in the XML document tree as provided by the web services). This was a surprise, although by now I really should have learned not to believe the specification without testing it first. Part of me is sure that it used to be there, but it certainly isn't now. Quite what reason they have for not returning it to the user that sent it I don't know, but it certainly causes us a problem at this stage of the project.

Still, every problem is an opportunity in disguise (or something... I believe they weren't the words that I used at the time). One of the key pieces of discography information we want to display is other albums upon which the currently playing track appears. Nothing in the MusicBrainz database identifies that "Ace of Spades" on the original album is in fact the same recording as the one on "100 Greatest Guitar Anthems" [1], so we have to do textual searching of the track table against artist and track name. By not getting the GID back from Audioscrobbler we've been forced to deal with this rather thorny issue upfront, instead of procrastinating as I had intended to do.

So. The next problem we hit is a very strange one. We started seeing weird failures of the system. Anything from "Floored Genius Vol 1: The Best of Julian Cope and the Teardrop Explodes 1979 - 1991" broke the system. Also, "Skating Away (On the Thin Ice of a New Day)" by Jethro Tull apparently didn't appear on any albums. Despite Doug being convinced that this was simply a matter of taste it seemed clear that something was up.

I've spent most of this week getting to the bottom of this and then building a workaround to the problem. It turns out that Audioscrobbler has not updated their copy of the MusicBrainz database for some considerable time [2]. The reason that tracks from "Floored Genius" were breaking the system is that until December 2005 MusicBrainz thought that the album artist was "Julian Cope and the Teardrop Explodes" but in December 2005 someone modifed the album to make the album artist the more correct "Various Artists". Audioscrobbler is still returning the GID of the now deleted artist "Julian Cope and the Teardrop Explodes", an artist that hasn't existed in the database for 8 months. And the reason that the Jethro Tull track didn't appear to appear on any albums? Again, someone had edited the track title to remove the brackets, but again Audioscrobbler does not reflect that change as it was made after their last update.

What this has forced us to do is to look at identifying textual signatures of track titles and match against those, instead of attempting a direct match (my original naive implementation based on my assumption that the data I got back from Audioscrobbler was good). We've had to build new indexes using the excellent tsearch2 package that comes in postgresql-contrib (if you're a Debian/Ubuntu user). I'm currently tuning the queries now, but it looks like I've got identifying the albums upon which the given track appears down to a very reasonable time now, following some judicious PostgreSQL tweaking and a lot of EXPLAIN ANALYZE. I'm bookmarking all my PostgreSQL related stuff up at del.icio.us/offmessage/postgresql and will write up the queries that we're using to access the right track once I've got them quick enough and accurate enough.

Nothing like the unforeseen to really put you off your stride, eh?

[1] PUIDs may help in the future, but are not under the same licence as the rest of the data making it hard for us to use them.

[2] If you're sceptical you can prove it for yourself by checking out pyscrobbler and running some tests. Try Skating Away on the Thin Ice of a New Day, Safesurfer and Crazy with varyingly correct and incorrect metadata and you'll see what I mean.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)