Duff metadata is the thorn in the side of any large mp3 collection; even
those that have all been ripped firsthand with fanatical attention to detail will
still have their moments. For example, for some unknown reason my rip of
Carter the Unstoppable Sex Machine’s 101 Damnations has all the tracks
incorrectly named, despite me ripping it myself and using FreeDB to look up the
track information. (I realise of course that by owning that album I may deserve bad metadata). Anything that relies on community submitted data (which in essence all
tagging applications do) will be subject to inaccuracies and change, however
careful one is.
Because Sleevenotez is an application that relies on the quality of an
mp3's metadata to drive it our initial intention had been to avoid tackling the
problem of the duff stuff too early. Instead we planned to rely on a test user
with a music collection of over 9,000 "correctly" tagged mp3s (by correct in
this case we are talking tagged with the MusicBrainz
tagger). As
Audioscrobbler makes a reasonable thing about their use of the
MusicBrainz data, our assumption was that by correctly tagging our test
collection and transmitting the MusicBrainz UFID up to Audioscrobbler that we
would receive that MBID (or GID as MusicBrainz call it) back, allowing us to
query the database by gid and ignore any metadata issues for the moment.
(Before you get really confused.... UFID,
MBID and GID are all the same thing. UFID is
how it appears in the ID3v2 tags, MBID is what Audioscrobbler calls it, while GID is
how it's actually referred to in the MusicBrainz data model. From now on we'll
always refer to GIDs)
The first stumbling block we hit was that in fact the Audioscrobbler web
services do not return the GID. Even though we send it to them, we do not get
it back (despite it being in the
XML
document tree as provided by the web services). This was a surprise, although by now
I really should have learned not to believe the specification without
testing it first. Part of me is sure that it used to be there, but it
certainly isn't now. Quite what reason they have for not returning it to the
user that sent it I don't know, but it certainly causes us a problem at this
stage of the project.
Still, every problem is an opportunity in disguise (or something... I believe
they weren't the words that I used at the time). One of the key pieces of
discography information we want to display is other albums upon which the
currently playing track appears. Nothing in the MusicBrainz database
identifies that "Ace of Spades" on the original album is in fact the same
recording as the one on "100 Greatest Guitar
Anthems" [1], so we have to do textual searching of the
track table against artist and track name. By not getting the GID back from
Audioscrobbler we've been forced to deal with this rather thorny issue upfront,
instead of procrastinating as I had intended to do.
So. The next problem we hit is a very strange one. We started seeing weird
failures of the system. Anything from "Floored Genius Vol 1: The Best of
Julian Cope and the Teardrop Explodes 1979 - 1991" broke the system. Also,
"Skating Away (On the Thin Ice of a New Day)" by Jethro Tull apparently didn't
appear on any albums. Despite Doug being convinced that this was simply a
matter of taste it seemed clear that something was up.
I've spent most of this week getting to the bottom of this and then building
a workaround to the problem. It turns out that Audioscrobbler has not updated their copy of the MusicBrainz
database for some considerable time [2]. The
reason that tracks from "Floored
Genius" were breaking the system is that until December 2005 MusicBrainz thought
that the album artist was "Julian Cope and the Teardrop Explodes" but in
December 2005 someone modifed the album to make the album artist the more
correct "Various Artists". Audioscrobbler is still returning the GID of the now
deleted artist "Julian Cope and the Teardrop Explodes", an artist that hasn't
existed in the database for 8 months. And the reason that the Jethro Tull
track didn't appear to appear on any albums? Again, someone had edited the
track title to remove the brackets, but again Audioscrobbler does not reflect that
change as it was made after their last update.
What this has forced us to do is to look at identifying textual signatures
of track titles and match against those, instead of attempting a direct match (my original naive implementation based on my assumption that the data I got back from Audioscrobbler was good). We've had to build new
indexes using the excellent tsearch2 package that comes in
postgresql-contrib (if you're a Debian/Ubuntu user). I'm currently tuning
the queries now, but it looks like I've got identifying the albums upon which
the given track appears down to a very reasonable time now, following some judicious
PostgreSQL tweaking and a lot of EXPLAIN ANALYZE. I'm bookmarking
all my PostgreSQL related stuff up at
del.icio.us/offmessage/postgresql
and will write up the queries that we're using to access the right track once
I've got them quick enough and accurate enough.
Nothing like the unforeseen to really put you off your stride, eh?
[1] PUIDs may help in the future, but are not under
the same licence as the rest of the data making it hard for us to use them.
[2] If you're sceptical you can prove it for yourself
by checking out pyscrobbler
and running some tests. Try
Skating
Away on the Thin Ice of a New Day,
Safesurfer
and Crazy
with varyingly correct and incorrect metadata and you'll see what I mean.