Main | September 2006 »

August 31, 2006

Iteration 1 is over

We said we would be open about how we are doing and so it seemed only right to share exactly how our planning an development process is getting on. We (Isotoma) are Agile developers. Part of our version of the Agile methodology is to use Mike Cohn's excellent approach for estimating. To that end we spent approximately a week with Mark developing a total of 141 user stories that describe the way that the application works. We then broke them down by feature and priority until we had a set that we could usefully work with that included all the things that might go into the first iteration and then we played planning poker (zip file) to get down to our core stories for iteration 1.

Iteration 1 is now complete, and this is how we did:

Story IDStoryEst. Points
50As a user I will always have a status bar showing what I am currently listening to1
78As a music fan, I can see photographs from Flickr in the photo tool1
83As a music fan, I will see MusicBrainz lists of releases in the discography tool0
84As a music fan, I will see MusicBrainz lists of tracks in the discography tool1
85As a music fan, I will have access to a tool showing photos relating to my current context2
87As a music fan, I will have access to a tool showing the biography for my current context2
88As a music fan, I will have access to a tool showing a discography for my current context2
95As a music fan, I will have access to a tool showing products, merchandise and music for my current context that is available for purchase2
107As a music fan, I will see links to allmusic biographies in the biography tool1
108As a music fan, I will see wikipedia biographies in the biography tool5
110As a music fan, I will see US & UK amazon listings in the products tool3
139As a music fan, I will have access to a tool showing other releases on which this track is present2
140There will be a local copy of the musicbrainz database installed,configured and tested5

As both Doug and I have blogged about, getting MusicBrainz up and running and correctly identifying tracks proved considerably harder than we thought. We had estimated a velocity of 27 story points in 14 days and achieved those points in 15.5. Not bad, particularly as knowing what we now know I would re-esimate the MusicBrainz related stuff as about 3 times as much.

Iteration 2 will be 16 days (looking like 30 story points). I'll write up which ones we are tackling after our next planning session.

August 29, 2006

datetime strftime methods require year >= 1900

Trying to strftime() a datetime before 1900? The Python source says:

/* Give up if the year is before 1900.
 * Python strftime() plays games with the year, and different
 * games depending on whether envar PYTHON2K is set.  This makes
 * years before 1900 a nightmare, even if the platform strftime
 * supports them (and not all do).
 * We could get a lot farther here by avoiding Python's strftime
 * wrapper and calling the C strftime() directly, but that isn't
 * an option in the Python implementation of this module.
 */

BItten again. Previous projects where I have hit this problem I've been safely able to ignore it. Ignoring Noel Coward, Humphrey Bogart or Duke Ellington's date of birth is going to be difficult. Ignoring nearly all classical music is going to be pretty much impossible. Some thought is clearly required. I have a feeling it's not vitally important to get it 100% right, but something needs to be done if only to disambiguate the two Engelbert Humperdincks.

August 26, 2006

Ubuntu image on Amazon EC2

If anyone is interested (I know, you are wetting yourself in anticipation), here’s how I created an Ubuntu dapper image for EC2:

http://developer.amazonwebservices.com/connect/thread.jspa?threadID=11559

Actually very simple. I managed to make pretty much every mistake going though, en route. Unfortunately when you screw up all you get is a remote image that you can’t connect to - no debugging output whatsoever.

doug.

August 25, 2006

Replication

Musicbrainz uses a Postgres contributed script called dbmirror to handle replication. dbmirror was designed to work between two live instances, They’ve done quite a nice job to decouple this - packages are regularly written to the musicbrainz ftp area containing data suitable for dbmirror slaves.

I wanted replication to be controllable within the sleevenotez application architecture, and more to the point I didn’t want to have to schedule a bunch of cron jobs running perl scripts on our development environments. I’ve ported the dbmirror slave part to twisted with a MusicbrainzReplicationService running on startup that does the full process of replication - ftping to the server, pulling down the packages, unpacking them, parsing them and updating the database.

It works very nicely, and seems to be working as well, more to the point. I’ll be releasing this bit back to the community once I’m sure it’s bug free - it might be useful for others later.

Duff metadata and out of date databases

Duff metadata is the thorn in the side of any large mp3 collection; even those that have all been ripped firsthand with fanatical attention to detail will still have their moments. For example, for some unknown reason my rip of Carter the Unstoppable Sex Machine’s 101 Damnations has all the tracks incorrectly named, despite me ripping it myself and using FreeDB to look up the track information. (I realise of course that by owning that album I may deserve bad metadata). Anything that relies on community submitted data (which in essence all tagging applications do) will be subject to inaccuracies and change, however careful one is.

Because Sleevenotez is an application that relies on the quality of an mp3's metadata to drive it our initial intention had been to avoid tackling the problem of the duff stuff too early. Instead we planned to rely on a test user with a music collection of over 9,000 "correctly" tagged mp3s (by correct in this case we are talking tagged with the MusicBrainz tagger). As Audioscrobbler makes a reasonable thing about their use of the MusicBrainz data, our assumption was that by correctly tagging our test collection and transmitting the MusicBrainz UFID up to Audioscrobbler that we would receive that MBID (or GID as MusicBrainz call it) back, allowing us to query the database by gid and ignore any metadata issues for the moment.

(Before you get really confused.... UFID, MBID and GID are all the same thing. UFID is how it appears in the ID3v2 tags, MBID is what Audioscrobbler calls it, while GID is how it's actually referred to in the MusicBrainz data model. From now on we'll always refer to GIDs)

The first stumbling block we hit was that in fact the Audioscrobbler web services do not return the GID. Even though we send it to them, we do not get it back (despite it being in the XML document tree as provided by the web services). This was a surprise, although by now I really should have learned not to believe the specification without testing it first. Part of me is sure that it used to be there, but it certainly isn't now. Quite what reason they have for not returning it to the user that sent it I don't know, but it certainly causes us a problem at this stage of the project.

Still, every problem is an opportunity in disguise (or something... I believe they weren't the words that I used at the time). One of the key pieces of discography information we want to display is other albums upon which the currently playing track appears. Nothing in the MusicBrainz database identifies that "Ace of Spades" on the original album is in fact the same recording as the one on "100 Greatest Guitar Anthems" [1], so we have to do textual searching of the track table against artist and track name. By not getting the GID back from Audioscrobbler we've been forced to deal with this rather thorny issue upfront, instead of procrastinating as I had intended to do.

So. The next problem we hit is a very strange one. We started seeing weird failures of the system. Anything from "Floored Genius Vol 1: The Best of Julian Cope and the Teardrop Explodes 1979 - 1991" broke the system. Also, "Skating Away (On the Thin Ice of a New Day)" by Jethro Tull apparently didn't appear on any albums. Despite Doug being convinced that this was simply a matter of taste it seemed clear that something was up.

I've spent most of this week getting to the bottom of this and then building a workaround to the problem. It turns out that Audioscrobbler has not updated their copy of the MusicBrainz database for some considerable time [2]. The reason that tracks from "Floored Genius" were breaking the system is that until December 2005 MusicBrainz thought that the album artist was "Julian Cope and the Teardrop Explodes" but in December 2005 someone modifed the album to make the album artist the more correct "Various Artists". Audioscrobbler is still returning the GID of the now deleted artist "Julian Cope and the Teardrop Explodes", an artist that hasn't existed in the database for 8 months. And the reason that the Jethro Tull track didn't appear to appear on any albums? Again, someone had edited the track title to remove the brackets, but again Audioscrobbler does not reflect that change as it was made after their last update.

What this has forced us to do is to look at identifying textual signatures of track titles and match against those, instead of attempting a direct match (my original naive implementation based on my assumption that the data I got back from Audioscrobbler was good). We've had to build new indexes using the excellent tsearch2 package that comes in postgresql-contrib (if you're a Debian/Ubuntu user). I'm currently tuning the queries now, but it looks like I've got identifying the albums upon which the given track appears down to a very reasonable time now, following some judicious PostgreSQL tweaking and a lot of EXPLAIN ANALYZE. I'm bookmarking all my PostgreSQL related stuff up at del.icio.us/offmessage/postgresql and will write up the queries that we're using to access the right track once I've got them quick enough and accurate enough.

Nothing like the unforeseen to really put you off your stride, eh?

[1] PUIDs may help in the future, but are not under the same licence as the rest of the data making it hard for us to use them.

[2] If you're sceptical you can prove it for yourself by checking out pyscrobbler and running some tests. Try Skating Away on the Thin Ice of a New Day, Safesurfer and Crazy with varyingly correct and incorrect metadata and you'll see what I mean.

Sleevenotez... on Amazon's Elastic Compute Cloud?

Last week we were sat around discussing the possible problems of an open alpha. One of the big problems is if people start using the service - obviously this is sort of good, but it brings with it a lot of potential issues. Part of what we are trying to gain is an understanding of real operational issues early, which can only improve the service, but that’s no good if the problems cause the whole thing to fall over - for example, because a lot of people are using it.

Well, we said, what we really need is someone (we thought Google) to come out with a flexible hosting system, where we could throw virtual machines at the problem if the load got too high. And lo, yesterday Amazon announced their beta of their Elastic Compute Cloud Service. Amazon have been doing some very cool stuff with their Web Services - their Simple Storage Service and Simple Queue Service are core building blocks of good applications, and I’ve been looking forward to an opportunity to use them.

Well, this Elastic Compute Cloud is just what we need, and as it happens I was invited onto the limited beta. So I’ve been hacking since 6am getting an Ubuntu 6.06 LTS Server image prepared for loading up to their cloud. It’s uploading right now - fingers crossed it’ll work. If so, we should have sleevenotez running on their cloud later today!

August 20, 2006

Sleevenotes Architecture

Hi I’m Doug, and I’m the software architect on this project (and one of the coders too). Most of my posts are going to be painfully technical, but I’ll try to keep this one at least a bit readable. We’re using some pretty left-field technology on this project and the purpose of this post is to introduce a few of the architectural aspects of the application, and how these led to selecting this software.

These left-field components are Twisted, Nevow & Axiom. All of these are built on the no-longer left-field Python.

What we are building is, in essence, a massive Mash Up. The majority of the data we display is going to be fetched from elsewhere, processed and passed on to the user. Each page view by a user might lead to a dozen or so queries going out to service providers, and then incremental asynchronous updating of the user’s display.

What this means is that the part of our application that needs the most thought is the part where it is doing nothing: when it’s waiting. We’re going to spend an awful lot of time waiting. The aim is to do all that waiting for as low a cost as possible.

Traditional architectures don’t handle waiting very well. A normal web application might have a server with 2-10 general purpose processing threads. Each thread will have some associated in memory caches, it’ll probably hold a database connection or two, it’ll have some thread local storage to handle context and all sorts of other stuff. These threads are expensive.

In an application like that, you do all your waiting on your own time. You connect to Amazon, for example, and then you sit there, blocking, until Amazon returns. We’re going to be issuing hundreds of these sorts of requests a second, possibly. They aren’t inherently resource intensive (they are only processing a few TCP packets after all), but the blocking is an absolute killer for a traditional architecture.

There are two feasible alternatives here, both of which have their merits, both of which are different approaches to cooperative multitasking. Twisted provides a single threaded model using a single select reactor internally. This is hugely efficient, although the style of programming takes some getting used to.

A valid alternative would be a stackless, lightweight cooperative multithreading environment, like Erlang or Stackless Python. If we used lightweight threads, we could run the thousands of concurrent threads we’re going to need for this application, and they would again block using virtually no resources while waiting.

Twisted has a large and very capable toolset built around it, with Nevow providing an extremely elegant and effective web framework. Nevow also provides Athena, which is some pretty cunning wiring to hook up deferreds in MochiKit to deferreds in the Twisted server, providing end-to-end asynchronicity. Very smart.

All of these factors make Twisted a very useful, and interesting, choice. Hopefully we’ll be born out by experience :)

August 18, 2006

Welcome to the Sleevenotez blog

Sleevenotez is roughly four days old. Doug and Andy from Isotoma and I started work in anger last Tuesday, after four days of planning. Since then we have built an app that could be released into the wild without fear of embarrassment. How did this happen? Creativity, focus, agile development practices, a clear understanding of the problem space and the project's goals - the usual stuff. No great surprises, no big revelations. It helps that we're a small team, with complementary skills. It's probably a benefit that we've known each other for over a decade. It's a big plus that Orange pays me to research emerging consumer behaviour, track new technologies and develop insights for driving new product development. But the idea behind Sleevenotes is a simple one - anyone with a large MP3 collection could have thought of it. However, I do believe that the complete vision for Sleevenotez satisfies both a consumer and industry need - and there is also a business model behind it. Still, the maxim that 'a vision without the ability to execute is a hallucination' is true. Luckily we have that ability, so... we executed on it.

It's now Saturday afternoon and - based on my extensive personal experience - we've achieved more in one week than the combined might of France Telecom's R&D facilities, and hundreds of thousands of dollars, could achieve in twelve months. It's not their fault - institutionalised R&D isn't set up to respond to disruptive, agile innovation. More often than not, when a product emerges from inside a Telco's borders it's too little too late. Working with Doug and Andy has, in contrast, been a breath of fresh air.

We've decided to 'do an Amigo' and launch Sleevenotez in its raw, pre-alpha state, based on the tried-and-tested assumption that releasing early and often is best. Our 'bare naked app' will evolve rapidly over the next two months, and hopefully our transparent approach will yield some useful feedback en route.

We're going to figure out how / when to provide access to the demo site shortly. In the meantime, this blog will provide a blow-by-blow account of our progress, from zero to coolio.