Archive for January, 2008

You don’t know what you do until you know what you don’t do

31 Jan

Raymond Chen is a wise old programmer from Microsoft. He’s been in the business a long time. This is an old post of his that I have brought up at Grooveshark several times, so I figured I should post a link to it here for handy reference.

You don’t know what you do until you know what you don’t do.

I’ve seen a lot of software projects, and one thing I’ve learned is that you don’t have a product until you start saying “No”.

In the early phases of product design, you’re all giddy with excitement. This new product will be so awesome. It will slice bread. It will solve world hunger. It’s designed for everybody, from the technology-averse grandmother who wants to see picture of her grandkids to the IT manager who is in charge of 10,000 computers. It’ll run equally well on a handheld device as in a data center.

When I see a product with an all-encompassing description like this, I say to myself, “They have no idea what their product is.” You don’t know what you do until you know what you don’t do. And the sooner you figure out what you don’t do the better, because a product that promises to do everything will never ship.



30 Jan

One constant source of database/mySQL related frustration is the need to fetch a random row in a given table. We do randomization a lot, especially for anything recommendation-related so that users don’t see the exact same set of recommendations over and over again. This worked fine when our tables were small and everything fit into memory easily, but now that our tables have 6+ million rows in them, the old standby ORDER BY RAND() LIMIT 1 is no good.

ORDER BY RAND() in and of itself isn’t the worst thing in the universe, it does use a temp table to do all that sorting which is certainly less than ideal to begin with, but by the very nature of ORDERs and LIMITs, mySQL can’t apply the LIMIT 1 until it has ordered your result set. Good luck to you if your result set contains thousands of rows. That’s why we’ve taken to calling it ORDER BY SLOW(). You really have to think about how to apply ORDER BY RAND() so that it is functional and fast on massive data sets.

Whatever you do, don’t do it like this.

One of the comments has a pretty decent solution:

Now, a better solution would be:
SELECT count(*) FROM foo
and save the result as num_rows. Then
SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
you just selected a random row in two quick queries! congrats!

There’s nothing really wrong with that way but it involves an unnecessary extra round-trip to the DB server and just feels inelegant. Also don’t think you can avoid coming back in to PHP between queries; mySQL will not allow you to use a mySQL variable (or sub-select clause or function call) as a LIMIT. In PostgreSQL this would be relatively simple: SELECT * FROM table OFFSET RANDOM() LIMIT 1; (obviously you would need slightly more complex logic to make sure RANDOM() is returning a legitimate value within the range of allowable OFFSETs, but this can all be done inside of SQL)

Thomas Baggelaar Almost had it right:

SELECT * FROM `table` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table` ) ORDER BY id LIMIT 1;

Unfortunately mySQL executes the inner select for every single row comparison, so that is at least as slow as the original.
I posted my solution in response:


By joining on a nested (or delayed) select in this way, the inner select statement is executed just once. Again there are potential complications with this solution as well: if your IDs are non-sequential (i.e. some get deleted, or they are not an auto_increment field), it will be biased towards higher numbers and not truly random. However even if that is the case you might decide that it’s better to lose true randomness in order to get your query to finish running in .02 seconds instead of 4 minutes.

How do you make that work if you want more than one row? I currently don’t know of any way to do it, although I’m sure it’s possible. One trick would be to find a way to make the inner query return Y rows where Y is the number of rows you want the entire query to return. I know that it’s possible to do that but I haven’t put quite enough thought into how that can be accomplished to have a solution yet.

Another thing we are going to try out is seeding every row with a random value, and then doing our selects based on those, periodically giving those rows new random values. My understanding is that this is how Wikipedia does it. I might also experiment with having a stored procedure build the SQL, dynamically setting a LIMIT before executing the query, thereby preventing mySQL from complaining about using a variable for LIMIT.


Posted in SQL


Flexible schedules

29 Jan

One thing I love about working at Grooveshark is the ability to work flexible hours. We are all salaried, so it doesn’t really matter when we come in as long as we are putting in appropriate hours.

Anyone who knows me knows that I am not a morning person. Not even a little bit. I have fallen into the general schedule of coming in around 3:00PM and working until 3:00AM. Grooveshark benefits immensely from this setup for two reasons:

  1. I’m alert, awake and productive when I come in.
  2. I don’t mind working a 12 hour day because I’m not all groggy.

On days when for some reason I have to come in earlier, I end up leaving much earlier too, barely managing to pull 8 hours during which I feel like I am accomplishing very little.


Paul Dateh + Inka One

28 Jan

One of the cool things about working at Grooveshark is that we are constantly exposing each other to new and interesting music. Today Vishal played this for us, and we were blown away:

No Comments

Posted in music



27 Jan

What happens when your UI doesn’t behave the way you expect it to?
Yesterday, Katy managed to create a new type: BooStringlean.

Oddly, her IDE did not complain that it had never heard of a BooStringlean; it was her auto-documenter that complained.

No Comments

Posted in fun


The Importance of Being Earnest

27 Jan

One of the facts of life you have to work with when working at a startup is that your code will always be changing. New features are added, old code gets re-factored, the database layer gets overhauled, etc. That, coupled with the fact that there is always more that needs to be done than time to do it, add up to having almost no documentation for anything.

This is why it is critically important to have self-documenting code. If your code is not self-documenting, and I need to interface with it, then I have to wait for you to have time to answer my questions, or spend a lot of time digging through your code trying to figure out what is actually going on.

There is actually something worse than non-self-documenting code, however, and that’s self-documenting code that is incorrect. If you do not have real documentation and the names you use imply one thing but do another, you are going to cause serious problems.

Of course, I have examples of both problems. In the first, we have two entities that I needed to extrapolate information from: Duration and Bitrate. Do you see the problem? No? What if I ask you for the file size of a song given the following: Bitrate: 190000, Duration: 211749000. Can you tell me at a glance what math would be required in order to get the file size? You can’t, because the units are missing.

For the second example we have something that actually caused a major bug in Grooveshark that is just now in the process of being fixed. We have a field called isOnline. Do you think that isOnline tells whether a file is online? It doesn’t. Well, not always. If isOnline is 0, the file is definitely offline. But if isOnline is 1, it might still be offline, because what is really being tracked by this field is whether the file has been removed from a user’s library (or rather, the inverse of that). With a name like isOnline, it’s quite tempting to assume that that is how you check whether a file is online, and so that is what happened. Which is why often times when you click play on a song in Grooveshark, you get a peer offline error. The file isn’t online because the peer isn’t online, but isOnline is set to 1 because the user has not removed the file.

1 Comment

Posted in Coding, SQL


SQL Performance

25 Jan

Skyler’s post about triggers briefly mentioned denormalization to increase SQL performance. Denormalization is just one of the many techniques we have had to resort to in order to keep SQL performance up to snuff.

The funny thing about SQL is that queries perform vastly differently depending on the size of your dataset. Often times the performance is completely non-linear. For example, a query that is blazingly fast when working on 4 million rows of data, might be unbelievably slow when working on 6 million rows. Paradoxically, you might also have a query that is slower than the first query on 4 million rows, but faster than that one on 6 million rows.

From the time I started working at Grooveshark just a couple of months ago until very recently, the number of users and songs on our system has grown fairly steadily, but the performance of our site due to SQL overhead became exponentially worse to the point where some pages were suddenly taking 30 seconds to load. The strange part was that it seemed to happen virtually overnight.

Part of the problem is that we all work on a development copy of the database when we’re experimenting with queries so’s not to risk breaking anything on production, but since our dev box has much more modest specifications than our production server, we work with a much smaller data set, so we don’t always get a good picture of how our queries are going to perform “out in the wild.”

Methods we have used to tweak the performance of our queries range from the aforementioned denormalization techniques to writing and rewriting the same queries over and over again until we find something that works (it’s often hard to predict how queries will perform ahead of time, and mySQL’s EXPLAIN often leaves much to be desired), to making sure proper indexes exist for everything and that they are being used properly (sometimes mySQL guesses wrong about which index is best) to switching certain tables to InnoDB to prevent table deadlock issues that started popping up.

It took us about a week to get everything sorted out and optimized, but it really paid off. I just opened up the page that had been taking 30 seconds to load, and it took 6 seconds, about .5 of which can be attributed to php + mySQL execution. That is a very impressive increase in performance for running on the exact same hardware with no load balancing to speak of. There are certainly more performance improvements to be had and we will soon be making significant improvements to the way our PHP utilizes mySQL. Even so, I’m sure this isn’t the last speed bump we will run into, but at least now we have a pretty good idea of what we need to do to get things running smoothly again.

1 Comment

Posted in SQL