SQL « Jay Paroline

RSS

Archive for the ‘SQL’ Category

ORDER BY SLOW()

30 Jan

One constant source of database/mySQL related frustration is the need to fetch a random row in a given table. We do randomization a lot, especially for anything recommendation-related so that users don’t see the exact same set of recommendations over and over again. This worked fine when our tables were small and everything fit into memory easily, but now that our tables have 6+ million rows in them, the old standby ORDER BY RAND() LIMIT 1 is no good.

ORDER BY RAND() in and of itself isn’t the worst thing in the universe, it does use a temp table to do all that sorting which is certainly less than ideal to begin with, but by the very nature of ORDERs and LIMITs, mySQL can’t apply the LIMIT 1 until it has ordered your result set. Good luck to you if your result set contains thousands of rows. That’s why we’ve taken to calling it ORDER BY SLOW(). You really have to think about how to apply ORDER BY RAND() so that it is functional and fast on massive data sets.

Whatever you do, don’t do it like this.

One of the comments has a pretty decent solution:

Now, a better solution would be:
SELECT count(*) FROM foo
and save the result as num_rows. Then
SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
you just selected a random row in two quick queries! congrats!

There’s nothing really wrong with that way but it involves an unnecessary extra round-trip to the DB server and just feels inelegant. Also don’t think you can avoid coming back in to PHP between queries; mySQL will not allow you to use a mySQL variable (or sub-select clause or function call) as a LIMIT. In PostgreSQL this would be relatively simple: SELECT * FROM table OFFSET RANDOM() LIMIT 1; (obviously you would need slightly more complex logic to make sure RANDOM() is returning a legitimate value within the range of allowable OFFSETs, but this can all be done inside of SQL)

Thomas Baggelaar Almost had it right:

SELECT * FROM `table` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table` ) ORDER BY id LIMIT 1;

Unfortunately mySQL executes the inner select for every single row comparison, so that is at least as slow as the original.
I posted my solution in response:

SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;

By joining on a nested (or delayed) select in this way, the inner select statement is executed just once. Again there are potential complications with this solution as well: if your IDs are non-sequential (i.e. some get deleted, or they are not an auto_increment field), it will be biased towards higher numbers and not truly random. However even if that is the case you might decide that it’s better to lose true randomness in order to get your query to finish running in .02 seconds instead of 4 minutes.

How do you make that work if you want more than one row? I currently don’t know of any way to do it, although I’m sure it’s possible. One trick would be to find a way to make the inner query return Y rows where Y is the number of rows you want the entire query to return. I know that it’s possible to do that but I haven’t put quite enough thought into how that can be accomplished to have a solution yet.

Another thing we are going to try out is seeding every row with a random value, and then doing our selects based on those, periodically giving those rows new random values. My understanding is that this is how Wikipedia does it. I might also experiment with having a stored procedure build the SQL, dynamically setting a LIMIT before executing the query, thereby preventing mySQL from complaining about using a variable for LIMIT.

20 Comments

Posted in SQL

The Importance of Being Earnest

27 Jan

One of the facts of life you have to work with when working at a startup is that your code will always be changing. New features are added, old code gets re-factored, the database layer gets overhauled, etc. That, coupled with the fact that there is always more that needs to be done than time to do it, add up to having almost no documentation for anything.

This is why it is critically important to have self-documenting code. If your code is not self-documenting, and I need to interface with it, then I have to wait for you to have time to answer my questions, or spend a lot of time digging through your code trying to figure out what is actually going on.

There is actually something worse than non-self-documenting code, however, and that’s self-documenting code that is incorrect. If you do not have real documentation and the names you use imply one thing but do another, you are going to cause serious problems.

Of course, I have examples of both problems. In the first, we have two entities that I needed to extrapolate information from: Duration and Bitrate. Do you see the problem? No? What if I ask you for the file size of a song given the following: Bitrate: 190000, Duration: 211749000. Can you tell me at a glance what math would be required in order to get the file size? You can’t, because the units are missing.

For the second example we have something that actually caused a major bug in Grooveshark that is just now in the process of being fixed. We have a field called isOnline. Do you think that isOnline tells whether a file is online? It doesn’t. Well, not always. If isOnline is 0, the file is definitely offline. But if isOnline is 1, it might still be offline, because what is really being tracked by this field is whether the file has been removed from a user’s library (or rather, the inverse of that). With a name like isOnline, it’s quite tempting to assume that that is how you check whether a file is online, and so that is what happened. Which is why often times when you click play on a song in Grooveshark, you get a peer offline error. The file isn’t online because the peer isn’t online, but isOnline is set to 1 because the user has not removed the file.

1 Comment

Posted in Coding, SQL

SQL Performance

25 Jan

Skyler’s post about triggers briefly mentioned denormalization to increase SQL performance. Denormalization is just one of the many techniques we have had to resort to in order to keep SQL performance up to snuff.

The funny thing about SQL is that queries perform vastly differently depending on the size of your dataset. Often times the performance is completely non-linear. For example, a query that is blazingly fast when working on 4 million rows of data, might be unbelievably slow when working on 6 million rows. Paradoxically, you might also have a query that is slower than the first query on 4 million rows, but faster than that one on 6 million rows.

From the time I started working at Grooveshark just a couple of months ago until very recently, the number of users and songs on our system has grown fairly steadily, but the performance of our site due to SQL overhead became exponentially worse to the point where some pages were suddenly taking 30 seconds to load. The strange part was that it seemed to happen virtually overnight.

Part of the problem is that we all work on a development copy of the database when we’re experimenting with queries so’s not to risk breaking anything on production, but since our dev box has much more modest specifications than our production server, we work with a much smaller data set, so we don’t always get a good picture of how our queries are going to perform “out in the wild.”

Methods we have used to tweak the performance of our queries range from the aforementioned denormalization techniques to writing and rewriting the same queries over and over again until we find something that works (it’s often hard to predict how queries will perform ahead of time, and mySQL’s EXPLAIN often leaves much to be desired), to making sure proper indexes exist for everything and that they are being used properly (sometimes mySQL guesses wrong about which index is best) to switching certain tables to InnoDB to prevent table deadlock issues that started popping up.

It took us about a week to get everything sorted out and optimized, but it really paid off. I just opened up the page that had been taking 30 seconds to load, and it took 6 seconds, about .5 of which can be attributed to php + mySQL execution. That is a very impressive increase in performance for running on the exact same hardware with no load balancing to speak of. There are certainly more performance improvements to be had and we will soon be making significant improvements to the way our PHP utilizes mySQL. Even so, I’m sure this isn’t the last speed bump we will run into, but at least now we have a pretty good idea of what we need to do to get things running smoothly again.

1 Comment

Posted in SQL

Jay Paroline

Archive for the ‘SQL’ Category

ORDER BY SLOW()

The Importance of Being Earnest

SQL Performance

Pages

Archives

Categories