RSS
 

ORDER BY SLOW()

30 Jan

One constant source of database/mySQL related frustration is the need to fetch a random row in a given table. We do randomization a lot, especially for anything recommendation-related so that users don’t see the exact same set of recommendations over and over again. This worked fine when our tables were small and everything fit into memory easily, but now that our tables have 6+ million rows in them, the old standby ORDER BY RAND() LIMIT 1 is no good.

ORDER BY RAND() in and of itself isn’t the worst thing in the universe, it does use a temp table to do all that sorting which is certainly less than ideal to begin with, but by the very nature of ORDERs and LIMITs, mySQL can’t apply the LIMIT 1 until it has ordered your result set. Good luck to you if your result set contains thousands of rows. That’s why we’ve taken to calling it ORDER BY SLOW(). You really have to think about how to apply ORDER BY RAND() so that it is functional and fast on massive data sets.

Whatever you do, don’t do it like this.

One of the comments has a pretty decent solution:

Now, a better solution would be:
SELECT count(*) FROM foo
and save the result as num_rows. Then
SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
you just selected a random row in two quick queries! congrats!

There’s nothing really wrong with that way but it involves an unnecessary extra round-trip to the DB server and just feels inelegant. Also don’t think you can avoid coming back in to PHP between queries; mySQL will not allow you to use a mySQL variable (or sub-select clause or function call) as a LIMIT. In PostgreSQL this would be relatively simple: SELECT * FROM table OFFSET RANDOM() LIMIT 1; (obviously you would need slightly more complex logic to make sure RANDOM() is returning a legitimate value within the range of allowable OFFSETs, but this can all be done inside of SQL)

Thomas Baggelaar Almost had it right:

SELECT * FROM `table` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table` ) ORDER BY id LIMIT 1;

Unfortunately mySQL executes the inner select for every single row comparison, so that is at least as slow as the original.
I posted my solution in response:

SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;

By joining on a nested (or delayed) select in this way, the inner select statement is executed just once. Again there are potential complications with this solution as well: if your IDs are non-sequential (i.e. some get deleted, or they are not an auto_increment field), it will be biased towards higher numbers and not truly random. However even if that is the case you might decide that it’s better to lose true randomness in order to get your query to finish running in .02 seconds instead of 4 minutes.

How do you make that work if you want more than one row? I currently don’t know of any way to do it, although I’m sure it’s possible. One trick would be to find a way to make the inner query return Y rows where Y is the number of rows you want the entire query to return. I know that it’s possible to do that but I haven’t put quite enough thought into how that can be accomplished to have a solution yet.

Another thing we are going to try out is seeding every row with a random value, and then doing our selects based on those, periodically giving those rows new random values. My understanding is that this is how Wikipedia does it. I might also experiment with having a stored procedure build the SQL, dynamically setting a LIMIT before executing the query, thereby preventing mySQL from complaining about using a variable for LIMIT.

 
20 Comments

Posted in SQL

 

Leave a Reply

 

 
  1. Jay

    January 31, 2008 at 7:16 am

    Hmm, it occurs to me that I can make the inner select return multiple random numbers at once if I do a cross-join with any table if I apply a limit (limit = however many rands I want), all within the nested select. How will that perform? I’ll test it out later and find out.

     
  2. Jay

    January 31, 2008 at 7:18 am

    Also I forgot to mention that Skyler should actually get credit for ORDER BY SLOW() as he came up with it. (I wasn’t sure if it was Skyler or Travis or I would have given credit in the first place)

     
  3. Jan

    March 5, 2008 at 10:29 am

    That’s actually pretty clever! I’m wondering if you could also select 10 random rows with that technique.

     
  4. Jay

    March 9, 2008 at 12:10 am

    Hey Jan, thanks! Unfortunately there is not a quick and easy way to do 10 at a time that I am aware of, aside from doing a bunch of unions (ugly) or making a stored procedure that loops over as many times as you ask it to…

    It’s funny, MySQL is usually quite happy to re-execute a query/function call over and over again when that is completely unnecessary, but I can’t find any way to trick the query optimizer into re-executing the RAND() calculation. I tried joining across a dummy table but it just gives me the same random ID 10 times rather than giving me 10 random IDs. Bummer. Please do let me know if you ever stumble across an answer to this problem.

     
  5. Erik

    April 3, 2008 at 11:18 am

    That could easy return null set if entries in the db has been removed. Getting tired of reading all the internet “pro” comments. More often that not you guys are experts on buggy code.

    Sorry for the sarcasm..

     
  6. Jay

    April 3, 2008 at 6:03 pm

    I appreciate your frustration, but the real problem is that MySQL does not have a simple and elegant solution to this problem; that’s probably why you haven’t found one you like yet.

    That said, I did mention the caveat that you need to have sequential IDs for this to work as intended. For our data sets, that is not a problem, but it might be for yours.

    That said, it should not be possible for this to return a null set; that’s why it’s x ON T.ID >= x.ID instead of x ON T.ID = x.ID

    There is another bug in my code though: since MySQL auto-increments start at 1 rather than 0, FLOOR(MAX(ID)*RAND()) should be CEIL(MAX(ID)*RAND()) – I have edited the original post to reflect that.

     
  7. Patrick

    August 4, 2008 at 11:20 pm

    can you explain the Sql a little more. Im not sure if t is a table or..

    Sorry, new a the complex selects

     
  8. Jay

    August 4, 2008 at 11:31 pm

    Hi Patrick, sure I understand, we’ve all been there before. :)

    I assume you mean this part: SELECT * FROM Table T – T is an alias for ‘Table’ in this case. I was just being lazy, the longer, more obvious way to type the same thing would be SELECT * FROM Table AS T

    I’m not sure if that was the only part you were having trouble with. If you have any other questions about the SQL I’d be glad to try to explain…

     
  9. Andreas

    August 20, 2008 at 6:05 pm

    I’m a newbie but this looks like a good and fast solution.
    My problem is that I want a random row but also only select a row with a specific country/age/gender etc..
    Is it possible to include a WHERE statement somewhere in this? I’ve tried but can’t get it right.. thanks!

     
  10. Jay

    August 21, 2008 at 1:42 am

    Hi Andreas, unfortunately that’s where this solution falls apart. As soon as you’re dealing with a partial data set, there aren’t any elegant solutions to the problem (that I know of). If you do find something that works, is quick and elegant, please do share!

     
  11. Gabe da Silveira

    October 20, 2008 at 8:27 pm

    A few notes:

    First, if you are limiting the data set via a where clause, if the set is small enough, then ORDER BY RAND() may be fast enough for you.

    Another thing to keep in mind is that:

    SELECT id FROM table ORDER BY rand();

    Can be orders of magnitude faster than:

    SELECT * FROM table ORDER BY rand();

    Due to the temporary table creation. Granted you probably will then need to do another query to pull the data, but…

    Regarding multiple queries to the database, you should run benchmarks with your particular setup to find out if multiple queries will actually hurt you. In some cases what you’re trying to do can be more efficient in code than SQL even considering connection latency. On my setup even though the database is on a separate cluster, it is still very low latency and multiple queries have not been a problem.

    With that in mind, I think the original fast solution you post is pretty good (and no less elegant to me than a weird subselect). When I started database programming 10 years ago I gravitated in the direction of as many efficient joins as possible to minimize queries. It worked pretty well, but after working on more complex applications I found the benefits of a nice ORM system and simple queries to ultimately be more beneficial for productivity without hurting performance (ie. knowing where to optimize).

    Finally, another thing to watch out for in your solution is that in sparse tables the randomness can suffer severely. Imagine a table with rows 1-100. Delete 2-99 . Now you have a 99% chance of getting 100 and a 1% chance of getting 1.

     
  12. Frederick R

    October 31, 2008 at 1:41 am

    Alternative random result; I think more flexible and faster result.

    SELECT * FROM (SELECT id,name,rand() as ran FROM table LIMIT 10) AS x ORDER BY x.ran

    Cheers!

     
  13. Jay

    October 31, 2008 at 5:14 am

    What that query will do is grab the first 10 rows in the table, and order them by rand. If you only want the first 10 rows to be considered, and your table is much larger, then yes this is faster. Otherwise you are just doing explicitly what MySQL does implicitly when you do an ORDER BY RAND().

     
  14. Frederick R

    October 31, 2008 at 6:30 am

    It is different from ORDER BY RAND() in sense that you actually take only 10 rows given by mysql then order it by the randomize number ran.

    Unlike ORDER BY RAND(), mysql will take all 6m tuples and randomize it.

    One idea to improve the code would be to increase the samples and do another limit. Even better if you play with the result a little more…

    SELECT * FROM (SELECT id,name,rand() as ran FROM table WHERE id like '345' OR name like 'exa' LIMIT 100) AS x ORDER BY x.ran LIMIT 10

    Feeding SQL another random number or random string from your script will give an even more mixed-up result. Still a lot faster than RAND().

     
  15. Matthew Montgomery

    January 28, 2009 at 2:05 pm

     
  16. Robert Goodyear

    December 22, 2010 at 9:14 pm

    I wound up simply determining a column with non-sequential data in it, and then using ORDER BY md5(‘column’) LIMIT n

    Was super fast, and subsequent pulls could add the offset method for LIMIT to grab more random chunks.

     
  17. Jay

    December 23, 2010 at 7:22 am

    Robert, I don’t see how that solution could be fast unless you are dealing with small sets of data, it’s going to apply the md5 function for every row in the table, store all of those results, and then sort them. What does the EXPLAIN look like?

     
  18. augusto

    March 31, 2011 at 1:49 pm

    well done!
    thanks for sharing

     
  19. Ash

    June 30, 2011 at 12:46 am

    Robert, your solution always returns the same results whenever I execute the query. It’s only random once but not until a new row has been inserted. I prefer Gabe’s explanation and Frederick’s solution. Sorry for the grammar.

     
  20. manu

    March 7, 2013 at 10:02 pm

    this truy a great way to fetch random rows.