Have some tips!

Categories: SQL life

In order to improve the quality of your life, I am providing some tips! I’ll start with some geeky stuff and move on to life tips.

Did you know that there is built in pager functionality in MySQL command line? I sure didn’t! It’s pretty cool, and extremely simple to use:
mysql>pager less
PAGER set to less

or, as Travis suggests you can even use vim:
mysql>pager vim -
PAGER set to vim -

Another MySQL tip is one that I figured out for my own uses a while ago and didn’t think much of until both Skyler and Travis also needed the use of it: it’s quite easy to search for tables containing a certain column.
For example, if you need to find all tables that contain a UserID, the following will do the trick:
SELECT table_name FROM information_schema.columns WHERE table_schema=’your_database_name’ AND column_name=’UserID’;
It’s great if you name all of your columns consistently and are doing this just because you don’t want to have to look that information up, but don’t rely on it when you’re trying to mess with an externally developed project. For example, Bugzilla actually has several different names for the UserID column depending on which table it’s in.

One last MySQL tip: disable name resolution if you haven’t done so already. MySQL attemps to do a reverse lookup on IP every time a connection is made, whether it needs to or not. Normally that behavior is fine, but if your DNS server goes down, becomes unreachable or (even worse) becomes slow, it can take MySQL up to 10 seconds per connection to authenticate or time out. It becomes incredibly easy to use up all of your allowed connections when this happens.

On to some life tips:
I recently bought a 2005 Scion xB. Aside from being a hideous solar yellow, it’s a great car and I got a great deal on it, so I couldn’t say no. A lot of people were surprised that I was able to pay cash for the car, considering that I don’t make a lot of money. The trick is that years ago I resolved to only ever pay cash for a vehicle, so I’ve only purchased vehicles I could afford at the time, while setting aside what would otherwise go towards a car payment. For the past 3 years I’ve been driving a jalopy that I bought for $800. About a year ago I bought a Yamaha C3 scooter for commuting to work and other in-town driving to save gas, for about $2,000. So for the past 3 years I’ve spent $2,800 for transportation. How much would I have spent on car payments? Let’s be conservative and estimate a payment of $300/month.
$300/month * 12 months/year * 3 years = $10,800
Having saved that money (while collecting interest!), it should be no surprise that I had enough to pay cash for a nicer car. And of course I still have my scooter for in town commuting, so I’ll continue to get excellent gas mileage in town and keep the wear and tear on the car down to a minimum, and based on the above math the xB only has to last me about 3 years to pay for itself. Chances are, of course, that I’ll get many more years of service out of it than that. Moral of the story? Don’t pay any more for a vehicle than you can afford. Sounds obvious when you put it that way, doesn’t it?

And now, a housekeeping tip from Skyler. If you enjoy cooking but hate doing dishes, a nice “trick” is to cook a relatively simple meal, something like rice where it’s not very involved but a bit time consuming, where you kind of have to be around keeping an eye on it. Once you get your dish going, you kind of have to sit around the kitchen anyway, so you might as well do some cleaning. It feels much less like a boring waste of time if you’re already stuck in the kitchen doing something you like. I tried this last night and it was much more tolerable. Now if only I could find a similar trick for making picking up after myself more enjoyable…

On the topic of being messy, I highly, highly recommend getting bunches of tackle organizers for all sorts of organizational needs. Tackle organizers come in all different shapes and sizes, and many have removable dividers (like this one) for ultimate flexibility. Have a junk drawer, or two or three? Put your crap in these and then put these in the drawer, it’ll be easier to find your stuff and you’ll probably be able to make more efficient use of the space. I also use one of these in my toolbox, and one for my sewing supplies.

MySQL Crash

Categories: SQL

On 3/3/08, beta.grooveshark.com was down for several hours. It took us a few minutes to figure out what was wrong. PHP logs showed that Auth was crashing on a bind_param error. Specifically, bind_param was complaining that hte number of arguments was different from the number of placeholders, which is really bind_param’s way of saying “something is broken, and I don’t know what.” I skimmed through everything Auth related to see if someone had uploaded a file to the live server recently by hand, bypassing the subversion/snapping proces, but all the timestamps were from when we last snapped, a few days prior.

While I was doing that, Colin thought to check the MySQL error log since the errors were SQL related. Sure enough, MySQL had crashed and restarted itself, but it left many of our MyISAM tables in a corrupted state. I ran REPAIR TABLE on all of the tables listed in the log but the site still wasn’t working properly. I dropped into the shell* and ran myisamchk on all of the MyISAM tables to see which ones were corrupted and to my surprise, some of those tables were ones I had already REPAIRed, so I ran REPAIR TABLE … EXTENDED on each of those and then, finally, the site worked again!

It’s worth noting that all of our InnoDB tables survived the crash completely unharmed. Moral of the story: don’t use MyISAM tables unless you absolutely have to. It’s too bad that MySQL uses MyISAM by default and doesn’t have a single fully-featured storage engine available. As a result, everything needing fulltext indexing will remain on MyISAM for the time being. We still don’t know the exact cause of the crash, but it’s been smooth sailing since we moved all those tables over to InnoDB, knock on wood.

*Handy tip: If you’re in MySQL and you need to drop to the shell, ^Z (CTRL-Z) is a quick and easy way to do so. Once you’ve finished what you need to do in the shell, just run fg, and assuming you haven’t backgrounded any other tasks since dropping to the shell, you’ll be back in MySQL, exactly where you left off.

‡That was the first time I have had to run myisamchk, so on the off chance that you’ve never used it before either, here’s a tip that it took me a few minutes to figure out: run myisamchk on the folder containing your DB files, and give it *.MYI. I initially thought myisamchk would be smart enough to find the DB files — it’s not.

SQL Schema and Graphs/Maps

Categories: Coding SQL

A while ago I wrote about my auto-query generator project. I only just recently got around to finishing it up because other things had higher priority, and also because I wasn’t entirely convinced that I was doing things the bes way, and I wanted to take some time to experiment.

Matt sat down with me and analyzed the problem, and we decided that we could use the schema to create a graph with all of the edges (our ID columns are consistently named in each table), and then use a shortest-path-finding algorithm, and then I could write a SQL generator that works off of the path. Getting all of the IDs in our tables in MySQL is pretty easy:
SELECT DISTINCT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE '%ID'

Then getting all the tables for a given ID:
SELECT TABLE_NAME, COLUMN_KEY FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME = '$id'
AND TABLE_SCHEMA = 'yourschemahere'

From that information, I simply built a graph which I represented with an adjacency map.

Unfortunately, that did not work. The path finding algorithm was basically too good, finding paths that were the shortest, but not necessarily the correct way to get from one table to another. For example, two tables might contain a GenreID, but maybe they are actually linked by ArtistID. Ok, so what about only making an edge when that column is a primary key in one of the nodes representing a table? That wasn’t hard to do, either, but it still gave wrong results in some cases. Sometimes it’s just more efficient (but wrong) to route through the Genres table than go the right way.

I considered making a directed graph so that connections would only be one-way to the table with the ID as a primary key, but I realized that wouldn’t work either, because sometimes you do need to join tables based in IDs that are not primary keys. Essentially, our schema does not completely represent the full complexity of the relationships that it contains.

So I went back to my original method, which was to map out the paths by hand. Tedious though it may have been, it’s still a pretty clever solution, in my opinion.

I created two maps. The first simply says “if you have this ID, and you are trying to get to this table, start at this other table,” for every possible ID, and the next one simply says “if you’re at this table, and you’re trying to get to this other table, here is the next table you need to go through.”

The great thing about this is that most of those steps can be reused, but I only had to create them once. For example, it’s always true that to get from Users to Files you must go through UsersFiles, no matter what your starting point is, although you may be trying to find all of the Songs, Albums or Artists that a user has in their library.

Having spelled things out this way, there is no guesswork for a path finding algorithm, because there is literally only one path. In fact it hardly counts as a path finder or an algorithm; it just iterates through the map until it reaches its destination. And it works. I will post as many details as I’m allowed about exactly how the actual SQL building algorithm works, and about how I am able to merge multiple paths, so for example you can have an ArtistID and a PlaylistID to get all of the artists on a given playlist. Stay tuned.

ALTER TABLE performance

Categories: SQL

This tip may be obvious to many of you, but I just discovered it recently, and it was news to Skyler when I told him, so I’ll share it here too:

If you need to make a bunch of ALTERs to the same table, it’s much more efficient to make them all in one huge comma-delimited statement rather than running each ALTER separately.

Why? Because every time you make an ALTER, MySQL makes a new table that meets your new altered definition, copies the entire contents into it, deletes the old one and renames the new one to the name of the old one. So if you make 10 alters, MySQL will do that 10 times, but if you make one alter with 10 changes in it, MySQL will only have to copy the table once.

Skyler estimates that trick will probably save us about 10 hours when we port the old (current) database to the new schema for file/song.

Hot Cache

Categories: SQL

One of the worst things about restarting a MySQL server is losing the cache. Your whole site might be slow for hours and hours before MySQL has a good cache and is loading data from memory instead of disc. How can you quickly get your caches hot, or at least lukewarm? Well, if you know which tables are most important, you can ask MySQL to load them up. How? By taking advantage of MySQL’s terrible query optimizer:
SELECT 1 FROM (SELECT * FROM ImportantTable) x LIMIT 1;
Any intelligent or sane optimizer would look at that query and at the very lease decide that it only needs the first row from ImportantTable, or preferably decide that it doesn’t need to select anything from ImportantTable at all. But MySQL’s query optimizer is neither intelligent nor apparently very sane, so it actually reads every row from ImportantTable, and hopefully sticks it in cache.

In our case we have a table with 10 million rows that nearly everything runs through. The explain shows the results:

EXPLAIN SELECT 1 FROM (SELECT * FROM UsersSongs) x LIMIT 1;
+----+-------------+------------+------+---------------+------+---------+------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+----------+-------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 10111671 | |
| 2 | DERIVED | UsersSongs | ALL | NULL | NULL | NULL | NULL | 10149257 | |
+----+-------------+------------+------+---------------+------+---------+------+----------+-------+

Note: if you wanted to be even more slick, you could craft your inner SELECT to load indexes into memory, which are even more helpful than having actual rows in memory, of course. So for example:
SELECT 1 FROM (SELECT UserID, SongID FROM UsersSongs) x LIMIT 1; assuming UsersSongs has a multi-column index consisting of UserID and SongID, of course.

Denormalizing MySQL

Categories: SQL

Some of you may have noticed that Grooveshark is a bit sluggish most of the time and an be extremely slow when doing certain things (searching, loading the discover page, looking at large libraries, etc.)

The biggest source of this slowdown comes from three major, related factors:

  1. We need to check whether each song is offline before displaying it.
  2. There are several different ways a song can be off- or online.
  3. Our schema is highly normalized.

The upshot of all this is that our queries to get song info, including online status of a song, require LEFT JOINing across multiple tables, onf of which a the time of this writing has over 10 million rows, another about 8 million.

As part of the major backend overhaul currently in process code named file/song, we have decided to de-normalize information about online status so that it is contained within the Songs table and updated by triggers.

The trick to keeping this fast for inserts and updates where these triggers are fired is to keep a running total for the number of files that are online for any given song, in a field called NumOnline, rather than re-calculating that value for every insert/update. To start out, we populate NumOnline for every single row in the Songs table by using a huge nasty query that aggregates the online statuses using LEFT JOINs and all that fun stuff that currently has to happen every time we want to display a song.

It takes a while to fill in that information (about 6 minutes) but once it’s there, all the hard work has been done. When a file goes on- or offline, we merely need to increment or decrement NumOnline - we can safely skip doing the long counts of everything as long as we make sure to handle every action that might cause that number to change.

That’s not quite as easy as it sounds, of course. Let’s look at the list of ways that number can change:

  • File is added to or removed from a user’s library.
  • User goes offline or comes online
  • User connects/disconnects external drive
  • File is added to the cache
  • Two songs are merged by the editing system

Each of these cases is a different update to a different table or field and we must be certain to handle each one correctly, but if we do that we gain the major advantage of being able to eliminate 3-4 LEFT JOINs from our Song-related queries and these SELECTS are blazing fast again. Definitely worth it.

Quick and Easy MySQL Benchmarking

Categories: SQL

When trying to tweak a query in MySQL it’s often difficult to tell if the changes you’ve made are actually making a difference, and if so by how much.
You may have noticed that using SQL_NO_CACHE still gives a very poor picture of what the performance is going to be like. Even with SQL_NO_CACHE, the first execution of the query might take 5 seconds, and the next execution might take 0.0 seconds. What gives? Well, even if you bypass MySQL’s caches, Linux is doing its own file system caches.

I’ve seen explanations of how to get around this issue before and none of them are particularly nice. Some say restarting MySQL is the way to go, others say you have to reboot the system. Travis’s original solution was to cat a huge file to /dev/null. All of these solutions are slow and somewhat annoying to do when you’re trying to make a bunch of tweaks to a query and test the effects.

A few days ago Travis found a solution that makes life easy again.

Effectively, the solution is this:

  1. Use SQL_NO_CACHE in your query each time.
  2. From the command line logged in as root, execute: echo 3 > /proc/sys/vm/drop_caches after each execution

That’s it! You must have kernel version 2.6.16 or greater, but that’s the only caveat.

Armed with this new technique to easily benchmark queries, I was able to tweak an intensive query running on his box to bring the execution time down from about 33 seconds to around 8 seconds, approximately 75% faster. To be perfectly honest if I didn’t have this easy instant feedback I probably would have quit with the optimization that brought execution time down to 25 seconds, because I would have known that it was faster but not by how much and I wouldn’t have been able to tell if any subsequent tweaks were really making much of a difference.

Auto-query-generator

Categories: Coding SQL

They say that Computer Scientists aren’t satisfied with putting other people out of jobs; they want to put themselves out of a job as well. In that vein I am writing a function that, given basic information about our database tables and their relationships, can build efficient queries automatically based on a complex filter consisting of:

  • The information requested
  • Any combination of IDs to filter the results based on

For example, you can tell it that you want the names and IDs of Playlists belonging to a user with a certain artist and my function will build the SQL on the fly based on what it knows.
This is extremely useful because I am working on a super secret project* with Katy and her app has no knowledge of hte database but needs to be able to ask for some extremely specific sets of data depending on user behavior. Instead of manually writing N! queries to handle every possible combination of requests, my function does all of the heavy lifting for me.

The code isn’t quite done yet but the code that figures out the join sequence and which key to join on is done, and surprisingly it only takes about 10 lines of code. I still need to add code to handle special cases, and I need to write the code that takes that sequence and turns it into real live SQL so it will surely grow, but that is still far less than I imagined would be required to get this far.

*not actually a secret project, but I haven’t written about it yet and it’s beyond the scope of this post.

ORDER BY SLOW()

Categories: SQL

One constant source of database/mySQL related frustration is the need to fetch a random row in a given table. We do randomization a lot, especially for anything recommendation-related so that users don’t see the exact same set of recommendations over and over again. This worked fine when our tables were small and everything fit into memory easily, but now that our tables have 6+ million rows in them, the old standby ORDER BY RAND() LIMIT 1 is no good.

ORDER BY RAND() in and of itself isn’t the worst thing in the universe, it does use a temp table to do all that sorting which is certainly less than ideal to begin with, but by the very nature of ORDERs and LIMITs, mySQL can’t apply the LIMIT 1 until it has ordered your result set. Good luck to you if your result set contains thousands of rows. That’s why we’ve taken to calling it ORDER BY SLOW(). You really have to think about how to apply ORDER BY RAND() so that it is functional and fast on massive data sets.

Whatever you do, don’t do it like this.

One of the comments has a pretty decent solution:

Now, a better solution would be:
SELECT count(*) FROM foo
and save the result as num_rows. Then
SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
you just selected a random row in two quick queries! congrats!

There’s nothing really wrong with that way but it involves an unnecessary extra round-trip to the DB server and just feels inelegant. Also don’t think you can avoid coming back in to PHP between queries; mySQL will not allow you to use a mySQL variable (or sub-select clause or function call) as a LIMIT. In PostgreSQL this would be relatively simple: SELECT * FROM table OFFSET RANDOM() LIMIT 1; (obviously you would need slightly more complex logic to make sure RANDOM() is returning a legitimate value within the range of allowable OFFSETs, but this can all be done inside of SQL)

Thomas Baggelaar Almost had it right:

SELECT * FROM `table` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table` ) ORDER BY id LIMIT 1;

Unfortunately mySQL executes the inner select for every single row comparison, so that is at least as slow as the original.
I posted my solution in response:

SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;

By joining on a nested (or delayed) select in this way, the inner select statement is executed just once. Again there are potential complications with this solution as well: if your IDs are non-sequential (i.e. some get deleted, or they are not an auto_increment field), it will be biased towards higher numbers and not truly random. However even if that is the case you might decide that it’s better to lose true randomness in order to get your query to finish running in .02 seconds instead of 4 minutes.

How do you make that work if you want more than one row? I currently don’t know of any way to do it, although I’m sure it’s possible. One trick would be to find a way to make the inner query return Y rows where Y is the number of rows you want the entire query to return. I know that it’s possible to do that but I haven’t put quite enough thought into how that can be accomplished to have a solution yet.

Another thing we are going to try out is seeding every row with a random value, and then doing our selects based on those, periodically giving those rows new random values. My understanding is that this is how Wikipedia does it. I might also experiment with having a stored procedure build the SQL, dynamically setting a LIMIT before executing the query, thereby preventing mySQL from complaining about using a variable for LIMIT.

The Importance of Being Earnest

Categories: Coding SQL

One of the facts of life you have to work with when working at a startup is that your code will always be changing. New features are added, old code gets re-factored, the database layer gets overhauled, etc. That, coupled with the fact that there is always more that needs to be done than time to do it, add up to having almost no documentation for anything.

This is why it is critically important to have self-documenting code. If your code is not self-documenting, and I need to interface with it, then I have to wait for you to have time to answer my questions, or spend a lot of time digging through your code trying to figure out what is actually going on.

There is actually something worse than non-self-documenting code, however, and that’s self-documenting code that is incorrect. If you do not have real documentation and the names you use imply one thing but do another, you are going to cause serious problems.

Of course, I have examples of both problems. In the first, we have two entities that I needed to extrapolate information from: Duration and Bitrate. Do you see the problem? No? What if I ask you for the file size of a song given the following: Bitrate: 190000, Duration: 211749000. Can you tell me at a glance what math would be required in order to get the file size? You can’t, because the units are missing.

For the second example we have something that actually caused a major bug in Grooveshark that is just now in the process of being fixed. We have a field called isOnline. Do you think that isOnline tells whether a file is online? It doesn’t. Well, not always. If isOnline is 0, the file is definitely offline. But if isOnline is 1, it might still be offline, because what is really being tracked by this field is whether the file has been removed from a user’s library (or rather, the inverse of that). With a name like isOnline, it’s quite tempting to assume that that is how you check whether a file is online, and so that is what happened. Which is why often times when you click play on a song in Grooveshark, you get a peer offline error. The file isn’t online because the peer isn’t online, but isOnline is set to 1 because the user has not removed the file.