SQL « Jay Paroline

RSS

Archive for the ‘SQL’ Category

Bypassing Magic

18 Aug

In my post about how we are adding client-side caching to Grooveshark 2.0, I mentioned one of the ways we are taking advantage of the fact that thanks to using Flash, we have a full-blown stateful application.

As Grooveshark evolves and the application becomes more sophisticated, the PHP layer is more and more becoming just an interface to the DB. The application just needs the data; it knows exactly what to do with it from there. It also only needs to ask for one type of data at a time, whereas a traditional webpage would need to load dozens of different pieces of information at the same time. So for our type of application, magic methods and ORM can really just get in the way when all we really need is to run a query, fill up an array with the results of that query, and return it.

Our old libraries employing ORM, magic methods and collections, were designed to meet the needs of a typical website and don’t necessarily make sense for a full-fledged application. On a webpage, you might only show 20 results at a time, so the overhead of having a bunch of getters and setters automatically fire whenever you load up your data is probably not noticeable. But in an application, you often load far more results than can be displayed, and allow the user to interact with them more richly. When you’re loading 500, or 5000 results as opposed to 20, the overhead of ORM and magic can start to really bog you down. I first noticed the overhead issue when testing new method calls for lite2, when in some cases fetching the data would take over 30 seconds, triggering my locally defined maximum execution time, even when the data was already cached.

Like any responsible developer considering making changes to code for performance reasons, I profiled our collections code using XDebug and KCachegrind, and then I rewrote the code to bypass collections, magic and all that stuff, loading data from the DB (or memcache) into an array and returning it. The difference? In the worst case, bypassing magic was an order of magnitude less work, often times far better than that. My > 30 second example took less than 1 second in the new code.

For Grooveshark 2.0 code, wherever makes sense, we are bypassing magic, ORM and collections and loading data directly. This of course means that Grooveshark is faster, but it also means that we can load more data at once. In most cases we can now afford to load up entire lists of songs without having to paginate the results, which in turn means fewer calls to the backend *and* much less work for the database. Whenever you must LIMIT results, you must also ORDER BY the results so they come back in an order that makes sense. Not having to ORDER results means in many cases we save an expensive filesort which often requires a temporary table in MySQL. Returning the full data set also allows the client to do more with the data, like decide how the results should actually be sorted and displayed to the user. But that’s another post…

2 Comments

Posted in Coding, grooveshark, SQL

Indexes Matter (or: Memcache Will Only Take You So Far)

28 Jan

About a week ago, I was doing some work on the DB in the middle of the night and noticed that my simple queries were running a bit sluggish. I dropped out of mysql and ran top, and noticed that load averages were way higher than I was used to seeing. I ran SHOW FULL PROCESSLIST a bunch of times, and noticed two queries popping up frequently, one was a backend processing query which did not belong on the production database, and the other was the query used to build Widget objects. My first suspect was the backend process, since it did not belong, so we took that off and moved it to a more appropriate server, which brought down the load average by 1; a significant improvement, but the load averages were still pretty high, however the server was usable and responsive enough again, so I forgot about it.

A couple of days later, I noticed our load averages were still pretty high and the main recurring query was still the widget one, so I ran an explain on it, and although the query looked innocent enough, it was missing an index, so instead of a quick lookup it was a full table scan across millions of rows. Ouch.

I knew we wouldn’t have a chance to have some downtime to run the necessary ALTERs to get the indexes in there until after the weekend, so I asked Chanel to put in memcache support so that widgets would only need to be loaded once from SQL. Chanel got that done on Sunday, and on Monday night we were able to get the proper indexes added.

Because of the time span involved, combined with the fact that we monitor server metrics with Zabbix, means that we can look back at a nice little graph of our performance before and after each of the changes.

The days with the grey background are Saturday and Sunday, before memcache was added. The next day, with memcache added the peak load is cut in half. The day after that, with proper indexes, the peak load is barely perceptible, roughly 1/4 of what the load was with just memecache.

The lesson to be learned from this is that while memcache can help quite a bit, there’s a lot to be said for making sure your SQL queries are optimized.

1 Comment

Posted in grooveshark, SQL

Have some tips!

26 Jun

In order to improve the quality of your life, I am providing some tips! I’ll start with some geeky stuff and move on to life tips.

Did you know that there is built in pager functionality in MySQL command line? I sure didn’t! It’s pretty cool, and extremely simple to use:
mysql>pager less PAGER set to less
or, as Travis suggests you can even use vim:
mysql>pager vim - PAGER set to vim -

Another MySQL tip is one that I figured out for my own uses a while ago and didn’t think much of until both Skyler and Travis also needed the use of it: it’s quite easy to search for tables containing a certain column.
For example, if you need to find all tables that contain a UserID, the following will do the trick:
SELECT table_name FROM information_schema.columns WHERE table_schema=’your_database_name’ AND column_name=’UserID’;
It’s great if you name all of your columns consistently and are doing this just because you don’t want to have to look that information up, but don’t rely on it when you’re trying to mess with an externally developed project. For example, Bugzilla actually has several different names for the UserID column depending on which table it’s in.

One last MySQL tip: disable name resolution if you haven’t done so already. MySQL attemps to do a reverse lookup on IP every time a connection is made, whether it needs to or not. Normally that behavior is fine, but if your DNS server goes down, becomes unreachable or (even worse) becomes slow, it can take MySQL up to 10 seconds per connection to authenticate or time out. It becomes incredibly easy to use up all of your allowed connections when this happens.

On to some life tips:
I recently bought a 2005 Scion xB. Aside from being a hideous solar yellow, it’s a great car and I got a great deal on it, so I couldn’t say no. A lot of people were surprised that I was able to pay cash for the car, considering that I don’t make a lot of money. The trick is that years ago I resolved to only ever pay cash for a vehicle, so I’ve only purchased vehicles I could afford at the time, while setting aside what would otherwise go towards a car payment. For the past 3 years I’ve been driving a jalopy that I bought for $800. About a year ago I bought a Yamaha C3 scooter for commuting to work and other in-town driving to save gas, for about $2,000. So for the past 3 years I’ve spent $2,800 for transportation. How much would I have spent on car payments? Let’s be conservative and estimate a payment of $300/month.
$300/month * 12 months/year * 3 years = $10,800
Having saved that money (while collecting interest!), it should be no surprise that I had enough to pay cash for a nicer car. And of course I still have my scooter for in town commuting, so I’ll continue to get excellent gas mileage in town and keep the wear and tear on the car down to a minimum, and based on the above math the xB only has to last me about 3 years to pay for itself. Chances are, of course, that I’ll get many more years of service out of it than that. Moral of the story? Don’t pay any more for a vehicle than you can afford. Sounds obvious when you put it that way, doesn’t it?

And now, a housekeeping tip from Skyler. If you enjoy cooking but hate doing dishes, a nice “trick” is to cook a relatively simple meal, something like rice where it’s not very involved but a bit time consuming, where you kind of have to be around keeping an eye on it. Once you get your dish going, you kind of have to sit around the kitchen anyway, so you might as well do some cleaning. It feels much less like a boring waste of time if you’re already stuck in the kitchen doing something you like. I tried this last night and it was much more tolerable. Now if only I could find a similar trick for making picking up after myself more enjoyable…

On the topic of being messy, I highly, highly recommend getting bunches of tackle organizers for all sorts of organizational needs. Tackle organizers come in all different shapes and sizes, and many have removable dividers (like this one) for ultimate flexibility. Have a junk drawer, or two or three? Put your crap in these and then put these in the drawer, it’ll be easier to find your stuff and you’ll probably be able to make more efficient use of the space. I also use one of these in my toolbox, and one for my sewing supplies.

2 Comments

Posted in life, SQL

MySQL Crash

19 Mar

On 3/3/08, beta.grooveshark.com was down for several hours. It took us a few minutes to figure out what was wrong. PHP logs showed that Auth was crashing on a bind_param error. Specifically, bind_param was complaining that hte number of arguments was different from the number of placeholders, which is really bind_param’s way of saying “something is broken, and I don’t know what.” I skimmed through everything Auth related to see if someone had uploaded a file to the live server recently by hand, bypassing the subversion/snapping proces, but all the timestamps were from when we last snapped, a few days prior.

While I was doing that, Colin thought to check the MySQL error log since the errors were SQL related. Sure enough, MySQL had crashed and restarted itself, but it left many of our MyISAM tables in a corrupted state. I ran REPAIR TABLE on all of the tables listed in the log but the site still wasn’t working properly. I dropped into the shell^* and ran myisamchk on all of the MyISAM tables^‡ to see which ones were corrupted and to my surprise, some of those tables were ones I had already REPAIRed, so I ran REPAIR TABLE … EXTENDED on each of those and then, finally, the site worked again!

It’s worth noting that all of our InnoDB tables survived the crash completely unharmed. Moral of the story: don’t use MyISAM tables unless you absolutely have to. It’s too bad that MySQL uses MyISAM by default and doesn’t have a single fully-featured storage engine available. As a result, everything needing fulltext indexing will remain on MyISAM for the time being. We still don’t know the exact cause of the crash, but it’s been smooth sailing since we moved all those tables over to InnoDB, knock on wood.

*Handy tip: If you’re in MySQL and you need to drop to the shell, ^Z (CTRL-Z) is a quick and easy way to do so. Once you’ve finished what you need to do in the shell, just run fg, and assuming you haven’t backgrounded any other tasks since dropping to the shell, you’ll be back in MySQL, exactly where you left off.

‡That was the first time I have had to run myisamchk, so on the off chance that you’ve never used it before either, here’s a tip that it took me a few minutes to figure out: run myisamchk on the folder containing your DB files, and give it *.MYI. I initially thought myisamchk would be smart enough to find the DB files — it’s not.

No Comments

Posted in SQL

SQL Schema and Graphs/Maps

18 Mar

A while ago I wrote about my auto-query generator project. I only just recently got around to finishing it up because other things had higher priority, and also because I wasn’t entirely convinced that I was doing things the bes way, and I wanted to take some time to experiment.

Matt sat down with me and analyzed the problem, and we decided that we could use the schema to create a graph with all of the edges (our ID columns are consistently named in each table), and then use a shortest-path-finding algorithm, and then I could write a SQL generator that works off of the path. Getting all of the IDs in our tables in MySQL is pretty easy:
SELECT DISTINCT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME LIKE '%ID'
Then getting all the tables for a given ID:
SELECT TABLE_NAME, COLUMN_KEY FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME = '$id' AND TABLE_SCHEMA = 'yourschemahere'

From that information, I simply built a graph which I represented with an adjacency map.

Unfortunately, that did not work. The path finding algorithm was basically too good, finding paths that were the shortest, but not necessarily the correct way to get from one table to another. For example, two tables might contain a GenreID, but maybe they are actually linked by ArtistID. Ok, so what about only making an edge when that column is a primary key in one of the nodes representing a table? That wasn’t hard to do, either, but it still gave wrong results in some cases. Sometimes it’s just more efficient (but wrong) to route through the Genres table than go the right way.

I considered making a directed graph so that connections would only be one-way to the table with the ID as a primary key, but I realized that wouldn’t work either, because sometimes you do need to join tables based in IDs that are not primary keys. Essentially, our schema does not completely represent the full complexity of the relationships that it contains.

So I went back to my original method, which was to map out the paths by hand. Tedious though it may have been, it’s still a pretty clever solution, in my opinion.

I created two maps. The first simply says “if you have this ID, and you are trying to get to this table, start at this other table,” for every possible ID, and the next one simply says “if you’re at this table, and you’re trying to get to this other table, here is the next table you need to go through.”

The great thing about this is that most of those steps can be reused, but I only had to create them once. For example, it’s always true that to get from Users to Files you must go through UsersFiles, no matter what your starting point is, although you may be trying to find all of the Songs, Albums or Artists that a user has in their library.

Having spelled things out this way, there is no guesswork for a path finding algorithm, because there is literally only one path. In fact it hardly counts as a path finder or an algorithm; it just iterates through the map until it reaches its destination. And it works. I will post as many details as I’m allowed about exactly how the actual SQL building algorithm works, and about how I am able to merge multiple paths, so for example you can have an ArtistID and a PlaylistID to get all of the artists on a given playlist. Stay tuned.

No Comments

Posted in Coding, SQL

ALTER TABLE performance

17 Mar

This tip may be obvious to many of you, but I just discovered it recently, and it was news to Skyler when I told him, so I’ll share it here too:

If you need to make a bunch of ALTERs to the same table, it’s much more efficient to make them all in one huge comma-delimited statement rather than running each ALTER separately.

Why? Because every time you make an ALTER, MySQL makes a new table that meets your new altered definition, copies the entire contents into it, deletes the old one and renames the new one to the name of the old one. So if you make 10 alters, MySQL will do that 10 times, but if you make one alter with 10 changes in it, MySQL will only have to copy the table once.

Skyler estimates that trick will probably save us about 10 hours when we port the old (current) database to the new schema for file/song.

No Comments

Posted in SQL

Hot Cache

11 Mar

One of the worst things about restarting a MySQL server is losing the cache. Your whole site might be slow for hours and hours before MySQL has a good cache and is loading data from memory instead of disc. How can you quickly get your caches hot, or at least lukewarm? Well, if you know which tables are most important, you can ask MySQL to load them up. How? By taking advantage of MySQL’s terrible query optimizer:
SELECT 1 FROM (SELECT * FROM ImportantTable) x LIMIT 1;
Any intelligent or sane optimizer would look at that query and at the very lease decide that it only needs the first row from ImportantTable, or preferably decide that it doesn’t need to select anything from ImportantTable at all. But MySQL’s query optimizer is neither intelligent nor apparently very sane, so it actually reads every row from ImportantTable, and hopefully sticks it in cache.

Note: if you wanted to be even more slick, you could craft your inner SELECT to load indexes into memory, which are even more helpful than having actual rows in memory, of course. So for example:
SELECT 1 FROM (SELECT UserID, SongID FROM UsersSongs) x LIMIT 1; assuming UsersSongs has a multi-column index consisting of UserID and SongID, of course.

1 Comment

Posted in SQL

Denormalizing MySQL

05 Mar

Some of you may have noticed that Grooveshark is a bit sluggish most of the time and an be extremely slow when doing certain things (searching, loading the discover page, looking at large libraries, etc.)

The biggest source of this slowdown comes from three major, related factors:

We need to check whether each song is offline before displaying it.
There are several different ways a song can be off- or online.
Our schema is highly normalized.

The upshot of all this is that our queries to get song info, including online status of a song, require LEFT JOINing across multiple tables, onf of which a the time of this writing has over 10 million rows, another about 8 million.

As part of the major backend overhaul currently in process code named file/song, we have decided to de-normalize information about online status so that it is contained within the Songs table and updated by triggers.

The trick to keeping this fast for inserts and updates where these triggers are fired is to keep a running total for the number of files that are online for any given song, in a field called NumOnline, rather than re-calculating that value for every insert/update. To start out, we populate NumOnline for every single row in the Songs table by using a huge nasty query that aggregates the online statuses using LEFT JOINs and all that fun stuff that currently has to happen every time we want to display a song.

It takes a while to fill in that information (about 6 minutes) but once it’s there, all the hard work has been done. When a file goes on- or offline, we merely need to increment or decrement NumOnline – we can safely skip doing the long counts of everything as long as we make sure to handle every action that might cause that number to change.

That’s not quite as easy as it sounds, of course. Let’s look at the list of ways that number can change:

File is added to or removed from a user’s library.
User goes offline or comes online
User connects/disconnects external drive
File is added to the cache
Two songs are merged by the editing system

Each of these cases is a different update to a different table or field and we must be certain to handle each one correctly, but if we do that we gain the major advantage of being able to eliminate 3-4 LEFT JOINs from our Song-related queries and these SELECTS are blazing fast again. Definitely worth it.

2 Comments

Posted in SQL

Quick and Easy MySQL Benchmarking

19 Feb

When trying to tweak a query in MySQL it’s often difficult to tell if the changes you’ve made are actually making a difference, and if so by how much.
You may have noticed that using SQL_NO_CACHE still gives a very poor picture of what the performance is going to be like. Even with SQL_NO_CACHE, the first execution of the query might take 5 seconds, and the next execution might take 0.0 seconds. What gives? Well, even if you bypass MySQL’s caches, Linux is doing its own file system caches.

I’ve seen explanations of how to get around this issue before and none of them are particularly nice. Some say restarting MySQL is the way to go, others say you have to reboot the system. Travis’s original solution was to cat a huge file to /dev/null. All of these solutions are slow and somewhat annoying to do when you’re trying to make a bunch of tweaks to a query and test the effects.

A few days ago Travis found a solution that makes life easy again.

Effectively, the solution is this:

Use SQL_NO_CACHE in your query each time.
From the command line logged in as root, execute: echo 3 > /proc/sys/vm/drop_caches after each execution

That’s it! You must have kernel version 2.6.16 or greater, but that’s the only caveat.

Armed with this new technique to easily benchmark queries, I was able to tweak an intensive query running on his box to bring the execution time down from about 33 seconds to around 8 seconds, approximately 75% faster. To be perfectly honest if I didn’t have this easy instant feedback I probably would have quit with the optimization that brought execution time down to 25 seconds, because I would have known that it was faster but not by how much and I wouldn’t have been able to tell if any subsequent tweaks were really making much of a difference.

No Comments

Posted in SQL

Auto-query-generator

02 Feb

They say that Computer Scientists aren’t satisfied with putting other people out of jobs; they want to put themselves out of a job as well. In that vein I am writing a function that, given basic information about our database tables and their relationships, can build efficient queries automatically based on a complex filter consisting of:

The information requested
Any combination of IDs to filter the results based on

For example, you can tell it that you want the names and IDs of Playlists belonging to a user with a certain artist and my function will build the SQL on the fly based on what it knows.
This is extremely useful because I am working on a super secret project* with Katy and her app has no knowledge of hte database but needs to be able to ask for some extremely specific sets of data depending on user behavior. Instead of manually writing N! queries to handle every possible combination of requests, my function does all of the heavy lifting for me.

The code isn’t quite done yet but the code that figures out the join sequence and which key to join on is done, and surprisingly it only takes about 10 lines of code. I still need to add code to handle special cases, and I need to write the code that takes that sequence and turns it into real live SQL so it will surely grow, but that is still far less than I imagined would be required to get this far.

*not actually a secret project, but I haven’t written about it yet and it’s beyond the scope of this post.

1 Comment

Posted in Coding, SQL

Jay Paroline

Archive for the ‘SQL’ Category

Bypassing Magic

Indexes Matter (or: Memcache Will Only Take You So Far)

Have some tips!

MySQL Crash

SQL Schema and Graphs/Maps

ALTER TABLE performance

Hot Cache

Denormalizing MySQL

Quick and Easy MySQL Benchmarking

Auto-query-generator

Pages

Archives

Categories