If you look at my recent posting (and tweeting) history, a new pattern becomes clear: Grooveshark has been down a lot lately. This morning, things broke yet again.
I don’t think we’ve been this unreliable since the beta days. If you don’t know what that means, consider yourself lucky. The point is that this is not the level of service we are aiming to provide, and not the level of service we are used to providing. So what’s going on?
Issue #1: Servers are over capacity
We hit some major snags getting our new servers, so we have been running over capacity for a while now. That means that at best, our servers are a bit slower than they should be and at worst, things are failing intermittently. Most of the other issues on this list are at least tangentially related to this fact, either because of compromises we had to make to keep things running, or because servers literally just couldn’t handle the loads that were being thrown at them. I probably shouldn’t disclose any actual numbers, but our User|Server ratio is at least an order of magnitude bigger than the most efficient comparable services we’re aware of, and at least two orders of magnitude bigger than Facebook…so it’s basically a miracle that the site hasn’t completely fallen apart at the seams.
Status: In Progress
Some of the new servers arrived recently and will be going into production as soon as we can get them ready. We’re playing catch up now though, so we probably already need more.
Issue #2: conntrack
Conntrack is basically (from my understanding) a built in part of Linux (or at least CentOS) related to the firewall. It keeps track of connections and enables some throttling to prevent DOS attacks. Unfortunately it doesn’t seem to be able to scale with the massive number of concurrent connections each server is handling now; once the number of connections reaches a certain size, cleanup/garbage collection takes too long and the number of connections tracked just grows out of control. Raising the limits helps for a little while, but eventually the numbers grow to catch up. Once a server is over the limits, packets are dropped en mass, and from a user perspective connections just time out.
Colin was considering removing conntrack from the kernel, but that would have caused some issues for our load balancer (I don’t fully understand what it has to do with the load balancer, sorry!). Fortunately he located some obscure setting that allows us to limit what conntrack is applied to, by port, so we can keep the load balancer happy without breaking everything when the servers are under heavy load. The fix seems to work well, so it should be deployed to all servers in the next couple of days. In the meantime, it’s already on the servers with the heaviest load, so we don’t expect to be affected by this again.
Issue #3: Bad code (we ran out of integers)
Last week we found out that playlist saving was completely broken. Worse, anyone trying to save changes to an existing playlist during that 3 hour period had their playlist completely wiped out.
There were really two issues here: a surface issue that directly caused the breakage, and an underlying issue that caused the surface issue.
The surface issue: the PlaylistsSongs table has an auto_increment field for uniquely identifying each row, which was a 32 bit unsigned int. Once that field is maxed out, it’s no longer possible to insert any more rows.
Underlying issue: the playlist class is an abomination. It’s both horrible and complex, but at the same time incredibly stupid. Any time a playlist is ever modified, the entries in PlaylistsSongs are deleted, and then reinserted. That means if a user creates a playlist with 5 songs and edits it 10 times, 50 IDs are used up forever. MySQL just has no way of going back and locating and reusing the gaps. How bad are the gaps? When we ran out of IDs there were over 3.5 billion of them; under sane usage scenarios, enough to last us years even at our current incredible growth rate.
We’ve known about the horror of this class and have been wanting to rewrite it for over a year, but due to its complexity and the number of projects that use the class, it’s not a quick fix, and for better or worse the focus at Grooveshark is heavily slanted towards releasing new feaures as quickly as possible, with little attention given to paying down code debt.
Status: Temporarily fixed
We fixed the problem in the quickest way that would get things working again — by making more integers available. That is, we altered the table and made the auto increment field a 64bit unsigned int. The Playlist class is still hugely wasteful of IDs and we’ll still run out eventually with this strategy, we’ve just bought ourselves a little bit of time. Now that disaster has struck in a major way, chances are pretty good that we’ll be able to justify spending the time to make it behave in a more sane manner. Additionally, we still haven’t had the chance to export the previous day’s backup somewhere so that people whose playlists were wiped out can have a chance to restore them. Some have argued that we should have been using a 64bit integer in the first place, but it should be obvious that that would only have delayed the problem and in the meantime, it wastes memory and resources.
Issue #4: Script went nuts
This was today’s issue. The details still aren’t completely clear, but basically someone who shall remain nameless wrote a bash script to archive some data from a file into the master database. That script apparently didn’t make use of a lockfile and somehow got spawned hundreds or maybe even thousands of times. The end result was that it managed to completely fill the database server. It’s actually surprising how elegantly MySQL handled this. All queries hung, but the server didn’t actually crash, which is honestly what I expected would happen in that situation. Once we identified the culprit, cut off its access to the database and moved things around enough to free up some space, things went back to normal.
The server is obviously running fine now, but the script needs to be repaired. In the meantime it’s disabled. One could say that there was an underlying issue that caused this problem as well, which is that it was possible for such a misbehaving script to go into production in the first place. I agree, and we have a new policy effective immediately that no code that touches the DB can go live without a review. Honestly, that policy already existed, but now it will be taken seriously.
Issue #5: Load Balancer crapped out
I almost forgot about this one, so I’m adding it after the fact. We were having some issues with our load balancer due to the fact that it was completely overloaded, but even once the load went down it was still acting funny. We did a reboot to restore normalcy, but after the reboot the load balancer was completely unreachable because our new switch thought it detected the routing equivalent of an infinite loop. At that point the only way to get things going was to have one of our techs make the 2 hour drive up to our data center to fix it manually.
This issue would have been annoying but not catastrophic had we remembered to reconnect the serial cable to the load balancer after everything got moved around to accommodate the new switch. It also wouldn’t have been so bad if we had someone on call near the data center who would have been able to fix the issue, but right now everyone is in Gainesville. Unless Gainesville wins the Google Fiber thing, there’s no way we can have the data center in Gainesville because there just isn’t enough bandwidth coming into the city for our needs (yes, we’re that big).
Status: Mostly fixed
We understand what happened with the switch and know how to fix the issue remotely now. We don’t yet know how to prevent the switch from incorrectly identifying an infinite loop when the load balancer boots up, but we know to expect it and how to work around it. We now also have the serial cable hooked up, and a backup load balancer in place, so if something happens again we’ll be able to get things working again remotely now. It would still be nice to not have to send someone on a 2 hour drive if there is a major issue in the future, but hopefully we have minimized the potential for such issues as much as possible.
Issue #6: Streams down
This issue popped up this week and was relatively minor compared to everything else that has gone wrong, since I believe users were affected for less than 20 minutes, and only certain streams failed. The unplanned downtime paid off in the long run because the changes that caused the downtime ultimately mean the stream servers are faster and more reliable.
We had been using MySQL to track streams, with a MySQL server running on every stream server, just for tracking streams that happen on that server. We thought this would scale out nicely, as more stream servers automatically means more write capacity. Unfortunately, due to locking issues, MySQL was ultimately unable to scale up nearly as far as we have been able to get our stream output to scale, so MySQL became a limiting factor in our stream capacity. We switched the stream servers over to Redis, which scales up much better than MySQL, has little to no locking issues, and is a perfect match for the kind of key-value storage we need for tracking streams.
Unfortunately, due to a simple oversight, some of the web servers were missing a critical component, or rather they thought they were because Apache needed to be reloaded before it would see the new component. This situation was made worse by testing that was less thorough than it should have been, so it took longer to identify the issue than would be idea. Fortunately, the fix was extremely simple so the overall downtime or crappy user experience did not last very long.
Status: Fixed with better procedures on the way
The issue was simple to fix, but it also helps to highlight the need for better procedures both for putting new code live and for testing. These new changes should be going into effect some time this week, before any more big changes are made. In the meantime, streams should now be more reliable than they have been in the past few weeks.