Known unknowns: Grooveshark downtime situation (updated)

23 Apr

Grooveshark is in the middle of some major unplanned downtime right now, after having major server issues just about all day.

What’s going on? We don’t fully know, but we have some clues.

-We are fairly certain that our load balancer is running well over capacity, and we’ve had a new one on order for a little while already.

-Some time this morning, one of our servers broke. Not only did it break, but our alerts that tell us about our servers were also broken due to a misconfiguration (but only for that server).

-Today near peak time, our two biggest servers stopped getting any web traffic except for very short bursts of very tiny amounts of traffic. At the same time, our other servers started seeing insane load averages. /var/log/messages showed warnings about dropping packets because of conntrack limits. The limits were the same as other boxes that had been doing just fine, and in normal circumstances we never come close to those limits, but the counters showed we were definitely hitting those limits and keeping them pegged.

-After upping the limits to even ridiculously higher numbers, the servers started getting web traffic again, and the other servers saw their load averages return to sane values. The connection counts were hovering somewhat close to the new limits, but seemingly with enough margin that they were never hitting the new limits; good enough for a temporary solution.

-Things were seemingly stable for a couple of hours, but connection numbers were still weird. One of our server techs who really is awesome at what he does but shall remain nameless here for his own protection, was investigating the issue and noticed a configuration issue on the load balancer that might have been causing the weird traffic patterns and definitely needed to be fixed. Once it was late enough, we decided to restart the load balancer to apply the settings, a disruption of service that would bring the site down for a minute or so under normal circumstances, well worth it for the chance of having things behave tomorrow.

-Obviously, things did not go quite as planned. The load balancer decided not to come back up, leaving us with the only option of sending a tech out to the data center 1.5 hours away to resurrect the thing.

Update 4/23/2010 11:50am:
Well obviously Grooveshark is back up and running. Total downtime was 2-3 hours, and we have a slightly better picture of what some of the problems were. The load balancer itself wasn’t exactly the problem, it was actually the interaction between the load balancer and the new core switch we installed last weekend. When the load balancer restarted, the switch got confused and essentially got a false alarm on the routing equivalent of an infinite loop, so it cut off the port, making the load balancer completely inaccessible. We now have it set up so that we can still connect to it even if the switch cuts it off, but we also figured out how to work around that issue if it happens again.
There’s still some weird voodoo going on with the servers that we haven’t fully explained, so be prepared for more slowness today while we continue to look into it.

  1. Jeuhrn

    April 23, 2010 at 1:49 am

    Thanks for not hosting this blog on! Also it would be nice with timestamps on your entries (but luckily the date is in the URL).

  2. Jay

    April 23, 2010 at 5:21 am

    Ha, good point! there was a time when we considered hosting it on Grooveshark, I guess itms a good thing we never did. :)

    As for the dates, whoops, I’m still looking for the ideal theme and hadnmt noticed that this one is missing the post dates. I hate when blogs do that! I’ll either try to hack the theme or find another one soon:::

  3. ignacio

    April 23, 2010 at 11:05 am

    Guys, this theme has the date on the top-left corner of the post. 23 Apr says now.

    Anyways, glad we have Grooveshark back :)

  4. Jay

    April 25, 2010 at 9:21 am

    lol so it does! Shows you how much time I’ve spent looking at it…

  5. Last Year « Jay Paroline – Grooveshark Dev

    May 20, 2010 at 7:33 am

    […] won’t rehash all the capacity issues we’ve had lately, but needless to say things have been at least as bad as I worried about a […]