Grooveshark is in the middle of some major unplanned downtime right now, after having major server issues just about all day.
What’s going on? We don’t fully know, but we have some clues.
-We are fairly certain that our load balancer is running well over capacity, and we’ve had a new one on order for a little while already.
-Some time this morning, one of our servers broke. Not only did it break, but our alerts that tell us about our servers were also broken due to a misconfiguration (but only for that server).
-Today near peak time, our two biggest servers stopped getting any web traffic except for very short bursts of very tiny amounts of traffic. At the same time, our other servers started seeing insane load averages. /var/log/messages showed warnings about dropping packets because of conntrack limits. The limits were the same as other boxes that had been doing just fine, and in normal circumstances we never come close to those limits, but the counters showed we were definitely hitting those limits and keeping them pegged.
-After upping the limits to even ridiculously higher numbers, the servers started getting web traffic again, and the other servers saw their load averages return to sane values. The connection counts were hovering somewhat close to the new limits, but seemingly with enough margin that they were never hitting the new limits; good enough for a temporary solution.
-Things were seemingly stable for a couple of hours, but connection numbers were still weird. One of our server techs who really is awesome at what he does but shall remain nameless here for his own protection, was investigating the issue and noticed a configuration issue on the load balancer that might have been causing the weird traffic patterns and definitely needed to be fixed. Once it was late enough, we decided to restart the load balancer to apply the settings, a disruption of service that would bring the site down for a minute or so under normal circumstances, well worth it for the chance of having things behave tomorrow.
-Obviously, things did not go quite as planned. The load balancer decided not to come back up, leaving us with the only option of sending a tech out to the data center 1.5 hours away to resurrect the thing.
Update 4/23/2010 11:50am:
Well obviously Grooveshark is back up and running. Total downtime was 2-3 hours, and we have a slightly better picture of what some of the problems were. The load balancer itself wasn’t exactly the problem, it was actually the interaction between the load balancer and the new core switch we installed last weekend. When the load balancer restarted, the switch got confused and essentially got a false alarm on the routing equivalent of an infinite loop, so it cut off the port, making the load balancer completely inaccessible. We now have it set up so that we can still connect to it even if the switch cuts it off, but we also figured out how to work around that issue if it happens again.
There’s still some weird voodoo going on with the servers that we haven’t fully explained, so be prepared for more slowness today while we continue to look into it.