RSS
 

Author Archive

Technology Stack

06 May

Ever wonder what technology powers Grooveshark?

Well that’s too bad, ’cause I’m going to tell you anyway. Most sites these days run on the LAMP stack (Linux, Apache, MySQL, PHP). Grooveshark runs on the LALMMRSPJG stack, more or less. Don’t try to pronounce that, you’ll only end up hurting yourself.

Linux: (CentOS primarily) for all of our servers except one lone Solaris box, which will be taken out back and shot one of these days, I hope.

Apache: All of our front-end nodes run apache. By front end node I mean everything serving up http traffic except for stream servers. For example, listen.grooveshark.com, www.grooveshark.com, tinysong.com, widgets.grooveshark.com, cowbell.grooveshark.com are all hosted on our front end nodes.

Lighttpd: Affectionally called lighty around here, it’s super efficient at serving up static content, so we use it on all of our stream servers instead of apache.

MySQL: We have several database servers, and they all run MySQL, much to my chagrin. We’d be using PostgreSQL if it had been up to me, but it wasn’t so we stick with MySQL. Now that Drizzle is coming along nicely, we are contemplating eventually moving fully or partially onto Drizzle, meaning our stack would be LALDMRSPJG or LALMDMRSPJG. Not much more pronounceable, I’m afraid.

Memcached: Without memcached, we would certainly not be where we are today. At this point nearly everything runs through memcached, reducing database load significantly and increasing site performance at the same time.

Redis: Redis is a new addition to our stack, but a welcome one. Redis is very similar to memcached in that it’s a key-value store, and it’s almost as fast, but it has the advantage of being disk-backed, so if you have to restart the server, you haven’t lost anything when it comes back up. Where memcached helps us save reads from MySQL, Redis helps us save reads and writes, because we can actually use it to store data that we intend to keep around.

Sphinx: MySQL fulltext indexes are absolutely horrible for search, so instead we use a technology called Sphinx. Sphinx recently got moved off of the front end servers and onto its own server, significantly reducing the load on the front end servers and improving the performance of search. Win-win!

PHP: Most of the code that makes Grooveshark work is written in PHP. All the websites I listed above, including the RPC services. Plenty of people hate PHP out there, including (or especially) those of us who program in it. It definitely has its warts, but it’s a language that is quick to develop in and it performs relatively well if you play to its strengths.

Java: Some of the code running on our servers, especially things that need to maintain state, are written in Java. Things that come to mind are the ad server, and some crazy stuff written in scala for keeping our stream servers in sync.

Gearman: Gearman is an awesome piece of the puzzle that we’re just starting to harness, and it’s going to help us scale out even more in the future. Gearmand is an extremely lightweight job queuing server with support for syncronous and asyncronous jobs. Workers can live on different servers and be written in different languages from clients. Gearman is great for map/reduce jobs or for allowing things that might be slow to be processed in the background without slowing down the user experience. For example, if our ad server needs to display an ad as quickly as possible *and* it needs to log the fact that it displayed the ad, it can fire off an asyncronous gearman job for the logging and get right to work on serving up the ad. Even if the logging portion is running incredibly slowly, nothing front-facing has to wait on it.
We have a super secret feature launching in about two weeks that would essentially not be possible without Gearman (and Redis). I’ll update in a couple of weeks to explain how Gearman makes it possible, once I can talk about what it is. :)

Please note that this list only includes the backend of the stack. We also have front-end clients written in HTML+JS, Flash+Flex, J2ME, Java, Objective C and some others on the way. It also doesn’t yet include Cassandra, but I’m hoping we can add that soon.

 
 

A long series of mostly unrelated issues

02 May

If you look at my recent posting (and tweeting) history, a new pattern becomes clear: Grooveshark has been down a lot lately. This morning, things broke yet again.

I don’t think we’ve been this unreliable since the beta days. If you don’t know what that means, consider yourself lucky. The point is that this is not the level of service we are aiming to provide, and not the level of service we are used to providing. So what’s going on?

Issue #1: Servers are over capacity

We hit some major snags getting our new servers, so we have been running over capacity for a while now. That means that at best, our servers are a bit slower than they should be and at worst, things are failing intermittently. Most of the other issues on this list are at least tangentially related to this fact, either because of compromises we had to make to keep things running, or because servers literally just couldn’t handle the loads that were being thrown at them. I probably shouldn’t disclose any actual numbers, but our User|Server ratio is at least an order of magnitude bigger than the most efficient comparable services we’re aware of, and at least two orders of magnitude bigger than Facebook…so it’s basically a miracle that the site hasn’t completely fallen apart at the seams.

Status: In Progress
Some of the new servers arrived recently and will be going into production as soon as we can get them ready. We’re playing catch up now though, so we probably already need more.

Issue #2: conntrack

Conntrack is basically (from my understanding) a built in part of Linux (or at least CentOS) related to the firewall. It keeps track of connections and enables some throttling to prevent DOS attacks. Unfortunately it doesn’t seem to be able to scale with the massive number of concurrent connections each server is handling now; once the number of connections reaches a certain size, cleanup/garbage collection takes too long and the number of connections tracked just grows out of control. Raising the limits helps for a little while, but eventually the numbers grow to catch up. Once a server is over the limits, packets are dropped en mass, and from a user perspective connections just time out.

Status: Fixed
Colin was considering removing conntrack from the kernel, but that would have caused some issues for our load balancer (I don’t fully understand what it has to do with the load balancer, sorry!). Fortunately he located some obscure setting that allows us to limit what conntrack is applied to, by port, so we can keep the load balancer happy without breaking everything when the servers are under heavy load. The fix seems to work well, so it should be deployed to all servers in the next couple of days. In the meantime, it’s already on the servers with the heaviest load, so we don’t expect to be affected by this again.

Issue #3: Bad code (we ran out of integers)

Last week we found out that playlist saving was completely broken. Worse, anyone trying to save changes to an existing playlist during that 3 hour period had their playlist completely wiped out.

There were really two issues here: a surface issue that directly caused the breakage, and an underlying issue that caused the surface issue.

The surface issue: the PlaylistsSongs table has an auto_increment field for uniquely identifying each row, which was a 32 bit unsigned int. Once that field is maxed out, it’s no longer possible to insert any more rows.

Underlying issue: the playlist class is an abomination. It’s both horrible and complex, but at the same time incredibly stupid. Any time a playlist is ever modified, the entries in PlaylistsSongs are deleted, and then reinserted. That means if a user creates a playlist with 5 songs and edits it 10 times, 50 IDs are used up forever. MySQL just has no way of going back and locating and reusing the gaps. How bad are the gaps? When we ran out of IDs there were over 3.5 billion of them; under sane usage scenarios, enough to last us years even at our current incredible growth rate.
We’ve known about the horror of this class and have been wanting to rewrite it for over a year, but due to its complexity and the number of projects that use the class, it’s not a quick fix, and for better or worse the focus at Grooveshark is heavily slanted towards releasing new feaures as quickly as possible, with little attention given to paying down code debt.

Status: Temporarily fixed
We fixed the problem in the quickest way that would get things working again — by making more integers available. That is, we altered the table and made the auto increment field a 64bit unsigned int. The Playlist class is still hugely wasteful of IDs and we’ll still run out eventually with this strategy, we’ve just bought ourselves a little bit of time. Now that disaster has struck in a major way, chances are pretty good that we’ll be able to justify spending the time to make it behave in a more sane manner. Additionally, we still haven’t had the chance to export the previous day’s backup somewhere so that people whose playlists were wiped out can have a chance to restore them. Some have argued that we should have been using a 64bit integer in the first place, but it should be obvious that that would only have delayed the problem and in the meantime, it wastes memory and resources.

Issue #4: Script went nuts

This was today’s issue. The details still aren’t completely clear, but basically someone who shall remain nameless wrote a bash script to archive some data from a file into the master database. That script apparently didn’t make use of a lockfile and somehow got spawned hundreds or maybe even thousands of times. The end result was that it managed to completely fill the database server. It’s actually surprising how elegantly MySQL handled this. All queries hung, but the server didn’t actually crash, which is honestly what I expected would happen in that situation. Once we identified the culprit, cut off its access to the database and moved things around enough to free up some space, things went back to normal.

Status: Fixed
The server is obviously running fine now, but the script needs to be repaired. In the meantime it’s disabled. One could say that there was an underlying issue that caused this problem as well, which is that it was possible for such a misbehaving script to go into production in the first place. I agree, and we have a new policy effective immediately that no code that touches the DB can go live without a review. Honestly, that policy already existed, but now it will be taken seriously.

Issue #5: Load Balancer crapped out

I almost forgot about this one, so I’m adding it after the fact. We were having some issues with our load balancer due to the fact that it was completely overloaded, but even once the load went down it was still acting funny. We did a reboot to restore normalcy, but after the reboot the load balancer was completely unreachable because our new switch thought it detected the routing equivalent of an infinite loop. At that point the only way to get things going was to have one of our techs make the 2 hour drive up to our data center to fix it manually.

This issue would have been annoying but not catastrophic had we remembered to reconnect the serial cable to the load balancer after everything got moved around to accommodate the new switch. It also wouldn’t have been so bad if we had someone on call near the data center who would have been able to fix the issue, but right now everyone is in Gainesville. Unless Gainesville wins the Google Fiber thing, there’s no way we can have the data center in Gainesville because there just isn’t enough bandwidth coming into the city for our needs (yes, we’re that big).

Status: Mostly fixed
We understand what happened with the switch and know how to fix the issue remotely now. We don’t yet know how to prevent the switch from incorrectly identifying an infinite loop when the load balancer boots up, but we know to expect it and how to work around it. We now also have the serial cable hooked up, and a backup load balancer in place, so if something happens again we’ll be able to get things working again remotely now. It would still be nice to not have to send someone on a 2 hour drive if there is a major issue in the future, but hopefully we have minimized the potential for such issues as much as possible.

Issue #6: Streams down

This issue popped up this week and was relatively minor compared to everything else that has gone wrong, since I believe users were affected for less than 20 minutes, and only certain streams failed. The unplanned downtime paid off in the long run because the changes that caused the downtime ultimately mean the stream servers are faster and more reliable.

We had been using MySQL to track streams, with a MySQL server running on every stream server, just for tracking streams that happen on that server. We thought this would scale out nicely, as more stream servers automatically means more write capacity. Unfortunately, due to locking issues, MySQL was ultimately unable to scale up nearly as far as we have been able to get our stream output to scale, so MySQL became a limiting factor in our stream capacity. We switched the stream servers over to Redis, which scales up much better than MySQL, has little to no locking issues, and is a perfect match for the kind of key-value storage we need for tracking streams.

Unfortunately, due to a simple oversight, some of the web servers were missing a critical component, or rather they thought they were because Apache needed to be reloaded before it would see the new component. This situation was made worse by testing that was less thorough than it should have been, so it took longer to identify the issue than would be idea. Fortunately, the fix was extremely simple so the overall downtime or crappy user experience did not last very long.

Status: Fixed with better procedures on the way
The issue was simple to fix, but it also helps to highlight the need for better procedures both for putting new code live and for testing. These new changes should be going into effect some time this week, before any more big changes are made. In the meantime, streams should now be more reliable than they have been in the past few weeks.

 
 

A quick note on the openness of Flash

30 Apr

Thanks to Steve Jobs’ Thoughts on Flash post, there’s been a whole new flurry of posts on the subject of flash vs html5, this time with some focus on the issue of openness, since Steve made such a point to bring that up.

Some people have already pointed out that Adobe has been moving Flash to be more and more open over time, including the open screen project, contributing tamarin to Mozilla, and Flex being completely open source. That’s all well and good, but people seem to be forgetting that historically, Flash had a very good reason for being closed. If one remembers back to the early days of the web, the wild west era as I do since it was such an exciting time for a young whippersnapper like me, one must only think back to what happened to Java to realize why making Flash closed was a very smart move.

For those who don’t remember or weren’t on the web back then, I’ll share what I remember which may be a bit off (that was quite a while ago now!). I’m talking about the hairy days when Netscape was revolutionizing the world and threatening to make desktop operating systems irrelevant and Microsoft was playing catch up. Java was an open and exciting platform to write once and run anywhere, promising to make proprietary operating systems even more irrelevant.

Of course, what ended up happening was that Microsoft created their own implementation of Java that not only failed to completely follow the Java spec, but added on some proprietary extensions, completely breaking the “write once, run anywhere” paradigm and helping to marginalize Java on the web, something it seems Java on the web has never fully recovered from, even though Microsoft has since settled with Sun and dropped their custom JVM after being sued.

This directly affected my personal experience with the language, when I took a Java class at the local community college while I was in high school. I wanted to write apps for the web, but even basic apps which compiled and ran fine locally would not work on IE, Netscape or both, the only solution being to spend hours fiddling and making custom versions for each browser. Needless to say, I quickly lost interest and haven’t really touched Java since I finished that class.

The atmosphere has certainly changed since those early days, and I don’t think it’s nearly as dangerous to “go open” now as it was back then, so Adobe’s choice to open up more and more of the platform now makes perfect sense, just as the decision to keep it closed until relatively recently also made sense.

 
 

Another hiccup

28 Apr

The saga continues. Today while checking up on one of our servers, I noticed that it’s load average was insanely low. It’s one of our “biggest” servers, so it typically deals with a large percentage of our traffic. Load averages over 10 are quite normal, and it was seeing a load average of 2-3.

root@RHL039:~# tail /var/log/messages
Apr 28 10:59:56 RHL039 kernel: printk: 9445 messages suppressed.
Apr 28 10:59:56 RHL039 kernel: ip_conntrack: table full, dropping packet.
Apr 28 11:00:01 RHL039 kernel: printk: 9756 messages suppressed.
Apr 28 11:00:01 RHL039 kernel: ip_conntrack: table full, dropping packet.
Apr 28 11:00:06 RHL039 kernel: printk: 6111 messages suppressed.
Apr 28 11:00:06 RHL039 kernel: ip_conntrack: table full, dropping packet.
Apr 28 11:00:11 RHL039 kernel: printk: 3900 messages suppressed.
Apr 28 11:00:11 RHL039 kernel: ip_conntrack: table full, dropping packet.
Apr 28 11:00:16 RHL039 kernel: printk: 2063 messages suppressed.
Apr 28 11:00:16 RHL039 kernel: ip_conntrack: table full, dropping packet.

root@RHL039:~# cat /proc/sys/net/ipv4/ip_conntrack_max
800000
root@RHL039:~# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_count
799995

When the server drops packets like this, things seem randomly and intermittently broken, so we apologize for that. I upped the limit (again) to get things working, but we still don’t know why so many connections are being held open in the first place.

 
 

Known unknowns: Grooveshark downtime situation (updated)

23 Apr

Grooveshark is in the middle of some major unplanned downtime right now, after having major server issues just about all day.

What’s going on? We don’t fully know, but we have some clues.

-We are fairly certain that our load balancer is running well over capacity, and we’ve had a new one on order for a little while already.

-Some time this morning, one of our servers broke. Not only did it break, but our alerts that tell us about our servers were also broken due to a misconfiguration (but only for that server).

-Today near peak time, our two biggest servers stopped getting any web traffic except for very short bursts of very tiny amounts of traffic. At the same time, our other servers started seeing insane load averages. /var/log/messages showed warnings about dropping packets because of conntrack limits. The limits were the same as other boxes that had been doing just fine, and in normal circumstances we never come close to those limits, but the counters showed we were definitely hitting those limits and keeping them pegged.

-After upping the limits to even ridiculously higher numbers, the servers started getting web traffic again, and the other servers saw their load averages return to sane values. The connection counts were hovering somewhat close to the new limits, but seemingly with enough margin that they were never hitting the new limits; good enough for a temporary solution.

-Things were seemingly stable for a couple of hours, but connection numbers were still weird. One of our server techs who really is awesome at what he does but shall remain nameless here for his own protection, was investigating the issue and noticed a configuration issue on the load balancer that might have been causing the weird traffic patterns and definitely needed to be fixed. Once it was late enough, we decided to restart the load balancer to apply the settings, a disruption of service that would bring the site down for a minute or so under normal circumstances, well worth it for the chance of having things behave tomorrow.

-Obviously, things did not go quite as planned. The load balancer decided not to come back up, leaving us with the only option of sending a tech out to the data center 1.5 hours away to resurrect the thing.

Update 4/23/2010 11:50am:
Well obviously Grooveshark is back up and running. Total downtime was 2-3 hours, and we have a slightly better picture of what some of the problems were. The load balancer itself wasn’t exactly the problem, it was actually the interaction between the load balancer and the new core switch we installed last weekend. When the load balancer restarted, the switch got confused and essentially got a false alarm on the routing equivalent of an infinite loop, so it cut off the port, making the load balancer completely inaccessible. We now have it set up so that we can still connect to it even if the switch cuts it off, but we also figured out how to work around that issue if it happens again.
There’s still some weird voodoo going on with the servers that we haven’t fully explained, so be prepared for more slowness today while we continue to look into it.

 
 

Memcached status like APC status

15 Apr

Just a quick note to point out a cool little utility I discovered recently even though it’s been around for a while:

memcache.php stats like apc.php

If you are at all familiar with the apc status page, you know its nothing fancy but it is a very usable way to gett a quick overview of what is going on, and much prettier than the cli options.

Here is a screenshot from the author’s post:

It literally requires virtually no setup, it’s quite amazing. Optionally change the user/pass combo and add info about your buckets to an array, and boom you’re done. This let’s you easily see how much memory a bucket is using, and you can even see the contents of a slab. It does have one very dangerous feature that I’m tempted to strip out entirely: the ability to flush the contents of a bucket, seemingly with the click of a mouse (I haven’t tested it to find out). Some things should be hard.

Anyway, I highly recommend checking it out, it takes virtually no time and might actually be useful!

 
 

Grooveshark Loves Spain

12 Apr

Alternate title: Why is the flag for Spanish the Mexican flag?

As you have probably already noticed, Grooveshark is translated into a couple of different languages now. The language selection dropdown contains the name of the language (in that language), and a tiny icon of a flag. The main purpose of the flag is to be eye-catching, in the hopes that it will be more obvious to non-English speakers that there might be a language that would work better for them. The secondary purpose of the flag, is to indicate the locale. This is where it fails miserably.

The people of Spain are, apparently, quite proud of their country. We frequently receive the complaint that we are Wrong to use the Mexican flag for Spanish. The good people of Argentina, Chile and Colombia don’t seem to mind, but the Spanish, they want to see their flag up there. I guess I can’t really blame them, I mean it is sort of the native home of the language, after all.

The observant will also notice that English shows the American flag, even though English really belongs to the English (that is to say, by that definition, English should have the UK flag next to it). In both cases, the country flags indicate the variant of the language being used. English has the US flag because it’s the US dialect. In the case of Spanish, the translations were done by our very own Carlos Perez whom by now you may have deduced, is from Mexico! Grooveshark uses the Mexican flag for our translation of Spanish because it is written in Mexican Spanish. For that reason, it would be disingenuous for us to use the Spanish flag.

Grooveshark doesn’t have anything against Spain. In fact, Spain is the only Spanish speaking country I have visited, and I found it to be quite nice. Grooveshark has nothing against Britain either, though I have never been there so I can’t speak to the niceness of said country. In any case, as soon as we have the chance, you can bet we’ll be getting the site translated into Spain-Spanish and UK-English, although as you can imagine there are lots and lots of other languages that we need a bit more urgently, such as French, Germain and Russian. Although it might hurt a little bit to see the “wrong” country up there, at least Spaniards can read the site in their native language, aside from the little idiosyncrasies that arise from it being a different dialect.

 
 

Attn Hackers: Opening up Grooveshark

21 Dec

At Grooveshark, we get a lot of requests for features that aren’t possible to create in Adobe Flash/Flex/Air or that we just plain don’t have time to do.

In that vein, we’re working on opening up Grooveshark to make it a bit more extensible.

The first thing we’re doing is making the player status really easy to get to. People want their chat clients to be able to reflect what’s playing in the status when they are listening to the desktop app. We can’t do that through AIR directly, and we’re not going to have time to learn about and write special code for each chat client out there, but now anyone who already knows about that stuff or just really wants to put the time into it can do it.
There is a file that we are storing in documentsDirectory\Grooveshark\currentSong.txt
documentsDirectory is defined by Adobe to be the current users’ documents directory. In windows that’s %HOMEPATH%\Documents\Grooveshark\currentSong.txt

The format of the file is pretty obvious if you open it up, but I’ll spell it out here:
SongName\tAlbumName\tArtistName\tStatus
where \t is a hard tab (note: names are guaranteed not to have tabs in them)
Valid statuses are “playing,” “paused” and “stopped” – we may add others later if it makes sense to do so. Note that if a user clears their queue or quits Grooveshark, the last song to play will still be listed with a status of “stopped”.

Other ways we plan to open up Grooveshark to 3rd party developers includes creating an external interface through javascript for the web client so people can make Firefox or Chromium extensions to control playback, and creating something similar that can allow the desktop app to accept connections from 3rd party apps running on the desktop alongside, so for example someone could write a plugin that registers global keybaord shortcuts for play/pause, back and next. Again, AIR will not allow us to register those shortcuts directly.

 
 

The Case of the Crazy Driver

09 Dec

I often work nights at Grooveshark, which means I also drive home very late or very early, depending on your perspective.

A few times on my way home, I’ve noticed somebody driving in a fashion that can only be summed up as ‘crazy.’ From a distance it looks like they start to pull up into somebody’s driveway, then turn onto the sidewalk, then back onto the road, and then back up onto the sidewalk again, until I or some other car gets near, then they zoom off down some side street.

Tonight I saw that again and I was on my scooter, so I decided to follow the car down the side street. It didn’t take them long to realize they were being followed, so they started driving even more crazily, speeding and running stop signs, turning off on even more side streets, but still sometimes pulling up onto the sidewalk and then veering off. I started to get freaked out at how freaked out this person was, and how crazy they were driving, afraid that they would decide to try to run me over or something, so I gave up. I turned around and took off, but not before looking back. And at that moment the car I had been following was pulled onto the sidewalk again, and I finally saw the detail I had been missing before. What I saw was a newspaper flying out the window of the car. My crazy driver was just a person delivering papers.

 
No Comments

Posted in life

 

Bugzilla error

23 Sep

Setting up Bugzilla for a friend on a slicehost account running RHEL5, I ran into issues following the official installation guide.
After running:
./checksetup.pl --check-modules
/usr/bin/perl install-module.pl --all

The error message I received was:

Can’t exec “–decompress”: No such file or directory

Amazingly, I found no exact matches for this in Google (my google-fu is generally weak, however). It turns out that in order for the module installer to be able to install the modules it needs, there’s a certain module it needs!

yum install perl-Compress-Zlib.x86_64
yum install perl-Archive-Tar.noarch

makes everything happy again.

Nothing earth shattering, but maybe now Google will have an answer if anyone else runs into this.