How Grooveshark Uses Gearman

RSS

How Grooveshark Uses Gearman

27 Mar

At Grooveshark, Gearman is an integral part of our backend technology stack.

30 Second Introduction to Gearman

Gearman is a simple, fast job queuing server.
Gearman is an anagram for manager. Gearman does not do any work itself, it just distributes jobs to workers.
Clients, workers and servers can all be on different boxes.
Jobs can be synchronous or asynchronous.
Jobs can have priorities.

To learn more about gearman, visit gearman.org, and peruse the presentations. I got started with this intro (or one much like it), but there may be better presentations available now, I haven’t checked.
The rest of this article will assume that you understand the basics Gearman including the terminology, and are looking for some use cases involving a real live high traffic website.

Architecture

With some exceptions, our architecture with respect to Gearman is a bit unconventional. In a typical deployment, you would have a large set of Apache + PHP servers (at Grooveshark we call these “front end nodes” or FENs for short) communicating with a smaller set of Gearman job servers, and a set of workers that are each connected to all of the job servers. In our setup, we have a gearman job server running on each FEN, and jobs are submitted over localhost. That’s because most of the jobs we submit are asynchronous, and we want the latency to be as low as possible so the FENs can fire off a job and get back to handling the user’s request. Then we have workers running on other boxes which connect to the gearman servers on the FENs and process the jobs. Where the workers run depends on their purpose, for example workers that insert data into a data store usually live on the same box as the data store, which again cuts down on network latency. This architecture means that in general, each FEN is isolated from the rest of the FENs, and Gearman servers are not another potential point of failure or even slowdowns. The only way a Gearman server is unavailable is if the FEN itself is out of commission. The only way a Gearman server is running slow is if the whole FEN is running slow.

Rate Limiting

One of the things that is really neat about this gearman architecture, especially when used asynchronously, is that jobs that need to happen eventually but not necessarily immediately can be easily rate limited by simply controlling the number of workers that are running. For example, we recently migrated Playlists from MySQL to MongoDB. Because many playlists have been abandoned over the years, we didn’t want to just blindly import all playlists into mongo. Instead we import them from MySQL as they are accessed. Once the data is in MongoDB, it is no longer needed in MySQL, so we would like to be able to delete that data to free up some memory. Deleting that data is by no means an urgent task, and we know that deletes of this nature cannot run in parallel; running more than one at a time just results in extra lock contention.

Our solution is to insert a job into the gearman queue to delete a given playlist from MySQL. We then have a single worker connecting to all of the FENs asking for playlist deletion jobs and then running the deletes one at a time from the MySQL server. Not surprisingly, when we flipped the switch deletion jobs came in much faster than they could be processed; at the peak we had a backlog of 800,000 deletion jobs waiting to be processed, and it took us about 2.5 weeks to get that number down to zero. During that time we had no DB hiccups, and server load was kept low.

Data Logging

We have certain high volume data that must be logged, such as song plays for accounting purposes, and searches performed so we can make our search algorithm better. We need to be able to log this high volume data in real time, without affecting the responsiveness of the site. Logging jobs are submitted asynchronously to Gearman over localhost. On our hadoop cluster, we have multiple workers per FEN collecting and processing jobs as quickly as possible. Each worker only connects to one FEN — in fact, each FEN has about 12 workers just for processing logging jobs. For a more in depth explanation for why we went with this setup, see lessons learned.

Backend API

We have some disjointed systems written in various languages that need to be able to interface with our core library, which is written in PHP. We considered making a simple API over HTTP much like the one that powers the website, but decided that it was silly to pay the cost of all the overhead of HTTP for an internal API. Instead, a set of PHP workers handle the incoming messages from these systems and respond accordingly. This also provides some level of rate limiting or control over how parallelized we want the processing to be. If a well meaning junior developer writes a some crazy piece of software with 2048 processes all trying to look up song information at once, we can rest assured that the database won’t actually be swamped with that much concurrency, because at most it will be limited to the number of workers that we have running.

Lessons Learned/Caveats

No technology is perfect, especially a technology when you are using it in a way other than how it was intended to be used.
We found that gearman workers (at least the pecl gearman extension’s implementation) connect to and process jobs on gearman servers in a round-robin fashion, draining all jobs from one server before moving to the next. That creates a few different headaches for us:

If one server has a large backlog of jobs and workers get to it first, they will process those jobs exclusively until they are all done, leaving the other servers to end up with a huge backlog
If one server is unreachable, workers will wait however long the timeout is configured for every time they run through the round-robin list. Even if the timeout is as low as 1 second, that is 1 second out of 20 that the worker cannot be processing any jobs. In a high volume logging situation, those jobs can add up quickly
Gearman doesn’t give memory that was used for long queues back to the OS when it’s done with it. It will reuse this memory, but if your normal gearman memory needs are 60MB and an epic backlog caused by these interactions leads it to use 2GB of memory, you won’t get that memory back until Gearman is restarted.

Our solution to these issues is, unless there is a strong need to rate limit the work, just configure a separate worker for each FEN so if one FEN is having weird issues, it won’t affect the others.
Our architecture combined with the fact that each request from a user will go to a different FEN means that we can’t take advantage of one really cool gearman feature: unique jobs. Unique jobs means that we could fire asynchronous jobs to prefetch data we know the client is going to ask for, and if the client asks for it before it is ready, we could have a synchronous request hook into the same job, waiting for the response.
Talking to a Gearman server over localhost is not the fastest thing in the world. We considered using Gearman to handle geolocation lookups by IP address so we can provide localized concert recommendations, since those jobs could be asynchronous, but we found that submitting an asynchronous job to Gearman was an order of magnitude slower than doing the lookup directly with the geoip PHP extension once we compiled it with mmap support. Gearman was still insanely fast, but this goes to show that not everything is better served being processed through Gearman.

Wish List

From reading the above you can probably guess what our wish list is:

Gearman should return memory to the OS when it is no longer needed. The argument here is that if you don’t want Gearman to use 2GB of memory, you can set a ulimit or take other measures to make sure you don’t ever get that far behind. That’s fine but in our case we would usually rather allow Gearman to use 2GB when it is needed, but we’d still like to have it back when it’s done!
Workers should be better at balancing. Even if one server is far behind it should not be able to monopolize a worker to the detriment of all other job servers.
Workers should be more aware of timeouts. Workers should have a way to remember when they failed to connect to a server and not try again for a configurable number of seconds. Or connections should be established to all job servers in a non-blocking manner, so that one timing out doesn’t affect the others.
Servers should be capable of replication/aggregation. This is more of a want than a need, but sometimes it would be nice if one job server could be configured to pull jobs off of other job servers and pool them. That way jobs could be submitted over localhost on each FEN, but aggregated elsewhere so that one worker could process them in serial if rate limiting is desired, without potentially being slowed down by a malfunctioning FEN.
Reduce latency for submitting asynchronous jobs locally. Submitting asynchronous jobs over localhost is fast, but it could probably be even faster. For example, I’d love to see how it could perform using unix sockets.

Even with these niggles and wants, Gearman has been a great, reliable and performant product that we are able to rely on to help keep the site fast and reliable for our users.

Supervisord

When talking about Gearman, I would be in remiss if I did not mention Supervisord, which we use to babysit all of our workers. Supervisord is a nice little python utility you can use to daemonize any process for you, and it will handle things like redirecting stdout to a log file, auto-restarting the process if it fails, starting as many instances of the process as you specify, and automatically backing off if your process fails to start a specified number of times in a row. It also has an RPC interface so you can manage it remotely, for instance if you notice a backlog of jobs piling up on one of your gearman servers, you can tell supervisord to fire up another 20 workers.

14 Comments

Posted in Uncategorized

Victor

November 29, 1999 at 7:00 pm

Thanks for this write-up — pretty cool to read about some of the tech behind Grooveshark (I know, I’m half a year late…)
expert

March 27, 2011 at 10:28 pm

how does it compares to celery?
Chris Han

March 27, 2011 at 11:41 pm

Thanks for the writeup! I have an application that uses almost exactly the same stack (Python, Gearman, Supervisor, MySQL). What made you switch to MongoDB? Also, what do you use to manage and watch the workers?
Chris Han

March 27, 2011 at 11:54 pm

Nevermind. Completely missed your last paragraph. But still curious about your switch to MongoDB.
Robert S.

March 28, 2011 at 12:15 am

Big fan of the service, really enjoyed reading about your exact set-up – thanks!
Brian Moon

March 28, 2011 at 12:41 am

Memory – This is a product of how malloc on Linux works. If you were using BSD, the memory would be given back.

Localhost – connecting via localhost is not really any faster than connecting on the network. unix sockets won’t help much either. You may find that the benefits of using centralized servers and having the ability to use unique ids is much better than the few milliseconds you save on localhost.

pecl/gearman is not really all that well maintained. It was kind of written and forgotten about. I use (and maintain) the PEAR libraries for gearman. Sure, PECL tests a little faster, but I don’t see the issues you mention when I use it. I am kind of confused why you see this knowing how gearmand works.

timeout issues. again this is because you have N gearmand servers instead of a few centralized servers. I am curious if you actually tried running central gearmand servers before putting them on localhost hoping that localhost would be faster. Can you comment about that?

Supervisord is nice. If you are using all PHP workers though, I suggest using (my) GearmanManger https://github.com/brianlmoon/GearmanManager/ It has some nice features that you don’t get with a generic daemon manager.
Jay

March 28, 2011 at 2:44 am

We haven’t moved everything over to MongoDB and probably never will, but we are migrating some of our less relational data out of MySQL and into MongoDB because it handles heavy write workloads and lock contention better and unlike MySQL it’s very easy to scale by just throwing more servers at the problem. I wouldn’t use it for any data that is not easily denormalized, or for data where the lookups would be spread across lots of records (called documents in mongo), as we have found MySQL to be generally faster at that, and it is relatively easy to scale reads in MySQL by simply adding more slaves.
Jay

March 28, 2011 at 2:58 am

Running gearman servers locally on each FEN has very little to do with trying to avoid network latency (although some of our servers are in different data centers, and some of the links between them can be high latency), and more about avoiding having another physical server in the mix that could add another point of failure or slowness when things go wrong.

pecl vs PEAR: I’ll definitely check that out. Which problems in particular do you find that the PEAR driver solves that the pecl one doesn’t? Switching to anything slower, even slightly so, is a tough sell around here but if it addresses enough annoyances, that might make it easier. :)

We have never tried running with multiple dedicated gearman servers. Besides the philosophical opposition to adding another potential point of failure/slowness to the mix, gearman is so damn fast and efficient on its own it would be a shame to waste some perfectly good hardware just to run that, but the more other things you allow those servers to do the higher the likelihood of something else going wrong on the box.

I’ll also look into GearmanManager, thanks for pointing that out. :) Not all of our workers are in PHP, but the majority are so it could be a valuable resource.
Brian Moon

March 28, 2011 at 10:56 am

@Jay

I don’t see the issues with balancing you talk about. The way gearmand works, it simply loops through its connected workers and asks who can do work. The workers don’t initiate that conversation. So, why PECL would do this is very strange to me. Do you have your workers all listening for all jobs? If so, perhaps that is what you are seeing. Each job is basically its own stack within gearmand and the first job that a worker told it could do is the first one that will get looked at by gearmand. Hmm, I need to write a blog post on that.

We do run gearmand on servers that do other work. We don’t have dedicated servers for them. We have several processing servers that are great candidates for that. Sure, other things are running and *could* cause issues. But, in reality, it doesn’t. Not in our environment.

I do want to add some uptime logic to the PEAR library so that it won’t keep trying to connect to the same down daemon. It will have to use some external system like perhaps memcached or Redis to store up down logic. That adds more latency. I suppose you could use files in /tmp or something as well.

We also run across the WAN (VPN). We have gearmand servers in each data center. Clients connect to the gearmand in their datacenter and workers connect across the WAN to them for services that are only provided in one data center. This works great because when the VPN drops, logging jobs just queue up in the remote data center and are all done when the connection comes back up. We only do this for background jobs. I would never run a foreground (sync) job over the WAN.
Jay

March 28, 2011 at 8:58 pm

@Brian that is odd. Perhaps what is happening is that one server is keeping the worker so busy that it can’t even get around to connecting to the other servers?
Example:
On 2 gearman servers, insert 1000 jobs each called sleepyTime
Create 1 worker that processes sleepyTime by sleeping for 2 seconds, and have it connect to both gearman servers.
If you watch status on each of the servers, you will see a status like sleepyTime 998 1 1 on one of them, and on the other you will see sleepyTime 1000 0 0, and it will stay that way until the first server gets down to 0 jobs remaining. This is using pecl gearman so it’s possible that the behavior does not happen in the PEAR library, which would be a good reason to consider switching. We have some workers that process many different jobs, but we also have a lot of single-purpose workers and we definitely see the behavior in those as well.

The difference in our setups might just be the reliability/stability of our systems. We aim to run on cheap commodity hardware, which is expected to flake out at any time and sometimes does. We’re also growing faster than we can keep up with sometimes, so servers are frequently shuffling roles or being taken offline for maintenance, running out of disk space, etc., so while having dedicated roles is unrealistic, having shared roles is pretty much guaranteeing that there will be some downtime at times. Aside from data storage, each server is set up to be an island so that these fluctuations in availability, don’t actually cause interruptions to service. If there were servers in each data center that we could say “this will realistically always be available, and never overloaded” then a slightly less distributed gearman configuration would probably make sense for us too.
Chris S

March 31, 2011 at 10:13 pm

To fix your round-robin issue, I would randomise (php has a fast shuffle()) the hosts array before starting work on your worker.

This will allow it to connect to a different server first each time, should work out to drain them all evenly.

I know what you mean about memory usage – I doubt it’s a linux memory managment byproduct but rather the fault of PHP not properly freeing it’s objects. Even a small memory leak will cause php to hit its memory ceiling if it never stops!

The solution is to have a “parent” worker that is managed by supervisord, and that worker forks one (or more) child to handle tasks given to it by the queue, that way the process that does the work is seperate to the process that talks to the queue. Check out docs for PCNTL (looks like Brian’s script uses that too).

The trickiest part about a decent gearman system is proper reporting and transparency into the queues, jobs and workers (look at resqueue for a great example of this). Every gearman implementation I’ve seen has had to implement this themselves and all in a different way… The job server’s telnet interface is nowhere near good enough.
Jay

April 14, 2011 at 2:05 am

Yeah, array_shuffle would work for balancing the workers, it’s just odd that the driver doesn’t handle that itself.

The memory usage is coming from the gearman server, not PHP. Gearmand is written in C…all our PHP workers are very well behaved and have no memory leaks. :)

Thanks for the tip about resqueue, I’ll check it out!
Marc Easen

April 21, 2011 at 12:10 pm

Nice write up and it’s nice to know that Grooveshark use Gearman.

You’ve mentioned that sometimes your workers get bogged down with one Gearman servers. Have you tried install a single Gearman job server per worker node and point all the workers on this node just use this local instance of gearman. Then put these nodes under a load balancer/VIP. However you must ensure each node is the same as the next, with regards to what workers it runs. The idea being this will stop one of your Gearman servers getting overloaded, as the loader balancer will sort this out at the network level.

I have also noticed the memory issue with Gearman, what I’m currently looking at is to use the libdrizzle persistent queue adapter and attached all the gearmand servers to their own database table on MySQL cluster. So if one of the nodes goes down for whatever reason, you can easily export and import the jobs into another gearman job list and process the backlog.

Anyway, it’s nice to know that someone is actually using gearman too!
Jay

April 21, 2011 at 8:53 pm

Hi Marc, your suggestion of using a load balancer sounds interesting, but I’m not sure I entirely follow when you say a single gearman job server per worker node. What I would envision is all workers connecting to a load balancer, and each time they send a “get jobs” command the load balancer would pick a gearman server either randomly or based on some weighting, for the message to pass through to…then if a gearman server is down, the load balancer would just stop sending traffic to it. That seems like it could solve our complaint and eliminate the need to run a separate worker for each job server, which would be awesome…but do you know of any load balancers that could do this. The only load balancers I have looked at are http only.

We have considered using persistent queues but nobody has time to test them out, obviously we want to make sure that it won’t slow things down, especially submitting asynchronous jobs; we want those to be fire-and-forget and introduce as little latency as possible. Have you done any tests to see if persistent queues add any latency? If not, I’m sure we’ll get around to it eventually. :)

Jay Paroline