Redis Saves, and Ruins, the Day

16 May

Redis saves the day

Recently, we had to make an an emergency switch from MySQL to Redis for all of our PHP session handling needs. We’ve been using MySQL for sessions for years, literally, with no problems. Along the way we’ve optimized things a bit, for example by making it so that calls made by the client don’t load up a session unless it’s needed, and more recently by removing an auto increment id column to prevent the need for global table locks whenever a new session is created.

But then we started running into a brick wall. Connections would pile up on the master while hundreds of queries against sessions would sit in a state of ‘statistics’, each connection storm only lasting for a second, but long enough to cause us to run out of connections, even if we doubled or tripled the usual limits. Statistics means that the optimizer is trying to come up with an execution plan, but these are queries that interact with a single row based on the primary key, so something else was obviously going on there. As far as we’ve been able to tell, it’s not related to load in any way, iostat and load averages both show calm and steady loads when the connection storms happen, and they happen at seeemingly random times even when thraffic is at the lowest points of the day.

Our master DB still runs 5.0, so we thought maybe the combination of giving sessions their own server and running on a Percona build of 5.1 would resolve whatever bizarre optimizer issues we were having, but no luck. It definitely seems like a software issue, and it may just be due to the massive size of the table combined with the high level of concurrency that just makes MySQL lose its marbles every so often. Either way, we needed to come up with a solution fast, because the site was extremely flaky while sessions were randomly crashing.

We evaluated our options, what could we get up and running as quickly as possible on our one spare server that would have a chance of handling the load? We considered Redis, Cassandra, Pstgres, Drizzle and Memcached, but decided to go with Redis as a temporary solution because we have been using it successfully for some other high load situations and all the other options besides Memcached are thus far untested by us, and Memcached doesn’t have the durability that we require for sessions (we don’t want everyone to get logged out if the box needs to be rebooted).

Nate got Redis up and running while I spent 20 minutes hacking our session handler to use Redis instead of MySQL. There was no time to copy all the session data to Redis, so instead I made it check Redis for the session first, and then fall back to reading from MySQL if it’s not already in Redis. Quick tests on staging showed that it seemed to be working, so we pushed it live. Miraculously, everything just worked! Redis didn’t buckle from the load and my code was seemingly bug free. That is definitely the least time I’ve ever spent writing or testing such a critical piece of code before deploying, but desperate times call for desperate measures, right?

Since the switch, we haven’t had a single session related issue, and that’s how Redis saved the day.

Redis ruins the day

As I have mentioned in previous blog bosts, we have been using Redis on our stream servers for tracking stream keys before they get permanently archived on a DB server. Redis has been serving us well in this role for what seems like a couple of months now. Starting yesterday, however, our stream servers started going deeply into swap and becoming intermittently unreachable. This was especially odd because under normal circumstances we had about 10GB of memory free.

Turns out, Redis was using twice as much memory as usual every time it went to flush to disk, which is every 15 minutes with our configuration. So every 15 minutes, Redis would go from using 15GB of memory to using 30GB.

After talking to James Hartig for a little while we found out that this was a known issue with the version of Redis we were using (1.2.1), which had been fixed in the very next release. Ed upgraded to the latest version of Redis, and things have been fine since. But that’s how Redis ruined the day.


Our setup with Redis on our stream servers should continue to work for us for the foreseeable future. They provide a natural and obvious sharding mechanism, because storing information about the streams on the actual stream server that handles the request means that adding more stream servers automatically means adding more capacity.

On the other hand, Redis for sessions is a very temporary solution. We have 1-3 months before we’re out of capacity on one server for all session information because Redis currently requires that all information stored must be kept in memory. There isn’t a natural or easy way to shard something like sessions, aside from using a hashing algorithm or some sort which will require us to shuffle data around every time we add a new server, or use another server for keeping track of all of our shards. Redis is soon adding support for virtual memory so it will be possible to store more information than there is memory available, but we feel it still doesn’t adequately address the need to scale out, which will eventually come, just not as quickly as with MySQL. The lead candidate for handling sessions long term is Cassandra, because it handles the difficult and annoying tasks of sharding, moving data around and figuring out where it lives, for you. We need to do some extensive performance testing to make sure that it’s truly going to be a good long term fit for our uses, but I am optimistic. After all, it’s working for Facebook, Digg, Twitter and Reddit. On the other hand, Reddit has run into some speed bumps with it, and I still get the Twitter fail whale regularly, so clearly Cassandra is not made entirely of magic. The clock is ticking, and we still need a permanent home for playlists, which we’re also hoping will be a good fit for Cassandra, so we will begin testing and have at least some preliminary answers in the next couple of weeks, as soon as we get some servers to run it on.

  1. Last Year « Jay Paroline – Grooveshark Dev

    May 20, 2010 at 7:33 am

    […] won’t rehash all the capacity issues we’ve had lately, but needless to say things have been at least as bad as I worried about a year ago. Partially that […]

  2. Shekhar

    May 13, 2011 at 8:12 am

    Good one. Thanks for sharing the story.
    Helped me in taking some db decisions.

  3. Scott

    February 12, 2015 at 6:23 pm

    Where you are at today with sessions handling since I am going through something similar and am trying to figure out whether to go with Redis or Cassandra to replace our current MySQL sessions solution.

    Have the newer features in Redis allowed you to stay on it longer, perhaps permanently?

    If not, where are you at with using Cassandra or have you found a third option?

  4. Jay

    March 30, 2015 at 1:44 pm

    Sorry, I didn’t see your comment earlier. We are using Mongo for session handling currently and it seems to be working fairly well. We will probably re-evaluate using Redis since it has come a long way, but for now Mongo is fine.