RSS
 

Archive for the ‘Uncategorized’ Category

PHP Autoload: Put the error where the error is

13 Oct

At Grooveshark we use PHP’s __autoload function to handle automatically loading files for us when we instantiate objects. In the docs they have an example autoload method:

function __autoload($class_name) {
    require_once $class_name . '.php';
}

Until recently Grooveshark’s autoload worked this way too, but there are two issues with this code:

  1. It throws a fatal error inside the method.

    If you make a typo when instantiating an object and ask it to load something that doesn’t exist, your error logs will say something like this:require_once() [function.require]: Failed opening required 'FakeClass.php' (include_path='.') in conf.php on line 83 which gives you no clues at all about what part of your code is trying to create FakeClass.php. Not remotely helpful.

  2. It’s not as efficient as it could be.

    include_once and require_once have slightly more overhead than include and require because they have to do an extra check to make sure the file hasn’t already been included. It’s not much extra overhead, but it’s completely unnecessary because __autoload will only trigger if a class hasn’t already been defined. If you’re inside your __autload function, you already haven’t included the file before, or the class would already be defined.

The better way:
function __autoload($class_name) {
    include $class_name . '.php';
}

Now if you make a typo, your logs will look like this:
PHP Warning: include() [function.include]: Failed opening ‘FakeClass.php’ for inclusion (include_path=’.’) in conf.php on line 83
PHP Fatal error: Class ‘FakeClass’ not found in stupidErrorPage.php on line 24

Isn’t that better?

Edit 2010-10-19: Hot Bananas! The docs have already been updated. (at least in svn)

 
 

Grooveshark Keyboard Shortcuts for Linux

21 Jun

Attention Linux users: Grooveshark friend Intars Students has created a keyboard shortcut helper for Linux. He could use some help testing it, so go help him! Once a few people have tested it and we’re pretty confident that it works well for users, we will add it to the links inside the app as well.

Intars is also the author of KeySharky, a Firefox extension that allows you to use keyboard shortcuts in Firefox to control Grooveshark playback from another tab, even if you’re not VIP! Pretty cool stuff.

 
 

Keyboard Shortcuts for Grooveshark Desktop

05 Jun

In a further effort to open Grooveshark to 3rd party developers, we have added an External Player Control API. (Side note: yes, it’s a hack to be polling a file all the time, but it’s also the only option we have until AIR 2.0 is out and most users have it installed.) Right now that means that we have support for keyboard shortcuts for OSX and Windows computers. To enable:

First open up desktop options by clicking on your username in the upper-right corner and selecting Desktop Options.

Notice the new option: Enable Global Keyboard Shortcuts (requires helper application)

Check that box (checking this box turns on polling the file mentioned in the External Player Control API) and select the proper client for your machine. In my case I chose windows, and I’ll go through setting that up now.

When you click on the link to download the keyboard helper, it should download via your browser. Once it is downloaded, run the app.

If prompted, tell Windows not to always ask before opening this file and choose Run.

The first time the application runs, it shows a list of keyboard shortcuts.

At this point the keyboard shortcuts listed should work! If they don’t, switch back to the desktop options window and make sure you click Apply or OK, then try again.

In your system tray you should notice a new icon (note: subject to change in future versions), that looks like the desktop icon with a blue bar in the lower-right corner. Rumor has it that is supposed to look like a keyboard.

If you right-click on the tray icon, you will see that there are a few options. The default configuration is pretty good, but I recommend setting it to start with Windows, so that it always is ready.

Big thanks go to James Hartig for hacking together the external player control API and the Windows keyboard shortcut helper, and to Terin Stock for the OSX helper. You can learn more about both keyboard shortcut helpers here. Hackers/developers: please feel free to extend functionality, create your own keyboard helpers (especially for Linux) or add integration to your current apps. Just show off what you’ve done so we can link to it!

Edit: Terin has supplied some Mac screenshots as well:

 
 

Redis: Converting from RDB to AOF

27 May

This is just a quick note about a question we had that we couldn’t find an easy answer to.

We decided to switch from redis’s default behavior of background saving (bgsave) a .rdb file to using append-only-file (AOF) mode. We thought we could just change the conf and restart, however, it created an empty AOF and was missing all the data from our .rdb.

Apparently the correct way to transition between the two is:
-While running in normal bgsave mode, run:
redis-cli BGREWRITEAOF

When that finishes, shut down the server, change the conf and start it back up. Now all your data should be present, and it should be using the AOF exclusively.

 
 

Redis Saves, and Ruins, the Day

16 May

Redis saves the day

Recently, we had to make an an emergency switch from MySQL to Redis for all of our PHP session handling needs. We’ve been using MySQL for sessions for years, literally, with no problems. Along the way we’ve optimized things a bit, for example by making it so that calls made by the client don’t load up a session unless it’s needed, and more recently by removing an auto increment id column to prevent the need for global table locks whenever a new session is created.

But then we started running into a brick wall. Connections would pile up on the master while hundreds of queries against sessions would sit in a state of ‘statistics’, each connection storm only lasting for a second, but long enough to cause us to run out of connections, even if we doubled or tripled the usual limits. Statistics means that the optimizer is trying to come up with an execution plan, but these are queries that interact with a single row based on the primary key, so something else was obviously going on there. As far as we’ve been able to tell, it’s not related to load in any way, iostat and load averages both show calm and steady loads when the connection storms happen, and they happen at seeemingly random times even when thraffic is at the lowest points of the day.

Our master DB still runs 5.0, so we thought maybe the combination of giving sessions their own server and running on a Percona build of 5.1 would resolve whatever bizarre optimizer issues we were having, but no luck. It definitely seems like a software issue, and it may just be due to the massive size of the table combined with the high level of concurrency that just makes MySQL lose its marbles every so often. Either way, we needed to come up with a solution fast, because the site was extremely flaky while sessions were randomly crashing.

We evaluated our options, what could we get up and running as quickly as possible on our one spare server that would have a chance of handling the load? We considered Redis, Cassandra, Pstgres, Drizzle and Memcached, but decided to go with Redis as a temporary solution because we have been using it successfully for some other high load situations and all the other options besides Memcached are thus far untested by us, and Memcached doesn’t have the durability that we require for sessions (we don’t want everyone to get logged out if the box needs to be rebooted).

Nate got Redis up and running while I spent 20 minutes hacking our session handler to use Redis instead of MySQL. There was no time to copy all the session data to Redis, so instead I made it check Redis for the session first, and then fall back to reading from MySQL if it’s not already in Redis. Quick tests on staging showed that it seemed to be working, so we pushed it live. Miraculously, everything just worked! Redis didn’t buckle from the load and my code was seemingly bug free. That is definitely the least time I’ve ever spent writing or testing such a critical piece of code before deploying, but desperate times call for desperate measures, right?

Since the switch, we haven’t had a single session related issue, and that’s how Redis saved the day.

Redis ruins the day

As I have mentioned in previous blog bosts, we have been using Redis on our stream servers for tracking stream keys before they get permanently archived on a DB server. Redis has been serving us well in this role for what seems like a couple of months now. Starting yesterday, however, our stream servers started going deeply into swap and becoming intermittently unreachable. This was especially odd because under normal circumstances we had about 10GB of memory free.

Turns out, Redis was using twice as much memory as usual every time it went to flush to disk, which is every 15 minutes with our configuration. So every 15 minutes, Redis would go from using 15GB of memory to using 30GB.

After talking to James Hartig for a little while we found out that this was a known issue with the version of Redis we were using (1.2.1), which had been fixed in the very next release. Ed upgraded to the latest version of Redis, and things have been fine since. But that’s how Redis ruined the day.

Epilogue

Our setup with Redis on our stream servers should continue to work for us for the foreseeable future. They provide a natural and obvious sharding mechanism, because storing information about the streams on the actual stream server that handles the request means that adding more stream servers automatically means adding more capacity.

On the other hand, Redis for sessions is a very temporary solution. We have 1-3 months before we’re out of capacity on one server for all session information because Redis currently requires that all information stored must be kept in memory. There isn’t a natural or easy way to shard something like sessions, aside from using a hashing algorithm or some sort which will require us to shuffle data around every time we add a new server, or use another server for keeping track of all of our shards. Redis is soon adding support for virtual memory so it will be possible to store more information than there is memory available, but we feel it still doesn’t adequately address the need to scale out, which will eventually come, just not as quickly as with MySQL. The lead candidate for handling sessions long term is Cassandra, because it handles the difficult and annoying tasks of sharding, moving data around and figuring out where it lives, for you. We need to do some extensive performance testing to make sure that it’s truly going to be a good long term fit for our uses, but I am optimistic. After all, it’s working for Facebook, Digg, Twitter and Reddit. On the other hand, Reddit has run into some speed bumps with it, and I still get the Twitter fail whale regularly, so clearly Cassandra is not made entirely of magic. The clock is ticking, and we still need a permanent home for playlists, which we’re also hoping will be a good fit for Cassandra, so we will begin testing and have at least some preliminary answers in the next couple of weeks, as soon as we get some servers to run it on.

 
 

Introducing Activity Feeds

10 May

Grooveshark’s new preloader image doesn’t really have anything to do with the new Activity Feeds feature (aside from also being new), but it sure does look nice, doesn’t it?

Saturday night/Sunday morning, Grooveshark released a pretty major set of features to VIP users onto preview.grooveshark.com and the desktop app. The major new feature is activity feeds, and it’s certainly the most interesting one, so I’ll cover that first.

When you visit the site, on the sidebar you should notice a new item called “Community” — next to that, it displays the number of events since the last time you logged in.

The actual community activity page is pretty awesome! It shows you aggregated information about the activity of all your friends! Possible activities include:
-Obsession (playing the same song a lot)
-Being really in an artist or album (listening to a lot of songs on that album or by that artist)
-Sharing
-Adding songs to your favorites or library.
Essentially, the power of this feature lies in being able to find out about new music that your friends are into. This feature turns Grooveshark into the best social music discovery service I’ve ever heard of.

Each user also has a profile page, with activity displayed by default, and links to their music library, community and playlist pages.

If you’re the kind of user who doesn’t want the user to know what you’re listening to at all times, then you have a couple of options in the settings page now. You can temporarily enable an “incognito mode” style setting which turns off logging to feeds for the duration of your session. This setting is perfect for parties or if you’re a hipster but just can’t resist the urge to listen to Miley Cyrus. No one has to know.
The other option is the more extreme “nuclear option” type of setting. It permanently disables logging your feed activity, and it permanently deletes all feed information we might have already stored.

Grooveshark is now available in Portugese! Translated by our very own Brazilian, Paulo. (Note: We will be removing country flags from languages soon, for those who are bothered by that sort of thing)

Shortcuts to playlists you are subscribed to now show up in the sidebar, below playlists you have created. The blue ones are your playlists, and the grey ones are subscribed playlists.

We now have autocomplete or “search suggest” functionality integrated into the search bar on the home screen.

Wondering if an artist is on tour? Want to buy tickets? Well now you can, thanks to our partner Songkick.

The library page has been revamped, and now playlists are contained within it. In the example pictured above, you can see that columns are collapsible: I collapsed the Albums column, while leaving the Artists column open.
Note: you can still get to your favorites by clicking on the Favorites smart playlist, or by going to My Music and then clicking on the button in the header that says “<3 Favorites”

 
 

A long series of mostly unrelated issues

02 May

If you look at my recent posting (and tweeting) history, a new pattern becomes clear: Grooveshark has been down a lot lately. This morning, things broke yet again.

I don’t think we’ve been this unreliable since the beta days. If you don’t know what that means, consider yourself lucky. The point is that this is not the level of service we are aiming to provide, and not the level of service we are used to providing. So what’s going on?

Issue #1: Servers are over capacity

We hit some major snags getting our new servers, so we have been running over capacity for a while now. That means that at best, our servers are a bit slower than they should be and at worst, things are failing intermittently. Most of the other issues on this list are at least tangentially related to this fact, either because of compromises we had to make to keep things running, or because servers literally just couldn’t handle the loads that were being thrown at them. I probably shouldn’t disclose any actual numbers, but our User|Server ratio is at least an order of magnitude bigger than the most efficient comparable services we’re aware of, and at least two orders of magnitude bigger than Facebook…so it’s basically a miracle that the site hasn’t completely fallen apart at the seams.

Status: In Progress
Some of the new servers arrived recently and will be going into production as soon as we can get them ready. We’re playing catch up now though, so we probably already need more.

Issue #2: conntrack

Conntrack is basically (from my understanding) a built in part of Linux (or at least CentOS) related to the firewall. It keeps track of connections and enables some throttling to prevent DOS attacks. Unfortunately it doesn’t seem to be able to scale with the massive number of concurrent connections each server is handling now; once the number of connections reaches a certain size, cleanup/garbage collection takes too long and the number of connections tracked just grows out of control. Raising the limits helps for a little while, but eventually the numbers grow to catch up. Once a server is over the limits, packets are dropped en mass, and from a user perspective connections just time out.

Status: Fixed
Colin was considering removing conntrack from the kernel, but that would have caused some issues for our load balancer (I don’t fully understand what it has to do with the load balancer, sorry!). Fortunately he located some obscure setting that allows us to limit what conntrack is applied to, by port, so we can keep the load balancer happy without breaking everything when the servers are under heavy load. The fix seems to work well, so it should be deployed to all servers in the next couple of days. In the meantime, it’s already on the servers with the heaviest load, so we don’t expect to be affected by this again.

Issue #3: Bad code (we ran out of integers)

Last week we found out that playlist saving was completely broken. Worse, anyone trying to save changes to an existing playlist during that 3 hour period had their playlist completely wiped out.

There were really two issues here: a surface issue that directly caused the breakage, and an underlying issue that caused the surface issue.

The surface issue: the PlaylistsSongs table has an auto_increment field for uniquely identifying each row, which was a 32 bit unsigned int. Once that field is maxed out, it’s no longer possible to insert any more rows.

Underlying issue: the playlist class is an abomination. It’s both horrible and complex, but at the same time incredibly stupid. Any time a playlist is ever modified, the entries in PlaylistsSongs are deleted, and then reinserted. That means if a user creates a playlist with 5 songs and edits it 10 times, 50 IDs are used up forever. MySQL just has no way of going back and locating and reusing the gaps. How bad are the gaps? When we ran out of IDs there were over 3.5 billion of them; under sane usage scenarios, enough to last us years even at our current incredible growth rate.
We’ve known about the horror of this class and have been wanting to rewrite it for over a year, but due to its complexity and the number of projects that use the class, it’s not a quick fix, and for better or worse the focus at Grooveshark is heavily slanted towards releasing new feaures as quickly as possible, with little attention given to paying down code debt.

Status: Temporarily fixed
We fixed the problem in the quickest way that would get things working again — by making more integers available. That is, we altered the table and made the auto increment field a 64bit unsigned int. The Playlist class is still hugely wasteful of IDs and we’ll still run out eventually with this strategy, we’ve just bought ourselves a little bit of time. Now that disaster has struck in a major way, chances are pretty good that we’ll be able to justify spending the time to make it behave in a more sane manner. Additionally, we still haven’t had the chance to export the previous day’s backup somewhere so that people whose playlists were wiped out can have a chance to restore them. Some have argued that we should have been using a 64bit integer in the first place, but it should be obvious that that would only have delayed the problem and in the meantime, it wastes memory and resources.

Issue #4: Script went nuts

This was today’s issue. The details still aren’t completely clear, but basically someone who shall remain nameless wrote a bash script to archive some data from a file into the master database. That script apparently didn’t make use of a lockfile and somehow got spawned hundreds or maybe even thousands of times. The end result was that it managed to completely fill the database server. It’s actually surprising how elegantly MySQL handled this. All queries hung, but the server didn’t actually crash, which is honestly what I expected would happen in that situation. Once we identified the culprit, cut off its access to the database and moved things around enough to free up some space, things went back to normal.

Status: Fixed
The server is obviously running fine now, but the script needs to be repaired. In the meantime it’s disabled. One could say that there was an underlying issue that caused this problem as well, which is that it was possible for such a misbehaving script to go into production in the first place. I agree, and we have a new policy effective immediately that no code that touches the DB can go live without a review. Honestly, that policy already existed, but now it will be taken seriously.

Issue #5: Load Balancer crapped out

I almost forgot about this one, so I’m adding it after the fact. We were having some issues with our load balancer due to the fact that it was completely overloaded, but even once the load went down it was still acting funny. We did a reboot to restore normalcy, but after the reboot the load balancer was completely unreachable because our new switch thought it detected the routing equivalent of an infinite loop. At that point the only way to get things going was to have one of our techs make the 2 hour drive up to our data center to fix it manually.

This issue would have been annoying but not catastrophic had we remembered to reconnect the serial cable to the load balancer after everything got moved around to accommodate the new switch. It also wouldn’t have been so bad if we had someone on call near the data center who would have been able to fix the issue, but right now everyone is in Gainesville. Unless Gainesville wins the Google Fiber thing, there’s no way we can have the data center in Gainesville because there just isn’t enough bandwidth coming into the city for our needs (yes, we’re that big).

Status: Mostly fixed
We understand what happened with the switch and know how to fix the issue remotely now. We don’t yet know how to prevent the switch from incorrectly identifying an infinite loop when the load balancer boots up, but we know to expect it and how to work around it. We now also have the serial cable hooked up, and a backup load balancer in place, so if something happens again we’ll be able to get things working again remotely now. It would still be nice to not have to send someone on a 2 hour drive if there is a major issue in the future, but hopefully we have minimized the potential for such issues as much as possible.

Issue #6: Streams down

This issue popped up this week and was relatively minor compared to everything else that has gone wrong, since I believe users were affected for less than 20 minutes, and only certain streams failed. The unplanned downtime paid off in the long run because the changes that caused the downtime ultimately mean the stream servers are faster and more reliable.

We had been using MySQL to track streams, with a MySQL server running on every stream server, just for tracking streams that happen on that server. We thought this would scale out nicely, as more stream servers automatically means more write capacity. Unfortunately, due to locking issues, MySQL was ultimately unable to scale up nearly as far as we have been able to get our stream output to scale, so MySQL became a limiting factor in our stream capacity. We switched the stream servers over to Redis, which scales up much better than MySQL, has little to no locking issues, and is a perfect match for the kind of key-value storage we need for tracking streams.

Unfortunately, due to a simple oversight, some of the web servers were missing a critical component, or rather they thought they were because Apache needed to be reloaded before it would see the new component. This situation was made worse by testing that was less thorough than it should have been, so it took longer to identify the issue than would be idea. Fortunately, the fix was extremely simple so the overall downtime or crappy user experience did not last very long.

Status: Fixed with better procedures on the way
The issue was simple to fix, but it also helps to highlight the need for better procedures both for putting new code live and for testing. These new changes should be going into effect some time this week, before any more big changes are made. In the meantime, streams should now be more reliable than they have been in the past few weeks.

 
 

A quick note on the openness of Flash

30 Apr

Thanks to Steve Jobs’ Thoughts on Flash post, there’s been a whole new flurry of posts on the subject of flash vs html5, this time with some focus on the issue of openness, since Steve made such a point to bring that up.

Some people have already pointed out that Adobe has been moving Flash to be more and more open over time, including the open screen project, contributing tamarin to Mozilla, and Flex being completely open source. That’s all well and good, but people seem to be forgetting that historically, Flash had a very good reason for being closed. If one remembers back to the early days of the web, the wild west era as I do since it was such an exciting time for a young whippersnapper like me, one must only think back to what happened to Java to realize why making Flash closed was a very smart move.

For those who don’t remember or weren’t on the web back then, I’ll share what I remember which may be a bit off (that was quite a while ago now!). I’m talking about the hairy days when Netscape was revolutionizing the world and threatening to make desktop operating systems irrelevant and Microsoft was playing catch up. Java was an open and exciting platform to write once and run anywhere, promising to make proprietary operating systems even more irrelevant.

Of course, what ended up happening was that Microsoft created their own implementation of Java that not only failed to completely follow the Java spec, but added on some proprietary extensions, completely breaking the “write once, run anywhere” paradigm and helping to marginalize Java on the web, something it seems Java on the web has never fully recovered from, even though Microsoft has since settled with Sun and dropped their custom JVM after being sued.

This directly affected my personal experience with the language, when I took a Java class at the local community college while I was in high school. I wanted to write apps for the web, but even basic apps which compiled and ran fine locally would not work on IE, Netscape or both, the only solution being to spend hours fiddling and making custom versions for each browser. Needless to say, I quickly lost interest and haven’t really touched Java since I finished that class.

The atmosphere has certainly changed since those early days, and I don’t think it’s nearly as dangerous to “go open” now as it was back then, so Adobe’s choice to open up more and more of the platform now makes perfect sense, just as the decision to keep it closed until relatively recently also made sense.

 
 

Another hiccup

28 Apr

The saga continues. Today while checking up on one of our servers, I noticed that it’s load average was insanely low. It’s one of our “biggest” servers, so it typically deals with a large percentage of our traffic. Load averages over 10 are quite normal, and it was seeing a load average of 2-3.

root@RHL039:~# tail /var/log/messages
Apr 28 10:59:56 RHL039 kernel: printk: 9445 messages suppressed.
Apr 28 10:59:56 RHL039 kernel: ip_conntrack: table full, dropping packet.
Apr 28 11:00:01 RHL039 kernel: printk: 9756 messages suppressed.
Apr 28 11:00:01 RHL039 kernel: ip_conntrack: table full, dropping packet.
Apr 28 11:00:06 RHL039 kernel: printk: 6111 messages suppressed.
Apr 28 11:00:06 RHL039 kernel: ip_conntrack: table full, dropping packet.
Apr 28 11:00:11 RHL039 kernel: printk: 3900 messages suppressed.
Apr 28 11:00:11 RHL039 kernel: ip_conntrack: table full, dropping packet.
Apr 28 11:00:16 RHL039 kernel: printk: 2063 messages suppressed.
Apr 28 11:00:16 RHL039 kernel: ip_conntrack: table full, dropping packet.

root@RHL039:~# cat /proc/sys/net/ipv4/ip_conntrack_max
800000
root@RHL039:~# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_count
799995

When the server drops packets like this, things seem randomly and intermittently broken, so we apologize for that. I upped the limit (again) to get things working, but we still don’t know why so many connections are being held open in the first place.

 
 

Known unknowns: Grooveshark downtime situation (updated)

23 Apr

Grooveshark is in the middle of some major unplanned downtime right now, after having major server issues just about all day.

What’s going on? We don’t fully know, but we have some clues.

-We are fairly certain that our load balancer is running well over capacity, and we’ve had a new one on order for a little while already.

-Some time this morning, one of our servers broke. Not only did it break, but our alerts that tell us about our servers were also broken due to a misconfiguration (but only for that server).

-Today near peak time, our two biggest servers stopped getting any web traffic except for very short bursts of very tiny amounts of traffic. At the same time, our other servers started seeing insane load averages. /var/log/messages showed warnings about dropping packets because of conntrack limits. The limits were the same as other boxes that had been doing just fine, and in normal circumstances we never come close to those limits, but the counters showed we were definitely hitting those limits and keeping them pegged.

-After upping the limits to even ridiculously higher numbers, the servers started getting web traffic again, and the other servers saw their load averages return to sane values. The connection counts were hovering somewhat close to the new limits, but seemingly with enough margin that they were never hitting the new limits; good enough for a temporary solution.

-Things were seemingly stable for a couple of hours, but connection numbers were still weird. One of our server techs who really is awesome at what he does but shall remain nameless here for his own protection, was investigating the issue and noticed a configuration issue on the load balancer that might have been causing the weird traffic patterns and definitely needed to be fixed. Once it was late enough, we decided to restart the load balancer to apply the settings, a disruption of service that would bring the site down for a minute or so under normal circumstances, well worth it for the chance of having things behave tomorrow.

-Obviously, things did not go quite as planned. The load balancer decided not to come back up, leaving us with the only option of sending a tech out to the data center 1.5 hours away to resurrect the thing.

Update 4/23/2010 11:50am:
Well obviously Grooveshark is back up and running. Total downtime was 2-3 hours, and we have a slightly better picture of what some of the problems were. The load balancer itself wasn’t exactly the problem, it was actually the interaction between the load balancer and the new core switch we installed last weekend. When the load balancer restarted, the switch got confused and essentially got a false alarm on the routing equivalent of an infinite loop, so it cut off the port, making the load balancer completely inaccessible. We now have it set up so that we can still connect to it even if the switch cuts it off, but we also figured out how to work around that issue if it happens again.
There’s still some weird voodoo going on with the servers that we haven’t fully explained, so be prepared for more slowness today while we continue to look into it.