500,000 Users and scaling

Categories: grooveshark

Grooveshark surpassed the 500,000 registered user mark today.
Ignoring the fact that many of our users never bother to register (it’s not necessary in order to use the site), 500k is an absolutely phenomenal number, especially compared to where we were just a year ago: 33k. The scary thing is that under our current growth rate, we will have over a million registered users in roughly 3 months.

Can we double our capacity in just 3 months? Obviously, history implies that it’s possible; we’ve already done much more than that. In fact we’ve done better than that: with little change in infrastructure and much of the same server capacity we’ve managed to make Grooveshark faster and scale at the same time.

On the other hand, much of the low-hanging scalability fruit has been picked now. We use memcached extensively, use a master/slave DB configuration with a data warehouse for logging or writes that don’t need to be processed in real time, and have begun doing some rudimentary sharding for stream-related activities.

What’s left? Well, we aren’t yet at the point where we can scale linearly simply by adding more servers, except probably for streaming servers. For that we need more sharding, primarily. There are still some SQL optimizations that can be made, like bringing session ids down to 16 bytes from 32 (32 on disk and 96 in memory, thanks to utf8) and ultimately getting them out of the database altogether, and using memcached even more heavily, but really all of those things only buy us time. Not that there is anything wrong with buying time, because we also need time to work on new features like last.fm scrobbling, a super-secret redesign, launching on half a dozen mobile platforms, etc., all with a relatively small dev team, but ultimately there are some fundamental architecture changes coming, and if we’re going to keep doubling our number of users every 3 months, it’s going to have to be very soon.

Moving Windows Partitions

Categories: Uncategorized

I’m pretty much the last Windows holdout among Grooveshark developers, but this is actually about my Windows install at home. I had a fairly unusual setup:
500GB drive: C:\ (windows x86-32)
500GB drive: D:\ (windows x86-64)
500GB drive: E:\ (data)

I used this configuration to dual-boot between x64 and x86 versions of Windows XP, since I wasn’t sure how many compatability issues I’d have but I wanted to try to take advantage of all 6GB of RAM I have installed. C:\ was the boot drive, and D:\ was a non-bootable drive. C:\boot.ini simply pointed to D:\ for the x64 version.

I’ve gotten to the point where I almost never boot into the x86 version of XP, so having my windows install essentially span two drives was starting to make me slightly nervous. If C:\ died, I couldn’t get into either version of the OS, but if D:\ died I’d be locked out of x64, which would really be almost as bad. And of course 2 drives = 2x the chance of failure vs 1 drive.

I recently got a 1TB drives, so I thought the perfect solution would be to clone both drives to separate partitions on the same 1TB drive, change boot.ini to point to the right locations and be good to go. Of course, nothing is that simple. When I did that, and removed the old drives from the system, I successfully got the boot menu and Windows would start, but hang on the login screen — right before showing the list of users.

I looked around online and generally couldn’t find anyone else trying to do exactly what I was trying to do, and most of hte advice for even simply moving windows from one drive to another was “don’t,” so I thought I might be out of luck but then I stumbled across

Microsoft + SeeqPod?

Categories: grooveshark music

There’s a rumor floating around that Microsoft has bought Seeqpod, mainly fueled it seems by the fact that they have a link to Microsoft live search on their home page.

I may regret saying this, but I think that link is a red herring. Microsoft is the last company I would expect to have an interest in SeeqPod, unless their search technology is incredibly impressive and Microsoft intends to apply it to other forms of search. A possibility, but it seems pretty slim. Besides being a bad fit in terms of corporate culture, SeeqPod is probably under an NDA and would most likely be in big trouble for leaking that sort of information early.

If Microsoft is buying SeeqPod for their search technology, don’t expect to see the free streaming service re-launched after the acquisition, at least not until Microsoft has signed deals with the majors, which as we know is a lengthy and expensive progress. Of course Microsoft can afford it, but can they profit from it?

In the meantime, Grooveshark is still running, still growing, and we have an API as well, for all those developers left out in the cold after SeeqPod shut down.

Grooveshark has outgrown Gainesville

Categories: Uncategorized

We have just about completed moving all of our servers from Gainesville to Colo5 in Jacksonville. Why did we have to move? Bandwidth! We simply could not get a fat enough tube to handle all of your music streaming demands into Gainesville. Why Jacksonville? Because it’s cheap, and close! We were considering moving to a similar facility in Atlanta, but Colo5 came in cheaper, while offering the same sort of bandwidth that we could expect to get in Atlanta. How much bandwidth? We should have room to grow up to about 20Gbps before we will have to consider expanding to other facilities. We’re going to need a lot more servers before that can happen.

Although we now have plenty of bandwidth for the near to mid-term future, there are still plenty of other bottlenecks that are starting to pinch us, so don’t be surprised if playback is still laggy sometimes or results to buffering. The next big improvement we need to make is the “bandwidth” we get from our disks. No point having 20GBps of headroom in the tube if we can’t actually get that much off of our servers. We have several strategies we are applying to this end, and I may post more about them later if I have time.

The transition from Gainesville to Jacksonville was, of course, not as smooth as we had hoped. Everything was going along swimmingly until we went to install our crappy “downtime server” in Gainesville to let users know that we were down and why, while allowing us to take the real web servers with us. The server and router simply would not acknowledge each other’s presence. Our laptops could connect directly to either one just fine, but both thought they had no connection when interfacing directly. The solution? A crappy old Linksys hub, which both the router and server were able to see.

Loading up all of our servers containing all of our data into a UHaul was more than a little scary, but we packed everything very carefully using lots of cardboard, furniture blankets and rope. Not a single server was harmed in the moving process!

Ben got a lot of pictures of the whole event, if you’d like to see more. Big thanks to Ed, Skyler, Joe, Colin and Nate for all working so hard to make the transition as quick and painless as possible, and thanks to Paloma for driving us home and letting us sleep on the way back; we were definitely in no shape to drive after all that.

Washington

Categories: Uncategorized

The following song has infected my brain: Clementine by Washington

Jay does some front-end work

Categories: grooveshark

It’s no secret that I am deeply entrenched in the back-end world. I can optimize the hell out of some queries, and write some pretty complex php, but when it comes to html, css and javascript, I am a barbarian. In fact, I stopped learning about html in the 90’s, before css and back when javascript was pretty useless. I just don’t care for working on visual stuff, especially if it’s going to be finicky and inconsistent on different platforms.

Anyway, a longstanding complaint I’ve had when using Tinysong is that I can’t listen to a song before I share it. How do I make sure it’s the song I’m thinking of? Sure, I could copy the URL and paste it in the location bar and load up lite, but by then my other song options are gone, not to mention that I’m lazy.

I finally decided to take a stab at adding playback to Tinysong myself, and I’m happy to report that everything seems to be working quite well. I can’t take credit for the whole thing, or even most of it. Katy wrote the streaming code, and Chanel made some beautiful javascript wrappers for the whole thing, both for other projects. I simply took what they did and wrote the css and javascript to glue it to Tinysong. I’m happy to report that it seems to be working quite well, and I hope I’m not the only one who appreciates this enhancement. Give it a try and let me know what you think.

Goodbye xampp, hello Portable Ubuntu

Categories: Uncategorized

For a long time now I’ve used xampp for testing my scripts locally before committing to the codebase and testing in our dev environment. Apache for windows, at least in this configuration, was incredibly unstable: major memory leaks, and it crashed if it received two connections at once. It has been good enough for rudimentary testing, but always a bit annoying. The memory leak issue caused me to shut down apache as soon as I was done testing any script.

I recently discovered Portable Ubuntu or pubuntu for short, which allows you to run ubuntu inside of Windows (yes, I still prefer windows as my desktop environment). I set up apache and php inside of pubuntu, and after following this suggestion (with my own hacks to make it work over the network: see thread), and pointing apache to the mount point for my workspace (also living in windows), everything is working quite smoothly. Apache is both more stable and faster. Ironically, pubuntu with apache running inside of it is more lightweight than the Apache for Windows distribution that came with xampp was for me.

What’s especially cool is that because of the way pubuntu mounts drives, pubuntu is seamlessly accessing my workspace to serve up files. That means I can edit a php file in my windows IDE of choice (notably not Vim), and that change is immediately reflected in pubuntu with no effort on my part.

Detect crawlers with PHP faster

Categories: Coding grooveshark

At Grooveshark we use DB-based php sessions so they can be accessed across multiple front-end nodes. As you would expect, the sessions table is very “hot,” as just about every request to do anything, ever, requires using a session. We noticed that web crawlers like google end up creating tens of thousands of sessions every day, because they of course do not carry cookies around with them.

The solution? Add a way to detect crawlers, and don’t give them sessions. Most of the solutions I’ve seen online look something like this:

function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawler as $c) {
if (stristr($USER_AGENT, $c[0])) {
return($c[1]);
}
}
return false;
}

Essentially, doing a for loop over the entire list of possible clients, and searching the user agent string for each one, one at a time. This seems way too slow and inefficient for something that is going to have to run on essentially every call on a high volume website, so I rewrote it to look like this:
public static function getIsCrawler($userAgent)
{
$crawlers = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
$isCrawler = (preg_match("/$crawlers/", $userAgent) > 0);
return $isCrawler;
}

In my not-very-scientific testing, running on my local box my version takes 11 seconds to do 1 million comparisons, whereas looping through an array of crawlers to do 1 million comparisons takes 70 seconds. So there you have it, using a single regex for string matching rather than looping over an array can be 7 times faster. I suspect, but have not tested, that the performance gap gets bigger the more strings you are testing against.

A reason to upgrade

Categories: grooveshark music

I use Flash Player 9 at work, and until recently I also used it at home. Version 10 is out, but most people still have 9, so for testing purposes I wanted to stick with 9, especially since Katy has switched to 10. Someone has to catch old bugs!

Anyway my Flash Player accidentally got upgraded to 10 at home, and I recently discovered this delicious Tom Waits song: Invitation to the Blues.

Alas, when I tried to play it at work, it sounded like a very sad, drunken donkey was wailing at me. That is usually indicative of a sample rate issue (Flash is very picky about sample rates), but close inspection of the file hasn’t turned up anything wrong with it, including sample rate. If I want to listen to that song at work, I’ll have to upgrade to Flash 10.

Tom Waits, or reliability? Hmm, a tough choice indeed.

Back from FOWA Miami

Categories: Uncategorized

Several of us Grooveshark developers went down to FOWA Miami this year (and Barcamp Miami too!), which was a fun and educational experience. We got to see my personal hero Joel Spolsky, whom Katy and I briefly met after his talk, and we got to learn about some cool new and upcoming technologies.

There were also, of course, some great pre and post conference parties, so we got to meet lots of cool people and network.

Aside from meeting Joel Spolsky, my favorite experience of the entire trip was being accosted by fans of Grooveshark several times over the course of the weekend. It’s really incredibly gratifying to know that Grooveshark has fans, and they’re real people!

Over the next week or so, if I’m not too lazy, I will be posting my thoughts on the FOWA talks, and about some of the conversations I had while I was there, so stay tuned.