RSS
 

Archive for the ‘grooveshark’ Category

Microsoft + SeeqPod?

09 May

There’s a rumor floating around that Microsoft has bought Seeqpod, mainly fueled it seems by the fact that they have a link to Microsoft live search on their home page.

I may regret saying this, but I think that link is a red herring. Microsoft is the last company I would expect to have an interest in SeeqPod, unless their search technology is incredibly impressive and Microsoft intends to apply it to other forms of search. A possibility, but it seems pretty slim. Besides being a bad fit in terms of corporate culture, SeeqPod is probably under an NDA and would most likely be in big trouble for leaking that sort of information early.

If Microsoft is buying SeeqPod for their search technology, don’t expect to see the free streaming service re-launched after the acquisition, at least not until Microsoft has signed deals with the majors, which as we know is a lengthy and expensive progress. Of course Microsoft can afford it, but can they profit from it?

In the meantime, Grooveshark is still running, still growing, and we have an API as well, for all those developers left out in the cold after SeeqPod shut down.

 
 

Jay does some front-end work

20 Apr

It’s no secret that I am deeply entrenched in the back-end world. I can optimize the hell out of some queries, and write some pretty complex php, but when it comes to html, css and javascript, I am a barbarian. In fact, I stopped learning about html in the 90′s, before css and back when javascript was pretty useless. I just don’t care for working on visual stuff, especially if it’s going to be finicky and inconsistent on different platforms.

Anyway, a longstanding complaint I’ve had when using Tinysong is that I can’t listen to a song before I share it. How do I make sure it’s the song I’m thinking of? Sure, I could copy the URL and paste it in the location bar and load up lite, but by then my other song options are gone, not to mention that I’m lazy.

I finally decided to take a stab at adding playback to Tinysong myself, and I’m happy to report that everything seems to be working quite well. I can’t take credit for the whole thing, or even most of it. Katy wrote the streaming code, and Chanel made some beautiful javascript wrappers for the whole thing, both for other projects. I simply took what they did and wrote the css and javascript to glue it to Tinysong. I’m happy to report that it seems to be working quite well, and I hope I’m not the only one who appreciates this enhancement. Give it a try and let me know what you think.

 
 

Detect crawlers with PHP faster

08 Apr

At Grooveshark we use DB-based php sessions so they can be accessed across multiple front-end nodes. As you would expect, the sessions table is very “hot,” as just about every request to do anything, ever, requires using a session. We noticed that web crawlers like google end up creating tens of thousands of sessions every day, because they of course do not carry cookies around with them.

The solution? Add a way to detect crawlers, and don’t give them sessions. Most of the solutions I’ve seen online look something like this:

function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawler as $c) {
if (stristr($USER_AGENT, $c[0])) {
return($c[1]);
}
}
return false;
}

Essentially, doing a for loop over the entire list of possible clients, and searching the user agent string for each one, one at a time. This seems way too slow and inefficient for something that is going to have to run on essentially every call on a high volume website, so I rewrote it to look like this:
public static function getIsCrawler($userAgent)
{
$crawlers = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
$isCrawler = (preg_match("/$crawlers/", $userAgent) > 0);
return $isCrawler;
}

In my not-very-scientific testing, running on my local box my version takes 11 seconds to do 1 million comparisons, whereas looping through an array of crawlers to do 1 million comparisons takes 70 seconds. So there you have it, using a single regex for string matching rather than looping over an array can be 7 times faster. I suspect, but have not tested, that the performance gap gets bigger the more strings you are testing against.

 
 

A reason to upgrade

24 Mar

I use Flash Player 9 at work, and until recently I also used it at home. Version 10 is out, but most people still have 9, so for testing purposes I wanted to stick with 9, especially since Katy has switched to 10. Someone has to catch old bugs!

Anyway my Flash Player accidentally got upgraded to 10 at home, and I recently discovered this delicious Tom Waits song: Invitation to the Blues.

Alas, when I tried to play it at work, it sounded like a very sad, drunken donkey was wailing at me. That is usually indicative of a sample rate issue (Flash is very picky about sample rates), but close inspection of the file hasn’t turned up anything wrong with it, including sample rate. If I want to listen to that song at work, I’ll have to upgrade to Flash 10.

Tom Waits, or reliability? Hmm, a tough choice indeed.

 
 

The Art of Elegant Code: Eliminating special cases that aren’t

18 Feb

One of my personal pet peeves is code that contains a bunch of conditional logic to handle seemingly special cases that really aren’t.

For example, let’s say we have a page that shows a user their playlists if they are logged in, but this page also has other useful information on it, and users aren’t required to log in. In our original design for our authentication class, if authentication “failed” (i.e. the user was not logged in), a call to Auth::getUserID() would return an empty string.

With that implementation, every piece that might have something to display to a logged in user has to have a bunch of conditional logic checking to see if the user is logged in, doing one thing i they are and another if they are not, adding unnecessary complexity and, of course, more potential for bugs to crop up.

This type of logic is completely unnecessary. I changed Auth to return a userID of 0 if the user is not logged in, and now it is not necessary to have any special handling for that case. If the user is logged in, they get playlists. If the user is not logged in, userID 0 does not have any playlists so they do not get playlists. I estimate that this simple change made over 100 lines of code obsolete. Not a whole lot of code in the grand scheme of things, but how many bugs can hide in 100 lines of code?

Another example of the same principle involves exceptions (which, as regular readers of Raymond Chen know, are hard to deal with). Under certain circumstances, recommendations from an external source can be included with our normal set of recommendations. The author of the recommendation-supplementing code originally had it throwing exceptions whenever it had problems. Of course, the case of not having supplemental recommendations because of an error is not really a special case. As far as my code is concerned, you just don’t have supplemental recommendations. In this case modifying the original code eliminates unnecessary handling for a special case and eliminates the potential for an uncaught exception to slip through.

A general rule of thumb to help avoid unnecessary special cases, at least with PHP, is to always return what you say you are going to return (and don’t throw exceptions). If your method processes some data which results in an array, return an array even if the processing has no results. This is the way most native methods in PHP already work, if you think about it. count(array()) doesn’t return an empty string or null or raise an exception, nor does count(null). Really, neither of these are special cases. In either case, the number of elements is zero, and developers are not required to care about the differences.

 
 

Lisa Hannigan

11 Feb

Sometimes you find out about things happening at your own company in the strangest ways. I discovered Lisa Hannigan through a friend, who discovered Lisa Hannigan because we are promoting her music. I had no idea. It’s actually really good, which is why I’m posting it here.

 
 

Optimize for concurrency, not throughput

06 Feb

When it comes to disks, raid and otherwise, there’s a lot of confusion about how to optimize for different server configurations, and unfortunately much more emphasis seems to be placed on sustained throughput, with little mention going to IOPS or concurrency.

Here are grooveshark we use Sun’s X4500, aka “Thumper” for housing your MP3s. Lately we’ve been hitting some buffering issues during peak hours, where it seemed like the 4500 was not able to keep up with the requests coming in. Although we are growing rapidly, we should still be nowhere near saturating the IOPS of 48 drives, but iostat -Cx was showing that the drives dedicated to serving up content had transactions waiting 100% of the time, and the disks were busy 100% of the time. Service time was in the triple digits. Insanity. Something was obviously misconfigured.

We were using 3 zpools for our content, each configured in raidz with about 8 drives (if I remember correctly). Ok, so we actually only have the IOPS of 24 drives in that configuration, but we still should not be anywhere near saturating that. After a fair amount of digging, I discovered that raidz is probably the worst configuration possible for serving up thousands of 3-5mb files concurrently. That’s because raidz causes every drive to engage in every read request, no matter how small the file. That means your theoretical transfer rate for any given file is N times the transfer rate of each drive, but your IOPS is the same as one drive, no matter how many drives you have in your pool. In other words, adding more drives to a raidz increases disk transfer bandwidth, but does nothing to alleviate the overhead associated with random access seeks. This configuration is ideal if you are transferring multi-gigabyte files frequently, but at 3.5-5MB, seek time makes up almost all of the overhead, and it would be much better to have N drives all seeking to N files simultaneously.

I brought up these concerns to our sysadmins, and they set up a small mirrored pool to handle our most popular content, and the performance difference is quite astounding even with just two disks in that pool. In a mirrored configuration like this, each drive can respond to requests for data separately because they each have the full set of data. Adding 2 more drives to the 24 drives already in production nearly doubled our IOPS capacity, because we went from the IOPS of 3 drives to the IOPS of 5, nearly doubling our capacity with only two hard drives. Our system admins will be adding more disks to the mirror to give us even more breathing room. We’ll have to wait and see tomorrow if the added IOPS capacity eliminates the buffering issues that users have been running into lately, but I bet it will.

 
 

Indexes Matter (or: Memcache Will Only Take You So Far)

28 Jan

About a week ago, I was doing some work on the DB in the middle of the night and noticed that my simple queries were running a bit sluggish. I dropped out of mysql and ran top, and noticed that load averages were way higher than I was used to seeing. I ran SHOW FULL PROCESSLIST a bunch of times, and noticed two queries popping up frequently, one was a backend processing query which did not belong on the production database, and the other was the query used to build Widget objects. My first suspect was the backend process, since it did not belong, so we took that off and moved it to a more appropriate server, which brought down the load average by 1; a significant improvement, but the load averages were still pretty high, however the server was usable and responsive enough again, so I forgot about it.

A couple of days later, I noticed our load averages were still pretty high and the main recurring query was still the widget one, so I ran an explain on it, and although the query looked innocent enough, it was missing an index, so instead of a quick lookup it was a full table scan across millions of rows. Ouch.

I knew we wouldn’t have a chance to have some downtime to run the necessary ALTERs to get the indexes in there until after the weekend, so I asked Chanel to put in memcache support so that widgets would only need to be loaded once from SQL. Chanel got that done on Sunday, and on Monday night we were able to get the proper indexes added.

Because of the time span involved, combined with the fact that we monitor server metrics with Zabbix, means that we can look back at a nice little graph of our performance before and after each of the changes.

The days with the grey background are Saturday and Sunday, before memcache was added. The next day, with memcache added the peak load is cut in half. The day after that, with proper indexes, the peak load is barely perceptible, roughly 1/4 of what the load was with just memecache.

The lesson to be learned from this is that while memcache can help quite a bit, there’s a lot to be said for making sure your SQL queries are optimized.

 
 

Grooveshark is growing!

14 Jan

I’m probably not at liberty to speak about specific numbers, but I have to share my elation at Grooveshark’s current growth rate.
Last month I predicted that given our current growth rate, we should double our total number of users every 3.6 months.

We are one day away from hitting a big round number, so I thought I’d look back and see how long ago we were at half of that. What do you know, 3.5 months ago. A growth rate slightly better than I had projected.

This, my friends, is exponential growth, and it turns out that when you have exponential growth it is insanely easy to calculate how long it will take to double your numbers. It’s called the rule of 72 and it basically states that you simply divide 72 by your growth rate to get your doubling time. For example, 72/(20% per month) = 3.6 months.

I must credit Dr. Albert Bartlett for teaching me about the rule of 72. According to Dr. Albert Bartlett, “the greatest shortcoming of the human race is our inability to understand the exponential function,” and I highly recommend watching his lecture on the topic. (sorry, it’s a .ram file. here is an alternate version on google video, which I haven’t watched)

 
 

Grooveshark just got more album art

20 Dec

I’m still getting used to not calling it “Grooveshark Lite.” For those who haven’t noticed, it’s been re-branded as just plain old Grooveshark. On one hand, it’s an awesome testament to just how successful the project has been. On the other hand, it’s a little harder to talk about it now. I can take credit for a large portion of Grooveshark Lite, but I can’t really take credit for a large portion of Grooveshark; Grooveshark represents so much more to me than just our player.

Anyway, thanks to a nifty little Perl script that Travis helpfully rewrote for me to be more compatible with our hosting environment, I was able to grab art for an extra 57,000 or so albums that we had not been able to get from our partners previously. What does this mean for you, the user? You should see less of this:

in your queue and song info panels, and more real album art, like this:

What’s the magic trick? Simple, the script looks for art embedded inside the mp3s that we have associated with an album, guesses at which piece of art is best. It’s surprising how many mp3s actually have art embedded in them.

Hopefully this will make Grooveshark more usable. I know I find it very frustrating when I have a large queue filled mostly with the default album art; it’s very hard to tell where you are when everything looks the same.