RSS
 

Archive for the ‘grooveshark’ Category

Google Analytics: Drop in Safari Visits

28 Sep

Starting around Thursday, 9/19/2013, Google Analytics has been showing about 50% fewer visits from Safari users than the Thursday before that for Grooveshark. (Aside: by default we compare each day to the same day of the week from the previous week because usage patterns for when people are on Grooveshark are highly dependent on what day of the week it is. The way people use Grooveshark is most wildly different comparing weekdays to weekends, but even a Monday session is very different from a Thursday session)

I suspect it’s related to a longstanding issue as explained in these 2 blog posts:

http://sunpig.com/martin/archives/2010/01/08/how-to-detect-a-page-request-from-safari-4s-top-sites-feature.html
(2010!)
http://blog.splitwise.com/2013/05/13/safari-top-sites-pollutes-google-analytics-and-mixpanel/ (2013)

it appears that the GA team is finally detecting these “preview” loads and not counting them, although I have not been able to get official confirmation of that.
The evidence:

It’s not just happening to Grooveshark:
Google Analytics Forum question
Superuser question (deleted: google archive)
Another forum post asking about the drop

  • Average visit duration is up by 163% for Safari users. That makes sense because the Safari preview visits would by definition be extremely short.
    Our bounce rate for Safari users is down by 82%. We had previously determined that Safari preview visits were all being counted as bounces, artificially inflating that number for Safari users.
  • I also dug into different Safari versions, and at least for the 3 most popular versions on Grooveshark (all that I checked), the % drop in visits is the same, which strongly implies that this drop is not due to some recent update from Apple. We also tested the site to make sure that everything still appears to work in Safari, and we have not seen an uptick in complaints from Safari users which we would expect to see if we had introduced a Safari-breaking bug.
  • New vs Returning: GA reports that “new” Safari visitors are down 36%, while “returning” Safari visitors are down a whopping 57%. This also points towards the Safari preview issue because users who visit Grooveshark more frequently are less likely to be considered new and are more likely to have Grooveshark show up in Safari preview.
  • Pages/Visit by Safari users is up a whopping 76%. This makes sense since a “preview” visit would only hit the homepage, but real users are very likely to visit other pages.

Note: Grooveshark’s numbers are generally considered proprietary, which is why all the stats above are in relative percentages. Sorry if that made it harder to follow.

I have reached out to Google Analytics via twitter and also commented on the Google Analytics discussion thread, but we have not heard any official word yet. Ideally this would be the sort of thing we’d hear about from Google directly before it happened. Barring that, it would be nice to see a blog post from them about it, which also hasn’t happened yet.

Unrelated:
On the off chance that I now have a Googler’s attention: a probably unrelated issue that we see for the day of Friday, 9/20/2013 is for that day our stats are mysteriously low and internally inconsistent. Visits for that day are 50% lower than the previous Friday. Our internal bandwidth and user activity #s look normal for that day. If I look at New vs Returning, we had significantly more returning visitors than we had total visits…how could that even be possible?! We have seen anomalies like that before but they usually only appear for a short time and are corrected a day or so later.

 
 

Git Tips (just a couple)

05 Jun

Below are a couple of git tips for the advanced git user. This is not meant to be remotely comprehensive in any way, just a couple of things I have figured out that I haven’t seen floating around out there yet.

Always rebase

At Grooveshark, we like to keep our history clean, such that the history of the main repo reflects when code was actually committed/accepted, regardless of when that code was written. As such, we almost always want to rebase. If you search for how to rebase by default, most of the advice you’ll find says to add the following to your ~/.gitconfig:

[branch]
    autosetuprebase = always

What this does, however, is it tells git to set up each branch to rebase as you create it. Any branch that git already knows about will not be set up to rebase; you’ll have to do that manually. Then in the future if you change your mind and decide you don’t want this functionality, you will have to remove it from each branch one by one. Instead put this in your ~/.gitconfig:

[pull]
    rebase = true

Now you’ve made the default behavior for pulling be to rebase. If you want to override that behavior for a particular branch, just set rebase = false in the .git/config file for the applicable repo, under the particular branch you want to do that for.

Github pull requests by the number

As mentioned earlier, when we do a pull request, we want it to be rebased onto the latest before it gets merged, so it’s a simple fast forward. Add the following to your ~/.gitconfig:

[alias]
    review = !sh -c 'REF=`git rev-parse --symbolic-full-name --abbrev-ref HEAD` && git fetch $1 $2:mergeReview -f && git checkout mergeReview && git rebase $REF && git checkout $REF && git diff HEAD..mergeReview' -
    pr = !git review origin refs/pull/$1/head
    approve = merge mergeReview

What’s going on here?
First we make a new command for reviewing arbitrary refs from arbitrary forks, called ‘review’. To use it, first checkout the target branch that you want to merge into, then run: ‘git review {remote} {ref}‘. Behind the scenes that fetches the remote ref into a temporary local branch called mergeReview, based on the remote ref you are reviewing. It rebases that onto the target branch you were in, switches you back to the target branch and then does a diff for you to review.
Next we make a new command for reviewing pull requests from github, called ‘pr’. This really just saves you from having to remember where github keeps pull request refs. It just calls git review and passes in the remote ref for the pull request.
Finally, ‘approve’ just merges the temporary mergeReview branch into the branch you’re currently in, which if you followed the instructions above is the target branch.

 

Migrating to github

26 Jan

I needed to migrate a couple of repositories with lots of tags and branches to github, and Github’s instrucitons just weren’t cutting it:

Existing Git Repo?
cd existing_git_repo
git remote add origin [email protected]:[orgname]/[branchname].git
git push -u origin master

That’s great as long as all I care about is moving my master branch, but what I really wanted was just to make github have everything that my current origin did, whether or not I was tracking it. A quick cursory Google search didn’t find any instructions how to do this, so I had to figure it out the old fashioned way, by reading git help pages. The quickest and easiest way I could find to do this after creating the repo on github is:

git clone git@[originurl]:[orgname]/[reponame].git [local temp dir] --mirror
cd [local temp dir]
git remote add github [email protected]:[orgname]/[reponame].git
git push -f --mirror github
cd ..
rm -rf [local temp dir]

that’s it! At this point github should have everything that your original origin had, including everything in /refs. It won’t have any of your local branches, since you did everything from a clean checkout. You might also want to change where origin is pointing:

cd [dir where your repo is checked out]
git remote add oldOrigin git@[oldgitserver]:[orgname]/[reponame].git
git remote set-url origin [email protected]:[orgname]/[reponame].git

Of course, this same procedure should work for moving from any git server to any other git server too.

 

Grooveshark is Hiring (Part 2 – Javascript Edition)

19 May

Grooveshark is looking for talented web developers. We are looking to fill a wide variety of web dev positions. The next few posts I make will be job descriptions for each of the major positions we are looking to fill. For part 2, I’m listing our Javascript developer position. If your skillset is more front-end leaning but you feel like you would be a good fit for Grooveshark, by all means apply now rather than waiting for me to post the job description. :)

Grooveshark is looking for a hardcore JavaScript developer

NOTE: Grooveshark is a web application, written with HTML/CSS, JavaScript, PHP, and ActionScript 3, not a web page or collection of webpages. This position involves working directly with the client-side application code in JavaScript: the “backend of the frontend” if you will.

Must be willing to relocate to Gainesville, FL and legally work in the US. Relocation assistance is available.
Responsibilities:
Maintaining existing client-side code, creating new features and improving existing ones
Writing good, clean, fast, secure code on tight deadlines
Ensuring that client architecture is performant without sacrificing maintainability and flexible enough for rapid feature changes
Striking a balance between optimizing client performance versus minimizing load on the backend
Integration of third-party APIs

Desired Qualities:
Enjoy writing high quality, easy to read, self-documenting code
A passion for learning about new technologies and pushing yourself
A deep understanding of writing bug-free code in an event-driven, asynchronous environment
Attention to detail
A high LOC/bug ratio
Able to follow coding standards
Good written and verbal communication skills
Well versed in best practices & security concerns for web development
Ability to work independently and on teams, with little guidance and with occasional micromanagement
More pragmatic than idealistic

Experience:
Extensive JavaScript experience, preferably in the browser environment
Experience with jQuery
Experience with jQueryMX or another MVC framework
HTML & CSS experience, though writing it will not be a primary responsibility
Some PHP experience, though you won’t be required to write it
Knowledge of cross-browser compatibility ‘gotchas’
Experience with EJS, smarty, or other templating systems
Experience with version control software (especially git or another dvcs)

Bonus points for:
Having written a client application (in any language) that relies on a remote server for data storage and retrieval
Having written a non-trivial jQuery plugin
Experience with JavaScript/Flash communication via ExternalInterface
Experience with integrating popular web APIs (OAuth, Facebook, Twitter, Google, etc) into client applications
Experience with ActionScript 3 outside the Flash Professional environment (ie, non-timelined code, compiling with mxmlc or similar)
Experience developing on the LAMP stack (able to set up a LAMP install with multiple vhosts on your own)
Experience with profiling/debugging tools
Being well read in Software Engineering practices
Useful contributions to the open source community
Fluency in lots of different programming languages
BS or higher in Computer Science or related field
Being more of an ‘evening’ person than a ‘morning’ person
A passion for music and a desire to revolutionize the industry

Who we don’t want:
Architecture astronauts
Trolls
Complete n00bs (apply for internship or enroll in Grooveshark University instead!)
People who want to work a 9-5 job
People who would rather pretend to know everything than actually learn
Religious adherents to The Right Way To Do Software Development
Anyone who would rather use XML over JSON for RPC

Send us:
Resume
Code samples you love
Code samples you hate
Links to projects you’ve worked on
Favorite reading materials on Software Engineering (e.g. books, blogs)
What you love about JavaScript
What you hate about JavaScript (and not just the differences in browser implementations)
Prototypal vs Classical inheritance – what are the differences and how do you feel about each?
If you could change one thing about the way Grooveshark works, what would it be and how would you implement it?

 

If you want a job:  jay at groovesharkdotcom
If you want an internship: [email protected]

 

Grooveshark is Hiring (Part 1 – PHP edition)

16 May

Grooveshark is looking for talented web developers. We are looking to fill a wide variety of web dev positions. The next few posts I make will be job descriptions for each of the major positions we are looking to fill. For part 1, I’m listing our backend PHP position. If your skillset is more front-end leaning but you feel like you would be a good fit for Grooveshark, by all means apply now rather than waiting for me to post the job description. :)

 

Grooveshark is seeking awesome PHP developers.

Must be willing to relocate to Gainesville, FL and legally work in the US. Relocation assistance is available.
Responsibilities:
Maintaining existing backend code & APIs, creating new features and improving existing ones
Writing good, clean, fast, secure code on tight deadlines
Identifying and eliminating bottlenecks
Writing and optimizing queries for high-concurrency workloads in SQL, MongoDB, memcached, etc
Identifying and implementing new technologies and strategies to help us scale to the next level

Desired Qualities:
Enjoy writing high quality, easy to read, self-documenting code
A passion for learning about new technologies and pushing yourself
Attention to detail
A high LOC/bug ratio
Able to follow coding standards
Good written and verbal communication skills
Well versed in best practices & security concerns for web development
Ability to work independently and on teams, with little guidance and with occasional micromanagement
More pragmatic than idealistic

Experience:
Experience developing on the LAMP stack (able to set up a LAMP install with multiple vhosts on your own)
Extensive experience with PHP
Extensive experience with SQL
Some experience with Javascript, HTML & CSS though you won’t be required to write it
Some experience with lower level languages such as C/C++
Experience with version control software (especially dvcs)

Bonus points for:
Well read in Software Engineering practices
Experience with a SQL database and optimizing queries for high concurrency on large data sets.
Experience with noSQL databases like MongoDB, Redis, memcached.
Experience with Nginx
Experience creating APIs
Knowledge of Linux internals
Experience working on large scale systems with high volume of traffic
Useful contributions to the open source community
Fluency in lots of different programming languages
Experience with browser compatability weirdness
Experience with smarty or other templating systems
BS or higher in Computer Science or related field
Experience with Gearman, RabbitMQ, ActiveMQ or some other job distribution/message passing system for distributing work
A passion for music and a desire to revolutionize the industry

Who we don’t want:
Architecture astronauts
Trolls
Complete n00bs (apply for internship or enroll in Grooveshark University instead!)
People who want to work a 9-5 job
People who would rather pretend to know everything than actually learn
Religious adherents to The Right Way To Do Software Development
Anyone who loves SOAP

Send us your:
Resume
Code samples you love
Code samples you hate
Favorite reading materials on Software Engineering (e.g. books, blogs)
Tell us when you would use a framework, and when you would avoid using a framework
ORM: Pros, cons?
Unit testing: pros, cons?
Magic: pros, cons?
When/why would you denormalize?
Thoughts on SOAP vs REST

If you want a job: jay at groovesharkdotcom

If you want an internship: [email protected]

 

Grooveshark Playlists now in MongoDB

06 Mar

As of about 5:30am last night (this morning?) Grooveshark is now using MongoDB to house playlist information.

Until now playlists have lived in MySQL, but there were some big problems that occasionally lead to data loss due (mostly) to deadlocks. Needless to say, users don’t like it when you lose their data. Moving to Mongo should resolve all of these issues.

Grooveshark has been using MongoDB for sessions and feed data for a while now, so we are comfortable with the technology and know that it is capable of handling massive amounts of traffic. while it’s certainly not perfect, we are confident that it will be easy to scale out to maintain reliability as our user base continues to grow rapidly.

 
 

Why You Should Always Wrap Your Package

30 Jul

Ok, the title is a bit of a stretch, but it’s a good one isn’t it?

What I really want to talk about is an example of why it’s a good idea to make wrappers for PHP extensions instead of just using them directly.

When Grooveshark started using memcached ever-so-long-ago, with the memcache pecl extension, we decided to create a GMemcache class which extends memcache. Our main reason for doing this was to add some convenience (like having the constructor register all the servers) and to add some features that the extension was missing (like key prefixes). We recently decided that it’s time to move from the stagnant memcache extension to the pecl memcached extension, which is based on libmemcached, which supports many nifty features we’ve been longing for, such as:

  • Binary protocol
  • Timeouts in milliseconds, not seconds
  • getByKey
  • CAS
  • Efficient consistent hashing
  • Buffered writes
  • Asyncronous I/O

Normally such a transition would be a nightmare. Our codebase talks to memcached in a million different places. But since we’ve been using a wrapper from day 1, I was able to make a new version of GMemcache with the same interface as the old one, that extends memcached. It handles all the minor differences between how the two work, so all the thousands of other lines in the app that talk to memcached do not have to change. That made the conversion a <1 day project, when it probably would have otherwise been a month long project. It also has the advantage that if we decide for some reason to go back to using pecl memcache, we only have to revert one file.

 

Last Year

19 May

It was almost a year ago that I made a post about hitting the 500,000 user mark and talked about some of our anticipated scaling issues.

In the past year we’ve grown to over 3.5 million registered users. In the past 30 days we added nearly 500,000. It’s absolutely incredible to me that it took us several years to reach 500,000, and now we add that many without even blinking.

I won’t rehash all the capacity issues we’ve had lately, but needless to say things have been at least as bad as I worried about a year ago. Partially that is because our server capacity has not grown nearly as quickly as our user base has up until now. Fortunately all that is changing, and we are getting a bunch more servers in, and should be growing our capacity more regularly from this point on. We still have infrastructure work to do to scale horizontally better, but at least we will soon have servers to scale horizontally to, a fairly critical piece of the puzzle, I would say. :)

Oh, and here’s what the office was like a year ago:

Thanks for voting for us at Grooveshark! from ben westermann-clark on Vimeo.

 
 

Technology Stack

06 May

Ever wonder what technology powers Grooveshark?

Well that’s too bad, ’cause I’m going to tell you anyway. Most sites these days run on the LAMP stack (Linux, Apache, MySQL, PHP). Grooveshark runs on the LALMMRSPJG stack, more or less. Don’t try to pronounce that, you’ll only end up hurting yourself.

Linux: (CentOS primarily) for all of our servers except one lone Solaris box, which will be taken out back and shot one of these days, I hope.

Apache: All of our front-end nodes run apache. By front end node I mean everything serving up http traffic except for stream servers. For example, listen.grooveshark.com, www.grooveshark.com, tinysong.com, widgets.grooveshark.com, cowbell.grooveshark.com are all hosted on our front end nodes.

Lighttpd: Affectionally called lighty around here, it’s super efficient at serving up static content, so we use it on all of our stream servers instead of apache.

MySQL: We have several database servers, and they all run MySQL, much to my chagrin. We’d be using PostgreSQL if it had been up to me, but it wasn’t so we stick with MySQL. Now that Drizzle is coming along nicely, we are contemplating eventually moving fully or partially onto Drizzle, meaning our stack would be LALDMRSPJG or LALMDMRSPJG. Not much more pronounceable, I’m afraid.

Memcached: Without memcached, we would certainly not be where we are today. At this point nearly everything runs through memcached, reducing database load significantly and increasing site performance at the same time.

Redis: Redis is a new addition to our stack, but a welcome one. Redis is very similar to memcached in that it’s a key-value store, and it’s almost as fast, but it has the advantage of being disk-backed, so if you have to restart the server, you haven’t lost anything when it comes back up. Where memcached helps us save reads from MySQL, Redis helps us save reads and writes, because we can actually use it to store data that we intend to keep around.

Sphinx: MySQL fulltext indexes are absolutely horrible for search, so instead we use a technology called Sphinx. Sphinx recently got moved off of the front end servers and onto its own server, significantly reducing the load on the front end servers and improving the performance of search. Win-win!

PHP: Most of the code that makes Grooveshark work is written in PHP. All the websites I listed above, including the RPC services. Plenty of people hate PHP out there, including (or especially) those of us who program in it. It definitely has its warts, but it’s a language that is quick to develop in and it performs relatively well if you play to its strengths.

Java: Some of the code running on our servers, especially things that need to maintain state, are written in Java. Things that come to mind are the ad server, and some crazy stuff written in scala for keeping our stream servers in sync.

Gearman: Gearman is an awesome piece of the puzzle that we’re just starting to harness, and it’s going to help us scale out even more in the future. Gearmand is an extremely lightweight job queuing server with support for syncronous and asyncronous jobs. Workers can live on different servers and be written in different languages from clients. Gearman is great for map/reduce jobs or for allowing things that might be slow to be processed in the background without slowing down the user experience. For example, if our ad server needs to display an ad as quickly as possible *and* it needs to log the fact that it displayed the ad, it can fire off an asyncronous gearman job for the logging and get right to work on serving up the ad. Even if the logging portion is running incredibly slowly, nothing front-facing has to wait on it.
We have a super secret feature launching in about two weeks that would essentially not be possible without Gearman (and Redis). I’ll update in a couple of weeks to explain how Gearman makes it possible, once I can talk about what it is. :)

Please note that this list only includes the backend of the stack. We also have front-end clients written in HTML+JS, Flash+Flex, J2ME, Java, Objective C and some others on the way. It also doesn’t yet include Cassandra, but I’m hoping we can add that soon.

 
 

Bypassing Magic

18 Aug

In my post about how we are adding client-side caching to Grooveshark 2.0, I mentioned one of the ways we are taking advantage of the fact that thanks to using Flash, we have a full-blown stateful application.

As Grooveshark evolves and the application becomes more sophisticated, the PHP layer is more and more becoming just an interface to the DB. The application just needs the data; it knows exactly what to do with it from there. It also only needs to ask for one type of data at a time, whereas a traditional webpage would need to load dozens of different pieces of information at the same time. So for our type of application, magic methods and ORM can really just get in the way when all we really need is to run a query, fill up an array with the results of that query, and return it.

Our old libraries employing ORM, magic methods and collections, were designed to meet the needs of a typical website and don’t necessarily make sense for a full-fledged application. On a webpage, you might only show 20 results at a time, so the overhead of having a bunch of getters and setters automatically fire whenever you load up your data is probably not noticeable. But in an application, you often load far more results than can be displayed, and allow the user to interact with them more richly. When you’re loading 500, or 5000 results as opposed to 20, the overhead of ORM and magic can start to really bog you down. I first noticed the overhead issue when testing new method calls for lite2, when in some cases fetching the data would take over 30 seconds, triggering my locally defined maximum execution time, even when the data was already cached.

Like any responsible developer considering making changes to code for performance reasons, I profiled our collections code using XDebug and KCachegrind, and then I rewrote the code to bypass collections, magic and all that stuff, loading data from the DB (or memcache) into an array and returning it. The difference? In the worst case, bypassing magic was an order of magnitude less work, often times far better than that. My > 30 second example took less than 1 second in the new code.

For Grooveshark 2.0 code, wherever makes sense, we are bypassing magic, ORM and collections and loading data directly. This of course means that Grooveshark is faster, but it also means that we can load more data at once. In most cases we can now afford to load up entire lists of songs without having to paginate the results, which in turn means fewer calls to the backend *and* much less work for the database. Whenever you must LIMIT results, you must also ORDER BY the results so they come back in an order that makes sense. Not having to ORDER results means in many cases we save an expensive filesort which often requires a temporary table in MySQL. Returning the full data set also allows the client to do more with the data, like decide how the results should actually be sorted and displayed to the user. But that’s another post…