RSS
 

Google Analytics: Drop in Safari Visits

28 Sep

Starting around Thursday, 9/19/2013, Google Analytics has been showing about 50% fewer visits from Safari users than the Thursday before that for Grooveshark. (Aside: by default we compare each day to the same day of the week from the previous week because usage patterns for when people are on Grooveshark are highly dependent on what day of the week it is. The way people use Grooveshark is most wildly different comparing weekdays to weekends, but even a Monday session is very different from a Thursday session)

I suspect it’s related to a longstanding issue as explained in these 2 blog posts:

http://sunpig.com/martin/archives/2010/01/08/how-to-detect-a-page-request-from-safari-4s-top-sites-feature.html
(2010!)
http://blog.splitwise.com/2013/05/13/safari-top-sites-pollutes-google-analytics-and-mixpanel/ (2013)

it appears that the GA team is finally detecting these “preview” loads and not counting them, although I have not been able to get official confirmation of that.
The evidence:

It’s not just happening to Grooveshark:
Google Analytics Forum question
Superuser question (deleted: google archive)
Another forum post asking about the drop

  • Average visit duration is up by 163% for Safari users. That makes sense because the Safari preview visits would by definition be extremely short.
    Our bounce rate for Safari users is down by 82%. We had previously determined that Safari preview visits were all being counted as bounces, artificially inflating that number for Safari users.
  • I also dug into different Safari versions, and at least for the 3 most popular versions on Grooveshark (all that I checked), the % drop in visits is the same, which strongly implies that this drop is not due to some recent update from Apple. We also tested the site to make sure that everything still appears to work in Safari, and we have not seen an uptick in complaints from Safari users which we would expect to see if we had introduced a Safari-breaking bug.
  • New vs Returning: GA reports that “new” Safari visitors are down 36%, while “returning” Safari visitors are down a whopping 57%. This also points towards the Safari preview issue because users who visit Grooveshark more frequently are less likely to be considered new and are more likely to have Grooveshark show up in Safari preview.
  • Pages/Visit by Safari users is up a whopping 76%. This makes sense since a “preview” visit would only hit the homepage, but real users are very likely to visit other pages.

Note: Grooveshark’s numbers are generally considered proprietary, which is why all the stats above are in relative percentages. Sorry if that made it harder to follow.

I have reached out to Google Analytics via twitter and also commented on the Google Analytics discussion thread, but we have not heard any official word yet. Ideally this would be the sort of thing we’d hear about from Google directly before it happened. Barring that, it would be nice to see a blog post from them about it, which also hasn’t happened yet.

Unrelated:
On the off chance that I now have a Googler’s attention: a probably unrelated issue that we see for the day of Friday, 9/20/2013 is for that day our stats are mysteriously low and internally inconsistent. Visits for that day are 50% lower than the previous Friday. Our internal bandwidth and user activity #s look normal for that day. If I look at New vs Returning, we had significantly more returning visitors than we had total visits…how could that even be possible?! We have seen anomalies like that before but they usually only appear for a short time and are corrected a day or so later.

 
 

Git Tips (just a couple)

05 Jun

Below are a couple of git tips for the advanced git user. This is not meant to be remotely comprehensive in any way, just a couple of things I have figured out that I haven’t seen floating around out there yet.

Always rebase

At Grooveshark, we like to keep our history clean, such that the history of the main repo reflects when code was actually committed/accepted, regardless of when that code was written. As such, we almost always want to rebase. If you search for how to rebase by default, most of the advice you’ll find says to add the following to your ~/.gitconfig:

[branch]
    autosetuprebase = always

What this does, however, is it tells git to set up each branch to rebase as you create it. Any branch that git already knows about will not be set up to rebase; you’ll have to do that manually. Then in the future if you change your mind and decide you don’t want this functionality, you will have to remove it from each branch one by one. Instead put this in your ~/.gitconfig:

[pull]
    rebase = true

Now you’ve made the default behavior for pulling be to rebase. If you want to override that behavior for a particular branch, just set rebase = false in the .git/config file for the applicable repo, under the particular branch you want to do that for.

Github pull requests by the number

As mentioned earlier, when we do a pull request, we want it to be rebased onto the latest before it gets merged, so it’s a simple fast forward. Add the following to your ~/.gitconfig:

[alias]
    review = !sh -c 'REF=`git rev-parse --symbolic-full-name --abbrev-ref HEAD` && git fetch $1 $2:mergeReview -f && git checkout mergeReview && git rebase $REF && git checkout $REF && git diff HEAD..mergeReview' -
    pr = !git review origin refs/pull/$1/head
    approve = merge mergeReview

What’s going on here?
First we make a new command for reviewing arbitrary refs from arbitrary forks, called ‘review’. To use it, first checkout the target branch that you want to merge into, then run: ‘git review {remote} {ref}‘. Behind the scenes that fetches the remote ref into a temporary local branch called mergeReview, based on the remote ref you are reviewing. It rebases that onto the target branch you were in, switches you back to the target branch and then does a diff for you to review.
Next we make a new command for reviewing pull requests from github, called ‘pr’. This really just saves you from having to remember where github keeps pull request refs. It just calls git review and passes in the remote ref for the pull request.
Finally, ‘approve’ just merges the temporary mergeReview branch into the branch you’re currently in, which if you followed the instructions above is the target branch.

 

XSS cheat sheet

26 Apr

Discovered this amazing XSS cheat sheet while trying to prove to a co-worker that using a regex to prevent <script> tags embedded in HTML was not going to be effective. I knew about the UTF8 vulnerabilities and all of the obvious ones, but the US-ASCII Encoding one especially was new to me, impressive!

 
 

Ultimate Steak Marinade

03 Sep

This is a bit off topic from the general theme of this blog, but this was just so delicious I wanted to save it so I can make it again.

Adapted from this recipe which claims to be from Alton Brown. I didn’t have all the ingredients, so I made modifications to end up with this:

Ingredients
5 tbsps soy sauce
4 tbsps worcestershire sauce
2 tbsps balsamic vinegar
1 tbsp brown sugar
2 tbsps lemon juice
1 tsp garlic salt
1 tbsp yellow mustard
1 tbsp olive oil (extra virgin)

mixed that all together and let 2 top sirloin steaks marinate for 2 days, before cooking as usual on the grill. Served with a side of grilled brussel sprouts and grilled potato wedges. Also cooked the au jus to use as a dipping sauce on the side.

 
No Comments

Posted in life

 

Migrating to github

26 Jan

I needed to migrate a couple of repositories with lots of tags and branches to github, and Github’s instrucitons just weren’t cutting it:

Existing Git Repo?
cd existing_git_repo
git remote add origin [email protected]:[orgname]/[branchname].git
git push -u origin master

That’s great as long as all I care about is moving my master branch, but what I really wanted was just to make github have everything that my current origin did, whether or not I was tracking it. A quick cursory Google search didn’t find any instructions how to do this, so I had to figure it out the old fashioned way, by reading git help pages. The quickest and easiest way I could find to do this after creating the repo on github is:

git clone git@[originurl]:[orgname]/[reponame].git [local temp dir] --mirror
cd [local temp dir]
git remote add github [email protected]:[orgname]/[reponame].git
git push -f --mirror github
cd ..
rm -rf [local temp dir]

that’s it! At this point github should have everything that your original origin had, including everything in /refs. It won’t have any of your local branches, since you did everything from a clean checkout. You might also want to change where origin is pointing:

cd [dir where your repo is checked out]
git remote add oldOrigin git@[oldgitserver]:[orgname]/[reponame].git
git remote set-url origin [email protected]:[orgname]/[reponame].git

Of course, this same procedure should work for moving from any git server to any other git server too.

 

Grooveshark is Hiring (Part 3 – Devsigner Edition)

26 May

Grooveshark is looking for talented web developers. For part 3, I’m listing our front end developer/designer (affectionately known here as “devsigner”) position.

Grooveshark Web Developer / Designers

Must be willing to relocate to Gainesville, FL and legally work in the US. Relocation assistance is available.
Responsibilities include:
Maintaining existing HTML templates and JavaScript code
Writing clean well-formed HTML and CSS
Writing JavaScript to be able to achieve the desired user experience
Using Photoshop and Illustrator and be able to convert images to markup and JavaScript
Optimizing HTML and CSS for speed considerations

Desired Qualities:
Enjoy writing high quality, easy to read, well-documented code
High level of detail given to converting designs
Able to follow coding standards
Good written and verbal communication skills
Well versed in best practices & security concerns for web development
Ability to work independently and on teams, with little guidance
Capable of prototyping features (either visually or concretely)
Capable of having a flexible schedule

Experience:
Experience with JavaScript, HTML & CSS
Experience in Photoshop and designing websites
Familiar with jQuery, including JavaScriptMVC
Familiar with Actionscript
Experience with version control software

Not Desired:
Uses WYSIWYG editors
Only searches for jQuery plugin that fits the job

 
 

Grooveshark is Hiring (Part 2 – Javascript Edition)

19 May

Grooveshark is looking for talented web developers. We are looking to fill a wide variety of web dev positions. The next few posts I make will be job descriptions for each of the major positions we are looking to fill. For part 2, I’m listing our Javascript developer position. If your skillset is more front-end leaning but you feel like you would be a good fit for Grooveshark, by all means apply now rather than waiting for me to post the job description. :)

Grooveshark is looking for a hardcore JavaScript developer

NOTE: Grooveshark is a web application, written with HTML/CSS, JavaScript, PHP, and ActionScript 3, not a web page or collection of webpages. This position involves working directly with the client-side application code in JavaScript: the “backend of the frontend” if you will.

Must be willing to relocate to Gainesville, FL and legally work in the US. Relocation assistance is available.
Responsibilities:
Maintaining existing client-side code, creating new features and improving existing ones
Writing good, clean, fast, secure code on tight deadlines
Ensuring that client architecture is performant without sacrificing maintainability and flexible enough for rapid feature changes
Striking a balance between optimizing client performance versus minimizing load on the backend
Integration of third-party APIs

Desired Qualities:
Enjoy writing high quality, easy to read, self-documenting code
A passion for learning about new technologies and pushing yourself
A deep understanding of writing bug-free code in an event-driven, asynchronous environment
Attention to detail
A high LOC/bug ratio
Able to follow coding standards
Good written and verbal communication skills
Well versed in best practices & security concerns for web development
Ability to work independently and on teams, with little guidance and with occasional micromanagement
More pragmatic than idealistic

Experience:
Extensive JavaScript experience, preferably in the browser environment
Experience with jQuery
Experience with jQueryMX or another MVC framework
HTML & CSS experience, though writing it will not be a primary responsibility
Some PHP experience, though you won’t be required to write it
Knowledge of cross-browser compatibility ‘gotchas’
Experience with EJS, smarty, or other templating systems
Experience with version control software (especially git or another dvcs)

Bonus points for:
Having written a client application (in any language) that relies on a remote server for data storage and retrieval
Having written a non-trivial jQuery plugin
Experience with JavaScript/Flash communication via ExternalInterface
Experience with integrating popular web APIs (OAuth, Facebook, Twitter, Google, etc) into client applications
Experience with ActionScript 3 outside the Flash Professional environment (ie, non-timelined code, compiling with mxmlc or similar)
Experience developing on the LAMP stack (able to set up a LAMP install with multiple vhosts on your own)
Experience with profiling/debugging tools
Being well read in Software Engineering practices
Useful contributions to the open source community
Fluency in lots of different programming languages
BS or higher in Computer Science or related field
Being more of an ‘evening’ person than a ‘morning’ person
A passion for music and a desire to revolutionize the industry

Who we don’t want:
Architecture astronauts
Trolls
Complete n00bs (apply for internship or enroll in Grooveshark University instead!)
People who want to work a 9-5 job
People who would rather pretend to know everything than actually learn
Religious adherents to The Right Way To Do Software Development
Anyone who would rather use XML over JSON for RPC

Send us:
Resume
Code samples you love
Code samples you hate
Links to projects you’ve worked on
Favorite reading materials on Software Engineering (e.g. books, blogs)
What you love about JavaScript
What you hate about JavaScript (and not just the differences in browser implementations)
Prototypal vs Classical inheritance – what are the differences and how do you feel about each?
If you could change one thing about the way Grooveshark works, what would it be and how would you implement it?

 

If you want a job:  jay at groovesharkdotcom
If you want an internship: [email protected]

 

Grooveshark is Hiring (Part 1 – PHP edition)

16 May

Grooveshark is looking for talented web developers. We are looking to fill a wide variety of web dev positions. The next few posts I make will be job descriptions for each of the major positions we are looking to fill. For part 1, I’m listing our backend PHP position. If your skillset is more front-end leaning but you feel like you would be a good fit for Grooveshark, by all means apply now rather than waiting for me to post the job description. :)

 

Grooveshark is seeking awesome PHP developers.

Must be willing to relocate to Gainesville, FL and legally work in the US. Relocation assistance is available.
Responsibilities:
Maintaining existing backend code & APIs, creating new features and improving existing ones
Writing good, clean, fast, secure code on tight deadlines
Identifying and eliminating bottlenecks
Writing and optimizing queries for high-concurrency workloads in SQL, MongoDB, memcached, etc
Identifying and implementing new technologies and strategies to help us scale to the next level

Desired Qualities:
Enjoy writing high quality, easy to read, self-documenting code
A passion for learning about new technologies and pushing yourself
Attention to detail
A high LOC/bug ratio
Able to follow coding standards
Good written and verbal communication skills
Well versed in best practices & security concerns for web development
Ability to work independently and on teams, with little guidance and with occasional micromanagement
More pragmatic than idealistic

Experience:
Experience developing on the LAMP stack (able to set up a LAMP install with multiple vhosts on your own)
Extensive experience with PHP
Extensive experience with SQL
Some experience with Javascript, HTML & CSS though you won’t be required to write it
Some experience with lower level languages such as C/C++
Experience with version control software (especially dvcs)

Bonus points for:
Well read in Software Engineering practices
Experience with a SQL database and optimizing queries for high concurrency on large data sets.
Experience with noSQL databases like MongoDB, Redis, memcached.
Experience with Nginx
Experience creating APIs
Knowledge of Linux internals
Experience working on large scale systems with high volume of traffic
Useful contributions to the open source community
Fluency in lots of different programming languages
Experience with browser compatability weirdness
Experience with smarty or other templating systems
BS or higher in Computer Science or related field
Experience with Gearman, RabbitMQ, ActiveMQ or some other job distribution/message passing system for distributing work
A passion for music and a desire to revolutionize the industry

Who we don’t want:
Architecture astronauts
Trolls
Complete n00bs (apply for internship or enroll in Grooveshark University instead!)
People who want to work a 9-5 job
People who would rather pretend to know everything than actually learn
Religious adherents to The Right Way To Do Software Development
Anyone who loves SOAP

Send us your:
Resume
Code samples you love
Code samples you hate
Favorite reading materials on Software Engineering (e.g. books, blogs)
Tell us when you would use a framework, and when you would avoid using a framework
ORM: Pros, cons?
Unit testing: pros, cons?
Magic: pros, cons?
When/why would you denormalize?
Thoughts on SOAP vs REST

If you want a job: jay at groovesharkdotcom

If you want an internship: [email protected]

 

How to Jailbreak Chrome and Install Grooveshark

15 Apr

As you may already know, Grooveshark’s Android app was recently <a href=”http://news.cnet.com/8301-31001_3-20051156-261.htm”>pulled from the Android market</a> due to either label pressure, or concerns from its competing music service that is soon to be launched (or both). That event has been covered in great detail all over the net, so I won’t rehash it here; that’s not what this blog post is about.

 

Yesterday we discovered that Grooveshark has also been pulled from the Chrome Web Store. Fortunately, if you already installed Grooveshark, Google does not appear to be revoking the app from users Chrome home pages.

For users who had not yet installed Grooveshark, they are out of luck from the web store. However, we have discovered that it is possible to “jail break” your Chrome installation to get the full Grooveshark + Chrome experience.

Step 1: Visit http://grooveshark.com in your chrome browser.

Step 2: Close the Grooveshark tab.

Step 3: Open the New Tab tab by typing CTRL+T (CMD+T in OSX?), or by pressing on the small + icon next to the rightmost tab.

Step 4: If Grooveshark appears under “Most visited,” simply drag it to the top-left position, hover over the site preview for a second, and click on the pin icon that appears. Skip to step 7.

Step 5: If Grooveshark does not appear under Most visited, scroll down to Recently closed, and click on Grooveshark.

Step 6: Proceed to step 1.

Step 7: Collapse the Apps section, or remove it entirely.

It may take several thousand iterations of this loop to bring Grooveshark into your Most visited section depending on how obsessively you are checking the 8 websites that appear there for you, but we feel this added effort on your part is well worth it. When you are done Chrome should look something like this:

P.S.: Grooveshark was ranked #8, right behind YouTube, an interesting juxtaposition since these services are both powered by user contributed content, and are both legal, but one is owned by Google and the other is not.

P.P.S.: For those of you who are really bad at detecting sarcasm, I am being facetious. As with every blog post I make here, I also am not speaking for Grooveshark in any sort of official capacity.

 
 

How Grooveshark Uses Gearman

27 Mar

At Grooveshark, Gearman is an integral part of our backend technology stack.

30 Second Introduction to Gearman

  • Gearman is a simple, fast job queuing server.
  • Gearman is an anagram for manager. Gearman does not do any work itself, it just distributes jobs to workers.
  • Clients, workers and servers can all be on different boxes.
  • Jobs can be synchronous or asynchronous.
  • Jobs can have priorities.

To learn more about gearman, visit gearman.org, and peruse the presentations. I got started with this intro (or one much like it), but there may be better presentations available now, I haven’t checked.
The rest of this article will assume that you understand the basics Gearman including the terminology, and are looking for some use cases involving a real live high traffic website.

Architecture

With some exceptions, our architecture with respect to Gearman is a bit unconventional. In a typical deployment, you would have a large set of Apache + PHP servers (at Grooveshark we call these “front end nodes” or FENs for short) communicating with a smaller set of Gearman job servers, and a set of workers that are each connected to all of the job servers. In our setup, we have a gearman job server running on each FEN, and jobs are submitted over localhost. That’s because most of the jobs we submit are asynchronous, and we want the latency to be as low as possible so the FENs can fire off a job and get back to handling the user’s request. Then we have workers running on other boxes which connect to the gearman servers on the FENs and process the jobs. Where the workers run depends on their purpose, for example workers that insert data into a data store usually live on the same box as the data store, which again cuts down on network latency. This architecture means that in general, each FEN is isolated from the rest of the FENs, and Gearman servers are not another potential point of failure or even slowdowns. The only way a Gearman server is unavailable is if the FEN itself is out of commission. The only way a Gearman server is running slow is if the whole FEN is running slow.

Rate Limiting

One of the things that is really neat about this gearman architecture, especially when used asynchronously, is that jobs that need to happen eventually but not necessarily immediately can be easily rate limited by simply controlling the number of workers that are running. For example, we recently migrated Playlists from MySQL to MongoDB. Because many playlists have been abandoned over the years, we didn’t want to just blindly import all playlists into mongo. Instead we import them from MySQL as they are accessed. Once the data is in MongoDB, it is no longer needed in MySQL, so we would like to be able to delete that data to free up some memory. Deleting that data is by no means an urgent task, and we know that deletes of this nature cannot run in parallel; running more than one at a time just results in extra lock contention.

Our solution is to insert a job into the gearman queue to delete a given playlist from MySQL. We then have a single worker connecting to all of the FENs asking for playlist deletion jobs and then running the deletes one at a time from the MySQL server. Not surprisingly, when we flipped the switch deletion jobs came in much faster than they could be processed; at the peak we had a backlog of 800,000 deletion jobs waiting to be processed, and it took us about 2.5 weeks to get that number down to zero. During that time we had no DB hiccups, and server load was kept low.

Data Logging

We have certain high volume data that must be logged, such as song plays for accounting purposes, and searches performed so we can make our search algorithm better. We need to be able to log this high volume data in real time, without affecting the responsiveness of the site. Logging jobs are submitted asynchronously to Gearman over localhost. On our hadoop cluster, we have multiple workers per FEN collecting and processing jobs as quickly as possible. Each worker only connects to one FEN — in fact, each FEN has about 12 workers just for processing logging jobs. For a more in depth explanation for why we went with this setup, see lessons learned.

Backend API

We have some disjointed systems written in various languages that need to be able to interface with our core library, which is written in PHP. We considered making a simple API over HTTP much like the one that powers the website, but decided that it was silly to pay the cost of all the overhead of HTTP for an internal API. Instead, a set of PHP workers handle the incoming messages from these systems and respond accordingly. This also provides some level of rate limiting or control over how parallelized we want the processing to be. If a well meaning junior developer writes a some crazy piece of software with 2048 processes all trying to look up song information at once, we can rest assured that the database won’t actually be swamped with that much concurrency, because at most it will be limited to the number of workers that we have running.

Lessons Learned/Caveats

No technology is perfect, especially a technology when you are using it in a way other than how it was intended to be used.
We found that gearman workers (at least the pecl gearman extension’s implementation) connect to and process jobs on gearman servers in a round-robin fashion, draining all jobs from one server before moving to the next. That creates a few different headaches for us:

  • If one server has a large backlog of jobs and workers get to it first, they will process those jobs exclusively until they are all done, leaving the other servers to end up with a huge backlog
  • If one server is unreachable, workers will wait however long the timeout is configured for every time they run through the round-robin list. Even if the timeout is as low as 1 second, that is 1 second out of 20 that the worker cannot be processing any jobs. In a high volume logging situation, those jobs can add up quickly
  • Gearman doesn’t give memory that was used for long queues back to the OS when it’s done with it. It will reuse this memory, but if your normal gearman memory needs are 60MB and an epic backlog caused by these interactions leads it to use 2GB of memory, you won’t get that memory back until Gearman is restarted.

Our solution to these issues is, unless there is a strong need to rate limit the work, just configure a separate worker for each FEN so if one FEN is having weird issues, it won’t affect the others.
Our architecture combined with the fact that each request from a user will go to a different FEN means that we can’t take advantage of one really cool gearman feature: unique jobs. Unique jobs means that we could fire asynchronous jobs to prefetch data we know the client is going to ask for, and if the client asks for it before it is ready, we could have a synchronous request hook into the same job, waiting for the response.
Talking to a Gearman server over localhost is not the fastest thing in the world. We considered using Gearman to handle geolocation lookups by IP address so we can provide localized concert recommendations, since those jobs could be asynchronous, but we found that submitting an asynchronous job to Gearman was an order of magnitude slower than doing the lookup directly with the geoip PHP extension once we compiled it with mmap support. Gearman was still insanely fast, but this goes to show that not everything is better served being processed through Gearman.

Wish List

From reading the above you can probably guess what our wish list is:

  • Gearman should return memory to the OS when it is no longer needed. The argument here is that if you don’t want Gearman to use 2GB of memory, you can set a ulimit or take other measures to make sure you don’t ever get that far behind. That’s fine but in our case we would usually rather allow Gearman to use 2GB when it is needed, but we’d still like to have it back when it’s done!
  • Workers should be better at balancing. Even if one server is far behind it should not be able to monopolize a worker to the detriment of all other job servers.
  • Workers should be more aware of timeouts. Workers should have a way to remember when they failed to connect to a server and not try again for a configurable number of seconds. Or connections should be established to all job servers in a non-blocking manner, so that one timing out doesn’t affect the others.
  • Servers should be capable of replication/aggregation. This is more of a want than a need, but sometimes it would be nice if one job server could be configured to pull jobs off of other job servers and pool them. That way jobs could be submitted over localhost on each FEN, but aggregated elsewhere so that one worker could process them in serial if rate limiting is desired, without potentially being slowed down by a malfunctioning FEN.
  • Reduce latency for submitting asynchronous jobs locally. Submitting asynchronous jobs over localhost is fast, but it could probably be even faster. For example, I’d love to see how it could perform using unix sockets.

Even with these niggles and wants, Gearman has been a great, reliable and performant product that we are able to rely on to help keep the site fast and reliable for our users.

Supervisord

When talking about Gearman, I would be in remiss if I did not mention Supervisord, which we use to babysit all of our workers. Supervisord is a nice little python utility you can use to daemonize any process for you, and it will handle things like redirecting stdout to a log file, auto-restarting the process if it fails, starting as many instances of the process as you specify, and automatically backing off if your process fails to start a specified number of times in a row. It also has an RPC interface so you can manage it remotely, for instance if you notice a backlog of jobs piling up on one of your gearman servers, you can tell supervisord to fire up another 20 workers.