RSS
 

Archive for the ‘Uncategorized’ Category

XSS cheat sheet

26 Apr

Discovered this amazing XSS cheat sheet while trying to prove to a co-worker that using a regex to prevent <script> tags embedded in HTML was not going to be effective. I knew about the UTF8 vulnerabilities and all of the obvious ones, but the US-ASCII Encoding one especially was new to me, impressive!

 
 

Grooveshark is Hiring (Part 3 – Devsigner Edition)

26 May

Grooveshark is looking for talented web developers. For part 3, I’m listing our front end developer/designer (affectionately known here as “devsigner”) position.

Grooveshark Web Developer / Designers

Must be willing to relocate to Gainesville, FL and legally work in the US. Relocation assistance is available.
Responsibilities include:
Maintaining existing HTML templates and JavaScript code
Writing clean well-formed HTML and CSS
Writing JavaScript to be able to achieve the desired user experience
Using Photoshop and Illustrator and be able to convert images to markup and JavaScript
Optimizing HTML and CSS for speed considerations

Desired Qualities:
Enjoy writing high quality, easy to read, well-documented code
High level of detail given to converting designs
Able to follow coding standards
Good written and verbal communication skills
Well versed in best practices & security concerns for web development
Ability to work independently and on teams, with little guidance
Capable of prototyping features (either visually or concretely)
Capable of having a flexible schedule

Experience:
Experience with JavaScript, HTML & CSS
Experience in Photoshop and designing websites
Familiar with jQuery, including JavaScriptMVC
Familiar with Actionscript
Experience with version control software

Not Desired:
Uses WYSIWYG editors
Only searches for jQuery plugin that fits the job

 
 

How to Jailbreak Chrome and Install Grooveshark

15 Apr

As you may already know, Grooveshark’s Android app was recently <a href=”http://news.cnet.com/8301-31001_3-20051156-261.htm”>pulled from the Android market</a> due to either label pressure, or concerns from its competing music service that is soon to be launched (or both). That event has been covered in great detail all over the net, so I won’t rehash it here; that’s not what this blog post is about.

 

Yesterday we discovered that Grooveshark has also been pulled from the Chrome Web Store. Fortunately, if you already installed Grooveshark, Google does not appear to be revoking the app from users Chrome home pages.

For users who had not yet installed Grooveshark, they are out of luck from the web store. However, we have discovered that it is possible to “jail break” your Chrome installation to get the full Grooveshark + Chrome experience.

Step 1: Visit http://grooveshark.com in your chrome browser.

Step 2: Close the Grooveshark tab.

Step 3: Open the New Tab tab by typing CTRL+T (CMD+T in OSX?), or by pressing on the small + icon next to the rightmost tab.

Step 4: If Grooveshark appears under “Most visited,” simply drag it to the top-left position, hover over the site preview for a second, and click on the pin icon that appears. Skip to step 7.

Step 5: If Grooveshark does not appear under Most visited, scroll down to Recently closed, and click on Grooveshark.

Step 6: Proceed to step 1.

Step 7: Collapse the Apps section, or remove it entirely.

It may take several thousand iterations of this loop to bring Grooveshark into your Most visited section depending on how obsessively you are checking the 8 websites that appear there for you, but we feel this added effort on your part is well worth it. When you are done Chrome should look something like this:

P.S.: Grooveshark was ranked #8, right behind YouTube, an interesting juxtaposition since these services are both powered by user contributed content, and are both legal, but one is owned by Google and the other is not.

P.P.S.: For those of you who are really bad at detecting sarcasm, I am being facetious. As with every blog post I make here, I also am not speaking for Grooveshark in any sort of official capacity.

 
 

How Grooveshark Uses Gearman

27 Mar

At Grooveshark, Gearman is an integral part of our backend technology stack.

30 Second Introduction to Gearman

  • Gearman is a simple, fast job queuing server.
  • Gearman is an anagram for manager. Gearman does not do any work itself, it just distributes jobs to workers.
  • Clients, workers and servers can all be on different boxes.
  • Jobs can be synchronous or asynchronous.
  • Jobs can have priorities.

To learn more about gearman, visit gearman.org, and peruse the presentations. I got started with this intro (or one much like it), but there may be better presentations available now, I haven’t checked.
The rest of this article will assume that you understand the basics Gearman including the terminology, and are looking for some use cases involving a real live high traffic website.

Architecture

With some exceptions, our architecture with respect to Gearman is a bit unconventional. In a typical deployment, you would have a large set of Apache + PHP servers (at Grooveshark we call these “front end nodes” or FENs for short) communicating with a smaller set of Gearman job servers, and a set of workers that are each connected to all of the job servers. In our setup, we have a gearman job server running on each FEN, and jobs are submitted over localhost. That’s because most of the jobs we submit are asynchronous, and we want the latency to be as low as possible so the FENs can fire off a job and get back to handling the user’s request. Then we have workers running on other boxes which connect to the gearman servers on the FENs and process the jobs. Where the workers run depends on their purpose, for example workers that insert data into a data store usually live on the same box as the data store, which again cuts down on network latency. This architecture means that in general, each FEN is isolated from the rest of the FENs, and Gearman servers are not another potential point of failure or even slowdowns. The only way a Gearman server is unavailable is if the FEN itself is out of commission. The only way a Gearman server is running slow is if the whole FEN is running slow.

Rate Limiting

One of the things that is really neat about this gearman architecture, especially when used asynchronously, is that jobs that need to happen eventually but not necessarily immediately can be easily rate limited by simply controlling the number of workers that are running. For example, we recently migrated Playlists from MySQL to MongoDB. Because many playlists have been abandoned over the years, we didn’t want to just blindly import all playlists into mongo. Instead we import them from MySQL as they are accessed. Once the data is in MongoDB, it is no longer needed in MySQL, so we would like to be able to delete that data to free up some memory. Deleting that data is by no means an urgent task, and we know that deletes of this nature cannot run in parallel; running more than one at a time just results in extra lock contention.

Our solution is to insert a job into the gearman queue to delete a given playlist from MySQL. We then have a single worker connecting to all of the FENs asking for playlist deletion jobs and then running the deletes one at a time from the MySQL server. Not surprisingly, when we flipped the switch deletion jobs came in much faster than they could be processed; at the peak we had a backlog of 800,000 deletion jobs waiting to be processed, and it took us about 2.5 weeks to get that number down to zero. During that time we had no DB hiccups, and server load was kept low.

Data Logging

We have certain high volume data that must be logged, such as song plays for accounting purposes, and searches performed so we can make our search algorithm better. We need to be able to log this high volume data in real time, without affecting the responsiveness of the site. Logging jobs are submitted asynchronously to Gearman over localhost. On our hadoop cluster, we have multiple workers per FEN collecting and processing jobs as quickly as possible. Each worker only connects to one FEN — in fact, each FEN has about 12 workers just for processing logging jobs. For a more in depth explanation for why we went with this setup, see lessons learned.

Backend API

We have some disjointed systems written in various languages that need to be able to interface with our core library, which is written in PHP. We considered making a simple API over HTTP much like the one that powers the website, but decided that it was silly to pay the cost of all the overhead of HTTP for an internal API. Instead, a set of PHP workers handle the incoming messages from these systems and respond accordingly. This also provides some level of rate limiting or control over how parallelized we want the processing to be. If a well meaning junior developer writes a some crazy piece of software with 2048 processes all trying to look up song information at once, we can rest assured that the database won’t actually be swamped with that much concurrency, because at most it will be limited to the number of workers that we have running.

Lessons Learned/Caveats

No technology is perfect, especially a technology when you are using it in a way other than how it was intended to be used.
We found that gearman workers (at least the pecl gearman extension’s implementation) connect to and process jobs on gearman servers in a round-robin fashion, draining all jobs from one server before moving to the next. That creates a few different headaches for us:

  • If one server has a large backlog of jobs and workers get to it first, they will process those jobs exclusively until they are all done, leaving the other servers to end up with a huge backlog
  • If one server is unreachable, workers will wait however long the timeout is configured for every time they run through the round-robin list. Even if the timeout is as low as 1 second, that is 1 second out of 20 that the worker cannot be processing any jobs. In a high volume logging situation, those jobs can add up quickly
  • Gearman doesn’t give memory that was used for long queues back to the OS when it’s done with it. It will reuse this memory, but if your normal gearman memory needs are 60MB and an epic backlog caused by these interactions leads it to use 2GB of memory, you won’t get that memory back until Gearman is restarted.

Our solution to these issues is, unless there is a strong need to rate limit the work, just configure a separate worker for each FEN so if one FEN is having weird issues, it won’t affect the others.
Our architecture combined with the fact that each request from a user will go to a different FEN means that we can’t take advantage of one really cool gearman feature: unique jobs. Unique jobs means that we could fire asynchronous jobs to prefetch data we know the client is going to ask for, and if the client asks for it before it is ready, we could have a synchronous request hook into the same job, waiting for the response.
Talking to a Gearman server over localhost is not the fastest thing in the world. We considered using Gearman to handle geolocation lookups by IP address so we can provide localized concert recommendations, since those jobs could be asynchronous, but we found that submitting an asynchronous job to Gearman was an order of magnitude slower than doing the lookup directly with the geoip PHP extension once we compiled it with mmap support. Gearman was still insanely fast, but this goes to show that not everything is better served being processed through Gearman.

Wish List

From reading the above you can probably guess what our wish list is:

  • Gearman should return memory to the OS when it is no longer needed. The argument here is that if you don’t want Gearman to use 2GB of memory, you can set a ulimit or take other measures to make sure you don’t ever get that far behind. That’s fine but in our case we would usually rather allow Gearman to use 2GB when it is needed, but we’d still like to have it back when it’s done!
  • Workers should be better at balancing. Even if one server is far behind it should not be able to monopolize a worker to the detriment of all other job servers.
  • Workers should be more aware of timeouts. Workers should have a way to remember when they failed to connect to a server and not try again for a configurable number of seconds. Or connections should be established to all job servers in a non-blocking manner, so that one timing out doesn’t affect the others.
  • Servers should be capable of replication/aggregation. This is more of a want than a need, but sometimes it would be nice if one job server could be configured to pull jobs off of other job servers and pool them. That way jobs could be submitted over localhost on each FEN, but aggregated elsewhere so that one worker could process them in serial if rate limiting is desired, without potentially being slowed down by a malfunctioning FEN.
  • Reduce latency for submitting asynchronous jobs locally. Submitting asynchronous jobs over localhost is fast, but it could probably be even faster. For example, I’d love to see how it could perform using unix sockets.

Even with these niggles and wants, Gearman has been a great, reliable and performant product that we are able to rely on to help keep the site fast and reliable for our users.

Supervisord

When talking about Gearman, I would be in remiss if I did not mention Supervisord, which we use to babysit all of our workers. Supervisord is a nice little python utility you can use to daemonize any process for you, and it will handle things like redirecting stdout to a log file, auto-restarting the process if it fails, starting as many instances of the process as you specify, and automatically backing off if your process fails to start a specified number of times in a row. It also has an RPC interface so you can manage it remotely, for instance if you notice a backlog of jobs piling up on one of your gearman servers, you can tell supervisord to fire up another 20 workers.

 
 

Grooveshark IE bug

29 Jan

I hate the idea that this blog might be turning into nothing but a journal of all the things at Grooveshark that have ever broken, but some of the most interesting challenges we face are when things go terribly awry, so I’m not going to avoid talking about it just because it involves something breaking at Grooveshark, again.

What happened was, out of the blue IE8 could no longer run the site. Users were getting a message about making sure they did not have flash block enabled, which means the swf was failing in some way. We determined that the swf was in fact loading, so why was it lying to us? There is one file that swfs need in order to talk to other domains: crossdomain.xml. If that file fails to load, the swf isn’t going to work. I suspected that was happening in this case, so I loaded up http://cowbell.grooveshark.com/crossdomain.xml and IE complained that it wasn’t valid XML. View source showed me that IE was right. It was in fact what looked like binary garbage. Loading the same file in Firefox and Chrome worked perfectly fine, but IE8 on 4 different computers all showed the invalid XML.

Some months ago, we switched from serving pages up directly from Apache, to running Nginx in front of Apache as a reverse proxy with caching. The difference that made on our front end servers in terms of memory usage and CPU load is phenomenal. Although Nginx serves 30% of requests from cache now, the drop in server load was much more than 30%. Nginx is truly a wonderful addition to our http stack…but as you’ve guessed by now, it played a key role in the latest breakage at Grooveshark.

Force clearing the cache in IE8 and in nginx would sometimes fix the file, but not always. I then turned to wget and found the same thing: whenever the file was broken for IE8, it was identical in wget. wget was showing the exact same file size that Firebug was showing, which was the biggest clue: Firefox received the file gzipped because it supports deflate, but wget also received the file gzipped even though it doesn’t support deflate. My theory, which proved correct, was that IE8 was for some reason asking for the non-gzipped version, but receiving the gzipped version and barfing.

Why would that happen? Well, remember that we are using nginx as a reverse proxy cache. It turns out that we just recently added some auto-gzipping for certain file types to apache. What was happening was, nginx would get a request for a file not in its cache, and forward along the request (with all headers intact) to Apache. If this request came from a client that supports deflate, Apache would respond with a gzipped file. Nginx would store that gzipped file in its cache, and the next request that came in asking for that file, with or without deflate support, would get the gzipped version served up.

The fix was relatively simple: add a variable in nginx conf tracking whether or not the current client supports deflate. Append the value of that variable to the proxy key, meaning that gzipped and non-gzipped versions of the files will be cached separately, and served appropriately depending on what the client supports.

What’s not clear to me at this time is why IE8 would refuse to accept gzipped content for that file, and whether that applies to all .xml files in IE8…but at least it helped us catch what would have otherwise been an extremely obscure issue!

 
 

Old sphinx bug

21 Nov

I’m posting this mostly as a note to myself.

When trying to build the pecl sphinx extension, ran into problems because I could not build libsphinxclient.

It looks like the bug causing that problem has been around for at least a year, but at least the fix is simple:
change line 280: void sock_close ( int sock ); to static void sock_close ( int sock );

Note: this applies to sphinx 0.9.9, the last “stable” release at the time of this writing.

 
 

Google AJAX Support: Awesome but Disappointing

18 Oct

Google has added support for crawling AJAX URLs. This is great news for us and any other site that makes heavy use of AJAX or is more of a web app than a collection of individual pages.

We have long worked around the issue of AJAX URLs not being crawl-able by having two versions of our URLs, with and without the hash. Users who are actually using the site will obviously get AJAX URLs like http://listen.grooveshark.com/#/user/jay/42, but if a crawler goes to http://listen.grooveshark.com/user/jay/42 they will get content for that page as well, while real users will be automatically redirected to the proper URL with the hash. Crawlers aren’t smart enough to go there on their own of course, but we provide a sitemap, and all links we present to crawlers are absent of the hash. Likewise, when users post links to Facebook via the app, we automatically give them the URL without the hash so they can put up a pretty preview of the link in the user’s news feed. The problem is, users also like to share by copying URLs from the URL bar. If users post those links anywhere, crawlers don’t know how to crawl them, so they either don’t or they just count it as a link to http://listen.grooveshark.com/ which isn’t great for us and is lousy for users too.

Google’s solution is for sites like ours to switch from using # to using #! and then opting in to having those URLs crawled. The crawler will take everything after the #! and convert the “pretty” URL into an ugly one. For example, /#!/user/jay/42 presumably becomes something like /?_escaped_fragment_=%2Fuser%2Fjay%2F42 when the crawler sends the request to us.

This is annoying and frustrating for several reasons:

  1. All our URLs have to change
    We have to change all URLs to have a #! instead of just a #. This not only requires developer effort but makes our URLs slightly uglier.
  2. All links that users have shared in the past will continue to be non-crawlable forever.
  3. We now have to support 3 URL formats instead of 2
  4. One of those URL formats no human will ever see; we are building a feature solely for the benefit of a robot.

Again, we greatly appreciate that Google is making an effort to crawl AJAX URLs, it’s a huge step forward. It’s just not as elegant as it could be. It seems like we could accomplish the same goals more simply by:

  1. Requiring opt-in just like the current system
  2. Using robots.txt to dictate which AJAX URLs should not be crawled
  3. Allowing webmasters to specify what the # should be replaced with
    In our case it would be replaced by nothing, just stripped out. For less sophisticated URL schemes that just use #x=y, webmasters could specify that the # should be replaced by a ?

That solution would have the same benefits of the current one, with the additional benefits of allowing all crawling permissions to be specified in the same place (robots.txt), automatically making links already in the wild crawl-able, without requiring the addition of support for yet another URL format.

 
 

PHP Autoload: Put the error where the error is

13 Oct

At Grooveshark we use PHP’s __autoload function to handle automatically loading files for us when we instantiate objects. In the docs they have an example autoload method:

function __autoload($class_name) {
    require_once $class_name . '.php';
}

Until recently Grooveshark’s autoload worked this way too, but there are two issues with this code:

  1. It throws a fatal error inside the method.

    If you make a typo when instantiating an object and ask it to load something that doesn’t exist, your error logs will say something like this:require_once() [function.require]: Failed opening required 'FakeClass.php' (include_path='.') in conf.php on line 83 which gives you no clues at all about what part of your code is trying to create FakeClass.php. Not remotely helpful.

  2. It’s not as efficient as it could be.

    include_once and require_once have slightly more overhead than include and require because they have to do an extra check to make sure the file hasn’t already been included. It’s not much extra overhead, but it’s completely unnecessary because __autoload will only trigger if a class hasn’t already been defined. If you’re inside your __autload function, you already haven’t included the file before, or the class would already be defined.

The better way:
function __autoload($class_name) {
    include $class_name . '.php';
}

Now if you make a typo, your logs will look like this:
PHP Warning: include() [function.include]: Failed opening ‘FakeClass.php’ for inclusion (include_path=’.') in conf.php on line 83
PHP Fatal error: Class ‘FakeClass’ not found in stupidErrorPage.php on line 24

Isn’t that better?

Edit 2010-10-19: Hot Bananas! The docs have already been updated. (at least in svn)

 
 

Grooveshark Keyboard Shortcuts for Linux

21 Jun

Attention Linux users: Grooveshark friend Intars Students has created a keyboard shortcut helper for Linux. He could use some help testing it, so go help him! Once a few people have tested it and we’re pretty confident that it works well for users, we will add it to the links inside the app as well.

Intars is also the author of KeySharky, a Firefox extension that allows you to use keyboard shortcuts in Firefox to control Grooveshark playback from another tab, even if you’re not VIP! Pretty cool stuff.

 
 

Keyboard Shortcuts for Grooveshark Desktop

05 Jun

In a further effort to open Grooveshark to 3rd party developers, we have added an External Player Control API. (Side note: yes, it’s a hack to be polling a file all the time, but it’s also the only option we have until AIR 2.0 is out and most users have it installed.) Right now that means that we have support for keyboard shortcuts for OSX and Windows computers. To enable:

First open up desktop options by clicking on your username in the upper-right corner and selecting Desktop Options.

Notice the new option: Enable Global Keyboard Shortcuts (requires helper application)

Check that box (checking this box turns on polling the file mentioned in the External Player Control API) and select the proper client for your machine. In my case I chose windows, and I’ll go through setting that up now.

When you click on the link to download the keyboard helper, it should download via your browser. Once it is downloaded, run the app.

If prompted, tell Windows not to always ask before opening this file and choose Run.

The first time the application runs, it shows a list of keyboard shortcuts.

At this point the keyboard shortcuts listed should work! If they don’t, switch back to the desktop options window and make sure you click Apply or OK, then try again.

In your system tray you should notice a new icon (note: subject to change in future versions), that looks like the desktop icon with a blue bar in the lower-right corner. Rumor has it that is supposed to look like a keyboard.

If you right-click on the tray icon, you will see that there are a few options. The default configuration is pretty good, but I recommend setting it to start with Windows, so that it always is ready.

Big thanks go to James Hartig for hacking together the external player control API and the Windows keyboard shortcut helper, and to Terin Stock for the OSX helper. You can learn more about both keyboard shortcut helpers here. Hackers/developers: please feel free to extend functionality, create your own keyboard helpers (especially for Linux) or add integration to your current apps. Just show off what you’ve done so we can link to it!

Edit: Terin has supplied some Mac screenshots as well: