RSS
 

Archive for October, 2010

Google AJAX Support: Awesome but Disappointing

18 Oct

Google has added support for crawling AJAX URLs. This is great news for us and any other site that makes heavy use of AJAX or is more of a web app than a collection of individual pages.

We have long worked around the issue of AJAX URLs not being crawl-able by having two versions of our URLs, with and without the hash. Users who are actually using the site will obviously get AJAX URLs like http://listen.grooveshark.com/#/user/jay/42, but if a crawler goes to http://listen.grooveshark.com/user/jay/42 they will get content for that page as well, while real users will be automatically redirected to the proper URL with the hash. Crawlers aren’t smart enough to go there on their own of course, but we provide a sitemap, and all links we present to crawlers are absent of the hash. Likewise, when users post links to Facebook via the app, we automatically give them the URL without the hash so they can put up a pretty preview of the link in the user’s news feed. The problem is, users also like to share by copying URLs from the URL bar. If users post those links anywhere, crawlers don’t know how to crawl them, so they either don’t or they just count it as a link to http://listen.grooveshark.com/ which isn’t great for us and is lousy for users too.

Google’s solution is for sites like ours to switch from using # to using #! and then opting in to having those URLs crawled. The crawler will take everything after the #! and convert the “pretty” URL into an ugly one. For example, /#!/user/jay/42 presumably becomes something like /?_escaped_fragment_=%2Fuser%2Fjay%2F42 when the crawler sends the request to us.

This is annoying and frustrating for several reasons:

  1. All our URLs have to change
    We have to change all URLs to have a #! instead of just a #. This not only requires developer effort but makes our URLs slightly uglier.
  2. All links that users have shared in the past will continue to be non-crawlable forever.
  3. We now have to support 3 URL formats instead of 2
  4. One of those URL formats no human will ever see; we are building a feature solely for the benefit of a robot.

Again, we greatly appreciate that Google is making an effort to crawl AJAX URLs, it’s a huge step forward. It’s just not as elegant as it could be. It seems like we could accomplish the same goals more simply by:

  1. Requiring opt-in just like the current system
  2. Using robots.txt to dictate which AJAX URLs should not be crawled
  3. Allowing webmasters to specify what the # should be replaced with
    In our case it would be replaced by nothing, just stripped out. For less sophisticated URL schemes that just use #x=y, webmasters could specify that the # should be replaced by a ?

That solution would have the same benefits of the current one, with the additional benefits of allowing all crawling permissions to be specified in the same place (robots.txt), automatically making links already in the wild crawl-able, without requiring the addition of support for yet another URL format.

 
 

Improve Code by Removing It

17 Oct

I’ve started going through O’Reilly’s 97 Things Every Programmer Should Know, and I plan to post the best ones, the ones I think we do a great job of following at Grooveshark and the ones I wish we did better, here at random intervals.

The first one is Improve Code by Removing It.

It should come as no surprise that the fastest code is code that never has to execute. Put another way, you can usually only go faster by doing less. And of course, code that never runs also exhibits no bugs. :)

Since I started at Groovshark, I’ve deleted a lot of code. 20% more code than I’ve ever written. More code than all but one of our developers has contributed. Despite that, I think we’ve only actually ever removed one feature, and we’ve added many more.

One of the things I see in the old code is an over enthusiastic attempt to make everything extremely flexible and adaptive. The original authors obviously made a noble effort to try to imagine every possible future scenario that the code might some day need to handle and then come up with an abstraction that could handle all of those cases. The problem is, those scenarios almost never come up. Instead, different features are requested which do not fit with the goals of the original abstraction at all, so then you end up having to work around it in weird ways that make the code more difficult to understand and less efficient.

Let me try to provide a more concrete example so you can see what I’m talking about. We have an Auth class that really only needs to handle 3 things:

  • Let users log in (something like Auth->login(username, password)
  • Let users log out (something like Auth->logout()
  • Find out who the logged in user is, if a user is logged in. (Auth->getUser())

Should be extremely straightforward, right? Well, the original author decided that the class should allow for multiple authentication strategies over various protocols in some scenarios that could never possibly arise (such as not having access to sessions) even though at the time only one was needed. Instead of ~100 lines of code to just get the job done and nothing else, we ended up with 1,176 lines spanning 5 files. The vast majority of that code was useless; our libraries are behind other front-facing code so the protocol and “strategy” for authenticating is handled at one level higher up, and we always use sessions so that no matter how a user logged in, they are logged in to all Grooveshark properties. When we finally did add support for a truly new way to log in (via Facebook Connect), none of that code was useful at all because Facebook Connect works in a way the author could never have anticipated 2 years ago. Finally, because the original author anticipated a scenario that cannot possibly arise (that we might know the user’s username but not their user ID), fetching the logged-in User’s information from the database required a less efficient lookup by username rather than a primary key lookup by ID.

Let’s step back a moment and pretend that the author had in fact been able to anticipate how we were going to incorporate Facebook Connect and made the class just flexible enough and just abstract enough in just the right ways to accommodate that feature that we just now got around to implementing. What would have been the benefit? Well, most of the effort of implementing that feature is handling all of the Facebook specific things, so that part would still need to be written. I’d say at best it could have saved me from having to write about 10 lines of code. In the meantime, we would have still been carrying around all of that extra code for no reason for a whole two years before it was finally needed!

Let’s apply YAGNI whenever possible, and pay the cost of adding features when they actually need to be added.

Edit:
Closely related: Beauty is in Simplicity.

 

PHP Autoload: Put the error where the error is

13 Oct

At Grooveshark we use PHP’s __autoload function to handle automatically loading files for us when we instantiate objects. In the docs they have an example autoload method:

function __autoload($class_name) {
    require_once $class_name . '.php';
}

Until recently Grooveshark’s autoload worked this way too, but there are two issues with this code:

  1. It throws a fatal error inside the method.

    If you make a typo when instantiating an object and ask it to load something that doesn’t exist, your error logs will say something like this:require_once() [function.require]: Failed opening required 'FakeClass.php' (include_path='.') in conf.php on line 83 which gives you no clues at all about what part of your code is trying to create FakeClass.php. Not remotely helpful.

  2. It’s not as efficient as it could be.

    include_once and require_once have slightly more overhead than include and require because they have to do an extra check to make sure the file hasn’t already been included. It’s not much extra overhead, but it’s completely unnecessary because __autoload will only trigger if a class hasn’t already been defined. If you’re inside your __autload function, you already haven’t included the file before, or the class would already be defined.

The better way:
function __autoload($class_name) {
    include $class_name . '.php';
}

Now if you make a typo, your logs will look like this:
PHP Warning: include() [function.include]: Failed opening ‘FakeClass.php’ for inclusion (include_path=’.’) in conf.php on line 83
PHP Fatal error: Class ‘FakeClass’ not found in stupidErrorPage.php on line 24

Isn’t that better?

Edit 2010-10-19: Hot Bananas! The docs have already been updated. (at least in svn)