RSS
 

Google AJAX Support: Awesome but Disappointing

18 Oct

Google has added support for crawling AJAX URLs. This is great news for us and any other site that makes heavy use of AJAX or is more of a web app than a collection of individual pages.

We have long worked around the issue of AJAX URLs not being crawl-able by having two versions of our URLs, with and without the hash. Users who are actually using the site will obviously get AJAX URLs like http://listen.grooveshark.com/#/user/jay/42, but if a crawler goes to http://listen.grooveshark.com/user/jay/42 they will get content for that page as well, while real users will be automatically redirected to the proper URL with the hash. Crawlers aren’t smart enough to go there on their own of course, but we provide a sitemap, and all links we present to crawlers are absent of the hash. Likewise, when users post links to Facebook via the app, we automatically give them the URL without the hash so they can put up a pretty preview of the link in the user’s news feed. The problem is, users also like to share by copying URLs from the URL bar. If users post those links anywhere, crawlers don’t know how to crawl them, so they either don’t or they just count it as a link to http://listen.grooveshark.com/ which isn’t great for us and is lousy for users too.

Google’s solution is for sites like ours to switch from using # to using #! and then opting in to having those URLs crawled. The crawler will take everything after the #! and convert the “pretty” URL into an ugly one. For example, /#!/user/jay/42 presumably becomes something like /?_escaped_fragment_=%2Fuser%2Fjay%2F42 when the crawler sends the request to us.

This is annoying and frustrating for several reasons:

  1. All our URLs have to change
    We have to change all URLs to have a #! instead of just a #. This not only requires developer effort but makes our URLs slightly uglier.
  2. All links that users have shared in the past will continue to be non-crawlable forever.
  3. We now have to support 3 URL formats instead of 2
  4. One of those URL formats no human will ever see; we are building a feature solely for the benefit of a robot.

Again, we greatly appreciate that Google is making an effort to crawl AJAX URLs, it’s a huge step forward. It’s just not as elegant as it could be. It seems like we could accomplish the same goals more simply by:

  1. Requiring opt-in just like the current system
  2. Using robots.txt to dictate which AJAX URLs should not be crawled
  3. Allowing webmasters to specify what the # should be replaced with
    In our case it would be replaced by nothing, just stripped out. For less sophisticated URL schemes that just use #x=y, webmasters could specify that the # should be replaced by a ?

That solution would have the same benefits of the current one, with the additional benefits of allowing all crawling permissions to be specified in the same place (robots.txt), automatically making links already in the wild crawl-able, without requiring the addition of support for yet another URL format.

 
 
  1. Terin Stock

    October 18, 2010 at 5:14 pm

    1) You are effectively opting in, by using the special hash-bang format. In the wild, they aren’t used all that much.

    2) You can use robot.txt to block certain keys from being crawled, by put the “*_escaped_fragment_” form in the file.

    3) I’m not sure I see too much of an issue here, again. Google does say that you can use a 302 to redirect to whatever URL you want, especially useful if you already have static HTML forms of your pages. In this case can’t you just do a redirect in the .htaccess?

     
  2. Jay

    October 19, 2010 at 12:08 am

    1. I know Google’s current solution is opt-in-by-having-an-ugly-URL. My suggestion is that it would be possible to opt-in without resorting to hashbang.

    2. Again not concerned with the ability to block keys from being crawled with the ugly way, just pointing out it would be equally feasible to do so with the more elegant solution.

    3. Could, but we already have a set of those for our existing pretty URLs and systems isn’t exactly happy about the current number, don’t want to be unnecessarily adding even more regexes that must run on every page load if it can possibly be avoided.

     
  3. Adam

    January 30, 2011 at 5:38 pm

    Is there any support planned for using the History API in newer browsers so we don’t need a hash at all, and the address in the top bar will be exactly the same as the crawlable version?
    Google’s AJAX urls system seems to be more of a stopgap solution for smaller sites that can’t or won’t support static versions, and hence is quite flawed as you explain for larger sites like yours.

     
  4. Jay

    February 4, 2011 at 4:45 am

    Hi Adam,

    There definitely are plans to start using the history API, but it’s probably a few weeks out still.