Google has added support for crawling AJAX URLs. This is great news for us and any other site that makes heavy use of AJAX or is more of a web app than a collection of individual pages.
We have long worked around the issue of AJAX URLs not being crawl-able by having two versions of our URLs, with and without the hash. Users who are actually using the site will obviously get AJAX URLs like http://listen.grooveshark.com/#/user/jay/42, but if a crawler goes to http://listen.grooveshark.com/user/jay/42 they will get content for that page as well, while real users will be automatically redirected to the proper URL with the hash. Crawlers aren’t smart enough to go there on their own of course, but we provide a sitemap, and all links we present to crawlers are absent of the hash. Likewise, when users post links to Facebook via the app, we automatically give them the URL without the hash so they can put up a pretty preview of the link in the user’s news feed. The problem is, users also like to share by copying URLs from the URL bar. If users post those links anywhere, crawlers don’t know how to crawl them, so they either don’t or they just count it as a link to http://listen.grooveshark.com/ which isn’t great for us and is lousy for users too.
Google’s solution is for sites like ours to switch from using # to using #! and then opting in to having those URLs crawled. The crawler will take everything after the #! and convert the “pretty” URL into an ugly one. For example, /#!/user/jay/42 presumably becomes something like /?_escaped_fragment_=%2Fuser%2Fjay%2F42 when the crawler sends the request to us.
This is annoying and frustrating for several reasons:
- All our URLs have to change
We have to change all URLs to have a #! instead of just a #. This not only requires developer effort but makes our URLs slightly uglier. - All links that users have shared in the past will continue to be non-crawlable forever.
- We now have to support 3 URL formats instead of 2
One of those URL formats no human will ever see; we are building a feature solely for the benefit of a robot.
Again, we greatly appreciate that Google is making an effort to crawl AJAX URLs, it’s a huge step forward. It’s just not as elegant as it could be. It seems like we could accomplish the same goals more simply by:
- Requiring opt-in just like the current system
- Using robots.txt to dictate which AJAX URLs should not be crawled
- Allowing webmasters to specify what the # should be replaced with
In our case it would be replaced by nothing, just stripped out. For less sophisticated URL schemes that just use #x=y, webmasters could specify that the # should be replaced by a ?
That solution would have the same benefits of the current one, with the additional benefits of allowing all crawling permissions to be specified in the same place (robots.txt), automatically making links already in the wild crawl-able, without requiring the addition of support for yet another URL format.