Detect crawlers with PHP faster

RSS

Detect crawlers with PHP faster

08 Apr

At Grooveshark we use DB-based php sessions so they can be accessed across multiple front-end nodes. As you would expect, the sessions table is very “hot,” as just about every request to do anything, ever, requires using a session. We noticed that web crawlers like google end up creating tens of thousands of sessions every day, because they of course do not carry cookies around with them.

The solution? Add a way to detect crawlers, and don’t give them sessions. Most of the solutions I’ve seen online look something like this:

In my not-very-scientific testing, running on my local box my version takes 11 seconds to do 1 million comparisons, whereas looping through an array of crawlers to do 1 million comparisons takes 70 seconds. So there you have it, using a single regex for string matching rather than looping over an array can be 7 times faster. I suspect, but have not tested, that the performance gap gets bigger the more strings you are testing against.

15 Comments

Posted in Coding, grooveshark

Christopher Suter

April 8, 2009 at 4:39 pm

given the relatively static nature of the list, you could probably define a very small, very fast hashing algorithm with a small range set (ideally, the range would be the set {0,1,…num_crawlers}). Then it could be blazing fast!
James Hartig

April 8, 2009 at 8:13 pm

You should try using memcache for your sessions and DB backups, in case the memcache gets erased (at shutdown). You could also just make a mount on /dev/shm and load the sessions into memory for faster access?

-fastest963
Jay

April 8, 2009 at 8:25 pm

We’ve explored using memcache for sessions in the past but there are some issues with concurrency. If a user makes two requests back-to-back, each one might try to upcate the session. Memcache has no locking so one update will overwrite the other, leading to lost data. This can be mitigated somewhat by using sessions for less, or by breaking up the session data so that it’s not one large serialized piece of data.

/dev/shm isn’t shared so that won’t work for us for sessions either, but I would like to get us using /dev/shm for mysql temp space, so that filesorts can happen in-memory.
Jay

April 8, 2009 at 8:44 pm

@chris I’m not really sure how that would work, the set to match against is relatively small but the set of all user agent strings approaches infinity…
elPas0

April 27, 2009 at 4:20 am

you are quite right about the loop
but why use preg_match
dont think about strpos() or strstr() instead ? it should be faster
Cult-foo » Detect crawlers with PHP

April 27, 2009 at 4:59 am

[…] After reading this i decide to update my code a bit. Change is connected to usage of function on high volume […]
Jay

April 27, 2009 at 10:58 am

Honestly I’m not intimately familiar with the differences — I tend to just use preg_match whenever I need to match a regex. I’ll check it out!
Mike

June 18, 2009 at 6:42 pm

Great post, very helpful.

I added some more crawlers:

Bloglines subscriber|Dumbot|Sosoimagespider|QihooBot|FAST-WebCrawler|Superdownloads Spiderman|LinkWalker|msnbot|ASPSeek|WebAlta Crawler|Lycos|FeedFetcher-Google|Yahoo|YoudaoBot|AdsBot-Google|Googlebot|Scooter|Gigabot|Charlotte|eStyle|AcioRobot|GeonaBot|msnbot-media|Baidu|CocoCrawler|Google|Charlotte t|Yahoo! Slurp China|Sogou web spider|YodaoBot|MSRBOT|AbachoBOT|Sogou head spider|AltaVista|IDBot|Sosospider|Yahoo! Slurp|Java VM|DotBot|LiteFinder|Yeti|Rambler|Scrubby|Baiduspider|accoona

From http://www.httpuseragent.org/list/Robot, Spider, Crawler-c16.htm

Also, maybe you should use the ‘i’ flag, e.g:

$isCrawler = (preg_match(“/$crawlers/i”, $userAgent) > 0);

To do case-insensitive matching?
IWM - Marketing Internet

October 23, 2010 at 10:25 am

Great post, it’s going to help increase my crawler speed detection by a great deal.

Is Bing’s crawler integrated ?
www.syso.pl

November 16, 2010 at 9:00 pm

I’m using strpos instead of preg_match i thing it should be faster
Karla Walsh

September 3, 2012 at 2:51 am

Using stripos rather than regex makes a slight difference.
1,000,000 regex repetions in 15.2711
1,000,000 string repetions in 13.9157
A revised list including the bots from Mike is attached.
function is_crawler($user_agent) { $crawlers='' . 'AbachoBOT|accoona|AcioRobot|AdsBot-Google|AltaVista|ASPSeek|Baidu|' . 'Charlotte|Charlotte t|CocoCrawler|DotBot|Dumbot|eStyle|' . 'FeedFetcher-Google|GeonaBot|Gigabot|Google|Googlebot|IDBot|Java VM|' . 'LiteFinder|Lycos|msnbot|msnbot-media|MSRBOT|QihooBot|Rambler|Scooter|' . 'ScrubbyBloglines subscriber|Sogou head spider|Sogou web spider|' . 'Sosospider|Superdownloads Spiderman|WebAlta Crawler|Yahoo|' . 'Yahoo! Slurp China|Yeti|YoudaoBot|' ; //$is_crawler = (preg_match("/$crawlers/i", $user_agent) > 0); // 1 million reps = 15.2711 secs $is_crawler = ((stripos($crawlers, $user_agent) !== false) ? true : false); // 1 million reps = 13.9157 secs return $is_crawler; }
Brett Miller

December 6, 2012 at 5:52 pm

There is a fundamental flaw in your reasoning, Karla.

In the preg_match, they are looking for a pattern within the HTTP_USER_AGENT value and that will look something like the following:
crawl-66-249-70-244.googlebot.com

As you can see, there is more to the HTTP_USER_AGENT value than just googlebot, so the stripos call would fail for just about every crawler.
Bacon

March 4, 2013 at 8:44 pm

Why the second level array? Why not all inside the parent array – (ie. array(‘google’ => ‘Google’, ‘msnbot’ => ‘Bing’…))
Ben

June 12, 2013 at 4:38 am

Be careful with this, it gives false positives – I am using Firefox and it returns true for my user agent.
James Lehmann

April 22, 2015 at 10:14 pm

Why not add “bot” to the list and remove “msnbot”, “AbachoBOT”, “AcioRobot”, “Dumbot”, “GenoaBot”, “Gigabot”, “MSRBOT”, and “IDBot”? That should save some time also.

Jay Paroline

Detect crawlers with PHP faster

Pages

Archives

Categories