RSS
 

Detect crawlers with PHP faster

08 Apr

At Grooveshark we use DB-based php sessions so they can be accessed across multiple front-end nodes. As you would expect, the sessions table is very “hot,” as just about every request to do anything, ever, requires using a session. We noticed that web crawlers like google end up creating tens of thousands of sessions every day, because they of course do not carry cookies around with them.

The solution? Add a way to detect crawlers, and don’t give them sessions. Most of the solutions I’ve seen online look something like this:

function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawler as $c) {
if (stristr($USER_AGENT, $c[0])) {
return($c[1]);
}
}
return false;
}

Essentially, doing a for loop over the entire list of possible clients, and searching the user agent string for each one, one at a time. This seems way too slow and inefficient for something that is going to have to run on essentially every call on a high volume website, so I rewrote it to look like this:
public static function getIsCrawler($userAgent)
{
$crawlers = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
$isCrawler = (preg_match("/$crawlers/", $userAgent) > 0);
return $isCrawler;
}

In my not-very-scientific testing, running on my local box my version takes 11 seconds to do 1 million comparisons, whereas looping through an array of crawlers to do 1 million comparisons takes 70 seconds. So there you have it, using a single regex for string matching rather than looping over an array can be 7 times faster. I suspect, but have not tested, that the performance gap gets bigger the more strings you are testing against.

 
  1. Christopher Suter

    April 8, 2009 at 4:39 pm

    given the relatively static nature of the list, you could probably define a very small, very fast hashing algorithm with a small range set (ideally, the range would be the set {0,1,…num_crawlers}). Then it could be blazing fast!

     
  2. James Hartig

    April 8, 2009 at 8:13 pm

    You should try using memcache for your sessions and DB backups, in case the memcache gets erased (at shutdown). You could also just make a mount on /dev/shm and load the sessions into memory for faster access?

    -fastest963

     
  3. Jay

    April 8, 2009 at 8:25 pm

    We’ve explored using memcache for sessions in the past but there are some issues with concurrency. If a user makes two requests back-to-back, each one might try to upcate the session. Memcache has no locking so one update will overwrite the other, leading to lost data. This can be mitigated somewhat by using sessions for less, or by breaking up the session data so that it’s not one large serialized piece of data.

    /dev/shm isn’t shared so that won’t work for us for sessions either, but I would like to get us using /dev/shm for mysql temp space, so that filesorts can happen in-memory.

     
  4. Jay

    April 8, 2009 at 8:44 pm

    @chris I’m not really sure how that would work, the set to match against is relatively small but the set of all user agent strings approaches infinity…

     
  5. elPas0

    April 27, 2009 at 4:20 am

    you are quite right about the loop
    but why use preg_match
    dont think about strpos() or strstr() instead ? it should be faster

     
  6. Cult-foo » Detect crawlers with PHP

    April 27, 2009 at 4:59 am

    […] After reading this i decide to update my code a bit. Change is connected to usage of function on high volume […]

     
  7. Jay

    April 27, 2009 at 10:58 am

    Honestly I’m not intimately familiar with the differences — I tend to just use preg_match whenever I need to match a regex. I’ll check it out!

     
  8. Mike

    June 18, 2009 at 6:42 pm

    Great post, very helpful.

    I added some more crawlers:

    Bloglines subscriber|Dumbot|Sosoimagespider|QihooBot|FAST-WebCrawler|Superdownloads Spiderman|LinkWalker|msnbot|ASPSeek|WebAlta Crawler|Lycos|FeedFetcher-Google|Yahoo|YoudaoBot|AdsBot-Google|Googlebot|Scooter|Gigabot|Charlotte|eStyle|AcioRobot|GeonaBot|msnbot-media|Baidu|CocoCrawler|Google|Charlotte t|Yahoo! Slurp China|Sogou web spider|YodaoBot|MSRBOT|AbachoBOT|Sogou head spider|AltaVista|IDBot|Sosospider|Yahoo! Slurp|Java VM|DotBot|LiteFinder|Yeti|Rambler|Scrubby|Baiduspider|accoona

    From http://www.httpuseragent.org/list/Robot, Spider, Crawler-c16.htm

    Also, maybe you should use the ‘i’ flag, e.g:

    $isCrawler = (preg_match(“/$crawlers/i”, $userAgent) > 0);

    To do case-insensitive matching?

     
  9. IWM - Marketing Internet

    October 23, 2010 at 10:25 am

    Great post, it’s going to help increase my crawler speed detection by a great deal.

    Is Bing’s crawler integrated ?

     
  10. www.syso.pl

    November 16, 2010 at 9:00 pm

    I’m using strpos instead of preg_match i thing it should be faster

     
  11. Karla Walsh

    September 3, 2012 at 2:51 am

    Using stripos rather than regex makes a slight difference.
    1,000,000 regex repetions in 15.2711
    1,000,000 string repetions in 13.9157
    A revised list including the bots from Mike is attached.

    function is_crawler($user_agent) {
    $crawlers=''
    . 'AbachoBOT|accoona|AcioRobot|AdsBot-Google|AltaVista|ASPSeek|Baidu|'
    . 'Charlotte|Charlotte t|CocoCrawler|DotBot|Dumbot|eStyle|'
    . 'FeedFetcher-Google|GeonaBot|Gigabot|Google|Googlebot|IDBot|Java VM|'
    . 'LiteFinder|Lycos|msnbot|msnbot-media|MSRBOT|QihooBot|Rambler|Scooter|'
    . 'ScrubbyBloglines subscriber|Sogou head spider|Sogou web spider|'
    . 'Sosospider|Superdownloads Spiderman|WebAlta Crawler|Yahoo|'
    . 'Yahoo! Slurp China|Yeti|YoudaoBot|' ;
    //$is_crawler = (preg_match("/$crawlers/i", $user_agent) > 0); // 1 million reps = 15.2711 secs
    $is_crawler = ((stripos($crawlers, $user_agent) !== false) ? true : false); // 1 million reps = 13.9157 secs
    return $is_crawler;
    }

     
  12. Brett Miller

    December 6, 2012 at 5:52 pm

    There is a fundamental flaw in your reasoning, Karla.

    In the preg_match, they are looking for a pattern within the HTTP_USER_AGENT value and that will look something like the following:
    crawl-66-249-70-244.googlebot.com

    As you can see, there is more to the HTTP_USER_AGENT value than just googlebot, so the stripos call would fail for just about every crawler.

     
  13. Bacon

    March 4, 2013 at 8:44 pm

    Why the second level array? Why not all inside the parent array – (ie. array(‘google’ => ‘Google’, ‘msnbot’ => ‘Bing’…))

     
  14. Ben

    June 12, 2013 at 4:38 am

    Be careful with this, it gives false positives – I am using Firefox and it returns true for my user agent.

     
  15. James Lehmann

    April 22, 2015 at 10:14 pm

    Why not add “bot” to the list and remove “msnbot”, “AbachoBOT”, “AcioRobot”, “Dumbot”, “GenoaBot”, “Gigabot”, “MSRBOT”, and “IDBot”? That should save some time also.