RSS
 

Detect crawlers with PHP faster

08 Apr

At Grooveshark we use DB-based php sessions so they can be accessed across multiple front-end nodes. As you would expect, the sessions table is very “hot,” as just about every request to do anything, ever, requires using a session. We noticed that web crawlers like google end up creating tens of thousands of sessions every day, because they of course do not carry cookies around with them.

The solution? Add a way to detect crawlers, and don’t give them sessions. Most of the solutions I’ve seen online look something like this:

function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawler as $c) {
if (stristr($USER_AGENT, $c[0])) {
return($c[1]);
}
}
return false;
}

Essentially, doing a for loop over the entire list of possible clients, and searching the user agent string for each one, one at a time. This seems way too slow and inefficient for something that is going to have to run on essentially every call on a high volume website, so I rewrote it to look like this:
public static function getIsCrawler($userAgent)
{
$crawlers = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
$isCrawler = (preg_match("/$crawlers/", $userAgent) > 0);
return $isCrawler;
}

In my not-very-scientific testing, running on my local box my version takes 11 seconds to do 1 million comparisons, whereas looping through an array of crawlers to do 1 million comparisons takes 70 seconds. So there you have it, using a single regex for string matching rather than looping over an array can be 7 times faster. I suspect, but have not tested, that the performance gap gets bigger the more strings you are testing against.

 
 

Leave a Reply

 
 
  1. Christopher Suter

    April 8, 2009 at 4:39 pm

    given the relatively static nature of the list, you could probably define a very small, very fast hashing algorithm with a small range set (ideally, the range would be the set {0,1,…num_crawlers}). Then it could be blazing fast!

     
  2. James Hartig

    April 8, 2009 at 8:13 pm

    You should try using memcache for your sessions and DB backups, in case the memcache gets erased (at shutdown). You could also just make a mount on /dev/shm and load the sessions into memory for faster access?

    -fastest963

     
  3. Jay

    April 8, 2009 at 8:25 pm

    We’ve explored using memcache for sessions in the past but there are some issues with concurrency. If a user makes two requests back-to-back, each one might try to upcate the session. Memcache has no locking so one update will overwrite the other, leading to lost data. This can be mitigated somewhat by using sessions for less, or by breaking up the session data so that it’s not one large serialized piece of data.

    /dev/shm isn’t shared so that won’t work for us for sessions either, but I would like to get us using /dev/shm for mysql temp space, so that filesorts can happen in-memory.

     
  4. Jay

    April 8, 2009 at 8:44 pm

    @chris I’m not really sure how that would work, the set to match against is relatively small but the set of all user agent strings approaches infinity…

     
  5. elPas0

    April 27, 2009 at 4:20 am

    you are quite right about the loop
    but why use preg_match
    dont think about strpos() or strstr() instead ? it should be faster

     
  6. Cult-foo » Detect crawlers with PHP

    April 27, 2009 at 4:59 am

    [...] After reading this i decide to update my code a bit. Change is connected to usage of function on high volume [...]

     
  7. Jay

    April 27, 2009 at 10:58 am

    Honestly I’m not intimately familiar with the differences — I tend to just use preg_match whenever I need to match a regex. I’ll check it out!

     
  8. Mike

    June 18, 2009 at 6:42 pm

    Great post, very helpful.

    I added some more crawlers:

    Bloglines subscriber|Dumbot|Sosoimagespider|QihooBot|FAST-WebCrawler|Superdownloads Spiderman|LinkWalker|msnbot|ASPSeek|WebAlta Crawler|Lycos|FeedFetcher-Google|Yahoo|YoudaoBot|AdsBot-Google|Googlebot|Scooter|Gigabot|Charlotte|eStyle|AcioRobot|GeonaBot|msnbot-media|Baidu|CocoCrawler|Google|Charlotte t|Yahoo! Slurp China|Sogou web spider|YodaoBot|MSRBOT|AbachoBOT|Sogou head spider|AltaVista|IDBot|Sosospider|Yahoo! Slurp|Java VM|DotBot|LiteFinder|Yeti|Rambler|Scrubby|Baiduspider|accoona

    From http://www.httpuseragent.org/list/Robot, Spider, Crawler-c16.htm

    Also, maybe you should use the ‘i’ flag, e.g:

    $isCrawler = (preg_match(“/$crawlers/i”, $userAgent) > 0);

    To do case-insensitive matching?