At Grooveshark we use DB-based php sessions so they can be accessed across multiple front-end nodes. As you would expect, the sessions table is very “hot,” as just about every request to do anything, ever, requires using a session. We noticed that web crawlers like google end up creating tens of thousands of sessions every day, because they of course do not carry cookies around with them.
The solution? Add a way to detect crawlers, and don’t give them sessions. Most of the solutions I’ve seen online look something like this:
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawler as $c) {
if (stristr($USER_AGENT, $c[0])) {
return($c[1]);
}
}
return false;
}
Essentially, doing a for loop over the entire list of possible clients, and searching the user agent string for each one, one at a time. This seems way too slow and inefficient for something that is going to have to run on essentially every call on a high volume website, so I rewrote it to look like this:
public static function getIsCrawler($userAgent)
{
$crawlers = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
$isCrawler = (preg_match("/$crawlers/", $userAgent) > 0);
return $isCrawler;
}
In my not-very-scientific testing, running on my local box my version takes 11 seconds to do 1 million comparisons, whereas looping through an array of crawlers to do 1 million comparisons takes 70 seconds. So there you have it, using a single regex for string matching rather than looping over an array can be 7 times faster. I suspect, but have not tested, that the performance gap gets bigger the more strings you are testing against.