RSS
 

Archive for April, 2009

Washington

30 Apr

The following song has infected my brain: Clementine by Washington

 
 

Jay does some front-end work

20 Apr

It’s no secret that I am deeply entrenched in the back-end world. I can optimize the hell out of some queries, and write some pretty complex php, but when it comes to html, css and javascript, I am a barbarian. In fact, I stopped learning about html in the 90’s, before css and back when javascript was pretty useless. I just don’t care for working on visual stuff, especially if it’s going to be finicky and inconsistent on different platforms.

Anyway, a longstanding complaint I’ve had when using Tinysong is that I can’t listen to a song before I share it. How do I make sure it’s the song I’m thinking of? Sure, I could copy the URL and paste it in the location bar and load up lite, but by then my other song options are gone, not to mention that I’m lazy.

I finally decided to take a stab at adding playback to Tinysong myself, and I’m happy to report that everything seems to be working quite well. I can’t take credit for the whole thing, or even most of it. Katy wrote the streaming code, and Chanel made some beautiful javascript wrappers for the whole thing, both for other projects. I simply took what they did and wrote the css and javascript to glue it to Tinysong. I’m happy to report that it seems to be working quite well, and I hope I’m not the only one who appreciates this enhancement. Give it a try and let me know what you think.

 
 

Goodbye xampp, hello Portable Ubuntu

16 Apr

For a long time now I’ve used xampp for testing my scripts locally before committing to the codebase and testing in our dev environment. Apache for windows, at least in this configuration, was incredibly unstable: major memory leaks, and it crashed if it received two connections at once. It has been good enough for rudimentary testing, but always a bit annoying. The memory leak issue caused me to shut down apache as soon as I was done testing any script.

I recently discovered Portable Ubuntu or pubuntu for short, which allows you to run ubuntu inside of Windows (yes, I still prefer windows as my desktop environment). I set up apache and php inside of pubuntu, and after following this suggestion (with my own hacks to make it work over the network: see thread), and pointing apache to the mount point for my workspace (also living in windows), everything is working quite smoothly. Apache is both more stable and faster. Ironically, pubuntu with apache running inside of it is more lightweight than the Apache for Windows distribution that came with xampp was for me.

What’s especially cool is that because of the way pubuntu mounts drives, pubuntu is seamlessly accessing my workspace to serve up files. That means I can edit a php file in my windows IDE of choice (notably not Vim), and that change is immediately reflected in pubuntu with no effort on my part.

 
 

Detect crawlers with PHP faster

08 Apr

At Grooveshark we use DB-based php sessions so they can be accessed across multiple front-end nodes. As you would expect, the sessions table is very “hot,” as just about every request to do anything, ever, requires using a session. We noticed that web crawlers like google end up creating tens of thousands of sessions every day, because they of course do not carry cookies around with them.

The solution? Add a way to detect crawlers, and don’t give them sessions. Most of the solutions I’ve seen online look something like this:

function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawler as $c) {
if (stristr($USER_AGENT, $c[0])) {
return($c[1]);
}
}
return false;
}

Essentially, doing a for loop over the entire list of possible clients, and searching the user agent string for each one, one at a time. This seems way too slow and inefficient for something that is going to have to run on essentially every call on a high volume website, so I rewrote it to look like this:
public static function getIsCrawler($userAgent)
{
$crawlers = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
$isCrawler = (preg_match("/$crawlers/", $userAgent) > 0);
return $isCrawler;
}

In my not-very-scientific testing, running on my local box my version takes 11 seconds to do 1 million comparisons, whereas looping through an array of crawlers to do 1 million comparisons takes 70 seconds. So there you have it, using a single regex for string matching rather than looping over an array can be 7 times faster. I suspect, but have not tested, that the performance gap gets bigger the more strings you are testing against.