RSS
 

Archive for the ‘Coding’ Category

Tail error logs to slack for fun and profit

12 May

We love slack, so we thought, wouldn’t it be great if our error logs posted to an #errors channel in Slack? This is obviously a very bad idea if you have noisy error logs, but for us we try to keep the chatter in the logs down so that people will actually pay attention to the errors, and it’s working out nicely, but it was a little complicated to get it set up so I thought I’d share.

First we installed and set up https://github.com/paulhammond/slackcat on our servers using the instructions there. It’s pretty simple and straightforward to get going, which is awesome!

Then we created two files on each server we wanted to monitor:
watchlogdir.sh

#!/bin/sh
trap "pkill -TERM -g $$; exit" INT TERM EXIT
while true; do
/path/to/tailtoslack.sh &
PID=$!
inotifywait -e create /var/log/xxx/
pkill -TERM -P $PID
kill $PID
done

tailtoslack.sh

while true; do
FILETOWATCH=$(ls -t /var/log/xxx/*.log | head -1)
tail $FILETOWATCH -f -n1 | grep -v "DEBUG:|^$" --color=never --line-buffered | /root/slackcat
sleep 31
done

So let’s look at what this is actually doing line by line:
trap "pkill -TERM -g $$; exit" INT TERM EXIT
This says, if we get any of the signals, INT TERM or EXIT, execute “pkill -TERM -g $$; exit”. Ok, well what does THAT mean? That utilizes the pkill command to send the TERM signal to all processes in the process group $$, and $$ means “my PID”. So essentially, we send TERM to all the processes that we spawned, so if we get killed, we don’t leave zombie children lying around.

while true; do
pretty self explanatory. all the code inside this block, we will try to do forever until we are killed

tailtoslack.sh &
this spawns tailtoslack.sh and runs it as a background process. tailtoslack needs to basically always run forever and we don’t want it to block what we’re about to do next

PID=$!
this grabs the pid of tailtoslack.sh and stores it in a variable called PID

inotifywait -e create /var/log/xxx/
this utilizes inotifywait to watch the /var/log/xxx/ directory for new files added. Our log files are written in the format of /var/log/xxx/yyyymmdd.log so when the day changes, a new file is written and that means we want to start tailing that log file and ignore the old one. inotifywait is blocking, so our code will sit here waiting for the new log file to be written, doing nothing until that moment.

pkill -TERM -P $PID
if we got here that means we have a new file, so we want to kill the old tailtoslack.sh and start the process over again. so this line is sending the TERM signal to all the processes owned by the old tailtoslack.sh

kill $PID
now we send the kill signal to tailtoslack.sh as well.

done
That’s it! on to the next file:

while true; do
this time we need a while loop because sometimes slackcat will exit (bad response from slack, or if you send it whitespace), but we don’t actually want to give up so we loop

FILETOWATCH=`ls -t /var/log/xxx/*.log | head -1`
this finds the most recently modified log file, ls -t sorts by most recently modified and “head -1” grabs the first line of output, then we store that in a variable called FILETOWATCH

tail $FILETOWATCH -f -n1 | grep -v "DEBUG:\|^$" --color=never --line-buffered | /root/slackcat
here we tail the file that we just determined was the most recently modified one, then we strip out lines starting with DEBUG: (since we don’t care about them) as well as empty lines (since empty lines crash slackcat), then we have to tell grep to not colorize the output and to buffer the output so it only sends complete lines to slackcat, since slackcat sends a message per line.

sleep 31
if we get to this line it means something in the tail pipleine on the previous line crashed. we don’t know why, but we’re hoping whatever condition caused the crash will pass soon, so we take a nap before we iterate through the loop again

done
we crashed, we took a nap, time to start over

that’s it! I am by no means a bash expert so it’s possible that some of this could be done better, but it works, and has been surprisingly robust!

For bonus points, here’s how I added it to systemd in CentOS 7:

Create file: /etc/systemd/system/logwatch.service :
[Unit]
Description=Watch web log files and pipe to slack
After=php-fpm.service

[Service]
ExecStart=/usr/bin/bash -c '/path/to/watchlogdir.sh'
Type=simple
User=youruser
Group=yourgroup
Restart=always
LimitNOFile=4096

[Install]
WantedBy=multi-user.target

Then you just need to run:
systemctl start logwatch

and once everything looks good:
systemctl enable logwatch

 
No Comments

Posted in Coding

 

Migrating to github

26 Jan

I needed to migrate a couple of repositories with lots of tags and branches to github, and Github’s instrucitons just weren’t cutting it:

Existing Git Repo?
cd existing_git_repo
git remote add origin [email protected]:[orgname]/[branchname].git
git push -u origin master

That’s great as long as all I care about is moving my master branch, but what I really wanted was just to make github have everything that my current origin did, whether or not I was tracking it. A quick cursory Google search didn’t find any instructions how to do this, so I had to figure it out the old fashioned way, by reading git help pages. The quickest and easiest way I could find to do this after creating the repo on github is:

git clone git@[originurl]:[orgname]/[reponame].git [local temp dir] --mirror
cd [local temp dir]
git remote add github [email protected]:[orgname]/[reponame].git
git push -f --mirror github
cd ..
rm -rf [local temp dir]

that’s it! At this point github should have everything that your original origin had, including everything in /refs. It won’t have any of your local branches, since you did everything from a clean checkout. You might also want to change where origin is pointing:

cd [dir where your repo is checked out]
git remote add oldOrigin git@[oldgitserver]:[orgname]/[reponame].git
git remote set-url origin [email protected]:[orgname]/[reponame].git

Of course, this same procedure should work for moving from any git server to any other git server too.

 

Grooveshark is Hiring (Part 2 – Javascript Edition)

19 May

Grooveshark is looking for talented web developers. We are looking to fill a wide variety of web dev positions. The next few posts I make will be job descriptions for each of the major positions we are looking to fill. For part 2, I’m listing our Javascript developer position. If your skillset is more front-end leaning but you feel like you would be a good fit for Grooveshark, by all means apply now rather than waiting for me to post the job description. :)

Grooveshark is looking for a hardcore JavaScript developer

NOTE: Grooveshark is a web application, written with HTML/CSS, JavaScript, PHP, and ActionScript 3, not a web page or collection of webpages. This position involves working directly with the client-side application code in JavaScript: the “backend of the frontend” if you will.

Must be willing to relocate to Gainesville, FL and legally work in the US. Relocation assistance is available.
Responsibilities:
Maintaining existing client-side code, creating new features and improving existing ones
Writing good, clean, fast, secure code on tight deadlines
Ensuring that client architecture is performant without sacrificing maintainability and flexible enough for rapid feature changes
Striking a balance between optimizing client performance versus minimizing load on the backend
Integration of third-party APIs

Desired Qualities:
Enjoy writing high quality, easy to read, self-documenting code
A passion for learning about new technologies and pushing yourself
A deep understanding of writing bug-free code in an event-driven, asynchronous environment
Attention to detail
A high LOC/bug ratio
Able to follow coding standards
Good written and verbal communication skills
Well versed in best practices & security concerns for web development
Ability to work independently and on teams, with little guidance and with occasional micromanagement
More pragmatic than idealistic

Experience:
Extensive JavaScript experience, preferably in the browser environment
Experience with jQuery
Experience with jQueryMX or another MVC framework
HTML & CSS experience, though writing it will not be a primary responsibility
Some PHP experience, though you won’t be required to write it
Knowledge of cross-browser compatibility ‘gotchas’
Experience with EJS, smarty, or other templating systems
Experience with version control software (especially git or another dvcs)

Bonus points for:
Having written a client application (in any language) that relies on a remote server for data storage and retrieval
Having written a non-trivial jQuery plugin
Experience with JavaScript/Flash communication via ExternalInterface
Experience with integrating popular web APIs (OAuth, Facebook, Twitter, Google, etc) into client applications
Experience with ActionScript 3 outside the Flash Professional environment (ie, non-timelined code, compiling with mxmlc or similar)
Experience developing on the LAMP stack (able to set up a LAMP install with multiple vhosts on your own)
Experience with profiling/debugging tools
Being well read in Software Engineering practices
Useful contributions to the open source community
Fluency in lots of different programming languages
BS or higher in Computer Science or related field
Being more of an ‘evening’ person than a ‘morning’ person
A passion for music and a desire to revolutionize the industry

Who we don’t want:
Architecture astronauts
Trolls
Complete n00bs (apply for internship or enroll in Grooveshark University instead!)
People who want to work a 9-5 job
People who would rather pretend to know everything than actually learn
Religious adherents to The Right Way To Do Software Development
Anyone who would rather use XML over JSON for RPC

Send us:
Resume
Code samples you love
Code samples you hate
Links to projects you’ve worked on
Favorite reading materials on Software Engineering (e.g. books, blogs)
What you love about JavaScript
What you hate about JavaScript (and not just the differences in browser implementations)
Prototypal vs Classical inheritance – what are the differences and how do you feel about each?
If you could change one thing about the way Grooveshark works, what would it be and how would you implement it?

 

If you want a job:  jay at groovesharkdotcom
If you want an internship: [email protected]

 

Grooveshark is Hiring (Part 1 – PHP edition)

16 May

Grooveshark is looking for talented web developers. We are looking to fill a wide variety of web dev positions. The next few posts I make will be job descriptions for each of the major positions we are looking to fill. For part 1, I’m listing our backend PHP position. If your skillset is more front-end leaning but you feel like you would be a good fit for Grooveshark, by all means apply now rather than waiting for me to post the job description. :)

 

Grooveshark is seeking awesome PHP developers.

Must be willing to relocate to Gainesville, FL and legally work in the US. Relocation assistance is available.
Responsibilities:
Maintaining existing backend code & APIs, creating new features and improving existing ones
Writing good, clean, fast, secure code on tight deadlines
Identifying and eliminating bottlenecks
Writing and optimizing queries for high-concurrency workloads in SQL, MongoDB, memcached, etc
Identifying and implementing new technologies and strategies to help us scale to the next level

Desired Qualities:
Enjoy writing high quality, easy to read, self-documenting code
A passion for learning about new technologies and pushing yourself
Attention to detail
A high LOC/bug ratio
Able to follow coding standards
Good written and verbal communication skills
Well versed in best practices & security concerns for web development
Ability to work independently and on teams, with little guidance and with occasional micromanagement
More pragmatic than idealistic

Experience:
Experience developing on the LAMP stack (able to set up a LAMP install with multiple vhosts on your own)
Extensive experience with PHP
Extensive experience with SQL
Some experience with Javascript, HTML & CSS though you won’t be required to write it
Some experience with lower level languages such as C/C++
Experience with version control software (especially dvcs)

Bonus points for:
Well read in Software Engineering practices
Experience with a SQL database and optimizing queries for high concurrency on large data sets.
Experience with noSQL databases like MongoDB, Redis, memcached.
Experience with Nginx
Experience creating APIs
Knowledge of Linux internals
Experience working on large scale systems with high volume of traffic
Useful contributions to the open source community
Fluency in lots of different programming languages
Experience with browser compatability weirdness
Experience with smarty or other templating systems
BS or higher in Computer Science or related field
Experience with Gearman, RabbitMQ, ActiveMQ or some other job distribution/message passing system for distributing work
A passion for music and a desire to revolutionize the industry

Who we don’t want:
Architecture astronauts
Trolls
Complete n00bs (apply for internship or enroll in Grooveshark University instead!)
People who want to work a 9-5 job
People who would rather pretend to know everything than actually learn
Religious adherents to The Right Way To Do Software Development
Anyone who loves SOAP

Send us your:
Resume
Code samples you love
Code samples you hate
Favorite reading materials on Software Engineering (e.g. books, blogs)
Tell us when you would use a framework, and when you would avoid using a framework
ORM: Pros, cons?
Unit testing: pros, cons?
Magic: pros, cons?
When/why would you denormalize?
Thoughts on SOAP vs REST

If you want a job: jay at groovesharkdotcom

If you want an internship: [email protected]

 

Controlling Arduino via Serial USB + PHP

28 Dec

Paloma bought me an Arduino for Christmas, and I’ve been having lots of fun with it, first following some tutorials to get familiar with the platform but of course the real excitement for me is being able to control it from my computer, so I’ve started playing around with serial communication over the USB port.

I messed around with Processing a bit, and while Processing seems pretty cool and pretty powerful for drawing and making a quick interface, I want to be able to get up and running fast communicating with all of our existing systems, so I thought PHP would be ideal.

I’ve never used PHP to communicate over a serial port before, so I did some digging and actually came across an example of using PHP to communicate with an Arduino, but it wasn’t working properly for me. My LED was blinking twice every time, and if I changed it to a value comparison (i.e. sending 1 in PHP and checking for 1 on Arduino) it was never matching. Turns out for some reason by default the com port wasn’t running in the right mode from PHP. The solution is a simple additional line:

exec("mode com3: BAUD=9600 PARITY=N data=8 stop=1 xon=off");

…but make sure you adjust to match the settings appropriate for your computer.

Of course I had to write my own little demo once I got it working. This PHP script takes command line arguments and passes them along to the arduino, which looks for ones and zeroes to turn the LEDs off or on. Unlike the example I linked to, I’m working with a shift register to drive 8 LEDs, but you’ll get the idea, and it should be obvious how to convert it to work with a single LED.

<?php
ini_set('display_errors', 'On');
exec("mode com3: BAUD=9600 PARITY=N data=8 stop=1 xon=off");
$fp = fopen("COM3", "w");
 
 foreach ($argv as $i => $arg) {
    if ($i > 0) {//skip the first arg since it's the name of the file
        print "writing " . $arg . "\n";
        $arg = chr($arg);//fwrite takes a string, so convert
        fwrite($fp, $arg);
        sleep(1);
    }
 }
print "closing\n";
fclose($fp);
?>

And for the Arduino:

//Pin connected to latch pin (ST_CP) of 74HC595
const int latchPin = 4;
const int clockPin = 3;
const int dataPin = 2;
void setup() {
    pinMode(latchPin, OUTPUT);
    pinMode(dataPin, OUTPUT);  
    pinMode(clockPin, OUTPUT);
    Serial.begin(9600);
    //reset LEDs to all off
    digitalWrite(latchPin, LOW);
    shiftOut(dataPin, clockPin, MSBFIRST, B00000000);
    digitalWrite(latchPin, HIGH);
}

void loop() {
    byte val = B11111111;
    if (Serial.available() > 0) {
        byte x = Serial.read();
        if (x == 1) {
            val = B11111111;
        } else {
            val = B00000000;
        }

        digitalWrite(latchPin, LOW);
        shiftOut(dataPin, clockPin, MSBFIRST, val);
        digitalWrite(latchPin, HIGH);
        delay(500);
    }
}

For bonus debugging points, if you’re using a shift register like this, send x instead of val to shiftOut, and you can actually see which bytes are being sent, very handy if you’re not getting what you are expecting (like I was)!

 
1 Comment

Posted in Coding, fun

 

Improve Code by Removing It

17 Oct

I’ve started going through O’Reilly’s 97 Things Every Programmer Should Know, and I plan to post the best ones, the ones I think we do a great job of following at Grooveshark and the ones I wish we did better, here at random intervals.

The first one is Improve Code by Removing It.

It should come as no surprise that the fastest code is code that never has to execute. Put another way, you can usually only go faster by doing less. And of course, code that never runs also exhibits no bugs. :)

Since I started at Groovshark, I’ve deleted a lot of code. 20% more code than I’ve ever written. More code than all but one of our developers has contributed. Despite that, I think we’ve only actually ever removed one feature, and we’ve added many more.

One of the things I see in the old code is an over enthusiastic attempt to make everything extremely flexible and adaptive. The original authors obviously made a noble effort to try to imagine every possible future scenario that the code might some day need to handle and then come up with an abstraction that could handle all of those cases. The problem is, those scenarios almost never come up. Instead, different features are requested which do not fit with the goals of the original abstraction at all, so then you end up having to work around it in weird ways that make the code more difficult to understand and less efficient.

Let me try to provide a more concrete example so you can see what I’m talking about. We have an Auth class that really only needs to handle 3 things:

  • Let users log in (something like Auth->login(username, password)
  • Let users log out (something like Auth->logout()
  • Find out who the logged in user is, if a user is logged in. (Auth->getUser())

Should be extremely straightforward, right? Well, the original author decided that the class should allow for multiple authentication strategies over various protocols in some scenarios that could never possibly arise (such as not having access to sessions) even though at the time only one was needed. Instead of ~100 lines of code to just get the job done and nothing else, we ended up with 1,176 lines spanning 5 files. The vast majority of that code was useless; our libraries are behind other front-facing code so the protocol and “strategy” for authenticating is handled at one level higher up, and we always use sessions so that no matter how a user logged in, they are logged in to all Grooveshark properties. When we finally did add support for a truly new way to log in (via Facebook Connect), none of that code was useful at all because Facebook Connect works in a way the author could never have anticipated 2 years ago. Finally, because the original author anticipated a scenario that cannot possibly arise (that we might know the user’s username but not their user ID), fetching the logged-in User’s information from the database required a less efficient lookup by username rather than a primary key lookup by ID.

Let’s step back a moment and pretend that the author had in fact been able to anticipate how we were going to incorporate Facebook Connect and made the class just flexible enough and just abstract enough in just the right ways to accommodate that feature that we just now got around to implementing. What would have been the benefit? Well, most of the effort of implementing that feature is handling all of the Facebook specific things, so that part would still need to be written. I’d say at best it could have saved me from having to write about 10 lines of code. In the meantime, we would have still been carrying around all of that extra code for no reason for a whole two years before it was finally needed!

Let’s apply YAGNI whenever possible, and pay the cost of adding features when they actually need to be added.

Edit:
Closely related: Beauty is in Simplicity.

 

Why You Should Always Wrap Your Package

30 Jul

Ok, the title is a bit of a stretch, but it’s a good one isn’t it?

What I really want to talk about is an example of why it’s a good idea to make wrappers for PHP extensions instead of just using them directly.

When Grooveshark started using memcached ever-so-long-ago, with the memcache pecl extension, we decided to create a GMemcache class which extends memcache. Our main reason for doing this was to add some convenience (like having the constructor register all the servers) and to add some features that the extension was missing (like key prefixes). We recently decided that it’s time to move from the stagnant memcache extension to the pecl memcached extension, which is based on libmemcached, which supports many nifty features we’ve been longing for, such as:

  • Binary protocol
  • Timeouts in milliseconds, not seconds
  • getByKey
  • CAS
  • Efficient consistent hashing
  • Buffered writes
  • Asyncronous I/O

Normally such a transition would be a nightmare. Our codebase talks to memcached in a million different places. But since we’ve been using a wrapper from day 1, I was able to make a new version of GMemcache with the same interface as the old one, that extends memcached. It handles all the minor differences between how the two work, so all the thousands of other lines in the app that talk to memcached do not have to change. That made the conversion a <1 day project, when it probably would have otherwise been a month long project. It also has the advantage that if we decide for some reason to go back to using pecl memcache, we only have to revert one file.

 

PHP Order of Operations Gotcha

04 Sep

PHP’s decision to give addition, subtraction and string concatenation equal precedence has caused some difficult to track down bugs on several occasions. It’s so non-intuitive I have difficulty remembering this one, and hence keep writing wrong code.

Example:

$tacos = "Robots: " . 1 + 2 . " for the win!";

I think a normal human would expect $tacos to be equal to “Robots: 3 for the win!”. But the result is actually “2 for the win!”

What gives? Well, the PHP docs say that plus, minus and string concatenation all get equal precedence, with left associativity. So going from left to right, it says:
“Robots: ” . 1 | “Robots: 1” so far, so good.
“Robots: 1″ + 2 => (int)”Robots: 1” + 2; | “Robots: 1″ converts to 0, so 0 + 2
2 . ” for the win!” | “2 for the win!” D’oh!

I think it would make a lot more sense for string concatenation to take a lower precedence than any arithmetic.

The correct way to write the above code is:

$tacos = "Robots: " . (1 + 2) . " for the win!";

 
3 Comments

Posted in Coding

 

Bypassing Magic

18 Aug

In my post about how we are adding client-side caching to Grooveshark 2.0, I mentioned one of the ways we are taking advantage of the fact that thanks to using Flash, we have a full-blown stateful application.

As Grooveshark evolves and the application becomes more sophisticated, the PHP layer is more and more becoming just an interface to the DB. The application just needs the data; it knows exactly what to do with it from there. It also only needs to ask for one type of data at a time, whereas a traditional webpage would need to load dozens of different pieces of information at the same time. So for our type of application, magic methods and ORM can really just get in the way when all we really need is to run a query, fill up an array with the results of that query, and return it.

Our old libraries employing ORM, magic methods and collections, were designed to meet the needs of a typical website and don’t necessarily make sense for a full-fledged application. On a webpage, you might only show 20 results at a time, so the overhead of having a bunch of getters and setters automatically fire whenever you load up your data is probably not noticeable. But in an application, you often load far more results than can be displayed, and allow the user to interact with them more richly. When you’re loading 500, or 5000 results as opposed to 20, the overhead of ORM and magic can start to really bog you down. I first noticed the overhead issue when testing new method calls for lite2, when in some cases fetching the data would take over 30 seconds, triggering my locally defined maximum execution time, even when the data was already cached.

Like any responsible developer considering making changes to code for performance reasons, I profiled our collections code using XDebug and KCachegrind, and then I rewrote the code to bypass collections, magic and all that stuff, loading data from the DB (or memcache) into an array and returning it. The difference? In the worst case, bypassing magic was an order of magnitude less work, often times far better than that. My > 30 second example took less than 1 second in the new code.

For Grooveshark 2.0 code, wherever makes sense, we are bypassing magic, ORM and collections and loading data directly. This of course means that Grooveshark is faster, but it also means that we can load more data at once. In most cases we can now afford to load up entire lists of songs without having to paginate the results, which in turn means fewer calls to the backend *and* much less work for the database. Whenever you must LIMIT results, you must also ORDER BY the results so they come back in an order that makes sense. Not having to ORDER results means in many cases we save an expensive filesort which often requires a temporary table in MySQL. Returning the full data set also allows the client to do more with the data, like decide how the results should actually be sorted and displayed to the user. But that’s another post…

 
 

Detect crawlers with PHP faster

08 Apr

At Grooveshark we use DB-based php sessions so they can be accessed across multiple front-end nodes. As you would expect, the sessions table is very “hot,” as just about every request to do anything, ever, requires using a session. We noticed that web crawlers like google end up creating tens of thousands of sessions every day, because they of course do not carry cookies around with them.

The solution? Add a way to detect crawlers, and don’t give them sessions. Most of the solutions I’ve seen online look something like this:

function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawler as $c) {
if (stristr($USER_AGENT, $c[0])) {
return($c[1]);
}
}
return false;
}

Essentially, doing a for loop over the entire list of possible clients, and searching the user agent string for each one, one at a time. This seems way too slow and inefficient for something that is going to have to run on essentially every call on a high volume website, so I rewrote it to look like this:
public static function getIsCrawler($userAgent)
{
$crawlers = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
$isCrawler = (preg_match("/$crawlers/", $userAgent) > 0);
return $isCrawler;
}

In my not-very-scientific testing, running on my local box my version takes 11 seconds to do 1 million comparisons, whereas looping through an array of crawlers to do 1 million comparisons takes 70 seconds. So there you have it, using a single regex for string matching rather than looping over an array can be 7 times faster. I suspect, but have not tested, that the performance gap gets bigger the more strings you are testing against.