Leaky Abstractions

Categories: Coding grooveshark software engineering

As with nearly every issue in Software Engineering worth thinking about, Joel Spolsky has written an article about leaky abstractions that is very relevant to some problems I ran into tonight.

Abstractions do not really simplify our lives as much as they were meant to. [...] all abstractions leak, and the only way to deal with the leaks competently is to learn about how the abstractions work and what they are abstracting. So the abstractions save us time working, but they don’t save us time learning [...] and all this means that paradoxically, even as we have higher and higher level programming tools with better and better abstractions, becoming a proficient programmer is getting harder and harder

In my case, I was not working with programming tools per se, more of an abstraction of an abstraction built into the Grooveshark framework that is meant to make life as a programmer easier. And normally it does. But in this case, that abstraction was wrapped in a couple more layers of abstraction away from where my code needed the information, and somewhere in there the information was, for lack of a better description, being mangled. The particular form of mangling is actually due to a lower level abstraction at the DB layer, but is not handled in the higher level of abstraction because for most uses it doesn’t matter.

From the level of abstraction where my code was sitting, the information needed in order to un-mangle the data was simply not available. I went through the various layers to find a convenient place to put the un-mangling, but by the time I found a place, everything was so abstract that I couldn’t confidently modify that code and see all potential ramifications, so I just did an end-run around many of the layers of abstraction. This is certainly not ideal, but it works. The point is, an abstraction intended to make life as a programmer slightly easier most of the time, can easily make life as a programmer significantly more difficult on edge cases. Unfortunately, once a code base has been established, most cases are edge cases: feature additions, new products built on top of the old infrastructure, etc., all or most unforeseen, and therefore unaccounted for during the original design process.

Who will monitor the monitors?

Categories: Coding software engineering

This is the second time that an error in a monitoring/testing tool that we use has caused me to waste a good deal of time trying to solve a non-problem.

The first was when we were using a tool to test the scaling performance of a new feature. According to this tool the performance was absolutely abysmal with only 50 virtual clients, and I spent a couple of days writing alternative algorithms and trying to test them with this tool. The scalability was horrible no matter what I did, so I added some logging to see if something was going on that wasn’t supposed to be. It was. The testing tool was sending completely wrong information, creating completely unrealistic scenarios. The testing implementation was done by another developer who was sure that he had set it up right, and I was not familiar with the tool so I assumed that it was a problem with my code. Having another developer write the tests was supposed to save me time.

The latest monitoring snag was caused by a monitoring tool that, for our convenience, tails the php error log and sends the last couple hundred lines to us whenever there is an error. We kept getting the same error periodically and could not track down the source; everything seemed to be working perfectly, but we still got the error. Usage of the site every day is breaking new records so we assumed that it had something to do with the unprecedented load on servers, but could never pinpoint the source of the errors. Today I looked a bit more closely at the error log only to discover that the log was from 3 days ago!

So the question is, when your testing or monitoring tools report an error, how do you ensure that the error is real and not just in the tool itself?

On being a DJ

Categories: Coding grooveshark life music

When I was in college (oh so long ago…) I was a DJ for our radio station, and then I was a music director. I loved being a DJ: having lots of new, interesting and unreleased music on tap, from Smashing Pumpkins to Underwater Boxer; having a channel to share that music with other people; being able to make a small band’s day by playing their stuff and reporting it to CMJ. Well, there was one part I didn’t care for so much: talking on the radio. I’m a bit shy, which is why although I loved being a DJ and music director at Eckerd College, I knew it wasn’t ever going to be a career path for me.

It’s interesting, then, that I work at Grooveshark where much of that dream is being fulfilled by participating in this movement. The one piece that is missing is having a channel to share music with other people and subsequently helping small bands by making them more discoverable. Well, now with the release of Autoplay in Grooveshark Lite, it’s kind of like I get to be everybody’s DJ. Of course a computer scientist would write a DJing program rather than doing the manual labor of DJing.

As Professor Fishman, the best professor who ever lived, was fond of saying in our classes, a computer scientist isn’t satisfied with just using computers to put other people out of a job, they won’t settle until they manage to put themselves out of a job too. To be fair, he usually talked about that in the context of AI and specifically programming languages such as LISP, where the program can rewrite itself, but I think it applies here as well.

Now I get to be everyone’s DJ, but with everyone’s help too. If the system is currently a bad DJ, keep giving it feedback and it will learn. Imagine if you got to call up your local radio station and yell at them every time they played something you didn’t like, and congratulate them every time they played something you liked. If they didn’t block your phone number, you’d end up with the ultimate radio station for you, and that’s what Grooveshark aims to be, although we admit it will take some time to get there.

Check out Autoplay, and let me know what you think.

Making it easy

Categories: Coding grooveshark

There is an effort underway at GS to modify our framework to make writing code easier than ever before (for the second time). It’s a great idea in theory and the results of what I have seen so far certainly make doing certain things more convenient, which is great when you’re writing code.

But I can’t help but feel that we are perhaps barking up the wrong tree a bit.

Peter Hallam points out that programmers spend most of their time reading code, not writing it. So rather than focusing on making code easier to write, we should be sure that we are making it easy to read, understand and modify. Peter surmises that a 10% reduction in the time it takes to understand code is equivalent to a 100% reduction in the time that it takes to write code. That’s very significant.

One of the mantras I have heard a bit too much is “always favor composition over inheritance.” As Phil Haack points out, composition is great sometimes, but it’s not a perfect design either (because there isn’t one). Personally, I think it’s best to keep composition and inheritance in mind and always prefer whichever one going to lead to easy-to-understand code. Many times I find that to be inheritance. Sometimes the solution is even minor code duplication, such as having each page requiring authentication to do an explicit auth check rather than having the framework infer whether or not an auth check is required based on the name or, and I shudder at the thought, comments in the code.

Quip

Categories: Coding

Much like with financial investments, past performance of software is not necessarily indicative of future results.

It’s a good thing to keep in mind when debugging: test even the stuff you know isn’t broken.

Be careful what you return

Categories: Coding grooveshark

(and how you handle what has been returned)

Things have been busy at Grooveshark, as usual. These past couple of days I have been hunting a bug both cthulu-like in its scary-strangeness and ninja-like in its stealthy manner. I went through all my code related to this particular project several times with a fine-toothed comb and didn’t catch it until today.

Turns out it wasn’t so strange after all. The fault was definitely mine, but PHP’s quirks certainly didn’t help matters any.
I was using array_search in a straightforward manner, not to find the particular position of an item but to find whether or not the item was in the array at all. The one thing about array_search, especially in the context of PHP’s loose typing, is that if it finds nothing it returns false. Of course, php happily treats false as zero, so how do you check to see if array_search is saying that it wasn’t found, or is saying that it’s the first element in the array? You have to check using strict equivalency, which I remembered, so I wrote my code like this:


$found = array_search('something', $arr);
if ($found === false) {
//handle what you do when it's not in there
}

then, a little while ago, specs changed and there was another case that had to be handled the exact same way as when ’something’ wasn’t in $arr. So I did this:


$found = array_search('something', $arr);
$found = $found || (SOME_OTHER_CASE);
if ($found === false) {
//handle what you do when it's not in there
}

So in other words, when I went back and looked at that code later, I didn’t notice the triple equals instead of the double, so without much thought I assumed that $found was already a boolean, when it was really only a sometimes-boolean. In a strictly typed language this mistake would, of course, not have been possible. More practically, if array_search returned something other than practically-zero, I would have been able to explicitly handle that special case and store the result of that explicit handling as a boolean. If you read the documentation you will see that array_search actually returned NULL before version 4.2.0. I have to wonder why they decided to change it.

The reason this bug was so hard to find was because it was only a bug when the item being searched was the first item in the array, which it turns out is not that often. By the time I found that bug, there were hundreds of newer lines of code to check first.

Now that the bug is solved and now that I am far into this very technical post I think it’s safe to leak a tiny bit of information about what you, dear user, can expect to see in Grooveshark Lite in the near future: autoplay. We have decided that we want to be your personal DJ. Our tack on this feature aims to get around the chicken and egg problem: how do you build recommendations without user feedback, and how do you get user feedback if you don’t have recommendations to make them want to use the system? I’m not going to answer that question directly, but we hope that you will find the autoplay sessions to be enjoyable, and as you provide feedback to the system, we’ll take that data and make it even better.

SQL Schema and Graphs/Maps

Categories: Coding SQL

A while ago I wrote about my auto-query generator project. I only just recently got around to finishing it up because other things had higher priority, and also because I wasn’t entirely convinced that I was doing things the bes way, and I wanted to take some time to experiment.

Matt sat down with me and analyzed the problem, and we decided that we could use the schema to create a graph with all of the edges (our ID columns are consistently named in each table), and then use a shortest-path-finding algorithm, and then I could write a SQL generator that works off of the path. Getting all of the IDs in our tables in MySQL is pretty easy:
SELECT DISTINCT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE '%ID'

Then getting all the tables for a given ID:
SELECT TABLE_NAME, COLUMN_KEY FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME = '$id'
AND TABLE_SCHEMA = 'yourschemahere'

From that information, I simply built a graph which I represented with an adjacency map.

Unfortunately, that did not work. The path finding algorithm was basically too good, finding paths that were the shortest, but not necessarily the correct way to get from one table to another. For example, two tables might contain a GenreID, but maybe they are actually linked by ArtistID. Ok, so what about only making an edge when that column is a primary key in one of the nodes representing a table? That wasn’t hard to do, either, but it still gave wrong results in some cases. Sometimes it’s just more efficient (but wrong) to route through the Genres table than go the right way.

I considered making a directed graph so that connections would only be one-way to the table with the ID as a primary key, but I realized that wouldn’t work either, because sometimes you do need to join tables based in IDs that are not primary keys. Essentially, our schema does not completely represent the full complexity of the relationships that it contains.

So I went back to my original method, which was to map out the paths by hand. Tedious though it may have been, it’s still a pretty clever solution, in my opinion.

I created two maps. The first simply says “if you have this ID, and you are trying to get to this table, start at this other table,” for every possible ID, and the next one simply says “if you’re at this table, and you’re trying to get to this other table, here is the next table you need to go through.”

The great thing about this is that most of those steps can be reused, but I only had to create them once. For example, it’s always true that to get from Users to Files you must go through UsersFiles, no matter what your starting point is, although you may be trying to find all of the Songs, Albums or Artists that a user has in their library.

Having spelled things out this way, there is no guesswork for a path finding algorithm, because there is literally only one path. In fact it hardly counts as a path finder or an algorithm; it just iterates through the map until it reaches its destination. And it works. I will post as many details as I’m allowed about exactly how the actual SQL building algorithm works, and about how I am able to merge multiple paths, so for example you can have an ArtistID and a PlaylistID to get all of the artists on a given playlist. Stay tuned.

Let me drive

Categories: Coding

One odd quirk that I have noticed about myself as I have been building a reputation as an SQL guru at Grooveshark, and therefore being regularly pulled aside to look at queries, is that I have a hard time thinking about an SQL query when I’m just looking at it over someone’s shoulder. It’s even harder for me to think about how I would change the query.

For some reason, I need to be in the proverbial driver’s seat. Let me sit down in front of the screen, give me a gui text editor that I can use easily (vim does not qualify), and my brain is prepared to evaluate the problem. I may not even need to type anything out in the process of solving the problem, but the brain juice just won’t even start to flow if I don’t have that.

I seem to have that problem much less when looking at PHP (although I prefer to look at the code in my IDE of choice), and I’m not entirely sure why that is. The only thing I can think of is that maybe SQL requires a higher level of abstract thought than PHP does most of the time, so I am more dependent on having the right set up before I can get into the right mode?

Do any other coders out there have this problem?

‡ ‘an’ is the correct usage here because I expect you to read that as S Q L, not Sequel.

OhHai->I->HasA(UPDAET)

Categories: Coding

As an update to my previous lolcode post, we are fixing the GetGenre()->GetGenre() issue by calling them names. GetGenre()->GetName()

There is talk of adding __toString() functions to classes like Genres and Tags but I tend to not be a fan of automagic functions. __toString() would enable us to just call GetGenre() and if we treat the resulting object as a string, it will call GetName() behind the scenes, and if we treat it like an object it will still be an object. That is a “neat” language feature, but I believe it leads to obscurity and inconsistent behavior in certain cases.

For example, if the object is not directly treated as a string even though it needs to be a string, it __toString() will not be called, and problems will ensue. Confusing problems, because the object acts like a string, sometimes.

sloppy example code:

class notAString
{
    public $what;
    public function __construct($val)
    {
        $this->what = $val;
    }
    public function __toString()
    {
        return $this->what;
    }
}
$whatIsIt = new notAString("a string");
$isAString = is_string($whatIsIt);
$isAnObject = is_object($whatIsIt);
var_dump(array('isAString' => $isAString, 'isAnObject' => $isAnObject));
echo $whatIsIt;

Output:array
'isAString' => boolean false
'isAnObject' => boolean true
a string

It fails the is_string check, so if you pass the object to a function that expects a string, and the function is smart enough to check for a string before doing anything with it, your call is going to fail and you’re going to be scratching your head wondering why.

Now imagine how confusing this would be if you were trying to debug a piece of code that you had no hand in writing, you see this object being used as a string, only you don’t know it’s an object because it’s being used as a string and that part of the cod works. “It should be declared right there, just look and you’ll see it’s an object.” Sure, or it could be passed in from another function and you haven’t looked that far up the ladder yet.

Worse, you finally figure out that it’s an object, and now you can’t figure out why it’s successfully being treated as a string elsewhere. You look at the class and you don’t see a __toString() function. You look at the parent class, no __toString() there either. Ah well, a red herring, time to move on right? Or did you give up before looking at the parent class’s parent? Was there a __toString() there? How much time was wasted trying to find that, compared to how much time the automagic __toString() function might save you as a developer?

I’d wager it’s not worth the lost time, and the added frustration.

OhHai->IHasASong()->ICanHasGenre()

Categories: Coding

Sometimes naming conventions can have weird side effects.

Consider this example:
Our tables and fields are named in CamelCase (or StudleyCaps, or StudleyCamels as we like to call them), this is not my preferred way of naming tables/fields, but it’s what we’ve got so we work with it. We* decided that Genres are entities so the Genres table contains a Genre field which is a VARCHAR.
In PHP we are using our own flavor of ORM that avoids the pitfalls of most ORM systems while keeping the benefits. Our database objects all have Get methods for each object property (which may or may not be a field in the table that the object represents). If you want a song’s Genre, you call Song->GetGenre(), which gives you a Genre object. If you then want the text-representation of the Genre, you have to call…GetGenre(). So if you want, from the Song object, the text representation of the Song’s Genre, you call: Song->GetGenre()->GetGenre()
We might as well have made it say Song->GetGenre()->PrettyPlease() or ->NoSeriously()

At least we aren’t writing our code lolcats style (see: lolcode) or it might look more like the title:
OhHai->I->HasASong()->CanHasGenre()->CanHasGenreNoowwwwwwws()

*actually someone else decided before I ever started working here

Kthxbye.