RSS
 

Archive for the ‘software engineering’ Category

You don’t know when you’re done if you don’t know what done is

08 Aug

As I mentioned in an earlier blog post, we are working hard on a 2.0 version of the product. One of the questions our team is asked quite frequently is “when will it be ready?” This question is impossible for us to answer, because ready is not defined. It’s one of the dangers of working without a real spec.

There are a lot reasons why we don’t use a real spec, none of which are up to me so I’ll not discuss them here.

Whenever 2.0 gets close to what everyone thinks we have agreed on, people look and poke at it, decide they don’t like things or realize nobody ever asked for critical feature x, or it somehow didn’t make it onto our bug list, and then we have to go back and get new designs, file a bunch of bugs, and set a new milestone. Repeat, repeat, repeat. It’s not just for 2.0 that it works this way, it’s whenever you’re working without a complete spec. When the goal posts for done are constantly moving, the question of “when will it be done?” Is really a question of “when will the goal posts stop moving?”

To break the cycle, we’re picking a done date, and mandating that the goal posts stop moving some time before that date. Working towards that, we’ve submitted our “last chance” milestone, meaning after this milestone any decisions/changes/designs must be final, because we’re going to call it done when those have been implemented, or when we hit our chosen date, whichever comes first.

 

Leaky Abstractions

12 Sep

As with nearly every issue in Software Engineering worth thinking about, Joel Spolsky has written an article about leaky abstractions that is very relevant to some problems I ran into tonight.

Abstractions do not really simplify our lives as much as they were meant to. [...] all abstractions leak, and the only way to deal with the leaks competently is to learn about how the abstractions work and what they are abstracting. So the abstractions save us time working, but they don’t save us time learning [...] and all this means that paradoxically, even as we have higher and higher level programming tools with better and better abstractions, becoming a proficient programmer is getting harder and harder

In my case, I was not working with programming tools per se, more of an abstraction of an abstraction built into the Grooveshark framework that is meant to make life as a programmer easier. And normally it does. But in this case, that abstraction was wrapped in a couple more layers of abstraction away from where my code needed the information, and somewhere in there the information was, for lack of a better description, being mangled. The particular form of mangling is actually due to a lower level abstraction at the DB layer, but is not handled in the higher level of abstraction because for most uses it doesn’t matter.

From the level of abstraction where my code was sitting, the information needed in order to un-mangle the data was simply not available. I went through the various layers to find a convenient place to put the un-mangling, but by the time I found a place, everything was so abstract that I couldn’t confidently modify that code and see all potential ramifications, so I just did an end-run around many of the layers of abstraction. This is certainly not ideal, but it works. The point is, an abstraction intended to make life as a programmer slightly easier most of the time, can easily make life as a programmer significantly more difficult on edge cases. Unfortunately, once a code base has been established, most cases are edge cases: feature additions, new products built on top of the old infrastructure, etc., all or most unforeseen, and therefore unaccounted for during the original design process.

 

Who will monitor the monitors?

21 Aug

This is the second time that an error in a monitoring/testing tool that we use has caused me to waste a good deal of time trying to solve a non-problem.

The first was when we were using a tool to test the scaling performance of a new feature. According to this tool the performance was absolutely abysmal with only 50 virtual clients, and I spent a couple of days writing alternative algorithms and trying to test them with this tool. The scalability was horrible no matter what I did, so I added some logging to see if something was going on that wasn’t supposed to be. It was. The testing tool was sending completely wrong information, creating completely unrealistic scenarios. The testing implementation was done by another developer who was sure that he had set it up right, and I was not familiar with the tool so I assumed that it was a problem with my code. Having another developer write the tests was supposed to save me time.

The latest monitoring snag was caused by a monitoring tool that, for our convenience, tails the php error log and sends the last couple hundred lines to us whenever there is an error. We kept getting the same error periodically and could not track down the source; everything seemed to be working perfectly, but we still got the error. Usage of the site every day is breaking new records so we assumed that it had something to do with the unprecedented load on servers, but could never pinpoint the source of the errors. Today I looked a bit more closely at the error log only to discover that the log was from 3 days ago!

So the question is, when your testing or monitoring tools report an error, how do you ensure that the error is real and not just in the tool itself?

 

The pain of project management

30 May

For the second time now, Grooveshark is making a serious effort to utilize project management software.

I don’t like the new project management solution, even though it is a vast improvement over the old one, and I think I’ve discovered why: I don’t like project management. I don’t like project management because project management is not for me; it’s for managers. Managers and developers have vastly different interests. Developers want to know what bugs are in their code, what features they need to develop, a way to view dependencies, and a way to see which bugs/features are most important. Bugzilla fits those needs perfectly. Managers, on the other hand, need to make sure that they are maximizing productivity by making sure devs are never sitting idle, they need to know what is going to happen when; they need ship dates.

The thing about bug tracking is that the people who benefit from it are the same people who have to do the work. If you want to see a bug get fixed, you file it. If you want to find bugs to fix, use the bug tracker. Project management tools, on the other hand, require a vast amount of work from the people who don’t need them: developers. That drastically decreases the chances that they will be used consistently. It also means that project managers end up working against their own goal: they reduce productivity. This is consistent with Le Chatelier’s Principle: Complex systems tend to oppose their own proper function.

It’s actually a more pronounced effect than just wasting dev time on project management. At least for me, having multiple people assign tasks to me with various arbitrary deadlines tends to make me feel like I am being micromanaged, which increases stress and also keeps me from being able to focus on any one thing for long enough to accomplish much. (“I’ve been working on this thing for 3 hours and I have all these other things to do, I should put it on the back burner and move on…”)

What’s the solution? I don’t know, but a good start would be to combine project management with bug tracking, ala FogBugz, set overarching deadlines for projects, let your devs work out the details, and be flexible. Plan for other things coming up, because new and very important bugs will pop up all the time, servers will break, and meetings will happen (sadly). If your developers are good, hardworking employees (and Grooveshark devs are), they will strive to meet or exceed the goals you set for them, without being micromanaged.

 

Always test in context

03 May

When testing your work, it’s essential to always test it in the context that it’s going to be used in.

Last weekend, I helped a friend move her stuff down to the Orlando area. UHaul trailers are incredibly cheap compared to trucks, and my car has a hitch, so we got one of those, loaded up her stuff and headed south. Somewhere along the way, the hookup for the lights became disconnected and then dragged along the ground, getting all nice and melted in the process. I noticed this when we got to her new apartment and called UHaul. Awesomely, they have free roadside assistance for any time something happens to their trailer.

They sent out a guy, he looked at the hookup on the trailer, rewired it, and tested it with his wire testing doohickey before taking off. We had already disconnected the trailer from the car for parking purposes, and when he pulled in to fix the trailer he blocked access to my car, so we didn’t actually hook it up to my car to test it.

A few hours later, it’s time to go home and return the trailer and oh, crap, the lights don’t work. I had to call UHaul and have them come fix it again. Because of the way the trailer was hooked up originally, the tongue of the trailer managed to get on top of one of the wires on my side of the connector and eventually wore through it, something I hadn’t noticed previously. It also managed to blow a fuse at some point so my brake lights didn’t work at all even when the wiring was fixed. Fortunately, the UHaul guy was very friendly and also had all the parts one could ever need for fixing minor car problems, so he easily re-rewired the connections and fixed my fuse.

Still, a few extra minutes of checking the trailer in the context it was going to be used in (i.e. attached to my car) would have revealed the problems and saved UHaul several hours worth of labor (driving to and from the apartment complex; they were far away).

That was a non-technical example of the principle, but there are certainly plenty of technical ones as well:
When I wrote the code to handle our PayPal processing, PayPal helpfully provided a development sandbox for testing with. All my code was thoroughly tested, seemed to be handling every sort of error that I threw at it, processing successful charges properly and everything. Then when it was time to release we pointed it at the ‘real’ PayPal servers and suddenly it didn’t work anymore. A few (real) payments later and some minor differences between how the real servers worked and how the dev servers worked were revealed. No big deal, but again, some testing up front in the correct context would have prevented the issue from ever appearing in the first place.

 

Back

24 Apr

I’m back from my mini-vacation and it sounds like things are crazy at the office, but I think even so that I should have time to start writing again and I look forward to doing so.

My first order of business tonight is to start reading Systemantics: How Systems Work and Especially How They Fail by John Gall which was recommended by Raymond Chen and therefore automatically must be worth reading. Because I’m sure many of you are too lazy to actually follow that link, here’s the relevant RChen review:

Systemantics is very much like The Mythical Man-Month, but with a lot more attitude. The most important lessons I learned are a reinterpretation of Le Chatelier’s Principle for complex systems (“Every complex system resists its proper functioning”) and the Fundamental Failure-Mode Theorem (“Every complex system is operating in an error mode”).

You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed.

That’s why I’m skeptical of people who look at some catastrophic failure of a complex system and say, “Wow, the odds of this happening are astronomical. Five different safety systems had to fail simultaneously!” What they don’t realize is that one or two of those systems are failing all the time, and it’s up to the other three systems to prevent the failure from turning into a disaster. You never see a news story that says “A gas refinery did not explode today because simultaneous failures in the first, second, fourth, and fifth safety systems did not lead to a disaster thanks to a correctly-functioning third system.” The role of the failure and the savior may change over time, until eventually all of the systems choose to have a bad day all on the same day, and something goes boom.

Time to hit the books!

 

Development Timeline as a Contract

20 Mar

Development timelines are a contract, in many ways.

Contract negotiation happens when developers sit down with management to hash out a release date for a product or feature. As with any other contract negotiation, both sides come to the table with their own demands, and there are concessions on both sides, but hopefully when the negotiation is over, both parties can be happy with the results. It is also essential that the terms of the contract are clear: if developers and management have different understandings of what feature XYZ entails, they might think they have come to an agreement when they haven’t; this will only lead to problems later, usually altered feature sets or later release dates.

It is important to keep in mind that contracts can be breached by either party, and this is certainly the case when dealing with timelines. If either side fails to hold up its end of the bargain, the timeline will slip and the contract will be broken. It’s obvious how developers can be in breach of this contract, and they are certainly usually the ones held responsible for it, but how can management be at fault? Well, it entirely depends on the negotiation process. If, for example, management assures the team that they will have a certain server in place and ready for production N days before release, and server deployment is delayed, development cannot be held accountable for the schedule slippage. Likewise if the development team asks for feature lockdown and management continues changing the specifications throughout the development cycle.

In an ideal situation, both parties of the contract are working towards the same goal: a product that will make the company more successful. What type of product that is exactly depends on the company, but goals that any team would strive for includes an intuitive interface, a useful feature set, and a bug-free user experience. With a set of shared goals, if both parties are able to hold each other accountable and enforce the terms of the contract, the end result is usually an on time release that both parties can be satisfied with.

In situations that are less ideal, where developers cannot expect management to live up to their terms of the contract, compromises will have to be made somewhere internally. A feature will silently go missing, the product will be poorly tested, or release dates will be moved back.

 

Music 1.0 is Dead

28 Feb

Music exec: “Music 1.0 is dead.”

Five hundred top members of the music business gathered today in New York to hear that “music 1.0 is dead.” Ted Cohen, a former EMI exec who used the phrase, opened the Digital Music Forum East by pleading with the industry to be wildly creative with new business models but not to “be desperate” during this transitional period.

Consider the statements that were made today without controversy:

  • DRM on purchased music is dead
  • A utility pricing model or flat-rate fee for music might be the way to go
  • Ad-supported streaming music sites like iMeem are legitimate players
  • Indie music accounts for upwards of 30 percent of music sales
  • Napster isn’t losing $70 million per quarter (and is breaking even)
  • The music business is a bastion of creativity and experimentation

Just within the last year, we’ve seen an array of experiments that include ad-supported streaming, “album cards” from labels like Sony BMG, and allowing Amazon to offer MP3s from all four majors. Some labels even allow user-generated content to make use of their music in return for a revenue share from sites like YouTube—unthinkable a few years ago to a business wedded to control over its music and marketing.

All of this bodes very, very well for Grooveshark, aside from the fact that we weren’t used as an example. We’ll soon be getting much more attention in that vein, but hopefully not until we’ve had a chance to improve the site in these areas:

  1. Differentiating between files and songs properly
  2. Faster loading times
  3. Better searching
  4. More user friendly interface
  5. Eliminate silent failure

So clearly, we have a lot of work cut out for us in a market that is on the verge of exploding, but if we can focus our resources I think we can be mostly there within a few weeks. Now to get management on board.

 

No Room for Ego

18 Feb

There’s no room for ego in software development, and especially if you are a small startup setting out to change the world, there’s no room for ego on any team.

Here at Grooveshark, we have the smartest bunch of people I’ve ever worked with (and they’re not paying me to say that), but we all make mistakes, have a bad brain day and make less than optimal decisions sometimes. When there is no ego involved, we can all point out to each other when one of us is being stupid, and we can easily own up to our own mistakes and learn from them (and make the site better), and we can ask questions when we don’t know something.

If ego is involved, it’s no longer easy to tell someone they are doing something wrong or question their decisions, because you might hurt their feelings. Conversely, if you are maintaining an ego, you’re going to have a hard time admitting that you don’t know things that you need to.

Either way, precious time and resources end up being wasted, either fixing bugs, redesigning architecture or spinning your wheels trying to figure out something that someone else probably already has the answer to.

I ask my co-workers probably hundreds of questions a day. I’m not an expert on the way every single piece of our site works and it doesn’t hurt my ego to admit that. So when I find a bug in, say, the recommendation engine, I can go over to Travis who seems to know that system inside and out, and ask him exactly what is happening. I don’t think Travis has any less respect for me for it, but now I know more about how that part works and I fixed the bug, so now we have a better product, and it only took me a few minutes because I didn’t have to analyze each line of code tracing through the object hierarchies, etc.

When we have to design an important new feature or revamp the way part of our architecture works, we have a meeting about it and everyone provides their input. We usually start out with a few different ideas and everyone argues and makes a case for their idea. In the process of doing that we change our minds and come to a consensus. Because there’s no ego involved, we’re each prepared to give up our idea if a better one is presented, and we usually end up coming up with a solution that is better than what any of us would have designed on our own.

This concept applies to other teams/departments as well. In my experience, those teams open to constructive criticism from outside the group are the most effective, and those that are least receptive to feedback tend to have lower quality solutions.