Coding from the Trenches: Everything I Need to Know about Debugging I Learned from CSI

I started to write this blog entry a few months ago, but it quickly got out of hand. With a little encouragement from a tester at Microsoft I was encouraged to publish it to a magazine, but things at work got out of hand, and I just never found the time. Since I am the quintessential procrastinator, I've decided to just publish it here, so that it will at least get published in some form. So, without further adieu, I present it for your general amusement.

—Mike

Okay, I admit it. I'm a CSI dweeb. No, I don't like CSI: Miami (it blows). I like the original CSI. And although all of those shows are equally implausible and unrealistic (let's face it, no CSI team is that thorough, that precise, or that good), the very premise of crime scene investigation and its parallels to defect resolution hit me like a ton of bricks recently.

At first thought, it's just a corny idea. But then, the more I thought about it, the more I realized that the idea isn't as silly as it sounds. I thought about writing this article as a parody, but when I set out to do so, it didn't turn out that way.

Some folks might look at this and laugh their butts off. But read it, think about it, chew on it, and then, after you're done, if you still think the parallels aren't striking, go ahead and laugh.

The Basics of Crime Scene Investigation and Defect Resolution

We’ve all done the dirty work in software development: defect resolution. In many companies, it’s the first place where new developers are unceremoniously dumped when they are brought on board. The thinking is that it will familiarize them with the product. “You’ll learn the code!” they say. “Then you can move into the development.” These poor saps are armed with reams of source code, an IDE, and a compiler, and sent marching into the battlefield with a stack of defect reports and an order to make progress repairing a system with which they typically have no experience whatsoever.

This thinking is fundamentally flawed. You don’t want someone who doesn’t know a lick about a complex software system trying to resolve its defects. But that’s a subject for another article.

When confronted with a defect report, there are a certain number of predictable responses that tend to flash through every developer’s mind when he hears it:

“It’s an ID10T error.” This one’s my favorite. It can’t possibly be a defect in any code that I wrote. The user must have done something wrong. Everyone knows users are st00pid. I mean, just look at them. They’re like, lame. And stuff.

In the Dark Ages, this might have flown, but this is the 21^st Century. There’s this thing called a presumption of innocence.
“We already fixed that.” Another keen insight. If it’s already fixed, why is it still happening? If it’s already been fixed, you’ll have to provide proof that it’s been fixed in a build in your test environment. If you can’t prove that, then what you’ve likely done is fixed the wrong thing and claimed victory. As we’ll see later, this will fall under the novel concept of “Convicting the wrong suspect.”
“He did what? You’re not supposed to do that.” Okay, let’s get something straight: just because a developer might not do something doesn’t mean that a user won’t. Users do unpredictable things all the time. And your lack of coding for it doesn’t mean that it’s not a defect. Holding the users at fault for being unpredictable is not an acceptable excuse. Inadequate code coverage in the test plan is a defect. Get used to it.
“It’s a known issue and we can’t do anything about it.” Okay, that’s marginally acceptable. Sometimes. But have you made an effort in the software to barricade the users from the effects? Trained them? Documented it? Why are users still running into it?
“I know exactly what causes that. Let me fix that right now.” The most fatal of all the answers. This knee-jerk reaction is what leads to reaction #2. This is always a bad response, and only in the rarest of cases is it ever right. I would estimate the chances of it being right as roughly equivalent to those of a stray cosmic ray setting off a nuclear disaster that ended the world within three seconds of your reading this sentence. Okay, that’s extreme. But you get my point.

Defects resolved this way are rarely documented properly. Test plans are rarely updated to ensure that the fix is correctly tested. They’re just quietly slipped in, like an Easter egg, and no one is any wiser. The only thing that gives them away is that there’s a new version of the file in the source code repository. (You do have a source code repository, right?) And that’s assuming it’s the only change in that version of the file.
“Oh God. What now?!” Don’t even tell me this has never crossed your mind. We’re all swamped. Products slip, schedules get crazy, we work overtime, and work piles up. We try to prioritize, but things get missed. Defects get buried in a stack, and some of them just don’t get fixed. We don’t see defects as challenges, we see them as annoyances, burdens, more junk sitting on our plate when we’re already seriously overtaxed.

We all know that there’s more clever and witty responses out there. Some of them I just can’t put in print. But for the purposes of this article, I think we’ve painted a pretty accurate picture of defect resolution as it stands today: it’s viewed as a dull job, one that’s resented, a pain in the neck, and one that no one looks forward to.

Let’s face it. You have to essentially tell the developers that their code is broken. Or, you have to tell the users that they don’t know what they’re doing, or that there’s nothing that can be done, and that they just have to wait. Either way, it’s a no-win scenario for you. You always come out the bad guy. No one wants to cooperate with you, because they know that you’re only going to give someone bad news. If you’re new to the company, you probably don’t even know anything about the product to begin with, so you’re flying by the seat of your pants as well. And if your company is like most, you don’t have the best equipment or software to make finding those defects as easy as it could or should be.

Who the heck would want that job?

Now, turn your attention to another group of individuals who are stuck in the very same situation. Their job is no different. They have to do the same basic thing. They have to wade into a situation that they know nothing about, typically understaffed and underequipped, and determine whether or not a problem occurred. Then they have to accuse someone of being in the wrong, or telling both sides that no wrongdoing took place at all (potentially angering both sides). Through it all, their job is to figure out the who, the what, the where, the why, and the how of it all. Crime scene investigators do this every day. They wade into a new crime scene, knowing only that a crime may have been committed, that one or more suspects are at large, and they have a crime scene to work with. They’re given the evidence, and told to run with it. Sound familiar? It should.

Crime scene investigation is essentially the act of solving a complex problem: finding the truth in a vaguely described problem when you’ve got few hands, little money, a lack of resources, a finite amount of time, and every witness can be a suspect. At the end of each case, they have to render their findings, and simply state the facts, regardless of whether or not the victim or the justice system likes it. Sometimes they’re praised, sometimes they’re despised. But they’re frequently overworked and underpaid, and the amount of care they have to take to get their jobs done is mind-boggling. If they make a mistake that tampers with the evidence, an entire case can get thrown out of court.

The Process

The job of the crime scene investigator is to determine the following:

Whether or not a crime was committed.
If a crime was commited, what the crime was.
If a crime was commited, who committed it.
if a crime was commited, how it was commited.

You’ll note that the investigator is not responsible for prosecuting the crime. His job is simply to collect the evidence, analyze it, and form a theory that fits the facts and leads to the perpetrators of the crime (if any).

The crime scene investigator uses the scientific method to arrive at his or her conclusions. The American Heritage Dictionary defines the Scientific Method as:

n. The principles and empirical processes of discovery and demonstration considered characteristic of or necessary for scientific investigation, generally involving the observation of phenomena, the formulation of a hypothesis concerning the phenomena, experimentation to demonstrate the truth or falseness of the hypothesis, and a conclusion that validates or modifies the hypothesis.

In other words, “Prove it, buster.”

Here’s the gist of it: You need to gather the facts, form a hypothesis based on the facts, and then prove your hypothesis. In crime scene analysis, proving the hypothesis leads you to one or more suspects who is or are more than likely guilty of committing the crime. You don’t rely on the “hunch.” Hunches put innocent people behind bars, wasting taxpayer dollars, and getting cases thrown out of court or convictions overturned on appeal.

In defect resolution, the same practice applies. You gather the facts, determine whether or not an actual defect exists, and then review the facts to create a theory. Then you prove the theory. If you can’t prove the theory, you don’t have a case. You’ll likely fix the wrong code, incorrectly mark it as “not reproducible,” “by design,” or “user error,” or fix part of the problem while the other parts that contributed to the problem remain uncorrected.

As a CSI conducts his investigation, certain guiding principles govern the way that the investigation is conducted. These are:

Humans lie and make mistakes; evidence doesn’t. When you can’t rely on the witnesses, keep going back to the evidence to find the truth.
You always want to convict the guilty party. You never want to convict the wrong party of the crime. When you do get a conviction, you want it to stick; you never want it to be overturned on an appeal.
Your first suspect is usually not the right suspect. Knee-jerk reactions tend to be wrong, and based on faulty assumptions. Careful evaluation of the evidence leads you to the right suspect(s).
You want to convict all of the guilty parties, not just one or some of them.
Don’t be swayed by your emotions or personal involvement. Always remain detached and objective.
Expensive tools aren’t always required to analyze the evidence. Sometimes, it’s simple tools that can be found at our fingertips every day that will do the trick.
Patience and persistence rule the day.
There ain't nothin' glamorous about this job. It's full of blood, gore, hate, anger, greed, fecal matter, tire tread, and a lot of pavement. No one ever cooperates willingly, but they all want answers now. And no one is ever guilty. Get used to it.

So without further adieu, let's see how the CSI process parallels defect resolution. Hold onto your butts people, it's going to be a bumpy ride.

The Process

Identify the Crime

Any time an alleged defect occurs in your product, treat it as a crime. After all, some part of your code has theoretically failed to meet its contractual obligation to the end user (or, so we’re assuming for the purposes of this article). You’ll first want to know what this alleged crime was. Was data corrupted? Did the software simply vanish off the screen? Did an error message appear? Did the screen lock up? Was sensitive data compromised?

Once you identify the crime, you'll need to categorize it. Its severity helps you to determine how quickly it needs to be resolved.

It's important to note, however, that at this point, you don't know that a defect has actually occurred. All you really know is that something happened. You still have to prove that it's a defect. So you start taking copious notes. This is why you need a defect tracking system. You need a place where you can record as much information about the event as you possibly can--preferably in one place.

Identify the Victim and Witnesses

The victim and witnesses provide valuable insight to what happened when the crime occurred. But it’s important to realize that witness accounts tend to be fuzzy at best.

Crimes and defects tend to catch people by surprise—they’re usually not paying close attention when these things happen, and the panic factor is pretty high, so relevant and often important details tend to escape their notice. You’ll still find their input valuable for recreating the series of events that led up to the event, and certain general information about it; but you will do well to remember that witnesses typically are not an authoritative source of information.

Identify the Crime Scene

When an alleged defect occurs in your product, treat the event as a crime scene.

You’ll want to know where and when the event occurred, what version of your product was being used, what OS it was being used on, what browser was being used, any plug-ins or service packs applied, what the user load was at the time, and so on. Any of these might have a bearing on the crime that was committed. You’ll need to know this information so that you know exactly which version of your software to use when you recreate the "crime scene" later.

Preserve the Crime Scene

It is absolutely imperative that you preserve the state of your software while you are attempting to identify the cause of the defect that occurred. If the environment is changing, someone is tampering with the evidence, and the evidence can no longer be relied upon to point you to the right suspect.

This is why a solid revision control process is critical to defect resolution. Every build must be labeled in your source code repository so that you can recreate it, and test it for defects. You must be able to recreate the environment later, and that means being able to use the same version of the software that the defect occurred in. You’ll hopefully have the means to do it on the same OS, with the same browser and plug-ins that the victim was using, but that’s not always feasible due to cost constraints. But having access to the source code that was used to create the software is absolutely essential. Your suspect may be hiding in there somewhere.

Collect the Evidence

The evidence is what you will base your findings on. Everything else will be ignored, because only the evidence can be relied upon to tell you the truth. Evidence includes the source code for the build in question, a fresh copy of the database, any exceptions that occurred, event log entries, data files, screen shots, and other output from the software generated at the time that the defect occurred.

Do not include email communications as evidence unless they were system-generated; interpersonal communications are testimony, not evidence.

Collect Testimony

Testimony includes emails, voice mails, and oral accounts from users that describe what happened when the defect occurred. It is vitally important to note that testimony is not evidence. Rather, testimony helps you to evaluate the evidence. Testimony is subject to witness credibility and the fallibility of human recollection.

That probably sounds pretty harsh, but it’s a simple statement of fact. As we’ve mentioned before, folks tend to be caught off guard when something goes wrong. They aren’t expecting someone to snatch their purse, jack their car, or corrupt their data. It takes them by surprise. Consequently, they don’t tend to be looking for the vital details that you need from them when you are trying to figure out what happened. They’ll tend to remember vague details, but not the specifics.

There’s also the uncomfortable truth that we simply don’t like to admit that we might have done something wrong. So we’re reluctant to divulge information. And we’re emotional when our data is corrupted or software that we’re required to use doesn’t behave as it’s supposed to and we’ve got tons of work to do. We get angry, even hostile. It’s human nature. We all do it. But the net effect is that our testimony in those situations isn’t always as objective as it might be. It’s subjective, defensive, and guarded.

Finally, witnesses can only tell you what they saw, not what happened internally. If it were a medical condition, we would say that they saw the symptoms, and not the underlying disease. As a professional, you don’t want to treat the symptom; you want to root out the disease. But a witness can’t tell you anything about the disease because she simply can’t see it.

So the witnesses’ testimony helps you to evaluate the evidence, but it isn’t evidence in and of itself.

Be careful, however, that you do not treat witnesses with hostility. Just because witnesses may not be accurate sources of information does not mean that they are dishonest sources of information. Always treat them with respect and understanding. Remember the golden rule when interviewing the witness: You’ll get more with honey than you will with vinegar.

Analyze the Evidence

Once you have all of the evidence, you must analyze it to determine what happened. Sometimes, the crime that was reported turns out not to be the crime that occurred. It turns out to be the wrong crime altogether. Or there was no crime at all. Careful analysis of the evidence determines whether or not a crime occurred at all; if one did, analysis of the evidence determines when it occurred, where, and who the likely suspects are. You can use the testimony to evaluate the evidence, such as to reconstruct the order of events that led up to the crime in question. But, again, do not treat the testimony as evidence.

You should use every available tool at your disposal to evaluate the evidence, including your application’s debugger, tracing tools, the event viewer, query tools, hex viewers, file parsers and viewers, system diagnostic utilities, network utilities, and so on. Expensive tools aren’t always required. A simple text editor is often sufficient for viewing data files, and a baseline graphic editor will suffice for viewing graphics files in most cases. Wherever possible, use the simplest tool that will accurately evaluate the evidence before you. There’s no need to inflate the costs of your investigation.

The outcome of your analysis should be a theory of the crime. You should have one or more suspects: the portion(s) of your code or the external components that caused the defect.

The next step is not to rush off and fix the code. Rather, you need to prove your theory. After all, you want to convict the suspect, and you want to convict the right suspect. And you don’t want to put this suspect in jail, only to find out that the same exact crime is being committed by another suspect that you hadn’t considered. This is especially true if your suspect is the victim.

Recreate the Crime

So now you have a theory. You just have to prove it. And you have to prove it beyond a reasonable doubt. So, you have to recreate the crime scene, and then walk through the crime itself. That means putting the victim and the witnesses back where they were at the scene of the crime, and taking all the steps that lead up to the moment when the crime occurred.

At the end of the recreation, if your theory called for one suspect, there can only be one suspect who contributed to the crime. If there’s more than one suspect at the end of your recreation, you’ve got a problem. If, in the course of recreating the crime you find that some other unexpected entity was involved, you need to go back to the evidence collection step, and start over. You’ve got another suspect out there somewhere that you didn’t know about.

Once you’ve identified all your suspects and you are reasonably sure you can prove they are the cause of the problem, you need to verify that your theory cannot be disproved. Is there any possible way that the crime could have been committed by another suspect? After all, that’s what a defense attorney would claim; the defense is going to do everything in their power to shoot your case full of holes. You want to be absolutely sure that your case is ironclad. As a developer attempting to identify the cause of a defect, you want to make sure that you’ve eliminated all the possible causes of a defect. Is there any other way that this defect might be caused that you haven’t considered? If there are, you need to account for them.

One last caveat: Never make the assumption that the victim is the suspect; conversely, never assume that the product is the suspect. Prove it. Be sure. And make sure that the evidence proves your case. Don’t rely on hunches or speculation. Neither the end-users nor the developers are going to appreciate being accused of being in error. If the suspect is an external entity, the developers are even less likely to be happy, because you’ve just identified extra work for them; be absolutely certain you have the facts to back up your case.

If you cannot disprove your theory, you’re ready to move on and prepare your case.

Prepare Your Case

Document everything. Preserve the evidence. If this case ever presents itself again, you’re going to want to know what you did to research it. A decent defect tracking tool is invaluable in this regard. If you lack one, there’s no reason you can’t keep it in your source code repository (unless there’s a storage limitation on it).

Even if you can’t identify the suspect, and this case is unsolved, you can keep this one in your Cold Case Files. If it rears its ugly head again, you can reopen the case and you’ll have all that evidence from the previous investigation at your disposal.

Obtain a Warrant

Now, with evidence in hand, and a solid provable case, you’re ready to obtain a warrant for the suspect. Up until now, you haven’t had enough to do that. But with the evidence, which doesn’t lie, and a thoroughly documented case, you can make your case to the development team. You’ll have the information to convince them that a defect exists, or it doesn’t.

Carefully lay out the facts, tell them what happened, how you determined that it occurred, and how you eliminated all the other possible suspects. Rely on facts, not conjecture. This is where your personal detachment is critical. You’re not supposed to be on their side or the users’ side. You’re on the truth’s side.

If no defect exists, say so. If one does exist, say so. Don’t make an issue out of it or point fingers. Simply state the facts. Be sure to point out how severe the issue is in terms of data corruption, application downtime, usability and so forth—so long as those pieces of information are based on facts and not opinions. These pieces of information will help the project team decide how quickly the defect should be resolved.

Arrest the Suspect(s)

Once the development team is convinced that the defect is real, the development team will take the information you’ve collected and use it to prioritize and correct the defect. This particular defect should no longer victimize anyone. Once the defect is resolved, the case is closed.

In Closing...

Most developers are constantly burdened with having to research and resolve defects. I know I am. But the problem is that we tend to treat them as annoyances. We don't see them with the weight that they deserve. To us, they're just "something that went wrong" and need to be addressed. So we quickly glance at the code, make a best guess and accuse the first suspect that walks by. All too often, we accuse the wrong suspect. Just as frequently, we take the defect report that’s given to us, add it to the growing list of things to do, and hope to get around to it at a later date when “more pressing concerns” don’t occupy our attention.

Perhaps the problem stems from how we view defects. Perhaps we see them as just another blip on the radar—another defect. I suspect that as developers we tend to think of defects solely in the light of the code base, and rarely in the light of the victim: the end user affected by the defect itself.

But what would happen if we changed our thinking by creating teams that viewed defects as offenses against victims that set out to prosecute or acquit the suspect by collecting the evidence and evaluating the testimony from witnesses? By elevating the perceived seriousness of the defect, perhaps we can increase the desire to get them corrected, and get them corrected correctly the first time. Too farfetched? Corny? Maybe. Maybe not.

Don’t make the mistake of thinking that I’m advocating the creation of a real CSI unit in your software shop or IT department (that would be an absolute disaster and insanity in and of itself). I don’t think you need to treat end-users as hostile witnesses. What I am advocating is the application of the scientific method to the resolution of defects: Know the difference between evidence and testimony and the value of proving your case. It’s likely to be far superior than scratching the first itch that irritates you. Think your way through a defect; get serious about resolving it correctly, deterministically, cost-effectively. There’s nothing in that statement that detracts from code quality. In fact, it enhances it.

I would hope that we all want to write better, more stable software, and that when defects are found, we'd address them quickly and with a minimal amount of cost. Part of minimizing that cost is identifying the right cause of the defect the first time.

What started out as a parody for me led me to rethink my process for defect resolution. There are ways that I can tighten it up, and improve my process. I might get laughed out of the door, but when it comes down to it, the only real thing that matters is whether or not I nailed the real perpetrator, and did so more efficiently than I did before. And isn't that the whole point of this exercise?

Coding from the Trenches

Thursday, June 21, 2007

Everything I Need to Know about Debugging I Learned from CSI

The Basics of Crime Scene Investigation and Defect Resolution

The Process

The Process

Identify the Crime

Identify the Victim and Witnesses

Identify the Crime Scene

Preserve the Crime Scene

Collect the Evidence

Collect Testimony

Analyze the Evidence

Recreate the Crime

Prepare Your Case

Obtain a Warrant

Arrest the Suspect(s)

In Closing...

No comments:

Who the Heck Am I?