Despite the availability of industrial-strength debuggers and integrated program development environments, professional programmers still have to engage in far more detective work than they ought to. This is their story.
Psychological studies of computer programming and debugging [1, 2, 5, 8, 9, 10, 11, 12], while important in their own right, have tended to overlook the potential benefit of self-reports by programmers reflecting the phenomenology of debugging-i.e. what it's like "out there in the trenches" from the programmer's perspective. Two exceptions to this are (a) the detailed account by Knuth of the log book that documented all the errors he encountered over a ten-year development period working on TEX , and (b) a log book of the development efforts of a team implementing the Smalltalk-80 virtual machine . Such self-reports and log books are valuable sources of insight into the nature of software design, development, and maintenance. The work reported here attempts to expand this single-user-log-book approach to investigate the phenomenology of debugging across a large population of users, with the ultimate aim of understanding and addressing the problems faced by professional programmers working on very large programming tasks.
Toward this end, I conducted a survey of professional programmers, asking them to provide stories describing their most difficult bugs involving large pieces of software. The survey was conducted by electronic mail and conferencing/bulletin board facilities with world-wide access (Usenet newsgroups, BIX, CompuServe, and AppleLink). My contribution is to gather, edit and annotate the stories, and to categorise them in a way which may help to shed some light on the nature of the debugging enterprise. In particular, I look at the lessons learned from the stories, and discuss what they tell us about what is needed in the design of future debugging tools.
In early 1992, I posted a request for
debugging anecdotes on an electronic bulletin board called BIX,
the "BYTE Information Exchange", and followed this with
similar messages posted to AppleLink, CompuServe, various Usenet
newsgroups, and the Open University's own conferencing system
(OU CoSy). The original message is shown in figure 1.
c.language/tools #2842, from meisenstadt
771 chars, Tue Mar 3 06:09:28 1992
TITLE: Trawl for debugging anecdotes (w/emphasis on tools side)...
I'm looking for some (serious) anecdotes describing debugging experiences. In particular, I want to know about particularly thorny bugs in LARGE pieces of software which caused you lots of headaches. It would be handy if the large piece of software were written in C or C++, but this is not absolutely essential. I'd like to know how you cracked the problem-- what techniques/tools you used: did you 'home in' on the bug systematically, did the solution suddenly come to you in your sleep, etc. A VERY brief stream-of-consciousness reply (right now!) would be much much better than a carefully-worked-out story. I can then get back to you with further questions if necessary.
The trawl request elicited replies from 78 "informants", mostly in the USA and UK. The group included implementors of very well-known commercial C compilers, members of the ANSI C++ definition group, and other known commercial software developers. A total of 110 messages were generated by 78 different informants. Of those, 50 informants specifically told a story about a nasty bug. A few informants provided several anecdotes, and in all a total of 59 bug anecdotes were collected. Figure 2 shows some typical replies to the original request (the full set of replies and analyses thereof is available from the author).
[Story A, complete] I had a bug in a compiler for 8086's running MSDOS once that stands out in my mind. The compiler returned function values on the stack and once in a while such a value would be wrong. When I looked at the assembly code, all seemed fine. The value was getting stored at the correct location of the stack. When I stepped thru it in the assembly-level debugger and got to that store, sure enough, the effective address was correct in the stack frame, and the right value was in the register to be stored. Here's the weird thing --- when I stepped through the store instruction the value on the stack didn't change. It seems obvious in retrospect, but it took some hours for me to figure out that the effective address was below the stack pointer (stacks grow down here), and the stored value was being wiped out by os interrupt handlers (that don't switch stacks) about 18 times a second. The stack pointer was being decremented too late in the compiled code.
[Story B, excerpt] ...I once had a program that only worked properly on Wednesdays...The documentation claimed that the day of the week was returned in a doubleword, 8 bytes. In actual fact, Wednesday is 9 characters long, and the system routine actually expected 12 bytes of space to put the day of the week. Since I was supplying only 8 bytes, it was writing 4 bytes on top of storage area intended for another purpose. As it turned out, that space was where a "y" was supposed to be stored to compare to the users answer. Six days a week the system would wipe out the "y" with blanks, but on Wednesdays a "y" would be stored in its correct place.
[Story C, excerpt] ...The program only crashed after running about 45000 iterations of the main simulation loop... Somewhere, somehow, someone was walking over memory. But that somewhere could have been *anywhere* - writing in one of the many global arrays, for example....The bug turned out to be a case of an array of shorts (max value 32k) that was having certain elements incremented every time they were "used", the fastest use being about every 1.5 iterations of the simulator. So an element of an array would be incremented past 32k, back down to -32k. This value was then used as an array index. ....But of course the actual seg fault was happening several iterations after the error - the bogus write into memory. It took 3 hours for the program to crash, so creating test cases took forever. I couldn't use any of the heavier powered debugging malloc()s, or use watchpoints, because those slow a program down at least 10 fold, resulting in 30 hours to track a bug.
Although the "root cause" of reported bugs is of a priori interest, in order to fully characterise the phenomenology of the debugging experiences I needed to look at more than the causes of the bugs. After several iterations of summarising the data, it became apparent that it would be necessary to say something about (i) why a bug was hard to find (which might or might not be related to the underlying cause), and (ii) how it was found (which might or might not be related to the underlying cause and the reason for the difficulty) in addition to (iii) the root cause (what really went wrong).
We know something about each of these dimensions from previous studies. Vesey  attempted to address the first dimension (why difficult) by asking how the time to find a bug depended upon its location in a program's structure and its level in a propositional analysis of the program (answers: location in serial structure has no effect, and level in propositional structure is inconclusive). Regarding techniques for bug finding (second dimension), Katz & Anderson  reported a variety of bug-location strategies among experienced Lisp subjects in a laboratory setting involving small (10-line) programs. They distinguished among (i) strategies which detected a heuristic mapping between a bug's manifestation and its origin, (ii) those which relied on a hand simulation of execution, and (iii) those which resorted to some kind of causal reasoning. Goal-driven reasoning (either heuristic mapping or causal reasoning) was predominant among subjects who were debugging their own code, whereas data-driven reasoning (typically hand simulation) was predominant among subjects who were debugging other programmers' code. For the kind of programming-in-the-large being studied here, the need for a bottom-up data gathering phase, which helps the programmer get some approximate notion of where the bug might be located, becomes apparent.
As far as root causes are concerned (dimension three), two main approaches to the development of bug taxonomies have been followed: a deep plan analysis approach (e.g. [4, 11]) and a phenomenological account (e.g. ). Johnson  worked on the premise that a large number of bugs could be accounted for by analysing the high level abstract plans underlying specific programs, and specifying both the possible fates that a plan component could undergo (e.g. missing , spurious , misplaced ) and the nature of the program constructs involved (e.g. inputs, outputs, initialisations, conditionals). Spohrer et. al.  refined this analysis by pointing out the critical nature of bug interdependencies and problem-dependent goals and plans. An alternative characterisation of bugs was provided by Knuth's analyses , which uncovered the following nine (problem-independent) categories: A= algorithm awry; B= blunder or botch; D= data structure debacle; F= forgotten function; L= Language liability, i.e. misuse or misunderstanding of the tools/language/hardware ("imperfectly knowing the tools"); M= Mismatch between modules ("imperfectly knowing the specifications", e.g. interface errors involving functions called with reversed arguments); R= Reinforcement of robustness (e.g. handling erroneous input); S= surprise scenario (bad bugs which forced design change, unforeseen interactions); T= Trivial typo.
For both approaches (plan analysis vs. phenomenological) the "true" cause of a bug can really only be resolved by the original programmer, because it is necessary to understand the programmer's state of mind at the time the bug was spawned in order to be able to assess the cause properly. I found it informative to evolve my own categories in a largely bottom-up fashion after extensive inspection of the data, and then compare them specifically with the ones provided by Knuth. The criterion I have adopted for identifying root causes is as follows: when the programmer is essentially satisfied that several hours or days of bewilderment have come to an end once a particular culprit is identified, then that culprit is the root cause, even when deeper causes can be found. I have adopted this approach (a) because a possible infinite regress is nipped in the bud, (b) because it is consistent with my emphasis on the phenomenology of debugging, i.e. what is apparently taking place as far as the front-line programmer is concerned, (c) it enables me to concentrate on what the programmers reported, and not try to second-guess them.
The subsections which follow describe the three dimensions of analysis (why difficult; how found; root cause) in turn.
The reasons that the bug was hard to trap fell into five categories, as described below:
cause/effect chasm: Often the symptom is far removed in space and/or time from the root cause, and this can make the cause hard to detect. Specific instances can involve timing or synchronisation problems, bugs which are intermittent, inconsistent, or infrequent, and bugs which materialise "far away" (e.g. thousands of iterations) from the actual place they are spawned.
tools inapplicable or hampered: Most programmers have encountered so-called "Heisenbugs", named after the Heisenberg uncertainty principle in physics: the bug goes away when you switch on the debugging tools! Other variations within this category are stealth bug (i.e. the error itself consumes the evidence) and context precludes (i.e. some configuration or memory constraints make it impractical or impossible to use the debugging tool).
WYSIPIG (What you see is probably illusory, guv'nor): I have coined this expression to reflect the cases in which the programmer stares at something which simply is not there, or is dramatically different from what it appears to be (e.g. "10" in an octal display being misinterpreted as meaning 7+3 rather than 7+1).
faulty assumption/model or mis-directed blame: If you think that stacks grow up rather than down (as did the informant in Story A, Figure 2), then bugs which are related to this behaviour are going to be hard to detect.
spaghetti (unstructured) code: Informants sometimes complain about "ugly" code invariably written by "someone else".
In this and subsequent sections, I report the frequency of occurrence of the different categories, not because it supports an a priori hypothesis at some level of statistical significance, but rather because it gives us a convenient overview of the nature of the problems that the informants chose to share with us. The frequency of occurrence of the different reasons for having difficulty is shown in Table 1.
cause/effect chasm 15
tools inapplicable or hampered 12
WYSIPIG: What you see is probably illusory, guv'nor 7
faulty assumption/model or mis-directed blame 6
spaghetti (unstructured) code 3
??? (no information) 8
Thus, 53% of the difficulties are attributable to just two sources: (i) large temporal or spatial chasms between the root cause and the symptom, and (ii) bugs that rendered debugging tools inapplicable. The high frequency of reports of cause/effect chasms accords well with the analyses of Vesey  and Pennington  which argue that the programmer must form a robust mental model of correct program behaviour in order to detect bugs-the cause/effect chasm seriously undermines the programmer's efforts to construct a robust mental model. The relationship of this finding to the analysis of the other dimensions is reported below.
The informants reported four major bug-catching techniques, as follows:
gather data: This category refers to cases in which the informant decided to "find out more", e.g. by planting print statements or breakpoints. Here are the six sub-categories reported by the informants:
step & study: the programmer single-steps through the code, and studies the behaviour, typically monitoring changes to data structures
wrap & profile: tailor-made performance, metric, or other profiling information is collected by "wrapping" (enclosing) a suspect function inside a one-off variant of that function which calls (say) a timer or data-structure printout both before and after the suspect function.
print & peruse: print statements are inserted at particular points in the code, and their output is observed during subsequent runs of the program
dump & diff: either a true core dump or else some variation (e.g. extensive print statements) is saved to two text files corresponding to two different execution runs; the two files are then compared using a source-compare ("diff") utility
conditional break & inspect: a breakpoint is inserted into the code, typically triggered by some specific behaviour; data values are then inspected to determine what is happening
specialist profile tool (MEM or Heap Scramble): there are several off-the-shelf tools which detect memory leaks and corrupt or illegal memory references
"inspeculation": This name is meant to be a hybrid of "inspection" (code inspection), "simulation" (hand-simulation), and "speculation", which were among a wide variety of techniques mentioned explicitly or implicitly by informants. In other words, they either go away and think about something else for a while, or else spend a lot of time reading through the code and thinking about it, possibly hand-simulating an execution run.
expert recognised cliché: These are cases where the programmer called upon a cohort, and the cohort was able to spot the bug relatively simply. This recognition corresponds to the heuristic mapping reported in 
controlled experiments: Informants resorted to specific controlled experiments when they had a clear idea about what the root cause of the bug might be.
The frequency of occurrence of the different debugging techniques is shown in Table 2.
gather data 27
expert recognised cliché 5
controlled experiments 4
??? (no information) 2
Techniques for bug-finding are clearly dominated by reports of data-gathering (e.g. print statements) and hand-simulation, which together account for 78% of the reported techniques, and highlight the kind of "groping" to which the programmer is reduced in difficult debugging situations. Let us now turn to an analysis of the root causes of the bugs before we go on to see how the different dimensions interrelate.
The bug causes reported by the informants fell into the following nine categories:
mem: Memory clobbered or used up. This cause has a variety of manifestations (e.g. overwriting a reserved portion of memory, and thereby causing the system to crash) and may even have deeper causes (e.g. array subscript out of bounds), yet is often singled out by the informants as being the source of the difficulty. Knuth has an analogous category, which he calls "D = Data structure debacle".
vendor: Vendor's problem (hardware or software). Some informants report buggy compilers or faulty logic boards, for which they either need to develop a workaround or else wait for the vendor to provide corrective measures.
des.logic: Unanticipated case (faulty design logic). In such cases, the algorithm itself has gone awry, because the programmer has not worked through all the cases correctly. This category encompasses both those which Knuth labels as "A = algorithm awry" and also those labelled as "S=surprise scenario".
init: Wrong initialisation; wrong type; definition clash. A programmer will sometimes make an erroneous type declaration, or re-define the meaning of some system keyword, or incorrectly initialise a variable. I refer to all of these as "init" errors.
var: Wrong variable or operator. Somehow, the wrong term has been used. The informant may not provide enough information to deduce whether this was really due to faulty design logic (des.logic) or whether it was a trivial lexical error (lex), though in the latter case trivial typos are normally mentioned explicitly as the root cause.
lex: Lexical problem, bad parse, or ambiguous syntax. These are meant to be trivial problems, not due to the algorithm itself, nor to faulty variables or declarations. This class of errors encompasses Knuth's "B=Blunder" and "T=Typo", which are hard to distinguish in informant's reports.
unsolved: Unknown and still unsolved to this day. Some informants never solved their problem!
lang: Language semantics ambiguous or misunderstood. In one case, an informant reports that he thought that 256K meant 256000, which is incorrect, and can be thought of as a semantic confusion.
behav: End-user's (or programmer's) subtle behaviour. For example, in one case the bug was caused by an end-user mysteriously depressing several keys on the keyboard at once, and in another case the bug involved some mischievous code inserted as a joke.
Table 3 displays the frequency of occurrence of the nine underlying causes. The table indicates that the biggest culprits were memory overwrites and vendor-supplied hardware/software problems. Even ignoring vendor-specific difficulties, one implication of Table 3 is that 37% of the nastiest bugs reported by professionals could be addressed by (a) memory-analysis tools and (b) smarter compilers which trapped initialisation errors. Of course, these results are dominated by stories from C and C++ programmers. As Java gains in popularity, we will observe a concomitant decline in 'memory clobbering' errors, which simply can't occur (although we know anecdotally that vendor-specific variations in the Java Virtual Machine are still causing debugging woes that fall into the 'vendor-supplied hardware/software problem' category).
Memory clobbered or used up 13
vendor: Vendor's problem (hardware or software) 9
des.logic: Unanticipated case (faulty design logic) 7
init: Wrong initialisation; wrong type; definition clash 6
lex: Lexical problem, bad parse, or ambiguous syntax 4
var: Wrong variable or operator 3
unsolved: unknown and still unsolved to this day 3
lang: language semantics ambiguous or misunderstood 2
behav: end-user's (or programmer's) subtle behaviour 2
??? (no information) 2
To understand the ways in which the three dimensions of analysis interrelate, we can place every anecdote precisely in our three-dimensional space. For expository purposes let's consider just a single two-dimensional comparison: how found vs. why difficult.
Table 4 compares reasons for difficulty
(row labels) against bug-finding techniques (column labels).
Cells with large numbers are noteworthy: indeed, the magnitude
of certain cell entries is greater than that predictable by chance
from the row and column totals alone (X2
,df:20, = 33.50, p. < .05), suggesting in particular that data-gathering
activities are of special relevance when a cause/effect chasm
is involved or when the built-in debugging tools are somehow rendered
|??? (no info)||4.00||2.00||2.00||8.00|
A niche of potential interest (and profit) to tool vendors is highlighted by looking at the relationship among the three dimensions: the most heavily populated cells are those involving data-gathering, cause-effect chasms and memory or initialisation errors. The implications of this finding are discussed in the next section.
A side-effect of this study is the realisation that complete strangers, with very little prompting and no incentive, are not only articulate in their reminiscences, but also very forthcoming with details. These people clearly enjoyed relating their debugging experiences. Moreover, the depth of supplied details seemed to be independent of whether I had explicitly posted my motivation (as I did on BIX and AppleLink) or not (as was the case on Usenet and CompuServe). Clearly, this is a self-selecting audience of email users and conference browsers who enjoy electronic "chatting" anyway, and some may even have felt an inner need to tell a good (and hence boastful) war story-- so much the better! I have no reason to distrust the sources, and the detailed stories certainly exhibit their own self-consistency. It is already widely accepted that the Internet is a gold-mine of information. This collection of anecdotes suggests that it may also be a rich repository of willing subjects ready to supply detailed knowledge in a fairly rigorous manner which may then serve as a resource for others. These stories, even without a definitive taxonomy, could help to provide a valuable adjunct to FAQ (Frequently Asked Question) repositories found on the World Wide Web and in growing "terabyte archives" such as those at http://www.dejanews.com. FAQ and generic Usenet discussion group repositories are a wonderful resource, but can be frustrating to access sensibly when an urgent debugging need arises.
It would be easy to say that what programmers really need are more robust design approaches, plus smarter compilers and debuggers! Fortunately, the analyses presented throughout this paper suggest that we can be more precise than simply demanding "robustness" from programmers/designers and "smartness" from tool developers. For one thing, we have identified a niche that really needs attention: the most heavily populated cell in our three dimensional analysis suggests that a winning tool would be one which employed some data-gathering or traversal method to resolved large cause/effect chasms in the case of memory-clobbering errors (indeed Purify, described below, does precisely this). Secondly, we can propose solutions to the "why difficult" problems by considering the specific cases brought to light by the stories themselves. One way or another, most of the problems mentioned in the stories are connected with "directness" and "navigation". For example, the need to go through indirect steps, intermediate subgoals or obtuse lines of reasoning plagues the user encountering the most frequent problems in Table 1, and each of these problems can be addressed specifically. A possible way forward, described in more detail in a comparative fine-grained analysis which I undertook in , would involve paying heed to the following advice:
Computable relations should be computed on request, rather than be deduced by the user. A software tool can perform important and complicated deductions on the programmer's behalf, and thereby liberate the programmer from some tedious work. A good example of just such a tool is Purify, which analyses run-time memory leaks (e.g. lost memory cells, overflowed arrays) in C programs. Purify works by patching the object code at link time, and pinpoints the root cause of the leak by traversing many indirect dataflow links back to the offending source code. Thus, it already solves a much harder dataflow traversal problem than that required to deal with indirect pointer traversing such as that reported by several informants, and suggests a highly promising direction for the development of future tools.
Displayable states should be displayed on request, rather than be deduced by the user. Minimising deductive work is an important aspect of tools such as Zstep95, described in the paper by Ungar et. al. in this volume.
Atomic user-goals should be mapped onto atomic actions. In other words, try to infer the programmer's likely intentions, so that frequently-occurring "reasonable" behaviours on the part of the programmer can be anticipated, with a concomitant reduction in wasteful "fine-tuning" activities (e.g. those which require the programmer to deal with digressions and irrelevant sub-goals just to get the tools working). Key steps in this direction are described in Fry's paper in this volume.
Allow full functionality at all times. Debugging environments which prevent access to certain facilities just make matters worse. The Kansas/Oz environment described in the accompanying paper by Smith et al. pushes this notion to its logical limits.
Viewers should be provided for "key players" (any evaluable expression) rather than just "variables". Several of the papers in this volume, particularly that of Baecker et al., take to heart the notion that more than variable-watching is at issue during the debugging process!
Provide a variety of navigation tools at different levels of granularity. Changing granularity is one of the hallmarks of the system underlying the work of Domingue and Mulholland in this volume, and offers programmers the opportunity to see appropriate views at appropriate times.
The suggestions above are not necessarily easy to implement, but there are an increasing number of tools appearing both in the research community and in the marketplace which illustrate key aspects of them. The suggestions outlined above indicate that specific debugging needs can be addressed systematically, and that a detailed account of programmers' continuing problems is an important step in facilitating the evolution of appropriate solutions.
An analysis of the debugging anecdotes collected from a world-wide email trawl revealed three primary dimensions of interest: why the bugs were difficult to find, how the bugs were found, and root causes of bugs. Half of the difficulties arose from just two sources: (i) large temporal or spatial chasms between the root cause and the symptom, and (ii) bugs that rendered debugging tools inapplicable. Techniques for bug-finding were dominated by reports of data-gathering (e.g. print statements) and hand-simulation, which together accounted for almost 80% of the reported techniques. The two biggest causes of bugs were (i) memory overwrites and (ii) vendor-supplied hardware or software faults, which together accounted for more than 40% of the reported bugs. The analysis pinpoints a winning niche for future tools: data-gathering or traversal methods to resolved large cause/effect chasms in the case of memory-clobbering errors. Other specific suggestions emerge by analysing the underlying issues of "directness" and "navigation". The investigation highlights a potential wealth of information available on the Internet, and indicates that it may well be possible to establish an on-line repository for perusal by those with an urgent need to solve complex debugging problems. The indexed repository could offer stories in a manner more accessible than the type found in FAQ (Frequently Asked Questions) stories. FAQs can be informative when a relevant one is found, but can be frustrating to access sensibly in the heat of a debugging session.
Parts of this research were funded by the UK EPSRC/ESRC/MRC Joint Council Initiative on Cognitive Science and Human Computer Interaction, by the Commission of the European Communities ESPRIT-II Project 5365 (VITAL), and by Apple Computer, Inc.'s Advanced Technology Group, now Apple Research Labs.
 Brooks, R. E. Studying Programmer Behavior Experimentally: the problems of a proper methodology. Comm. ACM, 23, 4 (1980), 207-213.
 Curtis, W By the way, did anyone study any real programmers?. In E. Soloway & S. Iyengar (Eds.). Empirical Studies of Programmers. Norwood, NJ, Ablex, 1986.
 Eisenstadt, M. Why HyperTalk debugging is more painful than it ought to be. In J. Alty, D. Diaper and S.P. Guest (Eds.), People and Computers VIII. Cambridge, UK: Cambridge University Press, 1993.
 Johnson, W. L. An Effective Bug Classification Scheme Must Take the Programmer into Account. In Proceedings of The Workshop on High-Level Debugging, . Palo Alto, CA:, 1983.
 Katz, I. R. & Anderson, J. R. Debugging: An analysis of bug-location strategies. Human Computer Interaction, 3, 4 (1988), 351-399.
 Knuth, D. E. The Errors of TeX. Software-Practice and Experience, 19, 7 (1989), 607-685.
 McCullough, P. L. Implementing the Smalltalk-80 System: The Tektronix Experience. In G. Krasner (Eds.), Smalltalk-80: Bits of History, Words of Advice. Reading, MA., USA: Addison-Wesley, 1983, pp. 59-78.
 Pennington, N.. Stimulus structures and mental representations in expert comprehension of computer programs. Cognitive Psychology, 19 (1987). 295-341.
 Shneiderman, B. Software Psychology. Cambridge, MA: Winthrop, 1980.
 Soloway, E. & Iyengar, S. (Eds.). Empirical Studies of Programmers. Norwood, NJ: Ablex., 1986.
 Spohrer, J. C., Soloway, E., & Pope, E. A Goal/Plan Analysis of Buggy Pascal Programs. Human-Computer Interaction, 1, 2 (1985), 163-207.
 Vesey, I. Toward a theory of computer program bugs: an empirical test. International Journal of Man-Machine Studies, 30 (1989),123-46.
Professor Marc Eisenstadt is Director
of the Knowledge Media Institute at the UK's Open University.
His interests lie in Cognitive Science, Artificial Intelligence,
Human-Computer Interaction, and New Media, particularly those
promising novel learning opportunities via remote telepresence.