"My hairiest bug" war stories
Marc Eisenstadt

Despite the availability of industrial-strength debuggers and integrated program development environments, professional programmers still have to engage in far more detective work than they ought to. This is their story.

Psychological studies of computer programming and debugging [1, 2, 5, 8, 9, 10, 11, 12], while important in their own right, have tended to overlook the potential benefit of self-reports by programmers reflecting the phenomenology of debugging-i.e. what it's like "out there in the trenches" from the programmer's perspective. Two exceptions to this are (a) the detailed account by Knuth of the log book that documented all the errors he encountered over a ten-year development period working on TEX [6], and (b) a log book of the development efforts of a team implementing the Smalltalk-80 virtual machine [7]. Such self-reports and log books are valuable sources of insight into the nature of software design, development, and maintenance. The work reported here attempts to expand this single-user-log-book approach to investigate the phenomenology of debugging across a large population of users, with the ultimate aim of understanding and addressing the problems faced by professional programmers working on very large programming tasks.

Toward this end, I conducted a survey of professional programmers, asking them to provide stories describing their most difficult bugs involving large pieces of software. The survey was conducted by electronic mail and conferencing/bulletin board facilities with world-wide access (Usenet newsgroups, BIX, CompuServe, and AppleLink). My contribution is to gather, edit and annotate the stories, and to categorise them in a way which may help to shed some light on the nature of the debugging enterprise. In particular, I look at the lessons learned from the stories, and discuss what they tell us about what is needed in the design of future debugging tools.

The Trawl

In early 1992, I posted a request for debugging anecdotes on an electronic bulletin board called BIX, the "BYTE Information Exchange", and followed this with similar messages posted to AppleLink, CompuServe, various Usenet newsgroups, and the Open University's own conferencing system (OU CoSy). The original message is shown in figure 1.

c.language/tools #2842, from meisenstadt
771 chars, Tue Mar 3 06:09:28 1992
Comment(s).

TITLE: Trawl for debugging anecdotes (w/emphasis on tools side)...

I'm looking for some (serious) anecdotes describing debugging experiences. In particular, I want to know about particularly thorny bugs in LARGE pieces of software which caused you lots of headaches. It would be handy if the large piece of software were written in C or C++, but this is not absolutely essential. I'd like to know how you cracked the problem-- what techniques/tools you used: did you 'home in' on the bug systematically, did the solution suddenly come to you in your sleep, etc. A VERY brief stream-of-consciousness reply (right now!) would be much much better than a carefully-worked-out story. I can then get back to you with further questions if necessary.

Thanks!

-Marc

Figure 1. The original "trawl" request. A second message, explaining my motivation, was also posted.

The trawl request elicited replies from 78 "informants", mostly in the USA and UK. The group included implementors of very well-known commercial C compilers, members of the ANSI C++ definition group, and other known commercial software developers. A total of 110 messages were generated by 78 different informants. Of those, 50 informants specifically told a story about a nasty bug. A few informants provided several anecdotes, and in all a total of 59 bug anecdotes were collected. Figure 2 shows some typical replies to the original request (the full set of replies and analyses thereof is available from the author).

[Story A, complete] I had a bug in a compiler for 8086's running MSDOS once that stands out in my mind. The compiler returned function values on the stack and once in a while such a value would be wrong. When I looked at the assembly code, all seemed fine. The value was getting stored at the correct location of the stack. When I stepped thru it in the assembly-level debugger and got to that store, sure enough, the effective address was correct in the stack frame, and the right value was in the register to be stored. Here's the weird thing --- when I stepped through the store instruction the value on the stack didn't change. It seems obvious in retrospect, but it took some hours for me to figure out that the effective address was below the stack pointer (stacks grow down here), and the stored value was being wiped out by os interrupt handlers (that don't switch stacks) about 18 times a second. The stack pointer was being decremented too late in the compiled code.

==========================================================================

[Story B, excerpt] ...I once had a program that only worked properly on Wednesdays...The documentation claimed that the day of the week was returned in a doubleword, 8 bytes. In actual fact, Wednesday is 9 characters long, and the system routine actually expected 12 bytes of space to put the day of the week. Since I was supplying only 8 bytes, it was writing 4 bytes on top of storage area intended for another purpose. As it turned out, that space was where a "y" was supposed to be stored to compare to the users answer. Six days a week the system would wipe out the "y" with blanks, but on Wednesdays a "y" would be stored in its correct place.

==========================================================================

[Story C, excerpt] ...The program only crashed after running about 45000 iterations of the main simulation loop... Somewhere, somehow, someone was walking over memory. But that somewhere could have been *anywhere* - writing in one of the many global arrays, for example....The bug turned out to be a case of an array of shorts (max value 32k) that was having certain elements incremented every time they were "used", the fastest use being about every 1.5 iterations of the simulator. So an element of an array would be incremented past 32k, back down to -32k. This value was then used as an array index. ....But of course the actual seg fault was happening several iterations after the error - the bogus write into memory. It took 3 hours for the program to crash, so creating test cases took forever. I couldn't use any of the heavier powered debugging malloc()s, or use watchpoints, because those slow a program down at least 10 fold, resulting in 30 hours to track a bug.

Figure 2. Some typical debugging anecdotes.

Analysis of the anecdotes

Dimensions of analysis: why difficult, how found, and root cause

Although the "root cause" of reported bugs is of a priori interest, in order to fully characterise the phenomenology of the debugging experiences I needed to look at more than the causes of the bugs. After several iterations of summarising the data, it became apparent that it would be necessary to say something about (i) why a bug was hard to find (which might or might not be related to the underlying cause), and (ii) how it was found (which might or might not be related to the underlying cause and the reason for the difficulty) in addition to (iii) the root cause (what really went wrong).

We know something about each of these dimensions from previous studies. Vesey [12] attempted to address the first dimension (why difficult) by asking how the time to find a bug depended upon its location in a program's structure and its level in a propositional analysis of the program (answers: location in serial structure has no effect, and level in propositional structure is inconclusive). Regarding techniques for bug finding (second dimension), Katz & Anderson [5] reported a variety of bug-location strategies among experienced Lisp subjects in a laboratory setting involving small (10-line) programs. They distinguished among (i) strategies which detected a heuristic mapping between a bug's manifestation and its origin, (ii) those which relied on a hand simulation of execution, and (iii) those which resorted to some kind of causal reasoning. Goal-driven reasoning (either heuristic mapping or causal reasoning) was predominant among subjects who were debugging their own code, whereas data-driven reasoning (typically hand simulation) was predominant among subjects who were debugging other programmers' code. For the kind of programming-in-the-large being studied here, the need for a bottom-up data gathering phase, which helps the programmer get some approximate notion of where the bug might be located, becomes apparent.

As far as root causes are concerned (dimension three), two main approaches to the development of bug taxonomies have been followed: a deep plan analysis approach (e.g. [4, 11]) and a phenomenological account (e.g. [6]). Johnson [4] worked on the premise that a large number of bugs could be accounted for by analysing the high level abstract plans underlying specific programs, and specifying both the possible fates that a plan component could undergo (e.g. missing , spurious , misplaced ) and the nature of the program constructs involved (e.g. inputs, outputs, initialisations, conditionals). Spohrer et. al. [11] refined this analysis by pointing out the critical nature of bug interdependencies and problem-dependent goals and plans. An alternative characterisation of bugs was provided by Knuth's analyses [6], which uncovered the following nine (problem-independent) categories: A= algorithm awry; B= blunder or botch; D= data structure debacle; F= forgotten function; L= Language liability, i.e. misuse or misunderstanding of the tools/language/hardware ("imperfectly knowing the tools"); M= Mismatch between modules ("imperfectly knowing the specifications", e.g. interface errors involving functions called with reversed arguments); R= Reinforcement of robustness (e.g. handling erroneous input); S= surprise scenario (bad bugs which forced design change, unforeseen interactions); T= Trivial typo.

For both approaches (plan analysis vs. phenomenological) the "true" cause of a bug can really only be resolved by the original programmer, because it is necessary to understand the programmer's state of mind at the time the bug was spawned in order to be able to assess the cause properly. I found it informative to evolve my own categories in a largely bottom-up fashion after extensive inspection of the data, and then compare them specifically with the ones provided by Knuth. The criterion I have adopted for identifying root causes is as follows: when the programmer is essentially satisfied that several hours or days of bewilderment have come to an end once a particular culprit is identified, then that culprit is the root cause, even when deeper causes can be found. I have adopted this approach (a) because a possible infinite regress is nipped in the bud, (b) because it is consistent with my emphasis on the phenomenology of debugging, i.e. what is apparently taking place as far as the front-line programmer is concerned, (c) it enables me to concentrate on what the programmers reported, and not try to second-guess them.

The subsections which follow describe the three dimensions of analysis (why difficult; how found; root cause) in turn.

Dimension 1: Why difficult

Results

In this and subsequent sections, I report the frequency of occurrence of the different categories, not because it supports an a priori hypothesis at some level of statistical significance, but rather because it gives us a convenient overview of the nature of the problems that the informants chose to share with us. The frequency of occurrence of the different reasons for having difficulty is shown in Table 1.

Category Occurrences

cause/effect chasm 15
tools inapplicable or hampered 12
WYSIPIG: What you see is probably illusory, guv'nor 7
faulty assumption/model or mis-directed blame 6
spaghetti (unstructured) code 3
??? (no information) 8

Table 1. Why the bugs were difficult to track down.

Thus, 53% of the difficulties are attributable to just two sources: (i) large temporal or spatial chasms between the root cause and the symptom, and (ii) bugs that rendered debugging tools inapplicable. The high frequency of reports of cause/effect chasms accords well with the analyses of Vesey [12] and Pennington [8] which argue that the programmer must form a robust mental model of correct program behaviour in order to detect bugs-the cause/effect chasm seriously undermines the programmer's efforts to construct a robust mental model. The relationship of this finding to the analysis of the other dimensions is reported below.

Dimension 2: How found

Results

The frequency of occurrence of the different debugging techniques is shown in Table 2.

Category Occurrences

gather data 27
inspeculation 13
expert recognised cliché 5
controlled experiments 4
??? (no information) 2

Table 2. Techniques used to track down the bugs.

Techniques for bug-finding are clearly dominated by reports of data-gathering (e.g. print statements) and hand-simulation, which together account for 78% of the reported techniques, and highlight the kind of "groping" to which the programmer is reduced in difficult debugging situations. Let us now turn to an analysis of the root causes of the bugs before we go on to see how the different dimensions interrelate.

Dimension 3: Root cause

Results

Table 3 displays the frequency of occurrence of the nine underlying causes. The table indicates that the biggest culprits were memory overwrites and vendor-supplied hardware/software problems. Even ignoring vendor-specific difficulties, one implication of Table 3 is that 37% of the nastiest bugs reported by professionals could be addressed by (a) memory-analysis tools and (b) smarter compilers which trapped initialisation errors. Of course, these results are dominated by stories from C and C++ programmers. As Java gains in popularity, we will observe a concomitant decline in 'memory clobbering' errors, which simply can't occur (although we know anecdotally that vendor-specific variations in the Java Virtual Machine are still causing debugging woes that fall into the 'vendor-supplied hardware/software problem' category).

Category Occurrences

mem: Memory clobbered or used up 13
vendor: Vendor's problem (hardware or software) 9
des.logic: Unanticipated case (faulty design logic) 7
init: Wrong initialisation; wrong type; definition clash 6
lex: Lexical problem, bad parse, or ambiguous syntax 4
var: Wrong variable or operator 3
unsolved: unknown and still unsolved to this day 3
lang: language semantics ambiguous or misunderstood 2
behav: end-user's (or programmer's) subtle behaviour 2
??? (no information) 2

Table 3. Underlying causes of the reported bugs.

Relating the dimensions

To understand the ways in which the three dimensions of analysis interrelate, we can place every anecdote precisely in our three-dimensional space. For expository purposes let's consider just a single two-dimensional comparison: how found vs. why difficult.

Table 4 compares reasons for difficulty (row labels) against bug-finding techniques (column labels). Cells with large numbers are noteworthy: indeed, the magnitude of certain cell entries is greater than that predictable by chance from the row and column totals alone (X² ,df:20, = 33.50, p. < .05), suggesting in particular that data-gathering activities are of special relevance when a cause/effect chasm is involved or when the built-in debugging tools are somehow rendered inapplicable.

WHY vs. HOW gather data inspeculation expert recognised cliché controlled experiments ???
(no info) Totals

cause/effect chasm 9.83 3.00 1.50 2.50 16.83

tools hampered 9.83 2.00 2.00 13.83

WYSIPIG 2.00 2.00 1.50 2.00 7.50

faulty assumption 2.50 3.00 1.00 6.50

spaghetti 1.33 1.00 2.33

??? (no info) 4.00 2.00 2.00 8.00

TOTALS 29.50 13.00 6.00 4.50 2.00 55.00

Table 4. Tally of why bugs were difficult (rows) vs. how found (columns). Each cell entry (e.g. 9.83) is a tally of the number of anecdotes reporting that cell's row label (i.e. root cause) and column label (i.e. how found). Fractional entries reflect anecdotes which have been divided into multiple categories, so that an anecdote reporting three reasons for difficulty scores .33 in each of three relevant cells.

A niche of potential interest (and profit) to tool vendors is highlighted by looking at the relationship among the three dimensions: the most heavily populated cells are those involving data-gathering, cause-effect chasms and memory or initialisation errors. The implications of this finding are discussed in the next section.

Discussion

From boasting war stories to on-line repository

A side-effect of this study is the realisation that complete strangers, with very little prompting and no incentive, are not only articulate in their reminiscences, but also very forthcoming with details. These people clearly enjoyed relating their debugging experiences. Moreover, the depth of supplied details seemed to be independent of whether I had explicitly posted my motivation (as I did on BIX and AppleLink) or not (as was the case on Usenet and CompuServe). Clearly, this is a self-selecting audience of email users and conference browsers who enjoy electronic "chatting" anyway, and some may even have felt an inner need to tell a good (and hence boastful) war story-- so much the better! I have no reason to distrust the sources, and the detailed stories certainly exhibit their own self-consistency. It is already widely accepted that the Internet is a gold-mine of information. This collection of anecdotes suggests that it may also be a rich repository of willing subjects ready to supply detailed knowledge in a fairly rigorous manner which may then serve as a resource for others. These stories, even without a definitive taxonomy, could help to provide a valuable adjunct to FAQ (Frequently Asked Question) repositories found on the World Wide Web and in growing "terabyte archives" such as those at http://www.dejanews.com. FAQ and generic Usenet discussion group repositories are a wonderful resource, but can be frustrating to access sensibly when an urgent debugging need arises.

Possible ways forward

It would be easy to say that what programmers really need are more robust design approaches, plus smarter compilers and debuggers! Fortunately, the analyses presented throughout this paper suggest that we can be more precise than simply demanding "robustness" from programmers/designers and "smartness" from tool developers. For one thing, we have identified a niche that really needs attention: the most heavily populated cell in our three dimensional analysis suggests that a winning tool would be one which employed some data-gathering or traversal method to resolved large cause/effect chasms in the case of memory-clobbering errors (indeed Purify, described below, does precisely this). Secondly, we can propose solutions to the "why difficult" problems by considering the specific cases brought to light by the stories themselves. One way or another, most of the problems mentioned in the stories are connected with "directness" and "navigation". For example, the need to go through indirect steps, intermediate subgoals or obtuse lines of reasoning plagues the user encountering the most frequent problems in Table 1, and each of these problems can be addressed specifically. A possible way forward, described in more detail in a comparative fine-grained analysis which I undertook in [3], would involve paying heed to the following advice:

Computable relations should be computed on request, rather than be deduced by the user. A software tool can perform important and complicated deductions on the programmer's behalf, and thereby liberate the programmer from some tedious work. A good example of just such a tool is Purify, which analyses run-time memory leaks (e.g. lost memory cells, overflowed arrays) in C programs. Purify works by patching the object code at link time, and pinpoints the root cause of the leak by traversing many indirect dataflow links back to the offending source code. Thus, it already solves a much harder dataflow traversal problem than that required to deal with indirect pointer traversing such as that reported by several informants, and suggests a highly promising direction for the development of future tools.

Displayable states should be displayed on request, rather than be deduced by the user. Minimising deductive work is an important aspect of tools such as Zstep95, described in the paper by Ungar et. al. in this volume.

Atomic user-goals should be mapped onto atomic actions. In other words, try to infer the programmer's likely intentions, so that frequently-occurring "reasonable" behaviours on the part of the programmer can be anticipated, with a concomitant reduction in wasteful "fine-tuning" activities (e.g. those which require the programmer to deal with digressions and irrelevant sub-goals just to get the tools working). Key steps in this direction are described in Fry's paper in this volume.

Allow full functionality at all times. Debugging environments which prevent access to certain facilities just make matters worse. The Kansas/Oz environment described in the accompanying paper by Smith et al. pushes this notion to its logical limits.

Viewers should be provided for "key players" (any evaluable expression) rather than just "variables". Several of the papers in this volume, particularly that of Baecker et al., take to heart the notion that more than variable-watching is at issue during the debugging process!

Provide a variety of navigation tools at different levels of granularity. Changing granularity is one of the hallmarks of the system underlying the work of Domingue and Mulholland in this volume, and offers programmers the opportunity to see appropriate views at appropriate times.

The suggestions above are not necessarily easy to implement, but there are an increasing number of tools appearing both in the research community and in the marketplace which illustrate key aspects of them. The suggestions outlined above indicate that specific debugging needs can be addressed systematically, and that a detailed account of programmers' continuing problems is an important step in facilitating the evolution of appropriate solutions.

Summary and conclusions

An analysis of the debugging anecdotes collected from a world-wide email trawl revealed three primary dimensions of interest: why the bugs were difficult to find, how the bugs were found, and root causes of bugs. Half of the difficulties arose from just two sources: (i) large temporal or spatial chasms between the root cause and the symptom, and (ii) bugs that rendered debugging tools inapplicable. Techniques for bug-finding were dominated by reports of data-gathering (e.g. print statements) and hand-simulation, which together accounted for almost 80% of the reported techniques. The two biggest causes of bugs were (i) memory overwrites and (ii) vendor-supplied hardware or software faults, which together accounted for more than 40% of the reported bugs. The analysis pinpoints a winning niche for future tools: data-gathering or traversal methods to resolved large cause/effect chasms in the case of memory-clobbering errors. Other specific suggestions emerge by analysing the underlying issues of "directness" and "navigation". The investigation highlights a potential wealth of information available on the Internet, and indicates that it may well be possible to establish an on-line repository for perusal by those with an urgent need to solve complex debugging problems. The indexed repository could offer stories in a manner more accessible than the type found in FAQ (Frequently Asked Questions) stories. FAQs can be informative when a relevant one is found, but can be frustrating to access sensibly in the heat of a debugging session.

Acknowledgements

Parts of this research were funded by the UK EPSRC/ESRC/MRC Joint Council Initiative on Cognitive Science and Human Computer Interaction, by the Commission of the European Communities ESPRIT-II Project 5365 (VITAL), and by Apple Computer, Inc.'s Advanced Technology Group, now Apple Research Labs.

References

[1] Brooks, R. E. Studying Programmer Behavior Experimentally: the problems of a proper methodology. Comm. ACM, 23, 4 (1980), 207-213.

[2] Curtis, W By the way, did anyone study any real programmers?. In E. Soloway & S. Iyengar (Eds.). Empirical Studies of Programmers. Norwood, NJ, Ablex, 1986.

[3] Eisenstadt, M. Why HyperTalk debugging is more painful than it ought to be. In J. Alty, D. Diaper and S.P. Guest (Eds.), People and Computers VIII. Cambridge, UK: Cambridge University Press, 1993.

[4] Johnson, W. L. An Effective Bug Classification Scheme Must Take the Programmer into Account. In Proceedings of The Workshop on High-Level Debugging, . Palo Alto, CA:, 1983.

[5] Katz, I. R. & Anderson, J. R. Debugging: An analysis of bug-location strategies. Human Computer Interaction, 3, 4 (1988), 351-399.

[6] Knuth, D. E. The Errors of TeX. Software-Practice and Experience, 19, 7 (1989), 607-685.

[7] McCullough, P. L. Implementing the Smalltalk-80 System: The Tektronix Experience. In G. Krasner (Eds.), Smalltalk-80: Bits of History, Words of Advice. Reading, MA., USA: Addison-Wesley, 1983, pp. 59-78.

[8] Pennington, N.. Stimulus structures and mental representations in expert comprehension of computer programs. Cognitive Psychology, 19 (1987). 295-341.

[9] Shneiderman, B. Software Psychology. Cambridge, MA: Winthrop, 1980.

[10] Soloway, E. & Iyengar, S. (Eds.). Empirical Studies of Programmers. Norwood, NJ: Ablex., 1986.

[11] Spohrer, J. C., Soloway, E., & Pope, E. A Goal/Plan Analysis of Buggy Pascal Programs. Human-Computer Interaction, 1, 2 (1985), 163-207.

[12] Vesey, I. Toward a theory of computer program bugs: an empirical test. International Journal of Man-Machine Studies, 30 (1989),123-46.

Biography

Professor Marc Eisenstadt is Director of the Knowledge Media Institute at the UK's Open University. His interests lie in Cognitive Science, Artificial Intelligence, Human-Computer Interaction, and New Media, particularly those promising novel learning opportunities via remote telepresence.

The Trawl

Analysis of the anecdotes

Dimensions of analysis: why difficult, how found, and root cause

Dimension 1: Why difficult

Categories

Results

Dimension 2: How found

Categories

Results

Dimension 3: Root cause

Categories

Results

Relating the dimensions

Discussion

From boasting war stories to on-line repository

Possible ways forward

Summary and conclusions

Acknowledgements

References

Biography

WHY vs. HOW	gather data	inspeculation	expert recognised cliché	controlled experiments	??? (no info)	Totals
cause/effect chasm	9.83	3.00	1.50	2.50		16.83
tools hampered	9.83	2.00		2.00		13.83
WYSIPIG	2.00	2.00	1.50		2.00	7.50
faulty assumption	2.50	3.00	1.00			6.50
spaghetti	1.33	1.00				2.33
??? (no info)	4.00	2.00	2.00			8.00
TOTALS	29.50	13.00	6.00	4.50	2.00	55.00