Thrown from Kansas to Oz
Collaborative Debugging when a Shared World Breaks

Randall B. Smith, Mario Wolczko, and David Ungar

{randall.smith, mario.wolczko, david.ungar}@sun.com

Sun Microsystems Laboratories

Banner: What happens when someone in a multi-user programmable virtual reality makes a mistake and breaks the world? We show that the world can be constructed to help heal itself from within itself. But sometimes the world must send the users into a metaworld where they can collaboratively debug the problem.

Why don't the laws of physics ever break? As they carry out their inconceivably complex task, the processes that handle the motion and interaction of real world objects seem to be incredibly consistent and reliable. Of course systems of physical objects (cars, toasters) sometimes adopt what we consider "broken" configurations, but this is not due to some glitch in the laws of physics. Rather, physical systems break because of unintended changes to the state of objects from which they are composed.But the laws of physics lie beyond our reach, and this is perhaps the secret to the reliability of physical law. For if the law of gravity, say, were itself a physical object, then we could in principle break it. In an effort to improve gravity, some well-intentioned engineer might cause a divide by zero, and then what would happen?Computers enable us to fashion our own universes, to choose our own set of "physical laws." Today, for example, most virtual reality designers have chosen for convenience not to have a gravitational attraction between objects. Although nowhere near as reliable as reality, many of these computer-based systems seem to have their physics well debugged. Some multi-user realities routinely run weeks on end while hosting dozens or even hundreds of visitors (for an overview, see [5][6]). As with reality, these systems benefit by having the underlying laws out of reach - virtual inhabitants usually cannot make deep changes to the system.Some collaborative virtual environments do allow their users to "program." For example, LambdaMOO [1] allows users to make new kinds of objects with new behavior. In Kansas, the system we report on here, we go much further. Nearly anything can be reprogrammed by the virtual Kansans, including the computations that display an object, and the mechanisms underlying arithmetic. Indeed, Kansas is a multi-user programming environment for the Self language [2], out of which the system is built. In Kansas, two plus two can equal five, at least until the resulting inconsistency brings the entire system crashing to a halt.Though sometimes dangerous, such malleability can be very useful. The ability to change Kansas while it is running has enabled its developers to use the system to collaboratively build it. Those using Kansas for experiments in real-time collaboration can work together, sometimes across thousands of miles, to modify the system for their use. Furthermore, end users can (and do) request new features - a participant with enough expertise can make modifications while the other users continue their work.But where there is programming, there are programming errors. In Kansas, we identify three levels of error. At one end of the spectrum lies the "benign" error, in which the shared world continues to run. Users are notified of the error from within Kansas, and they get an opportunity to debug. At the opposite end of the spectrum lie "fatal" errors: programming changes which ruin the system's ability to create and maintain consistent worlds. The illusion of a shared reality crashes, though a user with sufficient expertise may still repair things from a (single user) command line interface.But between these two extremes lies an interesting third realm. Sometimes the original shared reality can no longer run, but it can be programmatically repaired and restarted with an appropriate debugger placed among the inhabitants. If this is insufficient, often the underlying system can still make new shared worlds. In this case we have the system create a new collaborative world for debugging problems in Kansas, a world we call Oz.We will discuss Kansas in more detail, and describe facilities which support the collaborative repair of Kansas from within Oz. We believe the Oz metaphor and the kind of facilities we provide can be fruitfully adopted by any shared world malleable enough to support its own reprogramming and reflective enough to examine its own computational state.

Kansas

Kansas, like its real world namesake, is a large flat space with multiple people in it. Each Kansas user sees a rectangular patch of the surface. Users can pan their rectangular window over Kansas to move among the various objects that lie scattered across the vast plain. Users may move apart to work in isolation, or may come together to view the same set of objects for collaborative work (see Figure 1).

Figure 1. An overview of Kansas. Three users are present, each with a bounded window onto a larger world of objects, which include desktop video images.


Each participant's window boundary is depicted on the Kansas surface. If someone is panning their view over the surface, you might see his or her window bounds move across your screen as they slide through your territory. Users can arrange to have their windows partially overlap so that they have some objects in common, and some to themselves.Generally, a wide variety of objects are scattered across the surface of Kansas. A strolling user may encounter parts of interactive animated simulations, programming or interface construction tools, communication objects such as live video images, and so on.

Figure 2. A user encounters a "benign" error while working with an outliner for the red particle. The process created to handle the evaluation is suspended and a debugger is attached to the cursor of the guilty party.

By default, all Kansas users are equal. Any user can grab and carry any object, operate any slider, or press any button. Recent pilot studies with distributed groups of high school students led us to add mechanisms that can block arbitrary mouse or keyboard events to an object on a per-user basis, an example of how the system can be extended to explore new sharing semantics.

The principal programming tool in Kansas is the "outliner" (the gray object in Figure 2). An outliner is the Self-level (i.e., programmer's) view of an object. The outliner enables a user to make arbitrary changes to an object - such as the addition, removal, or editing of attributes and methods. During intensive programming, Kansas can have dozens of outliners scattered around. (For a more complete discussion of Kansas as a user interface construction environment, see [4].)The objects visible in Kansas are all kinds of "morphs." The term comes from the Greek for "form" or "thing." A morph can have other morphs stuck onto its surface to create more complex graphical effects. For example, the outliner in Figure 2 is composed of 149 morphs: even the labels on the buttons are morphs. Any morph can be pulled out its hosting morph structure, or otherwise directly edited while Kansas is running. To examine or modify a morph at the Self level, its outliner may be summoned by selecting from a pop-up menu.A brief story may illustrate the utility of Kansas's malleability. During a remote simulation session, students were finding it difficult and tedious to quickly pan across Kansas to retrieve a simulated projectile after it had been launched at high speed. A facilitator (one of the authors) simply summoned the outliner for the projectile, and added a "come home" method, which set the projectile's velocity to zero and reset its position. By selecting from a pop-up menu, the facilitator then created a button that sent this message: pressing this button made the wayward projectile appear at rest, back "home" on the students' screens. Of course, making such modifications requires expertise, but our point is the session could continue uninterrupted while the improvements were made. Editing in a live world like this may also help in expertise transfer - one student learned how to change the projectile mass by watching an expert respond to another student's request for lightweight projectiles.

Debugging in Kansas

The Kansas environment and the Self language arose from the same design center[3]. Kansas and Self are intended to work hand-in-hand in creating an overall experience of operating in a world of tangible, yet very malleable objects. Those inhabitants of Kansas who are willing to evaluate Self expressions or even change a few objects are really Self programmers. Adventurous programmers can change Kansas itself.What happens when a Kansas user makes a programming error? The environment is intended to encourage exploratory programming, so we must not penalize errors unduly. In particular, we must not punish all users for a single person's mistake. Fortunately, it is not possible to cause the underlying Self Virtual Machine to fail through a programming error, so it is possible in Kansas to attempt to handle errors gracefully.To illustrate, let's watch a user building a simulation to illustrate relativistic effects such as time dilation and length contraction. As a first step, the user may take physics book in hand and experiment with the Lorentz contraction formula. In Figure 2, the user has typed in the formula, but has not defined c anywhere (c is the speed of light) so evaluation will result in an error.For any error in Kansas we are faced with two problems unique to large, shared, programmable worlds: first, we want to keep the environment running if at all possible. While true in any system, this is arguably more urgent in multi-user systems where the activities of others should not normally interfere with each other. Secondly, we want to know which of the users in the large space to inform of the error.

We can solve these two problems in the benign case as follows. In Kansas, a button press typically creates a new process, so these evaluations are isolated from the fundamental operation of Kansas itself. (A separate process also allows the main display loop of Kansas to continue running smoothly in the face of lengthy computations.) When a user input such as a button press launches a process, the process is given a reference to the cursor which caused its birth. This reference comes in handy when it is time to inform the guilty party of the error: when the virtual machine informs a process of an error, the process suspends, a debugger is created for the process, and the debugger is attached to the cursor which initiated the process. Thus the user who caused the error ends up "holding" a debugger, as shown in Figure 2.The debugger can show the execution stack, and has several conventional controls for stepping through evaluation, continuing, or aborting. Multiple stack frames can be simultaneously inspected, and the code shown in any stack frame can be directly edited and recompiled. As with any morph, the buttons and method editors of the debugger can be operated collaboratively, a facility which has proven useful during development of the system.Our user quickly discovers that the definition of c was forgotten, and adds it. Unless they were in the neighborhood at the time, other inhabitants of Kansas remain unaware of any problem.

When Kansas fails: The journey to Oz

The above mechanism can catch errors which do not impact the operation of Kansas itself. But virtual Kansans are allowed - indeed, encouraged - to modify their environment. Can we still keep the shared world illusion alive in the situation in which a modification causes a failure in the operation of Kansas itself? We will describe how, in some error cases, we actually can "repair" Kansas so that it can run again, add an appropriate debugger in Kansas, and restart it. But in the general case we create Oz, another shared world.

First, we need to explain something about the internal structure of Kansas, and how users typically make modifications. Kansas is driven by a process that loops on a basic cycle. Kansas maintains a list of "active" morphs: each time through the cycle, each active morph is sent the "step" message. During a step, a morph might run a bit of computation or update internal state. For example, during its step, a simulated subatomic particle checks the current time to determine if it has decayed and should remove itself from the world. Each step should complete quickly, so that the UI process cycle time is short, animation proceeds smoothly and the illusion of a continuously operating universe maintains.At the beginning of a cycle, Kansas takes input events from the users and posts them to their respective target morph. Then all active morphs are sent the step message, and finally all morphs requiring redisplay are asked to repaint themselves. In classical object-oriented fashion, users normally customize the system by modifying the behaviors of morphs (e.g., changing a step or display method, or by adding a new kind of morph) rather than by changing the loop structure. Hence, the typical failure mode for Kansas is that a morph fails during display or during a step because it sends a message that is not understood, or performs some other erroneous operation (such as a division by zero).

Making Kansas resilient

In the first version of Kansas, when an operation failed inside the UI cycle the operation of Kansas would also cease, and all users (except perhaps the one who initiated the most recent change) would be left pondering why the system suddenly died. Only the user who started Kansas and hence had access to the command line interface could restart the system, and if he or she was not responsible for the failure and therefore did not attempt a fix, the restarted system would typically fail again immediately. At this point there would be little alternative except for the users to talk to each other in the hallway or by telephone to resolve the problem, or to discard this incarnation of Kansas and start again from a previously saved version.



Figure 3. Sketch of the main threads ("processes") in the Kansas system.

When a new world is opened (be it called Kansas or Oz), the "start new world" procedure creates a UI process and the watcher process.To preclude the possibility of a failure in a single morph suspending the operation of the entire world, we decided to monitor the UI process with another process, called the watcher process, whose responsibility would be to decide on a recovery action for the world, or to launch a new world, Oz (see Figure 3).

Recall from the previous section that when there is a failure in a process, even in the Kansas UI process, the failing process is suspended, and a debugger on it is added into Kansas. So if the UI process fails, the Kansas world will suspend, but end up with a debugger among its own objects. The watcher waits for suspension in the Kansas UI process; when the watcher is awakened it knows that there was a problem. The watcher examines the failed process (by looking through the activation stack) to see what was happening when it failed. If a morph was executing its step method, that morph is removed from Kansas's list of active morphs. The watcher also adds a notifier into Kansas, warning that it has removed a problematic morph from the active list. Finally, a new Kansas UI process is started. Of course users do not see all of this behind-the-scenes action. They simply see a debugger and a notifier pop into existence, and the illusion of a shared reality continues (see Figure 4). When the users have fixed the step method, the repaired morph can be added to the active list, thereby resuming its operation.

Figure 4. The watcher process has detected an error in the Kansas UI process during a step method in an active morph. The watcher removes the offending morph from the active list, adds a notifier in Kansas, and starts a new UI process, thereby helping repair the world from within the world.


Another watching process (not shown in Figure 3) is used to detect a step method that takes too long. It is incumbent on the users modifying and adding morphs to ensure that the computations they specify do not take so long as to adversely impact the interactiveness of Kansas. To this end, if a morph takes too long to execute its step method, a watcher will suspend the process, and as above, notify the users of the offending morph and remove it from the active list. This also deals with a non-terminating computation in a step method.If the watcher awakes, but does not find a problem with a stepping morph, then it merely restarts the UI process. If the problem is not recurrent, the shared world illusion continues. For example, a subatomic particle morph that had just decayed might display by drawing itself and playing a "pop" noise to the speakers. If there was an error during this one-time playing of the sound, the UI process will suspend one time, but a new UI process will be started by the watcher, and continue without subsequent problems. Again, the dwellers of Kansas will see a debugger on the old UI process suddenly appear without any other disturbance in their shared world.Thus with enough fancy footwork, the watcher process can often keep Kansas going, while informing its inhabitants of errors in the fundamental driving UI process. Of course, this is not always possible. Kansas can host such severe problems that it is necessary to travel to Oz.


The Sudden Journey to Oz

If any activity fails recurrently, then restarting the world with a new process will not fix the problem: the new process will also fail. For example, if we accidentally move a subatomic particle morph faster than light, it may try to take the square root of a negative number during its display method. New UI processes will repeatedly fail whenever they reach the point of trying to refresh the display. In this case the watcher detects two successive failures of the UI process, and launches Oz. It places into Oz an (emerald) debugger on the failing UI process, together with a message to the users as to why it was created, and a button which can be used to restart Kansas by restarting the UI process (see Figure 5). Until then, operation of Kansas is suspended, so it is not responsive to user input, and does not update the screen.

Figure 5 The watcher process has encountered an error that cannot be handled within Kansas. The watcher opens a new world, Oz, having prepared an emerald debugger on the suspended Kansas UI process. The normal policy is to send all Kansas inhabitants to Oz, though in general some may be left behind, or users from outside Kansas (possibly wizards) might also be sent to Oz.


When Oz is launched, it normally appears on all the screens that also displayed Kansas, so that all users are suddenly sucked up into Oz, and any user can attempt to fix the problem. Of course depending on how the system is set up, not all users need be brought to Oz. Conversely, not all Oz residents need come from Kansas: an expert user might well be brought into Oz when it is created. In this case visitors to Oz will find a wizard among them. We plan to use this strategy in helping to maintain Kansas at a remote site.Any user can manipulate the debugger in Oz to find out what broke. The full collaborative potential of Kansas is available in Oz. In the movie, travelers to Oz solve their problems when key objects become tangible (the scarecrow gets a diploma, the lion a medal). In our Oz, any user can reify a key object by summoning its outliner to inspect and change its state, send messages, fix old objects, or build new ones.If, during this experimentation and debugging, Oz itself breaks, then Oz's Oz is created, and so on. Except for the objects it contains, Oz really differs from Kansas only in name: it is itself just another shared world with similar UI and watcher processes. No matter how far up the Ozzian chain one travels, there is always a watcher process ready to act in the case of errors.When the users want to try out a fix for the problem that broke Kansas, one of them can press the button in the Ozzian debugger to restart Kansas, which starts a new UI process in Kansas. If this fails again, another Oz is created parallel to the first. (It would also be possible to detect the case where Oz already exists, and reuse it.) Otherwise, Kansas is once again in business, and any morph or outliner created in Oz can be lifted from Oz and dropped into Kansas (or vice versa) by any user, simply by picking it up with the mouse and dragging it from one window to the other.

Fatal errors

Some errors resist even this other-worldly solution. Errors that ruin the shared world illusion are called fatal errors. For example, perhaps the code that creates Oz has broken. The system does have a constructive strategy for handling this case, using the same basic mechanism: if the watcher notices that the first iteration of the UI process for Oz has failed, it assumes that Oz cannot operate. Under this scenario, it is futile (or at least very dangerous) to attempt to build Oz's Oz, so the watcher suspends the operation of both Kansas and Oz, printing an explanatory message on the command line console. But this is only visible to the user at the host where the system is running: a wizard might be able to fix or work around the problem, but this situation is fortunately rare in practice.

Conclusion

A multi-user virtual space malleable enough to be reprogrammed from within itself has many uses, from exploratory collaborative development of shared systems, to simple extensions in real-time response to user requests. This vision would be unrealistic if the debugging issue were ignored. Programming errors are commonplace, and good debugging support is a must. But debugging in a shared world raises interesting issues. Foremost among these: how can the shared world be maintained if an error occurs during its operation?We identify three classes of error: simplest among the three is the benign error which occurs in a process that is independent of the operation of the shared world. Since there are potentially many users, the system tries to find out who caused the error, so it can notify the appropriate participant by creating a debugger at the participant's location. At the other extreme is the fatal error in which the system cannot maintain or create shared worlds. Thankfully rare, the fatal error brings the collaborative experience to a halt.But we have found an intermediate class of error in which the shared reality metaphor can be maintained even though the state and behavior of Kansas itself is bound up in the error. The key player is a "watcher" process which detects when errors are non-recurring or are special cases the watcher knows how to handle. After making the suspended world safe again, the watcher reflects such errors back into Kansas (as debugger objects) and restarts the world. Users are unaware of any interruption. In cases where the attempt to restart Kansas has failed, the watcher knows to create a new world, Oz, with a debugger on the broken Kansas process. The users find themselves suddenly thrown into Oz where they can collaborate in repairing the problem, and can then resume and reenter Kansas. We have been pleased with the results: in our experience many errors are caught inside an apparently continuously running Kansas, and the majority of the remaining errors are caught with the Oz mechanism.

Once we identified this intermediate case and began using the watcher process, the experience of working in Kansas became qualitatively different. The need to get up and run down the hall or to do blind debugging over the phone has been eliminated, and we can almost always continue our collaborative shared world experience, though sometimes at the price of an unexpected trip to Oz.

Acknowledgments

Lars Bak and John Maloney are due kudos for the many kinds of Self morphs they have created over the years, and for their contribution to the underlying Self user interface substrate. Also thanks to Ron Hixon for his work on network audio and video morphs.About the Authors

Randall B. Smith is a Senior Staff Engineer at Sun Microsystems Laboratories. Author's present address: Sun Microsystems Laboratories, 2550 Garcia Ave, MTV29-110, Mountain View, CA 94043; e-mail: Randall.Smith@sun.com

David Ungar is a Senior Staff Engineer at Sun Microsystems Laboratories. Author's present address: Sun Microsystems Laboratories, 2550 Garcia Ave, MTV29-117, Mountain View, CA 94043; e-mail: David.Ungar@sun.com

Mario Wolczko Mario Wolczko is a Senior Staff Engineer at Sun Microsystems Laboratories. Author's present address: Sun Microsystems Laboratories, 2550 Garcia Ave, MTV29-117, Mountain View, CA 94043; e-mail: Mario.Wolczko@sun.com

References

[1] Curtis, P. The LambdaMOO Programmer's Manual, available by FTP from parcftp.xerox.com in pub/MOO.

[2] Ungar, D., and Smith, R. B. Self: The Power of Simplicity. In Proceedings of OOPSLA'87, (Oct 4-8, Orlando, Florida) ACM/SIGPLAN, New York, 1986. 227-241

[3] Smith, R. B., and Ungar, D. Programming as an Experience: The Inspiration for Self. In W. Olthoff, Ed., ECOOP '95 -Object-Oriented Programming. Springer-Verlag, Berlin/Heidelberg, 1995. 303-330.

[4] Maloney, J. and Smith, R. B. Directness and Liveness in the Morphic User Interface. In Proceeding of UIST'95, (Nov. 15-17, Pittsburgh, Penn.) ACM/SIGGRAPH, ACM/SIGCHI, New York, 1995.

[5] Damer, B. Avatars! Exploring and Building Virtual Worlds on the Internet, Addison-Wesly, Reading Mass. 1997.

[6] Damer, B. Inhabited Virtual Worlds. ACM Interacations (Sept.-Oct. 96), 27-46.