The Ultimate Interface



We already have the ultimate computer interface. Or, at least one half of it. That interface is what you and I present to the computer: a body that can speak, look, and gesture.

What we ordinarily do not have on the other side of the interface--the machine's side-- are the sensors and program intelligence that can capture such outputs from the person, interpret those outputs, and make an appropriate response.

Modes and meanings

Consider the following scene:

Three young people sit around a table in an outdoor cafe. Susan talks excitedly about her new car, pointing it out at curbside. Gerry, one hand arcing in mid-air toward the other, shows how his car got caught in a fender-bender with a neighbor's van. He glances toward a menu just beyond his reach and asks "Could I look at that?" Across the table, Julie, using one hand to represent a sofa and the other to represent a side table, shows how she is setting up the living room in her new apartment. As she talks, she glances back and forth between her hands and the eyes of her tablemates to check how well they are following her account.

On and around the table, these three relate to one another through a spontaneous mix of speech, gesture, and gaze. They are good at it. Their ancestors have been doing it just so for thousands of years.

While we shall continue to communicate with one another just as these people about the table do, what about our dealings with machines: computers and computer-based media; robots, both industrial and personal; the world of reactive, intelligent objects as envisaged by the Things That Think Consortium?

Will we someday be able to turn to a machine, act in the same way--speak, gesture, look about--and be as readily understood? Will it listen to our words, track our gestures, watch our eyes, make sense of it all, and respond appropriately?

Assuredly, yes. Interacting with machines via speech, gesture, and gaze will not only be possible, but in fact will become the way most people--most of the time, and for most purposes--will deal with them.

Some people may never care to talk to and gesture at computers--the programmer fluent in a programming language, the journalist skilled in some word-processor, the accountant facile with some spreadsheet. For such, the classic keyboard and mouse may be sufficient, even preferred.

However, as widespread as computers are coming to be, most people are not used to pecking on keyboards and jockeying a mouse about a menu display. In contrast, and across countries and cultures, most people do speak, look, and gesture. Clearly, what is at stake is an interface for most of the world.

Sharing space and time

Concurrent speaking, pointing, and looking as a way of dealing with computers works best when the subject matter is concrete rather than abstract. Consider the following anecdote--in effect, a mini-theory of human-computer communication--authored by the MIT Media Laboratory Director, Nicholas Negroponte:

At a dinner party in a foreign land one can understand the conversation with only meagre knowledge of the tongue as long as people are talking about bread, butter, second helpings, wine, and the like. As soon as your tablemates break into a discussion of history or politics, you can participate only if you are totally fluent in their language. A misleading assumption is to attribute the entire difference to the sophistication of one topic versus the other or one vocabulary versus the other. In fact, the reason is that the bread-and-butter talk is about subjects that are in the same space and time as you, at which you can point, look, and nod, thus calling forth parallel and strongly redundant channels of communication.

--from Dedication Brochure, opening of MIT Media Lab, October 1985.

Shifting from the dinner table to the computer context, the counterpart of "...subjects that are in the same space and time as you..." are virtual objects set forth on the computers display. Or, if interacting with computer-controlled robots, real objects in real space.

The computer--because either generating or modeling that domain of discourse--knows where everything is. And, if the machine can track our gestures and gaze and listen to our voice with respect to that domain of discourse, then we are, in effect, there with the computer.

Modes helping modes

When we and the computer share common time and space, our speech, gesturing, and gaze become each complementary to the other. That is, information that might be missing in any one of these modalities can be searched for in the others.

For instance, suppose I say to the machine, "What's that?" Relying upon just words alone, the meaning of that is ambiguous--given that there is more than one item on display, or more than one sound emanating from different spots in audio space.

However, if I am looking and/or pointing toward some particular thing, or in some particular direction, then the machine--provided it has the means to track my eyes and sense my hand position--can combine that information with what I uttered in words, and figure out what it was I mean by that.

Specifically, the use of coordinated, redundant modes lessens the burden on speech by permitting gesture and glance to disambiguate and supplement words. Overall, fewer words are needed. Such benefits from multimodal communication can happen not only between people sitting in each others presence in a cafe, but as well between people and computers.

The main benefit from thus combining modes lies in enabling everyday social and linguistic skills to access computing power. The computer captures user actions in speech, gesture, and gaze, interprets those actions in context, and generates an appropriate response in graphics and sound.

One powerful result is that of opening up computing power to the non-expert--namely, to most of the world. Dealing with a computer will shift from a purely technical to an increasingly social relationship. The user will experience the computer less as a tool, and more as human associate.

Multimodal Natural Dialog

I call this kind of communication multimodal natural dialog, or MMND for short:

Only where the person at the interface can avail themself of concurrent speech input, are eye-tracked, and may use two-handed, freehand gesture does there occur the possibility of the kind of multimodal natural dialog that I am referring to.

In contrast, traditional natural language processing (NLP) has concentrated upon verbal input, typed or spoken, while disregarding gesture and/or gaze. Too, one finds certain research labeled "human-computer natural language" employing typed-in and printed-out verbal communication, the gestural portion using a mouse or a stylus to point. Such work clearly differs from work wherein the verbal part is auditory--spoken aloud by the human and, through speech synthesis, by computer as well--and wherein the gestural input by the person is by the direct use of the hands, and not via some hand-held implement.

Also, some interface research has been done on speech-plus-gesture, but where the gesture is either via mouse or stylus, or via one hand only. Even in some two-handed gesture work done in recent years, the hands manipulate tools (e.g., mice) rather than employ two-handed, free-hand gesture. While perhaps worthy efforts, such studies are not reflective of the sort of multimodal interaction that I'm talking about.

Scope of MMND

The scope of the kind of multimodal natural dialog (MMND) contemplated here includes:

Again, note: in all of the above, I mean two-handed, free-hand gesture, not gesture using any kind of implement such as a wand, pen, mouse, or the like. Beyond the scope of MMND are:

Prime applications: interacting and ideation

MMND provides a powerful, naturalistic command style for real-time interaction with computation, whether that computation be resident in 3-D audiovisual displays, on-screen agents, personal robots, toasters or teddy bears.

In addition, a primary application of multimodal natural dialog is ideation: the kind of reflective and thoughtful "what-iffing" intrinsic to the scoping out of ideas. Such exploratory thinking could range from designing an addition to a house, strategizing the rescue of flood victims, laying out a new factory site, to planning orthopedic surgery.

The final frontier

I believe the multimodal interface to be--without exception--the most important frontier in human-computer dialog because it uniquely exploits the "native equipment" of people--what the unadorned person brings to the interface.

When we turn around 180 degrees from the hardware ensemble and take a fresh look at what the human brings to the interface situation, we find a being who uses hands and eyes, and who speaks. Current interfaces simply don't support that ensemble of traits.

To bring in speech, gesture, and gaze does not mean throw away the classic keyboard-and mouse combination. It's too useful. Nor does it mean throw away trackballs, tablets, joysticks, or any other devices that people find of help when dealing with computers. Interaction via speech, gesture, and gaze is not in competition with but in addition to any other modes or styles of interacting, rounding out the repertoire of ways whereby you can address computation.

Person literate, not "person-like"

Dealing with computers as you would another person does not necessarily imply--at least to my way of thinking-- a machine that gratuitously talks back, has a humanoid appearance, or exhibits human-like "personality" or "emotions."

My preference is for minimal commentary from the machine beyond simply doing what I ask it to do. Specifically, the machine's carrying out the request is the feedback.

For instance, I tell the display "Twist that block (looking at some block) ninety degrees (hand gesture indicating clockwise)." It does that, maybe "blinking" the block, plus a "click" when it slides into place. The blink and the click are the sum total of the "commentary."

It may at times be useful for an interface to have a "face," as when dealing with several embodied "agents" about various things on display. But it's less for conviviality than for efficiency in communication. An on-screen face offers a "there" to look at, and at which to direct your speech, as opposed to just speaking into thin air. Such "localization" can be especially helpful when dealing with multiple agents where each face corresponds to some specific "point-of-view," and you'd like directly to address this or that aspect of the topic under discussion.

If they are to be present, I see such agent faces as quasi-humanoid and exhibiting little or no affect, not unlike smiley-faces with subdued smiles. They glance about as feedback they are paying attention, and as a signal to me that they are following what I am saying about items of display. While they may have "arms" for gesture, I see any such gesturing as having real semantic added-value, and not just there to make them seem "vivacious."

As for the machine or any embodied agents, exhibiting affect, or being "emotional" in their actions or behavior, I would find that as annoying as I would find chattiness on their part cloying.

(In this connection, I note in the June '98 FRAMES Prof. Roz Picard saying that interaction may be improved by making machines sensitive to user affect, but not necessarily by the computer itself being "emotional" back at the human.)

Others may differ, and want a more chatty, "personable" flavor of interface. Fine. It's a matter of preference.

The real difference lies in the machine being able to recognize the difference between user looks and gestures which

as opposed to looking and gesturing which may reflect "personality" but lack semantic significance.

When will all this happen?

I believe the MMND interface to computers and computation will happen sooner that most of us might imagine.

To date, most research in HCI has concentrated on one or more modes, e. g., natural dialog in speech, gesture interfaces, eye-responsive displays. Increasingly, though, researchers are putting the modes together and playing them off one off the other to maximize the gains in power and economy of expression that their concurrent use offers. The recent addition of multimodal interaction as a CHI Conference category reflects this development in the broader world of human-computer research.

Win-Win for person and machine

The anticipated gains from MMND for both the human and the machine are manifold.

The primary gain for the person is the ability to interact with computers via his or her own native equipment rather than in arbitrary, machine-oriented ways. In particular:

The primary gain for the computer from MMND is the possibility of gesture and glance enabling fewer words to convey meaning, thus reducing dependence upon speech input under conditions of ambient noise, unclear enunciation, and speaker variability.

Sensing voice, hand, and eye

Of the technologies to capture speech, gesture, and gaze, it is the technology for speech recognition that is the least obtrusive. Current-day technologies for capturing gesture and gaze are yet less than ideal from the standpoint of convenience and comfort.

Speech Recognition

Speech recognition systems that will handle connected speech --speech spoken with no appreciable pauses between words-- have been available for some time now. Such systems include Naturally Speaking(TM) by Dragon Systems and the HARK(TM) system by BBN Systems and Technologies.

However, in order to use any speech recognition along with concurrent gesture and gaze, it is critical that the computing system know precisely when any word is uttered. Unless you have access to the internal coding of a speech recognition system--and usually you do not-- it is vital that the system itself provide you with the timing information so you can synchronize when each word is spoken with events in the other modes. For instance, when the user utters "...there..." , you need to link that temporally with where the user was either pointing or looking.

How precise does such information need to be? That, of course, relates to the time-scale of events in speech, gesture, and gaze.

While one can expect considerable individual response rates across individuals and situations, a reasonable guesstimate is that the user will take about one-third of a second to utter a word, and perhaps shift their gaze 1-3 times per second. Gesture times can vary widely, simple reaction time being about 100 milliseconds for something like finger-point shifts to a couple of seconds or so to perform more elaborate, perhaps two-handed, gestures. Thus, a rough guess would be that the rate of read-out of timing information on speech material should be on the order of 1/10 of a second. Accuracy less than that may make it difficult to properly align events in speech, i. e., just when a certain word was spoken, with events in gesture and/or looking.

The microphonics for speech recognition can vary from table-top models to the pencil-stub sized clip-on mikes that guests wear on TV talk shows. The big difference in microphones is whether or not they are noise-canceling or not.

One approach to noise-cancellation is covering the microphone with some type of sound-absorbing material, the intention being to dampen whatever ambient sound there may be in the vicinity of the speaker; background noise becomes softened, the up-close voice of the speaker becomes crisp by contrast.

Another type of noise-cancellation involves an additional microphone placed away from the speaker to capture the ambient noise of the surroundings. That sound signal is subtracted from the speech-plus-ambient-noise signal from the speaker's microphone, so that, in theory, what is left for the speech-recognition system to examine is just the voice of the speaker.

The prime issue for the speaker from the aspect of convenience and comfort, though, is whether or not the microphone is obtrusive: Is it large and clunky enough to interfere with feeling normal and natural? Is it "in the way"?

Gesture-sensing

Perhaps the most well-known technologies for free-hand gesture-sensing are the "glove" technologies, brought to public notice by the hype of "virtual reality": images of people wearing special visors wherein they see a 3-D graphics scene, and reaching into that scene with esoteric-looking gauntlets.

The most popular of the early glove technologies was the DataGlove(TM) by the former VPL Technologies. The fingers and thumb of a stretch-fabric glove were outfitted with fiber-optic loops, two on each digit, the optic fiber abraded so that it would "leak" light in proportion to how much it was bent. Thus, by sending light into one end of the fiber loop and measuring how much light was being returned at the other end, a measure of the degree of "bentness" of the loop could be obtained.

A successor to the DataGlove(TM) is the CyberGlove(TM) by Virtual Technologies, Inc.. This glove incorporates a set of up to 22 strain gauges imbedded in its fabric. Virtual Technologies, Inc. also offers:

The big drawback to glove-based gesture-sensing technology is that you have to put something on.

Far better from the aspect of unobtrusiveness would be video cameras at a distance as input to an image processing system. Such systems have their own drawbacks, such the inability to resolve in real-time the fingers of the hand, instead offering a mitten-like representation. Such a relatively coarse level of representation may nonetheless be highly useful when interpreted along with what they are saying and where they are looking in the context of the interface situation.

Eyetracking

A commercially available technique of tracking an observer's gaze which does not involve placing apparatus on their person is the pupil-center corneal-reflection distance method. Such systems are made by Applied Science Laboratories and also by ISCAN. Inc..

In this method, a small TV camera sensitive in the infrared range is situated several feet away from the observer and is zoomed close-in upon the observer's eye. The up-close image of the observer's eye is analyzed in real-time to locate: 1) the center of the pupil; 2) the center of a corneal reflection of the filament of a nearby infrared lamp. The distance between these two artifacts varies systematically with changes in the observer's point-of-regard in the surround, independent of small head movements. (The web site of LC Technologies, Inc. has a good diagram of the essentials of such as system.)

In this way, a measure the observer's point of regard is taken, with less than 1-degree error, sixty times per second. The method works well through most eyeglasses and contact lenses, and, when the apparatus is equipped with tracking mirrors and "autofocus," the observer has a fair amount of freedom (about a cubic-foot of space) to move the head about.

Often, however, ideal conditions do not obtain, and not only can and will accuracy can be much less, but--given an observer who is actively moving about--the eye being tracked can easily slip away from the cameras. Too, such systems, with their optical set-up and tracking mirrors, can be expensive--on the order of 20-50 thousands of dollars.

Another approach to eye-tracking image processing is one of employing neural nets. Dean Pomerleau and others at Carnegie-Mellon have demonstrated an eyetracker that works not on the basis of isolating specific artifacts in the image of the eye such as the pupil and a corneal reflection, but instead uses the aggregate appearance of the eye as it gazes in different directions. The procedure apparently works well; however, the system would have to be retrained should the observer change their position relative to the tracking camera.

As with gesture-tracking, a central issue is whether eyetracking can be done without placing apparatus on the user. The systems referred to above can work with cameras and optics remote from the user--as far away as 8 to 10 feet, given more elaborate optics and tracking mirrors. In some applications, the cameras and optics are miniaturized and set up on a head-band worn by the user. Such set-ups work well, but clearly are not ideal from the aspect of unobtrusiveness.

The future of sensing technologies

The big problem with current sensing technology--especially for eye and gesture tracking-- is that it tends to be obtrusive and clunky. If people are to spontaneously interact with computing power on as casual a basis as with other people sitting around a cafe table, then the sensors have to be much improved. Also, any technology would have to be robust enough to stand up to "normal" everyday usage, i. e., be at least as robust as your shirt and shoes.

And, reliable as well. The performance of the hardware and software technology would have to be consistent and accurate enough so as not to exasperate people so much that they conclude they would be better off with out it. That threshold of exasperation probably varies across people, but for most people it probably is fairly low-- meaning that sensors would have to perform at least as reliably as ATMs and power brakes.

Perhaps advances in computer vision and parallel computing will enable cameras at a distance to capture hand and eye actions with sufficient accuracy so that there will be no necessity to place optic apparatus or gloves on the person.

Perhaps wearable circuitry and space-and body-sensors will become robust and unobtrusive enough to provide consistent data without imposing on the person more than simply getting dressed in the morning, or changing into activity clothes, e.g., tennis or ski togs.

Sensors in the surround and not directly on the person may help. New work in non-contact measurement may enable gesture-sensing via measuring the body's effect on an electric field (c. f., work by Neil Gershenfeld's group at the Media Lab); concurrent use of computer vision plus non-contact sensor measurement may be prove a powerful combination.

Modularity

Meanwhile, currently available sensing technologies--though "clunky"--nevertheless are good enough to let research on development of multimodal software intelligence to proceed, and, in the long run, it probably is in software intelligence that most of the challenge lies.

Thus, a vitally important system consideration is that the sensing technologies be modular.

That is, the physical techniques which, for example, measure the position and dynamics of the hand in space, ought to be independent of any processes that attempt to place a meaning upon the position or actions of the hand.

Whether we measure the actions and placement of the hand mechanically via a glove, or at a distance by cameras, or by some other means, that data ought feed into a general model of the hands' placement and actions. In turn, higher-level, interpretative strategies are to be applied to that model.

The overall system, being thus modular with respect to sensing technologies, can allow a range of currently available devices to work, while making the system upstream of the sensing technologies impervious to changes and developments in those technologies.


...to be continued...