The Multimodal Interface



SUMMARY

Multi-modal interfaces, especially those which combine speaking, gesture, and looking, can make human/computer interaction more conversational in nature.

While "conversation" as a metaphor for human/computer interaction may not be appropriate for tasks such as using spreadsheets and wordprocessing, it may well be appropriate for the unhurried elaboration and exploration of ideas. A "self-disclosing" system is described. This system monitors the user's actions in several modes, especially eyemovements over a graphics display, and responds in synthesized speech and graphical actions according to what it infers the user's interests to be.

INTRODUCTION

At the Massachusetts Institute of Technology's Media Laboratory, when I headed its Advanced Human Interface Group, I was very much interested in multi-modal interfaces: interfaces which accept and interpret input from the user in two or more modes simultaneously. This kind of interaction contrasts with the more usual way of interacting with a computer by way of a single mode, for example, via the traditional keyboard.

In particular, I was interested in exploiting at the human/computer interface the three primary modes whereby people communicate in face-to-face encounters with one another: speaking, gesture, and looking. By making available to the human user the possibility of using any these modes, alone or in combination with the others, I hoped to make dealing with a computer more like conversing with another person.

There are those who feel that the metaphor of "conversational systems" is inappropriate when dealing with computers (for example, Nickerson, 1985) and that the metaphor of using a tool or instrument is preferable. This may be true for many or even most situations today where people sit down with computers. Yet, I feel there is a great and as yet unexplored realm of computer usage where the spirit of the exchange is more like a "conversation."

Consider an exchange with a computer where the main intent is to the elaboration and exploration of ideas, the unhurried examination and cross-comparison of information, the "thinking aloud" about this or that topic. Concretely, this could be the executive who wants to consider a possible plant merger, the physics student who want to review topics in optics, the person at home who wants to take an guided tour of European cathedrals - and to do this with the help of a computer. The spirit and tone of such pursuits are less like using a spread-sheet or word-processor than dealing in an ordinary conversational way with some agent or agency who knows about the topic and is willing to spend some time with you on it.

TECHNOLOGIES AND INTELLIGENCE

There are two aspects to the task of creating such conversational quality at the user/computer interface. On one side are the technologies to capture such inputs from speaking, looking and gesture from the user. These technologies are, respectively, automatic speech recognition, eye-tracking, and manual input devices. We can expect these technologies to improve over time, become more comfortable to use, and become increasingly cheaper through better engineering, improved material, miniaturization, the discovery and development of better "transducers." What will remain throughout is that speaking, looking and pointing will persist as the chief ways humans express themselves. The other side of the picture is the machine intelligence to interpret inputs captured in these modes, interpret them, and map to some appropriate response, usually an action on the graphics display, output in speech or sound, or both.

Are these modes of speaking, looking, and pointing to be substitutes for the ubiquitous keyboard - and often "mouse" - we use now? No, don't get rid of the keyboard. It is too useful for the input of symbolic information via character strings. And, don't get rid of the mouse. It's a great tool. Though people speak, look, and point, we still use tools: hammers to drive nails, scissors to cut paper. Here, though, the emphasis is upon the unencumbered, non-instrumented user who need not wear gadgetry on their person as they speak, point, and look about.

MULTI-MODALITIES

What are the advantages of the multi-modal approach? There are many, but three especially important benefits arise when two or more modes are orchestrated together or used in parallel: "unburdening", summation, and redundancy.

Unburdening

When we have only one mode available to use we tend to get the "one-armed paperhanger effect" - an overworking of the single available mode. Recall the silent movies? The lack of a sound track meant that the actors had to express everything in the visual mode. That led to the over-exaggerated grimaces and gestures in the old silent-era pictures. To convey anger, say, the actor waved arms about, put on a an overblown scowl, and strongly mouthed threatening words. Even this mouthing of words was overdone, trying to talk through the "glass-wall" of the silent screen. With the advent of the soundtrack, the overload on the visual mode suddenly became relaxed. The auditory mode shared the burden of conveying meaning. The visual and the auditory modes could now come into a just balance, one with the other. The benefit was not only for the viewer, who could now use ears as well as eyes, but especially for the actor who could portray whatever character in a more natural and spontaneous way.

Information summation

Another big benefit from multiple modes is that meagre or poor information in any mode taken singly can summate to more robust information when the information from other modes is factored in with it. Suppose you are about two-thirds effective in each mode in getting information from me as I speak, point, or glance about. That is, any of these modes is about 66.7% effective in getting across information to you. I look up at the ceiling, point my hand up at it, and say "Up there...". If you deal with each mode singly, then the probability that you will get my intent will be at best about two-thirds. But if you combine the inputs from all three modes the probability that you will get my meaning in at least one of the modes is much greater: about 96.3%. An easy way to picture this is to visualize the famous Rubik-Cube puzzle with its three sub-cubes along each edge, 27 sub-cubes in all. The probability that you would NOT get my intent via any of the 3 modes is proportional to the volume of one of the sub-cubes - 1/27th of the total volume or about 3.7%. The point is that combining information from several not very efficient modes can summate to a much higher information throughput.

Redundancy

Closely related to "information summation" is redundancy, arising as much from the context in which the speaking, looking, and pointing occur as from the fact of using these modes in unison. A case in point is speech recognition. A vendor of speech recognition equipment advertises that their equipment is, say, 99.7% accurate. But such high figures presume among other factors trained speakers, high-information (non-confusable) vocabularies, and noise-free environments. Tested under more realistic conditions, speech recognition drops down to perhaps 60-65% accuracy. Yet speech interpretation, if not raw recognition, can be quite high even when we barely know the language, given the right conditions. Consider the following:

At a dinner party in a foreign land one can understand the conversation with only meagre knowledge of the tongue as long as people are talking about bread, butter, second helpings, wine, and the like. As soon as your tablemates break into a discussion of history or politics, you can participate only if you are totally fluent in their language. A misleading assumption is to attribute the entire difference to the sophistication of one topic versus the other or one vocabulary versus the other. In fact, the reason is that the bread-and-butter talk is about subjects that are in the same space and time as you, at which you can point, look, and nod, thus calling forth parallel and strongly redundant channels of communication. (Negroponte, 1985)

At the user/computer interface, the computer's graphics display is the analogue of the immediate surrounding that two or more people share in ordinary conversation. What makes speaking and pointing, and looking so effective in face-to-face conversation - even when we don't know the language very well, as above - is shared context. We talk about, look at, point to objects and items about us. Similarly, the graphics display and its contents is what the computer is "offering" to us, what it - the computer - is ready and willing to "talk about." Our speaking, looking, and pointing actions are provoked by what the computer is offering to us on its screen. We formulate and execute our actions in those modalities with respect to, and in the light, of what's on the graphics display.

THE INFORMATION IN EYES

Imagine your favorite uncle come to visit your new apartment:

He enters and looks about while you comment on the decor. He studies a set of prints by the bookcase. "I got them in London," you say, and tell about them. He fixates one and asks "What's that?" "Convent Garden in 1770," you reply, and chat on about it.

You both sit down. He tries to light his cigar but is out of matches. You offer your lighter, and show him (it's tricky) how it snaps lit. You notice he wasn't watching closely, so you demonstrate it again.

He asks "How's the car running these days?" but he's looking at your roommate, not you. Your roommate praises Volvos. Your uncle asks again "How's the car running these days?" now looking at you.

"It got demolished last week," you explain. "Here I was coming out onto Main Street..." you say, glancing at your right hand as you describe in space the path of your car, "...and this truck came roaring along..." glancing at your left hand as it marks the path of the truck. You denounce people who run red lights, waving your hands about for emphasis.

Throughout, eye actions signify interest, attention, and reference.

Interest

As your uncle enters the room, you, the attentive host, respond not only to what says but to his "body language" as well, including where he is looking. While he looks about the room at large you comment on the room in general. When you notice him surveying the prints you talk about them as a set. When he picks out on in particular, you comment about that one. Thus, you not only pick up cues to where his interests lie, but adjust the specificity of your comments according to what you infer the scope and level of that interest to be, as indexed by the "range" of his looking and the time devoted to any area or item.

Your responses "work" and are appropriate because people tend to look at what attracts them, especially at what they find curious, novel, or unanticipated (Berlyne, 1966, 1958; Loftus and Mackworth, 1978). Eye movements in looking over a scene tend to take on distinctive patterns depending upon interests and intentions. In his classic eye tracking studies, Soviet scientist Alfred Yarbus asked observers to examine a copy of a famous Russian painting - "They Did Not Expect Him," by Repin - depicting a young man just returned from political exile to the midst of his startled family. Before looking at the picture for three minutes, each observer was asked one of a number of questions: What are the ages of the people? What are the material circumstances of the family? What was the family doing before the young man arrived? Observer looking patterns differed markedly depending upon the goals set by the questions. When asked about ages, observers looked mostly at the people's faces where naturally enough the best age clues would be. If asked about wealth and social position of the people in the scene, fewer fixations were made on the faces of people in the picture, and many more on the clothing and on the objects in the room where they were (Yarbus, 1967).

Attention

While showing your uncle how to use the cigar lighter you notice he was not watching closely and decide you must show him how again. Your observation of his eyes serve as feedback whether or not he is attentive when you instruct him in the lighter's operation. You can tell whether or not he is "following" you and can take appropriate action - here, repeating the demonstration - if he were not.

It is possible to be paying visual attention to something yet not be looking directly at it (Posner, 1980). Conversely, it is possible to be looking at something and not attending it as when "staring off into space" in introspection or day-dreaming. By and large, though, when we are in fact paying visual attention to something in the surround the eyes' point-of-regard is a robust index of the distribution of that attention (Kahneman, 1976, pp. 50-65).

Observing the eyes opens a new channel into where another's attention is directed. The effect can be compared to what children gain when they discover that where a parent is looking is useful to them in comprehending what is transpiring between them and their parent, and, in turn, the world about them. "What has been mastered is a procedure for homing in on the attitudinal locus of another: Learning where to look in order to be tuned to another's attention...". This is how psychologist Jerome Bruner describes the significance of the child's discovery of the meaningfulness of the point-of-regard of the mother when she looks at the cat and utters "kitty," looks at the door and utters "go out," and so forth (Bruner, 1974/5, p. 269).

Reference

There are several referential uses of eyes in the scenario of your uncle's visit. Your uncle utters "What's that?" while looking at some particular thing. To the linguist, words like "this," "that," and "there" are deictic words. "Deixis" comes from a Greek word that means pointing or indicating. Words like "chair" and "table" are non-deictic in that their commonly understood referent is part of their meaning. In contrast, words like "this" and "that" have no fixed referent, but gain meaning from their particular use by the speaker. (Miller, 1981, p. 128) When your uncle asks "What's that?" the meaning is completed by your awareness of where he is looking (at a certain picture).

Another kind of reference concerns to whom we are speaking. Your uncle inquires twice about cars, using the exact same words both times. The difference lies in whom he is looking at, you or your roommate. The question is different because having a different addressee. Mutual gaze and eye breaks assist conversation generally, indicating whose turn it is, synchronizing the contributions of the various speakers, and so on (Argyle and Cook, 1975; Cumming, 1978).

The eyes can also function in a kind of inter-modal cross reference. In our scenario, you describe your auto collision with your hands, looking at them as you do so. Where your uncle sees you looking - at your hands - implicitly signals him also to pay attention to their position and motion. When next you "beat the air" with your hands as you denounce red-light runners, you look instead at your uncle, by default not implying that your hands offer unique information, but simply underscore your speech.

EYE-RESPONSIVE GRAPHICS AND SOUND

The eye-tracking literature from experimental and applied psychology may not be particularly helpful in bringing in eye-tracking as a computer input mode. The object of such studies is how people look, not how looked-at things might respond to that looking. One early on experimental set-up wherein the issue was how looked-at things might react to that looking was a project at MIT's former Architecture Machine Groups entitled "Gaze-Orchestrated Dynamic Windows" wherein we tracked an observer's eye to control a dynamic display of many video episodes (Bolt, 1984, Chapter 4).

The intent was to create the visual analogue of the kind of informational world that besets the modern executive--one of brevity, fragmentation, and variety (Mintzberg, 1980)--yet enable the observer to "filter" their contact with that world via their own, built-in mechanisms of visual selective attention. The observer sat in front of a wall-sized color TV display which held at times up to forty simultaneous TV sub-images, their soundtracks playing together like background "cocktail party" noise. In our experimental set-up, the observer wore special eye-tracking glasses rather than the remote corneal reflection type of eye-tracker described earlier.

Should the observer's gaze rest on some specific episode for a certain number of consecutive seconds (the duration was varied), the display first would narrow the soundtrack to only the sound of the looked-at window - a kind of auditory "zoom." If looking at that episode persisted, the system would "freeze-frame" that video image - while leaving the others go on - and then "cut" to a full-screen version thereof. To return to the full-screen ensemble of images, the observer simply pulled on a small lever on the arm of the chair.

In essence, you dealt with the display much as you would with a crowd of people in your office all talking at once and competing for your attention, namely picking out one person and keeping on looking at them. The usual response of people is for those not looked at to yield the floor and for the looked-at person to keep on talking. (Interactions between the President and the White House Press Corps during TV news conferences illustrate this point.) After dealing as much as you care to with the person you picked out, you then look more widely about, implicitly "throwing open the floor" to the rest of the crowd and they start clamoring again. In our system, the lever-pull was analogous to your un-fixating the current person and looking more widely about to again take in the rest of the crowd.

More broadly, the possibility of a graphics display that responds to your patterns of looking raises the prospect of a new kind of computer display graphics - "lookable graphics."

But haven't graphic artists always assumed that what they make will be looked at? Haven't graphic artists always been ultra-aware that they were dealing in "visuals?" Don't they speak of lines and edges that "draw the eye on," that direct the pattern of looking? Yes, all of these. But still the traditional concern has been with the "look" of things, not with what the things should do upon being looked at. For the observer, too, eye-responsive graphics are unprecedented, the only previous experience remotely similar being eye contact with people and animals. Sounds too, in a 3-D stereo space may well be "addressed" by eye. It is well known that we look in the direction of sounds that alert or interest us. It has been discovered that we even tend to listen where we look (Reisberg et al, 1981). Thus, the scope of eye-sensitive graphics had ought to include sound as well as visibles.

A MULTI-MODAL "SELF-DISCLOSING" SYSTEM

The object in all of this is multi-modal interaction. How might the modes of speaking, gesture, and looking work together?

Let us consider a specific application: a gaze-contingent computer display which is in effect "self-disclosing" (Bolt 1984, Chapter 6). This system is instrumented to respond to your presence and normal behavior. It has a full-color graphics display, and an eye-tracking capability to determine where upon that display you are looking. You would be able to talk to it via automatic speech recognition, and to touch or point to items on view. It would be able to communicate back via text or graphics and synthesized or recorded speech.

It would disclose its contents and information base according to the interests you exhibit through your actions vis-a-vis the things on display, and would do so at a pace that complements your own. Its behavior would be not unlike your own as the attentive host showing off your apartment to your visiting uncle, as in the scenario earlier in this paper.

The Computer as Obliging Host

Suppose the computer has on its display screen the wall of a living room, with paintings, a fireplace, with a ship model and brass candlesticks on the mantlepiece, and large brass andirons in the fireplace to hold flaming logs. For its part, the machine emulates the obliging host, commenting about and showing off what the user seems most interested in - as suggested by the user's looking - among the objects on view. The machine watches the user's looking patterns over time and gauges its responses accordingly. A stored-text database whose organization parallels the figural aspects of the displayed scene and the structural aspects of the depicted items (paintings, ship model, etc.) provides the "script" for the machine's narration in synthesized speech about the scene.

If the looking is spread more or less evenly about the items in the room, then the machine will talk in a general way about the room as a whole: "This is our favorite spot in the house and where we spend most of our time. We've filled it with things we've picked up in our travels from all over. . . " and so on. It doesn't go into depth about any particular thing.

In contrast, should the user's looking dwell upon some specific item, the machine will begin to talk about that item. For example, the user spends somth of the candlesticks on the mantle. A reasonable inference, given "extended" looking, is that the user is interested in the candlesticks, and the machine starts telling about them: "Ah, yes, we got those candlesticks last year in Philadelphia. They were made about 1760, so we were told, by . . ." Given sustained interest, evinced by continued looking, the machine "zooms them out" to give the user a closer, more detailed look.

However, suppose that the user's looking is distributed between the candlesticks and the andirons as well: a few glances at one or other of the candlesticks, now a few at either or both of the andirons. A reasonable inference is that the user is not interested in just one or the other item, but both of them at some more inclusive level of interest. Perhaps the user is attracted to bright shiny things; perhaps the user collects antique brass. In any event, the machine, up to now telling about the candlesticks only, widens its focus now to tell about both items: both are made of brass, both bought at the same auction, etc. In general, the focus and inclusiveness of the machine's narration and exposition is a function of the user's distribution of attention and interest as disclosed in large part via looking patterns.

The Locus of Initiative

What is the relationship in such an interchange between the user and the machine in the sense of who drives what? Where is the "center of initiative"?

Consider the practice backboard tennis players use to sharpen their swing. The player "drives" the exchange, the backboard returning the ball according to the angle and velocity of the ball as hit, plus any "spin" the player might have added. Suppose now a backboard that could add its own "twist" to the return, with velocity shifts and surprise angles. Suppose further this backboard - in a lull in the practice session - can somehow pick up the ball and start to volley.

In this system the machine functions much like such this hypothetical backboard. The display screen itself is an implicit invitation to the user somehow to respond to it just as the very presence of a backboard "invites" the tennis player to lob a ball at it. The user looks about the display, and the machine responds largely as a function of the patterns of that looking. The machine takes local initiatives occasionally to take the situation "off dead center" should lulls in user looking activity occur. (A "lull" may be relatively less active patterns of looking, or looking patterns becoming relatively less correlated with the display image contents, suggesting distraction or less focussed interest.) Human dialog is similarly "episodic." We chat about this and that, and, having exhausted a topic or pursued it as far as we care to, a pause or "lull" occurs, the dialog reviving when either conversant ventures a remark on a new or revived theme.

Overall, the exchange between the user and the machine is one of mutual provocation and evocation, with the user taking primary initiative through its store of curiosity about the displayed scene, and the machine taking local initiatives when the user seems "stalled." Any particular exchange on any particular topic ends when the user no longer looks about the related item or the machine has exhausted its store of narration about it, whichever comes sooner. Thus, while the machine embodies and specifically is the "self-disclosing" system, the user's actions are themselves essential components of the self-disclosing process. Both parties to the dialog disclose information to the other, one party by way of eye movements and fixations, with occasional words and gestures, the other party via actions in graphics and synthesized speech.

Changing the Subject

The system could set the initial topic of the dialog simply by having something on its screen on display. Or, the user could set the initial topic by saying something like "Tell me about 16th century Japanese architecture." Given that the system had in its database information on Japanese architecture of the 16th century, it generates a beginning display on that topic and the exchange starts. But given that an exchange on some topic is underway, how might sub-topic give way to sub-topic, or even the major subject change?

One way change in topic might occur is through a subtle change in the user's looking patterns and the system's observation thereof. People, at least when attentive, tend to look at the things in a scene that are being named (Cooper, 1975). Suppose the user looks at the candlesticks on the mantlepiece in the living-room scene and is told, among other things, that the candlesticks once belonged to some famous person, say Thomas Jefferson, as well as did the ship model. Suppose further that the system subsequently observes the user looking now at the candlestick, now at the ship model, but that fine-grain looking when regarding either item is largely uncorrelated with the narration about the details of the item: the features named are for the most part not looked at. One reasonable inference is that the user is not interested in the items for their own sake but because of something the two have in common, in this instance having once belonged to a famous person. An appropriate action might then be to display images and items pertaining to Thomas Jefferson, his home at Monticello, the American Constitution, and so forth. It depends upon the system's information base. It may have no particular information, pictorial or anecdotal, about Thomas Jefferson, and is unable to get information from outside itself, must simply say so, and continue with the current scene.

The Conversational "Contract"

Whether and how much to digress also depends upon the implicit conversational "contract" between the user and the system (Cf. Martinich, 1984). Did the user specifically ask to be shown about the room? Does some prior understanding exist that it's just the room itself to be shown off? Then for the system simply to digress or "free associate" to material on Thomas Jefferson to the Flag to the Statue of Liberty to New York harbor and so on indefinitely, can't occur. Holding to the conversational "contract" is even more necessary if, for instance, the user has explicitly asked the system to instruct him or her on such-and-such a topic; the system must try to stick to the theme - unless, through repeated "meandering"the user reveals little determination to stay on subject and the trajectory through topic space becomes a resultant of the system's push to remain "on topic" and the user push to diverge. How indulgent or resistant the system is respecting diversions is yet a higher order aspect of "personalization," depending in part on user reaction when the system has given in to or frustrated diversions.

AN "EXPERT CONVERSATIONALIST"?

Having supported "expert systems" knowledgeable in medicine, petroleum geology, and configuring computer systems, can the computer now become an "expert conversationalist?" Consider a human conversing about politics, last summer's vacation, the new car they just bought, whatever. Across subject matter, and independent of the specific topic, they show practical skills in dealing with whomever they are talking with. The skills themselves are basic: breaking eye contact when you want to speak; noting whether the other person is looking in the right spot when you point something out to them; describing things and events with your hands - the way to their house or the size of the fish they caught. The skills used are by and large largely unconscious, having been honed through much practice in dealing with others. Can such general, practical conversational expertise be imparted to computers?

There is no necessity that the approach to that goal resemble current work in so-called "expert systems". Development of conversational machine intelligence will at least as likely combine insights from psychology, linguistics, artificial intelligence (AI), computer science. Even so, such insights will not necessarily come from psychologists, linguists, AI specialists, or computer scientists as such, that is, from professionals who work mainly within their own discipline. Rather, progress toward the kinds of machine intelligence that will lend true conversational quality to human/computer interaction will involve insight and invention involving any and all these disciplines, and perhaps others as well.

REFERENCES

Argyle, M. and M. Cook. Gaze and mutual gaze. Cambridge, England: Cambridge University Press, 1975.

Berlyne, Daniel E. Curiosity and explanation. Science, 1966, 153, 25-33

Berlyne, Daniel E. The influence of complexity and novelty in visual figures on orienting responses. Journal of Experimental Psychology 1958, 55, 289-296.

Bolt, Richard A. The human interface. New York: Van Nostrand Reinhold, 1984. Translated into Japanese and distributed in Japan through the Tuttle-Mori Agency, Inc., Tokyo.

Bruner, Jerome S. "From Communication to Language - a Psychological Perspective." In Cognition, Vol. 3, No. 3, 1974/1975, 255-287.

Cooper, Roger M. The Control of eye fixations by the meaning of spoken language. Cognitive Psychology, 1974, 6, 84-107.

Cumming, G. D. Eyemovements and visual perception. In E. C. Carterette and M. P. Friedman, Eds., Handbook of perception: Vol. IX, Perceptual processing. New York: Academic Press, 1978.

Kahneman, Daniel. Attention and effort. Englewood Clifs, New Jersey: Prentice-Hall, 1973.

Loftus, Geoffry R. and Norman H. Mackworth. Cognitive determinants of fixation location during picture viewing. Journal of Experimental Psychology, 1978, 4(4), 565-572.

Mackworth, Norman H. and Anthony J. Morandi. The gaze selects informative details within pictures. Perception and Psychophysics, 1967, 2(11), 547-552.

Martinich, A. P. Communication and reference. New York: Walter de Gruyter, 1984.

Miller, George A. Language and speech. San Francisco: W. H. Freeman and Company, 1981.

Mintzberg, Henry. The Nature of managerial work. (Theory of Management Policy Series.) Englewood Cliffs, New Jersey: Lawrence Erlbaum Associates, Publishers, 1980.

Negroponte, N. The sensory apparatus of computers." MIT Media Lab: Brochure for the Media Lab Dedication Ceremonies, October 1985.

Nickerson, Raymond S., Using Computers:The Human Factors of Informations Systems. Cambridge, Massachusetts, MIT Press, 1986.

Posner, Michael I. Orienting of attention. Quarterly Journal of Experimental Psychology, 1980, 32, 3-25.

Posner, Michael I., Mary Jo Nissen, and Raymond M. Klein. Visual dominance: an information-processing account of its origins and significance. Psychological Review, 1976, 83(2), 157-171.

Reisberg, Daniel, Roslyn Schreiver, and Linda Potenken. Eye position and the control of auditory perception. Journal of Experimental Psychology: Human Perception and Performance, 1981, 7(2), 318-323.

Yarbus, Alfred L. Eyemovements and vision. Translated by B. Haigh. New York: Plenum Press, 1967.