The Ultimate Interface, continued



Understanding voice, hand, and eye

Upsetting the equilibrium

Any conversational exchange between one person and another, as well as between a human and a computer, has a particular form: an initial state of equilibrium; disturbance of that equilibrium; the restoration of equilibrium.

Even a simple exchange of greetings has that form. Person A says "Good morning!" to person B. The prior "neutral" state has been upset; person B "owes" person A some kind of reply. That reply could be a "Good morning!" in return, or even just a nod. The prior equilibrium is restored.

Imagine the user before a large-screen display. On view are a number of items, in this case a set of blocks representing buildings that the user--an architect--is shifting about a plot of land, trying out different layouts for a factory site.

The situation currently is "in equilibrium," the user having shifted some of the blocks around before, and is now contemplating the way they are set up. After some moments, the user says:

"Twist it (glancing at a particular block)... 30 degrees..."

with the hands making a motion as if turning a steering wheel counterclockwise.

What has just happened? The equilibrium between the user and the system has been disturbed. The user has made some kind of utterance with the intent to alter or adjust some aspect of the scene, and expects the computer to act upon that utterance.

The user makes their utterance in light of, in the context of, the contents of the screen; the scene, if you will, occasions what the user says. To restore equilibrium, the system must interpret the user's actions and carry out whatever intention was expressed.

Working from speech outward

The interpretation strategy assumes that speech is primary. The speech input is examined first. If it is semantically complete, then the system's job is to execute the command. Thus, if the user had said

"Twist the blue block counterclockwise 30 degrees..."

and there were only one blue block present in the immediate context, then what the user said is semantically complete and the system just goes ahead and does it: rotates the blue block 30 degrees counterclockwise.

The axis about which to twist the blue block is, in this case, reasonably assumed by the machine to be the vertical axis. The blocks on view represent buildings, and the work at hand is trying them positionally this way and that about a graphical surface representing a plot of land.

Some of the semantics of the situation--such as the items being "buildings" situated on "land"--derives from the nature of the database: what the items represent, what properties they might have. This is a complex subject, which I'll return to in a later section entitled "Topic-Independent Dialog Expertise."

For now, though, let's assume a fairly rudimentary "blocks world" with sufficient built-in semantics to let the user maneuver various blocks representing buildings about a graphical surface representing a plot of land.

Going beyond speech

Now, the user in our example did not say "Twist the blue block counterclockwise 30 degrees," that is, a sentence wherein all the necessary information to interpret the command is contained in the speech. Instead, they said:

"Twist it (glancing at some block)... 30 degrees..."

Thus, there's information missing from the utterance, and the strategy now becomes to search the situation for that missing information. Part of that "situation" to be searched is where the user might have been looking when they uttered the words, and what they might have been doing with their hands.

The command "twist" designates turning something--"it"--about one of its axes. What is "it"? In the present case, a logical source of such information would be where and at what object the user may have been looking at the time they uttered the sentence.

Let us say they were glancing at a certain block in the lower left portion of the depicted scene. Maybe not directly at the block--the eye is somewhat labile--nor even steadily at it, but near enough to it and long enough so that the machine may reasonably conclude that it is the intended item.

How about which direction to twist the block? A reasonable source of the information is what the person is doing with their hands. Here, the user is in fact making a gesture not unlike the position and actions of both hands of a bus driver swinging a left turn.

The gesture could as well be performed by one hand, e. g., making a motion as if unscrewing a jar cap. Or, perhaps as if dialing a rotary phone. The point is that the precise form of the manual gesture doesn't matter. What is at issue is the direction of the twist--one bit of uncertainty. The task of the machine is to look for anything in the realm of gesture--even a twist of the head, a swing of the eyes, whatever--that suggests the twist be counterclockwise rather than clockwise.

Note that the axis about which to twist is readily offered by the axis of twist of the user's hand, or the axis of the "steering-wheel" shaft if that is the kind of gesture used. In the current example, the twist axis is fixed by the assumption in the database, and hence by the machine, that the only free axis is the vertical one; with some different topic matter and associated data base, the versatility of hand attitude to specify axis-of-rotation is a distinct plus.

What is also at issue is how far to twist. In this example, the user specified that by voice:

"...30 degrees...".

Had the user instead intended just a clockwise "nudge" of the block representing the building to be twisted, perhaps the amount of twist could be unspoken, the amount of twist to be estimated by the machine from the manual input: how far the hand was rotated, or how far the imaginary steering-wheel was turned.

The machine response

The response of the machine is to do what the user asked.

On the view that what people and machine best talk about are--as in the case of the young people in the sidewalk cafe, or Negroponte's diners in a foreign land--things which are concrete and share the same space and time, the response of the machine is some action in that same space and time.

The user sees and perhaps as well hears the intended action taking place. The blue block does in fact twist on its horizontal axis. Perhaps it makes a squeaking sound as it turns--what they call in show biz "traveling music"--to make more obvious to the user that action is taking place.

There is no concurrent chatty "As you wish" or equivalent vocalizations from the machine. This is a preference, perhaps a prejudice, on my part. I am not an advocate of anthopomorphisizing computers, for many reasons, some of which I'll be touching upon in later sections. For now, let me simply express my preference that the computer simply carry out the user's command directly and ostensibly, providing the user first-hand, explicit confirmation that what they wanted done has, in fact, been done.

What about errors?

Nonsense

An utterance by the user can be "nonsense" in that the semantics of the utterance bear no relation to any items on display or plausible actions that might be taken upon them.

That can happen inadvertently, or perhaps by way of mischief. Suppose, in the midst, of the scenario about the twisting the blue block, the user said instead:

"Make all bananas very salty this Fall"

there being no bananas on display, or indeed in the machine's current database vocabulary, and the system as currently constituted has no database knowledge concerning saltiness and seasons of the year.

This is important, and is part of what Grice calls the "Cooperative Principle":

Our talk exchanges do not normally consist of a succession of disconnected remarks, and would not be rational if they did. They are characteristically, to some degree at least, cooperative efforts; and each participant recognizes in them, to some extent, a common purpose of set of purposes, or at least a mutually accepted direction...We might then formulate a rough general principle which participants will be expected...to observe, namely: Make your conversational contribution such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged. One might label this the COOPERATIVE PRINCIPLE [emphasis in original]. (Grice, 1975, p. 45.)

Following on this requirement, Grice indicates a number of rules, which he calls "Conversational Maxims," centered about four tenets, which may be briefly summarized as follows:

(For a fuller statement of these maxims, see (Grice, 1975, p. 46.); also, for extended critical commentary on Grice's Cooperative Principle and related maxims, see Martinich, 1984, pp. 18-39.)

All this amounts to the notion that, in a conversation, the two (or more) participants ought say things to one another that potentially make sense, and avoid deliberate inanities such as in an old Marx Brothers movie. Specifically, they are not out to trip one another up.

But, should they do--deliberately or inadvertantly, then should the machine say anything at all, such as "That does not compute..." or "I don't understand that statement"? My take is that it should not, but just consider the input as something of an "aside" on the user's part, and simply wait for the next input.

This strategy perhaps make for a "stuffy" machine. Might the machine instead attempt some Groucho Marx style come-back to the user?

Interesting possibility. But very complicated (impossibly so..??), because the rules underlying such banter means breaking the rules (even that one...).

Too, whimsy is open-ended, whereas a plausible and serviceable repertoire of commands to manipulate concrete objects is relatively finite. To do the ordinary is hard enough. Maybe someone will come up with an effective "computational model" of stand-up repartee. But, I doubt it. "Computational humor" is probably an oxymoron.

Incomplete statements

A different case entirely is that of incomplete statements. Semantically incomplete utterances are potentially repairable, and in terms of the "conversational contract" between user and machine, it is safe in most cases to assume that that the user simply has made an error, and is not, as it were, pulling the machine's (metaphorical) leg.

A statement can be incomplete because some piece of information is missing, either by mistake or because garbled. Suppose, with "Twist it...30 degrees," the gestural behavior was lost because of some hardware glitch, but otherwise the utterance seems well-formed. The appropriate thing for the machine to do is ask the user "Which direction?" The reply can be verbal ("Counter-clockwise") or via gesture, either a repeat of what was motioned before, or some variant.

Conflicting input

All the input can be syntactically present, but semantically ambiguous.

For instance, the user's glance falls between two side-by-side items, either item a plausible candidates for "twisting." The system need ask "Which one?" perhaps blinking the candidate items alternately. The user can specify the intended item by a word or two, a pointing gesture, or a nod of the head to either side.

The immediate past as context

Exploiting temporal contiguity

Suppose the user is working on some particular block representing a building, trying to jockey it into place so that it "looks right" to them in light of the overall plan they are trying to evolve for the factory site.

They say something like

"Twist it (glancing at some block)... 30 degrees..."

followed a moment later by something like

"...more..."

What is a reasonable strategy for the machine to take in interpreting that last statement? One is to use the immediate past as context.

"More..." may be taken in an adverbial sense, modifying what occurred just beforehand. Treated as a "follow-up command," it has a temporal context; it logically applies to the command which just preceded it. But, how much time to let go by before the system might be well-advised to ask the user to re-specify the entire command?

It depends. The machine should take its cue from the pace at which the user has (recently) been issuing commands. Five to ten seconds may be right if the user's pace of producing utterances has been fairly rapid; longer, if the pace at which the user has been issuing commands has been relatively slow.

Quantifying the adverb

But, how much is meant by "more..."?

If unaccompanied by any gestural input which could, by the scale of the movement, enable some estimate of "a whole lot more" vs. "just a little bit more," perhaps the reasonable thing for the machine to do is to do a follow-up movement, say, 50% in magnitude of what it did before.

Perhaps any number of other rules of thumb would be equally reasonable. It is all very subjective to the user, and probably interacts perceptually with the shape, size, etc. of the items to which the action is being applied. The user, in any case, can intervene at any time and make more explicit adjustments, as in saying, e. g., "10 degrees more...".

Being alert to user nuance

But, is it always the case that the right thing for the system to do "more" of is the last thing it did? In our example, that would mean twisting the block a little more counter-clockwise.

Perhaps. Perhaps not.

As with any command, the system should always take into account what the user is doing in any or all of the modes available to them.

For instance, the user may be saying "more..." unaccompanied by any gestural action. Fine. Then it's "more" of what was just done. Or, the user could be making a twisting gesture counter-clockwise, that is, in the same direction as the just prior command. That's fine, too; it's "more" in the same direction.

However, the twist could be in the opposite direction: clockwise.

Thus, the rule of thumb: the system does "more" of the last thing it did unless overridden by some component of the follow-up command, such as co-presence with the follow-up command of a twisting gesture in the opposite direction of the prior twist.

The point is that any "follow-up" command whereby the user is trying to modify some action that went on a short time before is also a multimodal command.

Multimodal "follow-up" commands

If the system is consistent in applying a multimodal analysis to all commands, even syntactically incomplete ones, then this affords the user much flexibility--through gesture, for example--to modify actions just taken in the immediate past.

For instance, suppose, sometime prior to performing twisting actions on a block representing a building, the user had been also moving the block back into the perspective of the scene by saying "Move it..." or "Move it back..." accompanied by a "pushing" gesture. Then, should the user later say "More..."--but instead of a twisting gesture perform a pushing gesture--then the system had ought to apply the "more..." modifier to the action referenced by the current gesture--pushing--not to that of twisting.

Thus, the content of gesture--if present in a follow-up command--can not only override the specifics of some action (e. g., which direction in which to twist) but, by the nature of the gesture, indicate which of recently given commands the follow-up modification does in fact refer.

Types of gestural input

What gesture is...and isn't

Not every movement is gesture. It's important to distinguish what movements probably are gestures, and which probably are not.

Psychologist Adam Kendon conducted a study to find out "...whether or not people did consistently recognize only certain aspects of action as belong to gesture." The following summarizes his main observations (Kendon, 1986, pp. 26-31):

Distinguishing gesture or deliberately expressive movement from movement that is "natural," "ordinary," or of "no significance.": "Deliberately expressive movement was movement with a sharp boundary of onset, marked by an excursion, rather than as resulting in any sustained change of position."
For example, the following types of actions were seen by Kendon's subjects as being gesture:

Movement involving the manipulation of an object: manipulations such as changing the position of an object were never seen [by subjects] as expression; instead, were seen as "practical." Movements touching the clothing or one's body were never seen as part of deliberate expression. (Kendon, 1986, pp. 27-28).

Kendon's definition of gesture: "The word 'gesture' serves as a label for that domain of visible action that participants routinely separate out and treat as governed by an openly acknowledged communicative intent." (Kendon, 1986, p. 28.)

A gestural taxonomy

A useful taxonomy of gestural types is one offered by Rime-Schiaratura, 1991:

Referring to ideation

Evocative

Depictive

Those items of the taxonomy marked with an asterisk (*) are viewed as those types of co-verbal gesture that would be involved when the speaker is talking about concrete objects, and situations, with reference toward them.

Such gestures as "beats" and ideographic gestures may be of importance to the speaker in helping them to organize their thoughts, the pace of their presentation, or as motor accompaniments to thinking. However, they do not seem to bear any information about the content of what the speaker is saying (though perhaps revelatory of the speaker's state of mind, or level of arousal or nervousness).

My focus in this discussion is upon the deitic, iconic, and pantomimic types of gesture, namely those gesture types which contribute referential or semantic content to the speaker's utterance.

The role of the eyes

The eyes operate as an input channel to someone who is observing, and as an output channel to others who may witness the activity of the observer's eyes. Thus, in developing a taxonomy of looking we have to concern ourselves with:

The first taxonomy of looking behavior offered below--that of psychologist Daniel Kahneman--is elaborated from the vantage of the observing person. Kahneman's taxonomy follows from the motivation of the psychologist to understand and characterize human looking behavior.

The second taxonomy--my own, in the section below entitled Inferring user states from eyetracking--is elaborated from the aspect of someone observing another's behavior. This second taxonomy stems from a much different question than that of the psychologist, namely what interpretation might a computer-based display place upon certain kinds of looking patterns, and in turn, how might the system respond. Put another way, this second taxonomy is concerned not so much with how people look, but how looked-at items might behave.

How and why people look

Kahneman (1973) cites three types of looking

...distinguished by the situation in which they occur: spontaneous looking in the absence of a specific task set; looking that serves to acquire task-relevant information; and looking that accompanies internal processing events. (Kahneman, p. 64, italics added.)

Kahneman summarizes each mode, thusly:

Spontaneous looking is controlled by collative features of stimuli, such as novelty, complexity, and incongruity. The antecedents of these enduring dispositions are found in innate dispositions to orient toward contours and toward moving objects. The enduring dispositions that control spontaneous looking serve the function of information-seeking, rather than the function of pleasure-seeking.
Task-relevant looking...[is] an allocation problem. Because the area of sharp vision is narrow, it must be directed to those portions of the field which are likely to be richest in relevant information. The decision often require a sophisticated weighting of many factors, and thus are made quickly, for the eye changes position 3-5 times a second. The sequential allocation of glances is a highly skilled performance. The system generally makes decisions about the locus of individual fixations rather than about their duration, which is often quite stable. In complex visual discriminations, however, the duration of individual fixations may vary, within rather narrow limits, according to the demands of the task.
Finally, eye movements are a salient manifestation of the changing orientations which occur whenever the focus of thought refers to a direction in space. This orientation occurs even when it cannot possibly aid in the acquisition of new information. Movements of the eye also accompany, and perhaps influence, the balance of activity between the cerebral hemispheres, and the rate of eye movements often corresponds to the rate of mental activity. (Kahneman, p. 65, italics added.)

With regard to how deliberate looking in ordinary life, Kahneman notes:

Looking is obviously under voluntary control, because one can decide where to fixate, but conscious and deliberate control of fixation is actually infrequent. As with other highly skilled components of voluntary performance, such as walking or the maintenance of balance, looking is controlled by a general intention, and consciousness plays a minor role in the execution of the intended sequence of fixations. The processes that determine the locus of individual fixations are psychologically silent, and their feedback is so poor that people do not usually know precisely where they are looking. (Kahneman, p. 51.)

This commentary is relevant to the possible use of the eye as a "pointer."

The eye as a "pointer"

A number of researchers in human-computer interaction regard the eyes potentially as a pointer to material on display: the user looks at some item or area, and either by blinking or sustaining the gaze for some time-out period, the item or area is selected, the whole process not unlike clinking on a mouse.

The eye is, in fact, an excellent "pointer" in that one can fixate some spot, look away, and come back right on target. And, I would agree that if the only output modality available to the user is line-of-sight, as with someone severely disabled, then such use makes much sense.

On the whole, though, while the eye can be enlisted at the interface as a "pointer"--i. e., used in a deliberate fashion by the observer to "pick out" items on display--to my mind it represents a mis-use of the eyes.

It's much more compatible with the way we normally function to--in the interface situation--point with the hand while the eye functions more globally to gather information about what is in the surround.

In any event, those researchers in human-computer interaction who consider the role of eyes mainly as "pointers" in fact miss the point (...the pun perhaps intentional..). For the most part, the role of the eyes is social.

The eye in "address"

The act of looking may assign the role of addressee in the sense of designating whom I am addressing when I say "Happy Birthday" to someone in a group of people, only one of which is having a birthday.

From my side of things, I don't experience myself as "pointing" at them. I am just "looking" at them, something less formal, less exacting, less effortful for me than if I felt I had somehow to "use" my eye to single them out.

In any event, the person to whom I am directing my words gets the message very clearly and easily, and those to whom the greeting is not intended readily pick up on that from observing where my eyes at trained, i. e., not at them.

But, if I'm congratulating the group-- it's the first anniversary of our bowling team-- then the fact that I automatically tend to sweep my gaze cross the set of faces as I speak effectively signals that I am addressing the whole group. Thus, both narrow focus and wide-focus "address" are equally well conveyed by what I am doing with my eyes, and with no deliberate effort on my part.

Eye-pointing as "gesture"

There is, in everyday life, a species of pointing-by-eye, but it's more akin to gesture than to pointing as such.

For instance, deliberate pointing with the eyes can occur when we signal to a checkout clerk, our arms being filled with grocery bags, to put that last packet "up here" (while glancing at the top of one of the bags we are holding).

We also see such gestural use of the eyes in flirting, winking, or as carefully nuanced by an actor in a movie or a play. In those contexts, the concern is less with where the line-of-sight is actually directed than with how the eyes might appear to another.

The Midas Look

Some researchers fret about what they term the "Midas touch" effect of eyes as an interface input device.

The eyes--except when shut or during the moment of blinking--are always to be viewed as looking somewhere or other. The eyes are always "on." Thus, a display which monitors point-of-gaze potentially has this "stuck button" in the form of eyes that are always outputting.

This, however, is a moot problem in the instance of the multimodal interface.. The ready solution is to track the eyes, yes, but only act upon point-of-regard information when there is a signal in some other mode--speech or gesture--to "do something": to delete a file, select an icon or object, or whatever.

Nonetheless, there is a lot of potential utility in the "trace" of the eye-- the trail of individual fixations the eye makes--but which are not acted upon as such.

The patterns of eye fixations can reflect interest in, and attention to, items and areas on display, both in general and in particular. Such patterns, observed and analyzed in the light of what is on display, can be useful in orchestrating the presentation of information to the user--what they seem to be attracted to, or not paying attention to when, in fact, they should.

However, while the correlation between interest and attention with where the eyes are trained is strong, it is not absolute. That is, while patterns of eye movements are excellent indices of what the user is interested in, what attracts their attention, it's nonetheless possible to be paying attention to something while not looking at it, and to be looking at something but not actually paying attention to it, as while "daydreaming." Interpretation of what the eyes may "mean" is always in need of that interpretation being done in light of the overall situation.

Inferring user states from eye-tracking

My own informal taxonomy of classes of information that might be inferred concerning the state or intention of a person being eyetracked is the following:

Conflict between eye and hand

Information given via hand and eye may at times seem to conflict. For example, the user might say:

"Move that (looking at some item) ...over here (pointing to some spot, but looking at some spot in a distinctly separate part of the display)."

Where to place the item? Put it here the user is pointing? Or, where the user is looking?

As a general rule, pointing by hand is more a effortful and deliberate act than looking; on that basis, a reasonable strategy is--all else being equal--to place the item where the user is pointing.

In instances wherein the placement of the item is critical and the circumstances of the entire operation are of such a nature as to not tolerate errors, then the system might well insist that the user be both looking and pointing at same spot at the same time. (While looking is inherently more volatile that manual pointing, the system ought insist that the user at least be looking at the same general area as where they are pointing, and not at some spot at some far remove.)


...to be continued...


References

Grice, H. P. "Logic and Conversation." In: Peter Cole and Jerry L. Morgan, Syntax and Semantics: Speech Acts, vol. 3. New York: Academic Press, 1975.

Kahneman, Daniel. "Looking." Chap. 4, pp. 50-65, in Attention and effort. Englewood Cliffs, New Jersey, 1973.

Kendon, Adam. "Current issues in the study of gesture." In: The Biological Foundations of Gestures: Motor and Semiotic Aspects. Jean-Luc Nespoulous, Paul Perron, and Andre Roch Lecours, eds. Hillside, NJ: Lawrence Erlbaum Associates, Inc. 1986, 23-47.

Martinich, A. P. Communication and reference. New York: Walter de Gruyter, 1984.

Rime, Bernard and Loris Schiaratura. Gesture and speech. In Robert S. Feldman and Bernard Rime (eds.), Fundamentals of nonverbal behavior. Cambridge, England: Cambridge University Press, 1991, Chapter 7, 239-281.