The Ultimate Interface, continued



Meaning from hands and eyes

Factoring in gesture

Gesture can interact with actions in speech and looking. Consider the observation attributed to the late Warren McCulloch of MIT to the effect that difference between dogs and people is that when you point a person will look where you are pointing while a dog will look at your hand. That difference between people's and dogs' ways of interpreting the meaning of some hand action has to do with noticing as well where the eyes of the speaker are trained.

Suppose someone says "The best fishing around here is at Lake Winnebago," all the while chopping the air with their hands apart but looking straight into your eyes. Then, suppose that speaker saying and doing the exact same things except now you see them looking at their own hands. Everything is the same, except what you as hearer understands is that in the first instance the speaker's hands merely emphasize the assertion being made, whereas in the second instance something is being indicated by the hands, namely the size of the fish to be caught.

Hands and handles

Objects proffer "handles" toward us. Sometimes, these handles are literally there, as in the case of teacups, hammers, briefcases, ice cream cones. There is some part of the object that is so shaped as to invite or imply how it is to be picked up and used. (What percetual psychologist J. J. Gibson terms "affordances.") Some part or aspect of the object plays a role, by virtue of its shape, in our relationship to it.

Other items do not have explicit handles, but have shapes to which we conform the shape and actions of our hands. As we refer by gesture to these items, our actions are influenced by their shape and size.

For example, we ask someone to move a table--"Swing it around that way..."--all the while holding our hands horizontally apart with palms facing to indicate the left and right edges of the table; the movement of the implicit tabletop held in our hands indicates the direction and amount of "swing."

The shapes of things both invite us to mime their shapes, as well as suggest to us the pattern our miming might take. This ranges all the way from showing someone how to wash a dog to scribing in mid-air the paths of vehicles when describing an automobile collision.

Facial gesture

A more exotic kind of gesture-sensing is computer reading of facial expressions. Facial expression can communicate emotion, as is well known, and may communicate semantic information as well. Computers may learn to register and recognize emotion in the face, perhaps as well through thermographic video sensitive to heat changes in regions of the face, and perhaps through image analysis of facial regions and contours.

What the eyes tell us

The basic facts how and where people look have been well-researched, and can easily be summarized.

Eye basics

(The following discussion is drawn from Spoehr and Lehmkuhle,1982, and Cumming, 1978.)

A human observer has a tiny volume of highly acute vision, less than one degree across, corresponding to a local concentration of receptor cells in the axis of the eye's lens called the fovea. In any single fixation from one to five degrees of visual information is processed. Thus, to preserve a clear view of a scene larger that a few degrees, the eye moves about in a succession of quick movements called saccades.

Saccades ordinarily occur several times per second, under voluntary but unnoticed control. The eye rarely moves more than 15 degrees in any saccade, the great majority of saccades being smaller than 10 degrees (Yarbus, 1967).

The movements are fast. If under 5 degrees, they take about 30 milliseconds to execute, with the time taken for larger saccades increasing linearly to about 100 milliseconds for 40 degree movements. The velocity of a saccade increases smoothly to a maximum, over 1000 degrees/second for large movements, then decreases smoothly to zero. Of total time spent looking about a picture, saccades take about 5 percent, during which time visual information is blurred across the retina and more or less suppressed by the visual system

An eye fixation typically lasts between 200 and 500 milliseconds. The latency for a saccade to a sudden shift in the position of what is being looked at is about 200 milliseconds. A saccade is ballistic in nature; that is, once started, it goes to completion and the end point cannot be changed.

How are end points selected? Why, for example, do people look where they do when viewing a picture?

Looking where we do

First, people look more at regions containing unusual objects and unpredictable contours than at regions composed mostly of high figural redundancy (Mackworth and Morandi, 1967). Upon initial exposure to a picture, observers tend to fixate briefly on some spot, then make a lengthy saccade. Then on, fixations become somewhat longer and saccades of shorter distance. Observers' eyes tend to go to areas of high information, fixate on it briefly, then explore locally by making short excursions to areas nearby of less information (Antes, 1974). The long saccades to new regions are not at random, but are made to points of high informativeness, suggesting that information gained from the visual periphery in part determines where the eye moves next.

It is not mere figural information that determines where the observer looks next. Loftus and Mackworth (1978) found that observers tended to make more fixations upon items and objects in a scene which were incongruent or unexpected, like an octopus in a farmyard scene rather that at a barn, silo, or wagon. Further, such informative (in the sense of having low predictability) objects were fixated sooner in the viewing period suggests again that observers must be able to gain information from peripheral vision about where next to look.

In his pioneering work on eye movements, Yarbus (1967) found that observers did not necessarily devote fixations to the darkest or brightest regions of a picture, nor to regions with the greatest detail unless somehow particularly informative. Contours or outlines of figures as such tend not to be followed by the eye unless as in, for example, a facial profile, where the informative details lie along that contour.

Yarbus also found that the observer's purposes while looking at a picture affected the pattern of looking. For instance, when asked to judge the ages of people depicted in a painting, observers looked mostly at the people's faces where naturally enough the best age clues would be. If asked instead to gauge the wealth and social position of the people in the scene, fewer fixations were made on the faces of people in the picture, and many more on the clothing and on the objects in the room where they were.

Other cues from eyes

Clues from the can provide important ongoing feedback to a computer system as to how well a presentation or exposition of some topic is going. For instance, since people tend to look at things named (Cooper, 1975), the system can use that phenomena to aid in assessing the user's current level of attention to what is currently being presented.

Also, pupil diameter--a measure available from corneal-reflection eyetrackers--and found correlated with attraction, interest, suspense (Hess, 1965; Hess and Polt, 1964) can provide ancillary clues to level of user involvement. Do the pupils widen upon the presentation of something? If so, this is evidence of interest and attraction; if not, the observer may be bored, satiated, and the pace of the presentation might well be picked up. Pupil-diameter information by itself is not definitive, but, combined with other information such as the "briskness" of looking activity and the pace of user interaction generally, it can be most helpful.

Head-tracking vs. eye-tracking

Let us note in passing that head-tracking is not necessarily a substitute for eye-tracking--and vice-versa.

Head-tracking may serve well in interface applications where the program needs to know where the user is facing rather than specifically training their gaze. However, the average person can shift eye about 15 degrees before having to move their head (Robinson, 1979). Thus, in cases where it is important to know where in fact the eyes are trained, head-tracking is not a substitute for eye-tracking.

Eye tracking as measurement

For the vendor of eye tracking equipment, the issue is measurement. Because the vendor cannot necessarily predict to which uses their equipment, the vendor must necessarily state the performance of the eyetracker in terms of accuracy: for example, error being within less than plus or minus one degree of arc. (For comparison, the thumb held out at arm's length subtends about 2 degrees of arc.)

However, more often than not the broader interface issue is less where precisely the eye is aimed at any moment than what items or areas are being looked at. Or, put another way, where is visual attention being paid? The answer to this question is often less a measurement as such than an interpretation.

Making that interpretation involves more than measurement of where the eye is looking--which measurement according to circumstances may be more or less precise. Factors affecting that precision include: possible "slippage" of the measuring apparatus e, g., a head-mounted camera rig sliding a bit on the wearer's head; a tendency on the part of the observer--for whatever reason--to look a bit to the right or left of some item, or maybe a bit above it; or, perhaps the apparatus has--again, for whatever reason--drifted a bit out of calibration.

Eye tracking as interpretation

Fortunately, measurement is only part of the story. Context plays a key part.

For instance, I notice that someone opposite me, while mostly looking me in the eye as we chat, seems also to be looking now and then to their lower right. I can "eye track" fairly well, in fact quite well, when their gaze is largely toward me; their off-to-the side and downwards glances, though, permit me do it less well.

However, I have a fair idea of what is there is to be seen, based on the memory that during a hurried breakfast that morning some eggs benedict landed on my left cuff. The unattractive spot forms a kind of "attractive nuisance" for my conversational partner's eye. A reasonable assumption is that they are glancing, every now and then, where part of my breakfast fell.

Thus, the knowledge--more or less precise in the measurement sense--of where another's eye is aimed, coupled with knowledge of what is out there to be seen, can allow a serviceable interpretation of where another's visual attention is being paid.

Reference: the key to dialog

The notion of "reference" is fundamental to dialog. In almost any human utterance, someone is trying to refer someone to something (Carter, 1978).

Rochester and Martin (1977) speak of the "art" of reference: the speaker's ability to guide listeners to select the intended referents. Other writers, e. g., Halliday and Hasan (1976), have called this referring function "phoricity," a phoric speech act being one that in effect instructs the listener to retrieve from elsewhere the information for interpreting the communication in question.

Something in the communication says "look somewhere else for the rest of the meaning." The listener (receiver of messages) "searches the situation" for the referent, the situation including:

In multimodal human-computer dialog, the "art of reference" consists of combining speech, gesture, and looking in the context of the shared situation. The human speaker (the sender of the message) is always referring someone to some thing or to some action.

The speaker exercises a "phoric" ability: the skill to refer the listener to the intended object or action. For instance, the speaker's looking or eye activity in the context of the graphics display can be such a phoric act. The meaning is incomplete until what the user is looking at is discovered and examined.

What the computer must gain are intelligent strategies to search for the referent, including knowing when that search is appropriate.

The where and when of dialog

Dialog is a complex set of communicative acts between two (or more) parties, about things in a specific locale, and occurring within a specific time frame.

Dialog occurs in a setting which both "occasions" and facilitates the actions of the participants. Specifically, the graphics/audio display "externalizes" the topic, i. e., puts the topic "out there," specifically:

The graphics/audio display may be:

The Role of the Graphic

The linkage between user actions in speech, gestures, and eyes, and the intelligence in the machine is precisely the shared graphic. The machine displays a certain scene, and the user witnesses this scene and acts in terms of its presence to form the linkage which makes human/machine dialog a practical possibility.

The presence of the mutually shared graphic opens the possibility for both differentiation and reference. Given a graphical arena spread out before the human user in the x, y plane (and possibly extending in z), the human as well as the machine are able to make distinctions on the presented surface: there is this item... then, there is this item... and another over here.

The items, thusly spread out, can be referenced by both human and machine: by the human as we have described in terms of words and gesture ("That..."); and by the machine through highlighting or blinking the item as it speaks of it ("...that...(blinking the item...)), or by having a graphical "persona" glance at the item while speaking of it in synthesized speech.

If 3-D audio accompanies the display, it can show - because we tend to look where we listen - which of two or more spatially distinct sound sources is being attended (Reisberg,1981). In turn, our speaking, looking, and pointing actions are provoked by what the computer is offering to us on its screen. We formulate and execute our actions in those modalities with respect to, and in the light, of what's on the graphics and audio display.

Specifically, a total speech act can comprised of:

Eyetracking bears a special relationship to the graphics display. Despite the fact that the graphics display is by far the most powerful output channel of today's computers, the machine that cannot eyetrack - which is virtually all machines - has no knowledge of where and how on its display the human observer is looking. This lack of knowledge about how its human user is specifically responding to its displayed output is significant, as eye actions can signify interest, attention, and reference.

Eyes and the graphic

People tend to look at what attracts them, especially at what they find curious, novel, or unanticipated. Eye movements in looking over a scene tend to take on distinctive patterns depending upon interests and intentions.

In his classic eye tracking studies, Russian scientist Alfred Yarbus asked observers to examine a painting of a family scene. Before looking at the picture, each observer was asked one of a number of questions, such as: What are the ages of the people? What are the material circumstances of the family?

Observer looking patterns differed markedly depending upon the goals set by the questions. When asked about ages, observers looked mostly at the people's faces wherein the best age cues would appear. If asked about the wealth and social position of the people in the scene, fewer fixation were made on the faces and many more on the people's clothing and on the objects in the room.

It is possible to be paying visual attention to something yet not be looking directly at it. Conversely, it is possible to be looking at something and not attending it as when "staring off into space" in introspection or day-dreaming. By and large, though, when we are in fact paying visual attention to something in the surround the eyes' point-of-regard is a robust index of the distribution of that attention.

Observing the eyes opens a new channel into where another's attention is directed. The effect can be compared to what children gain when they discover that where a parent is looking is useful to them in comprehending what is transpiring between them and their parent, and, in turn, the world about them.

"What has been mastered is a procedure for homing in on the attitudinal locus of another: Learning where to look in order to be tuned to another's attention...".

This is how Harvard psychologist Jerome Bruner (Bruner, 1974/75) describes the significance of the child's discovery of the meaningfulness of the point-of-regard of the mother when she looks at the cat and utters "kitty," looks at the door and utters "go out," and so forth.

A friend utters "What's that?" while looking at something. You follow their line of gaze and learn which thing they mean. The word token "that" gains its meaning (referent) from the context of the situation. To the linguist, words like "this," "that," and "there" are deictic words. "Deices" comes from a Greek word that means pointing or indicating. Words like "chair" and "table" are non-deictic in that their commonly understood referent is part of their meaning. In contrast, words like "this" and "that" have no fixed referent, but gain meaning from their particular use by the speaker (Miller, 1981).

There is, of course, ambiguity inherent in the words "this" and "that": the potential for the intended referent not be clear. For instance, I point to a book lying on a table and say, "That's great." To what might "that" refer?

I could mean I like the book as object; the story inside; the "series" to which the book belongs:" the fact that the author just won the Pulitzer prize; the expensive binding; the picture on the cover; the "look" of the book as it blends on the background of the coffee table.

I could mean any one of these. But, while all of them are possible, not all of them may be plausible. The immediate conversational context can help. If we have just been talking about stories, then its reasonable I mean the book's story. If we have been talking about writers, then I probably mean the author winning the prize. In the case of no plausible prior setting of context, then maybe "What do you mean?" is in order--asking for clarification.

The graphic as "common ground"

That the user looks while speaking, and the system can follow that looking, has immediate relevance for speech understanding. Even when we share the language, not all of speech is, as such, spoken. A speaker says "What's that?" while fixating some item. To follow their line-of-regard is part and parcel of speech understanding in the larger sense of comprehending natural dialogue.

Carter (1978) speaks of the importance of the shared visual scene that serves as the common ground between conversants:

The percentage of human communication that involves pointing out or drawing attention to an immediately perceptible object is undoubtedly extremely high. There is in fact an informal sense in which virtually every communication of any sort involves or implies the expression of reference to something in order to convey one's ideas, feeling or whatever is communicated about, and very often this indicated "something" is a concrete entity in the immediate environment of both sender and receiver . . . (Carter, 1978, p.309)

Not only can shared graphical space be the occasion for user speech acts, but it can itself become ingredient to those speech acts. Shields (1978) speaks of the elements of language which connect the utterances of speakers with a shared field of reference. Three of these are noted, with parenthetical comments:

Specifically, a total speech act can comprised of: a) what the user says concerning something on the graphic display; b) the thing on the graphic; and c) gestural actions or the line-of-gaze of the user relative to the graphic at the time of utterance.

The primacy of the visual sharing

The presence of a visual scene shared by discussants--a computer and a person as well--goes a long way to enable discourse.

Some years back, Britain's Sir Kenneth Clark produced the TV series "Civilization," a program seen in the USA on public television (PBS) stations. In an interview about the program, he was asked why the series did not contain any material or mention of developments in the areas of law and philosophy, surely important aspects of civilized life. He replied that he had considered such topics, but ended up omitting them as they did not lend themselves to visual presentation.

Fiction writers have appreciated the amount of words it takes to "set the scene," given their intuitions concerning would-be readers' familiarity with story settings. Crime novelist Gregory MacDonald, author of the "Fletch" series, has noted that he could devote most of his pages simply to dialogue in the confidence that his readers were familiar with the things and places he was talking about.

In contrast, because relatively few people in his time travelled beyond the confines of their home village or town, a writer like Charles Dickens felt constrained to include considerable description of, say, London and its sights and sounds. MacDonald, however, would assume that most of his readers had seen London, if not in person then either in the movies, TV, or picture books, and accordingly felt he need not describe it.

Externalizing the Referent

The graphic both enables and constrains. To provide a common, shared basis for dialog--to enable dialog, the system proffers or puts "out there" on its graphics screen what it is currently willing and and able to talk about. At the same time, the graphic constrains. In so publicizing its willingness and ablity to talk about X--the topic on display, the system implicitly signals it is not ready and able - at the moment at any rate - to dialog about topic Y or Z. In so doing, the machine lets the user know something of the bounds of its potential to comprehend what the user may try to express.

This externalization of what may be talked about can dramatically aid speech recognition and understanding by having things "out there" in view of both user and machine. At the user/computer interface, the computer's graphics display is the analog of the immediate surrounding that two or more people share in ordinary conversation.

What makes speaking abetted by pointing and/or looking so effective in face-to-face conversation - even when we don't know the language very well, as in Professor Negroponte's anecdote about diners in a foreign land - is shared context. We talk about, look at, point to, objects and items about us. For the user at the interface, the graphical image is an occasion for looking, speaking, and pointing. For the machine, the graphical image constitutes the means by which it interprets the user's glances, gestures, and remarks - all made in the light of that graphic.


...to be continued...



References

Antes, James R. The time course of picture viewing. Journal of Experimental Psychology, 1974, 103(1), 62-70.

Bruner, Jerome S. From communication to language - a psychological perspective. In: Cognition, 1974/1975, 3(3), 255-287.

Carter, A. Sensori-motor vocalizations to words: A case study of the evolution of attention-directing communication in the second year. In Lock, A. (ed.), Action, gesture and the symbol: the emergence of language. New York: Academic press, 1978, p. 309.

Cooper, Roger M. The Control of eye fixations by the meaning of spoken language. Cognitive Psychology, 1974, 6, 84-107.

Cumming, G. D. Eyemovements and visual perceptions. In E. C. Carterette and M. P. Friedman Eds., Handbook of perception: Vol. IX, Perceptual processing. New York: Academic Press, 1978, pp. 223-4.

Hess, E. H. Attitude and pupil size. Scientific American, 1965, Vol. 212, 46-54.

Hess, E. H. and J. M. Polt. Pupil size as related to interest value of visual stimuli. Science, 1964, August 5, Vol. 132, No. 3423, 349-350.

Halliday, M. A. K and Hasan, R. Cohesion in English. London, England: Longmans, 1976.

Loftus, Geoffrey R. and Norman H. Mackworth. Cognitive determinants of fixation location during picture viewing. Journal of Experimental Psychology, 1978, 4(4), 565-572.

Mackworth, Norman H. and Anthony J. Morandi. The gaze selects informative details within pictures. Perception and Psychophysics, 2(110), 547-552.

Miller, George A. Language and speech. San Francisco: W. H. Freeman and Company, 1981.

Reisberg, Daniel, Roslyn Schreiver, and Linda Potenken. Eye position and the control of auditory perception. Journal of Experimental Psychology: Human Perception and Performance, 1981, 7(2), 318-323.

Robinson, Gordon H. Dynamics of the eye and head during movement between displays: a qualitative and quantitative guide for designers. Human Factors, 1979, 21(3), 343-352.

Rochester, S. and J. R. Martin. The art of referring: the speaker's use of noun phrases to instruct the listener. In: R. O. Freedle (ed.), Discourse production and comprehension. Norwood, New Jersey: Ablex, 1977.

Shields, M. M. The child as psychologist: construing the social world. In Lock, A. (ed.), Action, gesture and the symbol: the emergence of language. New York: Academic press, 1978, p. 544.

Spoehr, Kathryn T. and Stephen W. Lehmkuhle. Visual information processing. San Francisco: W. H. Freeman and Company, 1982, pp. 161+.

Yarbus, Alfred L. Eyemovements and vision. Translated by B. Haigh. New York: Plenum Press, 1967.