3 Key Issues in Multimodes
The Advanced Human Interface Group (AHIG), which I formerly directed, developed a distinctive approach to handling multiple user inputs in speech, gesture, and gaze. What follows is an overview of 3 key issues we identified, and how we attempted to deal with them.
Our system approach to resolving concurrent inputs from speech, two-handed gesture, and gaze consisted of four levels:
• device-dependent input
• the “body model”
• analysis/segmentation layer
• the interpreter
We wanted whatever scheme we developed to be device-independent as concerns the initial capture of speech, gesture and eye gaze. This was so that our system prototype would persist in being of value and use over time, as well as being negotiable across particular applications and sites of applications. Thus, as new technologies developed, the system would not become obsolete, but require only that any new device for speech, hand, or eye input be integrated into the system at the level of the device-independent input layer, leaving the balance of the system unaffected.
This level wass an articulated representation of the body of the user, its actions and outputs in hand and eye. (Speech was a special case in that the string of recognized words goes directly to the top-level interpreter.) This model attempted to be both comprehensive and exhaustive. It tried to include all of the articulated parts of the body that would be of potential interest to the interpretation of gesture and looking.
Suppose the particular device we currently use to measure hand motion does not, for example, measure the angle or “spread” between the fingers. The body model should nonetheless have parameters reflecting such information about finger posture even though the currently used device does not supply it; the values for those parameters are simply a null value, reflecting that they are not available. If and when a new device is used that does in fact supply such values, the body model—and hence the entire system on above the body model—would be ready to receive them. Meanwhile, system levels above the body model would know that certain values are missing because not measured (or derived), so that processing at those levels would not fail for lack of that specific input.
At this level of the system, the elements of the body model were “parsed” into yet higher order elements. For example, in the domain of gesture, a temporal sequence of spatial positions of the body parts representing the right arm and hand were aggregated and condensed into a particular movement. That movement might be one which started from the hand at rest at the user’s lower right, crossed the body with palm upwards and fingers extended and joined, and ended with the arm extended in space in front of the user’s left shoulder, palm still upwards, fingers still extended and joined.
This kind of segmentation developed the body model data into a gestural representation which was as yet uninterpreted. In this example, the movement on the part of the user could be the expression of various intentions, all quite different: that an area of the display is to erased; that a certain object is to be moved on the display from the lower right to the upper left; that the edge of some object is to be tipped at a certain angle; that a line or border is to be drawn across the screen; that items below a demarcation are to all to be colored green; or, perhaps the user is simply stretching their arm because fatigued. Thus, though the analysis/segmentation process took information from the body model and recast it as an integrated rest-movement-rest sequence, the movement remained as yet un-interpreted; interpretation of the movement and assigning it meaning or semantic content occured at the next system level of the interpreter (see next section).
Our implementation of the gestural segmentation enabled us to successfully capture and characterize (but not, at its own level, interpret) a wide variety of one- and two-handed gestures.[1] From there, we hoped to have planned a more sophisticated gesture segmentation module, which would support more fluid patterns of user movement than our current gestural segmented is prepared to handle, as well as integrated them into more extended patterns in time.
In the case of eyetracking input, the stream of x,y point-of-regard coordinates produced by the eyetracker were condensed and summarized as a succession of fixation points. This represented considerable data reduction, as the tracker logic produced such coordinates at the rate of sixty coordinate pairs per second while the maximum rate at which the average human observer can shift their point-of-regard is on the order of five times per second.
With the eye modality, the body model system level could well be bypassed, the raw data from the tracking device level going directly into the segmentation level; we had been uncertain on this point. We were less interested—at least early on—on the position of the eyes as such, but rather where on the display the user is looking. Thus, processing of eye data could well go directly to the analysis/segmentation stage, and not feed into a body model of eye position within the head. We noted that the eyes can be used in a purely expressive manner, such as when the eyes are rolled upwards in the head, not in the act of looking at anything, but as a gesture of dismay;[2] to register that kind of eye usage, the use of the body model to reflect position of the eye within the head may be required.
As noted in the previous section, in the case of our particular speech recognizer (we had employed the BBN Hark system), the string of recognized words went directly from the device to the interpreter layer of the system. The software logic of the speech recognition system incorporated a grammar model, constituting, in effect, a “segmentation” layer. It was not clear what a “body model” might be with respect to speech output from user to machine, except possibly for automatic lip-reading as an adjunct to speech recognition, or perhaps visual monitoring of mouth expressions (smiles; grimaces). Our system did not attempt to deal with either lip-reading or mouth expressions.
In the interpretive layer of our system model, speech was primary in that no system response was executed unless there were some spoken input. That is, neither gesture nor gaze input was sufficient by itself to cause a system response. Without accompanying speech, input from the gesture-sensing gloves and from the eyetracker continued to be processed, but was simply discarded and not acted upon.
When speech input did occur, the interpreter analyzed the utterance for its semantic content, specifically, what the user was commanding or requesting. The speech input could be of itself semantically complete. For instance, the user could say “Delete the blue square,” where there was only one blue square on display. However, the input from speech could be ambiguous, as when there were two or more blue squares on the display. The interpreter then had to examine the input in gesture and/or looking to see if there were any evidence to indicate which blue square was meant.
In gesture, the hand did not need to take on the posture of a “pointing” gesture as such, where one or other of the user’s hands was raised, the index finger extended, the remaining fingers and the thumb curled. The evidence gesturally could be any motion or posture of the hand that could conceivably distinguish in the context of the situation which square was meant. Again, suppose that there were in fact two blue squares, one placed above the other. From context, there is one “bit” of ambiguity: is it the top square that is meant, or the bottom one? Any gesture by either or both hands, however slight, that might be distinguished as denoting “up” or “down” would suffice; so would an up or down shift of the head, or of the eyes. Or, there might be an up or down shift of the hands with a concurrent tip of the head or eyes in the same direction, each gesture fragment reinforcing the clue that the other furnishes.
Thus, our approach was not one of “template matching,” but one of looking for any kind of “evidence” in posture or movement of either or both hands, the head or the eyes, that would support the selection of one particular square out of the two on display as being the intended square. This approach to parsing and interpreting user input in the several modes can be viewed as a kind of "top-down, bottom-up" approach. Information about user gesture and looking behavior filters up from the systems lower layers (the device level; the body model; the analysis/segmentation layer), and, while progressively refined and reduced, was not interpreted. Simultaneously, the topmost interpreter layer, when it needed additional information to resolve ambiguity arising from the speech input, formed hypotheses about the kind of supporting evidence it needs (e.g., Any “up” indications from hand, head, or eye? Any “down” indications in either hand, head, or eye?) and proceeded to consult the lower layers to seek such.
With respect to such actions as pointing or looking, which imply some kind of accuracy with respect to items or areas on display, our overall approach was one of interpretation, not mere measurement. A vendor of eyetracking equipment, for instance, has to provide it its promotional literature some estimate of the eyetracker’s resolution. This could be such by such statements as "less than one-degree of error" or, "accurate to within 2 degrees." The vendor cannot know beforehand to what user a purchaser might put their equipment; the purchaser has to consider what needs to be accomplished, and somehow decide whether the equipment will serve the purpose. In our situation, however, the issue was always "what item is being looked at?" rather than "where is the observer looking?" That question was always in the context to the "granularity" of the items on display. Because looking behavior is inherently volatile at best, and an observer is not always precisely sure of where there eye (as measured by external apparatus) is aimed, the question was resolved on a relative basis: which item, of those on display, was the observer most likely looking at when they said such-and-such?
Our rule that no action occur except following upon some kind of speech input need not always hold. We weighed the possibility of having gestures initiate actions, but under specific circumstances. Suppose the user said: “Move that over like this (the hand or hands indicating a leftwards motion).” A few seconds later, without further utterance, the user waves their hand leftward intending to indicate that the same item should be moved leftwards a bit more. It would be reasonable to have the system move the item over a bit more in response, especially if the user is also looking at the just moved item.
Possibly, the mere act of looking at any item and “waving’ it left or right, or up or down, had ought to be sufficient to initiate motion. Or, the better rule might be that the system had ought to respond to such “silent” hand/eye indications if and only if there was at least one preceding act in the current interactive session where in the user indicates in speech with accompanying gesture, that some item’s position is to change this way or that. To insist on some such precedent is to make sure that this user, in this current situation, is of a “mind set” to use such types of indications, thusly expressed. Having had that bit of evidence that the user is so disposed to express themselves in such fashion, the system might then venture to allow such abbreviated expression of user intentions, and to act upon them.
Our taxonomy of gestural types was one adopted from Rime-Schiaratura:[3]
• Speech-marking (“beats”) — gestures which accompany speech, and which somehow “mark” speech by 1) stressing some elements for clarity or emphasis; 2) introduce some new element into the talk; 3) “chunk” the sentence according to the underlying reasoning. Examples: the speaking style of the U.S. Presidents John Kennedy, and more recently, that of Bill Clinton, wherein the making of logical points is accented by stabbing the air with the forefinger.
• Ideographic — hand or finger movements sketching in space the logical track followed by the speaker’s thinking. Example: saying “On the one hand (the left hand raised toward the speaker’s left) we have...But, on the other hand we have (the right hand raised toward the speaker’s right)...”.
• Deitic* — pointing gestures. Example: saying “That’s him…(pointing at some person).”
• Symbolic — gestural representation without any shape relationship to the visual or logical object expressed. Example: waving the hand by way of greeting or bidding farewell.
• Iconic* — hand movements that parallel speech present some figural representation of the object evoked simultaneously. Example: saying “Empty that cup (the hand describes a grasping shape, thumb and fingers a bit apart, mimicking the shape of a cup, the hand then tipped as if pouring something from the cup-shaped hand
• Pantomimic* — Hand movements that enact some action or function, as when the described actions are imitated by the speakers hands. Example: Saying “He grasped the box,” while shaping with the hands the imaginary box.
Those items of the taxonomy marked with an asterisk (*) are viewed as those types of co-verbal gesture that would be involved when the speaker is talking about concrete objects, and situations, with reference toward them. Such gestures as “beats” and ideographic gestures may be of importance to the speaker in helping them to organize their thoughts, the pace of their presentation, or as motor accompaniments to thinking. However, they do not seem to bear any information about the content of what the speaker is saying (though perhaps revelatory of the speaker’s state of mind, or level of arousal or nervousness). Our research focus currently centers upon the deitic, iconic, and pantomimic types of gesture, namely those gesture types which contribute referential or semantic content to the speaker’s utterance.
After a period of research in which we had been concentrating upon integrating gesture and speech, in particular the coupling of "iconic" gesture with speech, we had planned to re-introduce gaze into the multi-modal picture. We had envisaged a number of distinctive roles for the eyes, with concurrent gesture and speech.
The eyes may play a strong deictic role; however, the use of the eyes in pointing should not be confused with pointing by hand. Because of their structure and appearance, the eyes of another, unless closed, are always to be observed as pointing in some direction or other. However, the subjective sense of the observer is, for the most part, that they are simply "looking about." That is, we normally pay no more attention to the position of our eyes that we do to the exact placement of our feet when we walk down a hallway; while our intention to go down some particular hallway may be under conscious control, the placement of each step ordinarily is not.[4] There may be occasions when we deliberately "point" with the eyes, as when we wish to indicate some direction or other, or hands and arms loaded up with things we are carrying; or, perhaps we are gossiping about someone and wish discreetly to indicate "...him...over there...", not wishing to point with our hand. Otherwise, the use of the eyes intentionally to point is probably relatively rare; the fact that an other person may notice our eyes aimed in some or other direction is a by-product of our looking behavior, which is guided more by our own seeking of information in the surrounding than by any intention to "point."
A major role of the eyes is that of addressing the person or persons with whom we are interacting or speaking. We tend to look at the person to whom we are speaking, and often in the general direction of what we are talking about. For example, we may say to someone "What a beautiful view from here," glancing now at the shared view, and back to the person. In this way, in the context of talking with someone about some thing, we "address" both the person and the topic of our utterances by means of the eyes.
"Address" by eye can also include things and actions. Consider the command: "Move that (looking at some item) ...over there (hand point where...)." Which item is to be moved is "picked up" by eye, while, in turn, the manual gesture indicates its destination. This "picking up" by eye and subsequent "placement" by means of the hand is perhaps the most natural and rhythmic use of eyes and hands in commands to move items about; it most closely resembles the usage of eye and hand we ordinarily employ when talking with people, as, for example, when directing someone to move furniture about a room.
Information given via hand and eye may at times seem to conflict. For example, the user might say: "Move that (looking at some item) ...over here (pointing to some spot, but looking at some spot in a distinctly separate part of the display)." Where to place the item? In the place where the user is pointing? In the spot where the user is looking?
As a general rule, pointing by hand is a more effortful and deliberate act than looking; on that basis, a reasonable conclusion is to place the item where the user is pointing. Where the placement of the item is critical and the circumstances of the entire operation are of such a nature as to not tolerate errors, then the system might well insistent that the user be both looking and pointing at same spot at the same time. (While looking is inherently more volatile that manual pointing, the system might insist that the user at least be looking at the same general area as where they are pointing, and not at some spot at some far remove.)
When we speak to someone about something, we can observe whether or not they are looking back at us, and at the thing (assuming it is in our mutual presence) we are talking about. We can readily see from the patterns of the other's looking whether or not we have their attention, and whether or not they are paying attention to what we are talking about. While it is possible for someone to be paying attention to something and yet not looking at it, or, conversely to be looking at something yet not be paying attention to it, people for the most part tend to look toward where their attention is directed.[5]
The fact that people tend to look at where their interest lies can let us use their looking patterns to infer what those interests are. This use of eyetracking at the interface has already been illustrated for visual material.[6] People also tend to look toward the apparent sound of sounds in which they are interested.[7] These phenomena of human behavior can enable us to infer and respond to patterns of user interest as disclosed thorough both immediate glances, and inferred from aggregate patterns of looking over time.
For example, in the midst of a situation where the user is monitoring a display, and interacting with that display, the system may observe them glancing periodically toward the upper left portion of the display, which, if a map display, might reasonably be inferred to be some kind of expectation or anticipation of some event (the arrival of an airplane flight..?? the arrival of a message, where messages of a particular sort always are posted in certain spots on the display..??). The person may never have said explicitly to the system that they have a special interest in such an anticipated event; yet, the system, on the basis of repeated glances toward that area—their frequency and duration—might well reasonably infer such interest, and set up a special warning when in fact the sought-for item or material is about to arrive (but is yet off-screen).
In the case where we move some item, "Move that (looking at item)...over here (pointing at spot)," we might be inclined to add a little wave of the hand, say leftward or to the right, to effect a more exact placement of the item. Or, we might say "More...more...stop..." again, with a wave of the hand to one side or other. In this second, or "follow-up" command, there is an implicit "it" being referenced, namely, the item just moved.
The system, instead of each time insisting upon a "complete" or fully stated command, such as "Move that (looking and or pointing at some item)...there (pointing at some location)," might accommodate a more abbreviated form of the command: "Over (waving the hand in some direction)...over...stop"; or simply waving the hand...then giving a "halting" (hand upraised) gesture. This would work provided the system can know reasonably reliably which item is being referenced; that could be given by eye.
The eyes, through making and breaking eye contact with others, tend to orchestrate social interaction. The pattern of eye interaction amongst a pair or more of people engage in dialogue helps to indicate whose turn it is in the on-going conversation, who “has the floor." Gaze had ought to play a similar role in the case of the user interacting with one or more on-screen computer "agents." The point-of-regard of the human user is monitored via eyetracking. The apparent point-of-regard of the on-screen agents is rendered graphically: each agent has a pair of graphical "eyes" which register where they are looking, namely where their "attention" is directed, and whom they are "addressing" when they speak. This mutuality of gaze creates a situation of "full-duplex" eyes, where mutual gaze between the human and the computer-generated participants in a dialogue at the interface can both aid human and machine understanding of the dialogue situation and help orchestrate the conduct of that dialogue.
In our interface work, we discerned two distinct "arenas" of reference for the user and for the system. One arena was that of the immediate space about the body of the user in which the user makes their one- or two-handed gestures. In that space, the user may stipulate a particular kind of space to be created, as for example, when the user says "Make a room..." while extending both hands in front to indicate the relative scale of the room. The computer displays on its screen its ideas of what the user means; specifically, it shows on the screen the layout of a room. The user may next say "Place a table here...," while indicating with one or the other hand where, with respect to the spatial schema of the room they just created in the space before them, the table is to go. That is, the user places the table with reference to the space they just a moment before created with their hands. However, they might have pointed to the pictorial space displayed on the screen, which space is, as noted, the machine's idea of what the user is talking about, and said "Place a table there...".
Now, the use of the word "...there..." in lieu of "...here...", as well as the posture of the hand (a "pointing" posture, with the index finger extended, perhaps), had ought to suggest to the computer system that it should use the pointing trajectory of the hand with respect to the image on the screen to computer the position on the floor of the room where the table is to be placed, rather than the position of the hand with respect to the spatial schema of the floor indicated in mid-air just before the body space of the user. In addition, however, where the user is looking at the time they give the command to place the chair—whether into graphical space on the screen or at the physical space just before their body—can furnish an additional clue as to which of these two spaces the user is using to position the chair. We noted generally, that information as to whether or not the user is looking at their own hands as they speak and describe items and actions can be a powerful source of information as to interpret their speech, and in turn to interpret user intent.[8]
The essence of “circumstantial indexing” is the reference to, or retrieval of, some part of the dialogue or its contents by way of some occurrence or by some state of the data, past or anticipated. An examples from real life might be: “Get me that green folder I gave you late yesterday by the water cooler.” Or, “Let me know when those packages are picked up.”
In order for this kind of reference or retrieval to happen, the system needs to make some kind of record of events occurring in the user/machine dialogue. Some kind of schema for recoding the history of the user/machine dialogue is needed. Events in this dialogue include:
• What happened in world of objects the user and the computer are dealing with
• What happened in the human/computer dialogue
Circumstances of interest to be recorded include:
• actions of the user: e.g., “When I created the red table...”
• actions of the machine: e.g., “When you (the machine/system) deleted the red dots...”
• state of objects: e.g., “When the blue square was below the red circle...”
• action of objects: e.g., “When the places were entering from the northeast...”
Psychologists working in the field of human memory have identified at least three levels of store:
• very short-term memory (VSTM: refers to the consequences of stimulation over the first 1000 milliseconds (variously referred to as "iconic" or "echoic" memory)
• short-term memory (STM): refers to the memorial processes over the first 20 seconds (material processed beyond the VSTM stage to be "encoded," named," or "recognized")
• long-term memory (LTM): refers to the later memory stage in the distinction that has been made on the basis of experiments which demonstrate differential effects of the same variables on immediate (8-20 seconds) and delayed (longer than 20 seconds) recall; "permanent" memory.
Very-short-term memory is highly volatile, subject to erasure by subsequently presented material. Unless encoded immediately upon reception by rehearsal, and thus passed into short-term store, material in very-short-term memory will be lost. While transitory, the capacity of such sensory store is apparently quite large as compared with the capacity of short-term store.
Short-term, or "immediate" memory is the kind of memory that holds, for example, the telephone number that you where just told, or just looked up. Such memory is very susceptible to interference; unless, the material in this form of memory is maintained by either overt or sub-vocal rehearsal, it will fade. It may be erased by new input, such as being given an additional telephone number. Unless material in short-term memory is maintained or rehearsed long enough to pass over into long-term memory, it will be lost. The capacity of immediate is held to be limited, to about seven plus-or-minus two items or "chunks."
Long-term memory is memory for material recallable 20 seconds or more after its presentation, despite intervening cognitive effort devoted elsewhere. If material persists that long in a person’s memory, it is deemed to have passed over into long-term store. Once established, such memory traces are resistant to erasure and displacement by subsequently presented material involving cognitive effort on the part of the person. Such memories are, however, subject over time to elaboration, distortion, and consolidation with other memory traces.
There probably should be at least two levels of temporal resolution of the past record:
• short-term, or “what just happened”
• longer-term, or “what happened some time ago”
These two levels of memory are the types of memory that people experience in their interaction with one another: there is memory for things in the immediate present, for things just said and done; there is memory for things happened several minutes or hours ago, and what had happened in previous days. While the sensory physiology of people include memory processes of the very-short term "iconic" or "echoic" variety, the role of such memorial processes play only a negligible part in the conscious experience of individuals. Except for so called "fleeting glimpses," wherein we are not certain we actually saw something or not, or instances where we somehow "playback" to ourselves some just heard but not-attended-to sound, our awareness tends to focus on material processed to the point of recognition or material recalled from long-term store. Thus, it would seem that the computer system need deal with people at least at the level of the person's long-term store (20 seconds or more past), and probably at the level of their short-term store (first 20 seconds) as well.
It is likely that the human user will expect the computer to remember everything. While a person would not expect another person to remember everything, and would be tolerant of the memory faults of another, at least to some extent, the computer as a “machine” would be held accountable for every detail of all transactions and interactions with it. This kind of expectation might lead to certain difficulties.
For example, human’s memory of some past event may be faulty, but nonetheless request some act of retrieval based upon that faulty recollection. That the machine cannot perform what is asked because it has no similar memory (because what is referred to did not happen in precisely the way the human remembered it) might well become an annoyance to the user: “Of course the machine should remember what I am talking about—after all, it is a computer and must have a perfect memory!!
What strategy might the machine adopt in such instances? It may be irritating to the user to continually fault the user: “That never happened...There never was a blue square below the red circle...,” and so forth. Perhaps the machine could go beyond simple denial that such-and-such did not happen, and attempt to disentangle the user’s request.
One approach might be to take the user’s descriptive referral to the past situation or event (“Go back to when the blue square was below the red circle...”) on a more relaxed or “fuzzy” basis. The relaxation could be along lines of what is know about “consolidation” processes in human long-term memory. For instance, while memory for spatial location is ordinarily very good, memory for color is much less good: we are very likely to remember where certain people were seated around a table, but unlikely to remember what color clothes they wore; you are likely to remember where you gave me a file last week, but not the file’s color (unless of special significance), nor precisely which day (Thursday or Friday), though you remember quit clearly that it was toward the end of the week.
A strategy for the machine to adopt when the users queries about a blue square below a red circle—which circumstance in fact never happened, at least within the current dialogue session— might be to examine past situations in the current session where there was a square of some color (though not blue) below the red circle, and ask or suggest whether that was the instance they might be referring to. This kind of “negotiation” between the machine and the user to map from occasionally faulty human memories and machine’s trace of what happened and what was where can be as elaborate as one cares to make it. In action, though, it must be efficient in order to be acceptable to the user.
In long-term memory, consolidation of memories with corresponding omissions, condensations, merging, over time is the rule. Consider what happens when we attend a lecture. We listen sentence by sentence as the logic of the argument unfolds. At any moment, we could, if challenged immediately reproduce almost word for word what the speaker just said. We could, though not word for word, reproduce or para-phrase the several points the speaker made a few minutes earlier. However, perhaps 15 or 20 minutes into the talk, we would be hard pressed to give a thought-by-thought account of just what the speaker said, much less word for word. Later, when leaving the lecture hall, we perhaps could sum up the main points readily enough; a few hours later, those remembered main points become fewer. A week or so later, about all we are able to do is to summarize a ghost-like remnant of the original talk. A month later, we are lucky if we can recall what the speech was about.
Consider now, what a computer system could and must “remember” to enable reference by the user to past events and episode. What are crucial features to capture? Reference to just past event in the human/system exchange can be on the part of the human quite exact with respect to specific graphic and auditory features of the exchange. However, over time, as with our attendee at the lecture, reference become less immediate. Can some of those features can be dropped? Which ones? What are key features 5 minutes later...ten minutes.? What kind of memory consolidation happens over the near term? Over the longer term?
People experience life events as “episodes”: rising in the morning; making breakfast; driving to work, and so forth. Any of these episodes contain “sub-episodes.” Driving to work can be broken down into such steps as: opening the garage; starting up and backing out the car; driving to the expressway; driving along the expressway, and so on. It may be that the actions driving to work are so well-practiced, so interconnected and “fluid,” that we do not ordinarily think of the act of driving to work as consisting of a set of sequential episodes. However, when later we talk about some event which occurred during our trip from home to work, we tend readily to refer it to some particular segment of our trip: when I was leaving the garage; when I arrived at the company parking lot, and so forth. We tend to cut up the continuous flow of experience on the basis of markers or “anchor points” alluding to time and space, the where and when of experience.
Having extracted such markers from amidst the flow of experience, we say such things as “...while I was on the expressway...”or “...as I was leaving the company parking lot...”. We tend to cut up and segment experience thus, especially when looking back upon previous happenings. We can expect that people to extend such mental segmentation to their experience at the multimodal interface. The issue for our work is how to make the computer system segment the ongoing dialogue with the user such that, when the user refers back to some previous period or point in time, the computer has so arranged its memory schemata that it may respond readily.
At a basic level, the machine must be able to represent the "states" and "events" occurring in the three dimensions of space and the fourth dimension of time. "States" are encoded as a static set of values specifying a spatial relation that exists at some point in time. "Events" are encoded as a simple program for driving a value or set of values attached to a spatial relation through a range at a specified rate. States, usually representing some significant point in 4D space, are tied together and related by events. When a motion changes significantly and can no longer be encoded as a single event, a state will be introduced (as a kind of edge in temporal space) and a new event started. States and events serve as the basic temporal elements in the machine's representation of the interaction domain.
In addition to the primitive links between states and events are the "categorical" temporal relations: "during", "just after", etc. These abstract relations are useful in relating not only states and events but allow the construction of whole systems of actions. Chains of states and events can be abstracted into larger episodes and related to other episodes by these higher-level categorical relations.
Our approach to system-building in the domain of multimodal human-computer natural dialogue is modularity. Specifically, we are attempting to make our system design modular on two important levels:
• device independence—this means that the body model layer of our system, as well as other layers above the body model layer, stay the same whatever changes and developments may occur in sensory technology. For example, measurement of the user's manual gestures may derive for a "glove-like” device or from machine vision, or yet some other approach, without affecting the usefulness of the remaining parts of the overall system
• topic-independence—this means that dialogue skills built into the system for both interpreting user actions and utterances as well as for generating machine output in graphics and sound can effectively operate whatever the particular subject matter might be
As regards topic-independence, an underlying assumption of this work is that there exists in people, and potentially in machines, a certain interpersonal dialogue expertise, which socio-linguistic skill is independent of the subject matter under discussion. It is further assumed that this skill is of two types:
• the skill of the sender of messages. This skills lies in the ability of the speaker to take their linguistic intention and distribute its expression of the modes of speech, gesture and gaze, that is, to "de-construct" their intended message and to differentially load its expression on these several modes.
• the skill of the receiver of messages to witness a multi-modal expression by some person (or agency), to apprehend the meaning conveyed in each of several channels, and to combine those several streams of meanings—which may be redundant, complementary and partial—into a single, overall meaning.
These skills are assumed, in humans, to be over-learned, well-practiced, and for the most part to proceed without undue effort. It is possible to hold the view that the way in which a person discusses, for example, the construction of a house or the decorative layout of a room, is driven primarily or exclusively by the nature of the material under discussion: that the way in which people talk about some topic is inherent in the topic. It is an observational fact that people can and do talk about, refer to, articulate through combinations of concurrent speech, gesture, and gaze a great many topics; they are even able to articulate their non-knowledge, or lack of expertise about something or some topic using combination of speech, gesture, and gaze. That is, they are able to discuss their lack of subject matter knowledge in an articulate way!! This observation in particular belies the assertion that articulatory skill concerning specific topics resides primarily or exclusively in the subject matter itself.
It is possible that multimodal articulatory skills are rooted in a generalized knowledge about space and geometry, and these general skills or notions are applied to whatever items or topics are under discussion. In this way, the articulatory skill could seem to be independent of subject matter, yet rooted not in articulatory skills as such, but only in so far as such skills themselves are yet further rooted in a generalized familiarity with space and spatial concepts. On this view, spatial knowledge and articulateness mediate the exposition of, and reference to, concrete items.
Additionally, it could be the case that the generalized ability of people to talk about the objects about them is rooted as well in an unconscious but highly developed knowledge about how other people perceive things. That is, in showing something off, certain simple but highly important presentational principles need to be observed. For example, the item to be shown off must be held up in clear view of the person to whom it is being described. The item should be held at an angle of view that exhibits maximal information. In showing off to someone else a new type of pencil, we do not hold it end-on to the person so that all they see is its circular end; rather, we hold it at an angle so that the person sees a "three-quarter" view of the pencil, and in one glance can take in the top end of the pencil and its barrel. In addition, we may turn the pencil about, so that the viewer can look it from several aspects. Beyond such spatial aspects of expository skill, there are temporal aspects: we turn the pencil slowly so that its features can be appreciated. Another temporal aspect is that we coordinate our narration with our physical handling of the pencil so that whatever aspect of the pencil's design is apparent to the viewer as we talk about it.
We have a skilled sense of looking as well. We glance back and forth between the item being described —here, a pencil—and the face and eyes of the person to whom we are describing it. This pattern of looking is unconsciously and spontaneously orchestrated. We look toward the pencil when we are making some detailed point about it—a signal to the other person that they also should look there because so doing will aid their understanding; and, we glance directly at the other person by way of sustaining the interpersonal dialog linkage with them.
Further, having made some point about the item being shown off, all the while looking between it and the person, we can observe whether or not the person is (or had been ) looking in the right places. That the other person be at least looking in the appropriate direction when we are explaining some aspect of some item (or situation) is a minimal condition for their understanding of what is being shown off, or, at least is a minimal assurance to us that the other person is paying attention, and that we may feel confident enough in their monitoring our exposition or explanation to proceed further.
Suppose there are some details to the item we are showing off to which we wish to draw particular attention. In the case of the pencil, there may be some functional or decorative detail toward the pencil's cap that we which to emphasize. We point with our finger to that detail to highlight it. That is, we somehow intuitively know that, in order to draw the listener's attention to that bit of detail we must lead and direct their attention through the use of overt signals that "point the way" out. In other words, we implicit know something about the other person as "knower": we have a model of how other people in general assimilate and uptake information, and we now act toward this particular person both in terms of that model as applied to the item to be shown off. That model may have, in addition to general assumptions about how people assimilate information, bits of information concerning the cognitive style of the particular person we are currently dealing with; thus, our model may be both generic and person-specific.
Does such expository expertise exist apart for the item or topic to be shown off? Or, is the exposition or "showing off" of something driven by the inherent nature of the object or activity itself?
The answer to the first question would seem to be a clear "yes" for such conversational conventions as "turn-taking": the fact that people tend to talk one after the other, and not simultaneously, as well as tending not to interrupt one another. It is less clear concerning referring someone to this or that aspect of an item or topic. It may also be the case that any referential act becomes more efficient the more specific the domain. On the other hand, different people are more or less adept at explaining the nature of something; and, people can improve with practice their ability to instruct someone about something. The true answer may be that "dialogue expert" is a generalized ability whose exercise is tempered and shaped by the specific content under discussion. If so, this suggests that the skill might to some extent be built into computers, and that the specific subject matter can be modular.
* * * * * *
[1]Sparrell, Carlton J. Coverbal iconic gesture in human-computer interaction. Unpublished master’s thesis, MIT, June 1993.
[2]It is not clear whether an eyetracker of the corneal reflection type, such as is ours, could reliably capture such a “rolling up” of the eyes. A remote camera with sophisticated neural net or autocorrelational image analysis techniques might better handle such “expressive” use of the eyes.
[3]Rimé, Bernard and Loris Schiaratura. Gesture and speech. In Robert S. Feldman and Bernard Rimé (eds.), Fundamentals of nonverbal behavior. Cambridge, England: Cambridge University Press, 1991, Chapter 7, 239-281.
[4]Kahneman, Daniel. Attention and effort. Englewood Cliffs, N. J.: Prentice-Hall, 1973, pp. 50-51.
[5]Cf. Jonides, J. Voluntary versus automatic control over the mind’s eye’s movement. In: John Long and Alan Baddeley (eds.), Attention and Performance IX. Lawrence Erlbaum Associates, Hillsdale , New Jersey, 1981; Posner, Michael. I. Orienting of attention. Quarterly Journal of Experimental Psychology, 32 1980, pp. 3-25.
[6]Starker, India and Richard A. Bolt. A gaze-responsive self-disclosing display. In: Jane C. Chew and John Whiteside (eds.), Proceedings of CHI '90 (Seattle, Washington, April 1-5, 1990), ACM, New York, 1990, 3-9.
[7]Reisberg, Daniel, Roslyn Schreiber, and Linda Potemkem. Eye position and the control of auditory attention. Journal of Experimental Psychology: Human Perception and Performance, 7(2), 1981, pp. 318-323.
[8]Bolt, Richard A. and Edward Herranz. Two-handed gesture in multi-modal natural dialog. In: Proceedings of UIST ‘92, Fifth Annual Symposium on User Interface Software and Technology, Monterey, CA, November 15-18, 1992. New York: ACM Press, 1992.