The development of people literate machines is one of socializing the machine. Additionally, the machine has to gain a certain level of linguistic competence, including dealing with gesture as part of language. Also, the machine has to mirror, or at least to take into account, certain psychological qualities of their human users. These various competencies that the machine must gain are at the heart of multimodal natural language.
By multimodal natural language, I mean the concurrent use of speech, gestures, gaze. While there has been a good deal of reset in natural language input to computers, perhaps the bulk of that research has involved typed input or spoken input in the absence of gesture, and where gesture has been present, often it has been confined to pointing by mouse or finger on a touch sensitive surface.
In my own outlook, and in my researches, what I intend by the phrase multimodal natural language is, as previously stated, the concurrent use of speech, one- or two-handed gestural input, and gaze. It is through these modes, used singly and in freely mixed combination, that people in ordinary life express themselves to one another. The challenge at the human-computer interface, then, is that of enabling the un-adorned person (e.g., no mice, batons, etc.), using their "native equipment" as it were, to express themselves in freely admixed combinations of speech, gestures, and gaze, and make themselves as readily understood at the computer interface context as they would be in the presence of other people.
Certainly, gesture is part of everyday "natural language." Professor David McNeill of the University of Chicago puts it thus:
"...speech and gesture are co-expressive manifestation of a single underlying process. The underlying process is equally speech and gesture, and there is a subsequent evolution of expressive action with outputs in both channels concurrently. The channels, moreover, have a constant relationship in time, with the gesture manifesting the primitive stage of the shared process and speech its final socially presentable stage." (McNeill, 1992, p. 31.)
Eyes, too, play a central role in everyday interpersonal dialog. Their main role is perhaps one of "address": other people gauge to whom--where more than one person is present--we are talking. Gaze and mutual gaze between conversants work to orchestrate the flow of conversation, signaling "breaks," and turn-taking. We monitor how well someone else is "following" what we say by noting the eyes: are they, for example, looking at the lawnmower controls while we explain their operation? Or, are they eyes trained elsewhere, suggesting that their real attention is elsewhere as well?
On the view of the "receiver" skills being that of integrating meaning from the several modes, then the receiver is operating under the hypothesis that the sender is indeed trying to express a unified thought, though that thought is distributed over the several modes...
Looking patterns can index level and quality of comprehension. Recall psychiatric "glove anesthesia," where the whole-hand, wrist-forward feeling loss mirrors body gestalts, not neuro-anatomy. Do looking patterns favor subject matter logic or figural qualities like saliency and conspicuity? With a displayed chess-board image, does the observer focus more on the pieces (antique collector?) or their patterns (player)? If on patterns, is looking willy-nilly (beginner) or economic (tournament player)?
Who - the computer or the user - takes the initiative in the exchange, and how the initiative is passed back and forth is of central importance. The initiative can be assigned to the machine: "Tell me about by the user, by asking, for instance, "Tell me about that," while looking at item or set of items. Or, the machine may take the initiative is telling about some thing having inferred user interest in from patterns of looking. However the dialogues starts, once underway the "lead" in that dialogue may well pass back and forth, the user picking up the pace via vigorous looking and /or questions, the machine volunteering some new tack into the subject upon inferring some momentary slack in the user reactivity.
The showing off of things by machine should not be driven in a ultra tight, "direct" way by user looking, i.e., items on display shouldn't instantly jump about as if "hit" by user glance. Rather, system response in graphic actions and synthesized speech occur as a result of inferences about user curiosity and interest. The system neither "jumps " in response to user acts, nor is unduly sluggish. Ideally, the "pace" of system response approximates that of an alert, attentive human conversational partner.
The "direction" of the system's exposition or explanation of an item or set of times can be "top-down" or "bottom-up," or some mix thereof. If user looking initially focuses on one of a group of items, then system explication can well start with that item, working out to others, possibly to the group as a whole, should user looking widen out. Conversely, user interest as evinced by looking and questions may proceed from the group as a whole to particular items. Should there be some optimal order in which to examine the set of items or parts of an item, the system may "take the lead" in steering the user in examining things in that order. This would especially be that case where the user had started things off by requesting to be told about this. To the extent that the "conversational contract" is tutorial in nature, a tendency toward the system steering thing is in order. This is not ironclad, and the system should be continuously attempting to infer user state with respect to how much initiative to take in both the pace of things and their direction.
"User state" however is one factor in determining such factors in system action such as initiative, pace, and direction of the dialogue. There are at least two sides to any dialogue, and it would be a mistake to determine how the system should respond solely by consulting and inferring "user state." Between people, I don't slavishly modulate myself to what I infer you are most willing and wanting to hear. Such actions on my part would rapidly become cloying.
It is vital to conversational qualities in dialogue then, that the system not attempt simply to model or otherwise infer ŒUser state" and strive solely to accommodate. Such attempts slavishly to accommodate would certainly be self-defeating. Certain levels of initiative on the part of the system are in order, and probably welcome to the human user.
A couple of helpful references on "system initiative" are:
Bucheit, Paul and Thomas Moher. Response assertiveness in humancomputer dialogues. International Journal of Man-Machine Studies, 1990, Vol. 32, 109-117.
Cook, John and Gavriel Salvendy. Perception of computer dialogue personality: an exploratory study. International Journal of Man-Machine Studies, 1989, Vol. 31, 717728.
See also:
Reeves, Byrom & Clifford Nass, The Media Equation: How people treat computers, television and new media like real people and places. Cambridge University Press/CSLI Publications, 1996.
The 3 main ways in which we get tasks done are:
The role of person vis-a-vis machine can be variously seen as:
Many of the rote things we now do with computers may well be taken over by agents. As a corollary, most of what we will be doing with computers will be to describe to machines what we will want to be done.
In this connection, we may well ask: Why would anyone want to communicate with a computer? What for?
Answer lies in how people get things done:
Whether writing a letter, doing income taxes, carrying groceries to car, planning a summer home, washing dog - you can do it yourself, ask someone else to do it, or work in tandem with others
In doing a task directly, or via "tools," the computer becomes a "tool"; you "operate" the computer as you would use a tool.
Software people create "tools" or (the proverbial "toolkit") to map task to the invariant keyboard-and mouse
Keyboard-and-mouse access serves to let us input characters and numbers into the computer. Current interfaces serve specialists well for certain tasks (word-processing, spreadsheets, along with most, if not all, games), provided they know, or have the patience/motivation to learn, how to control the software via keyboard-and-mouse.
Multiple modes, like speech with gesture have been used in this context, e.g., graphic manipulations; but speech often is functionally reduced to a "button," and the hand to a "valuator."
However, the standard keyboard-and-mouse interface serves the non-specialist (read: most people in the world) not at all well.
Delegating is leveraging the abilities of others; here, leveraging the abilities of computer "agents."
Autonomous agents doing work in non-interactive way (Cf. "situated actions" vs. pre-ordained actions).
Communication with conventional agents limited; there is, or has been at any rate, little ability interactively to direct, discuss, negotiate, concerning the task-in-progress.
Interactive multi-modal communication with computer agents about things to be done would open up many new doors to human-computer interaction for many, if not most, people. They would be interactively describing, directing, explaining, negotiating what is to be done.
Examples of dialoging with agents include:
Examples of application areas for the above include:
All this is not an issue of grafting a "front-end" onto existing agents, but re-examining shared tasks along social as well as task dimensions. Relevant disciplines include:
The visible agent could be, in the most general sense, some combination of graphics and/or sound that enables the core agent (or "the system") to refer the user to some aspect of the situation. While the on-screen agent could be some abstract form, the use of a "body" and "face" may be most efficient.
The agent is the active repository of two levels of skill:
The proper aim, I believe, is not to simulate a person, but rather to support and sustain machine dialog with human user, leading to the person having a social rather than a technical relationship with machine
To the human user, the agent is an "other." With traditional "autonomous" agents, there has traditionally been little ability interactively to direct, discuss, negotiate, etc. about the task-in-progress. There is a need to "socialize" the agent: action selection by agents to contemplate not only goal, situation, and planning task-space, etc., but inputs in speech, hand, and eye from the user.
Specifically, the agents would embody special complementary skills, expertise, viewpoints, etc. working with user on common task.
The locus of agent action would be doing things in real world, e.g., directing robotic agents (e.g., digging a ditch); doing things in graphics world, via graphical agents.
Should the agent may have some intention they wish to express, that intention ought to be "externalized," its expression distributed over the agent's modes of speech, gestures, and gaze.
In the area of mutual gaze with a machine, there is the issue of whether there might be a distinction between what we are looking at and some "agent" with whom we are sharing the visual scene. In everyday life when we are talking about things with another person, we ordinarily look back and forth as we talk from the thing about us that we are discussing to the person's eyes, so on and on, back and forth. That is, our eyes moves alternately back and forth between the things discussed and direct eye contact with the person with whom we are talking.
Such eye contacts and eye breaks also serve to orchestrate the conversation, signaling attentiveness, when we wish to speak, whose turn it is to speak, and so forth. Thus it is of theoretical and practical interest to establish at the interface in eye-supported dialogue a graphic "persona" or graphical personage with whom to speak. How humanoid or human like it might be is a sub-issue for research. The desirability of its presence and/or absence is also a sub-issue for research.
The role and function of this persona is: 1)to establish some place or agency within the graphical display, but distinct for the to-be- talked-about contents of the display; 2) to establish a distinction between the subject matter contents of the display - the material of topic that is the subject of the dialogue or conversation, and the agency with whom we are conversing.
That agency of course could be disembodied--we could be talking with some disembodied voice--but this approach does not encourage or take advantage of the interpersonal subcarrier of information concerning the progress, as opposed to the content matter of the conversation, that comes with things like mutual eye contact and eye breaks, Thus, the establishment of a visible persona re-instates the possibility of such interpersonal or rather interagency actions as mutual gaze. The gaze of the human is of course monitored by an eyetracking apparatus; the graphics of the "agent" are animated to exhibited graphical "eyes" which look out at the user, which move about, blink, look at this or that object under discussion (read: "point") at the item under discussion, or otherwise reference the submatter of the dialogue.
When we want to communicate with someone we (or our message) has to find them. That is, we and/or our message have to go where they are. And, where are they? --Where their bodies are.
We know where a person is because that is where their body is. In the case where the agency with which we wish to communicate is a computer, we go to where the computer is, stand in front of it and begin to speak, point, look and so forth. This is precisely what my students and I have did in the initial versions of our multi-modal dialog systems: the user stood before the screen, wearing eyetracker and gesture-sensing gloves, microphone clipped in place, and began to speak, gesture, and look in the "presence" of the system.
In our initial version, "the machine" is not as such visible, but is omni-present because not situated anywhere in particular, being an offstage presence which listens to our commands, notes our gestures, and where our eyes are pointing, and does its best to carry out the given commands and requests.
Even with this initial, presumably simple set-up there are some issues in "address" as in addressing the system. The system, when you speak, does not necessarily know whether you are addressing your remarks to it, or to someone else (human) in the room. In most cases, some command will not accidentally be triggered, as the command format is generally dissimilar to the kinds of thing you might be saying to people with you about the room; more likely, you will say some things that will cause the machine to process them, but not be able to make sense of them, and thus they will not be executed.
One stratagem to avoid confusion is to require the user to say something like "Wake up" to the system before issuing specific commands or inquires to the machine; then the computer would know you were in fact speaking to it and it should now attempt to interpret what you say, but also to act upon it. Similarly, when you are through giving commands, you would say something like "Stop listening." (Or, "computer--stop listening" by way of specific address to the machine.) Of course, the machine would keep on "listening" whatever spoken input might come in over its microphone; it simply would no longer act upon it--a kind of mode shift.
Consider, now, there being not just one agency in the room with you but two or more. That is, in addition to the machine there is another person in the room with you, with whom you may well be speaking, as well as they to you.
This creates the beginnings, at least, of a need for you to differentiate to whom you are speaking. The utterance "Place the green truck on that bridge?" while heard by your human companion as an inquiry directed toward them (you are in fact looking at them, as well as the overall context of the dialog being consistent with your looking at them and making such a remark). But the machine may hear it as well--and if not especially sensitive to inflection (the raised tone of voice at the end)--all else being equal, it may well interpret your utterance as a command to be carried out (Placing the red truck on the bridge; asking you "What bridge?" or whatever it, the system, may conclude is the appropriate response for it to make.)
Even when you are alone in the room with the machine there are potentially two parties in there with you--that is, one beyond the machine. You may utter some phrase or thought aloud intending it "for yourself" as in talking to oneself or "thinking aloud" but with no conscious intention for the machine to either hear or to act upon it. The reader may regard this possibility as a further complication to the situation, or perhaps speculate that it makes for an interesting situation: a new "mode" wherein the computer hears you as doing just that--"thinking aloud," and may have some response or offering specifically to make to you.
Thus, the external address space in the room splits into not just one undifferentiated "other" which whom you are communicating, but now 2 agents: the machine and the other person. The focus of your remarks to the other person becomes them, or specifically their body as a "target" toward which you, in terms of gaze and attitude of the head direct toward them. (Of course, any salutary word e.g., "Jim..." of "ComputerŠ") serves to direct your remarks).
But now suppose there is now an additional agency resident on the screen. It could be an "inset" real-time image of a collaborating colleague as in a teleconferencing situation. Suddenly, it becomes more important to direct your remarks. And, even more easily that the mandatory need of a salutatory word, is the glance of the eye toward the inset image when you want to address the tele-present colleague, the person who may be in the room with you, or the "system."
But where to look to address the system?
As the space before and within the screen becomes populated with agencies, there arises the need to differentiate, and hence embody, the machine. Just as our own bodies serve to localize us and make us "addressable" where there are a multiplicity of agents participant in the ongoing process, so some kind of "body" becomes necessary for the machine.
What the our body does is to localize us, and, amongst other things, establish a point in space toward others can‹literally--direct their remarks. Beyond any salutatory word they may use, they ordinarily turn their head toward us and look in our direction.
Thus, amidst all its other functions, the body serves to localize us, for purposes of Œaddress" by others. The differentiation and specialization of portions of the visual display-- flat, round, 2-D, 3-D, holographic, whatever‹serves the same function for machine "agencies."
Just as the body gives us a place, a point in space for others to return to in order to address us, so does the face, within the body, serve such a function. The body could, of course, be all face; on some displays this is true, as with the anchor people on nightly newscasts. In any event, the face is the place within the place of the body that has maximum malleability, and hence the most potential for bearing information. (Note: manual sign language in the case of "signers" is beyond the scope of this discussion, though needing treatment in its own right.]
Consider the face in functional terms, rather than the emotive or aesthetic. The collection of features are all in a limited locale, which facilitates looking at it, in contrast to were it distributed over the body, the eyes here, the mouth over there, and so forth.
The individual features of the face provide at their own level certain types of information. The eyes, we note, provide address, attention, reference, and their overall pattern of movement can reveal the distribution of interest and attention. They can also "wink" to express coquettishness or complicity, or "cross" to express dismay.
The mouth externalizes what we say as visible speech (for lip-readers), as well as serving as a vehicle for expression as in smiling or sticking out one's tongue. The nose may wrinkle, as in disgust. The face as a whole, as a collection of features, can express a wide range of attitudes and emotions.
Here, though, in justify that the computer or computer resident agents have a face rests in the functional aspect of the face as:
The issue on output is distributing the expression of the agent's intention over the modes of speech, gesture, and glance.
That is, the agent has something to say about the situation. Part of the expression of that intention will be in words, part will be in gesture, and part of it will be in glance. The task of the software intelligence that governs the agent's action will be to differentially load the expression of that intention over the modes of speech, gesture, and gaze.
For example, suppose the user and the on-screen agent are engaged in specifying the layout of a room. The room may be a real one that the user is planning to set up and is modeling its layout in advance, or the room could reside only in the computer as a virtual room in a virtual house. It does not matter. The discussion takes place mainly through declarative expressions from the user (e. g., "Move that table nearer to that wall..."), with an occasional question, tip, or kibitz from the agent. The agent would have interior decorating expertise, and perhaps some record of the likes, dislikes, tastes, etc., garnered over time, of the user.
Suppose the user does in fact say, "Move that table nearer to that wall." Suppose that the agent, it's algorithms and strategies churning away, comes to the conclusion--derived in part from its deliberations about the aesthetics of the developing room--that the just-moved table should be wider. That then is the kernel notion to be expressed: that the table should be wider. Its task is to externalize that kernel notion-- to express it--via its power of synthesized speech, graphical gestures, and graphical gaze.
The speech output may well be: "The table would look better if it were wider." Which table? The one we were just talking about? Perhaps. Or, should the agent be, via its graphical representation, be either pointing or looking at some other table, then that is the one intended.
On-view agents thus are personae as vehicles for dialog management. Among other things, they manage dialog "turn-taking" by, for instance, eye contact and eye breaks. Other interpersonal cues and clues used the agent representation include inflection of voice and modulation of posture. Think of the exchange as a "conversational waltz," user and agent both immersed in a shared set of expectations, a mutual sense of the occasion.
As noted, the individual agent acts as a repository of specific expertise about some aspect of the subject matter at hand. Similarly, a collection of agents can serve as an articulation of the organizational nodes of the body of knowledge about some topic.
The relationship between the agent's interpersonal skills as dialoging (psychosocial-linguistic) and subject matter expertise is a matter of much importance. Certain topics might suggest their own articulation, e. g., moving furniture about a room, or changing the size and shape of things on display naturally involve combinations of glance, gesture and words. Imagine yourself directing someone in rearranging your office. Act it out. That's what the agent should be doing; that's how the agent would behave.
The kind of agent I have in mind is one that is "on-screen"‹i. e., visible to the user. It has graphical eyes so that it can "look" at you--a signal that you have its attention, or that it is directing its synthesized speech to you. It has graphical arms and hands, so that it can gesture to you, and indicate things on the screen. The way you interact with such an agent is to look at it and speak or gesture, any or all at once.
Consider, on your graphics display, several of such agents visible on screen. The details of their appearance are not now unimportant, except to say that one agent is identifiably an electrician, another is a plumber, another is an architect, another is a general contractor. You, the user, are planning to build a house, and, just are you would discuss your plans with human construction professional, you are seated with these agents to discuss your plans and ideas for your house.
You block out your ideas for the layout of some rooms, using your hands and speech to describe the layout and location of the rooms. For example, you say "The bedroom is here (using your two hands to show its position and orientation), and the new bath should be here (designating its position with your hands as you speak)...". The graphics display responds by placing rectangular blocks to mark the location of the new rooms.
However, the "plumber" agent begins to speak up, looking out at you with graphical eyes: "If you put the new bath there, you are so far from the rest of the pipes in the house that it will cost you twice as much than if you put the new bath there (showing with graphical hands where it thinks the new bath should be placed...)."
While you consider this suggested alternate location for the bath, the "electrician" agent speaks up: "If you (looking straight out at you) put the bath where he (glancing over at the "plumber" agent) is suggesting, the electrical work will be a lot more complicated and expensive...." And so it would go, on and on throughout your design session with these (nonhuman) construction people.
By virtue of the on-screen presence of this collection of agents who are bringing their individual expertise to bear on the layout of your house design as you develop it, you are able to go through a design sequence in a manner as much social as it is technical. You see the agents, each reflecting a complementary role in their common task, and bringing to bear on the situation their particular expertise.
...to be continued...