
Given that we have a multimodal interface, we have a number of options with regard to whether there is an "agent"—either on-screen or off-screen—in the system. One option, at the left of the chart above, is to have no on-screen agent; in effect the system is the agent. The user talks and gestures toward the graphics scene, stipulating this or that graphical manipulation, and the system simply does it.
Where we do opt to have an on-screen agent, as in the right hand branches of the diagram above, there are a number of options. We can have an on-screen agent who serves as the "face" of the system. (The first left-hand branch under "on-screen agent" in the diagram above.) This kind of agent could be a purely "flunky" type agent, who simply executes of enacts the commands and inquiry of the user. Or, this agent could have a great deal of "intelligence" about the situation or the subject matter.
A variant is to have the agent be distinct from the system (the right-hand branch of the diagram, outlined). Here, the on-screen agent is not intended to be perceived by the user as being "the system," but is independent thereof. The agent separately address either the system, or the agent. When the user wants to effects some change in the display, the user directly requests the system, for example, "Color those items (glancing a/o gesturing) blue...". The user does not ask the agent to change the color; they simply order the system to do so. The system constitutes the shared environment of the human user and the agent. The relationship between the user and the agent is one of collegiality, not master/slave.
There are two sub-options to this option of having an on-screen agent which is distinct from the system. One option, at the bottom left of the chart, is to have no on-screen system agent. When the user addresses the system to request, for instance, some graphic manipulation, the user is speaking to some behind-the-scenes "agency," rather than to an embodied agent who is visible on the display, and which agent represent the system to which they are putting the request.
The other sub-option (at the bottom right of the above chart) is that there are on-screen both an agent who is "the system" and to whom requests and inquires about the display are directly put, as well as (at least one) on-screen agent who is distinct from the system—that is, to whom the user does not give commands to alter the display ("flunky" role), but rather have a level of knowledge or expertise about the subject matter domain, and who functions as an informed "colleague" of the user.
It is probably best to make the presence/non-presence of on-screen agents a personal option of the user. However, in order to explore the particular case of the on-screen agent in the multimodal setting, we shall adopt the configuration where the on-screen agent is distinct from the system, i.e., has its own expertise re the display subject matter domain, and relates in a collegial way to the user. (This is the outlined option in the above diagram. The particular sub-option we shall be most concerned with is the case where there is no on-screen system agent, although we may have occasion to introduce an on-screen agent which does in fact embody "the system," where "the system" means the display as encountered by the user.

User/Agent: collegial
• suggestions
• requests for opinion
• cautions, caveats
• other??
User/Graphic Display: master/servant
• commands
• inquiries
• other??
Agent/Graphic Display: master/servant
• commands
• inquiries
• other?? (interlocutor for system..??)
The graphics display, including 3-D audio, is the "situation" shared by the user and any on-screen agent(s).
The on-screen agent is not the "system," but is, as is the user, a collegial "onlooker" with respect to the visual and audio situation.
The user and the on-screen agent can command the system to do this or that. The command can be given directly to the system, as in the user (or on-screen agent) looking/pointing at some item and saying "Turn that this way (making an iconic gesture)." The user or agent may also refer back to some prior act, by user or on-screen agent, and say something like "...do it...".
In any event, the user or on-screen agent are addressing the
system as "agency," rather than as a visible persona, that is, yet
another on-screen agent representing "the system."

Note that user can also look toward ("visually address") their own near-body space, as well as that of the agent:

Places where user can look (visually "address"), con't:

• at agent:
• at agent face
• at agent's gestural space
• at own gestural space
• at graphics display
• at items, areas, etc. in visual scene
• at apparent sources of 3-D audio
• "Elsewhere"
The agent has a similar range of visual options, including looking toward its own near-body (gestural) space, or that of the user.

The agent's near-body space (gestural space) is a kind of display space.
In the same sense that the user's gesture and speech relative to some item on the graphics display can introduce a variant upon that object, the agent's speech and gestures may describe a "virtual" variant of some item or aspect of the situation, which is not depicted upon the graphics display but, if anywhere, is realized in the user's imagination by way of visual (and/or auditory) imagery. This "suggestion" by the agent becomes part of the entire dialog between user and agent for purposes of maintaining a "history" of the exchange.
Note that the agent—with its graphical eyes—looks both at its own gestural space, particularly when uttering "...like this," as well as at the user to establish "eye contact." The agent, via the system's eyetracker, should also confirm that the user is attending to (i.e., looking at) the right places.
The orchestration of the agent's graphical eyes and user's point-of-gaze as revealed through eyetracking is both subtle and important for issues of gaze and mutual gaze. For instance, from the standpoint of the user, can the agent still see (via the eyetracker) where the user is looking, even while its graphical eyes are trained elsewhere..?? It is, of course, legitimate for the agent via its graphical eyes to look at the user's hands, as it can "see" then by means of the gloves; it is also legitimate for the agent to look toward the display screen (or source of 3-D audio), as all that is "public", even though the agent necessarily knows it only via a privileged link to the system.
The user agrees with the agent. The user makes the change, thus:

The change in the display graphic is a function of the former state of the Object Knowledge Base (OKB) and the operation upon it requested by the user.
In the above instance, the change is explicitly indicated by the user. An alternative is that the user "assents" to the suggestion of the agent and "the system" carries it out, thus:

In either case, the graphics display is altered as a function of its former contents and the operation of "widening".

"Ideation": planning; "brainstorming"; shaping new concepts, or amendment to old ones
• displacement of concepts
• setting forth original ideas
• amending these initial representations of the ideas
could be:
• planning family vacation
• planning architectural addition
• designing device
Some have raised the issue whether there ought not to be an on-screen agent, or "face," on the view that the presence of such confers no added value to the situation for the user, and that any explanations of system actions or responses could as well, if not better, be carried out by animations, narrations, etc., that is, by multimedia. They further points out that, in the field of multimedia output, the classic problem is "...how best to represent this or that point/issue to the user..."
Indeed, it is hard to make the case for the onscreen agent. To say that it is more natural is to beg the question. I have always avoided the issue of what is natural on the input side, merely noting that when people are together, they do behave this way. That this is the more natural is can be seen as moot: e.g., maybe telepathy would be more natural. People just do act this may... the pragmatic approach.
For purposes of communicating with you, you are where your body is. When I want to talk with you, I have either to go where your body is, or use some kind of "tele" technology—tele-phone, tele-vision, tele-conferencing—to bridge the distance to where your body is. Getting "in touch" with you always translates into some such concrete action.
The situation with a computer agent is the same. When I want to communicate with a computer-resident agent, I must first establish a physical link: I must go to the computer, or bring it to We need some kind of technology to establish a physical connection between ourselves and the computer wherein the agent resides. That could be by going to where the computer is (my office?) bring the computer to me (using a mobile monitor?).
Given that the body is where I may find you or anyone else with whom I want to communicate, why might we want The aspects of the face and body that are functionally relevant for our purposes are For the face, these are the eyes, lips, plus head attitude. For the body, these are body attitude, head attitudes hand gestures.
If, as I argued before, the essence of dialog is that we are always referring someone to something, then it makes sense to have those someones and somethings occupy a distinct locale in space. One model of human/computer dialog is to have the user (or users) communicate with an off-screen, "omnipresent" agent: the "system." When the user speaks to the system they are addressing an off-screen agency which is not "bodily" present. This model ought to work well where on the system's side of the exchange there is essentially a unified "viewpoint." An instance of this might be when the user is
Suppose the user is engaged in designing the layout of furniture in a room, a kitchen, say. There are not a lot of "interests" operating in the dialog situation: the user merely wants to place the table in this spot or other, to put a chair here or there, to try out different window arrangement, and the like. In this instance, the system "agent" could well be invisible. The things the user wants done are simply done. When the agent speaks to the user—to complain, or to give advice—it speaks out as a single entity. In contract, where the user may be engaged in considering the layout of a more complex situation, or a situation with more dimensions, then it may be appropriate to have ...where there are inherently, or by virtue of the situation, more levels of aspects to the situation, then the overall dialog on the side of the system might better be articulated through the means of two or more agents.
Several on-screen agents might, for instance, be useful in representing the multiplicity of "interests" or "demands" of the design of an addition to a house. The user is considering expanding the north wall of their house about forty feet, adding several new rooms, with a porch and upstairs space. There are inherently several facets of the addition to be considered: the overall design, the electrical wiring, the plumbing, the costs, the zoning laws, of the local community, and so forth. These various aspects of the addition might well be represented each by its own special agent, rather than as multiple concerns residing in a single on- or off-screen agent.
Suppose the user stipulated a second bath to be situated in such -and-such a spot on the developing plan. where there is not inherently a multiplicity of "viewpoints" on the side of the system.
The number of entailments of an issue—what I have been calling "viewpoints"—can vary as the complexity of the issue waxes and wanes over time, as problems and sub problems arise and are resolved. An issue, such as "Where will we spend our summer vacation?" may arise. At first it is unitary: where shall we go. But it may quickly become more complex as each member of the family express their preference. Complications may set in: where do we board the dog? Who will house-sit? What about conflicts with kids' summer camps, etc., etc. One can imagine a single soap-bubble splitting and dividing, representing the increasing differentiation and growing complexity of what was originally a single question., the constituent bubbles being burst one-by-one as inter-familial discussions (and arguments!) become resolved.
Is it better to have a single agent—whether on- of off-screen— hold and represent the various aspects of a situation, or to have multiple agents, each assigned to and representing, advocating some aspect of the overall issue? There is empirical research done some years ago that speaks to this issue, that of Douwe Yntema and others, about whether we can more readily deal with a small number of variable which can assume many states, or with a larger number of variable each of which represents fewer states. If we are surrounded with a set of people with like mind, we find ourselves overwhelmed with dealing with people rather than issues; if we are confronted with a single person who is full of multiple attitudes and opinions, it becomes hard to keep track of what is that person's attitude on anyone of them. A situation less difficult to deal with is where the mapping from issues or aspects of a situation are more equably dispersed amidst and assigned to specific agents or persons.
Consider the automobile. Basically, it consists of four wheels, engine, steering wheel, brakes, headlights, and so on. These are the functional parts of the car; you need them if you are to have an automobile.
However, few of use are satisfied with just the basics. More often than not we don't just buy the basic model,. We want some comforts and conveniences: air-conditioner, tilt-back seats, a radio with tape deck, maybe a CD player. We like some styling: maybe a convertible, with racing stripes, and wheel-covers. We don't just drive a car to get from point A to point B. We like to get there is a certain style. We don't just drive a car, we make a statement. And, people tend to judge us somewhat one the kind of car we drive. A station wagon means married with kids, stable, dependable. A two-seater sports cars means single and adventurous.
The same point could be made about most of the material possessions we care about. We can live in a plain house with a roof, walls, windows of basic materials and nondescript design, or we might live in a house with "character," constructed of special materials and being designed by a name architect. We either have the basic, generic model, or something with a bit extra which makes it interesting in some way, as well as specializing it in terms of our perceptions and assumptions about its owner.
What is true of automobiles, houses, and whatever else is also true of the face. To be sure, we need the basics: eyes, nose, mouth, ears, etc. But we are a bit more pleased with ourselves if we are, in addition, fortunate enough to have some "looks." We have more confidence, have better expectations that we will be well -received. Conversely, we tend to react more positively to people who are attractive. Systematic studies have shown that, other things being equal, people feel generally more well-disposed to others judged to be attractive, than to other whom are judged to be of only average or plain looks.
The point is that we tend to react to things. We react to things on the basic of association, memories, we have about those things, as well as upon the associations that the specific things elicit in us. Consider the case of a clerk at a store counter. If the person is especially good-looking (or ugly), flirtatious (or argumentative), we tend to have a strong reaction to them rather than upon the business at hand (purchasing some item). Sometimes the reaction to the person can be so pronounced as to swamp all interest in whatever it was that we were trying to do. If, on the other hand, the counter person is of "average" appearance, is of "neutral" affect, neither over-friendly not aloof, then we tend to focus on out errand and nor the person who is there to help us. They tend to "fade in to the woodwork," letting us tend to business and not to personalities.
These same considerations operate at the system interface as well. If an interface has too much "personality," it can get in the way of our sticking to the work at hand. Too much "personality" can range from a sticky "e" key on a keyboard interface, which key distracts us from focusing upon whatever we'd like to type in, to a chatty, impish on-screen interface agent character who has so much going-on in their behavior that we are perpetually distracted from whatever practical task it is that we are trying to accomplish.
Some commentators on the interface have pointed out how a natural language interface can potentially mislead the user into assuming the interface has much more sophistication than it in reality has. That is, an interface that responds to you in well-formed sentences seems as well to carry the implication it is generally smart and can do a lot more else besides.[1] Such "overblown expectations" can well mislead the user into making more of the system than it has to offer.
Perhaps the best way to squelch this specious setting up of false expectations is to squelch it at the onset, or, at the very least, try not to set up false expectations. This means giving the on-screen agent the minimal functional a face to subserve communication, but nothing to set up expectation beyond that.
Does this mean that the face of an on-screen intelligent agent need always be minimalist? No. As acquaintanceship proceeds, the agent's face could become more differentiated, either through actions by the user or by system-resident procedures. The point in time where it is most critical—namely, early on in the first encounter with this particular agent.
The utterance of the user is processed across a set of levels:
• syntactic: is it well-formed?
• semantic: is it meaningful in the current context?
• pragmatic: can it de done? (or answered, if a question?)
• advisability: if something to be done, is it "advisable" to do?
The processing of the 1st 3 levels is done by the System; the last by the "agent."
The system output, which is done in a multiMEDIA style, without the use of a "face," has to do with clarifying syntactic, semantic, and pragmatic issues with the user's utterance; where the on-screen agent comes into play is at the level of "advisability" of the input, where the processes the change in the "situation" just stipulated by the user against its internal set of criteria and either concurs with the user's input (with some kind of affirmative reaction such as "unh-huh's," or possibly something more elaborate, like "That is a good place to put that chair...".
In addition to the Object Knowledge Base (OKB) representing "the situation", we have agent-specific knowledge resident within the one or more on-screen agents, e.g., the plumber, the architect, the electrician, etc. The knowledge of the agents concerned the advisability, in context, of the user doing this or that, e.g., make the table that long, putting that many doors in one wall, etc., etc. The agent is there to help the user, by way of timely advise; that is, the spirit is on of collegiality. The user can always do what the user wants to do, that is, can always override the agent's advise.
For the Agent Knowledge Base (AKB), we need to develop a system of representing such knowledge, which system is compatible with the representations in the OKB. A basic tenet here is that that knowledge is always rooted in the "concrete" aspects of the situation, or may be liked there to.
All AKB concerns have to be:
1) about concrete features of the situation, e.g., "...all surfaces have to be 'golden sections...".
OR
2) about factors rooted in concrete features of the situation, e.g., "...the cost is exceeded if floor area over X sq feet, given $X/sq ft, and a budget cap of $X dollars...".
All abstract concerns, e.g., "cost overrun," have to be rooted somehow in concrete 3-D features of the situation.
One-way interactions:
user-agent: collegial
user-system: master/servant
agent-system: agent
Two-way interactions, between the party on the left, and the other two seen as a combination:
user: agent/system user is "onlooker"
agent: user/system agent is "onlooker"
system: agent/user system is "servant" for either
The agent:
- is there to help the user vis-à-vis the user's task, mainly by timely comments on the possible consequences of the actions the user stipulates
- is NOT the "system"
The agent is there to help the user as it may vis-à-vis whatever the user is doing in the context of the situation. This help consists mainly of telling the user about the advisability of doing this-or-this when the user has given a situation-changing command the results of which interact with domain-specific criteria resident in the agent; (e.g., the user asks that a table be lengthened; the system does that--lengthens the table; the agent checks the results of this action against its internal criteria, one of which is relevant to table lengths, and finds that the new length violates some criteria, such as the tabletop being no longer a "golden section"; the agent then tells the user, via concurrent words, looks, and gestures that "...The table (looking/pointing at the relevant table) will not look as attractive if the top (glancing back at the user) is lengthened like that (iconic gesture of "lengthening")...".
NOTE: If the user had stipulated the table to be lengthened such that it would no longer fit in the room (a pragmatic concern as opposed to violating an aesthetic "advisability" criteria of the agent about golden sections), it is the system, not the agent, that would complain (voice or text: "Table would be too long to fit (graphic highlighting or animation to make the point).
The on-screen agent(s) is NOT the system. The system response modality is multiMEDIA, not multiMODAL, as with an agent.
User/agent "dialog" occurs primarily when the user's agent negotiates a point about the "free factors" of the situation, e.g.,
- user stipulates change X
- agent advises user re consequences of doing X
- user accepts or rejects agent's advice
The agent combines:
• topic-independent dialog expertise, the psycho-socio-linguistic skills of the sender of messages
• application-specific knowledge
• The agent is NOT the "face" of the system
- design of topic-independent dialog expertise (TIDE) feature of the agent; representations of the domain-specific knowledge of the agent, plus relationship to OKB
- creation of the agent software
- when giving commands to the system, ordinarily "addresses" the situation; the user's output are assertions or questions about the situation
-output back to the user re the syntactic, semantic, and pragmatic aspects of the user's input multimodal statement are done by the system (not the on-screen agent), and are multiMEDIA not multiMODAL in nature (i.e., graphics, animations, highlighting, etc., with speech output)
- output back to the user re "advisability" issues is from the on-screen agent as representative of application domain knowledge
[1]Norman, Donald A. "How people might interact with agents." in Communications of the ACM, special issue on intelligent agents, July 1994, 68-71.