The Ultimate Interface, continued



People-literate computers

The development of sensing technologies to support basic recognition of speech, gesture, and gaze continues apace. Meanwhile, the concurrent challenge is to create software to support multimodal natural dialog. Development of such software intelligence surely will occur, rooted as it is in the nature of people and their ordinary styles of expression.

Topic-independent dialog expertise

One of the salient qualities of human dialog is how readily people use combinations of speech, gesture, and gaze regardless of the topic under discussion. Consider a human conversing about last summer's vacation, the new car they just bought, how to wash a dog, whatever. Across subject matter, and independent of the specific topic, they exhibit great practical skill in expressing their thoughts and in dealing with whomever they are talking.

This remarkable skill--let us call it "topic-independent dialog expertise," or TIDE for short--seems to have two sides. One side is the ability of the sender of messages to decompose their intended message and distribute its expression across the modes of speech, gesture, gaze. The other, complementary, skill is that of the receiver of messages to parse, interpret, and integrate outputs in speech, gesture, and gaze from the speaker or "sender."

Sender skills

The elements of such sender skills are basic: breaking eye contact when you want to speak; noting whether the other person is looking in the right spot when you point something out to them; describing things and events with your hands - directions to your house or the size of the fish you caught.

Much of such topic-independent dialog expertise (TIDE) is necessarily rooted in conversational conventions. Such conventions include such straightforward and seemingly mundane things as:

These are the traits possessed in great measure by managers, interviewers, teachers - and especially by professionals whose main task is to build rapport and to motivate. They are for the most unconscious, having been honed through much practice in dealing with others.

Receiver skills

Receiver skills include the ability properly to associate gestures with speech, in particular to ascertain when a gesture is "meaningful," as well as determining in a room full of people that a remark is targeted to oneself, to someone else, to some, or to all.

Receiver skills also seem to include the ability to help orchestrate the dialog by projecting the appearance of being alert and interested, perhaps through posture and eye contact, as well as being able appropriately to send signals that one "wants the floor." (Cf. Starkey, 1972.)

Dialog skills for machines

In attempting to endow the computer with such sender and receiver skills, the aim ought to be--on my view at least--not to simulate a person, but rather to support and sustain dialog with a human user. That is, I see the issue as not being how do people function, but instead being one of asking what must the machine be able to do in order to maintain a dialog relationship with its human user.

That is, irrespective of why and how the mechanisms in people which underlie their psycho-social-linguistic behavior work, the challenge is: how can we make the computer behave in a comparable fashion, and thus behave in a manner compatible with human assumptions and expectations about dialog?

Is TIDE "real"?

The display of skill by people in managing multimodal dialog across widely varying subject matter implies that that skill--for people, at least--is independent of any specific topic. At the same time, which specific words we use are dictated by what we intend to talk about; we don't say "dog" when we want to refer to a door. And, when we are using our hands to indicate how two automobiles collided, we use our hand actions to portray the paths of the vehicles, and not to mimic the actions our hands would take when scrubbing down the family dog.

We may speculate, then, that there resides in people at a general, abstract level the ability to orchestrate together the modes of speech, gesture, and gaze no matter what the topic, including the ability readily to invent and extemporize novel gestures--for example, to discuss things and actions we have encountered for the first time. This ability seems as well to exist along with collections of specific templates, schemata, or whatever for each and every topic we might discuss.

Domain-specific vs. topic-independent knowledge

The supposed generalized multimodal ability of the person to refer another person to some aspect of their mutual surround is over-learned, well-practiced, and for the most part proceeds without undue effort. It is possible that multimodal articulatory skills are rooted in a generalized knowledge about space and geometry, and these general skills or notions are applied to whatever items or topics are under discussion. In this way, the articulatory skill could seem to be independent of subject matter, yet rooted not in articulatory skills as such, but only in so far as such skills themselves are yet further rooted in a generalized familiarity with space and spatial concepts. On this view, spatial knowledge and articulateness mediate the exposition of, and reference to, concrete items.

On the other hand, it is possible to hold the view that the way in which a person discusses, for example, the construction of a house or the decorative layout of a room, is driven primarily or exclusively by the nature of the material under discussion: that the way in which people talk about some topic is inherent in the topic. It is an observational fact that people can and do talk about, refer to, articulate through combinations of concurrent speech, gesture, and gaze a great many topics; they are even able to articulate their non-knowledge, or lack of expertise about something or some topic using combination of speech, gesture, and gaze. That is, they are able to discuss their lack of subject matter knowledge in an articulate way!! This observation in particular belies the assertion that articulatory skill concerning specific topics resides primarily or exclusively in the subject matter itself.

Additionally, it could be the case that the generalized ability of people to talk about the objects around them is rooted as well in an unconscious but highly developed knowledge about how other people perceive things. That is, in showing something off, certain simple but highly important presentational principles need to be observed. For example, the item to be shown off must be held up in clear view of the person to whom it is being described. The item should be held at an angle of view that exhibits maximal information. In showing off to someone else a new type of pencil, we do not hold it end-on to the person so that all they see is its circular end; rather, we hold it at an angle so that the person sees a "three-quarter" view of the pencil, and in one glance can take in the top end of the pencil and its barrel. In addition, we may turn the pencil about, so that the viewer can look it from several aspects. Beyond such spatial aspects of expository skill, there are temporal aspects: we turn the pencil slowly so that its features can be appreciated. Another temporal aspect is that we synchronize our narration with our physical handling of the pencil so that whatever aspect of the pencil's design is apparent to the viewer as we talk about it.

We have a skilled sense of looking as well. We glance back and forth between the item being described--here, a pencil--and the face and eyes of the person to whom we are describing it. This pattern of looking is unconsciously and spontaneously orchestrated. We look toward the pencil when we are making some detailed point about it--a signal to the other person that they also should look there because so doing will aid their understanding; and, we glance directly at the other person by way of sustaining the interpersonal dialog linkage with them.

Further, having made some point about the item being shown off, all the while looking between it and the person, we can observe whether or not the person is (or had been) looking in the right places. That the other person be at least looking in the appropriate direction when we are explaining some aspect of some item (or situation) is a minimal condition for their understanding of what is being shown off, or, at least is a minimal assurance to us that the other person is paying attention, and that we may feel confident enough in their monitoring our exposition or explanation to proceed further.

Suppose there are some details to the item we are showing off to which we wish to draw particular attention. In the case of the pencil, there may be some functional or decorative detail toward the pencil's cap that we which to emphasize. We point with our finger to that detail to highlight it. That is, we somehow intuitively know that, in order to draw the listener's attention to that bit of detail we must lead and direct their attention through the use of overt signals that "point the way" out. In other words, we implicitly know something about the other person as "knower": we have, as it were, a model of how other people in general assimilate and uptake information, and we now act toward this particular person in terms of that model as applied to the item to be shown off. That model may have, in addition to general assumptions about how people assimilate information, bits of information concerning the cognitive style of the particular person we are currently dealing with; thus, our model may be both generic and person-specific.

Is dialogue expertise truly topic-independent?

The puzzle persists. Does there exist, in people, an expository expertise apart from the item or topic to be shown off? Or, is the exposition or "showing off" of something driven by the inherent nature of the object or activity itself?

The answer to the first question would seem to be a clear "yes" for such conversational conventions as "turn-taking": the fact that people tend to talk one after the other, and not simultaneously, as well-as tending not to interrupt one another. It is less clear concerning the referral of someone to this or that aspect of an item or topic. It may also be the case that any referential act becomes more efficient the more specific the domain. On the other hand, different people are more or less adept at explaining the nature of something; and, people can improve with practice their ability to instruct someone about something. The true answer may be that "dialogue expertise" is a generalized ability whose exercise is tempered and shaped by the specific content under discussion. If so, this suggests that the skill might to some extent be built into computers, and that the specific subject matter can be modular.

Endowing the machine with dialog expertise

If TIDE is real, then there is a dichotomy between the generalized dialog skill possessed by the person--and potentially by the computer--and domain-specific knowledge, that is, knowledge about the things a person, or a machine, may talk about. If domain-specific knowledge can be separated from system dialog skills, then we have the possibility of a generalized, non-specific kernel program operating upon whatever topic might arise.

Put another way, if TIDE--at least with people--is real, this encourages the concept of a generalized program containing dialog skills, such a program to operate in conjunction with a range of modules relating to specific domains of discourse. This view is not dissimilar to the proposition that I can in general walk and leap about, but precisely where I walk and leap about can be in my back yard, a parking lot, a baseball diamond--wherever, and the specific locale shapes and modulates the manner in which I walk and leap therein.

Modular object-resident knowledge

From where might domain-specific knowledge arise? If the domain is that of concrete items and graphical, depictable spaces, and dynamic actions of items in those spaces, then much if not most of the domain-specific knowledge necessarily arises out of the items themselves, and their disposition in the user-viewable graphical space.

The most ready and plausible source of domain-specific lies in the objects themselves. Depictable things have dimensions, shapes, colors--that is, they have concrete attributes by means of which I may reference them. For instance, I point to some item on display, and say "...that." The general skill is one of pointing and uttering a non-specific pronoun (that). What makes it specific is the presence in a certain place and time of some item (to which I wish to refer you), that item existing in the same space and time that you and I share.

Consider the following. I speak to the computer's graphics display thusly: "Turn that (looking at some item, say, a rectangular block) around...like this (a two-handed turning gesture, as if turning a steering wheel)."

The item, by being "out there" and visible on the graphics display, permits my glancing at it to serve as an indication that--out of the several blocks or items on display--it is the one I wish something to happen to. The item, by the very fact of having a shape, of having dimensions, affords me "handles" in that its abstracted form serves as a model for how I might perform my gestures in order to "address" it.

First, the longish end-to-end shape prompts me to set my hands apart, palms facing, the space separating my hands as it were "holding" the item. I twist my hands. The direction of my twist gives the direction of how the display is to turn the item, and the relative amount of twist is proportional to how much I twist my hands about: a little, or a lot.

A more exact amount of turn could be given by speech, as in "Turn that (looking at item) ...like this (two-handed twisting gesture)...eleven and one-half degrees." The direction of twist is given in gesture, the amount of twist in speech.

Or, perhaps I say "Twist that (looking at the relevant item)...eleven and one-half degrees counter-clock wise," wherein which item is disclosed by eye, and the desired action is given entirely in speech.

Note that certain items, such as a dial set upon a flat panel, may have their degrees of freedom "preset," as it were, by their shape and physical context as determined by the object database that encodes them. The dial, for instance, may be turned to the left or right (counter-clockwise; clockwise). While a two-handed gesture, then, which operates like a bus-driver turning a steering wheel, can indicate both a direction of turn as well as a plane of turning, it is the direction of turn which would ordinarily be relevant where the item addressed already has its plane of turning fixed. In contrast, if the item were like a beach ball, that is, a globe-like object suspended in graphical space, it may well be that the plane of turning is relevant as well as the direction.

Putting TIDE into machines

Thus it seems that TIDE can be made available to machines, comprised of two complementary components: 1) a generalized repertoire of actions in speech and gesture resident in the user; 2) the properties presented to the user by some object and which collectively form the occasion for the user to speak and acts in certain ways. The sum of such expressions by the user are subsequently analyzed by the machine, that analysis proceeding in the light of a) the properties of the object--including its size, shape, and color, and b) whatever named process (e.g., "move," "delete," "twist," "color it," etc.) is stipulated by name by the user. The goal of the system is to form a complete reference to some object, and to an action to be performed upon that object.


...to be continued...


References

Starkey Duncan, Jr. Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology, 1972, Vol. 23, No. 2, 283-292.