|N E G R O P O N T E||Talking With Computers|
"Okay, where did you hide it?"
"Well, where do you think?"
The scene comes from an MIT proposal on human-computer interaction submitted to ARPA twenty years ago by Chris Herot (now at Lotus), Joe Markowitz (now at the CIA) and me. It made two important points: Speech is interactive, and meaning - between people who know each other well - can be expressed in shorthand language that probably would be meaningless to others.
It may be difficult for the reader to believe the degree to which speech I/O has been studied separately in the past. Like Benedictine monks, each research team developed and guarded a special voice input or output technique, rarely fussing over the conversational brew. Understanding speech as a component of a conversation is very different to understanding it as a monologue.
I have told the following story a million times (admittedly, a figure of speech!). In 1978, our lab at MIT was building a management information systems for generals, CEOs, and 6-year-old children, namely an MIS system which could be learned in less than ten seconds. As part of this project we received NEC's top-of-the-line, speaker-dependent, connected speech-recognition system. Like all such systems, then and now, it was subject to error when the user showed even the lowest level of stress in his or her voice. Mind you, this would not necessarily be audible to you or me.
ARPA, the sponsors of that research, made periodic "site visits" to review our progress. On these occasions, the graduate students prepared what we thought were bug-free demonstrations. We all wanted the system to work absolutely perfectly during these reviews. The very nature of our earnestness produced enough stress to cause the system to crash and burn in front of the ARPA brass.
Like a self-fulfilling prophecy, the system almost never worked for important demos; our graduates were just too nervous and their voices reflected their condition.
A few years later, one student had an idea: Find the pauses in the user's speech and program the machine to generate the utterance, "ah ha," at judicious times. Thus, as one spoke to the machine, it would periodically say: ah hha, ahhh ha, or ah ha. This had such a comforting effect (it seemed that the machine was encouraging the user to converse), that the user relaxed a bit more and the performance of the system skyrocketed.
Our idea was criticized as sophisticated charlatanry. Rubbish. It was not a gimmick at all, but an enlightened fix. It revealed two important points: For one, not all utterances need have lexical meaning to be valuable in communications; for another, some utterances are purely protocols, like network handshaking. Think of yourself on the telephone. If you do not say "ah ha" to the caller at appropriate intervals, the person will become nervous and, ultimately, inquire: "Are you there?" You see, the "ah ha" is not saying "yes," "no," or "maybe," but is basically transmitting one bit of information to say, "I'm still here and listening."
The reason for revisiting this long story is that some of the most sophisticated people within the speech recognition community failed to understand what I have just illustrated. In fact, in many labs today, speech recognition and production are still studied in different departments or labs! I frequently ask, "why?" One conclusion is that these people are not interested in communication, but transcription. That is to say, people in speech recognition wish to make something like a "listening" typewriter which can take dictation and produce a document. Good luck! People are not good at that. Have you ever read a transcription of your own speech?
Instead of transcription, let's look at speech as an interactive medium, as part of a conversation. This perspective is well presented in the forthcoming book by Chris Schmandt entitled Voice Communication with Computers: Conversational Systems, (Van Nostrand Reinhold, 1994).
Talking with computers goes beyond speech alone. Imagine the following situation. You are sitting around a table where everyone but you is speaking French, but you do not speak French. One person turns to you and says: "Voulez-vous encore du vin?" You understand perfectly. Subsequently, that same person changes the conversation to, say, politics in France. You will understand nothing unless you are fluent in French (and even then it is not certain).
You may think that "Would you like some more wine?" is baby-talk, whereas politics requires sophisticated language skills. So, obviously the first case is simple. Yes, that is right, but that is not the important difference between the two conversations.
When the person asked you if you wanted more wine, he or she probably had an arm stretched toward the wine bottle and eyes pointed at your empty wine glass. Namely, the signals you were decoding were parallel and redundant, not just acoustic. Furthermore, all the subjects and objects were in the same space and time. This is what made it possible for you to understand.
The point is that redundancy is good. The use of parallel channels (gesture, gaze, and speech) should be the essence of human-computer communications. In a foreign land, one uses every means possible to transmit intentions and read all the signals to determine even minimal levels of understanding. Think of a computer as being in such a foreign land, ours, and being expected to do everything through the single channel of hearing.
Humans naturally gravitate to concurrent means of expression. Those of you who know a second language, but do not know it very well, will avoid, if at all possible, using the telephone. If you arrive at an Italian hotel and find no soap in the room, you will go down to the concierge and use your best Berlitz to ask for soap. You may even make a few bathing gestures. That says a lot.
When I talk with my computers in the future, I will expect the same plural interface. If I do too much talking at one of my computers, I will not be surprised if it asks me one day, "Can we have a conversation about this?"
Next Issue: The Fax of Life
[Back to the Index of WIRED Articles | Back to Nicholas Negroponte's Home Page | Back to Media Lab Home Page]
[Previous | Next]
[Copyright 1994, WIRED Ventures Ltd. All Rights Reserved. Issue 2.03 March 1994.]