WIRED 2.02 - Talking to Computers: Time for a New Perspective

N E G R O P O N T E

Message: 8
Date: 2.1.94
From: <nicholas@media.mit.edu>
To: <lr@wired.com>
Subject:

Talking to Computers: Time for a New Perspective

In contrast to the gain in graphical richness of computers, speech recognition has progressed very little over the past fifteen years. And yet, fifteen years from now, the bulk of our interaction with computers will be through the spoken word. It is time to move on this interface backwater and correct the fact that computers are hearing impaired.

In my opinion, the primary reason for so few advances is perspective, not technology. People have been working on the wrong problems and hold misguided views about the voice channel. When I see speech recognition demonstrations or advertisements with people holding microphones to their mouths, I wonder: Have they really overlooked the fact that one of the major values of speech is that it leaves your hands free? When I see people with their faces poked into the screen - talking - I wonder: Have they forgotten that the ability to function from a distance is a reason to use voice? In short, most people developing speech systems need a lesson in communications interfaces.

Speech Goes Around Corners
Using computers today is so overt that the activity demands absolute and full attention. Usually, you must be seated. Then you must attend, more or less exclusively, to both the process and content of the interaction. There is almost no way to use a computer in passing or to have it be one of several conversations. This is oversight number one.

Computing at and beyond arm's length is very important. Imagine if talking to a person required that his or her nose always be in your face. We commonly talk to people at a distance, we momentarily turn away and do something else, and it is not uncommon to be out of sight while still talking.

That is what I want to be able to do with a computer: have it be in "earshot." But this requires an aspect of speech input that has been almost totally ignored: sound separation and capture. It is not trivial to segregate speech from the sounds of the air conditioner or an airplane overhead. But such separation is crucial because speech has little value if the user is limited to talking from one noise-free place.

Aural Text
Oversight number two: Speech is more than words. Anyone who has a child or a pet knows that what is said can be as important as how it is said. In fact, dogs respond to tone of voice more than any innate ability to do complex lexical analysis. I frequently ask people how many words they think their dogs know and I have received answers as high as 500 to 1,000. I suspect the number is closer to 20 or 30.

Spoken words carry a vast amount of information beyond the words themselves, which is something that my friends in speech recognition seem to ignore. While talking one can convey passion, sarcasm, exasperation, equivocation, subservience, exhaustion, (and so on) with the exact same words. In speech recognition, these subcarriers of information are ignored or, worse, treated as bugs rather than features. They are, however, the very features that make speaking a richer medium than typing.

The Three Dimensions of Speech
Speech recognition can be viewed as a problem defined by three axes: vocabulary size, degree of speaker independence, and the extent to which words can be slurred together (their connectedness). Think of this as a cube, whose lower left-hand near corner is a small vocabulary of totally speaker-dependent words, that must be uttered with distinct pauses between each. This is the simplest corner of the problem space.

As you move out along any axis, making the vocabulary larger, making the system work for any speaker, or allowing words to be run together, speech recognition gets harder and harder for the computer. In this regard, the upper right-hand far corner of this cube represents the most difficult place to be. Namely, this is where we expect the computer to recognize any word, spoken by anybody, "inneny" degree of connectedness.

A common assumption has been that we must be far out on all three of these axes for speech recognition to be at all useful. I do not agree.

One might ask, when it comes to vocabulary size, how big is big enough: 500, 5,000, or 50,000 words? The question is wrong. It should be: How many recognizable words need to be in the computer's memory at any one time? This question suggests subsetting vocabularies, such that chunks can be folded into the machine as needed. When I ask my computer to place a phone call, my Rolodex is loaded. When I am planning a trip, the names of places are there instead. If one views vocabulary size as the set of words needed at any one time, then the computer needs to select from a far less daunting number of words; closer to 500 than to the superset of 50,000.

Looking at speaker independence: Is this really so important? I believe it is not. In fact, I think I would be more comfortable if my computer were trained to understand my spoken commands and maybe only mine. The presumed need for speaker independence is derived in large part from earlier days, when the phone company wanted anybody to be able to talk to a remote database. The central computer needed to be able to understand anybody, a kind of "universal service." Today, we can do the recognition in the handset, so to speak. What if I want to talk with an airline's computer from a telephone booth? I call my computer or take it out of my pocket and let it do the translation from voice to ASCII. Once again, we can do a great deal at the "easier" end of this axis.

Finally, connectedness. Surely we do not want to talk to a computer like a tourist addressing a foreign child, mouthing each word as if in a locution class. Agreed. And this axis is the most challenging in my mind. But even here, there is a way out in the short term: Look at vocabulary as multiword utterances, not as just single words. These utterances can be short, slurred phrases of all kinds, which endow the machine with sufficient connected speech recognition to be very useful. In fact, handling runtogetherspeech in this fashion may well be part of the personalization and training of my computer.

My purpose is not to argue any one of these three points to death, but to show more generally that one can work much closer to the easiest corner of speech space than has been assumed and that the hard and important problems are elsewhere. Said in another way: It is time to look at talking from a different perspective.

Next Issue: Talking WITH Computers

[Back to the Index of WIRED Articles | Back to Nicholas Negroponte's Home Page | Back to Media Lab Home Page]
[Previous | Next]