The Tyranny of Evaluation

Henry Lieberman

MIT Media Lab

 

Reject! This paper has no empirical evaluation of the user interface. Reject! The methodology of this evaluation is flawed. Reject! Too few subjects. Reject! How do we know that this interface is any good?

Most people in the user interface research community have heard (or voiced) these kind of comments again and again in program committees, proposal evaluations, or reviews. Criticism of user interface research for lack of, or for flawed evaluation is probably the single largest cause of mortality for papers and research projects. Some see this as a matter of maintaining high standards and making user interface research more scientific. Sure, it would be a great thing if we had a foolproof way of numerically measuring which user interface was better than which other one, so we could, in the words of Leibniz, "replace judgment with calculation".

The truth of the matter is that pretty much all of our methodologies for quantitatively evaluating user interfaces suck. Nobody wants to admit it. The evaluationistas would have you believe that their user interface experiments are every bit as definitive as Galileo dropping balls from the Leaning Tower of Pisa. User interface research has a bad case of physics envy.

First of all, for an experiment to yield a definitive result, all the variables need to be controlled. In Galileo's experiments, he was able to control the experiment such that the only variable that differed was the weight of the balls. Trouble is, people aren't balls. There are so many variables when presenting a user interface to someone that it is very difficult to make sure you've controlled all the relevant ones. There is no "ISO standard human".

Among some of the more difficult-to-control variables are

Task. We gave them some task to do. Interface A was better than Interface B for the task. Therefore Interface A is a better interface. Whoa, wait just a second. Even if the task is a fairly representative task, how much variation is there between tasks that they user might do over a long period of time? How much variation is there between users in the kind of tasks they do? How do your results generalize to other tasks? There is no linear "task space" that it makes sense to measure. I even occasionally see papers where the author averages some sort of score over several unrelated tasks. What could that possibly mean?

What if you're not even "doing a task"? Suppose you're evaluating Web browsers. What's the "task" of a Web browser? Well, there could be lots of tasks. You could be buying plane tickets or searching for information on Egyptian pottery, but you could be just wandering around to see what's interesting. How ya gonna test that? Or do you say that since we can't test it, it doesn't matter whether Web browsers do it well or not? This is like the drunk looking for his keys under the lamppost because the light is better.

Experience. Often experiments are controlled for simple variables like "novice or expert?". But the kind and extent of people's experience, both in real life and in use of computer interfaces, varies a lot more than the experimenters are likely to admit. Not easy to boil it down to one bit.

Cognitive style. People differ significantly in cognitive style that affects their reaction to interfaces. Nothing wrong with that, but it screws up experiments. Many interfaces, especially new and innovative ones, are controversial. Some people like them and some people hate them. Do you average the results and say that the interface is so-so? Some kinds of interfaces are good for some people and some are good for other kinds of people. It might not depend on some easily measured variable such as gender or age. It might even be a matter of (horrors!) taste. And why does everybody have to have the same interfaces as everybody else, anyway?

By quantifying the results and running statistics and drawing graphs, the experiments give a comforting allure of exactitude. But that's an illusion. Most of the time, these experiments are running statistics on bogus numbers. Ask people what they thought of an interface? How do you calibrate the scales on which people are answering? Stopwatch the time it takes them to perform a task? Did you calibrate the times against individual differences and performing the task with all relevant alternative methods? Did you independently test the reproducibility and error rates of each measurement? What self-respecting scientist would do averages on uncalibrated scales?

And if people like or don't like an interface, or it helps or hinders a specific task, how can you be sure what it is they are reacting to? They may like or dislike an inessential aspect of the interface, not one you are trying to test. This happens to me all the time in user tests. You might be trying to test how your spiffy new groupware encourages collaboration, but the users complain if it doesn't have a spelling checker. Conversely, users have been so habituated to being forced to perform cryptic incantations with commercial software that they don't criticize the baseline software for some clunky procedure that your innovation completely eliminates. Detailed analysis of user interviews teases these factors out, but they easily get washed out by questionnaires, numerical ratings, and statistical averages.

There are some situations where experimental evaluation of user interfaces yields definitive results. If you have 15 items, and you want to know: Is it better have 3 menus of 5 items each, or to have 5 menus of 3 items each? Then user testing can give you the definitive answer. There's just one variable to be controlled, and more importantly, changing the variable doesn't change the paradigm of interaction. But when the alternatives being tested are radically different from each other, you've got a problem. Is it better to use a command line interface or an iconic interface? Speech recognition or typing? Keeping your calendar on-line or in a paper notebook?

In medicine, they would be horrified at the idea of someone who developed a technique being the one to evaluate it. Medicine developed double-blind studies for a reason -- studies by people originally involved with whatever is under study were shown to be biased. Experimentally shown to be biased.

Well, say the evaluationistas, of course we can't do all this stuff perfectly, but, hey, we do the best we can. The drunk under the lamppost, again. The experimentalists look for the effects that are measurable and ignore the rest. They insist that all interfaces be judged solely by the criteria that their experimental methodology can quantify.

Anyone ever counted up how much time and money it would take to do an experiment that met all the criteria for a "good" evaluation? You'd have to run the experiment with enough subjects so that you could expect the differences between people to wash out statistically. You'd have to run it long enough so that you weren't fooled by either novelty or learning effects. You'd have to matrix all the relevant variables and test all combinations. You'd have to do double-blind studies. Excuse me, I have to go now, I have to work on my paper for CHI 2025.

User interface evaluation has a lousy track record. It isn't much good at predicting which interfaces will be accepted and appreciated by users, or which will be commercially successful. Even when there is a definitive result from testing, people often ignore it, most often wrongly. Studies show that pie menus are better than linear menus. The results make sense and have been empirically verified. Yet few commercial systems use pie menus. And, for God's sake, why don't we yet have two-handed pointing despite experiments showing it is superior for many tasks to a single mouse?

On the other hand, studies originally predicted no better productivity for color displays over black-and-white displays, for time-sharing versus batch, or (for example, now) buying airline tickets on the Web versus over the phone, or visual programming languages over textual ones. When some new kind of proposed interface makes a big enough change, it is hard to prove its worth in short-term direct-comparison tests. But fortunately, that did not (and will not) stop the adoption of these technologies.

I hate to break the news to some of you, but user interface design is, in no small part, art, as much as it is science or engineering. That's not a bad thing. It doesn't mean that we shouldn't ask scientific and engineering questions about our interfaces, just that they are not the whole story.

The brilliant conceptual artists Vitaly Komar and Alex Melamid (http://www.diacenter.org/km/) conducted surveys asking people questions like "What's your favorite color?", "Do you prefer landscapes to portraits?". Then they produced exhibitions of perfectly "user-centered art". The results were profoundly disturbing. The works were completely lacking in innovation or finesse of craftmanship, disliked even by the very same survey respondents. Good art is not an optimal point in a multidimensional space; that was, of course, their point. Perfectly "user-centered interfaces" would be disturbing as well, precisely because they would lack that artistry.

The user interface research community, in particular, is heading for crisis. In the name of tightening up evaluation standards at the annual CHI conference, more and more papers that present innovative user interfaces but lack airtight evaluations are being rejected. The results are predictable. More papers describing studies of existing or only incrementally different interfaces are being submitted, in order to reduce the risk of the studies being criticized. Fewer papers presenting radically innovative interfaces are being submitted or accepted. Especially in the long papers, what tends to get accepted is carefully crafted studies of uninteresting questions, leaving most of the real action at CHI to the short papers and demo track.

Don't get me wrong. I'm not actually against user testing. I like user testing. I do user testing. I think you can learn a lot of things from user testing. But, in my experience, it's chancy. Everybody's got a wonderful anecdote about how user testing uncovered some surprising and important aspect that the original designers didn't anticipate. When that happens, it's great. But you can't rely on it, you can't depend on it, and you can't insist on it.

My point is to take user interface evaluation with a grain of salt. Let's continue to do, and to encourage doing, user testing, but let's keep in mind that the results will not be definitive and need to be understood and interpreted in depth. Let's not reject out of hand new interface innovations that have not been tested or don't immediately show spectacular results in testing. Let's not let knee-jerk "tests have shown that" replace detailed analysis. We need more judgment in evaluation of user interfaces, not just more calculation.