CHI 98 018-23
APRIL
1998
ACM
ISBN l-581
TUTORLUG
13-028-7
Designing Speech User Interfaces Nicole Yankelovich Sun Microsystems Laboratories Two Elizabeth Drive Chelmsford, MA, USA 01824 (978) 442-0441
[email protected] ABSTRACT
This tutorial focuses on techniques for designing speech interfaces. Topics covered include an introduction to speech input and output, a discussion of speech user interface design issues, and an exploration of ways to involve users in the design process. KEYWORDS: speech user interface design, speech recognition, speech synthesis. INTRODUCTION
The state-of-the-art in speech technology has progressed to the point where it is now practical for designers to consider integrating speech input and output into their applications. Adding speech to a multimodal application or creating a speech-only interface presents design challenges that are different from those presented in a purely graphical environment. This intermediate-level tutorial is intended for user interface designers and application developers who are interested in understanding the issues involved in designing effective speech interfaces. SPEECH INPUT AND OUTPUT
Using speech in an application interface has advantages in a variety of circumstances. It makes data entry possible when no keyboard or screen is available. It is an excellent choice of technology for tasks in which the user’s hands and/or eyes are busy. It also alleviates the need for typing, making it an effective enabling tool for people with disabilities. Speech does pose substantial design challenges as well. Speech input is error-prone and speech output is often difficult to understand. Since speech is a time-based medium, it is transient, taxing users’ short-term memory and making it unsuitable for delivering large amounts of data. Before embarking on the design of a speech application, it is crucial to understand the underlying technology [5]. Speech synthesizers come in different flavors. Some, called “parameterized,” are small and fast, but do not sound very
Jennifer Lai IBM Thomas J. Watson Research Center 30 Saw Mill River Road Hawthorne, NY 10532 (914) 784-6515
[email protected] natural, while others, called “concatonative,” natural, but are more resource intensive.
sound more
Speech recognizers can be classified along a number of different dimensions. Some need to be trained to a user’s voice, while others are “speaker-independent.” Some allow users to speak continuously [4], while others constrain users to speaking with pauses between each word. Finally, some speech recognizers require the use of a rule grammar, others allow only fixed phrases for command-and-control, and others are based purely on statistical models. The variations noted above have a significant application design.
impact on
SPEECH UI DESIGN ISSUES
Designing speech applications, particularly speech-only ones, is substantially different from designing graphical applications [3, 61. Designers new to the process often find themselves at a loss because there are no concrete visuals to sketch. Designing speech interfaces involves understanding conversational style and the different ways that people use language to communicate. Since language use is deeply ingrained in human behavior, successful speech interfaces must not violate conversational conventions [ 11.It is important to establish a command ground with users and make proper use of “discourse cues.” These issues of language use are fundamentally important in both the design of prompts and feedback. It is also important to take the characteristics of speech input and output into consideration. For example, speech is a slow output channel, so feedback must be sufficient, but not verbose. With this constraint in mind, speech interface designers have developed a variety of prompting techniques (tapered, incremental, expanded) which attempt to make conversational systems understandable to first time users, but efficient and fast for experienced users [7]. Another characteristic of speech input that dramatically effects application design is the probability of errors. Even with the most sophisticated speech recognizers, designers must assume that recognition errors will occur and design for this eventuality. Such errors can occur for a whole host of reasons, including excess noise in the environment, speaking words that are outside of the recognizer’s vocabu-
lary, or speaking before the recognizer is ready to listen. When these errors do occur, it is not always easy for the application to detect the error (because the speech recognizer often returns a legal phrase, but not one that matches what the user actually said). If either the application or the user detects an error, an effective speech user interface should provide one or more mechanisms for correcting the error. While this seems obvious, correcting a speech input error is not always easy! If the user speaks a word or phrase again, the same error is likely to reoccur. Many techniques, including switching input modalities, progressively changing prompts, or simplifying the recognition grammar, can be used alone or in combination to create an effective error detection and correction mechanism.
we hope this tutorial will provide the practical advice and concrete examples that practitioners will need to design effective speech user interfaces. REFERENCES
1. Clark, Herbert H. Arenas of Language Use. University of Chicago Press, Chicago, IL, 2.
Fraser, Norman M., and G. Nigel Gilbert. “Simulating Speech Systems,” Computer Speech and Language, Vol. 5, Academic Press Limited, 199 1.
3.
Kamm, Candance. “User Interfaces for Voice Applications,” in Voice Communication Between Humans and Machines. National Academy Press, Washington, D.C., 1994.
4.
Lai, Jennifer and John Vergo. “MedSpeak: Report Creation with Continuous Speech Recognition,” CHI ‘97, Atlanta, GA. March 22-27, 1997.
5.
Schmandt, Christopher. Voice Communication with Computers: Conversational Systems. Van Nostrand Reinhold, New York, 1994.
6.
Yankelovich, Nicole, Gina-Anne Levow, and Matt Marx. “SpeechActs: Issues in Speech Interfaces,” CHI ‘9.5,Denver, CO. May 7- 11, 1995.
7.
Yankelovich, Nicole. “How Do Users Know What to Say?” ACM Interactions, Volume 3, Number 6, November/December 1996.
INVOLVING USERS
As with all user interface design endeavors, involving users in the design process throughout the lifecycle of a speech application is crucial. At the very early stages of design, users can help to define application functionality and, critical to speech interface design, provide input on how humans carry out conversations in the domain of the application. Natural dialog predesign studies are an effective technique for collecting vocabulary, establishing commonly used grammatical patterns, and providing ideas for prompt and feedback design. Once a preliminary application design is complete, wizardof-oz studies can help to test and refine the interface. In these studies, a human “wizard” (sometimes using software tools) simulates the speech interface [2]. Major usability problems are often uncovered with these types of simulations. With a preliminary software implementation, more standard usability testing, in the laboratory or in the field, can be used to evaluate interaction techniques and uncover usability problems. It is usually necessary to test a real system in order to uncover problems due to recognition errors. These are difficult to simulate effectively in the wizard-of-oz studies. Usability tests of speech applications tend to be different from usability studies involving graphical interfaces. For example, it is difficult to have a facilitator in the room with the study participant because any human-human conversation can interfere with the human-computer conversation, causing recognition errors. It is also not possible to use the popular “speak aloud” protocols -- in many speech applications, the voice channel is needed exclusively for speech input. CONCLUSION
By examining fundamental characteristics of speech technology, exploring speech user interface design issues, and detailing options for involving users in the design process,
132