From Members to Teams to Committee—A Robust ... - Semantic Scholar

Report 2 Downloads 14 Views
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

1

From Members to Teams to Committee—A Robust Approach to Gestural and Multimodal Recognition Lizhong Wu, Sharon L. Oviatt, and Philip R. Cohen

Index Terms—Combination of multiple classifiers, decision making, gesture recognition, learning, multimodal integration, pattern recognition, uncertainty.

I. INTRODUCTION

A

clearly cannot be tailored to be optimal for the full range of all of them. Likewise, in feature extraction it is hardly possible to find a single feature set that is optimal for all recognition targets. Instead, different targets or subsets of targets will have their own optimal feature sets. Traditional approaches usually append all potential features into a single large-dimension stream. For example, in acoustic modeling, one representation is universal to all target triphones. Conventionally, a 39-dimensional vector is formed by a basic 13-dimensional acoustic vector (i.e., the signal energy and first 12 cepstral coefficients) and its firstand second-order differentials [53]. However, such a singular feature representation risks greatly reducing the robustness of the model. The first 12 cepstral coefficients that are selected may be efficient for recognizing voiced sounds, but redundant for voiceless fricatives in a manner that introduces noise into the model. Another example of a single-stream large-dimension feature representation is Apple Computer’s Newton handwriting recognizer, in which the input representation consists of a 14 14 image, a 20 9 stroke feature, a five-dimensional stroke count and a single-dimension aspect ratio [52]. In training a posterior estimator, traditional pattern recognition techniques usually take a black-box approach (e.g., the backpropagation algorithm for training a neural network [2]), in which all potential features are included as input. This black-box procedure searches over the whole system parameter space to find the best mapping between its input and the target. When the dimensions of the parameter space are large, the optimization process tends to converge slowly and in some cases may not converge due to poor initialization. Finally, in decision making, traditional approaches usually are based on a single posterior estimate, which assumes an optimality that rarely exists [2], [16], [20]. For example, an estimator might deviate greatly from its trained trajectory in the presence of a small perturbation of input or model parameter. As a result, traditional recognizers fail to handle questions such as the following: Is the input a target or a target if the following three different posterior estimates have been observed:

IE E Pr E oo f

Abstract—When building a complex pattern recognizer with high-dimensional input features, a number of selection uncertainties arise. Traditional approaches to resolving these uncertainties typically rely either on the researcher’s intuition or performance evaluation on validation data, both of which result in poor generalization and robustness on test data. This paper describes a novel recognition technique called members to teams to committee (MTC), which is designed to reduce modeling uncertainty. In particular, the MTC posterior estimator is based on a coordinated set of divide-and-conquer estimators that derive from a three-tiered architectural structure corresponding to individual members, teams, and the overall committee. Basically, the MTC recognition decision is determined by the whole empirical posterior distribution, rather than a single estimate. This paper describes the application of the MTC technique to handwritten gesture recognition and multimodal system integration and presents a comprehensive analysis of the characteristics and advantages of the MTC approach.

pattern recognizer consists of three important components: a feature extractor, a posterior estimator, and a decision maker. The feature extractor transforms the raw highly redundant input into a representative dimension-reduced feature vector. The posterior estimator evaluates the posterior probabilities, given the extracted feature. The decision maker then sorts the posterior probabilities and assigns the interpretation with the largest posterior probability as the correct choice. When building a recognizer, a series of selection choices and uncertainties must be resolved. These include selection of input representation and features, model type, model complexity and training and validation data. Traditional approaches to handling these uncertainties typically rely either on intuitive selections or evaluation of performance on validation data [16], [20]. They also usually rely on a single preselected choice for each decision. For example, in handwriting recognition, the image size for ink representations is decided beforehand. However, too small an image will fail to distinguish certain symbols and too large an image will begin to pick up noise in handwriting styles. The use of one preselected size (i.e., typically determined by the training and validation data) can only optimize for the average of all recognition symbols, but Manuscript received January 17, 2001; revised October 29, 2001. L. Wu is with HNC Software Inc., San Diego, CA 92121-3728 USA. S. L. Oviatt and P. R. Cohen are with the Center for Human and Computer Communication, Oregon Graduate Institute, Portland, OR 97291-1000 USA. Publisher Item Identifier S 1045-9227(02)04431-4.

Posterior

Estimate

Estimate

Estimate

Aiming at the above problems, this paper proposes a new technique referred to as members to teams to committee (MTC). The MTC has a three-tiered divide-and-conquer architecture

1045-9227/02$17.00 © 2002 IEEE

2

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

Fig. 1. The three-tiered bottom–up MTC architecture, comprised of multiple members, multiple teams, and a decision making committee.

discussed in this paper, the gating network will not be appropriate due to the modeling complexity and lack of training data. The MTC approach also is distinct from previous weighted combination techniques [1], [18], [33], [38] in at least two aspects. First, the team parameters are both input and class dependent. Previous techniques usually have employed class-independent weights. Second, MTC also takes into account weighting uncertainty by providing multiple teams. The previous literature has not addressed this latter problem. Finally, the MTC committee also is different from other committee approaches described previously in the literature. Previous committee approaches [8] generally have been based on the mean, median, or majority vote of the members’ output, which may be weighted or unweighted. To our knowledge, committee decision making based on the empirical posterior distribution has not been reported before. The outline of this paper is as follows. Section II introduces the MTC technique. It describes the MTCs overall architecture, functionalities of each component and its learning algorithms. Section III presents handwritten gesture recognition as an example of a general pattern recognition problem and shows how to build an MTC recognizer for this purpose. Section IV demonstrates the application of the MTC approach to integrating a multimodal pen/voice system, which is an example of a high-dimensional input feature pattern recognition problem. Section V analyzes the advantages of the MTC technique empirically and discusses the philosophy behind the MTC approach. Section VI summarizes the paper’s conclusions.

IE E Pr E oo f

that consists of multiple members, multiple teams, and a committee. The members provide a spectrum of diversified local posterior estimates, with the teams coordinating output from the members and the committee finalizing the recognition decision based on the empirical posterior distribution passed up from the teams. The MTC approach is designed to reduce modeling uncertainty by providing multiple automatically weighted factors. In feature extraction, the MTC takes a divide-and-conquer approach, with its feature representation composed of multiple small-dimension streams. The MTC posterior estimator is formed by multiple coordinating local estimators. Furthermore, the training and validation data, the input feature and the model type and complexity of each estimator can differ. The MTC recognition decision also derives from the whole empirical posterior distribution, rather than a single estimate. Finally, the MTC is built layer-by-layer, with a bottom–up optimization procedure. Search is conducted over multiple small subspaces, with parameters in the upper layer initialized by the immediate results from the lower layer. The MTC approach is related to, but different from, previously reported techniques on combining multiple classifiers [15], [25], [41]. For example, the MTC member is related to Stacking [49], Bagging [4], and AdaBoost [17], [18] in subsampling development data. The MTC member is also related to the work in multiple-stream feature representation [6], [7]. However, the MTC approach is more comprehensive in that the pool of MTC members is able to take into account all potential uncertainties. The MTC approach is related to automatic relevance determination (ARD) [28], [32] in simultaneous inferring the utility of large numbers of possible input variables. In ARD, a hyperparameter is associated with each input feature, which controls the size of the weights associated with connections out of that input. ARD is implemented in a Bayesian scheme. It usually assumes that the network weights follow a Gamma distribution and a Gibbs sampling procedure can be used for the hyperparameters. Both the network weights and the hyperparameters are learnt at the same time with a large enough amount of training patterns. In the MTC framework, the model build process for individual input features or their combinations and the evaluation process of relevance of these features are separated. The model build process is conducted at the member level and the relevance is evaluated at the team level. The input space to teams, i.e., the output space of members, is usually much smaller than the original feature space. By mapping the problem to separated and dimension-reduced spaces, one can expect to reduce the curse of dimensionality and to avoid overfitting. In addition, the MTC team is related to the gating network in the mixtures of local experts [23]. However, there are two fundamental differences. The gating network, which combines multiple local experts, is driven by all input features of local experts. Both the local experts and the gating network are trained simultaneously. In the MTC, the teams are driven by the members and the members and the teams are trained sequentially. The gating network will work only when the curse of dimensionality [16] is low. However, for applications involving high-dimensional input features, such as the pen/voice multimodal system to be

II. MTC TECHNIQUE

A. MTC Architecture

The MTC architecture consists of three layers, as shown in Fig. 1. The bottom layer is formed by multiple recognizer members. Each member is a local posterior estimator with an assigned input variable subset, a specified model type and complexity and a given training and validation data set. The members cooperate with each other via the multiple teams built at the midlayer. Different teams observe different training data and are initialized and trained differently.1 Output from the teams forms an empirical posterior distribution that then is sent to the committee at the upper layer. The committee makes a final decision 1The team integrates the members. Multiple teams are built to reduce integration uncertainty. More discussion on the need for multiple teams is presented in Sections II-B and V-C.

WU et al.: FROM MEMBERS TO TEAMS TO COMMITTEE

3

after comparing the empirical posterior distributions of different targets. B. MTC Recognition Algorithm and First, we define the input feature set . The input feathe recognition target set ture is formed by -streams, whose dimensions may differ. The target consists of different classes, for example of different multimodal commands. Our goal is to evaluate the posterior probability of each target, given the input feature set for

(1)

with

and

within the given confithan dence level, 1 when it is significantly less, or otherwise 0. By definition, the diagonal elements of are all zeros, then , if then and if . The recognition targets then are ranked by summing over each row in and this summary value is called a significance number. All significance numbers form an -dimensional significance vector . The max. If there is a signifiimal significance number is , the input is recognized cance number that equals as the target corresponding to the row index of this maximal significance number. If all significance numbers are , then the current input cannot be recsmaller than ognized with confidence and further external information is required. An example of a 4 4 significance matrix with its corresponding significance vector and rank is given as follows:

for

(3)

is the mode probability of the th target where associated by the th combination of modeling specifications. The team is trained to learn the mode probability matrix. Different training data and approaches will result in different mode probability estimates. Subsequently, the multiple team posteriors are obtained and

Rank

(2)

where stands for the th combination of modeling specifications.2 2) Coordinating the local posteriors into teams The team integrates the local posterior estimates of different specifications. We have

with

is the number of recognition targets. The , is either one when the posterior estiis significantly greater

IE E Pr E oo f

and to assign the input to the target with the largest posterior probability. The MTC recognition algorithm goes through three bottom–up steps. 1) Estimating the local posteriors of members Each member computes a local posterior estaimate under the specified modeling condition. The modeling specifications include the model type, the model complexity, the extraction of input features, the training and validation data and the learning algorithm. If there is a combinations of modeling specifications in total of local which we are interested, then we would compute members as follows: posterior estimates from the

matrix, where element of , mate

(4)

where is the index of ways of estimating the mode probability and is the total number of ways that we are interested in. 3) Making a recognition decision via committee The output from multiple teams forms an empirical dis, which is approximated by tribution of posterior a normal or t-student distribution, depending on the size of the samples. Given a confidence level, the committee runs through a series of pair-by-pair hypothesis tests and square obtains a significance matrix . is an 2Note that the notation P ^ (T jI ; S ) does not imply that the input to the ith member must be the complete feature set I . The actual input is specified by the feature extraction operation in the S .

C. MTC Training Procedure

This section describes the building process for the MTC. It consists of constructing the members, training the teams and configurating the committee. 1) Members: The goal of MTCs members is to learn a set of local posterior estimates, as indicated in (2). The key is to identify the modeling specifications and their combinations of members. The members within the MTC can represent different types of models, with the training algorithm for the members being model-dependent. Among the various modeling specifications, the most important one is the extraction of input features via exploratory data analyzes. Once the input features have been extracted, the model type is selected to fit the characteristics of input features. In the MTC, a variety of input features can be extracted and different types of models can be selected to fit different types of input features. In Sections III and IV, we will present examples of the design of members for a specific application. 2) Teams: The goal of teams is to learn the mode probability matrix in (3). By adjusting the mode probability matrix , we maximize (i.e., reward) the th posterior and simultaneously reduce (i.e., penalize) the other posteriors , for and , when the th class pattern is applied. In order to meet the constraint that the sum of all posteriors must equal one, we impose a softmax function on the output. This task is illustrated in Fig. 2 described by the following equations: (5)

4

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

Fig. 2. Architecture of teams in the MTC approach, with defined by (5) and (6).

x; y; z;

for

and

w

(6)

IE E Pr E oo f

Equation (5) is a rewritten version of (3) and (6) is a softmax function [5] that ensures the normalization constraint. The detailed learning algorithm is given in the Appendix. 3) Committee: The team integrates the members’ posterior estimates. Multiple teams are built to reduce integration uncertainty. The goal of the committee is to compare the empirical posterior distribution formed by the teams and make a final recognition decision. To train the committee, no free system parameter is needed. The confidence level for recognition is predetermined and different confidence levels will result in a different system error rate/rejection rate tradeoff. The higher the confidence level, the lower the error rate but the higher the rejection rate.

component analysis (PCA), which is a valuable tool for reducing input dimensions [20], [30]. In handwritten gesture recognition, to preserve as much information as possible, a larger sized image should be used when ink is transformed into a gestural image. However, model complexity and computational requirements increase with the image size. Because our gestural images were black and white, with a large percentage of white pixels, the image dimension was reducible via PCA from 441 (i.e., 21 21) to about 30 without resulting in significant reconstruction distortion. 2) Number of strokes: Although the precise number of strokes can change from one pattern to another for any particular gesture, nonetheless the inclusion of this feature easily separates the most image-confusable gesture pairs (e.g., checkmark versus arrow). 3) Normalized stroke length: The normalized stroke length is defined as the number of black pixels normalized by the total number of pixels in the image. This feature helps distinguish relatively simple gestures (e.g., line) from more complex ones. 4) Image centroid: The image centroid is defined as the average pixel location normalized by the size of image. The image centroid especially contributes to distinguishing direction-sensitive gestures (e.g., arrows pointing north, northeast, east, southeast). Any individual feature differs in its contribution to recognizing a particular target. In contrast to other conventional approaches, the MTC approach does not append all extracted features into a single large-dimensional stream. Instead, it builds different recognizer “members” for different extracted features. The member with eigen components of gestural images is modeled by a mixture of Gaussian distributions [16]. The number of strokes is a discrete variable and its associated member is modeled by a frequency table of the number of strokes. The normalized stroke length is modeled by a Gamma distribution [21]. The image centroid is modeled by a two-dimensional Gaussian process. In addition, multiple image sizes and eigen dimension cutoffs are modeled. In total, there are 60 different combinations of modeling specifications and therefore 60 recognizer members are built. All of these models are trained using data randomly bootstrapped from the development corpus and then validated using the whole development corpus. To reduce the uncertainty of data selection, more than one model is trained, with each model using a different bootstrap replica of the development data. For the member recognizers, the average test error rate was 18.24 20.22%, which confirms that the members alone cannot act in a robust manner. The average correlation coefficient of posteriors between different members was just 0.60, which also indicates that a significant difference exists between members. Using the mechanism shown in Fig. 2 and the learning algorithm listed in the Appendix, multiple MTC teams have been trained using different training data (i.e., different bootstrap replicas of the development data). The average team performance was greatly improved over that of member performance with the average test error rate 4.42 0.95% for teams, compared with 18.24 20.22% for the 60 member

III. MTC HANDWRITTEN GESTURE RECOGNITION

In this section, we demonstrate how to build a real-world MTC system, first by describing our application task and then by going through the system building procedure in a step-by-step manner. Finally, we report on the performance of the MTC system, as applied to the gesture recognition application and corpus. A handwritten gesture recognizer was developed to serve our Quickset multimodal human-computer interface [11]. It recognizes common pen gestures such as points, lines, routes, circles, arrows, checkmarks, question marks, crosses, and alphas. It also recognizes specialized military symbols, such as fortified lines and barbed wire. In total, there are 191 different gesture symbols in our collected database. For the present purpose, the MTC recognizer was built to recognize a subset of 31 of these gestures. For these 31 gestures, a total of 1618 patterns were collected from multiple Quickset system users. These data were randomly partitioned into a development set consisting of 1326 patterns used for training and validation and a test set consisting of 292 patterns used for final recognition testing. After completing routine exploratory data analyzes, the following multiple features were extracted. 1) Eigen components of gestural images: The eigen components of gestural images were obtained via principal

5

IE E Pr E oo f

WU et al.: FROM MEMBERS TO TEAMS TO COMMITTEE

Fig. 3. Example of recognition process using the MTC handwritten gesture recognizer. Top: boxplots empirical posterior distributions obtained from the multiple recognizer teams. Middle: responses from the recognizer members as local posterior estimates. Bottom: gestural ink and corresponding 21 21 image.

2

recognizers. This result was obtained based on ten teams and it is noteworthy that more than ten teams did not yield any significant further reduction in error rate. The empirical posterior distribution then is passed from the multiple teams to the committee. The output from the committee is the ranked N-best list, together with the mean, standard deviation, vote, and our defined significance number of posterior estimates. The average error rate based on the committee was 3.77% (i.e., without rejection) and 2.14% (i.e., with rejection). These results were obtained with a 95% confidence level. An example of the described MTC gesture recognition process is provided in Fig. 3 and Table I.

TABLE I OUTPUT FROM THE COMMITTEE CORRESPONDING TO THE RECOGNITION PROCES IN FIG. 3, INCLUDING THE MEAN, STANDARD DEVIATION (STD), VOTE, SIGNIFICANCE NUMBER (SIG.), AND RANK OF POSTERIOR ESTIMATES. IN THIS EXAMPLE, THE 17TH GESTURE (SWARROW) RECIEVES THE TOP RANK, THE SECOND GESTURE (NESWLINE) SECOND RANK, AND THE 30TH AND 31ST GESTURES (CHECK AND SCRATCH) TIE FOR THIRD RANK

IV. MTC MULTIMODAL STATISTICAL INTEGRATION

This section demonstrates the application of the MTC approach to multimodal integration. We first review studies on multimodal integration and introduce our testbed, a pen/voice system called Quickset. We then describe the MTC-based multimodal statistical integrator. Finally, we compare this integrator to a benchmark using a data corpus that was collected using Quickset. There are two main types of multimodal systems, one of which integrates signals at a feature level and the other at a semantic level. Feature fusion generally is considered more appropriate for closely coupled and synchronized modalities (e.g., speech and lip movement), [29], [37], [45], whereas semantic fusion is widely applied to modalities that differ in the time scale characteristic of their features (e.g., speech and pen input) [43]. This paper focuses on multimodal systems that

are more appropriately developed to rely on semantic fusion of input signals. Our multimodal system consists of a set of recognizers for different input modalities and levels of recognition accuracy. A command in a multimodal system is jointly represented by a set of constituents. For example, to control the system’s map display, one can say “zoom in” and simultaneously use the pen to circle an area of the map. Each of the two constituents, the

6

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

speech recognition was Microsoft’s Whisper 3.0. Quickset integrates multimodal input in the following three sequential steps: 1) Temporally, Quickset combines speech and gesture input that is overlapped, or that falls within a certain lag relation when signals arrive sequentially. The temporal constraints of Quickset’s integration were determined by empirical research with users [36]. It was found that when users speak and gesture in a sequential manner, they gesture first, then speak within a relatively short time window; speech rarely precedes gesture. As a consequence, the multimodal synchronizer in Quickset prefers to integrate a gesture with speech that follows within a 4-s interval, rather than integrating it with preceding speech. If speech arrives after that interval, the gesture is interpreted unimodally. The precise lag threshold adopted when signals arrive sequentially can be learned by the system using training data, or preset by the system developer for a particular domain. 2) Statistically, Quickset integrates the posterior probabilities of constituents from individual modes and then generates an N-best list for a multimodal command that includes posterior probabilities for each final interpretation. The original version of Quickset relied on the independence assumption and takes the cross product of the probabilities of individual modes as the multimodal probability for each item in the final multimodal N-best list. One goal of the present work is to supersede the independence assumption by developing a more powerful statistical integrator based on the realities of empirical data. 3) Semantically, Quickset determines whether a given gestural element and a spoken element in the N-best lists can be combined legally into a coherent multimodal interpretation that is executable by the system. The semantic information contained within the two modes in Quickset are represented as typed feature structures that can be unified if the elements are compatible semantically. The unification of typed feature structures in Quickset has been detailed elsewhere [24]. The data corpus used in the present work was collected using Quickset to set up simulations involving community fire and flood control activities. All commands were multimodal and required both speech and pen input. The corpus consisted of 1539 labeled commands collected from 16 users, eight native speakers of English and eight accented nonnative speakers. We randomly assigned the data from the first eight users for development and the rest for test purposes. The development data included 780 command patterns and the test data included 759 patterns. More detailed description of this corpus and its analysis is available elsewhere [9], [35]. The operation of the MTC-based multimodal statistical integrator can be summarized by the following three procedures: 1) Both speech and gesture recognizers are members of the MTC statistical integrator. Each of these members computes their local posterior estimates as follows:

IE E Pr E oo f

spoken input and the pen input, is a target that is identifiable by an individual recognizer. During recognition, an individual recognizer analyzes its relevant set of input features for a given mode and then formulates an N-best list of targets along with their estimated posterior probabilities. The estimated posterior probabilities from different mode recognizers are integrated to generate a final multimodal N-best list, which is sent for semantic unification. The probabilistically top-ranked semantically compatible command then is selected and confirmed as the system’s multimodal interpretation. Multimodal systems with fusion at the semantic level include Bolt’s seminal work “Put-That-There” [3], ShopTalk [12], [13], CUBRICON [31], Virtual World [10], Finger-Pointer [19], Visual Man [48], Jeanie [46] and others as described in [26], [42]. However, these previous studies have ignored basic statistical integration issues, while instead concentrating on issue like semantic representation, dialogue management, so the development of new recognition modalities. As reviewed in [43], previous studies usually have assumed that the individual modes are independent of each other, so the final multimodal command probability is set as the cross product of the posterior probabilities of the associated constituents. Although use of the independence assumption was a reasonable starting point and it simplified the integration process, nonetheless it is never true since a semantic constituent in one mode usually only associates with a subset of constituents in the other mode. Recognition accuracies also usually vary from one mode to another. Even in the same mode, they vary from one constituent to another. By giving different weights to different modes and different constituents, we can compensate for some recognition errors that otherwise would occur in individual recognizers. For example, the study in [39] has shown the possibility of improving continuous gesture recognition results based on the co-occurrence analysis of different gestures with some spoken keywords. Performance improvement also has been found in audio–visual speech recognition systems by weighting both audio and visual recognition channels [29], [40], [47]. Our proposed MTC technique is well suited to experimenting with ways to integrate multiple modes on the basis of posterior probabilities and other factors. Using this technique, the recognizers of different modes become the members of an MTC statistical integrator. Multiple teams built in the MTC integrator are trained to coordinate and weight the output from different modes. Each team establishes the posterior estimate for a multimodal command, given the current multimodal input received. The committee of the MTC integrator analyzes the empirical distribution of the posteriors and establishes the N-best ranking for each multimodal command. Our testbed used to experiment with and evaluate the MTC integrator was Quickset [11], a collaborative handheld agentbased multimodal system that processes simultaneous pen and voice input. Quickset has been used for various applications that enable users to set up and control distributed interactive simulations. Using a “Wizard of Oz” research paradigm, it was demonstrated that a multimodal interface parallel to Quickset supported 36% fewer task errors and 10% faster task completion time than a unimodal spoken interface [34]. In Quickset, the pen gesture recognition for Quickset was developed at OGI and the

Gesture Recognition: Speech Recognition:

for

(7)

for

(8)

WU et al.: FROM MEMBERS TO TEAMS TO COMMITTEE

7

where and are the synchronized gestural and is the th gestural constituent, spoken input features, is the th spoken constituent, and are the number of gestural and spoken constituents. 2) The team in the MTC integrator combines the modal posterior estimates based on an associative map involving the command and its constituents.3 Assuming that the command is associated with the constituents and , we have

TABLE II COMPARISON OF MULTIMODAL RECOGNITION ERROR RATES BETWEEN THE MTC-BASED STATISTICAL INTEGRATOR, VERSUS A PREVIOUS INTEGRATOR THAT WAS BASED ON THE CROSS-PRODUCT OF PROBABILITY ESTIMATES FROM INDIVIDUAL MODES

Softmax

for

and

(9)

IE E Pr E oo f

is the number of commands, is the number where of teams, and is the index of the team. Equation (9) is a rewritten version of (5) and (6) with the number of , , , members and . 3) The committee in the MTC integrator ranks the multimodal commands based on the empirical distribution of their posterior probabilities that was formed by the mul. The N-best list of multiple teams timodal commands is then sent for semantic unification. Table II lists the MTC architecture’s error rates for multimodal commands for both the development and test data of the fire and flood control multimodal corpus. These error rates are compared to a benchmark based on the previous architecture in which the multimodal posterior probability was the cross product of the posterior probabilities of individual modes. As shown, the MTCs test error rate has been reduced to less than half of the original. V. WHY MTC?—A PULL-SOME-OUT ANALYSIS

To better understand the functionality and performance contribution of the MTC technique’s components, this section reports on a series of “pull-some-out” empirical analyzes, with the gesture recognition described in Section III as a testbed. For each analysis, we pull some components out of the MTC architecture and the remaining MTC components form a new recognizer that is retrained. We then compare the performance of the modified recognizer with that of the complete MTC system. A. Pull Some Members Out

There are a total of 60 members in the MTC gesture recognizer.4 Here, we analyze performance changes associated with seven of the most important system component perturbations. 1) The gesture input features consist of only a single image size. 2) The gesture input features consist of only a single eigen dimension. 3The

3) The members only analyze one bootstrap replica of the development data. 4) The gesture input features consist of only the eigen components of images. 5) The number of strokes is excluded from the original gesture input features. 6) The normalized stroke length is excluded from the original gesture input features. 7) The image centroid is excluded from the original gesture input features. The indexes 1–7 in Table III summarize the results of the above experiments. Compared to the original performance, although the changes in validation performance are variable, most of the changes in test performance involved degradation. One exception was the sixth case involving exclusion of normalized stroke length, which did not alter test performance although validation performance worsened. In conclusion, the multiple-choice automatically weighted MTC approach generalized better than any of these seven individual approaches involving hand-picked or validated elements.

multimodal associative map for the present corpus is detailed in [51]. 4The total number of possible combinations therefore is 2 , an infeasible number to examine experimentally.

B. Pull the Teams Out

In this case, all member output was weighted equally and sent to the committee directly, effectively bypassing team deliberations. The result was that the error rates increased from 2.19% to 2.57% for the development data and from 3.77% to 10.96% for the test data. This impact of pulling out the teams is summarized in index 8 of Table III, compared with the original MTC performance in index 0. We also compare the posterior probability output from the teams with those from the members. For each gesture labeled by , we first computed the mean posterior of the th posterior estimator from the multiple member output [see (2)]

and that from the multiple team output [see (4)]

We then computed the mean posterior that had the largest response among other posterior estimators, i.e., for

and

8

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

TABLE III SUMMARY IMPACT OF THE VARIOUS PULL-SOME-OUT ANALYSES REPRESENTED AS A PERCENT ERROR RATE FOR VALIDATION (I.E., DEVELOPMENT) DATA AND TEST DATA

3.77% to 4.42%. Index 10 shows the error rate that corresponds to the team that achieved the best validation performance. Note that this “best” team did not generalize best with respect to performance on the test data, however, with an error rate worse than the average. This comparison clarifies why multiple teams are needed in the MTC architecture and further confirms that the committee approach generalizes better than a validation approach that involves selecting a single team. Finally, this committee architecture component provides a technique for handling recognition rejection and confirmation.

VI. CONCLUSION

IE E Pr E oo f

and

finally were renormalized by

.

The MTC approach has a three-tiered divide-and-conquer architecture that consists of multiple members, multiple teams, and a committee. The members provide a spectrum of diversified local posterior estimates, with the teams coordinating output from the members and the committee finalizing the recognition decision based on the empirical posterior distribution passed up from the multiple teams. The MTC technique is distinct from other techniques in the following four aspects: 1) feature extraction and input representation—the MTC takes a multiple-stream divide-and-conquer approach, rather than the single-stream approach used in traditional models; 2) model selection—the MTC provides automatic weighting of multiple factors, rather than the single hand-picked or validated selection process used in traditional techniques; 3) training—the MTC is built in a layer-by-layer bottom–up approach, rather than the slower black-box training approach used in many traditional models, and 4) decision making—the MTC is based on the empirical posterior distribution, rather than a single posterior estimate. The advantages of the MTC include: 1) reducing the uncertainties in model selection, input variable selection, training data selection and decision making; 2) minimizing potential selection bias, which can result in a greater likelihood of overfitting; and 3) improving overall generalization and robustness of system recognition performance. The MTC approach appears particularly efficacious in handling difficult data modeling problems that involve high-dimensional input features, which are typical of areas such as gestural and multimodal pattern recognition. More direct comparisons between the MTC and other methods based on the same benchmark data are available in [51]. As a tradeoff, the MTC structure may require more computational power and memory space than other traditional approaches. Exact computation and space requirements will depend on the number of members, the number of teams and the model complexity of members. As necessary, it would be possible to adjust the number of members and teams to meet specific computation or space constraints by removing those components that affect overall performance the least. The contribution of individual components can be estimated using an approach similar to the pull-some-out analysis demonstrated in Section V, sensitivity analyzes as in [27], or saliency analyzes as in [14] and [22].

, then the gesture is correctly Note that if measures recognized. The region in which and the average error. The distance between represents the discriminant ability of the recognizer. Fig. 4 iland for both the lustrates the histograms of members and teams and shows clearly that the teams are more discriminative. In this figure, cases in which the correct posterior responses are smaller than the wrong posterior responses would indicate an error. To further analyze the functionality of the team, Fig. 5 illustrates a comparison of the team parameter histograms before and after training. The team parameters are initialized by the recognition accuracy of an individual target contributed by an individual member and evaluated by a bootstrap replica of the development data. The recognition accuracies range from zero to one and are the initial approximations of mode probability given in (3). Due to the divide-and-conquer strategy adopted by the members, inclusion of a specific member may only contribute to a small number of targets. This produces a large proportion of zero weights in the team parameter distribution. During training, some parameters are increased or rewarded, while others are decreased or essentially penalized. The histogram of the trained team parameters therefore expands beyond zero and one, which is acceptable as a probability definition because of the softmax constraint imposed on the output [see (6)].

C. Pull the Committee Out Without the committee, each team acts individually. Index 9 in Table III lists the mean and standard deviation of the error rate of all the teams. We note that the error rate increases from

WU et al.: FROM MEMBERS TO TEAMS TO COMMITTEE

9

IE E Pr E oo f

Fig. 4. Histograms of the posterior probability output from (top) the teams and (bottom) members. The “dark” bars represent distributions of normalized correct posterior responses and the “gray” ones represent wrong responses.

Fig. 5. Comparison of team parameters (top) after training and (bottom) before training.

By backpropagating the cross entropy of the th pattern, and using (10), (6) and (5), we have

APPENDIX

TEAM TRAINING ALGORITHM

(11)

The total cross-entropy over the whole training set is

and

(12) (10) (13) where is the index of patterns and is the number of patterns in training data set. th is the labeling symbol of the th pattern for the th target. If the pattern belongs to the target, a and for but common setting is that . is the output from the team as described by (5) and (6).

(14) Substituting (12), (13), and (14) into (11), we obtain (15)

10

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 4, JULY 2002

is then updated by

(16) with a given learning parameter . To prevent overfitting, early stopping and regularization techniques [44], [50] have been used. The optimal parameters are set by the minimal error rate over the validation data. REFERENCES

IE E Pr E oo f

[1] K. Ali and M. J. Pazzani, “Error reduction through learning multiple descriptions,” Machine Learning, vol. 24, p. 173, 1996. [2] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.: Oxford Univ. Press, 1995. [3] R. A. Bolt, “Put that there: Voice and gesture at the graphics interface,” Comput. Graphics, vol. 14, no. 3, pp. 262–270, 1980. [4] L. Breiman, “Bagging predictors,” Machine Learning, vol. 26, no. 2, pp. 123–140, 1996. [5] J. S. Bridle, “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,” in Neurocomputing: Algorithms, Architectures and Applications, F. F. Soulié and J. Hérault, Eds. New York: Springer-Verlag, 1990, pp. 227–236. [6] K. Chen and H. Chi, “A method of combining multiple probabilistic classifiers through soft competition on different feature sets,” Neurocomput., vol. 20, no. 1–3, pp. 227–252, 1998. [7] K. J. Cherkauer, “Human expert-level performance on a scientific image analysis task by a system using combined artifical neural networks,” in Working Notes of the AAAI Workshop on Integrating Multiple Learned Models, P. Chan, Ed., 1996, pp. 15–21. [8] R. Clemen, “Combining forecast: A review and annotated bibliography,” Int. J. Forecasting, vol. 5, pp. 559–583, 1989. [9] J. Clow and S. Oviatt, “STAMP: A suite of tools for analyzing multimodal system processing,” in Proc. Int. Conf. Spoken Language Processing, 1998. [10] C. Codella, R. Jalili, L. Koved, J. Lewis, D. Ling, J. Lipscomb, D. Rabenhorst, C. Wang, A. Norton, P. Sweeney, and C. Turk, “Interactive simulation in a multiperson virtual world,” in Proc. ACM Conf. Human Factors Comput. Syst.-CHI’92, pp. 329–334. [11] P. Cohen, M. Johnston, D. McGee, S. Oviatt, J. Pittman, I. Smith, L. Chen, and J. Clow, “Quickset: Multimodal interaction for distributed applications,” in Proc. 5th ACM Int. Multimedia Conf., New York, 1997, pp. 31–40. [12] P. R. Cohen, M. Dalrymple, D. B. Moran, and F. C. N. Pereira, “Shoptalk: An integrated interface for decision support in manufacturing,” in Working Notes AAAI Spring Symp. Series, vol. AI, Manufacturing, Stanford, CA, Mar. 1989, pp. 11–15. [13] P. R. Cohen, M. Dalrymple, D. B. Moran, F. C. N. Pereira, J. W. Sullivan, R. A. Gargan, J. L. Schlossberg, and S. W. Tyler, “Synergistic use of direct manipulation and natural language,” in Proc. CHI’89 Conf. Human Factors Comput. Syst., New York, Apr. 1989, pp. 227–234. [14] Y. Le Cun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems, D. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1990, pp. 598–605. [15] T. G. Dietterich, “Machine-learning research: Four current directions,” AI Mag., vol. 18, no. 4, pp. 97–136, 1997. [16] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973. [17] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” in Proc. 2nd European Conf. Comput. Learning Theory, Mar. 1995. , “Experiments with a new boosting algorithm,” in Proc. 13th Int. [18] Conf. Machine Learning, L. Saitta, Ed., 1996, pp. 148–156. [19] M. Fukumoto, Y. Suenaga, and K. Mase, “Finger-pointer: Pointing interface by image processing,” Comput. Graphics, vol. 18, no. 5, pp. 633–642, 1994. [20] K. Fukunaga, Statistical Pattern Recognition, 2nd ed. New York: Academic, 1990. [21] G. J. Hahn and S. S. Shapiro, Statistical Models in Engineering. New York: Wiley, 1994.

[22] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in Neural Information Processing Systems, S. J. Hanson, J. D. Cowen, and C. L. Giles, Eds. San Mateo, CA: Morgan Kaufman, 1993, pp. 164–171. [23] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Comput., vol. 3, no. 1, pp. 79–87, 1991. [24] M. Johnston, P. Cohen, D. McGee, S. Oviatt, J. Pittman, and I. Smith, “Unification-based multimodal integration,” in Proc. 35th Annu. Meet. Assoc. Comput. Linguistics, San Francisco, CA , 1997, pp. 281–288. [25] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 226–239, Mar. 1998. [26] D. B. Koons, C. J. Sparrell, and K. R. Thorisson, “Integrating simultaneous input from speech, gaze and hand gestures,” in Intelligent Multimedia Interfaces, M. Maybury, Ed. Cambridge, MA: MIT, 1993, pp. 257–276. [27] Y. Liao and J. Moody, “A neural network visualization and sensitivity analysis toolkit,” in Proc. Int. Conf. Neural Inform. Processing, S. Amari, L. Xu, L. W. Chan, I. King, and K. S. Leung, Eds., Hong Kong, Sept. 1996, pp. 1069–1074. [28] D. J. C. Mackay, “Bayesian nonlinear modeling for the energy prediction competition,” ASHRAE Trans., pt. 2, vol. 100, pp. 1053–1062, 1994. [29] U. Meier, W. Hurst, and P. Duchnowski, “Adaptive bimodal sensor fusion for automatic speechreading,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, 1996, pp. 833–836. [30] B. Moghaddam and A. Pentland, “Probabilistic visual learning for object representation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 696–710, July 1997. [31] J. G. Neal and S. C. Shapiro, “Intelligent multimedia interface technology,” in Intelligent User Interfaces, J. Sullivan and S. Tyler, Eds. New York: ACM, 1991, pp. 11–43. [32] R. M. Neal, Bayesian Learning Neural Networks, Lecture Notes in Statistics no. 118HN. New York: Springer-Verlag, 1996. [33] D. W. Opitz and J. W. Shavlik, “Generating accurate and diverse members of a neural-network ensemble,” in Advances in Neural Information Processing Systems, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds. Cambridge, MA: MIT Press, 1996, vol. 8, pp. 535–541. [34] S. Oviatt, “Multimodal interfaces for dynamic interactive maps,” in Proc. Conf. Human Factors Comput. Syst. CHI’96, Vancouver, BC, Canada, pp. 95–102. , “Mutual disambiguation of recognition errors in a multimodal ar[35] chitecture,” in Proc. Conf. Human Factors Comput. Syst. CHI’99, Pittsburgh, PA, pp. 576–583. [36] S. Oviatt, A. DeAngeli, and K. Kuhn, “Integration and synchronization of input modes during multimodal human-computer interaction,” in Proc. Conf. Human Factors Comput. Syst. CHI’97, Atlanta, GA, pp. 415–422. [37] V. Pavlovic´ and T. S. Huang, “Multimodal prediction and classification on audiovisual features,” in AAAI 1998 Workshop Representations Multi-Modal Human-Comput. Interaction, Menlo Park, CA, 1998, pp. 55–59. [38] M. P. Perrone and L. N. Cooper, “When networks disagree: Ensemble methods for hybrid neural networks,” in Artificial Neural Networks for Speech and Vision, R. J. Mammone, Ed. London, U.K.: Chapman and Hall, 1993, pp. 126–142. [39] I. Poddar, Y. Sethi, E. Ozyildiz, and R. Sharma, “Toward natural gesture/speech HCI: A case study of weather narration,” in Proc. 1998 Workshop Perceptual User Interfaces-PUI’98, M. Turk, Ed., San Francisco, CA, Nov. 1998, pp. 1–6. [40] T. Sejnowski, B. Yuhas, M. Goldstein, and R. Jenkins, “Combining visual and acoustic speech signal with a neural network improves intelligibility,” in Advances in Neural Information Processing Systems, D. Touretzky, Ed. Cambridge, MA: Morgan Kaufman, 1990, pp. 232–239. [41] A. J. Sharkey, “On combining artificial neural nets,” Connection Sci., vol. 8, no. 3–4, pp. 383–404, Dec. 1996. [42] R. Sharma, T. S. Huang, V. I. Pavlovic´, Y. Zhao, Z. Lo, S. Chu, K. Schulten, A. Dalke, J. Phillips, M. Zeller, and W. Humphrey, “Speech/gesture interface to a visual computing environment for molecular biologists,” in Proc. Int. Conf. Pattern Recognition, Aug. 1996, pp. 964–968. [43] R. Sharma, V. I. Pavlovic´, and T. S. Huang, “Toward multimodal humancomputer interface,” Proc. IEEE, Special Issue Multimedia Signal Processing, vol. 86, no. 5, pp. 853–869, May 1998. [44] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London, U.K.: Chapman and Hall, 1986.

WU et al.: FROM MEMBERS TO TEAMS TO COMMITTEE

[53] S. Young, “Large vocabulary continuous speech recognition: A review,” Cambridge Univ. Eng. Dept., Cambridge, U.K., Tech. Rep., 1996.

Lizhong Wu Please provide biographical information and photos in tiff or postscript format to [email protected]

Sharon L. Oviatt Please provide biographical information and photos in tiff or postscript format to [email protected]

Philip R. Cohen Please provide biographical information and photos in tiff or postscript format to [email protected]

IE E Pr E oo f

[45] M. T. Vo, R. Houghton, J. Yang, U. Bub, U. Meier, A. Waibel, and P. Duchnowski, “Multimodal learning interfaces,” in Proc. ARPA SLT Workshop, Austin, TX, 1995. [46] M. T. Vo and C. Wood, “Building an application framework for speech and pen input integration in multimodal learning interfaces,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Atlanta, GA, 1996, pp. 3545–3548. [47] A. Waibel, M. T. Vo, P. Duchnowski, and S. Manke, “Multimodal interfaces,” Artificial Intell. Rev., Special Volume Integration Natural Language Vision Processing, vol. 10, no. 3–4, pp. 299–319, Aug. 1995. [48] J. Wang, “Integration of eye-gaze, voice and manual response in multimodal user interface,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., 1995, pp. 3938–3942. [49] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, pp. 241–259, 1992. [50] L. Wu and J. Moody, “A smoothing regularizer for feedforward and recurrent neural networks,” Neural Comput., vol. 8.3, pp. 463–491, 1996. [51] L. Wu, S. Oviatt, and P. Cohen, “Multimodal integration—A statistical view,” IEEE Trans. Multimedia, vol. 1, no. 4, pp. 334–341, Dec. 1999. [52] L. S. Yaeger, B. J. Webb, and R. F. Lyon, “Combining neural networks and context-driven search for online, printed handwriting recognition in the Newton,” AI Mag., vol. 19, no. 1, pp. 73–89, 1998.

11