Learning from imperfect data - Semantic Scholar

Comment

Report 3 Downloads 169 Views

Applied Soft Computing 7 (2007) 353–363 www.elsevier.com/locate/asoc

Learning from imperfect data Pitoyo Hartono a,*, Shuji Hashimoto b a

Department of Media Architecture, Future University-Hakodate, Kamedanakanocho 116-2, Hakodate 041-8655, Japan b Department of Applied Physics, Waseda University,Ohkubo 3-4-1 Shinjuku-ku, Tokyo 169-8555, Japan Received 22 September 2004; received in revised form 6 July 2005; accepted 22 July 2005

Abstract For a supervised learning method, the quality of the training data or the training supervisor is very important in generating reliable neural networks. However, for real world problems, it is not always easy to obtain high quality training data sets. In this research, we propose a learning method for a neural network ensemble model that can be trained with an imperfect training data set, which is a data set containing erroneous training samples. With a competitive training mechanism, the ensemble is able to exclude erroneous samples from the training process, thus generating a reliable neural network. Through the experiment, we show that the proposed model is able to tolerate the existence of erroneous training samples in generating a reliable neural network. The ability of the neural network to tolerate the existence of erroneous samples in the training data lessens the costly task of analyzing and arranging the training data, thus increasing the usability of the neural networks for real world problems. # 2005 Elsevier B.V. All rights reserved. Keywords: Neural network ensemble; Imperfect supervisor; Temperature; Competitive learning

1. Introduction In this study, we consider the problem of training a supervised-type neural network with an imperfect training data or supervisor, in which some percentage of incorrect training samples is included in the data. It is known that one of the important factors that determine the fidelity of a neural network is the quality of the training data. For simple or artificial * Corresponding author. E-mail addresses: [email protected] (P. Hartono), [email protected] (S. Hashimoto).

problems, it is easy to design perfect training data sets, in which no errors are included in the data. However, for real world problems, it is not always easy to create perfect training data. For example, when a human expert has to create the training data, it is always possible that the human expert classifies some inputs incorrectly when the problem is complex. In the case of automatic data generation, noise in various stages of the data collections may cause data corruption. In these cases, the neural network has to learn from an imperfect data set or supervisor. Although the neural networks are known to have some ability to learn from noisy data, performance degradation is very hard to avoid.

1568-4946/$ – see front matter # 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2005.07.005

354

P. Hartono, S. Hashimoto / Applied Soft Computing 7 (2007) 353–363

In the past, we proposed a neural network ensemble model that could be trained using an imperfect training supervisor [4–7] and produced a reliable neural network to be used for a classification task. However, there were some internal parameters in the ensemble that have to be set empirically, which can be troublesome because they are problems and are structure-dependent. In this paper, we modify the ensemble by utilizing automatic parameter settings and also mathematically clarify the roles of those parameters in shaping the behavior of the ensemble. The proposed ensemble model is different from the multi neural networks models in Refs. [9–11] because these models automatically assigned different sub data to different modules in the training process and used the combined outputs of all the modules to achieve a better generalization ability in the running process. The differences are that the learning target of our ensemble is tainted by erroneous data, so that the ensemble has to automatically bias the tendency of a particular member to learn only from correct training data. The proposed ensemble has similarity to the models proposed in Refs. [16–18,12] where these models had to deal with a class of problems called ‘‘switching dynamics’’ problems. The impefect supervisor can be considered as a member of the switching dynamics problems, because the supervisor can be thought to have two dynamics, one for generating correct teacher signals and the other for generating incorrect ones. However, the training objective is significantly different in that the previous models assume the switching dynamics to occur in both the training phase and running phase, while we deal with a problem in switching dynamics that occurs only in the learning phase. As in boosting [1,2] which can deal nicely with noisy data through competitive learning, the proposed model also executes a kind of competitive learning mechanism. They differ significantly in the method of competition, in that our proposed method executes concurrent competitions among the members, and the parameter that triggers the competition is embedded in each member. There were studies for training neural networks with deficient data. For example, Ref. [21], dealt with noisy input data, Ref. [26] dealt with imperfect data where part of the attributes for input vectors are

missing, while Ref. [3] dealt with various kind of incompleteness, like missing components of input, missing class labels, missing classes or lack of training data. Although the objective of our proposed model is similar with the previous studies, in the sense of training neural networks with imperfect data, the imperfection of the data dealt in this study differs from the previous ones. The proposed model dealt with the problem of inconsistency of labeling, thus not just fixed mislabeling but stochastic mislabeling. This kind of mislabeling realistically occurs in real world problems. For example, clinical data that usually were sampled for diagnostic or prognostic purposes by human experts, in which the subjectivity of the experts may caused the mislabeling [13]. The inconsistancy in data labeling also happens when several human experts have to label a same data set, because of the subjectivity or difference in the level of expertise. Some methods for correcting the data labeling were previously proposed [8,27]. Our proposed model differs from the previous label correction methods, in that the proposed method is able to tolerate the existence of mislabeled data without compromising the quality of the neural network, so that requirement of having to preprocess the learning data prior to the training process can be relaxed. We consider that the most distinguishable characteristics of the proposed model is in the utilization of the ensemble itself. Our model used the approach of ‘‘ensemble-training single-running’’, which means that a number of neural networks are group-trained, but once the learning procedure ended, only one of the neural networks will be used in the running phase. In Section 2, the problem formulation is explained. Section 3 elaborates on the ensemble model, while the parameter called ‘‘Temperature’’ introduced to trigger the competition between the neural networks will be explained in detail in Section 4. Experiment results are given in Section 5 and the conclusions are given in the final section.

2. Problem formulation In this paper we deal with an ‘‘imperfect’’ supervisor, which stochastically produces mislabeled data. The behavior of the imperfect supervisor is

P. Hartono, S. Hashimoto / Applied Soft Computing 7 (2007) 353–363

defined as follows: Pðdinc ðxÞjxÞ ¼ e

(1)

Pðdcor ðxÞjxÞ ¼ 1 e

(2)

In this paper, we deal with the problem of an imperfect supervisor by utilizing a neural network ensemble explained in the next section.

3. Neural network ensemble

with 0 < e < 0:5;

355

x 2 Rn ;

dinc ; dcor 2 f0; 1gm ;

dinc dcor ¼ 0 Pðdinc ðxÞjxÞ and Pðdcor ðxÞjxÞ are the conditional probability that for an input x the supervior generates the incorrect teacher signal dinc , and the conditional probability that for an input x the supervisor generates the correct teacher signal dcor , respectively. e is the error rate of the supervisor. As shown above, the imperfect supervisor generates correct teacher signals in the majority, with only occassional errors. n and m are the dimensions of the input vector and output vector, respectively. As shown above, the data imperfection to be dealt in this study is a deficiency in the quality of the training data caused by a labeling incosistancy. Although in the real world problems the probability distribution of the mislabeling is biased toward a particular difficult subdata, in this study we generalize the problem by setting a uniform distribution of mislabeling for the whole data set. In this paper, for simplicity, we use a neural network with only one output neuron, but fundamentally there is no limitation on the number of output neurons. The objective is to train a neural network which generates output Oout such that, Oout ðxÞ d cor ðxÞ:

The proposed neural network ensemble is shown in Fig. 1. The ensemble consists of N independent MLPs (called members hereafter). All of the members receive the same input and independently process them to generate outputs. The output of the respective member is given to the temperature controller, which then regularizes the temperature of each member that in turn affects the learning intensity of each member based on their relative performance with regard to the teacher signal generated by the supervisor. All of the members have the same number of neurons in the input layers and the same numbers of neurons in the output layer, while to apply biases to their learning abilities, we set different numbers of neurons in their hidden layers. A member whose relative performance is good with regard to a particular training set has the privilege of learning further from the training set. On the contrary, members with bad performances are punished so that their intensities to learn from the training set are low. The temperature controller triggers the competitive learning between the members, so that different members have tendencies to acquire different input– output relations. For the problem of learning from an imperfect supervisor as defined in Eq. (1), this means that a particular member will learn only from the training sets relating input x with the correct teacher

(3)

In this problem, the error rate e is unknown; the only assumption is that, in the majority, the teacher signals are correctly generated. For the binary problems defined in Eq. (1), it is easy to calculate that the conventional MLP will produce Oout [7] such that, Oout ðxÞ edinc ðxÞ þ ð1 eÞdcor ðxÞ:

(4)

The learning result of a single MLP shown in Eq. (4) cannot satisfy the objective shown in Eq. (3).

Fig. 1. Neural network ensemble.

356

P. Hartono, S. Hashimoto / Applied Soft Computing 7 (2007) 353–363

signal dcor ðxÞ, while the connection of x to dincr ðxÞ is learned by another member. The behavior of the ensemble is explained as follows. In the explanation throughout this paper, for simplicity we used the members that have only one output neuron, although in principal there is no limitation on the number of output neurons. We also use three-layered perceptron as the members of the ensemble, here. Howerever, the discussions below can be applied to MLP with arbitrary any number of hidden layers. 3.1. Ensemble dynamics The output of the k th neuron in the hidden layer of the i th member, Oi;mid is, k Oi;mid ðtÞ ¼ k

Iki;mid ðtÞ

1 1 þ exp ðIki;mid ðtÞÞ

(5)

N in X in ¼ wi;in jk ðtÞO j ðtÞ; j¼1

where N in is the number of elements in the input, wi;in jk ðtÞ is the connection weight between the j th neuron in the input layer and the k th neuron in the hidden layer of the i th member at time t. The value of the output neuron in the i th member at time t is calculated as, 1

Oi;out ðtÞ ¼ 1 þ exp

I

i;out

ðtÞ ¼

i;mid N X

I i;out ðtÞ T i ðtÞ

(6)

wi;mid ðtÞOi;mid ðtÞ; k k

Fig. 2. Sigmoid function and temperature.

in the vicinity of 0.5. For binary problems requiring (0, 1) output, this value is irrelevant, so that this kind of member is considered to be an ‘‘inactive’’ member. It should be noted that the error of an inactive member with regard to a binary problem is maximum; hence the performance of the inactive member will be the worst. Because the performance will affect the learning intensity (explained in the next section), the intensity of learning for an inactive member will be the lowest. The parameter T is named temperature because its influence to the output is similar to the influence of the temperature in the Boltzmann distribution utilized to stochastically decide the action of an agent in the Reinforcement Learning [20]. In the Boltzmann distribution, high temperature implies a random decision, while in the proposed method a high temperature implies an output value of 0.5, which for binary problems also implies a kind of random decision. 3.2. Ensemble learning For the learning method we adopt Backpropagation learning [19], in which the correction for the connection weight between the neurons in the hidden and output layers is,

k¼1 i;mid

where N is the number of the hidden neurons in the i th member and wi;mid is the connection weight k between the k th neuron in the hidden layer and the neuron in the output layer of the i th member. T i ðtÞ is a scaling parameter called the ‘‘temperature’’ of the i th member at time t. The influence of the temperature on the output of the member is illustrated in Fig. 2. It is obvious that a very large T will cause the member to output values that are

Dwi;mid ðtÞ ¼ k

@Ei ðtÞ ðtÞ @wi;mid k

¼

1 T i ðtÞ

Oi;mid ðtÞdout ðtÞ k

(7)

di;out ðtÞ ¼ ðdðxðtÞÞ Oi;out ðtÞÞOi;out ðtÞð1 Oi;out ðtÞÞ

1 Ei ðtÞ ¼ ðOi;out ðtÞ dðxðtÞÞÞ2 : 2

P. Hartono, S. Hashimoto / Applied Soft Computing 7 (2007) 353–363

Dwi;mid ðtÞ is the correction for the connection weight k from the k th neuron in the hiddden layer to the neuron in the output layer at time t. Ei ðtÞ is the error of the i th member at time t, and dðxðtÞÞ is the teacher signal given by the supervisor in response to the input x at time t. The correction for the connection weight between the neurons in the input and hidden layers is, Dwi;in jk ðtÞ ¼

1 T i ðtÞ

Oi;mid ðtÞð1 Oi;mid ðtÞÞdmid j j j ðtÞ

(8)

i;mid out dmid j ðtÞ ¼ w jk ðtÞdk :

Eqs. (7) and (8) show that for a large value of Ti ðtÞ, the correction values for the connection weights become insignificant. We can bias the learning of each member by controlling the temperature during the learning process. For example, the temperature of a member that has to some extent learned the connection for input x to output dcor should be increased in the case of the presentation of the training set ðx; dincor Þ, so that the presented training set will not ‘‘contaminate’’ the input-output relation of the member. Hence, the temperature T works to bias the function acquired by the respective member. It should be noted that it works not only to regulate the learning intensity of each member but also to affect the output of each member, in which temperature increase causes the member to be inactive, which eventually prevents the member from further using the presented training sample for learning. It can be expected that a particular member develops a tendency to learn only from ðx; dcor Þ in accordance with the objective of this study.

4. Temperature control The temperature control mechanism is introduced to trigger competition among the members, so that a member that has relatively good performance is automatically activated to learn the training example while at the same time deactivate the irrelevant members. The mechanism of temperature control is to decrease the temperatures of the members that performed relatively well regarding a particular training example and increase the temperatures of the members that performed worse. For binary-valued

357

output neurons, it is clear that if a member performs well with respect to the correct training sample, it will perform badly with regard to the same input with incorrect (flipped) teacher signal. In this case, the increase in temperature prevents the member from learning from the incorrect training sample. Because it is favorable that there is only one active member at one learning iteration, we also apply competition for domination between the members, so that a particular member that performs well not only rewards itself by decreasing its temperature but also suppresses the other members by increasing their temperatures. On the other hand, a member that performs badly is penalized by increasing its temperature and gives its learning opportunity to others by decreasing their temperatures. Because of the assumption that, in the majority, the teacher signals are correct, and because of the variety in the structure of the members, it can be expected that, through competitive learning, a particular member will have a bias or tendency to learn only from the correct training data. The temperature control is expressed as follows: T i ðt þ 1Þ ¼ T i ðtÞ þ DT i ðtÞ

(9)

DT i ðtÞ ¼ piself ðtÞð1 Nt i ðtÞÞ þ picross ðtÞ

N X ð1 Nt j ðtÞÞ j 6¼ i

DT i ðtÞ in Eq. (9) is the correction of the i th member’s temperature at time t. The first term on the right side in Eq. (9) is the self-penalty term, in which the relative performance of the i th member influences that member’s temperature, while the second term is the cross-penalty term, which is the influence of other members’ performances on the the i th member’s temperature. N is the number of members in the ensemble. The relative performance ti ðtÞ, of the i th member is calculated as follows: ðdðtÞ Oi;out ðtÞÞ2 ti ðtÞ ¼ PN j;out Þ2 j¼1 ðdðtÞ O dðtÞ is the teacher signal at time t.

(10)

358

P. Hartono, S. Hashimoto / Applied Soft Computing 7 (2007) 353–363

Further, to prevent temperature T i from growing excessively large so as to cause a member to be permanently inactive, we limit the value of T i as follows: i

1 T ðtÞ

i Tmax ðtÞ

(11)

In the previous studies [4–7], the value of the Tmax is empirically set, but the setting of this value could be troublesome, because it depends heavily on the problems and the structures of the members. In this study, we introduce an autotuning Tmax ðtÞ Tmax has to be set to ensure that a member that has reached the maximum temperature becomes an inactive member (the outputs are always in the vicinity of 0.5) and is irrelevant to the learning procedure. With this consideration, Tmax is set as follows:

member with the least average temperature; hence, the winner could be chosen as, win ¼ arg min fhT i ig;

in which hT i i is the average temperature of the i th member during the training process. 4.1. Output of an inactive member For the proposed ensemble model, it is important to guaranty that the output of any member that has reached the temperature maximum Tmax is in the vicinity of 0.5. The proof that the selection of Tmax in Eq. (12) ensures, Oi;out ðtÞ ¼

i Tmax ðtÞ

¼a

i;mid N X

jwi;mid ðtÞj; j

for

j¼1

¼ a;

i;mid N X

jwi;mid ðtÞj 1 j

j¼1

for

i;mid N X

u jwi;mid ðtÞj 1 j

i;mid N X

(15)

wi;mid ðtÞOi;mid ðtÞ k k

k¼1

i;mid N X

The self-penalty rate piself and the cross-penalty rate picross are set as follows. i Tmax

M pi ðtÞ picross ðtÞ ¼ self N

ðtÞ ¼

1 0:5 i 1 þ exp ðui;out ðtÞ=Tmax ðtÞÞ

is given as follows:

a : positive constant

piself ðtÞ ¼

i;out

(12)

j¼1

(14)

wi;mid ðtÞOi;mid ðtÞ k k

k¼1

i;mid N X

jwi;mid ðtÞOi;mid ðtÞj < k k

k¼1

(13)

M is a positive constant that determines the ‘‘speed’’ of i when it constantly the i th member to reach Tmax performs badly. After the learning process is terminated, we have to choose a ‘‘winner’’, which is a member that has the greatest tendency to learn only from the correct training samples, from the ensemble’s members. The assumption that, in the majority, the training samples are correct infers that a member that benefits most from the training process is the winner. A member that learns most of the time during the training process is a

i;mid N X

jwi;mid ðtÞj; k

(16)

k¼1

because the Sigmoid Function guarantees, 0 < Oi;mid ðtÞ < 1: k

(17)

Eq. (16) implies that 1 < a

PN i;mid

wi;mid ðtÞOi;mid ðtÞ 1 k k < ; PN i;mid i;mid a a k¼1 jwk ðtÞj

k¼1

(18)

so that 1 1 þ exp

< Oi;out ðtÞ < 1 a

1 1 1 þ exp a

(19)

P. Hartono, S. Hashimoto / Applied Soft Computing 7 (2007) 353–363

Considering Taylor’s expansion, 1

layer is, ðtÞ ¼ Dwi;mid k

1 þ exp

1 a

3 5 1 11 1 1 1 1 þ ¼ þ ; 2 4 a 48 a 480 a

¼ (20)

Equ. (19) can be approximated to, 1 11 1 11 < Oi;out ðtÞ < þ 2 4a 2 4a

(21)

Eq. (21) guarantees that the output of a member that has reached the maximum temperature Tmax will be in the vicinity of 0.5, while the diffraction from 0.5 is determined by the parameter a. Fig. 3 explains the influence of a on the inactivity level of a member that has reached Tmax . 4.2. Learning of an inactive member

i for T i ðtÞ ¼ Tmax ðtÞ

ðtÞ @wi;mid k @Ei ðtÞ @Oi;out ðtÞ @I i;out ðtÞ ; @Oi;out ðtÞ @I i;out ðtÞ @wi;mid ðtÞ k

(23)

1 1 1 < : P 4 a N i;mid jwi;mid ðtÞj a k¼1 k

(24)

Eq. (24) indicates that it is sufficient to set a 1,to nullify the learning process for connection weights between the middle and output layers. For the connection weights between the neurons in the input and hidden layer, the weight’s correction can be written as follows: 1 Oi;in ðtÞOi;mid m ðtÞ i Tmax ðtÞ n mid ð1 Oi;mid m ðtÞÞdm ðtÞ

(25)

i;mid i;out ðtÞÞOi;out ðtÞ dmid m ðtÞ ¼ wm ðtÞðdðtÞ O

ð1 Oi;out ðtÞÞ

(22)

The proof that the selection of Tmax in Eq. (12) ensures that Eq. (22) is true is given as follows: because the correction of the connection weight from the k th neuron in the middle layer to the neuron in the ouput

@Ei ðtÞ

and the term that is relevant to Tmax is, i;out @O ðtÞ 1 i;out i;out ¼ @I i;out ðtÞ T ðtÞ O ðtÞð1 O ðtÞÞ max

Dwi;in n mðtÞ ¼

In the proposed competitive learning method, ideally an inactive member should not be affected by the learning procedure, so that it can protect its expertise when the given training samples are not relevant. Hence, it is important that the following equation is true. DWi 0

359

Hence, 1 i;mid Oi;in jDwi;in nm ðtÞj PN i;mid i;mid n ðtÞwm ðtÞ a k¼1 jwk ðtÞj

Recommend Documents

Learning from Imperfect Data Pavel Brazdil* Peter Clarkâ 1990