dealing with a priori knowledge by fuzzy labels - Semantic Scholar

Report 1 Downloads 65 Views
Pattern Recognition Vol. 14, Nos. I-6, pp. 111-115. 1981. Printed in Great Britain.

0031 - 3203/81/070111-05 ~2.00/0 Pergamon Press Ltd. ,~) 1981 Pattern Recognition Society

DEALING WITH A PRIORI KNOWLEDGE BY FUZZY LABELS FRITS T. BEUKEMATOE WATER and ROBERT P. W. DUINt Pattern Recognition Group, Department of Applied Physics, Delft University of Technology, The Netherlands (Received 9 January 1980; in revised form 1 May 1980; received for publication 22 December 1980)

Abstract - The performances of two different estimators of a discriminant function of a statistical pattern recognizer are compared. One estimator is based on binary label values of the objects of the learning set (hard labels) and the other on continuous or multi-discrete label values in the interval [03] (fuzzy labels). By the latter estimator more detailed a priori knowledge of the contributing learning objects is used. In a discrete feature space, in which a multi-nomial distribution function has been assumed to exist, the expected classification error, based on fuzzy labels, can be more accurate than the one based on hard labels. Statistical pattern recognition Classification error

A priori knowledge

I. INTRODUCTION The labels of the learning samples used in the learning stage of a statistical pattern recognizer are usually given in a hard manner. As far as the labeling is concerned all the objects are treated in the same rigid way: an object belongs to a class or it does not. By doing so one might lose valuable information, as not necessarily all of the objects have an equally strong resemblance to the ideal representatives of the classes. A way in which this loss of information may be avoided is by giving fuzzy (or soft) labels to the samples. The term fuzzy has been derived from Zadeh's fuzzy set theory. ~1~Each object is given a membership value for each class indicating the extent of its resemblance to the ideal representatives of that class. The term fuzzy label, indicating the possibility of label values between 0 and I, is chosen here instead of soft label in order to emphasize the relation with the fuzzy set concept as presented by Zadeh. The idea is that the classes are fuzzy sets and that the resemblance of each object to each class can be represented by a membership value. In this paper, however, no further use has been made of the fuzzy set theory. Furthermore we realize that what are called fuzzy labels here have been used by many authors before without using that name. The goal of our research is to assess the usefulness of a fuzzy-labelled learning set. An important aspect is that these labels are given by the teacher on the basis of

Fuzzy labels

Discriminant function

the whole object, or at least on the basis of more or different features than used in the automatic classification procedure. It should be noticed that this kind of labeling is different from the fuzzyfication of hard labels on the basis of some distance measure to the discriminant function as proposed by Tsokos and Welch,~2~or on the basis of class densities as proposed by Lissack and Fu, °~ and by Giick ~4~ for error estimation. These authors start with hard labels and give some objective transformation to fuzzy labels. In this paper we will start with originally fuzzy-labelled objects. However, the underlying idea is the same: estimators based on a continuous or multi-discrete variable between 0 and 1 may be more accurate than estimators based on their clipped binary values. A slightly different approach, the so called probabilistic teacher, has been studied by Agrawala~sl and by Cooper. 16~In this case the (hard) labeling of the objects has been randomized. However, if the teacher assigns the probabilities instead of randomly chosen hard labels using those probabilities, the approach is comparable with the fuzzy label approach. Still a difference exists : probabilities necessarily sum to one while fuzzy membership values do not. It has been pointed out by Zadeh ~ that membership values should rather be seen as possibilities than as probabilities.

tAddress for correspondence: R. P. W. Duin, Dept. of Applied Physics, room 329, Delft University of Technology, P.O. Box 5046, 2600 GA Delft, The Netherlands. 111

In section 2 some remarks on the transformation of a priori knowledge into a posteriori knowledge are

made. Sections 3 and 4 contain the general description of the comparison of fuzzy labels with hard labels and of a special case which has been studied. In Section 5 the results of computer simulations and of analytic approximations are presented. Conclusions with a short discussion of the results are given in Section 6.

112

FRITST. BEUKEMATOE WATERand ROBERTP. W. DUIN 2. T R A N S F O R M A T I O N O F A PRIORI K N O W L E D G E INTO A POSTERIORI KNOWLEDGE

One of the main goals of statistical pattern recognition is the establishment of an optimal classifier by means of statistical techniques. Many authors have emphasized the application of a priori knowledge while using these techniques. In the case of supervised learning, which is studied here only, a source of a priori knowledge is the labelling of the individual objects of the learning set. As stated in the introduction the loss of information which might occur while using hard labels, i.e. the label value can only be 1 for one class and 0 for the other classes, could be reduced by fuzzy labelling, i.e. the label value is somewhere between 0 and 1 for each class. In fact more precise knowledge of the individual learning objects is used by this. The question is whether this will lead to more accurate a posteriori knowledge. In other words, will the classification error diminish? As will be shown in the following sections there are indeed situations in which this happens. Now that we have considered hard labels and fuzzy labels as two different forms of a priori knowledge, we could do the same for hard and fuzzy decisions in respect to a posteriori knowledge. In completely automated classifiers one often desires hard decisions, but in cases where the final decision is human, fuzzy representations can be more useful. In Table 1 the possible decisions in relation to the types of labelling are shown. Some characteristics are mentioned. The combination ofhard labels and hard decisions is called classical, because it has been studied mostly until now. Hard labelled data may be transformed into a posteriori probabilities used for a following decision stage. Especially in cases in which the final decision has to be made by man, like in the field of medical diagnosis, this may be important, see Hermans and Habbema. ~sJ In this paper the combination of fuzzy labels and hard decisions is partly investigated, while the last combination as far as known - has not yet been studied. In the next section a set-up for the comparison of fuzzy and hard labels is made.

two class problems has been made. If A and B are the two distinct classes, then the fuzzy labels can be defined as membership values mi6[0,1],

A sensible comparison demands certain conditions and a good measure of comparison. These will be formulated in this section. First of all a restriction to

ui~{0, U, i=A,~.

Decisions Hard Fuzzy

Labels of the learning set Hard Fuzzy Classical For following decision stage

Studied in this paper Not yet researched

(2)

So as to make a sensible comparison there has to be some consistency between the hard and fuzzy labelling. For that reason the fuzzy labels mi and the hard labels v~ will be related by mi > 0 . 5 , i = A , B ~ - ~ v i = 1, i = A , B m i = 0.5, i = A, B *--,v~ = l by random choice, i=A,B

ml < 0.5, i = A , B * - - ~ v i = O , i = A , B

(3)

which means that a hard label value l (0) will be given if the fuzzy label value is greater than 0.5 (less than 0.5) and vice versa. I f a fuzzy label value of 0.5 is given, the hard label value will be 0 or 1 by random choice. In addition to this, as v i is restricted by V A + V B = 1,

(4)

mi will be restricted by mA+mB---- 1.

(5)

The conditions (3), (4) and (5) make it possible to give only one label value to an object, which will be denoted by m = mA. Condition (5) is not strictly necessary, as has already been indicated in Section 1 and will be further discussed in Section 6. Furthermore, let x denote a feature vector ( x l . . . Xk)r in a k-dimensional feature space and let f(x, m) be the joint distribution function of x and the label value m. If discriminant functions Ds(x), j = h,fin the hard as well as the fuzzy case are computed such that Dr(x ) > 0 ,

j=h, fthenxeclassA

Dr(x ) < 0, j = h, f t h e n x ~ class B

(6)

then the classification error can be computed. With a finite learning set only estimates/)j(x) of the discriminant functions Dr(x ) can be made. Usually also the density f(x, m) has to be estimated, but in computer simulations this is not necessary. The classification errors are then defined as ~:S=

Table 1. Relations between two types of label and two types of decision

(1)

Hard labels can be seen as a special case of fuzzy labels. They will be denoted by

-

3. THE C O M P A R I S O N O F F U Z Z Y AND H A R D LABELS

i=A,B.

f0 f0" ;ofo

f(x, m) dx dm

~x~o

+

j(x) 0.5)

N(m = 0.5)

n

2n r > 0.5 then Xo is assigned to class A, = 0.5 then Xo is assigned to class A or class B by random choice, < 0.5 then x o is assigned to class B, (8)

in which N(m > 0.5) is the number of times that the label value m is greater than 0.5, N(m = 0.5) the number of times that m is equal to 0.5 and n the number of learning objects. N(m = 0.5) is only counted halfas a consequence of the random choice in equation (3). In the fuzzy case likewise different assignment rules are used in which the relative frequencies are replaced by the average label value ffl

=

-

m 1.

n I=1

With these assignment rules and stating that p = P(AIx = Xo) is the probability of class A in xo, the expected classification errors in x = Xo are given by E(eh) = p" Prob(f 0.5) + 0.5 • Prob(f = 0.5)

(9)

E(e.s) = p. Prob(n~ < 0.5) + (1 - p)" Prob(n~ > 0.5) + 0.5 • Prob(~h = 0.5). (10) Here it has been assumed that the labels of the learning objects in xo are independent random variables. Note that for n ~ ov the probabilities in equations (9) and (10), except p, tend to 0 or 1, by which the expected classification errors approach p or 1 - p. The classification errors will depend on the distribution of n • f, which is binomial and of n~, which tends to a normal distribution if the number of learning objects is large as

I13

a consequence of the central limit theorem. If the size of the learning set is small the probabilities Prob (.) in equation (9) will depend on the distributionf(mlx = Xo) of the label values and E(~:s)has to be computed by Monte Carlo procedures. It should be noted that equations (9) and (10) can also be used to evaluate the performance ofa discriminant function by means of test objects which have a distribution function f(mlx = Xo). In the hard label situation one only takes into account the fact that an object is on a particular side of the discriminant function and in the fuzzy label situation in addition to this the distance Iml to the discriminant function, is used (see Lissack and Fu ~31and Glick)J 4~Note that m can now be in the interval ( - -~, + 7_). By equations (9) and (10) two different expected classification errors are formulated, of which the former is based on the average of the binary values 0 and 1 and the latter on the average of values between 0 and 1. By and large the costs of the fuzzy labelling will be higher than those of the hard labelling. However in exchange for these costs a learning set, which contains more detailed knowledge of the contributing objects, is acquired. The efforts which have been made should be repaid in some way or another. This might be done by a faster converging estimator rh. To demonstrate the performance of the estimators f in equation (9) and n~ in equation (10) comparisons of the expected classification errors E(~h) and E(e.s) for four different types of distribution functions are shown in the next section.

5. RESULTS O F C O M P U T E R SIMULATIONS AND ANALYTIC A P P R O X I M A T I O N S

The classification of a point x on a hard labelled learning set may only differ from the one based on a fuzzy labelled learning set ifx is in the region of overlap between the classes. The classifications differ if th > 0.5 and ~ < 0.5 or to the other way around. The distributions of the fuzzy label values should therefore include the point 0.5. Below four different density functionsf(mlx = Xo) for the fuzzy label values are investigated. They are chosen in such a way that there is a large class overlap and that they illustrate the various effects that may occur. For those reasons they are not representative for an arbitrary point x. The density functions are as listed below. (a) Discrete uniform with m ~ {0.2, 0.3 ..... 1.0}. Then p equals 0.611. (b) Truncated normal with m E [0, 1-]. In this case p equals 0.612 as the mean of the untruncated distribution is chosen as 0.75 and the standard deviation as 0.5. (c) Special type, with 40~ of m distributed uniformly between 0 and 0.25 and 60~ between 0.5 and 0.75; p = 0.6. (d) Standard normal, with m e (-7-~, + : ~ ) ; p equals 0.69 as the mean has been chosen as 0.5. This

114

FRITS T . BEUKEMA TOE WATER a n d ROBERT P. W . D U I N

EIE;I)

1

• ., x -----\~

/,5%

E(tj) l

hord

]

fuzzy

.z ~.*x

50"~

~

/x

J,

x/a

N~x x~K

/ ~x

/



/ ~

45%

/.0%

I

I

I

I

I

1

....x S

I

I

l

l

5

I

10

'

x

~----

herd fuzzy

I I

I

1s

= n

Fig. I. The expected classification errors E0:j),j -- h, f a s a function of the number of learning objects n. The fuzzy label values have becn drawn randomly from a discrete uniform distribution function as described under (a) in Section 5. One thousand runs for each value of n less than or equal to 10 have been made. For 10 and 15 normal approximations have been made.

example is presented here to demonstrate that equations (9) and (10) can also be used to evaluate the performance of a discriminant function by means of test objects, as mentioned in the previous section. These are the label densities in a single point of the discrete feature space. The results shown in Figs 1-4 are the expected classification errors of that point. In the cases (a) and (c) the label values were generated. The expected classification errors in equations (9) and (10) have been computed as functions of the number n of learning objects for values of n less than or equal to 10 over 1000 runs for each value. Normal approximations were used for values of n greater than or equal to 10. In cases (b) and (d) normal approximations have been used for each value of n shown.

40% I 1

a

J

1 ,I 5

1

I

I

I ,| 10

Fig. 3. The expected classification error E(e.3,j = h, f a s a function of the number of learning objects n. The fuzzy label values have been drawn randomly from a distribution function as described under (c) in Section 5. 1000runs for each value of n shown have been made.

6. DISCUSSIONAND CONCLUSIONS The results of the simulations and approximations of the previous section lead to the following conclusions. Depending on the number of learning objects used, the distribution function of the fuzzy labels and the probabilities p [ = Prob(A]x -- xo)] and 1 - p the expected classification error in the fuzzy labelled situation can b¢ faster converging as a function of the number of learning objects than the same kind of e r r o r in the hard labelled situation. This may result in up to 3% better expected classification error in absolute terms or up to 30% less learning objects to reach the s a m e error.

EIcj)

l E(]j)

herd ~ ~ --fuzzy

~0%

N ~5%

hord --~--

fuzzy

35%

/.0%

t

1

i

I

I

5

I

I

I

I

I

10

I

~ n

15

Fig. 2. The expected classification errors E(~j),j = h , f a s a function of the number of learning objects n. The errors have been computed by normal approximation. The distribution function of the fuzzy label values m is described under (b) in Section 5.

10

1~)0

:

n

Fig. 4. The expected classification errors E(~j),j - h,fas a function of the number of learning objects n. The errors have been computed by normal approximation. The distribution function of the fuzzylabel values is described under (d) in Section 5.

Dealing with a priori knowledge by fuzzy labels However, as demonstrated in example (c) of Section 5, there exists a class of distribution functions of the fuzzy label value for which the expected classification error is increasing. This will be the case when the mean and the median of the distribution function are on opposite sides of the label value 0.5. The fuzzy and the hard class assignments will then be different. The condition (5), mA + ms = I, used in Section 3 could be dropped. Membership values or fuzzy label values are not necessarily restricted to this, as can be seen in equation (I). If one drops condition (5) a different estimator should be used, for instance the average of the difference of the fuzzy label value: I n I=I

Simulations with this estimator have given similar results as has been shown in the previous section. Furthermore it should be noticed that in the multiclass case estimators such as n~, 6 and i can not be used any more. Maxima of the average label values for each class could be used instead. A successful application of fuzzy labels in practice depends on the way in which they are assigned to the learning objects. There should be a correspondence between the local amount of class overlap and the membership value. Otherwise the labelling and the

115

feature values are inconsistent, which may result in worse instead of better results. The research has been limited to a discrete feature space in which a multi-nomial distribution function has been assumed. Surely further research should involve the classification error in the continuous space as expressed in equation (7).

REFERENCES 1. L. A. Zadeh, Fuzzy sets, Inf. Control 8, 338-353 (1965). 2. C. P. Tsokos and R. L. W. Welch, Bayes discrimination with mean square error loss, Pattern Recognition 10, 113-123 (1978). 3. T. Lissack and C. S, Fu, Error estimates in pattern recognition via L'-distance between posterior density functions, IEEE Trans. Inf. Theory IT-22, 34-45 (1976). 4. N. Glick, Additive estimators for probabilities of correct classification, Pattern Recognition 10, 211-222 (1978). 5. A. K. Agrawala, Learning with a probabilistic teacher, IEEE Trans. Inf. Theory IT-16, 373-379 (1970). 6. D. B. Cooper, On some convergence properties of "Teaching with a probabilistic teacher" algorithms, IEEE Trans. Inf. Theory IT-21, 699-702 (1975). 7. L. A. Zadeh, Fuzzy sets as a basis for a theory of possibility, Fuzzy Sets Syst. 1, 3-28 (1978). 8. J. Hermans and J. D, F. Habbema, Comparison of five methods to estimate a posteriori probabilities,ED V M ed. Biol. 6, 14-19 (1975).

About the Author - FRITST. BEUKEMATOEWATERwas born in 1953 in Oegstgeest, The Netherlands. He received the Ingenieur degree in applied physics from the Delft University of Technology, Delft, The Netherlands, in 1980. His graduate work included theoretical investigationsin the field of statistical pattern recognition. He joined the Physics Laboratories of the Dutch Defence Research Organization RVO-TNO, The Hague, The Netherlands, in 1980. About the Author - ROBERTP. W. DUJNwas born in 1946 in Maasniel, The Netherlands. He received the Ingenieur degree in applied physics in 1970and the Doctoral degree in 1978,both from the Delft University of Technology, Delft, The Netherlands. Since 1970 he has been a staff member of the Pattern Recognition group of the Department of Applied Physics of the Delft University of Technology. He is responsiblefor the development of statistical approaches in the research on pattern recognition and complex data. Mr. Duin is a member of the Pattern Recognition Society, the Biometric Society, and the Netherlands Society for Statistics, Biometrics, Econometrics and Operational Research.