Fuzzy Sets and Systems 157 (2006) 3109 – 3122 www.elsevier.com/locate/fss
An omission approach for detecting outliers in fuzzy regression models Wen-Liang Hunga,∗ , Miin-Shen Yangb a Graduate Institute of Computer Science, National Hsinchu University of Education, Hsin-Chu, Taiwan b Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li 32023, Taiwan
Received 17 September 2005; received in revised form 28 July 2006; accepted 17 August 2006 Available online 14 September 2006
Abstract Since Tanaka et al. in 1982 proposed a study in linear regression with a fuzzy model, fuzzy regression analysis has been widely studied and applied in various areas. However, Tanaka’s approach may give an incorrect interpretation of the fuzzy linear regression results when outliers are present in the data set. To handle the outlier problem, we propose an omission approach for Tanaka’s linear programming method. This approach has the capability to examine the behavior of value changes in the objective function of fuzzy regression models when observations are omitted. Furthermore, we use a simple visual display—box plot—to define the cutoffs for outliers. Some numerical experiments are performed to assess the performance of the proposed approach. Numerical results clearly indicate our approach performed well. © 2006 Elsevier B.V. All rights reserved. Keywords: Box plot; Fuzzy linear regression; Linear programming; Omission approach; Outlier
1. Introduction Tanaka et al. [13] initiated research in fuzzy linear regression (FLR) analysis. The parameter estimations of the FLR model were considered under two factors, namely, the degree of the fit and the vagueness of the model. The estimation problems were then transformed to a linear programming method based on these two factors. This type of analysis of FLR models is called Tanaka’s approach throughout this paper. The extension of FLR models and different estimation methods have been proposed by many researchers along with Tanaka’s approach (see [4,5,7,8,10,11]). A brief survey of Tanaka’s approach to FLR models is now given. A fuzzy number F is defined as a convex normalized fuzzy set of the real line so that there exists exactly x0 ∈ with F (x0 ) = 1, and its membership F (x) is piecewise continuous. A fuzzy number M is of the LR-type if there are m, > 0, > 0 in so that ⎧ m−x ⎪ ⎪ if x m, ⎨L M (x) = x−m ⎪ ⎪ ⎩R if x m, ∗ Corresponding author. Tel.: +886 3 5213132; fax: +886 3 5611228.
E-mail address:
[email protected] (W.-L. Hung). 0165-0114/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.fss.2006.08.004
3110
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
where L and R are decreasing functions from + to [0, 1], and L(x) = R(x) = 1 for x 0; 0 for x 1. m is called the center value of M and and are called the left and right spreads, respectively. Symbolically M is denoted by (m, , )LR . Let M and N be two LR-type fuzzy numbers with M = (m, , )LR and N = (n, , )LR . Then by the extension principle, the following operations are defined: M + N = (m + n, + , + )LR ; −N = (−n, , )RL ; (m, , )LR − (n, , )RL = (m − n, + , + )LR ; · (m, , )LR = (m, , )LR when > 0, · (m, , )LR = (m, −, −)RL when < 0 (see [2]). For an LR-type fuzzy number A = (a, , )LR , if L and R are of the form T (x) = 1 − x for 0 x 1 and 0 otherwise, then A is called a triangular fuzzy number, denoted by A = (a, , )T . If = , then A = (a, , )T is called a symmetrical triangular fuzzy number, denoted by A = (a, )T . Tanaka et al. [13] considered the following FLR model: Y ∗ = A0 + A1 x1 + · · · + Ap xp = Ax,
(1)
where A = (A0 , A1 , . . . , Ap ) is a vector of fuzzy parameters with Ai = (ai , i )T , i = 0, 1, . . . , p and x = (1, x1 , . . . , xp )t is a vector of non-fuzzy input. Thus, the fuzzy output Y ∗ will also be a symmetrical triangular fuzzy number with p p Y∗ = ai xi , i |xi | (for x0 = 1). i=0
i=0
T
Let us identify the FLR model (1) when the data is given as (xj , Yj ) (j = 1, . . . , n), where xj = (1, xj 1 , . . . , xjp )t is the jth non-fuzzy input vector, Yj = (yj , ej )T is the jth fuzzy output, and n is the data size. Tanaka et al. [13] proposed the parameter estimation method under the minimization of the vagueness of the FLR model subject to at least some degree of the fitting of the estimated values to the observed values. Let Yj∗ be the estimates of Yj , j = 1, . . . , n. The degree of the fitting of Yj∗ to Yj is given under the condition that (Yj )h ⊂ (Yj∗ )h , h H for all j = 1, . . . , n with (A)h defined as the h-level set of the fuzzy set A and H ∈ [0, 1), a degree of fitting chosen by the decision-maker. The vagueness of the FLR model is defined as the sum of spreads of all fuzzy parameters J = 0 + 1 + · · · + p . The estimation problem is to obtain the fuzzy parameter estimates A∗i which minimize J, subject to the condition of the degree of fit. Tanaka et al. [13] formulated this optimization problem as the following linear programming problem: min a,
s.t.
J = 0 + 1 + · · · + p 0, = 0, 1, . . . , p and at xj + (1 − H )t |xj |yj + (1 − H )ej , at xj − (1 − H )t |xj |yj − (1 − H )ej , j = 1, . . . , n,
(2) (3) (4)
where a = (a0 , a1 , . . . , ap )t and = (0 , 1 , . . . , p )t are center values and spreads of the fuzzy parameters t A = (A0 , A1 , . . . , Ap ), and |xj | = (1, |xj
1 |, . . . , |xjp |) . Several years afterward, Tanaka [10] revised his definin t tion of vagueness in the FLR model to J = j =1 |xj |. That is, the estimation is to minimize the sum of spreads of all estimated outputs Yj∗ , subject to the above conditions (2)–(4). Outliers may occur because of gross error during the collection, recording, or transcribing of the data. They may also be genuine observations. In the latter case, they may indicate the inadequacy of the model. Once an outlier has been detected, it should be put under scrutiny. One should not mechanically reject outliers and proceed with the analysis. If the outliers are bona fide observations, they may indicate the inadequacy of the model under some specific conditions. They often provide valuable clues to the analyst for constructing a better model. It is important for a data analyst to be able to identity outliers and assess their effect on various aspects of the analysis. Some methods have been presented to detect outliers in FLR (cf. [1,7]). However, there are some drawbacks to the existing methods, i.e., they must pre-assign some values to parameters and they cannot conduct a formal test for the outliers. To overcome the drawbacks in the existing methods, we use an omission approach which examines how the value change in the objective function behaves when some of the observations are omitted from the data set in FLR model. Moreover, to define the cutoffs for outliers, we use the box plot procedure for the visual display tool. The remainder of the paper is organized as follows. In Section 2, we propose an omission approach for an outlier detection. Section 3 has a review of the box plot procedure. In Section 4, we present several examples to demonstrate the performance of the proposed omission approach with box plots. Section 5 extends the detection of a single outlier to multiple outliers. Conclusions are made in Section 6.
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
3111
2. An omission approach for an outlier detection Frequently in FLR applications, the data set contains some elements that are outlying or extreme; that is, the observations for these cases are well separated from the remainder of the data. The cases may be outlying or extreme with respect to the response Y value, the predictor X value, or both. Peters [7] considered outliers or extremes with respect to the Y value and considered systems with non-fuzzy input and non-fuzzy output data types where Tanaka’s approach is modified to treat the bounds of the interval as fuzzy. Peters introduced a new variable which represents the membership degree to which the solution belongs to the set “good solution”. The fuzzy linear programming problem is given as follows: max s.t.
(1 − )p0 −
n
t |xj | − d0 (“objective function”),
j =1 (1 − )pj + at xj (1 − )pj − at xj
+ (1 − H )t |xj |yj (“upper limit”), + (1 − H )t |xj |yj (“lower limit”), − − 1, , c 0, j = 1, 2, . . . , n, where d0 represents the desired value of the objective function, p0 can be considered as the tolerance of the desired lower bound and pj as the width of the tolerance interval of yj . The main problem
with Peters approach is “How to select the values of d0 , p0 and pj ?”. He suggested selecting d0 = 0 which makes nj=1 t |xj | = 0 for the desired value of total vagueness. This means we prefer to obtain a model as crisp as possible. But the values of p0 , pj must be determined in a context-dependent way and cannot be easily determined. Chen [1] then considered the outlier problem in FLR analysis with non-fuzzy input and fuzzy data-type output. Based on the idea of detecting the difference in the width between t |xj | and ej , Chen proposed the following approach: min a,
s.t.
J =
n
t |xj |
j =1
at xj + (1 − H )t |xj |yj + (1 − H )ej , at xj − (1 − H )t |xj |yj − (1 − H )ej , t |xj | − ej k, j = 1, 2, . . . , n.
The problem with Chen’s approach is similar to that of Peters, i.e., how to decide the value of k. If the value of k is too small, normal values may become abnormal. On the other hand, if it is too big, abnormal values may become normal or, abnormal values will go undetected. Thus, the value of k must be decided with great care by the decisionmaker. In order to solve this problem, Chen [1] proposed seven different approaches for obtaining the k value which is based on ej . Because it is difficult to find the relationship between k and ej , the k value cannot be obtained systematically. In order to overcome the drawbacks of Peters’ and Chen’s methods, we propose a method using a refined measure to identify the cases with outlying Y observations. This refinement is to measure the influence of the ith observation on the value of the objective function in Tanaka’s approach when the ith observation is omitted. Based on this idea, we develop an omission approach for detecting a single outlier in a data set as follows. The procedure is to first delete the ith observation. We then apply Tanaka’s approach to the remaining (n − 1) observations and obtain the (i) minimized value of the objective function which is denoted by JM . After deleting the ith observation, Tanaka’s approach becomes (i) t |xj | min JM = a,
s.t.
j =i
0, = 0, 1, . . . , p and at xj + (1 − H )t |xj |yj + (1 − H )ej , at xj − (1 − H )t |xj |yj − (1 − H )ej , j = i.
(5) (6) (7)
3112
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
Let JM be the minimized value of the objective function obtained by Tanaka’s approach to all observations. The (i) absolute difference between JM and JM will be denoted by di , (i)
di = |JM − JM |,
i = 1, 2, . . . , n.
The ratio of di to JM is called the normalized absolute difference and will be denoted by ri , ri =
di , JM
i = 1, 2, . . . , n.
(8)
The ri value shows the size of the absolute difference relative to JM . A large value of ri is indicative of a large impact of the ith observation on the value of the objective function. We usually assume that there is at most one outlier in a given data set and require that the label of the outlying observation is unknown. We therefore use rmax = max{ri |1 i n} to detect a single outlier in a FLR model. But determination of the critical value for rmax is difficult. Therefore, we use a box plot method to compare the values of ri relative to each other for the determination of outliers that will be described and demonstrated in next sections. 3. Box plot procedure In 1977, Tukey [14] proposed a box-and-whisker plot to display the 5-number summary (extremes, hinges, median). To identify certain values as “outside” or “far out”, Tukey [14] also used the H-spread to give a rule for these values. The rule is as follows: (i) (ii) (iii) (iv)
H-spread = difference between values of hinges. Step = 1.5 times H-spread. Inner fences are 1 step outside hinges; outer fences are 2 steps outside hinges. The values between an inner fence and its neighboring outer fence are “outside”; values beyond outer fences are “far out”.
By a similar technique of Tukey [14], Emerson and Strenio [3] and Hoaglin [4] considered the median, fourths, and extremes to be the elements of the 5-number summary. To examine the data for outlying values, they provided the F-spread, defined by F-spread = (upper fourth) − (lower fourth) = FU − FL , to define the cutoffs for outliers. They defined FL − 1.5 × F-spread
and FU + 1.5 × F-spread
as the outlier cutoffs. Data values that lie outside these cutoffs are called outliers. The first and third quartiles are nearly the same as the lower and upper fourths and these two quartiles are easily got in statistical software programs. Therefore, we use quartiles instead of fourths. On the other hand, a box plot is used as a short term for a box-and-whisker plot (see [3]) where it is always available in most general purpose statistical software programs. This plot has the following strengths (cf. [3]): (i) Graphically display a variable location and spread at a glance. (ii) Provide some indications of the data symmetry and skewness. (iii) Unlike many other methods of data display, box plots show outliers. Thus, a box plot is an important exploratory data analysis tool. We use rmax proposed in Section 2 to detect outliers through the box plot. By similar arguments of [3,4,14], we use the interquartile range (IQR), that is the difference between the first and third quartiles, to define the outlier cutoffs as follows. In a box plot, inner fences are constructed to the left and right of the box at a distance of 1.5 times the IQR. Outer fences are constructed in the same way at a distance of 3 times the IQR. It is well known that the median and the first and third quartiles are insensitive to outlying data values. Since a box plot itself contains these values, a box plot will automatically guard against undue influence of outlying cases. On the
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
3113
other hand, the outliers cutoffs are determined by the first and third quartiles. They can dampen the influence of even a single wild data value. Therefore, we use a box plot to determine whether rmax is an outlier or not. In SPSS statistical software, data points that lie between the inner and outer fences are denoted by a circle “◦”, called outliers. Data points that lie beyond the outer fences are denoted by an asterisk “∗”, called extreme outliers. These SPSS box plots will be used to demonstrate our numerical examples. 4. Numerical examples An outlier with respect to its Y value may be outlying with respect to its spread value, its center value, or both. In the following, we show examples of each of the three cases to illustrate the proposed approach and choose the threshold level H = 0. Example 1. In this example, comparisons between the proposed approach and Chen’s approach are made. The data given in Table 1 was used by [1]. The minimized values of the objective function are A = 44.4583, JM
B JM = 123.75,
C JM = 144.0476 (i)
for Table 1(A)–(C), respectively. The results of JM and ri obtained by the omission approach are shown in Table 2 (A)–(C). For Table 2(A), the median, the first and third quartiles of ri , denoted by Me, Q1 , Q3 , respectively, are Me = 0.105,
Q1 = 0.097,
Q3 = 0.118.
Table 1 Three kinds of trends of ej s as x increasing No.
x
A: Constant spread (yj , ej )
B: Increasing spread (DS-A2) (yj , ej )
C: Decreasing spread (DS-A1) (yj , ej )
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
(8.0,1.8) (6.4,2.2) (9.5,2.6) (13.5,2.6) (13.0,2.4) (15.2,2.3) (17.0,2.2) (19.3,4.8)a (20.1,1.9) (24.3,2.0)
(11,2) (13,2) (21,4) (29,4) (29,6) (34,6) (45,15)a (44,8) (48,12) (54,12)
(11,12) (13,12) (21,10) (24,10) (31,8) (34,8) (42,4) (44,15)a (51,2) (54,2)
a Represents
outlier.
Table 2 (i) and ri for Table 1 The values of JM No.
1 2 3 4 5 6 7 8 9 10 a Represents
A
B
(i) JM
ri
(i) JM
ri
JM
ri
40.65 38.64 40.3667 39.775 40.0833 39.9417 39.8 35.55 39.5167 39.375
0.086 0.131 0.092 0.105 0.098 0.102 0.105 0.200a 0.111 0.114
120.1225 116.4107 116.7024 114.5714 112.4405 110.3095 78.4286 106.0476 100.7166 101.7857
0.029 0.059 0.057 0.074 0.091 0.109 0.366a 0.143 0.186 0.177
126.6 126.8571 130.2381 130.0 129.7619 129.5238 129.2857 117.0 128.8095 128.5714
0.121 0.119 0.096 0.098 0.099 0.101 0.102 0.188a 0.106 0.107
outlier.
C (i)
3114
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
0 200
7
*8
*8 0 180
0 300
0 175
0 160 0 150 0 200
0 140
0 125 0 120 0 100 0 100 0 100
Fig. 1. Box plots of the values of the normalized absolute difference: Table 2(A) is on the left, Table 2(B) is on the middle and Table 2(C) is on the right. Table 3 Outlier with its center values No.
x
A (yj , ej )
B (yj , ej )
C (yj , ej )
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
(8.0,1.8) (6.4,2.2) (9.5,2.6) (13.5,2.6) (13.0,2.4) (15.2,2.3) (17.0,2.2) (29.3,2.2)a (20.1,1.9) (24.3,2.0)
(11,2) (13,2) (21,4) (29,4) (29,6) (34,6) (55,8)a (44,8) (48,12) (54,12)
(11,12) (13,12) (21,10) (24,10) (31,8) (34,8) (42,4) (64,4)a (51,2) (54,2)
a Represents
outlier.
Therefore, IQR = 0.021 and cutoffs are Q3 + 1.5IQR = 0.149,
Q3 + 3IQR = 0.181,
Q1 − 1.5IQR = 0.066,
Q1 − 3IQR = 0.034.
Hence, the data point of No. 8 is an extreme outlier. By similar argument, we have Table 7 which shows the above values for Table 2(B) and (C). From Table 7, we have that the data points of No. 7 in Table 2(B) and No. 8 in Table 2(C) are outliers. The corresponding box plots are shown in Fig. 1. These results are consistent with [1]. Example 2. In this example, we consider the outlier with respect to its center value. Table 3(A)–(C) lists the numerical values with some modification, from Table 1(A)–(C). The minimized values of the objective function are A JM = 62.75,
B JM = 135,
C JM = 145.2679 (i)
for Table 3(A)–(C), respectively. The results of JM and ri obtained by the proposed approach are presented in Table 4(A)–(C). From Table 7 and Fig. 2, we conclude that the outliers are Nos. 8, 7 and 8 in Table 3(A)–(C), respectively.
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
3115
Table 4 (i) and ri for Table 3 The values of JM No.
1 2 3 4 5 6 7 8 9 10 a Represents
A
B
(i) JM
ri
(i) JM
C ri
JM
ri
52.5 57.0 57.85 57.3 56.75 56.2 55.65 35.55 52.87 54.0
0.163 0.092 0.078 0.087 0.096 0.104 0.113 0.433a 0.157 0.139
131.6939 127.7679 127.4524 125.0714 122.6905 120.3095 78.4286 115.5476 108.9167 110.7857
0.024 0.054 0.056 0.074 0.091 0.109 0.419a 0.144 0.193 0.179
128.2716 130.9286 130.9643 130.875 130.7857 128.7589 130.6071 117.0 130.4286 130.3393
0.117 0.099 0.098 0.099 0.100 0.114 0.101 0.195a 0.102 0.103
(i)
outlier.
*8
*
7 0 400
8
0 400 0 180
0 300 0 160
0 300
0 200
0 140
0 200 0 120
0 100 0 100
0 100
Fig. 2. Box plots of the values of the normalized absolute difference: Table 4(A) is on the left, Table 4(B) is on the middle and Table 4(C) is on the right.
In Example 2, we consider the outlier with respect to its center value. But Chen [1] only considers the outlier with respect to its spread. Thus, Chen’s method does not work in this example. Example 3. In this example, we consider the outlier with respect to its spread and center values. Table 5(A)–(C) lists numerical values which are a combination of Tables 1(A)–(C) and 3(A)–(C). The minimized values of the objective function are A JM = 71.1071,
B JM = 161.6327,
C JM = 180.625 (i)
for Table 5(A)–(C), respectively. The results of JM and ri obtained by the proposed approach are presented in Table 6(A)–(C). From Table 7 and Fig. 3, we conclude that the outliers are Nos. 8, 7 and 8 in Table 5(A)–(C), respectively. Chen’s method gives the same results. Example 4. In this example, we use the data set (see Table 8) given by Peters [7] to compare the proposed approach with Peters’ approach. According to Peters’ results, the data point of No. 5 is an outlier. Using the proposed approach,
3116
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
Table 5 Outlier with its spread and center values No.
x
A (yj , ej )
B (yj , ej )
C (yj , ej )
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
(8.0,1.8) (6.4,2.2) (9.5,2.6) (13.5,2.6) (13.0,2.4) (15.2,2.3) (17.0,2.2) (29.3,4.8)a (20.1,1.9) (24.3,2.0)
(11,2) (13,2) (21,4) (29,4) (29,6) (34,6) (55,15)a (44,8) (48,12) (54,12)
(11,12) (13,12) (21,10) (24,10) (31,8) (34,8) (42,4) (64,15)a (51,2) (54,2)
a Represents
outlier.
Table 6 (i) and ri for Table 5 The values of JM No.
1 2 3 4 5 6 7 8 9 10 a Represents
A
B
(i) JM
ri
(i) JM
C ri
JM
ri
60.4125 65.1714 65.8357 65.1 64.3643 63.6286 62.8928 35.55 59.7414 60.6857
0.150 0.083 0.074 0.084 0.095 0.105 0.116 0.500a 0.160 0.147
158.6939 154.2679 152.8165 149.8775 146.9388 144.0 78.4286 138.1225 130.5 132.2449
0.018 0.046 0.055 0.073 0.091 0.109 0.515a 0.145 0.193 0.182
162.5625 165.5 164.75 163.875 163.0 160.1875 161.25 117.0 159.5 158.625
0.100 0.084 0.088 0.093 0.098 0.113 0.107 0.352a 0.117 0.122
(i)
outlier.
Table 7 The values of Me, Q1 , Q3 , IQR and cutoffs for Tables 2, 4 and 6 Table 2 Me Q1 Q3 IQR Q1 − 1.5I QR Q1 − 3I QR Q1 + 1.5I QR Q1 + 3I QR
Table 4
Table 6
A
B
C
A
B
C
A
B
C
0.105 0.097 0.118 0.021 0.066 0.034 0.149 0.181
0.100 0.059 0.179 0.120 −0.121 −0.301 0.359 0.539
0.104 0.099 0.120 0.021 0.068 0.036 0.151 0.183
0.109 0.091 0.158 0.067 −0.010 −0.110 0.259 0.359
0.100 0.056 0.183 0.127 −0.135 −0.325 0.374 0.564
0.102 0.099 0.115 0.016 0.075 0.051 0.139 0.163
0.111 0.084 0.153 0.069 −0.019 −0.123 0.256 0.360
0.100 0.053 0.185 0.132 −0.145 −0.343 0.383 0.581
0.104 0.092 0.118 0.026 0.053 0.014 0.157 0.196
(i)
the minimized value of the objective function JM = 24 and the values of JM and ri are presented in Table 8. From Table 8, the median, the first and third quartiles of ri , denoted by Me, Q1 , Q3 , respectively, are Me = 0.01,
Q1 = 0.01,
Q3 = 0.115.
Therefore, IQR = 0.105 and cutoffs are Q3 + 1.5IQR = 0.2725, Q1 − 1.5IQR = −0.1475,
Q3 + 3IQR = 0.43, Q1 − 3IQR = −0.305.
Hence, the data point of No. 5 is an extreme outlier. This result is consistent with [7].
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
0 500
*
8
0 300
0 400
0 400
*8
7
0 500
3117
0 300
0 300
0 200 0 200 0 200 0 100 0 100
0 100 0 000
Fig. 3. Box plots of the values of the normalized absolute difference: Table 6(A) is on the left, Table 6(B) is on the middle and Table 6(C) is on the right. Table 8 (i) and ri Data of Example 4 and the values of JM (i)
No.
x
y
JM
ri
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1.5 2.3 2.7 4.4 9.4a 6.3 6.5 7.8 8.5 10.5
21.6 21.6 20.61 21.6 4.175 21.6 21.45 21.6 21.6 21.6
0.100 0.100 0.141 0.100 0.826a 0.100 0.106 0.100 0.100 0.100
a Represents
outlier.
Example 5. In this example, we consider an outlier with respect to the predictor x value. Please note that Peters [7] does not consider this case. The input–output data is shown in Table 9. This data set is derived from Table 8 by changing the y-value of the data No. 5 and the x-value of the data No. 10. Clearly, the data point of No. 10 is an outlier with respect to x-value. Using the proposed approach, the minimized value of the objective function JM = 27.6531 and the (i) values of JM and ri are presented in Table 9. From Table 9, the median, the first and third quartiles of ri , denoted by Me, Q1 , Q3 , respectively, are Me = 0.1,
Q1 = 0.1,
Q3 = 0.18444.
Therefore, IQR = 0.08444 and cutoffs are Q3 + 1.5IQR = 0.3111,
Q3 + 3IQR = 0.43776,
Q1 − 1.5IQR = −0.02666,
Q1 − 3IQR = −0.15332.
Hence, the data point of No. 10 is an extreme outlier. Compared with Peters’ approach, the proposed approach requires more time computational time. In the omission (i) approach for detecting a single outlier, we need to compute JM for i = 1, . . . , n. Its computational complexity is O(nt1 ),
3118
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
Table 9 (i) and ri Data of Example 5 and the values of JM (i)
No.
x
y
JM
ri
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 50a
1.5 2.3 2.7 4.4 5.4 6.3 6.5 7.8 8.5 10.5
22.5187 24.8878 24.8878 24.8878 24.8878 24.8878 24.8878 24.8878 22.5643 3.6
0.186 0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.184 0.870a
a Represents
outlier.
where t1 is the number of iterations in Tanaka’s approach. But the computational complexity of Peters’ approach is only O(t2 ), where t2 is the number of iterations in Peters’ approach. Since we adopt the omission approach to detect outliers, it naturally requires more computational time than Peters’ approach. However, the computer run still fast for the proposed approach and the reward is not to determine p0 , pj such as in Peters’ approach. 5. Detection of multiple outliers In Section 2, we had discussed a simple method for the detection of an observation that can be individually considered an outlier. In this section, we extend the single observation procedure to a more general case of multiple observations. Our goal is to detect the jointly outliers in FLR models. A question that naturally arises here is: Why should we consider the detection of multiple outliers? This problem is important from the theoretical as well as the practical points of view. From the theoretical point of view, there may exist situations in which observations are jointly but not individually influential, or the other way around. An illustration is given in Fig. 4. Points 1 and 2 are not singly influential, but jointly they are a large influence on the fit. This situation is sometimes referred to as the masking effect, because the influence of one observation is masked by the presence of another observation. The problem of joint influence is also important from the practical side; when present, the joint influence is generally much more severe than single-case influence and it is most often overlooked by practitioners because it is much more difficult to detect than the single-case influence. There are two inherent problems in the multiple outlier case. The first is how do we determine the size of the subset of jointly outliers? Suppose we are interested in the detection of all subsets of size m = 2, 3, . . ., of data points that are considered to be jointly outliers. How do we determine m? When it is difficult to determine m, a sequential method may be useful, e.g., we start with m = 2, then m = 3, etc; but even then, when do we stop? The second problem is computational. We have seen in Section 2 that for diagnostic measure ri , we compute n quantities r1 , r2 , . . . , rn , one for each observation in the data set. In multiple outlier case, however, for each subset of size m, there are n!/(m!(n − m)!) possible subsets for which diagnostic measure of interest can be computed. Even with today’s fast computers, this can be computational prohibitive when n and m are large. In this section, we give, at least in part, answers to the above-mentioned problems. The approach described in Section 2 can be generalized to the case of omitting m observations. Let I = {i1 , i2 , . . . , im },
2 m < n − p − 1,
be the set containing the indices of the m observations to be omitted. After omitting the m observations, Tanaka’s approach becomes (I ) t |xj | min JM = a,
s.t.
j ∈I /
0, = 0, 1, . . . , p and at xj + (1 − H )t |xj |yj + (1 − H )ej , at xj − (1 − H )t |xj |yj − (1 − H )ej , j ∈ / I.
(9) (10) (11)
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
3119
Y Center Value End Points
#2 #1
X Fig. 4. An illustration of joint influence. Points 1 and 2 are joint outliers. Table 10 The data set with joint outliers No.
x
(yj , ej )
1 2 3 4 5 6
1 2 3 4 5 6
(8.0,10.8)a (6.4,12.2)a (9.5,2.6) (13.5,2.6) (13.0,2.4) (15.2,2.3)
a Represents
outlier.
(I )
Let JM be the minimized value of the objective function obtained by Tanaka’s approach to the remaining (n − m) (I ) observations. The absolute difference between JM and JM will be denoted by dI when the m observations indexed by I are omitted. That is, (I )
dI = |JM − JM |. The ratio of dI to JM is called the normalized absolute difference and will be denoted by rI , rI =
dI . JM
(12)
Note that rI is the general analog of ri defined in (8). When m = 1, rI reduces to ri . Assume that there are at most m outliers in the data set. In practice, we compute rI for all n!/(m!(n − m)!) possible subsets of size m and then look at the largest rI . We also use the box plot to determine whether the largest rI is an outlier or not. Next, we use the data set in Table 10 consisting of two outliers (Nos. 1 and 2) to illustrate our approach. The minimized value of the objective function is JM = 73.8. In this data set, 2 m < 4 (Note that 2 m < n − p − 1). Therefore, we consider two cases of simultaneously omitting m = 2 and 3 observations in this data set. First, we (I ) compute all values of JM and rI for 15 possible subsets with size m = 2, which are shown in Table 11. From Table 11, the median, the first and third quartiles of rI are Me = 0.339,
Q1 = 0.333,
Q3 = 0.415.
Therefore, IQR = 0.082 and cutoffs are Q3 + 1.5IQR = 0.538, Q3 + 3IQR = 0.661, Q1 − 1.5IQR = 0.210, Q1 − 3IQR = 0.087.
3120
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
Table 11 (I ) and rI for Table 10 The values of JM (I )
I
JM
rI
{1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {2, 3} {2, 4} {2, 5} {2, 6} {3, 4} {3, 5} {3, 6} {4, 5} {4, 6} {5, 6}
14.7 48.8 48.8 48.8 48.8 43.2 43.2 43.2 43.2 49.2 49.2 49.2 49.2 49.2 49.2
0.801a 0.339 0.339 0.339 0.339 0.415 0.415 0.415 0.415 0.333 0.333 0.333 0.333 0.333 0.333
a Represents
outlier.
0 800
*
0 700
0 600
0 500
0 400
Fig. 5. Box plots for Table 11.
The corresponding box plot is shown in Fig. 5. An examination of the cutoffs and box plot shows that r{1,2} is an extreme outlier. That is, observations 1 and 2 are jointly outliers. (I ) Next, we compute all values of JM and rI for 20 possible subsets with size m = 3. The results are shown in Table 12. From Table 12, the median, the first and third quartiles of rI are Me = 0.53252,
Q1 = 0.50407,
Q3 = 0.56098.
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
3121
Table 12 (I ) and rI for Table 10 The values of JM (I )
I
JM
rI
{1, 2, 3} {1, 2, 4} {1, 2, 5} {1, 2, 6} {1, 3, 4} {1, 3, 5} {1, 3, 6} {1, 4, 5} {1, 4, 6} {1, 5, 6} {2, 3, 4} {2, 3, 5} {2, 3, 6} {2, 4, 5} {2, 4, 6} {2, 5, 6} {3, 4, 5} {3, 4, 6} {3, 5, 6} {4, 5, 6}
9.3 7.8 10.8 11.025 36.6 36.6 36.6 36.6 36.6 36.6 32.4 32.4 32.4 32.4 32.4 32.4 36.9 36.9 36.9 36.9
0.874a 0.894a 0.854a 0.851a 0.504 0.504 0.504 0.504 0.504 0.504 0.561 0.561 0.561 0.561 0.561 0.561 0.500 0.500 0.500 0.500
a Represents
outlier.
Therefore, IQR = 0.05691 and cutoffs are Q3 + 1.5IQR = 0.646345,
Q3 + 3IQR = 0.73171,
Q1 − 1.5IQR = 0.418705,
Q1 − 3IQR = 0.33334.
The above cutoffs show that the following: r{1,2,3} ,
r{1,2,4} ,
r{1,2,5} ,
r{1,2,6}
are all extreme outliers. However, it is impractical that all data points in the data set including observations 1, . . . , 6 are outliers. As we can see from Table 11, observations 1 and 2 are jointly outliers. We may think that there is the case of outliers r{1,2,3} , r{1,2,4} , r{1,2,5} , r{1,2,6} because of the impact of the joint outlier I = {1, 2}. When we see the other (I ) values of JM and rI in Table 12, we find that r{1,3,4} , . . . , r{1,5,6} without the observation 2, r{2,3,4} , . . . , r{2,5,6} without the observation 1 and r{3,4,5} , . . . , r{3,5,6} without observations 1 and 2, are not outliers. From previous analysis, we can conclude that the observation together with I = {1, 2} has the largest impact on the objective function, but the impact from the observations 3, 4, 5 and 6 is not significant. Hence, observations 1 and 2 are actually the only jointly outliers. 6. Conclusions In this paper, we focused on the detection of outliers in FLR models. The main disadvantage in several of the procedures for detecting outliers is a lack of defining cutoffs for outliers. To overcome the drawbacks in existing methods, we used an omission approach to detect outliers. This approach examined how the values of the objective function in a fuzzy regression analysis of the data change when some of the observations are omitted. Furthermore, we conducted a simple visual display by means of the box plot procedure to define the cutoffs for outliers. Numerical results clearly indicate our approach performed well. References [1] Y.S. Chen, Outliers detection and confidence interval modification in fuzzy regression, Fuzzy Sets and Systems 119 (2001) 259–272. [2] D. Dubois, H. Prade, Fuzzy Sets and Systems: Theory and Applications, Academic Press, New York, 1980.
3122
W.-L. Hung, M.-S. Yang / Fuzzy Sets and Systems 157 (2006) 3109 – 3122
[3] J.D. Emerson, J. Strenio, Boxplots and batch comparison, in: C. Hoaglin, F. Mostellar, J.W. Tukey (Eds.), Understanding Robust and Exploratory Data Analysis, Wiley, New York, 1982, pp. 58–96. [4] C. Hoaglin, Letter values: a set of selected order statistics, in: C. Hoaglin, F. Mostellar, J.W. Tukey (Eds.), Understanding Robust and Exploratory Data Analysis, Wiley, New York, 1982, pp. 33–57. [5] M. Hojati, C.R. Bector, K. Smimou, A simple method for computation of fuzzy linear regression, European J. Oper. Res. 166 (2005) 172–184. [7] G. Peters, Fuzzy linear regression with fuzzy intervals, Fuzzy Sets and Systems 63 (1994) 45–55. [8] J.D.A. Sanchez, A.T. Gomez, Estimating a fuzzy term structure of interest rates using fuzzy regression techniques, European J. Oper. Res. 154 (2004) 804–818. [10] H. Tanaka, Fuzzy data analysis by possibilistic linear model, Fuzzy Sets and Systems 24 (1987) 363–375. [11] H. Tanaka, I. Hayashi, J. Watada, Possibilistic linear regression analysis for fuzzy data, European J. Oper. Res. 40 (1989) 389–396. [13] H. Tanaka, S. Uegima, K. Asai, Linear regression analysis with fuzzy model, IEEE Trans. Systems Man Cybernet. 12 (1982) 903–907. [14] W.J. Tukey, Exploratory Data Analysis, Addison-Wesley, Reading, MA, 1977.