A Linear Regression Model for Nonlinear Fuzzy Data Juan C. Figueroa-Garc´ıa and Jesus Rodriguez-Lopez Universidad Distrital Francisco Jos´e de Caldas, Bogot´a - Colombia
[email protected],
[email protected] Abstract. Fuzzy linear regression is an interesting tool for handling uncertain data samples as an alternative to a probabilistic approach. This paper sets forth uses a linear regression model for fuzzy variables; the model is optimized through convex methods. A fuzzy linear programming model has been designed to solve the problem with nonlinear fuzzy data by combining the fuzzy arithmetic theory with convex optimization methods. Two examples are solved through different approaches followed by a goodness of fit statistical analysis based on the measurement of the residuals of the model.
1 Introduction and Motivation The linear regression analysis called the Classical Linear Regression Model (CLRM) is important statistical tool to establish the relation between a set of independent variables and a dependent one is. A mathematical representation of the CLRM is: yj =
m
βi xij + ξj
∀ j ∈ Nm
(1)
i=0
Where yj is a dependent variable, xij are the observed variables, βi is the weight of the ith independent variable and ξj is the jth observation. i ∈ Nn and j ∈ Nm . As Bargiela et al. expressed in [1], the classical linear regression analysis is not able to find the assignment rule between a collection of variables when these are not numerical (Crisp) entities i.e. fuzzy numbers (See Zadeh in [9]) To address and solve this problem, Tanaka et al. [8] introduced the fuzzy linear regression (FLR) model. y˜j =
m
βi x ˜ij + ξj
∀ j ∈ Nm
(2)
i=0
Where y˜j is a fuzzy dependent variable, x ˜ij are fuzzy observations, βi is the weight of the ith independent variable and ξj is the jth observation. i ∈ Nn and j ∈ Nm . Fuzzy linear regression has the capability of dealing with linguistic variables through different methods such as the least squares method or the gradient-descent algorithm (See Bargiela in [1] and Gladysz in [2]). However, those methods are designed for
Corresponding authors.
D.-S. Huang et al. (Eds.): ICIC 2011, LNBI 6840, pp. 353–360, 2012. c Springer-Verlag Berlin Heidelberg 2012
354
J.C. Figueroa-Garc´ıa and J. Rodriguez-Lopez
analyzing the most used sets, the symmetrical triangular fuzzy sets. However, the solution routines involve algorithms that require considerable amounts of resources e.g. software, computing machine and time. The present work presents a Linear Programming (LP) model capable of managing in a simple way, all type-1 fuzzy data. To decompose the information that each fuzzy data contains, we analyze several parameters that represent it. These parameters are called interesting values and are characterized by the following definition. Definition 1. Suppose A, B and C be fuzzy sets with membership functions A(x), B(x) and C(x) such that: m A(x) = βi Bi (x) (3) i=0
And suppose τ (·) a function which output is an interesting value of a given set (·), hence for A(x) we have m βi τ (Bi (x)) (4) τ (A(x)) = i=0
Where βi is the weight of the fuzzy set Bi (x), so τ (A(x)) is a linear combination of the weights of τ (Bi (x)). It means that each parameter of A(X) can be expressed as a linear combination of βi and τ (Bi (x)). Interesting values have an important attribute: interesting values from a fuzzy set that is a linear combination of a group of fuzzy sets, equal the same linear combination of the interesting values from the second fuzzy set. An LP model is designed based on the interesting values. Each constraint tries to set an equivalence between the interesting values of the sets Y and X considering slack or surplus; and the objective function is the minimization of their sum. The LP method is compared to other three proposals. The reason for compairing the four models is to evaluate their efficiency and efficacy. In addition to the comparisons we also discuss a case study, including a statistical analysis of its residuals.
2 A Linear Programming Fuzzy Regression Model 2.1 The Independent Variables In the classical linear regression model (CLRM) each observation corresponds to a single crisp value which measures a variable; these values, however, cannot encapsulate all information about the variable itself. These variables can bring noise or imprecisions in its measurement, moreover, the measurement process might not be accurate. These imprecise measures can be represented by fuzzy sets, therefore a regression model should deal with the imprecision involving fuzzy sets. An L-R fuzzy set is composed by the spread, position and shape defined by a central value, by the lower and the upper distance, and by the lower and the upper area.
A Linear Regression Model for Nonlinear Fuzzy Data
355
Let A denotes an L-R fuzzy set, and A(x) its membership function. According to fuzzy number properties (See Klir in [5] and [4]), we have that the central value (vc ) of A has 1 as the membership value, this means that A(vc ) = 1. In addition, the lower value (vl ) and upper value (vu ) of A are respectively the left and right boundaries of the support of A. Let dl denote the distance from the lower value to the central value of a fuzzy set, i.e. the lower distance of A, hence dl can be defined as dl = vc − vl , and the upper distance can be defined as du = vu − vc . Let al denote the area between the lower and the central value of A, i.e. the lower area of A, thus al can be numerically expressed as: vc al = A(x)dx vl
And analogously au i.e. the upper area can be calculated as: vu A(x)dx au = vc
The central value of a fuzzy set is the value is the support element which α-cut equals to 1. Figure 1 shows the graphical representation of these values. 2.2 Fuzzy Arithmetic of Interesting Values The interesting values of a fuzzy number, resulting from operate several fuzzy numbers, can be calculated through the values of the interesting values of each one of the second fuzzy numbers. According to the Definition 1, Klir and Yuan in [5] and Klir and Folger in [4], we derive the following operations on fuzzy sets. Let B and C denote two L-R fuzzy sets, and B(x) and C(x) their membership functions respectively. Let also vcA , vcB and vcC indicate the central values for the fuzzy sets A, B and C respectively, and let n indicate any real number. If C = A + B then vcC = vcA + vcB . If C = A − B then vcC = vcA − vcB . If C = nA then vcC = n · vcA .
Fig. 1. Solution of the example as a function of α
356
J.C. Figueroa-Garc´ıa and J. Rodriguez-Lopez
Then the central value of a fuzzy set is a linear combination of the central values of all fuzzy sets. Let Y and Xi denote fuzzy numbers, and vY and vcXi their respective central values, and let βi indicate a coefficient that is multiplying each Xi fuzzy number. If Y = ni=0 βi ·Xi then vcY = ni=0 βi ·vcXi On the other hand, conversely let dla , dlb and dlc indicate the lower distances for the fuzzy sets A, B and C respectively, and let dua , dub and duc denote the upper distances for the fuzzy sets A, B and C respectively. If C = A + B then dlC = dlA + dlB & duC = duA + duB . If C = A − B then dlC = dlA + duB & duC = duA + dlB . If n 0 & C = n · A then dlC = n · dlA & duC = n · duA . If n < 0 & C = n · A then dlC = n · duA & duC = n · dlA On the other hand, let βi+ and βi− denote the possible values for βi such that: βi+ if βi 0 βi = (5) −βi if βi < 0 Let dlY and duY denote the lower and upper distance for the fuzzy number Y , and dlXi and duXi denote the lower and upper distance for the fuzzy number Xi . Hence dlY =
n i=0
duY =
n i=0
βi+ dli −
n
βi − duXi
i=0 n
βi+ duXi −
i=0
βi− dlXi
Finally, let ala , alb and alc be the lower areas for the fuzzy sets A, B and C respectively, and let aua , aub and auc denote the upper areas for the fuzzy sets A, B and C respectively. If C = A + B then alC = alA + alB and auC = auA + auB . If C = A − B then alC = alA + auB and auC = auA + alB . If n 0 and C = n · A then alC = n · alA and duC = n · duA . If n < 0 and C = n · A then alC = |n| · auA and auC = |n| · alA Let alY and auY denote the lower and the upper distance for the fuzzy number Y , and alXi and auXi denote the lower and the upper distance for the fuzzy Xi . Thus, alY = auY =
n i=0 n i=0
βi+ alXi −
n
βi− auXi
i=0 n
βi+ auXi −
i=0
βi− alXi
2.3 Linear Programming Fuzzy Regression Model Based on the above results, we need two sets of variables; the first one for slack s(·) and the other one for surplus f (·). These variables are added to each constraint for each j observation. This allows the βi coefficients to make each equation fits, where Yj is the dependent variable and Xij are the explanatory variables. Finally, the objective function is the minimization of both the sum of the slack and the surplus variables. Formally,
A Linear Regression Model for Nonlinear Fuzzy Data
min z =
m
357
svcj + fvcj + sdlj + fdlj + sduj + fduj + salj + falj + salj + falj
j=1
s.t. vcYj = dlYj = duYj =
n
i=0 n
βi · vcXij + svcj − fvcj
i=0
βi+ dlij
−
n
alY j =
i=0 n i=0
βi+ ali −
i=0 n
βi− dlXij + sduj − fduj
βi− auXi + salj − falj
i=0 n
βi+ auXi −
∀ j ∈ Nm
βi− duXij + sdlj − fdlj
i=0 n
βi+ duXij −
i=0 n
auY j =
n
i=0
βi− alXi + salj − falj
∀ j ∈ Nm ∀ j ∈ Nm
(6)
∀ j ∈ Nm ∀ j ∈ Nm
The first constraint refers to the central value. The second and third constraints are focused in the estimation of the lower and upper distances, and finally the fourth and fifth constraints bound the lower and the upper area values. The presented model in (6) is defined for Type-1 L-R fuzzy numbers where its main goal is to get a regression model oriented to fit a set of fuzzy dependent variables Yj through a set of independent fuzzy variables Xij . The model focuses in getting an approximation of the complete membership function of Yj , Yj (x) represented by their parameters and its area decomposed into a lower and an upper areas through each constraint of the model presented in (6).
3 Validation of the Model - A Comparison Case To measure its effectiveness the model is compared to the models proposed by Kao, Tanaka and Bargiela (See[3]). The problem consists of a single variable regression analysis. The input values (vl, vc, vu) characterize symmetrical triangular fuzzy sets, so we need less constraints since the area and distances are linear functions of their shapes. Table 1. Results for comparison case Proposal Bargiela Tanaka Kao Present proposal β0 3,4467 3,201 3,565 2,6154 β1 0,536 0,579 0,522 0,6923 Central value error 0.64627579 0.692704031 0.643133875 1.131409359 Distance error 0.094192 0.077542938 0.09996175 0.041422189 Total error (e) 0.860504381 0.877637151 0.862029944 1.082973475
358
J.C. Figueroa-Garc´ıa and J. Rodriguez-Lopez
After computing the interesting values of the variables and applying the LP model (6) for eight observations, the obtained β s, the error values obtained and the RSME 8 8 ∗ 2 + 1/8 ∗ 2 are shown in based error e = 1/8 j=1 (vcj − vcj ) j=1 (dj − dj ) ∗ Table 1 where vcj and d∗j are estimations of vcj and dj respectively. ∗ Although the error of vcj obtained by the LP model is the highest, the error of d∗j is the lowest, which leads to less area errors. Moreover, its efficiency is improved since the structure of an LP model is even simpler than Tanaka’s proposal (See [8]).
3.1 Shipping Company Case Study In this case, a shipping company wants to identify the role of several factors in the profit incoming. The factors considered are: price of service, shipping time, package weight and the return time of the service vehicle. The linguistic label for each Xj is its Expected value. Each Xj is defined by the average of the observations, and each j constraint uses the i, j observation as vcj , so the membership function for each Xj is defined as follows. ⎧ 2
⎪ ⎨1 − 1.05 − x for 0.77 < x 1.05 Price - X1 (x) = 0.28 ⎪ ⎩0 otherwise ⎧ 2 ⎪ 2.331 − x ⎪ ⎪ for 2.331 < x 3 ⎪ ⎪ 0.669 ⎨
4 3−x Shipping time - X2 (x) = for 3 < x 3.669 ⎪1 − ⎪ ⎪ 0.669 ⎪ ⎪ ⎩0 otherwise ⎧ 2 ⎪ ⎨ x − 15.07 for 5.65 < x 15.07 Weight - X3 (x) = 9.42 ⎪ ⎩0 otherwise ⎧ 4 ⎪ ⎨ x − 6.606 for 2.22 < x 6.066 Return time - X4 (x) = 4.386 ⎪ ⎩0 otherwise ⎧
2 ⎪ 7.38 − x ⎪ ⎪ 1− for 5.115 < x 7.38 ⎪ ⎪ 2.265 ⎨ 2 x − 11.793 Profit - Y (x) = for 7.38 < x 11.793 ⎪ ⎪ ⎪ 4.413 ⎪ ⎪ ⎩0 otherwise
(7)
(8)
(9)
(10)
(11)
The model was applied to 49 observations and 7 Xj , with the following results: Y ∗ = 0.6216 · X1 + 7.9743 · X2 + 0.4685 · X3 − 0.0073 · X4 − 5.1762
(12)
Figure 2 shows the comparison between the estimated dependent variables (gray area) and expected dependent variables (black line) using the average of Yj and Yˆj as central values. At a first glance, Figure 2 shows that the LP model reaches a good approximation of the original Yj (x), but for decision making, the selected deffuzzification method
A Linear Regression Model for Nonlinear Fuzzy Data
359
Fig. 2. Graphical comparison of the results of the shipping company case study
is the centroid since it can be obtained by a linear combination of the position, spread and area of the fuzzy sets Xi , viewed as a fuzzy relational equation. Analysis of the Results. The obtained β s by the regression are used to obtain the estimated centroids, which yields into the following error measures: RMSE=2.29 and MSE=5.19 computed through ξj (See Equations (1) and (2)). The determination coefficient obtained is R2 = 72.46, so the 72.476% of the behavior of the dependent variable is explained by the independent variables obtained by the application of the model. In addtion, some desirable properties of the residuals are tested as shown below Absence of Autocorrelation in the Residuals. Based on a 95% level of significance, the results of the autocorrelation analysis are shown in Table 2. Based on these results, there is no autocorrelation effect, therefore the residuals are randomly distributed. Table 2. Ljung-Box autocorrelations test on the residuals Lag 1 2 3 4 Autocorrelation -0,138 -0,295 -0,12 0,094 Ljung-Box statistic 0,99 5,624 6,409 6,903 Significance 0,32 0,06 0,093 0,141
Normal Distribution in the Residuals. For a 95% confidence level, the KolmogorovSmirnov test reaches a p-value of 0.150 and the Shapiro-Wilks test reaches a p-value of 0,121, so we can conclude that the residuals are normally distributed. Zero Mean Residuals. A One Sample Test was performed to test if ξ¯ = μ = 0. A difference test based on a normal distributed asymptotic behavior of the mean of the residuals ξ¯ = 0, 320, with an obtained significance of 0,333. We can conclude that there is no statistical evidence that supports that ξ has no zero mean. Homoscedasticity of the Residuals. By dividing the residuals in three balanced groups and applying the F-test for variances between each pair, with a 95% confidence level, it is concluded that there is homoscedasticity in the residuals. (See Table 3).
360
J.C. Figueroa-Garc´ıa and J. Rodriguez-Lopez Table 3. Homocedasticity test Group F Sample F Statistic Significance
1-2 1,034 2,403 0,473
1-3 1,845 2,352 0,117
2-3 1,783 2,352 0,131
Main Conclusions. The LP model had good results since it reached normally independent, zero mean and homocedastic residuals. Thus, the β s and the regression analysis is valid. The β s shows that X1 and X3 (See (7) and (9)) have the highest contribution to the profit, so is recommended to review the pricing policy of the company.
4 Concluding Remarks The LP model presented in this paper focuses on the minimization of the errors between a linear combination of Xj that estimates Y . The method used can deal with nonlinear fuzzy data, therefore our proposal is an alternative formulation for fuzzy regression. Some real problems involve uncertainty that can be treated as fuzzy sets, therefore the LP model presented in this paper is more efficient than other proposals because it can be handled through mixed fuzzy-convex optimization methods. Finally, we recommend the use Type-2 fuzzy sets as uncertainty measures to deal with the perception about a linguistic variable of a fuzzy set held by multiple experts. For further information see Melgarejo in [6], and Mendel in [7]. Acknowledgments. The authors would like to thank Jesica Rodriguez-Lopez for her invaluable support.
References 1. Bargiela, A., et al.: Multiple regression with fuzzy data. Fuzzy Sets and Systems 158(4), 2169– 2188 (2007) 2. Gladysz, B., Kuchta, D.: Least squares method for L-R fuzzy variable. In: 8th International Workshop on Fuzzy logic and Applications, vol. 8, pp. 36–43. IEEE, Los Alamitos (2009) 3. Kao, C., Chyu, C.: Least-Squares estimates in fuzzy regression analysis. European Journal of Operational Research 148(2), 426–435 (2003) 4. Klir, G.J., Folger, T.A.: Fuzzy Sets, Uncertainty and Information. Prentice Hall, Englewood Cliffs (1992) 5. Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall, Englewood Cliffs (1995) 6. Melgarejo, M.A.: Implementing Interval Type-2 Fuzzy processors. IEEE Computational Intelligence Magazine 2(1), 63–71 (2007) 7. Mendel, J.: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice Hall, Englewood Cliffs (1994) 8. Tanaka, H., et al.: Linear Regression analysis with Fuzzy Model. IEEE Transactions on Systems, Man and Cybernetics 12(4), 903–907 (1982) 9. Zadeh, L.A.: Toward a generalized theory of uncertainty (GTU) an outline. Information Sciences 172(1), 1–40 (2005)