Data Analysis and Utilization Method Based on Genetic Programming in Ship Design 1
Kyung Ho Lee , Yun Seog2 Yeun, Young Soon Yang3, Jang Hyun Lee1 and June Oh1 1 Inha University, Department of Naval Architect & Ocean Engineering, 253 Yonghyun-dong, Nam-gu, Inchon, Korea
[email protected],
[email protected],
[email protected] 2 Daejin University, Department of Mechanical Design Engineering, San 11-1 Sundan-dong, Pocheon, Kyonggi-do, Korea
[email protected] 3 Seoul National University, Department of Naval Architecture & Ocean Engineering, San 56-1 Shillim-dong, Gwanak-gu, Seoul, Korea
[email protected] Abstract. Although Korean shipyards have accumulated a great amount of data, they do not have appropriate tools to utilize the data in practical works. Engineering data contains the experiences and know-how of experts. Data mining technique is useful to extract knowledge or information from the accumulated existing data. This paper presents a machine learning method based on genetic programming (GP), which can be one of the components for the realization of data mining. The paper deals with linear models of GP for regression or approximation problems when the given learning samples are not sufficient.
1 Introduction Recently, intelligent systems for ship design have been slowly changing from those developed by the knowledge-based approach, whose knowledge is difficult to extract and represent, to those developed by the data-driven approach, which is relatively easy to handle. Because engineering data contains meaningful information such as the experiences and know-how of experts, the development of data analysis and utilization method is very important. In the ship design process, the utilization of existing data is one of the very important issues. Recent researches have focused on data mining for the extraction of useful knowledge or information from accumulated engineering data [1]. This paper especially focuses on data prediction of the several functions of data mining. Usually, artificial neural network (ANN) as a training system for the prediction is utilized in most engineering fields. But if the characteristics of the training data are nonlinear and discontinuous, the performance of the training result deteriorates. M. Gavrilova et al. (Eds.): ICCSA 2006, LNCS 3981, pp. 1199 – 1209, 2006. © Springer-Verlag Berlin Heidelberg 2006
1200
K.H. Lee et al.
And also the trained results cannot be reflected in the artificial neural network; that is, ANN becomes a black-box system. To overcome the shortcomings of ANN, this paper adopted Genetic Programming (GP) [2] having excellent ability to approximate with non-linear and discontinued data. Above all, GP can show the trained result as a function tree. Generally, a lot of accumulated data is needed for the training by using ANN or GP. But in a real situation in the ship designing field, we do not have enough data to utilize for the training procedure. Therefore, this paper presents an enhanced genetic programming technique to fit an approximated function from accumulated data with small learning data. In addition, a data prediction tool based on genetic programming algorithm is developed. Finally, some application examples in ship design are presented to verify the implemented GP program.
2 Implementing Linear Models in Genetic Programming The regression or function approximation finds the underlying model that can best explain the given samples with consideration of the generalization capability. Perhaps, one of most important tasks is the selection of an appropriate functional form of a model. For instance, polynomials, sigmoid based neural networks, and radical based function networks can be considered. After the base functions are selected, a proper model form is constructed by combining these bases in a predefined way. Usually the model contains numerical coefficients or weights, which should be estimated in such a way that the learning error or other criterion is minimized by applying the optimization method. On the other hand, genetic programming (GP) can offer a very different alternative for regression problems. GP deals with a tree-structured program, called a GP tree, whose structure evolves towards the minimization of its fitness value by using genetic operators. Unlike the traditional approximation methods, where the structure of an approximate model is fixed, the structure of the GP tree itself is modified and optimized, and thus GP trees can be more appropriate or accurate approximate models. Much research has been done on GP through regression problems and system identifications [3]. 2.1 Generation of Optimal Linear Model by Minimum Description Length Method Although GP has great potential, it still has major disadvantages originating from the evolved structure of the GP tree. In most engineering applications related to regression, a good structure of the tree alone is not enough. There have to be the numerical weights, which are estimated through optimization techniques so as to further minimize the fitness value. For example, consider the tree θ1 ( + θ 2 x1 θ 3 (* θ 4 x2 θ 5 (sin θ 6 ( + θ 7 x3 θ 8 x4 )))) , where θ i is the weight attached to every node of the tree and its value is always multiplied to the output value of the node, and
xi s can be variables or constants [2].
Data Analysis and Utilization Method Based on GP in Ship Design
1201
Since, the GP tree is a nonlinear function, there are no cheap methods for the estimation of θ i ,so using a computationally expensive nonlinear optimization technique such as SA (simulated annealing) is the only option. If the population includes several hundred trees, several hundred optimization processes are required for each generation. Computational cost is a heavy burden on the use of GP in many engineering applications such as the response surface method in optimization problems [3]. Although the fast estimation technique based on linear associative memories has been proposed [4], this method also has difficulty in providing accurate values of the weights. Simply, the GP tree, that gives the minimum fitness value based on only the error measure of the learning samples, cannot be the one that generalizes best with the good description of the underlying regularity of samples. To select the best GP trees, statistical inference such as the Rissanen’s modern minimum description length (MDL) principle [5] may be required. As can be expected, it is nearly intractable to use the MDL criterion in GP because of the requirement of multiple integrations of the nonlinear GP tree. This paper focuses on Rissanen’s modern MDL for computing the fitness of the GP tree. MDL tries to find the model that is encoded with the shortest code length, and at the same time best describes all learning samples. So, the philosophical foundation behind MDL is closely related to Ockham’s Razor, which insists that the simplest model with good fitting of samples is the best one. This paper investigates linear model GP (LM-GP) and MDL with a directional derivative based the smoothing (DDBS) method. A common linear model can be denoted as (1).
y=
∑θ x
i i
=θ x
T
(1)
i
Where
θ
is the vector of the unknown parameter θ i , and x is the d -dimensional
vector of variables. Since y is the linear function of θ i s, it can be called a linear model. (1) can be extended as shown in (2).
y=
κ −1
∑θ b = θ b i i
T
(2)
i =1
Where κ − 1 is the number of the base functions, b = (bi )κi =−11 is a vector of an arbitrary continuous function of p variables chosen from the d -dimensional vector x , and bi should not be a linear combination of any other functions. (2) is still linearmodel, but bi is not a standard base function since bi and b j ( i ≠ j ) are not of the same functional form. But herein, the terminology ‘base function’ is used to referr to
bi . When a learning set L = {( z i , ti )}in=1 , where z i is the d -dimensional vector
1202
K.H. Lee et al.
( z1i ,..., z di ) and t i is the target value, the GP must find b using only the information contained in L . To build the linear model from the GP tree, the base function bi in (2) should be extracted first. For example, consider the GP tree;
( − (* 0.7 (* 1.5 (sin ( + x1 ( * 0.3 (exp x2 )))))) (* (* 0.1 (* x1 x3 )) (* x1 (cos ( − x2 1))))) If the tree is expanded and expressed in the standard mathematical form, then we have the following function.
1.05 sin( x1 + 0.3 exp( x2 )) + ( −0.1) x12 x3 cos( x2 − 1)
(3)
From (3), we can identify two base functions. If a constant term is always added to the linear model, then there are three base functions;
b1 = 1 , b2 = sin( x1 + 0.3 exp( x2 )) , and b3 = x12 x3 cos( x2 − 1)
(4)
Note, that real numbers attached to base functions such as 1.05 and -0.1 are ignored because θ i s will be estimated later by using the OLS method. Basically, the translation algorithm collects all possible base functions from the tree. Before discussing the algorithm, the set of terminal and GP functions needs to be defined.
TGP = { x1 ,..., xd , R , one} Where R is a random number such that | R |< 1 , and ‘one’ is 1.
FGP = { g1 , g 2 ..., + ,−,*} Where g i can be any continuous function. For g i , we use various mathematic functions. Also, instead of the mathematical function, the polynomial function can be considered for g i . The set of low order Taylor series in Table 1 is used. Iba et al. have carried out many works related to GP with polynomials [6]. Unlike their works, we tackle the problem by using LM-GP with a Taylor series that can handle the very high order polynomial with smoothness (hereinafter, LM-GP with polynomials will be denoted as PLM-GP). The PLM-GP has many advantages: it is computationally more efficient and is numerically more stable when the OLS (Ordinary Least Square) method is used. In the following subsections, the data structure for the linear model and the symbolic processing algorithm are presented [7].
Data Analysis and Utilization Method Based on GP in Ship Design
1203
Table 1. Taylor series used for GP functions
Symbol
Math. function
Taylor series
Symbol
Math. function
tcos
cos(x)
1 − 1/ 2x2
t1sqrt
(1 + x)1 / 2
1+1/ 2x −1/ 8x2 +1/16x3
tsec
sec( x)
1 + 1/ 2 x 2
ti1sqrt
(1 + x) −1 / 2
1−1/ 2x+3/8x2 −5/16x3
tsin
x − 1 / 6 x3 x + 1 / 3x3
texp
ttan
sin( x) tan( x)
exp( x) log(1 + x)
1+ x +1/ 2x2 +1/ 6x3 x − 1/ 2x2 + 1/ 3x3
tcosh
cosh( x)
1 + 1/ 2 x 2
ti1px
(1 + x) −1
1 − x + x 2 − x3
tsinh
sinh( x)
x + 1 / 6 x3
ti2px
1 − 2x + 3x2 − 4x3
ttanh
tanh( x)
x − 1 / 3 x3
texpsin
(1 + x) −2 exp(sin(x))
1 + x + 1/ 2 x2
tlogcos
log(cos(x)) − 1 / 2 x 2
exp(tan(x))
1 + x + 1/ 2x2 + 1/ 2x3
t1log
texpta n
Taylor series
2.2 Generation of Virtual Data Set by DDBS
With small samples, there is no guarantee that the correct model will be selected by simply choosing the model that shows the shortest code length. Moreover, our focus is on the problems where the available samples are very limited, and the base function of the linear model becomes a highly nonlinear function during the evolving process. Because of the high nonlinearity and limited amount of samples, the linear model shows extreme overfitting behaviors, especially in the regions that are away from the sample points, despite the fact that the model was chosen according to the MDL selection criterion. To avoid such a serious problem, the directional derivative based smoothing (DDBS) method is introduced. The basic idea of DDBS is that once the linear model y is given, the behavior of the linear model y is inspected by traveling along the line
i, j
l connecting two nearest sample point from z i to z j , and, if un-
wanted peaks or valleys are detected, such behaviors are suppressed by forcing the directional derivatives of y in the direction of i, j
i, j
l to approach the slope
s = (t j − ti ) / | z j − z i | by adjusting the parameters of y . For simplicity of the pres-
entation, in this section, it is assumed that y is not a function of the chosen p (< d ) variables but a function of the full d variables. Fig.1 shows the nearest sample points with the lines connecting them. If the distance of z i and z j is denoted by
i, j
u =| z j − z i | , then the slope is
1204 i, j
K.H. Lee et al. i, j
s = (t j − ti )/ i , j u , and the directional vector
defined. At the points
h =(i , j hk ) dk =1 = ( z j − z i )/ i , ju can be
l (k = 1,.., γ ) on the line
i, j k
derivative of y in the direction of
i, j
h such that
∇y is the gradient of y . DDBS tries to make
i, j
i, j
i, j
l , we can compute the
Dk = ∇y (i , j l ) ⋅( h) , where k
Dk approach
i, j
i, j
s . Unless y is a
i, j
linear function, i , j Dk of y is very different from s , but DDBS is very effective for smoothing out y and more importantly, no additional sample points are required. zl
i, l
l
zi
i, j
l k ,i
l zk
zj
Fig. 1. Generating of Virtual Data from Sample Points
3 Validation Test for the Developed Method The developed method is compared with standard GP to show that the GP tree with the limited amount of samples frequently shows abnormal behaviors, which can be effectively dealt with by adopting LM/PLM-GP with virtual samples. The test function, Rosenbrock’s Function, is given as follows.
y = 100 ( x2 − x12 ) 2 + (1 − x1 ) 2 − 2 ≤ xi ≤ 2, i = 1,2 Learning and test samples are prepared by the 6x6 and 25x25 grid type, respectively. For PLM-GP and standard GP, the best results are shown in Fig.2 (d) and (e). As mentioned, the standard GP often shows abnormal behaviors such as very large peaks or discontinuities when the number of learning samples is not large enough. Fig.2 (f) shows one typical example where a sudden peak appears. The best linear model of PLM-GP and its corresponding polynomial are shown in Fig.3.
Data Analysis and Utilization Method Based on GP in Ship Design
a. The original function
. The best result of LM-GP
e. The best result of standard GP.
1205
b. Generated lines for creating virtual samples
d. The best result of PLM-GP
f. An example of a bad model found by standard GP.
Fig. 2. Fitting results of the Rosen Brock’s function with noiseless samples. In Fig.2(f), the RMSE of learning and test samples are 0.002348 and 0.3113, respectively, and the number of the tree s nodes is 19.
1206
K.H. Lee et al.
0.995 - 1.404x2 - 2.859x1 + 6.883E-1x2^2 + 3.217x1x2 - 1.407E1x1^2 - 4.718E-1x1x2^2 - 1.667E1x2^3 + 1.182E2x1^3 - 1.481x1^2x2 + 6.665E-3x1x2^3 - 7.959E-1x1^3x2 - 1.957E-1x1^2x2^2 4.123E2x1^4 + 2.651E-1x1x2^4 - 9.611E-1x1^4x2 + 9.560E2x1^5 + 7.688E-1x1^2x2^3 2.171x1^3x2^2 - 1.663E3x1^6 - 1.050x1^4x2^2 - 2.941x1^5x2 - 1.750Ee1x1^2x2^4 …………………………………………………………………………………………..……… -1.017E-1x1^7x2^7 - 1.533E1x1^14 + 1.186E-1x1^12x2^2 - 2.209E-1x1^8x2^7 + 6.746E-2x1^11x2^4 - 2.791E-1x1^10x2^5 - 3.374E-2x1^12x2^3 - 5.438E-1x1^9x2^6 + 3.389E-2x1^9x2^7 + 2.283E1x1^10x2^6 - 3.858E-2x1^12x2^4 - 7.216E-2x1^11x2^5 + 3.624E-2x1^12x2^5 + 2.455E-2x1^10x2^7 + 6.042E-2x1^11x2^6 - 3.008E-2x1^12x2^6 - 3.765E-3x1^11x2^7
Fig. 3. The Polynomial transformed from the best linear model of PLM-GP
4 Function Approximations by GP in Ship Design Process 4.1 Data Miner for Ship Design by Using Enhanced Genetic Programming
In this paper, the data mining tool for a data analysis and utilization by using enhanced genetic programming with linear model is developed. That is, the tool is contrived to apply to ship design under the case that the accumulated data is not enough to make learning process. Fig.4 shows the developed system for a data mining by using GP. The data miner can make fitting functions with 3 types of GP such as GP with high order polynomial, linear model GP with polynomial (PLM-GP), and linear model GP with math functions (LM-GP). Users can make the process of function approximation by selecting arbitrary functions that they want to use. And the generated function tree can be converted to C code in order to integrate with other program.
Fig. 4. Data Miner for Ship Design by using GP
4.2 Estimating Principal Dimensions of a Ship
In the problem of designing bulk carriers, the design requirements, which are given by the ship owner, include deadweight (ton) DWT , service speed (knot) VS , and draft (m) T . In the conceptual design stage, the principal dimensions, such as the length between perpendiculars LBP , depth D , breadth B , block coefficient CB , and so
Data Analysis and Utilization Method Based on GP in Ship Design
1207
forth, are determined to meet with the design requirements. Typically, the design expert tries to choose appropriate values for them by utilizing the information gained from the mother ship built in the past and close to the current design, or empirical formulas based on such mother ships. Once the principal dimensions are determined, they are further refined through a cyclic design process, called a design spiral, to satisfy other requirements by repeatedly using mother ships’ data or empirical formulas. We have only real data from 80 ships, and will construct linear models that can estimate the principal dimensions of a bulk carrier in this subsection. The dimensions such as LBP , D , and B can be considered as functions of DWT , VS , and T . The most important variable is DWT , and sometimes, the principal dimensions are considered as functions of only DWT . But 80 samples contain data showing different values for LBP , D and B with almost the same value of DWT . So, it is difficult to build linear models that fully explains the given samples using only DWT . For this reason, constructing the linear model might be a 3-dimensional fitting problem. If we examine the samples, the principal dimensions show a certain pattern, but they seem to indicate that the samples might be contaminated by noises or even worse, might contain outliers. A ship owner sometimes demands special specifications for his/her ship, and thus, data gathered from such ships could display somewhat different trends. Also, with the 340
340
320
320 Learning samples LM-GP
Learning samples PLM-GP
300 280
260
260
LBP(m)
280
240
240
220
220
200
200
180
180
160
160 140
140 5.0e+4
1.0e+5
1.5e+5
2.0e+5
5.0e+4
2.5e+5
1.0e+5
1.5e+5
2.0e+5
2.5e+5
Deadweight(ton)
Deadweight(ton)
a. The learning results of LM-GP
b. The learning results of PLM-GP
320 320
300
300
280
280
260
260
LBP(m)
LBP(m)
LBP(m)
300
240 Test samples LM-GP
220
240 220
200
200
180
180
160
Test samples PLM-GP
160
5.0e+4
1.0e+5
1.5e+5
2.0e+5
Deadweight(ton)
c. The test results of LM-GP
2.5e+5
5.0e+4
1.0e+5
1.5e+5
2.0e+5
Deadweight(ton)
d. The test results of PLM-GP
Fig. 5. The results of two linear models for estimating
LBP
2.5e+5
1208
K.H. Lee et al.
accumulation of design skills, data coming from the ship built several years ago may show patterns different from the data of the relatively new ship. Such factors may affect samples as if they contained noises. Half of the samples are used for the learning set, and the rest for the test set. Learning samples are not randomly chosen. Instead, they are very carefully selected to distribute them in the variable space as evenly as possible with the hope that learning samples can provide the GP system with major important characteristics of the entire group of samples without leaving out any characteristics. The best results of LM/PLM-GP from 5 runs are shown in Fig.5, where graphs are plotted with only deadweight. In Table 2, the results of LM/PLM-GP are summarized with those of NN and MARS. Table 2. The estimation of the principal dimensions of bulk carriers
LBP
Learning error
Test error
LM-GP PLM-GP NN MARS LM-GP PLM-GP NN MARS
5.36341 5.66651 5.16004 5.21958 5.24520 5.15973 6.09057 5.66843
For LBP , the PLM-GP’s model shows a larger learning error than that of LM-GP because the PLM-GP’s model does not pass through the last learning samples, as shown in Fig.5.b. Because the output of PLM-GP’s model at the last test sample point closely approaches the target value, as shown in Fig.5.d, PLM-GP gives smaller test error than LM-GP. Both LM-GP and PLM-GP outperform NN and MARS, giving a large margin in term of the test error.
5 Conclusions In this paper, data approximation/prediction tool to assist the ship designing process with insufficient learning samples is developed. Both LM-GP and PLM-GP can give consistent results with limited amount of learning samples, regardless of whether or not samples contain noise. The validation test and the adoption of the developed method in the ship designing process showed that the method is good for non-linear function approximation with limited amount of learning data, without overfitting.
Acknowledgement This work is supported by Advanced Ship Engineering Research Center (R11-2002104-08002-0).
Data Analysis and Utilization Method Based on GP in Ship Design
1209
References 1. Yeun, Y.S. et. al. : Smooth Fitting with a Method for Determining the Regularization Parameters under the Genetic Programming Algorithm, Information Sciences 133 (2001), 175-194 2. Koza, J.R.: Genetic programming: on the programming of computers by means of natural selection, The MIT Press, (1992) 3. Lee, K.H., Yeun, Y.S., Ruy, W.S. and Yang, Y.S.: Polynomial genetic programming for response surface modeling, Proc. on 4th International Workshop on Frontiers in Evolutionary Algorithms(FEA2002), In conjunction with Sixth Joint Conference on Information Sciences (2002) 4. Yeun, Y.S., Suh, J.C. and Yang, Y.S.: Function approximation by superimposing genetic programming trees: with application to engineering problems, Information Sciences, vol.122, issue 2-4 (2000) 5. Barron, A., Rissanen, J. and Yu, B.: The minimum description length principle in coding and modeling, IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2743-2760 (1998) 6. Iba, H. and Nikolaev, N.: Inductive genetic programming of polynomial learning networks, Proc. of the First IEEE Sym. on Combination of Evolutionary Computation and Neural Networks, pp.158-167 (2000) 7. Yeun, Y. et. al.: Implementing Linear Models in Genetic Programming, IEEE Trans. on Evolutionary Computation, Vol.8, No.6, pp. 542-566 (2004)