152
JOURNAL OF NETWORKS, VOL. 10, NO. 3, MARCH 2015
Learning Algorithm of Neural Networks on Spherical Cap Zhixiang Chen and Jinjie Hu* Department of Mathematics, Shaoxing University, Shaoxing, China *Corresponding author, Email: {czx, jinjiehu}@usx.edu.cn
Feilong Cao Department of Mathematics, China Jiliang University, Hangzhou, China Email:
[email protected] Abstract—This paper investigates the learning algorithm of neural network on the spherical cap. Firstly, we construct the inner weights and biases from sample data, such that the network has the interpolation property on the sampling points. Secondly, we construct the BP network and BP learning algorithm. Finally, we analyze the generalization ability for the constructed networks and give the numerical experiments. Index Terms—Spherical Cap; Neural Network; Learning; Interpolation; BP Algorithm
I.
INTRODUCTION
The (d 1) -dimensional unit sphere d 1 in d is defined by d 1 x1 , x2 ,, xd d : x12 x22 xd 2 1. In recent years, the construction and approximation of spherical function have attracted the attention of large number of scholars. As the main approximation tools on the unit sphere, spherical polynomials are fundamental, and many results have been explored [1]. The spherical thin-plate splines, as natural analogs of the classical thinplate splines, have also been constructed for interpolation and approximation on the unit sphere [2]. Moreover, in many applications of geophysics and metrology, we usually need to find some functional models to fit the scattered data collected over the surface of the earth via satellite or ground stations. Recently, a class of so-called spherical positive definite radial basis functions has been used to tackle the problem by interpolating the samples, and lots of results have been obtained. We refer reader to [3] and references therein. On the other hand, it is well-known that feed-forward neural networks are universal approximator. There has been a lot of research devoting to the topic on the compact subset of Euclidean space d , for example, Cybenko [4], Funahashi [5], Chen and Chen [6], Barron [7], Chen [8], Cao, Xie, and Xu [9], and Chen and Cao [10]. On the unit sphere, Mhaskar et al. [11] introduced the following zonal function network
© 2015 ACADEMY PUBLISHER doi:10.4304/jnw.10.3.152-158
n
a ( k 1
k
k
x) , to deal with spherical scattered data
approximation, where k , x d 1 , ak , is a real function defined on [-1,1]. In [12] [13] [14] [15], the following feed-forward networks defined on the unit sphere were considered and some approximation properties were studied: n
a ( k 1
k
k
x ck )
where ak , ck , x d 1 , k d , and the activation function was defined on . Since sigmoidal type functions are one important class of activation functions, these functions are usually used to be the activation functions in the hidden layer of neural networks. In fact, a sigmoidal function : is a bounded function and satisfies lim (x) 1, lim (x) 0. x
x
Recently [16] [17] [18] studied the error of approximation by networks with sigmoidal activation functions. Since we are usually concerned with the target function defined on the local area of d 1 , on our samples are derived from a cap of d 1 , so we will discuss the approximation of networks on a spherical cap. It is well known that interpolation is a popular and important approximation method when the samples are obtained. Therefore we construct the interpolation networks to approximate target function. This kind of network can save much training time, naturally, its construction is difficult. On the other hand, although there have been many results concerning the spherical network approximation, the results about learning algorithm are relatively few. In fact, the basic BP (back propagation) learning algorithm has been raising many scholars' interesting, and has been applied to a variety of disciplines, see [19] [20] [21]. Hence we are also to discuss the BP algorithm in the application of network approximation of spherical cap. Considering 1 (x) is a typical sigmoidal function, so we will 1 e x
JOURNAL OF NETWORKS, VOL. 10, NO. 3, MARCH 2015
use to construct activation functions of interpolation network. And we are also to discuss the BP algorithm of network approximation, and the target functions are defined on a spherical cap. The paper is organized as follows. Section 2 discusses the existence of spherical interpolation networks, where a constructive method based on scattered (randomly) sampling data will be utilized. In Section 3, we will use classical gradient method and derive the adjusting formulas of weights. And the BP learning algorithm on spherical cap will be established. In Section 4, we will analyse the generalization ability for the constructed networks and give the numerical experiments. Finally, some conclusions one presented in Section 5. II.
CONSTRUCTION OF INTERPOLATION NETWORKS
k 1
k
k
n
k 1
k
k
x ck )
X i (xi1 , xi 2 ,, x id ) d , i 1, 2.
If there exists j0 1, 2,, d , such that x1 j0 x2 j0 and x1 j x2 j for j j0 1, j0 2,,d, we say X1 X 2 To prove the main result of this section, we first give a lemma. Lemma 1. For n distinct vectors: X1 X 2
X n , there exists a vector W , such that W X1 W X 2 W X n . Proof. For n distinct vectors: X i (xi1 , xi 2 ,, x id ) (i 1, 2,, n), we set d
1 , j 1, 2,, d . 1 max i {| x ij |}
Then xij1 w1j xij [1,1],i 1, 2,, n, j 1, 2,, d . Let yij1 :| x1i 1, j x1ij |,i 1, 2,, n 1, j 1, 2,,d.
© 2015 ACADEMY PUBLISHER
define w2j : w1j i 1 ni , j 1, 2,, d , j
and W : (w12 , w 22 ,, w 2d ). For fixed i(1 i n 1) , from X i X i 1 , it follows that there exists k0 such that xik0 xi 1,k0 ; xij xi 1, j , j k0 1,, d . So we have k
(x1i 1, k0 x1i , k0 ) s01 ns : I1 I 2 .
Since, | I1 | 2(n1 n1n 2 n1n 2 nk0 1 ) k 1 1 1 = 2 s01 ns 1 nk 2 n2 nk0 1 0 k 1 k 1 1 2 s01 ns 4 s01 ns , 1 1 2 and
I1 s01 ns nk (x1i 1, k0 x1ik0 ) k 1
0
k 1
Thus, when sample points X i d 1 , the construction of interpolation network is trivial. So we first discuss the general case, that is, the case of x d . For two distinct vectors X1 , X 2 ,
w1j
ij
It is easy to see that n j 2, j 1, 2,, d . Now we
s01 ns yi1, k0
x ck ) yi , i 1, 2,, n.
d
4 . min y1 0 { yij1 }
k
has interpolation property, where weights ak are derived from linear systems
a (
then let n j 2; else, then let n j
k 1
be samples, where X i d , yi , i 1, 2,, n. N (x) is a network, if it has the following property N (xi ) yi , i 1, 2,, n, we say that N (x) is a interpolation network. For given samples above, we will choose properly weights k and threshold ck such that the network n
For given j (1 j d) , if yij1 0 , for i 1, 2,, n 1 ,
W X i 1 W X i k01 (x1i 1, k x1i , k ) s 1 ns
Let (X1 , y1 ),(X2 , y2 ),,(Xn , yn )
a (
153
=
k0 1 s 1
ns nk0 yi1, k0
k 1 4 4 s01 ns yi1, k0
Hence, W X i 1 W X i 0, i.e.,W X i W X i 1 . The proof of Lemma 1 is complete. 1 , we set For (x) 1 e x 1 (x) : ( (x 1) (x 1)), and will prove the 2 following theorem. Theorem 1. For n distinct vectors X1 X 2 X n , there exists vectors Wi d and bi (i 1, 2,, n) , such that (W1X1 b1 ) (W2 X1 b2 ) (Wn X1 bn ) (W1X 2 b1 ) (W2 X 2 b2 ) (W1X1 b1 ) (W1X n b1 ) (W2 X n b2 ) (Wn X n bn ) is nonsingular. Proof. By simple calculation, we know that satisfies lim (x) lim (x) 0, x
x
and is an even and strictly decreasing on [0, ) .
154
JOURNAL OF NETWORKS, VOL. 10, NO. 3, MARCH 2015
Clearly, ex 1 1 e 1 (x) (e e1 ) , (0) . 2 2 e 1 (1 e x 1 )(1 e x 1 ) Now we want to prove that there holds ex 1 1 e 1 (x) (e e1 ) , x 1 x 1 2 (1 e )(1 e ) 2n e 1 that is, ex e , x 1 x 1 (1 e )(1 e ) n(e 1)2
there exists feed-forward neural networks n
N ( X ) ci (Wi X bi ) , i 1
such that N ( X i ) yi (i 1, 2,, n), which shows that the network N ( x) can be an interpolation function for samples (X1 , y1 ),(X2 , y2 ),,(Xn , yn ) . Since d 1 is a
(1)
X i d 1 ,(i 1, 2,, n), there exists a feed-forward neural networks
Since ex 1 x, x 1 x 1 (1 e )(1 e ) e when
ex
n
N ( X ) ci (Wi X bi ) , i 1
n(e 1) 2 , e
(2)
the inequality (1) is true. Obviously, when x satisfies n4e2 ex 4ne, the inequality (2) is valid. So we can e choose x ln 4n 1, and thus when | x | ln 4n 1 , there holds (x)
subset of d , we obtain immediately that Corollary 1. For n samples (X1 , y1 ),(X2 , y2 ),,(Xn , yn ) ,
(0)
. n By Lemma 1, there exists W such that W X1 W X 2 W X n . Therefore, we fix a ln 4n 1, set 2a ki min{W Xi 1 W Xi , W Xi W Xi 1}
(i 2,3,, n 1), and choose Wi kiW (i 2,3,, n 1),W1 W2 , Wn Wn 1 , and bi Wi X i (i 1, 2,, n), then (Wi X i ) (0). On the other hand, for i, j 1, 2,, n, j i , we have | (Wi X j bi ) (Wi Xi bi ) | ki | W X j W Xi | 2a,
such that N ( X i ) yi (i 1, 2,, n). Although function interpolation algorithm is an effective approach for fitting scattered data, the construction of weight is difficult. As an important performance of feed-forward neural networks, the BP algorithm can iteratively adjust the network weights to minimize the least squares objective function by training samples, and thus the networks have a certain generalization ability. III.
SPHERICAL BP NEURAL NETWORK
The multilayer perceptron employing BP algorithm is one of the most extensive and applicable neural networks. The basic idea of BP algorithm is composed of two processes: the feed-forward propagation of signals and the back-propagation of errors (see [22] for details). In this paper we will use the following (see Figure 1) spherical feed-forward neural network with single-hidden layer, it can be mathematically modeled as
and Wi X1 bi Wi X 2 bi Wi X i 1 bi a Wi X i bi 0 a Wi X i 1 bi Wi X n bi . So when i j , it follows that
(Wi X j bi )
(0) n
,
which leads to (0) ii | ij |, j i
and illustrates that matrix G is strictly and diagonally dominant. Therefore, G is nonsingular. This finishes the proof of Theorem 1. From Theorem 1 we know that for samples (X1 , y1 ),(X2 , y2 ),,(Xn , yn )
© 2015 ACADEMY PUBLISHER
Figure 1. Neural network with single-hidden layer n 3 N n ( x1 , x2 , x3 ) wi wij x j , i 1 j 0 2 where x0 1,( x1 , x2 , x3 ) . Now we give the definition of network error and the ideas of adjusting weights. When the output o isn't equal to the sample value y, there exists error E , it is defined as
JOURNAL OF NETWORKS, VOL. 10, NO. 3, MARCH 2015
155
3 wij x j ( y o) wi wij x j j 0
2
E :
n 3 1 1 ( y o) 2 y wi wij x j . 2 2 i 1 j 0
TABLE I. Algorithm 1: BP step 1: For i 1, 2,, n; j 0,1, 2,3. initialize wi , wij with random values, initialize counters p, q of sample model and training times, both for 1, error E 0 , and the learning rate and training accuracy
Emin are set decimals in
(0,1)
respectively. step 2: Using sample data X p , initialize x for X p , and calculate the output o . step 3: Using sample data y p , calculate error E . step 4: Adjust weights wi , wij , i 1, 2,, n;
j 0,1, 2,3. step 5: If p n , counters p, q add 1, and return to step 2, or go to step 6. step 6: If E Emin , the training goes over, or, set E 0, p 1 , and go to step 2. step 7: Output wi , wij , i 1, 2,, n;
j 0,1, 2,3. Thus we get a network.
Since the principle of adjusting weights is that we should have the error become smaller and smaller, so the adjusting value of weight should be direct proportion to the gradient descent of error, that is E wi , i 1, 2,, n. wi wij
E , i 1, 2,, n; j 0,1, 2,3, wij
3 3 1 wij x j 1 wij x j 1 , j 0 j 0 i 1, 2,, n; j 0,1, 2,3. Now we can give the BP algorithm as follows (see TABLE 1).
IV.
NUMERICAL EXPERIMENTS AND ANALYSES
Now, we apply numerical experiments to discuss the generalization capacity of above two kinds of networks on spherical cap. We choose two functions defined on cap
{x ( x1 , x2 , x3 ) : d ( x,(0,0,1)) } 4 as follows: f1 ( x1 , x2 , x3 ) x1 x2 x3 , f 2 ( x1 , x2 , x3 ) x12 x2 x3
where x12 x22 x32 1. To describe the distribution of samples, we use polar coordinates representations: x1 sin( )sin( ), x2 sin( )cos( ), x3 cos( ), where [0, / 4], [0, 2 ] . We randomly choose N samples ( X1 , y1 ),( X 2 , y2 ),,( X N , yN ) to learn interpolation and BP network respectively. And we test generalization capacity by stochastic points (TX1 , ty1 ),(TX 2 , ty2 ),,(TX N , tyN ) The distribu-tion of sample points in experiments are described in Figure 2 and Figure 3.
where (0,1) is the learning rate. By standard calculation, we have 3 E (o y) wij x j , i 1, 2,, n. wi j 0 3 E (o y) wi wij x j x j , wij j 0 i 1, 2,, n; j 0,1, 2,3.
1 we have 1 e x (x) (x)(1 (x)), and 1 (x) ( (x 1) (x 1)) 2 1 (x 1)(1 (x 1))] [ (x 1)(1 (x 1)) 2 1 (x)[1 (x 1) (x 1)]. 2 Thus, we derive the following computation formulas: 3 wi ( y o) wij x j , i 1, 2,, n, j 0 When (x)
© 2015 ACADEMY PUBLISHER
Figure 2. The distribution of sample and test points for function f1
Figure 3. The distribution of sample and test points for function f 2
156
JOURNAL OF NETWORKS, VOL. 10, NO. 3, MARCH 2015
We define sample test error E1 and generalization test error E2 as follows: N
E1
| Y y i 1
i
i
M
| , E2
| TY ty i 1
i
i
|
. N M where Yi , TYi denote the output values of samples and test points respectively. From experimental results we find that the sample size affects performances of interpolation network. This network performs well when the sample size is between 10 and 30. So we can call such sample size the valid size. Furthermore, we illustrate the relation between errors and sizes with Figure 6 and Figure 7. Figure 7. Interpolation network errors of function f 2 (for valid sample size)
Figure 4. Interpolation network errors of function f1
Figure 8. BP network errors of function f1
Figure 5. Interpolation network errors of function f 2
Figure 9. BP network errors of function f 2
Figure 6. Interpolation network errors of function f1 (for valid sample size)
© 2015 ACADEMY PUBLISHER
From all above numerical results, Figure 4 - Figure 9, we can see that: 1. When the sample size N is relatively small (N 10) , these two kinds of networks perform badly in generalization. It is easily understood, for less sample information naturally leads to worse learning effect.
JOURNAL OF NETWORKS, VOL. 10, NO. 3, MARCH 2015
2. As for interpolation network, when the sample size N is relatively large (N 30) , the network lies in an unstable state, hence, the generalization capacity is undesirable. The reason for such result is that the interpolation G is close to singular or badly scaled.3. For BP network, the error becomes smaller, and the generalization capacity becomes better as the sample size N increases. However, in order to achieve the same performance, BP network will spend longer time.4. When sample size N is at valid size, especially when 18 N 24) the interpolation network performs very well, and the generalization behaves with very high accuracy. By numerical experiments above we obtain that when the number of sample points on a cap is appropriate, the performance (error and time cost) of the interpolation network excels the BP network. While, if the number of samples gets relatively large, the interpolation network becomes instable easily. In this case the BP network can works, but the time which the network spends increases rapidly. To compare the learning effects of above two networks further, we draw 3-dimensional curved surfaces of f1 and f 2 so that we can see more clearly the coincidence of tested points. Now, we use above polar coordinates representations to transform f1 , f 2 into g1 ( , ) and g2 ( , ) . Choosing 20 sample points we get experimental results, see Figure 10 - Figure 13.
157
Figure 12. Simulating with interpolation network for f 2 ( g2 ( , ))
Figure 13. Simulating with BP network for f 2 ( g2 ( , ))
It is not difficult to see that, for these two networks, almost all the points marked with “+” are in five-pointed stars, which indicate that the testing effects for original sample points are ideal.As for generalization testing, all the points marked with “*” exactly lie in the center of “o”. This indicates that the generalization of interpolation network is very good. In the case of BP network, most points marked with “*” are in the center of “o”, only a few points have fairly big biases. Figure 10. Simulating with interpolation network for f1 ( g1 ( , ))
V.
CONCLUSIONS
Interpolation is an important method of data fitting and numerical approximation, so we construct the interpolating network, and the numerical experiments that it can reach the results we want. However, we require constructing weights and solving large scale linear systems to obtain the interpolant of a target function, and when the number of interpolation points is very large, the interpolation matrix may become ill-conditioned. Hence, we also study the BP learning algorithm of neural networks. By properly choosing learning rate and the number of samples the network may have ideal learning effects and generalization capacity. ACKNOWLEDGMENT
Figure 11. Simulating with BP network for f1 ( g1 ( , ))
© 2015 ACADEMY PUBLISHER
This work was supported by the National Natural Science Foundation of China (Nos. 61179041, 61272023).
158
JOURNAL OF NETWORKS, VOL. 10, NO. 3, MARCH 2015
REFERENCES [1] K. Y. Wang and L. Q. Li, Harmonic Analysis and Approximation on the Unit Sphere. Beijing: Science Press, 2000. [2] W. Freeden, T. Gervens, and M. Schreiner, Constructive Approximation on the Sphere. Oxford: Clarendon Press, 1998. [3] H. Wendland, Scattered Data Approximation. Cambridge: Cambridge University Press, 2005. [4] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. of Control Signals and Systems, vol. 2, no. 4, pp. 303-314, 1989. [5] K. I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, no. 3, pp. 183-192, 1989. [6] T. P. Chen and H. Chen, “Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to a dynamic system,” IEEE Trans. Neural Networks, vol. 6, no. 4, pp. 911-917, 1995. [7] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Trans. Inform. Theory, vol. 39, no. 3, pp. 930-945, 1993. [8] D. B. Chen, “Degree of approximation by superpositions of a sigmoidal function,” Approx. Theory & Appl., vol. 9, no. 3, pp. 17-28, 1993. [9] F. L. Cao, T. F. Xie, and Z. B. Xu, “The estimate for approximation error of neural networks: A constructive approach,” Neurocomputing, vol. 71, no. 4-6, pp. 626-630, 2008. [10] Z. X. Chen and F. L. Cao, “The approximation operators with sigmoidal functions,” Computers and Mathematics with Applications, vol. 58, no. 4, pp. 758-765, 2009. [11] H. N. Mhaskar, F. J. Narcowich, and J. D. Ward, “Approximation properties of zonal function networks using scattered data on the sphere,” Adv. Comput. Math., vol. 11, no. 2-3, pp. 121-137, 1999. [12] F. L. Cao and S. B. Lin, “The capability of approximation for neural networks interpolant on the sphere,” Math. Meth. Appl. Sci., vol. 34, no. 4, pp. 469^78, 2011. [13] S. B. Lin, F. L. Cao, and Z. B. Xu, “Essential rate for approximation by spherical neural networks,” Neural Networks, vol. 24, no. 7, pp. 752-758, 2011. [14] F. L. Cao, H. Z. Wang, and S. B. Lin, “The estimate for approximation error of spherical neural networks,” Math. Meth. Appl. Sci., vol. 34, pp. 1888-1895, 2011. [15] S. B. Lin, F. L. Cao, and Z. B. Xu, “The essential rate of approximation for radial function manifold,” Science in China, Mathematics A, vol. 54, no. 9, pp. 1985-1994, 2011.
© 2015 ACADEMY PUBLISHER
[16] G. A. Anastassiou, “Multivarite sigmoidal neural network approximation,” Neural Networks, vol. 24, pp. 378-386, 2011. [17] D. Costarelli and R. Spigler, “Multivarite neural network operators with sigmoidal activation functions,” Neural Networks, vol. 48, pp. 72-77, 2013. [18] D. Costarelli and R. Spigler, “Convergence of a family of neural network operators of the kantorovich type,” J. Approx. Theory, vol. 185, pp. 80-90, 2014. [19] Q. Wang, “Research and application of artificial neural network on function approximation,” Computer Development & Applications, vol. 25, no. 7, pp. 85-86, 2012. [20] M. Hou and X. Han, “Constructive approximation to multivariate function by decay rbf neural network,” IEEE Transactions on Neural Networks, vol. 21, no. 9, pp. 15171523, 2012. [21] T. T. Zhu, H. K. Wei, and K. J. Zhang, “Handwritten digit recognition based on ap and bp neural network algorithm,” China Science Paper, vol. 9, no. 4, pp. 479-482, 2014. [22] L. Q. Han, Theories, Designs and Applications of Artificial Neural Networks. 2nd. ed.(in Chinese). Beijing: Chemical Industry Press, 2007.
Zhixiang Chen, male, was born in Jiangsu Province, China, on March, 1966. He received the M.S. degree in Mathematics from Hangzhou University, China in 1995. In 2002, he received the Ph.D. degree in Mathematics from Zhejiang University, China. He now is a Professor in Shaoxing University. His current research interests include neural networks and approximation theory. Jinjie Hu received the B.S. degree in Computer Software from Lanzhou University, China in 1999. He received the M.S. degree in Computational Mathematics from Shanghai University, China in 2011. He now is a lecturer in Shaoxing University, China. His current research interests include neural networks and scientific computing. Feilong Cao received the M.S. degree in Applied Mathematics from Ningxia University, China in 1998. In 2003, he received the Ph.D. degree in institute for information and System Science, Xi’an Jiaotong University, China. He now is a Professor in China Jiliang University. His current research interests include neural networks and approximation theory.