JOURNAL OF COMPUTERS, VOL. 6, NO. 9, SEPTEMBER 2011
1949
A Method for Building Partially Connected Neural Network Gang Li Management Department, Shanghai University for Science and Technology, Shanghai, China Email:
[email protected] Xingsan Qian, Chunming Ye, Management Department, Shanghai University for Science and Technology, Shanghai, China Email:
[email protected],
[email protected] Lin Zhao HP China, Shanghai, China Email:
[email protected] Abstract - This paper focuses mainly on application of Partial Connected Back Propagation Neural Network (PCBP) instead of typical fully connected neural network (FCBP), as PCBP with less connections learns faster than FCBP. The initial neural network is fully connected, after training with sample data, a clustering method is employed to cluster weights between input to hidden layer and from hidden to output layer, and connections that are relatively unnecessary are deleted, thus the initial network becomes a PCBP network. PCBP can be used in prediction or data mining by training it with data that comes from database. At the end of this paper, several experiments are conducted to illustrate the effects of PCBP using the submersible pump repair data set. Index Terms - Neural Network; FCBP; PCBP; pruning
I. INTRODUCTION Artificial neural networks have been proved to be a useful tool in pattern recognition and classification tasks in diverse areas like data mining, millions of databases are being used in business data management, scientific and engineering data management and other applications [1], and the most-widely used network is the standard Back Propagation (SBP) algorithm [2]. Indeed, the SBP learning algorithm has emerged as the standard algorithm for the training of multiplayer networks, and hence the one against which other learning algorithms are usually benchmarked. Actually, the SBP is fully connected, as called FCBP, and it has been commonly used as a matter of fact, since they usually do not need a priori information of data, of course, this is the feature of FCBP, but unfortunately, FCBP have several drawbacks, as reported by researchers [3]: it is extremely slow; training performance is sensitive to the initial conditions; it may become trapped in local minima before converging to a solution; oscillations may occur during learning (this usually happens when users increase the learning rate in an unfruitful attempt to speed up convergence); and, if the error function is shallow, the gradient is very small
© 2011 ACADEMY PUBLISHER doi:10.4304/jcp.6.9.1949-1954
leading to small weight changes. Also, as for FCBP, due to the learning style, the structure of the trained FCBP usually have unnecessary connections which induces the issue of the complexity of the networks and causes the slow training time, especially for large networks. The complexity problem has attracted the interest of researchers because of the advantages that would be obtained by solving it. One critical advantage is that the simpler the system, the better it is [4]. So, if these unnecessary connections can be removed from the network, then training times would be greatly reduced, it is especially important for data mining, where database usually contains large number of data records ranging from millions to even billions, without faster training time, data mining using neural network is mission impossible. One way to reduce the complexity of the networks is to reduce the number of redundant connections, nodes [5], or input features. The reduction of the connections or nodes can be achieved by removing the weights that contribute the least to the network outputs. To our best knowledge, most reduction methods have been done during training networks. And one important thing is to determine what kinds of connections are redundant? II. BUILDING A PCBP NETWORK As mentioned earlier, FCBP requires more training time than PCBP (see Fig1).
Fig.1 Example of PCBPs
1950
JOURNAL OF COMPUTERS, VOL. 6, NO. 9, SEPTEMBER 2011
Generally, there are two ways to build a PCBP, one is manually; the other is automatic generated by starting from a FCBP and then prune FCBP to remove the unnecessary connections. The previous way is mandated and requires a deep insight into the data patterns involved, or else the network structure is not properly set, it may need more training time than FCBP; the latter way does not require user participated, and could determined the would-be removed connections automatically, the process is illustrated in Figure 2. Const r uct a FCBP
Tr ai n FCBP wi t h sampl e dat a
k
o
Error= −∑∑(t ip logS ip + (1− t ip )log(1− S ip ))
(2)
i=1 p=1
During our experiments, we also found that the cross entropy error function in equation 2 is pretty good, so we employ cross entropy as error function for both FCBP and PCBP. Before deriving the equations, we need to introduce the notations used in deriving the mathematical expressions of FCBP and PCBP training. Notations: n Numbers of input nodes h Numbers of hidden nodes o Numbers of output nodes x Sample input vector t Sample output vector
wlj
Weights between input node l to hidden node
v pj
Weights between hidden node j to output
j
Pr une t he FCBP
node p Sat i sf y accur acy?
σ () Activation function in hidden and output layer (here, we suppose it is sigmoid) wc l j Connection status between input node l to
No
Yes
hidden node j
vc pj
The f i nal PCBP
output node p (see definition 1) For FCBP, the components of the gradient of cross-entropy error function are given by equation 3 to 4:
Fig.2. Build a PCBP by pruning FCBP
III. TRAINING FCBP Before training network, several things should be pre-defined, that is network structure including number of input and hidden and output nodes, generally speaking, numbers of input and output nodes depend on sample data, as for hidden nodes, it is usually determined by experience, some researchers have reported that a few number of hidden nodes is just enough. As for error function, the typical SBP employs Mean Squared Error (MSE) as follows:
1 k o Error = ∑∑ ( S ip − t ip ) 2 2 i =1 p =1
(1)
i
Where Sp stands for the actual output of output i
node i , and t p for expected corresponding output value, while k is number of output nodes. Although MSE is the most widely used error function, it requires more training time and may become trapped in local minima before converging to a solution. It has been suggested by several authors, for example Lang [6] and Ooyen [7], that the cross-entropy error function improves the convergence of the training process, and can significantly reduce training time, the cross-entropy error function is as follows:
© 2011 ACADEMY PUBLISHER
Connection status between hidden node j to
i S ip − t ip ∂S ip ∂Error ∂Error ∂S p × = × j = i ∂v p S p × (1 − S ip ) ∂v pj ∂v pj ∂S ip
=
S ip − t ip S ip × (1 − S ip )
× S ip × (1 − S ip ) × σ (∑ x ip wlj )
(3)
= ( S ip − t ip ) × σ (∑ x ip wlj )
∂Error ∂Error = × ∂wij ∂S ip
∂S ip ∂σ (∑ x wl ) i p
×
j
∂σ (∑ x ip wlj ) ∂wij
= ∑ (( S ip − t ip ) × v pj ) × σ (∑ x ip wlj ) × (1 − σ (∑ x ip wlj )) × x ip (4) So, adjustment items of input to hidden weights and hidden to output weights can be calculated by equation 3 and 4 plus learning rate. IV. PRUNING FCBP Before going on, we have to introduce a new definition: Definition 1: Connection Status: a vector that
JOURNAL OF COMPUTERS, VOL. 6, NO. 9, SEPTEMBER 2011
represents how one node connects with its adjacent nodes in the following layer. From a macro perspective view, connection status represents network structure; while from a micro point of view, it is just a vector consisting of binary elements, that 3 is, zeros or ones. For example, if wc 2 = 1 , then we can say that the connection between second node in input layer and third node in hidden layer exists, while if wc 23 = 0 , then there is no connection between second node in input layer and third node in hidden layer. Actually, FCBP can be viewed as a particular PCBP whose connection status vectors are just ones. Hence, it can be concluded that by creating and maintaining such connection status vectors, FCBP can be easily defined and implemented. After training FCBP achieved predetermined accuracy, take 0.98 for example, unnecessary connections should be removed from the network in order to get a simple but efficiency PCBP. The pruning process consists of five steps: Step1: clustering weights between adjacent layers, it starts from the first node to the last one in the hidden or output layer, respectively. Step2: automatically determine a pruning bias that satisfied, if absolute clustered weights of connections that below this bias are all deleted, the pruning ratio can be met. Step3: removed all the connections that below the bias Step4: if network accuracy falls far below expected, then roll back pruning, set another pruning ratio, and go to step 2 Step5: update connection status A. Algorithm for clustering weight The notations used in the algorithm are: β Clustering distance between connections w Weight vector of a node that contains all the connections that connect to it from its previous layer, either input to hidden layer or hidden to output layer num Length of w ClusterTyp e Represents which cluster type each w element belongs to ClusterVal ue A vector that represents clustered value ClusterSum Sum of weights that belong to a specific cluster type ClusterCou nt A vector that contains number of each cluster type Count Number of clustered type, that is the length of ClusterVal ue For each node of hidden or output layer, do the following: Step1: Initially, set
ClusterSum (1) = w (1) ClusterVal ue (1) = w (1) Count = 1 ClusterCou nt (1) = 1
1951
ClusterTyp e (1) = 1 Step2: for each i = 2 to num , if there exists an index j that satisfies:
max w (i ) − ClusterVal ue ( j ) < β ,
j =1:Count
Then it means that weight w ( j ) should be clustered with ClustedVal ue ( j ) , so set
ClusterSum ( j ) = ClusterSum ( j ) + w (i ) ClusterCou nt ( j ) = ClusterCou nt ( j ) + 1 ClusterTyp e (i ) = j Else, it means w ( j ) should be another cluster, then set
Count = Count + 1 ClusterSum ( count ) = w (i ) ClustedVal ue (Count ) = w (i ) ClusterVal ue (i ) = w (i ) ClusterTyp e (i ) = Count
Step3: calculate the average clustered value for each cluster value; For i = 1 to Count Set
ClusterVal ue (i ) = ClusterSum (i ) / ClusterCou nt (i ) Step4: update w to relevant cluster value: For i = 1 to num Set
w (i ) = ClusterVal ue (ClusterTyp e (i ))
B. Deleting unnecessary connections We define the following criterion that evaluates which kinds of connections are unnecessary: connections that have relatively small weights. Small is an obscure word that it is difficult to determine exactly, especially facing the fact that the distribution of weights after training is unpredictable, as initial weights are random numbers usually ranging from zero to one. It is not practical to set pruning bias manually. In order to solve this problem, we propose a heuristic method to automatically generate such a pruning bias that depends on the distribution of weights. The algorithm is based on pruning ratio, which is defined as: Definition 2: Pruning Ratio: numbers of pruned connections divided by total connections of the previous FCBP, it ranges from zero to one. The algorithm is as follows: Step1: let µ ( 0 < µ < 1) be a predetermined pruning ratio that indicates how much connections should be pruned; let ϖ be the best pruning bias; num is the accumulated number of connections; Step2: sort ClusterVal ue in ascending order, at the same time, ClusterCou nt changes accordingly in order to make the two vectors still consistent with each other; set
index = 1, num = ClusterCount (1),ϖ = ClusterValue(1) Step3: do the following loop
© 2011 ACADEMY PUBLISHER
1952
JOURNAL OF COMPUTERS, VOL. 6, NO. 9, SEPTEMBER 2011
end After step 3, we can get the best pruning bias. When deleting unnecessary connections, we just set the corresponding elements in vector wc and vc to be zeros, and by doing this, a FCBP becomes a PCBP network. When pruning FCBP, one thing should be paid special attention to which may ignored by other researchers. To illustrate it, look at figure 3, on the left part is a pruned network which has three hidden nodes marked as A, B and C respectively. Notice that, node A has two connections with nodes in output layer (in bold line), while none with input layer, as they have been deleted as unnecessary connections in the above pruning process, though the probability of happening is small, but if it does happen, it must be handled properly, or else some error would happen, for example, for any kind of input pattern, the output of node A is always one if activation function of node A is sigmoid, because sum of input plus weight is zero, if node A has a bias, then the output of node A can be any float number as long as it varies with the value of bias, thus the actual output of output node is affected. For this kind of conditions, we propose that node A be deleted too, that is to delete connections between A and the output layer, then A is totally removed from the network as illustrated at the right part of figure 2. If node A has connections with its previous layer but none with following layer, then delete its connections with its previous too. By doing this, pruning connections and nodes can be handled together.
Fig.3 Example of inconsistent connections
© 2011 ACADEMY PUBLISHER
When apply PCBP in a specific domain, like data mining, usually, PCBP should be trained first just like FCBP does. The training process of PCBP is a slightly different comparing with FCBP, take a hidden node for example, not all the input nodes connect with it, so the actual input for hidden node is calculated by: n
∑ (w
i l
i =1
× x i × wcli ) , notice that an additional items
(connection status between input layer and hidden layer) is added, similarly, equation 3 and 4 change to equation 5 and 6:
∂Error = ( S ip − t ip ) × σ (∑ x ip wlj wclj ) j ∂v p
(5)
∂Error = ∑ (( S ip − t ip ) × v pj ) × σ (∑ x ip wlj wclj ) × (1 − σ (∑ x ip wlj wclj )) × x ip ∂wij
(6) VI. EXPERIMENTS Enterprise data warehouse
Sources
Operation al database
Operation al database
Users
Finance
Staging Integrated Area
Sales
History
Enterprise Report
ELSE ϖ = ClusterValue(index); Exit
V. TRAINING PCBP
Sourcing Area
WHILE (index < Count )do num IF ( < µ )THEN Count index = index + 1; num = num + ClusterCount (index ); Continue
Area
` Marketing
Data files
Meta data and Security MPP (Massively Parallel Processing)
Figure 4: data warehouse architecture
In the data warehouse architecture, from Staging Area to Enterprise Report are considered as Enterprise Data Warehouse, because every one of them is integral part of a warehouse, and they can satisfy the current and future needs for all the business users across the enterprise. It is common understanding that data warehouse is basis for BI and DSS application, and implementing a successful data warehouse requires not only technologies but also methodology as well as culture and cooperation across the enterprise. The experiment data set which recorded submersible pump repair history contains four attributes classification codes, these attributes are separately: Single rotor electric power (kW Per Rotor), Cable Temperature Level(℃), Casing size(inch), Protector Length(m). In the following experiments, we use the data set to train FCBP and PCBP. The experiment data set which recorded submersible pump repair history contains four attributes classification codes, these attributes are separately: Single rotor electric power (kW Per Rotor), Cable Temperature Level(℃), Casing size(inch), Protector Length(m). In the following experiments, we use the data set to train FCBP and PCBP.
JOURNAL OF COMPUTERS, VOL. 6, NO. 9, SEPTEMBER 2011
1953
First, we need to encode the numeric data into binary, as illustrated in table 1: Table 1 The data attribute encoding table Range Encoded Input Single rotor Single rotor electric power >8 1 1 electric 4<Single rotor electric power 0 1 power (kW Per