Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Analysis Marc Strickert, Udo Seiffert Pattern Recognition Group, Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben
Nese Sreenivasulu, Winfriede Weschke Gene Expression Group, IPK Gatersleben
Thomas Villmann University Leipzig, Clinic for Psychotherapy
Barbara Hammer Institute of Computer Science, Technical University of Clausthal
Abstract A correlation-based similarity measure is derived for generalized relevance learning vector quantization (GRLVQ). The resulting GRLVQ-C classifier makes Pearson correlation available in a classification cost framework where data prototypes and global attribute weighting terms are adapted into directions of minimum cost function values. In contrast to the Euclidean metric, the Pearson correlation measure makes input vector processing invariant to shifting and scaling transforms, which is a valuable feature for dealing with functional data and with intensity observations like gene expression patterns. Two types of data measures are derived from Pearson correlation in order to make its benefits for data processing available in compact prototype classification models. Fast convergence and high accuracies are demonstrated for cDNA-array gene expression data. Furthermore, the automatic attribute weighting of GRLVQ-C is successfully used to rate the functional relevance of analyzed genes. Key words: Prototype-based learning, adaptive metrics, correlation measure, Learning Vector Quantization, GRLVQ, gene expression analysis.
Preprint submitted to Elsevier Science
11 October 2005
1
Introduction
Pattern classification is the key technology for solving tasks in diagnostics, automation, information fusion, and forecasting. The backbone of pattern classification is the underlying distance metric: it defines how data items are compared, and it controls the grouping of data. Thus, depending on the definition of the distance, a data set can be viewed and processed from different perspectives. Unsupervised clustering with a specific similarity measure, for example visualized as the result of a self-organizing map (SOM), provides first hints about the appropriateness of the chosen metric for meaningful data grouping [6]. In prototype-based models like the SOM, a data item can be compared with an ’average’ data prototype in various ways, for example, according to the Euclidean distance or the Manhattan block distance. Different physical and geometric interpretations are obtained then, because the former measures diagonally across the vector space, while the latter sums up distances along each dimension axis. In any case, the specific structure of the data space can and should be accounted for by selecting an appropriate metric. Once a suitable metric is identified, it can be further utilized for the design of good classifiers. In supervised scenarios, auxiliary class information can be used for adapting parameters improving the specificity of data metrics during data processing, as proposed by Kaski for (semi-)supervised extensions of the SOM [5]. Another metric-adapting classification architecture is the generalized relevance learning vector quantization (GRLVQ) developed by Hammer and Villmann [4]. Data metrics in mathematical sense, however, might be too restrictive for some applications in which a relaxation to more general similarity measures would be useful. For example, in biological sciences often functional aspects of collected data play an important role: general spatio-temporal patterns in time series, intensity fields, or observation sequences might be more inter-related than patterns that are just spatially close in Euclidean sense. This applies to the aim of the present work, the analysis of gene expression patterns, for which the Pearson correlation is commonly used. Since recent technological achievements allow probing of thousands of gene expression levels in parallel, fast and accurate methods are required to deal with the resulting large data sets. Thereby, the definition of genetic similarity in terms of Pearson correlation should be possible, and the curse of dimensionality, related to only few available experiments in high-dimensional gene expression space, should be reduced to a minimum. Many commercial and freely available bioinformatEmail addresses: {stricker,seiffert}@ipk-gatersleben.de (Marc Strickert, Udo Seiffert), {srinivas,weschke}@ipk-gatersleben.de (Nese Sreenivasulu, Winfriede Weschke),
[email protected] (Thomas Villmann),
[email protected] (Barbara Hammer).
2
ics tools, such as ArrayMiner, GeneSpring, J-Express Pro, and Eisen’s Gene Cluster use Pearson correlation for analysis. The common goal of these programs is the identification of key regulators and clusters of coexpressed genes that determine metabolic functions in developing organisms. Usually, only the metric of algorithms, which have been initially designed for processing Euclidean data, is exchanged by a 1 minus correlation term. Here, GRLVQ-C is proposed, a classifier that is mathematically derived ’from the scratch’ for correlation-based classification. Its foundations are the generic update rules of generalized relevance learning vector quantization (GRLVQ, [3,4]). This allows incorporation of auxiliary information for genetic distinction, such as the developmental stage of the probed tissues, or the stress factors applied to the growing organisms. Using the GRLVQ approach with its rigid classification cost function, a fast prototype-based and intuitive classification model with very good generalization properties is derived. Both, data attribute relevances and prototype locations are obtained as a result of optimizing Pearson correlationships. The specific requirements of gene expression analysis are met in two ways: firstly, the implemented correlation measure accounts for the nature of gene expression experiments which, due to physico-chemical reasons, tend to differ in their overall intensities and in their dynamic ranges, but not in their general structure of expressed patterns. Secondly, automatic relevance weighting attenuates the curse of high dimensionality. The properties and benefits of the proposed GRLVQ-C classifier are demonstrated for real-life data sets.
2
Generalized Relevance LVQ (GRLVQ) and extensions
Let X = {(xi , y i ) ∈ Rd × {1, . . . , c} | i = 1, . . . , n} be a training data set with d-dimensional elements to be classified xk = (xk1 , . . . , xkd ) and c classes. A set W = {w1 , . . . , wK } of prototypes in data space with class labels y i is used for data representation, wi = (w1i , . . . , wni , y i ) ∈ Rd × {1, . . . , c}. The classification cost function to be minimized is given in the generic form [4]: EGRLVQ :=
n i=1
i
g qλ(x )
− i i d+ λ (x ) − dλ (x ) with qλ(x ) = + i , dλ(x) := dλ(x, w) . i dλ (x ) + d− λ (x ) i
The classification costs of all patterns are summed up, whereby qλ(xi ) serves as quality measure of the classification depending on the degree of fit of the presented pattern xi and the two closest prototypes, wi+ representing the same label as xi and wi− a different label. A sigmoid transfer function g(x) = sgd(x) = 1/(1 + exp(−x)) ∈ (0; 1) is used [9]. Implicit degrees of freedom of the cost minimization are the prototype locations in the weight space and a set of adaptive parameters λ connected to the measure dλ(x) = dλ(x, w) comparing pattern and prototype. In prior work, dλ(x) was supposed to be a 3
metric in mathematical sense, i.e. taking only non-negative values, conforming to the triangle inequality, with a distance of d = 0 only for w = x. These conditions enable intuitive interpretations of prototype relationships. However, if just a well-performing classifier invariant to certain features is wanted, distance conditions might be relaxed to a mere similarity measure to be plugged into the algorithm. Overall similarity maximization can be expressed in the GRLVQ framework by flipping the sign of the measure and then just keeping the minimization of EGRLVQ . Since the iterative GRLVQ update implements a gradient descent on E, d must be differentiable almost everywhere, no matter if acting as distance or as similarity measure. Partial derivatives of EGRLVQ yield the generic update formulas for the closest correct and the closest wrong prototype and the metric weights: GRLVQ wi+ = −γ + · ∂E∂w = −γ + ·g (qλ(xi ))· i+
GRLVQ wi− = γ − · ∂E∂w = γ − ·g (qλ(xi ))· i−
λ
GRLVQ = −γ λ · ∂E∂λ = −γ λ ·g (qλ(xi ))·
2 · d− (xi ) λ
2
·
∂d+ (xi ) λ ∂wi+
2
·
∂d− (xi ) λ ∂wi−
(d+λ (xi ) + d−λ (xi )) 2 · d+ (xi ) λ
(d+λ (xi ) + d−λ (xi ))
2·∂d+ (xi )/∂λ·d− (xi ) − 2·d+ (xi )·∂d− (xi )/∂λ λ λ λ λ 2
(d+λ (xi ) + d−λ (xi ))
Learning rates are γ λ for the metric parameters λj , all initialized equally by λj = 1/d, j = 1 . . . d; γ + and γ − describe the update amount. Their choice depends on the used measure – generally, they should be chosen according to the relation 0 ≤ γ λ γ − ≤ γ + ≤ 1 and decreased within these constrains during training. Metric adaptation should be realized slowly, as a reaction to the quasi-stationary solutions for the prototype positions. The above set of equations is a convenient starting point to test different concepts of similarity by just inserting the denoted partial derivatives of dλ(x).
3
Metrics and similarity measures
The missing ingredient for carrying out comparisons is either a distance metric or a more general similarity measure dλ(x, w). In contrast to metrics, similarity measures are sometimes also called dis-similarity measures, because they are maximum for best match, which is opposed to the semantics of metrics. For reference, formulas for the weighted Euclidean distance will be revisited first. Then, by relaxing the conditions of metrics, two types of measures are derived from the Pearson correlation coefficient, which both inherit the invariance to component offsets and amplitude scaling. The feature of prototype invariance, implemented by the presented update dynamic, is desirable in situations when mainly frequency information and simple plotting curves 4
matching is accounted for. More details on functional shape representations and functional data processing with neural networks are given in Biau et al. [1], with a focus on prototype-based SOM by Rossi et al. [7,8] and for use with support vector machines by Villa and Rossi [12].
3.1 Weighted Euclidean metric
The weighted Euclidean metric yields the following set of equations [13]:
i dEuc λ (x, w ) =
d
λbjλ · (xj − wji )bw , integers bλ, bw ≥ 0 , bw even
j=1
(x, wi )
∂dλ ∂wji Euc
⇒
= −bw · λbjλ · (xj − wji )bw −1 ,
i ∂dEuc λ (x, w ) = bλ · λjbλ −1 · (xj − wji )bw . ∂λj
For simplicity, roots have been omitted. In the squared case with bw = 2, the derivative for the prototype update 2 · (xj − wji ) contains the well-known Hebbian learning term. In other cases, large bw tend to focus on dimensions with large differences, and small bw focus on dimensions with small differences. Approved values for the exponents of the relevance factors are bλ ∈ {1, 2}. Normalization of di=1 λi = 1, λi ≥ 0 is necessary after each update step to prevent the parameters from divergence and collapse.
3.2 Correlation-based measures
In the following, a correlation-based classification is derived from the term r = d (x, w ) = r
i
d
l=1
d l=1
(wli − µwi ) · (xl − µx )
(wli − µwi )2 ·
d l=1
(xl − µx )2
∈ [−1; 1]
(1)
which is the Pearson correlation coefficient; therein, µy denotes the mean value of vector y. As illustrated in Fig. 1, this correlation possesses fundamentally different properties than the Euclidean distance: depending on the applied similarity function, the two patterns compared with a reference pattern yield opposite relations. Simple data preprocessing cannot transform correlationbased classification problem into an equivalent one solvable with the Euclidean metric. As a rough rule of thumb: if a prototype with ’sufficient’ variance is similar to input points in Euclidean sense, then it is very likely that it is also 5
Pattern similarity
expression value
reference signal (RS) pattern 1 (P1) pattern 2 (P2)
1
2
3
4 attribute
5
6
7
Fig. 1. Data patterns compared with different similarity functions. Relation characterizations for the squared Euclidean metric differ from those for Pearson correlation: dEuc (RS, P1) = 0.82 < dEuc (RS, P2) = 1.81 → P1 closer to RS than P2; but dr (RS, P1) = −0.53 < dr (RS, P2) = 0.89 → P2 more similar to RS (highly correlated) than P1 (anti-correlated).
highly correlated to them. The other direction is untrue: if high correlation exists, there might still be a large Euclidean distance. Thus, potentially fewer prototypes are necessary for representations based on correlation, and sparser data models can thus be realized.
The straightforward definition of Pearson correlation by Eq. 1, however, is not suitable for being implemented in GRLVQ: firstly, the required cost function minimization conflicts with the desired maximization of Pearson similarity between data and prototypes; secondly, only the small range of values −1 ≤ r ≤ 1 is taken for expressing best match versus worst match, which yields sub-optimum convergence of EGRLVQ . As a first approach, one might think of a version of Fisher’s Z’, Z = 0.5 · (ln(1 + r) − ln(1 − r)), as a standard transformation of Pearson correlation. This, however, leads to unstable behavior, because almost perfect (dis-)correlation is mapped to arbitrarily large absolute values. Therefore, inverse fractions of appropriately reshaped functions of r are considered in the following. The derivations presented here unify and improve the transformations given in the authors’ prior work [10].
Since metric adaptivity is a very beneficial property for rating individual data attributes, free parameters are added to the Pearson correlation in such a way that the meaning of correlation can be fully maintained. Then, by paying attention to the current prototype w := wi , the numerator of Eq. 1 becomes 6
H := = ¯ j (µw) = H
d l=1 λ2j · d
λ2l · (wl − µw) · (xl − µx ), focusing on component j : ¯ j (µw) , (wj − µw) · (xj − µx ) + H λ2l · (wl − µw) · (xl − µx ) ;
l=j
l=1
µw = µw(wj ) = 1/d · wj + 1/d ·
d
wl .
l=j
l=1
The focus on component j will be a convenient abbreviation for deriving the formulas for prototype update and relevance adaptation. Each of the mean subtracted weight and pattern components, (wj − µw) and (xj − µx ), has got its own relevance factor λj . This is reflected in both rewritten denominator factors of Eq. 1, again with a focus on weight vector components j: W :=
d
l=1
λ2l · (wl − µw)2 ,
X :=
d
l=1
λ2l · (xl − µx )2
W (wj , λj ) = λ2j · (wj − µw)2 + W¯j , W¯j = dl=j λ2l · (wl − µw)2 l=1
Using the√ defined shortcuts, the adaptive Pearson correlation can be written as rλ = H / W ·X . Two types of measures are obtained by a unified transform:
R=
1 C + rλ(x, w)
k
−Rmin =
H C+√ W ·X
−k
−Rmin =: R −k −Rmin (2)
The resulting classifiers are characterized as follows: C0 One type of classifier is obtained for C = 0, even integer exponents k ≥ 2, and minimum Rmin = 1. This classifier allows to separate both correlation and anti-correlation from mere dis-correlation. The minimum value Rmin is subtracted in order to obtain sharp zeros for perfect matches. In computer implementations, a special treatment of the unlikely case of extreme discorrelation might be considered in order to avoid division by (near-)zero values. C0 -prototypes match both correlated patterns and their inverted anti-correlated counterparts, which allows the realization of very compact classifiers. For specific data, however, this type of classification might lead to undesired intermingling of data profile shapes, such as occurring in gene expression analysis. C1 The other type of classifier separates correlated patterns from anti-correlated ones. The C1 model is realized by C = 1, an integer exponent k ≥ 1, and Rmin = 2−k ; here, the rare occurrence of extreme anti-correlation might be worth handling singular values in computer realizations. The C1 setup allows classification with intuitive Pearson correlationships, a feature that is well-suited for co-expression analysis of gene intensities. 7
For calculating derivatives of R, the constant expression Rmin can be omitted. Solutions can be obtained manually or by using computer algebra systems. In the latter case, after some rearrangements, the following equations are found: ∂R ∂wj
=
∂R(wj ,λj ) ∂wj
=
∂ wj
−k
H (wj ,λj )
C+
1
W (wj ,λj )·X (λj )
2
= F · H (wj ) · W · X − H · W (wj ) · X ∂R ∂λj
= F · (H (λj ) · W · X − H · W (λj ) · X − H · W · X (λj ))
using the factor
3
F = − k · R −k−1 /(W · X )− 2 The missing derivatives are H (wj ) = λ2j (xj − µx ) − 1/d ·
d
l=1
W (wj ) = 2 · λ2j (wj − µw) − 2/d ·
λ2l · (xl − µx )
d
l=1
λ2l · (wl − µw)
H (λl ) = 2 · λl · (wl − µw) · (xl − µx ) W (λl ) = 2 · λl · (wl − µw)2 X (λl ) = 2 · λl · (xl − µx )2 These formulas contain plausible Hebb terms, adapting wj into the direction of (xj − µx ) and away from (wl − µw) in case of correct classification, whereby further scaling factors come from the cost function. Similarly, λl is adapted according to the correlation (wl − µw) · (xl − µx ) in comparison to the variances of these terms. Note that very efficient computer implementations can be realized, because most calculations can be done during the pattern presentation phase already. Then, the similarity measure R and all its constituents H , W , X , and some other terms are computed, and they can be stored for each prototype for later reuse in the update phase. Again, the normalization of the relevance factors to di=1 λi = 1, λi ≥ 0, is advised in order to avoid numerical instabilities and to make different training runs comparable. Two finally discussed practical issues concern singular expression handling and the choice of the exponent k. Singular states occur if the variance of one of the vectors is absent. If, for example, a pattern vector x is affected, then the simplified limit correlation d
d
(xl −µx )·(wl −µw )
(x−µx )·
(w −µ )
w l l ≈ limx→µx d d (xl −µx )2 · l (wl −µw )2 d·(x−µx )2 · l (wl −µw )2 l d (wl −µw ) l = = d 0 = 0 d 2 2
limxl →µx d
l
d·
l
(wl −µw )
d·
8
l
(wl −µw )
is of interest. In this case, all prototypes would end up with zero correlations and impracticable terms F . Analog reasoning holds in rare cases of equal prototype components, which, in practice, would occur by inappropriate initialization rather than by the update dynamic. Here, prototypes are assumed to be initialized by data instances. By skipping the degenerated constant patterns or prototypes, unpleasant situations can be effectively avoided. Alternatively, a single randomly picked component can be set to a different value, which, on average, produces the desired state of uncorrelatedness, even for two vectors with simultaneously equal components subject to that modification. The free parameter k takes influence on the speed of convergence and the generalization ability. Integer values in the range 1 ≤ k < 20 have been found reasonable choices in experiments. Too high values lead to fast adaptation, but sometimes also to over-fitting or to unstable convergence, unless a very small learning rate γ is chosen. Good initial exponents are k = 7 or k = 8, odd and/or even according to the desire for a C1 or a C0 type classifier. For the presented experiments, training is only a matter of one or two minutes; therefore, systematic parameter searches can be realized.
4
Experiments
Three experiments underline the usefulness of correlation-based classification. First a proof of concept is given for the Tecator benchmark data which is a set of absorbance spectra Then the focus is put on cDNA array classification: in the second experiment, a distinction between two types of leukemia from gene spotting intensities is sought for – this involves the classification of 72 complete cDNA microarrays, i.e. 7129-dimensional gene expression intensity vectors, and the rating of these genes for their relevance to classification. The last experiment detects systematic differences between two series of gene expression records by analyzing two corresponding sets of 7-dimensional expression patterns with 1421 genes each.
4.1 Tecator spectral data
The first data set, publicly available at http://lib.stat.cmu.edu/datasets/tecator, contains 215 samples of 100-dimensional infrared absorbance spectra. The task is to predict the binary fat content, low or high, of meat by spectrum classification, thereby using random data partitions of 120 training patterns and 95 test patterns, as suggested in [12]. It is known that Euclidean distance is not appropriate for raw spectrum comparison, and the question of interest is, if Pearson correlation yields any benefit. 9
Tecator data set samples / relevance profiles 4.500
intensity
4.000
class 0 class 1
3.500 3.000 2.500 10
20
30
40
50
60
relevance factor
0.016
70
80
90
relevance profile average std.-dev. boundary
0.014 0.012 0.01 0.008 0.006 10
20
30
40 50 60 frequency channel
70
80
90
Fig. 2. Upper panel: sample spectra from Tecator data set. Lower panel: GRLVQ-C1 relevance profiles.
Figure 2, top panel, shows some of the spectra with their corresponding classes. Apart from a tendency towards dints around channel 41 for high fat content, a substantial visual data overlap can be stated. This is reflected in the results for Euclidean-based classifiers: k-nearest neighbor (k-NN) reaches its best classification results of about 80% accuracy for k = 3, GRLVQ with squared Euclidean metric reaches 88% on the test set at 94% training accuracy. However, focusing on correlation-based classification the problem gets easier and results above 97% are obtained. For comparison, k-NN with maximum correlation neighborhood is taken, k-NN-C for short. Table 1 contains the average numbers of misclassification for a specific classifier, each of which trained 25 times. Instead of the k-NN-C, which utilizes all available training data for classification, GRLVQ-C training requires only 20 prototypes per class trained in 500 epochs. The learning rates for both types C0 and C1 without relevance learning are γ λ = 0 and γ + = γ − = 10−8 , and the exponents are k = 6 and k = 5, respectively. In additional runs, relevance learning is switched on by choosing a relatively large non-zero learning rate γ λ = 10−7 while the prototype adaptation rates are kept at γ + = γ − = 10−8 . GRLVQ-C1
GRLVQ-C0
1-NN-C
3-NN-C
7-NN-C
no rel.
rel.
no rel.
rel.
no rel.
rel.
no rel.
rel.
no rel.
rel.
3.32
2.04
2.4
2.16
3.96
2.52
5.04
2.84
6.4
3.24
Table 1 Tecator correlation-based classification results for the test set. Average numbers of misclassifications are shown for 25 runs of each classifier. k-NN-C utilizes maximum correlation neighborhood. Relevance utilization is indicated by ’rel.’, denoting metric adaptation for GRLVQ-C and application of relevance-rescaled data for k-NN-C.
10
To summarize Tab. 1, GRLVQ-C classification is superior to k-NN-C. The differences between C0 and C1 accuracies become visible for non-relevance based learning: while C0 does not much profit from relevance adaptation, C1 does account for it, as some of the runs ended up with no misclassification at all. The bottom panel of Fig. 2 shows individual and average relevance profiles for the C1 -classifier. As an intuitive result, the apparent discriminators shown at particular channels of the data, such as channel 41, get amplified, while less important channels are suppressed. Although k-NN-C decreases in performance for larger k-neighborhoods, the results can be improved by transforming the input data according to the scaling factors shown in the relevance plots. The GRLVQ-C scaling weights can thus be used to boost k-NN-C classification accuracies, which underlines a more general validity of the found data scaling properties. To conclude, sparse and accurate GRLVQ-C-classifiers are obtained for the Tecator data set without further data preprocessing. The built-in relevance detection yields highly interpretable results, which, in the following, helps to identify key regulators in gene expression experiments.
4.2 Leukemia cancer type detection
The second task is gene expression analysis, where the GRLVQ-C property of automatic attribute weighting is used for gene ranking. Data is taken from cDNA arrays which are powerful tools for probing in parallel the expression levels of thousands of genes that were extracted form organic tissue cells. A very important issue in gene expression analysis is the identification of functionally relevant genes. Particularly, medical diagnostics and therapies profit from the isolation of small sets of candidate genes responsible for defective or mutative operations. In cancer research many well-documented data sets and publications are online available. One of the discussed problems is the differentiation between two types of leukemia, the acute lymphoblastic leukemia (ALL) and the acute myeloid leukemia (AML). Background information is provided by Golub et al. [2]. The corresponding data sets and further online material is found at http://www.broad.mit.edu/. The available data contains real-value expressions of 7129 genes (some redundant) for each of the 38 training samples (27 ALL, 11 AML) and of the 34 test samples (20 ALL, 14 AML). In order to distinguish correlation from anti-correlation GRLVQ-C1 is considered. Training has been carried out for a minimalistic GRLVQ-C1 model, using only one prototype per class which prevents over-fitting in the 7129-dimensional space. Learning rates are γ + = γ − = 2.5·10−3 , the exponent k = 2 is taken, and 1000 epochs are trained. Table 3 shows that the average results of 100 runs with only prototype adaptation are rather poor in contrast to the neighborhood analysis method of Golub et al.; however, allowing relevance adaptation at γ λ = 5 · 10−8 , the 11
Gene-#
Found
ID
Name
1745
Yes
M16038
LYN Yamaguchi sarc. vir. rel. oncog. homolog
1834
Yes
M23197
CD33 antigen (differentiation antigen)
1882
Yes
M27891
CST3 Cystatin C (amyloid cerebral hemorrhage)
2354
Yes
M92287
CCND3 Cyclin D3
4190
No
X16706
FOS-related antigen 2
4211
No
X51521
VIL2 Villin 2 (ezrin)
4847
Yes
X95735
Zyxin
5954
No
Y00339
CA2 Carbonic anhydrase II
6277
No
M30703
Amphiregulin (AR) gene
6376
Yes
M83652
PFC Properdin P factor, complement
Table 2 Leukemia list of candidate genes for differentiating between types AML and ALL. The ’Found’ column indicates if the specific gene is confirmed identified by the team of Golub et al.
GRLVQ-C1 accuracies are drastically improved. Thus, the obtained relevance factors must explain correlative differences between AML and ALL. Yet, this statement does not claim biological truth. For validation purposes, the genes have been ranked according to their relevance values for those 19 of 100 independently trained classifiers that showed perfect results on training and test set. A list of top-ten genes with highest sum of their 19 ranks has been extracted and matched with a longer list of fifty genes given by Golub and his group. The results are given in Tab. 2. Remarkably, 6 of the identified genes are consistent with the reference list. This event ’finding at least 6 matches’, has got a vanishing probability of 10 50 7129−50 ≈ 1.8 · 10−11 for randomly selected genes. This P = k=6 k · 10−k / 7129 10 means that, although only ten instead of fifty genes are considered for brevity here, these genes are confirmed as being of special importance. The functional Set
GRLVQ-C1 (rel.)
GRLVQ-C1 (no rel.)
Neigh.-Analysis
train
0.19
5.96
2
test
2.1
7.14
5
Table 3 Leukemia data set classification results. Average numbers of misclassifications are shown for 100 runs of each GRLVQ-C1 classifier. Relevance utilization is indicated by ’rel.’. The results for the neighborhood analysis (done a single time) are taken from Golub et al. [2].
12
MDS of original leukemia data
MDS of rescaled leukemia data 2
ALL AML
ALL AML
2
1
0
0
-1
-2
-4
-3
-2
-1
0
1
2
3
-2 -1.5
-1
-0.5
0
0.5
1
1.5
2
Fig. 3. Leukemia data embedded by correlation-based multi-dimensional scaling. Left: original data set. Right: data set after application of GRLVQ-C1 relevance factors. Class prototypes are indicated by large symbols.
role of the other 4 identified genes might still be of interest from biological point of view, but this remains open for investigation in extra studies. Finally, the grouping of all 72 data samples are visualized together with the GRLVQ-C prototypes by correlation-based multi-dimensional scaling HiTMDS [11] in Fig. 3. For the original data, shown in the left panel, there is a rough unsupervised separation of the 7219-dimensional gene expression vectors according to their type AML or ALL. The corresponding GRLVQ-C1 prototypes are defining a tight data separation boundary which is still imperfect due to noisy data grouping. However, after the rescaling transform utilizing the GRLVQ-C1 relevance factors, much clearer data clusters are obtained, as shown in right panel of Fig. 3. This visual aid re-emphasizes how the curse of dimension is effectively circumvented by using an adaptive metric that is driven by the available auxiliary class information.
4.3 Validation of gene expression experiments
The third study is connected to developmental gene expression series obtained from macroarrays. Expression patterns of 1421 genes were collected from filial tissues of barley seeds during 7 developmental stages between 0 and 12 days after flowering in steps of two days. For control purposes, each experiment has been repeated using 2 sets of independently grown plant material. The question of interest is, whether a systematic difference can be found in the gene expression profiles resulting from the two experimental series. Thus, 1421 data vectors in 7 dimensions are considered for each of the two classes related to series 1 and series 2. 13
GRLVQ-C0
GRLVQ-C1
GRLVQ-Euc
GRLVQ-Euc
Set
(3 pt.)
(5 pt.) %
(4 pt.)
(40 pt.)
train
68.07%
66.91%
53.14%
68.38%
test
64.95%
66.44%
49.66%
58.32%
Table 4 GRLVQ-classification accuracies for differentiating between 2 series of macroarray experiments. Number of used prototypes are given in brackets.
Random partitions into 50% training and 50% test sets are trained for 2500 epochs and 25 runs for each classification model, GRLVQ-C0 with k = 8, GRLVQ-C1 with k = 7, and GRLVQ-Euc. The exponents have been determined in a number of runs as a compromise between speed of convergence, related to small exponents, and over-fitting observed for high values. Table 4 contains the average results of the classifiers with optimum model size. GRLVQ-C0 uses only 3 prototypes, 1 for series one, and 2 for series two. This asymmetry has proofed to be beneficial for classification. Likewise, GRLVQ-C1 makes use of 2 and 3 prototypes for series one and two, respectively. The squared Euclidean GRLVQ-Euc yields about guessing results for 2 prototypes per class; accuracies get better for 20 prototypes per class, but then the generalization is rather poor. All classifiers but the small Euclidean one indicate a detectable difference between the two series of experiments. However, the GRLVQ-C1 -classifier that maintains the opposition of correlation and anti-correlation is a good choice with respect to model size and generalization ability. The relevance profiles for the three classifiers are given in Fig. 4. They show a rough correspondence in identifying relevant developmental stages within a range of 4–8 days. However, the details must be considered with care, because different configurations of the relevance profiles are found to lead to EGRLVQ cost function minimization, especially the C1 type. Thus, the data space is too homogeneously filled to emphasize specific dimensions clearly. This is also reflected in the small variability of the relevance factors λi ; in this case, larger relevance learning rates produce unstable solutions. Nevertheless, biologically consistent interpretation of the relevance profiles has been found: further biological investigations have supported a slight shift in assigning developmental stages between the two sets of independent experiments. In the conducted gene expression experiments, a robust transcriptional reprogramming occurred during intermediate stage related to days 4-8 of filial tissue development. Although overall expression data between the two sets of experiments are hardly distinguishable in practice, the slight systematic influence depending on an assignment of the developmental stages affects gene expression during this intermediate phase. These slight differences were detected and could be well exploited by the GRLVQ-classifiers, which confirms their use for processing biological observations. 14
Euclidean relevances for gene expression series classification 0.147
relevance profile average std.-dev
0.146 relevance factor
0.145 0.144 0.143 0.142 0.141 0.14 0.139
0
2
4 6 8 developmental stage (days after flowering)
10
12
C0 relevances for gene expression series classification 0.147
relevance profile average std.-dev
relevance factor
0.146 0.145 0.144 0.143 0.142 0.141 0.14
0
2
4 6 8 developmental stage (days after flowering)
10
12
C1 relevances for gene expression series classification 0.147
relevance profile average std.-dev
relevance factor
0.146 0.145 0.144 0.143 0.142 0.141
0
2
4 6 8 developmental stage (days after flowering)
10
12
Fig. 4. GRLVQ relevance profiles characterizing the developmental stage that enhance the distinction of two experimental gene expression series. From top to bottom: Euclidean, C0 , C1 . Different characteristics occur depending on the underlying similarity measure.
15
5
Conclusions
Adaptive correlation-based similarity measures have been successfully derived and integrated into the existing mathematical framework of GRLVQ learning. The experiments with the GRLVQ-C classifiers show that there is much potential in using non-Euclidean similarity measures. GRLVQ-C1 maintains properties of correlation vs. anti-correlation, while GRLVQ-C0 opposes both characteristics against dis-correlation, which leads to structurally different classifiers. The GRLVQ-C0 pattern matching is somewhat analogous to the property of Hopfield networks that do not distinguish a pattern from an inverted copy of it. By the utilization of Pearson correlation, no preprocessing is required for getting independent of data contrast related to scaling and shifting. As a consequence, in comparison to the Euclidean metric less prototypes are necessary to represent certain types of data: it has been showed that the functional Tecator data and the gene expression classification profit from using correlation measures. High sensitivity to specific differences in the data is realized, and very good classification results are obtained. Many other areas of GRLVQ-C applications ranging from image processing to mass spectroscopy can be thought of, areas which profit from relaxed pattern matching in contrast to strict metricbased classification. A very important property of the proposed types of correlation measures, C0 and C1 , is their adaptivity for enhanced data discrimination at a global data perspective. As has been shown by the experiments, relevance scaling helps to find interesting data attributes, and, thereby, the scaling drastically increases the classification accuracies for high-dimensional data. Even for standard methods, like the demonstrated k-NN-C and the MDS visualization technique, their expressiveness can be much improved if the data is subject to preprocessing by the GRLVQ-C scaling factors. Yet, an interesting theoretical problem remains, apart from the practical benefits: to which extend can the large margin properties of GRLVQ-Euc be transferred to the new correlation measures of GRLVQ-C? This and a number of further issues will be addressed in future work. GRLVQ-C is online available as instance of SRNG-GM at http://srng.webhop.net/.
Acknowledgments
Thanks to Dr. Volodymyr Radchuk for macroarray hybridization experiments. Gratitudes to Prof. Wolfgang Stadje, University of Osnabr¨ uck, for his solution to the combinatorial probability of accidentally identified relevant genes. Many thanks also for the precise statements of the anonymous reviewers. The present work is supported by BMBF grant FKZ 0313115, GABI-SEED-II. 16
References [1] G. Biau, F. Bunea, and M. Wegkamp. Functional Classification in Hilbert Spaces. IEEE Transactions on Information Theory, 51(6):2163–2172, 2005. [2] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, Oct 1999. [3] B. Hammer, M. Strickert, and T. Villmann. On the Generalization Ability of GRLVQ Networks. Neural Processing Letters, 21:109–120, 2005. [4] B. Hammer and T. Villmann. Generalized Relevance Learning Vector Quantization. Neural Networks, 15:1059–1068, 2002. [5] S. Kaski. Bankruptcy analysis with self-organizing maps in learning metrics. IEEE Transactions on Neural Networks, 12:936–947, 2001. [6] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 3rd edition, 2001. [7] F. Rossi, B. Conan-Guez, and A. E. Golli. Clustering functional data with the SOM algorithm. In Proceedings of ESANN 2004, pages 305–312, Bruges, Belgium, April 2004. [8] F. Rossi, N. Delannay, B. Conan-Guez, and M. Verleysen. Representation of functional data in neural networks. Neurocomputing, 64:183–210, March 2005. [9] A. Sato and K. Yamada. Generalized Learning Vector Quantization. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7 (NIPS), volume 7, pages 423–429. MIT Press, 1995. [10] M. Strickert, N. Sreenivasulu, W. Weschke, U. Seiffert, and T. Villmann. Generalized Relevance LVQ with Correlation Measures for Biological Data. In M. Verleysen, editor, European Symposium on Artificial Neural Networks (ESANN), pages 331–338. D-side Publications, 2005. [11] M. Strickert, S. Teichmann, N. Sreenivasulu, and U. Seiffert. High-Throughput Multi-Dimensional Scaling (HiT-MDS) for cDNA-Array Expression Data. In W. Duch, J. Kacprzyk, E. Oja, and S. Zadro˙zny, editors, Artificial Neural Networks: Biological Inspirations – ICANN 2005, Springer Lecture Notes in Computer Science, pages 625–633, 2005. [12] N. Villa and F. Rossi. Support vector machine for functional data classification. In Proceedings of ESANN 2005, pages 467–472, Bruges, Belgium, April 2005. [13] T. Villmann, F. Schleif, and B. Hammer. Supervised Neural Gas and Relevance Learning in Learning Vector Quantization. In T. Yamakawa, editor, Proc. of the Workshop on Self-Organizing Networks (WSOM), pages 47–52, Kyushu Institute of Technology, 2003.
17