Generalized Relevance LVQ (GRLVQ) with ... - Semantic Scholar

Comment

Report 2 Downloads 132 Views

Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Analysis Marc Strickert, Udo Seiﬀert Pattern Recognition Group, Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben

Nese Sreenivasulu, Winfriede Weschke Gene Expression Group, IPK Gatersleben

Thomas Villmann University Leipzig, Clinic for Psychotherapy

Barbara Hammer Institute of Computer Science, Technical University of Clausthal

Abstract A correlation-based similarity measure is derived for generalized relevance learning vector quantization (GRLVQ). The resulting GRLVQ-C classiﬁer makes Pearson correlation available in a classiﬁcation cost framework where data prototypes and global attribute weighting terms are adapted into directions of minimum cost function values. In contrast to the Euclidean metric, the Pearson correlation measure makes input vector processing invariant to shifting and scaling transforms, which is a valuable feature for dealing with functional data and with intensity observations like gene expression patterns. Two types of data measures are derived from Pearson correlation in order to make its beneﬁts for data processing available in compact prototype classiﬁcation models. Fast convergence and high accuracies are demonstrated for cDNA-array gene expression data. Furthermore, the automatic attribute weighting of GRLVQ-C is successfully used to rate the functional relevance of analyzed genes. Key words: Prototype-based learning, adaptive metrics, correlation measure, Learning Vector Quantization, GRLVQ, gene expression analysis.

Preprint submitted to Elsevier Science

11 October 2005

1

Introduction

Pattern classiﬁcation is the key technology for solving tasks in diagnostics, automation, information fusion, and forecasting. The backbone of pattern classiﬁcation is the underlying distance metric: it deﬁnes how data items are compared, and it controls the grouping of data. Thus, depending on the deﬁnition of the distance, a data set can be viewed and processed from diﬀerent perspectives. Unsupervised clustering with a speciﬁc similarity measure, for example visualized as the result of a self-organizing map (SOM), provides ﬁrst hints about the appropriateness of the chosen metric for meaningful data grouping [6]. In prototype-based models like the SOM, a data item can be compared with an ’average’ data prototype in various ways, for example, according to the Euclidean distance or the Manhattan block distance. Diﬀerent physical and geometric interpretations are obtained then, because the former measures diagonally across the vector space, while the latter sums up distances along each dimension axis. In any case, the speciﬁc structure of the data space can and should be accounted for by selecting an appropriate metric. Once a suitable metric is identiﬁed, it can be further utilized for the design of good classiﬁers. In supervised scenarios, auxiliary class information can be used for adapting parameters improving the speciﬁcity of data metrics during data processing, as proposed by Kaski for (semi-)supervised extensions of the SOM [5]. Another metric-adapting classiﬁcation architecture is the generalized relevance learning vector quantization (GRLVQ) developed by Hammer and Villmann [4]. Data metrics in mathematical sense, however, might be too restrictive for some applications in which a relaxation to more general similarity measures would be useful. For example, in biological sciences often functional aspects of collected data play an important role: general spatio-temporal patterns in time series, intensity ﬁelds, or observation sequences might be more inter-related than patterns that are just spatially close in Euclidean sense. This applies to the aim of the present work, the analysis of gene expression patterns, for which the Pearson correlation is commonly used. Since recent technological achievements allow probing of thousands of gene expression levels in parallel, fast and accurate methods are required to deal with the resulting large data sets. Thereby, the deﬁnition of genetic similarity in terms of Pearson correlation should be possible, and the curse of dimensionality, related to only few available experiments in high-dimensional gene expression space, should be reduced to a minimum. Many commercial and freely available bioinformatEmail addresses: {stricker,seiffert}@ipk-gatersleben.de (Marc Strickert, Udo Seiﬀert), {srinivas,weschke}@ipk-gatersleben.de (Nese Sreenivasulu, Winfriede Weschke), [email protected] (Thomas Villmann), [email protected] (Barbara Hammer).

2

ics tools, such as ArrayMiner, GeneSpring, J-Express Pro, and Eisen’s Gene Cluster use Pearson correlation for analysis. The common goal of these programs is the identiﬁcation of key regulators and clusters of coexpressed genes that determine metabolic functions in developing organisms. Usually, only the metric of algorithms, which have been initially designed for processing Euclidean data, is exchanged by a 1 minus correlation term. Here, GRLVQ-C is proposed, a classiﬁer that is mathematically derived ’from the scratch’ for correlation-based classiﬁcation. Its foundations are the generic update rules of generalized relevance learning vector quantization (GRLVQ, [3,4]). This allows incorporation of auxiliary information for genetic distinction, such as the developmental stage of the probed tissues, or the stress factors applied to the growing organisms. Using the GRLVQ approach with its rigid classiﬁcation cost function, a fast prototype-based and intuitive classiﬁcation model with very good generalization properties is derived. Both, data attribute relevances and prototype locations are obtained as a result of optimizing Pearson correlationships. The speciﬁc requirements of gene expression analysis are met in two ways: ﬁrstly, the implemented correlation measure accounts for the nature of gene expression experiments which, due to physico-chemical reasons, tend to diﬀer in their overall intensities and in their dynamic ranges, but not in their general structure of expressed patterns. Secondly, automatic relevance weighting attenuates the curse of high dimensionality. The properties and beneﬁts of the proposed GRLVQ-C classiﬁer are demonstrated for real-life data sets.

2

Generalized Relevance LVQ (GRLVQ) and extensions

Let X = {(xi , y i ) ∈ Rd × {1, . . . , c} | i = 1, . . . , n} be a training data set with d-dimensional elements to be classiﬁed xk = (xk1 , . . . , xkd ) and c classes. A set W = {w1 , . . . , wK } of prototypes in data space with class labels y i is used for data representation, wi = (w1i , . . . , wni , y i ) ∈ Rd × {1, . . . , c}. The classiﬁcation cost function to be minimized is given in the generic form [4]: EGRLVQ :=

n i=1

i

g qλ(x )

− i i d+ λ (x ) − dλ (x ) with qλ(x ) = + i , dλ(x) := dλ(x, w) . i dλ (x ) + d− λ (x ) i

The classiﬁcation costs of all patterns are summed up, whereby qλ(xi ) serves as quality measure of the classiﬁcation depending on the degree of ﬁt of the presented pattern xi and the two closest prototypes, wi+ representing the same label as xi and wi− a diﬀerent label. A sigmoid transfer function g(x) = sgd(x) = 1/(1 + exp(−x)) ∈ (0; 1) is used [9]. Implicit degrees of freedom of the cost minimization are the prototype locations in the weight space and a set of adaptive parameters λ connected to the measure dλ(x) = dλ(x, w) comparing pattern and prototype. In prior work, dλ(x) was supposed to be a 3

metric in mathematical sense, i.e. taking only non-negative values, conforming to the triangle inequality, with a distance of d = 0 only for w = x. These conditions enable intuitive interpretations of prototype relationships. However, if just a well-performing classiﬁer invariant to certain features is wanted, distance conditions might be relaxed to a mere similarity measure to be plugged into the algorithm. Overall similarity maximization can be expressed in the GRLVQ framework by ﬂipping the sign of the measure and then just keeping the minimization of EGRLVQ . Since the iterative GRLVQ update implements a gradient descent on E, d must be diﬀerentiable almost everywhere, no matter if acting as distance or as similarity measure. Partial derivatives of EGRLVQ yield the generic update formulas for the closest correct and the closest wrong prototype and the metric weights: GRLVQ wi+ = −γ + · ∂E∂w = −γ + ·g (qλ(xi ))· i+

GRLVQ wi− = γ − · ∂E∂w = γ − ·g (qλ(xi ))· i−

λ

GRLVQ = −γ λ · ∂E∂λ = −γ λ ·g (qλ(xi ))·

2 · d− (xi ) λ

2

·

∂d+ (xi ) λ ∂wi+

2

·

∂d− (xi ) λ ∂wi−

(d+λ (xi ) + d−λ (xi )) 2 · d+ (xi ) λ

(d+λ (xi ) + d−λ (xi ))

2·∂d+ (xi )/∂λ·d− (xi ) − 2·d+ (xi )·∂d− (xi )/∂λ λ λ λ λ 2

(d+λ (xi ) + d−λ (xi ))

Learning rates are γ λ for the metric parameters λj , all initialized equally by λj = 1/d, j = 1 . . . d; γ + and γ − describe the update amount. Their choice depends on the used measure – generally, they should be chosen according to the relation 0 ≤ γ λ γ − ≤ γ + ≤ 1 and decreased within these constrains during training. Metric adaptation should be realized slowly, as a reaction to the quasi-stationary solutions for the prototype positions. The above set of equations is a convenient starting point to test diﬀerent concepts of similarity by just inserting the denoted partial derivatives of dλ(x).

3

Metrics and similarity measures

The missing ingredient for carrying out comparisons is either a distance metric or a more general similarity measure dλ(x, w). In contrast to metrics, similarity measures are sometimes also called dis-similarity measures, because they are maximum for best match, which is opposed to the semantics of metrics. For reference, formulas for the weighted Euclidean distance will be revisited ﬁrst. Then, by relaxing the conditions of metrics, two types of measures are derived from the Pearson correlation coeﬃcient, which both inherit the invariance to component oﬀsets and amplitude scaling. The feature of prototype invariance, implemented by the presented update dynamic, is desirable in situations when mainly frequency information and simple plotting curves 4

matching is accounted for. More details on functional shape representations and functional data processing with neural networks are given in Biau et al. [1], with a focus on prototype-based SOM by Rossi et al. [7,8] and for use with support vector machines by Villa and Rossi [12].

3.1 Weighted Euclidean metric

The weighted Euclidean metric yields the following set of equations [13]:

i dEuc λ (x, w ) =

d

λbjλ · (xj − wji )bw , integers bλ, bw ≥ 0 , bw even

j=1

(x, wi )

∂dλ ∂wji Euc

⇒

= −bw · λbjλ · (xj − wji )bw −1 ,

i ∂dEuc λ (x, w ) = bλ · λjbλ −1 · (xj − wji )bw . ∂λj

For simplicity, roots have been omitted. In the squared case with bw = 2, the derivative for the prototype update 2 · (xj − wji ) contains the well-known Hebbian learning term. In other cases, large bw tend to focus on dimensions with large diﬀerences, and small bw focus on dimensions with small diﬀerences. Approved values for the exponents of the relevance factors are bλ ∈ {1, 2}. Normalization of di=1 λi = 1, λi ≥ 0 is necessary after each update step to prevent the parameters from divergence and collapse.

3.2 Correlation-based measures

In the following, a correlation-based classiﬁcation is derived from the term r = d (x, w ) = r

i

d

l=1

d l=1

(wli − µwi ) · (xl − µx )

(wli − µwi )2 ·

d l=1

(xl − µx )2

∈ [−1; 1]

(1)

which is the Pearson correlation coeﬃcient; therein, µy denotes the mean value of vector y. As illustrated in Fig. 1, this correlation possesses fundamentally diﬀerent properties than the Euclidean distance: depending on the applied similarity function, the two patterns compared with a reference pattern yield opposite relations. Simple data preprocessing cannot transform correlationbased classiﬁcation problem into an equivalent one solvable with the Euclidean metric. As a rough rule of thumb: if a prototype with ’suﬃcient’ variance is similar to input points in Euclidean sense, then it is very likely that it is also 5

Pattern similarity

expression value

reference signal (RS) pattern 1 (P1) pattern 2 (P2)

1

2

3

4 attribute

5

6

7

Fig. 1. Data patterns compared with diﬀerent similarity functions. Relation characterizations for the squared Euclidean metric diﬀer from those for Pearson correlation: dEuc (RS, P1) = 0.82 < dEuc (RS, P2) = 1.81 → P1 closer to RS than P2; but dr (RS, P1) = −0.53 < dr (RS, P2) = 0.89 → P2 more similar to RS (highly correlated) than P1 (anti-correlated).

highly correlated to them. The other direction is untrue: if high correlation exists, there might still be a large Euclidean distance. Thus, potentially fewer prototypes are necessary for representations based on correlation, and sparser data models can thus be realized.

The straightforward deﬁnition of Pearson correlation by Eq. 1, however, is not suitable for being implemented in GRLVQ: ﬁrstly, the required cost function minimization conﬂicts with the desired maximization of Pearson similarity between data and prototypes; secondly, only the small range of values −1 ≤ r ≤ 1 is taken for expressing best match versus worst match, which yields sub-optimum convergence of EGRLVQ . As a ﬁrst approach, one might think of a version of Fisher’s Z’, Z = 0.5 · (ln(1 + r) − ln(1 − r)), as a standard transformation of Pearson correlation. This, however, leads to unstable behavior, because almost perfect (dis-)correlation is mapped to arbitrarily large absolute values. Therefore, inverse fractions of appropriately reshaped functions of r are considered in the following. The derivations presented here unify and improve the transformations given in the authors’ prior work [10].

Since metric adaptivity is a very beneﬁcial property for rating individual data attributes, free parameters are added to the Pearson correlation in such a way that the meaning of correlation can be fully maintained. Then, by paying attention to the current prototype w := wi , the numerator of Eq. 1 becomes 6

H := = ¯ j (µw) = H

d l=1 λ2j · d

λ2l · (wl − µw) · (xl − µx ), focusing on component j : ¯ j (µw) , (wj − µw) · (xj − µx ) + H λ2l · (wl − µw) · (xl − µx ) ;

l=j

l=1

µw = µw(wj ) = 1/d · wj + 1/d ·

d

wl .

l=j

l=1

The focus on component j will be a convenient abbreviation for deriving the formulas for prototype update and relevance adaptation. Each of the mean subtracted weight and pattern components, (wj − µw) and (xj − µx ), has got its own relevance factor λj . This is reﬂected in both rewritten denominator factors of Eq. 1, again with a focus on weight vector components j: W :=

d

l=1

λ2l · (wl − µw)2 ,

X :=

d

l=1

λ2l · (xl − µx )2

W (wj , λj ) = λ2j · (wj − µw)2 + W¯j , W¯j = dl=j λ2l · (wl − µw)2 l=1

Using the√ deﬁned shortcuts, the adaptive Pearson correlation can be written as rλ = H / W ·X . Two types of measures are obtained by a uniﬁed transform:

R=

1 C + rλ(x, w)

k

−Rmin =

H C+√ W ·X

−k

−Rmin =: R −k −Rmin (2)

The resulting classiﬁers are characterized as follows: C0 One type of classiﬁer is obtained for C = 0, even integer exponents k ≥ 2, and minimum Rmin = 1. This classiﬁer allows to separate both correlation and anti-correlation from mere dis-correlation. The minimum value Rmin is subtracted in order to obtain sharp zeros for perfect matches. In computer implementations, a special treatment of the unlikely case of extreme discorrelation might be considered in order to avoid division by (near-)zero values. C0 -prototypes match both correlated patterns and their inverted anti-correlated counterparts, which allows the realization of very compact classiﬁers. For speciﬁc data, however, this type of classiﬁcation might lead to undesired intermingling of data proﬁle shapes, such as occurring in gene expression analysis. C1 The other type of classiﬁer separates correlated patterns from anti-correlated ones. The C1 model is realized by C = 1, an integer exponent k ≥ 1, and Rmin = 2−k ; here, the rare occurrence of extreme anti-correlation might be worth handling singular values in computer realizations. The C1 setup allows classiﬁcation with intuitive Pearson correlationships, a feature that is well-suited for co-expression analysis of gene intensities. 7

For calculating derivatives of R, the constant expression Rmin can be omitted. Solutions can be obtained manually or by using computer algebra systems. In the latter case, after some rearrangements, the following equations are found: ∂R ∂wj

=

∂R(wj ,λj ) ∂wj

=

∂ wj

−k

H (wj ,λj )

C+

1

W (wj ,λj )·X (λj )

2

= F · H (wj ) · W · X − H · W (wj ) · X ∂R ∂λj

= F · (H (λj ) · W · X − H · W (λj ) · X − H · W · X (λj ))

using the factor

3

F = − k · R −k−1 /(W · X )− 2 The missing derivatives are H (wj ) = λ2j (xj − µx ) − 1/d ·

d

l=1

W (wj ) = 2 · λ2j (wj − µw) − 2/d ·

λ2l · (xl − µx )

d

l=1

λ2l · (wl − µw)

H (λl ) = 2 · λl · (wl − µw) · (xl − µx ) W (λl ) = 2 · λl · (wl − µw)2 X (λl ) = 2 · λl · (xl − µx )2 These formulas contain plausible Hebb terms, adapting wj into the direction of (xj − µx ) and away from (wl − µw) in case of correct classiﬁcation, whereby further scaling factors come from the cost function. Similarly, λl is adapted according to the correlation (wl − µw) · (xl − µx ) in comparison to the variances of these terms. Note that very eﬃcient computer implementations can be realized, because most calculations can be done during the pattern presentation phase already. Then, the similarity measure R and all its constituents H , W , X , and some other terms are computed, and they can be stored for each prototype for later reuse in the update phase. Again, the normalization of the relevance factors to di=1 λi = 1, λi ≥ 0, is advised in order to avoid numerical instabilities and to make diﬀerent training runs comparable. Two ﬁnally discussed practical issues concern singular expression handling and the choice of the exponent k. Singular states occur if the variance of one of the vectors is absent. If, for example, a pattern vector x is aﬀected, then the simpliﬁed limit correlation d

d

(xl −µx )·(wl −µw )

(x−µx )·

(w −µ )

w l l ≈ limx→µx d d (xl −µx )2 · l (wl −µw )2 d·(x−µx )2 · l (wl −µw )2 l d (wl −µw ) l = = d 0 = 0 d 2 2

limxl →µx d

l

d·

l

(wl −µw )

d·

8

l

(wl −µw )

is of interest. In this case, all prototypes would end up with zero correlations and impracticable terms F . Analog reasoning holds in rare cases of equal prototype components, which, in practice, would occur by inappropriate initialization rather than by the update dynamic. Here, prototypes are assumed to be initialized by data instances. By skipping the degenerated constant patterns or prototypes, unpleasant situations can be eﬀectively avoided. Alternatively, a single randomly picked component can be set to a diﬀerent value, which, on average, produces the desired state of uncorrelatedness, even for two vectors with simultaneously equal components subject to that modiﬁcation. The free parameter k takes inﬂuence on the speed of convergence and the generalization ability. Integer values in the range 1 ≤ k < 20 have been found reasonable choices in experiments. Too high values lead to fast adaptation, but sometimes also to over-ﬁtting or to unstable convergence, unless a very small learning rate γ is chosen. Good initial exponents are k = 7 or k = 8, odd and/or even according to the desire for a C1 or a C0 type classiﬁer. For the presented experiments, training is only a matter of one or two minutes; therefore, systematic parameter searches can be realized.

4

Experiments

Three experiments underline the usefulness of correlation-based classiﬁcation. First a proof of concept is given for the Tecator benchmark data which is a set of absorbance spectra Then the focus is put on cDNA array classiﬁcation: in the second experiment, a distinction between two types of leukemia from gene spotting intensities is sought for – this involves the classiﬁcation of 72 complete cDNA microarrays, i.e. 7129-dimensional gene expression intensity vectors, and the rating of these genes for their relevance to classiﬁcation. The last experiment detects systematic diﬀerences between two series of gene expression records by analyzing two corresponding sets of 7-dimensional expression patterns with 1421 genes each.

4.1 Tecator spectral data

The ﬁrst data set, publicly available at http://lib.stat.cmu.edu/datasets/tecator, contains 215 samples of 100-dimensional infrared absorbance spectra. The task is to predict the binary fat content, low or high, of meat by spectrum classiﬁcation, thereby using random data partitions of 120 training patterns and 95 test patterns, as suggested in [12]. It is known that Euclidean distance is not appropriate for raw spectrum comparison, and the question of interest is, if Pearson correlation yields any beneﬁt. 9

Tecator data set samples / relevance profiles 4.500

intensity

4.000

class 0 class 1

3.500 3.000 2.500 10

20

30

40

50

60

relevance factor

0.016

70

80

90

relevance profile average std.-dev. boundary

0.014 0.012 0.01 0.008 0.006 10

20

30

40 50 60 frequency channel

70

80

90

Fig. 2. Upper panel: sample spectra from Tecator data set. Lower panel: GRLVQ-C1 relevance proﬁles.

Figure 2, top panel, shows some of the spectra with their corresponding classes. Apart from a tendency towards dints around channel 41 for high fat content, a substantial visual data overlap can be stated. This is reﬂected in the results for Euclidean-based classiﬁers: k-nearest neighbor (k-NN) reaches its best classiﬁcation results of about 80% accuracy for k = 3, GRLVQ with squared Euclidean metric reaches 88% on the test set at 94% training accuracy. However, focusing on correlation-based classiﬁcation the problem gets easier and results above 97% are obtained. For comparison, k-NN with maximum correlation neighborhood is taken, k-NN-C for short. Table 1 contains the average numbers of misclassiﬁcation for a speciﬁc classiﬁer, each of which trained 25 times. Instead of the k-NN-C, which utilizes all available training data for classiﬁcation, GRLVQ-C training requires only 20 prototypes per class trained in 500 epochs. The learning rates for both types C0 and C1 without relevance learning are γ λ = 0 and γ + = γ − = 10−8 , and the exponents are k = 6 and k = 5, respectively. In additional runs, relevance learning is switched on by choosing a relatively large non-zero learning rate γ λ = 10−7 while the prototype adaptation rates are kept at γ + = γ − = 10−8 . GRLVQ-C1

GRLVQ-C0

1-NN-C

3-NN-C

7-NN-C

no rel.

rel.

no rel.

rel.

no rel.

rel.

no rel.

rel.

no rel.

rel.

3.32

2.04

2.4

2.16

3.96

2.52

5.04

2.84

6.4

3.24

Table 1 Tecator correlation-based classiﬁcation results for the test set. Average numbers of misclassiﬁcations are shown for 25 runs of each classiﬁer. k-NN-C utilizes maximum correlation neighborhood. Relevance utilization is indicated by ’rel.’, denoting metric adaptation for GRLVQ-C and application of relevance-rescaled data for k-NN-C.

10

To summarize Tab. 1, GRLVQ-C classiﬁcation is superior to k-NN-C. The differences between C0 and C1 accuracies become visible for non-relevance based learning: while C0 does not much proﬁt from relevance adaptation, C1 does account for it, as some of the runs ended up with no misclassiﬁcation at all. The bottom panel of Fig. 2 shows individual and average relevance proﬁles for the C1 -classiﬁer. As an intuitive result, the apparent discriminators shown at particular channels of the data, such as channel 41, get ampliﬁed, while less important channels are suppressed. Although k-NN-C decreases in performance for larger k-neighborhoods, the results can be improved by transforming the input data according to the scaling factors shown in the relevance plots. The GRLVQ-C scaling weights can thus be used to boost k-NN-C classiﬁcation accuracies, which underlines a more general validity of the found data scaling properties. To conclude, sparse and accurate GRLVQ-C-classiﬁers are obtained for the Tecator data set without further data preprocessing. The built-in relevance detection yields highly interpretable results, which, in the following, helps to identify key regulators in gene expression experiments.

4.2 Leukemia cancer type detection

The second task is gene expression analysis, where the GRLVQ-C property of automatic attribute weighting is used for gene ranking. Data is taken from cDNA arrays which are powerful tools for probing in parallel the expression levels of thousands of genes that were extracted form organic tissue cells. A very important issue in gene expression analysis is the identiﬁcation of functionally relevant genes. Particularly, medical diagnostics and therapies proﬁt from the isolation of small sets of candidate genes responsible for defective or mutative operations. In cancer research many well-documented data sets and publications are online available. One of the discussed problems is the diﬀerentiation between two types of leukemia, the acute lymphoblastic leukemia (ALL) and the acute myeloid leukemia (AML). Background information is provided by Golub et al. [2]. The corresponding data sets and further online material is found at http://www.broad.mit.edu/. The available data contains real-value expressions of 7129 genes (some redundant) for each of the 38 training samples (27 ALL, 11 AML) and of the 34 test samples (20 ALL, 14 AML). In order to distinguish correlation from anti-correlation GRLVQ-C1 is considered. Training has been carried out for a minimalistic GRLVQ-C1 model, using only one prototype per class which prevents over-ﬁtting in the 7129-dimensional space. Learning rates are γ + = γ − = 2.5·10−3 , the exponent k = 2 is taken, and 1000 epochs are trained. Table 3 shows that the average results of 100 runs with only prototype adaptation are rather poor in contrast to the neighborhood analysis method of Golub et al.; however, allowing relevance adaptation at γ λ = 5 · 10−8 , the 11

Gene-#

Found

ID

Name

1745

Yes

M16038

LYN Yamaguchi sarc. vir. rel. oncog. homolog

1834

Yes

M23197

CD33 antigen (diﬀerentiation antigen)

1882

Yes

M27891

CST3 Cystatin C (amyloid cerebral hemorrhage)

2354

Yes

M92287

CCND3 Cyclin D3

4190

No

X16706

FOS-related antigen 2

4211

No

X51521

VIL2 Villin 2 (ezrin)

4847

Yes

X95735

Zyxin

5954

No

Y00339

CA2 Carbonic anhydrase II

6277

No

M30703

Amphiregulin (AR) gene

6376

Yes

M83652

PFC Properdin P factor, complement

Table 2 Leukemia list of candidate genes for diﬀerentiating between types AML and ALL. The ’Found’ column indicates if the speciﬁc gene is conﬁrmed identiﬁed by the team of Golub et al.

GRLVQ-C1 accuracies are drastically improved. Thus, the obtained relevance factors must explain correlative diﬀerences between AML and ALL. Yet, this statement does not claim biological truth. For validation purposes, the genes have been ranked according to their relevance values for those 19 of 100 independently trained classiﬁers that showed perfect results on training and test set. A list of top-ten genes with highest sum of their 19 ranks has been extracted and matched with a longer list of ﬁfty genes given by Golub and his group. The results are given in Tab. 2. Remarkably, 6 of the identiﬁed genes are consistent with the reference list. This event ’ﬁnding at least 6 matches’, has got a vanishing probability of 10 50 7129−50 ≈ 1.8 · 10−11 for randomly selected genes. This P = k=6 k · 10−k / 7129 10 means that, although only ten instead of ﬁfty genes are considered for brevity here, these genes are conﬁrmed as being of special importance. The functional Set

GRLVQ-C1 (rel.)

GRLVQ-C1 (no rel.)

Neigh.-Analysis

train

0.19

5.96

2

test

2.1

7.14

5

Table 3 Leukemia data set classiﬁcation results. Average numbers of misclassiﬁcations are shown for 100 runs of each GRLVQ-C1 classiﬁer. Relevance utilization is indicated by ’rel.’. The results for the neighborhood analysis (done a single time) are taken from Golub et al. [2].

12

MDS of original leukemia data

MDS of rescaled leukemia data 2

ALL AML

ALL AML

2

1

0

0

-1

-2

-4

-3

-2

-1

0

1

2

3

-2 -1.5

-1

-0.5

0

0.5

1

1.5

2

Fig. 3. Leukemia data embedded by correlation-based multi-dimensional scaling. Left: original data set. Right: data set after application of GRLVQ-C1 relevance factors. Class prototypes are indicated by large symbols.

role of the other 4 identiﬁed genes might still be of interest from biological point of view, but this remains open for investigation in extra studies. Finally, the grouping of all 72 data samples are visualized together with the GRLVQ-C prototypes by correlation-based multi-dimensional scaling HiTMDS [11] in Fig. 3. For the original data, shown in the left panel, there is a rough unsupervised separation of the 7219-dimensional gene expression vectors according to their type AML or ALL. The corresponding GRLVQ-C1 prototypes are deﬁning a tight data separation boundary which is still imperfect due to noisy data grouping. However, after the rescaling transform utilizing the GRLVQ-C1 relevance factors, much clearer data clusters are obtained, as shown in right panel of Fig. 3. This visual aid re-emphasizes how the curse of dimension is eﬀectively circumvented by using an adaptive metric that is driven by the available auxiliary class information.

4.3 Validation of gene expression experiments

The third study is connected to developmental gene expression series obtained from macroarrays. Expression patterns of 1421 genes were collected from ﬁlial tissues of barley seeds during 7 developmental stages between 0 and 12 days after ﬂowering in steps of two days. For control purposes, each experiment has been repeated using 2 sets of independently grown plant material. The question of interest is, whether a systematic diﬀerence can be found in the gene expression proﬁles resulting from the two experimental series. Thus, 1421 data vectors in 7 dimensions are considered for each of the two classes related to series 1 and series 2. 13

GRLVQ-C0

GRLVQ-C1

GRLVQ-Euc

GRLVQ-Euc

Set

(3 pt.)

(5 pt.) %

(4 pt.)

(40 pt.)

train

68.07%

66.91%

53.14%

68.38%

test

64.95%

66.44%

49.66%

58.32%

Table 4 GRLVQ-classiﬁcation accuracies for diﬀerentiating between 2 series of macroarray experiments. Number of used prototypes are given in brackets.

Random partitions into 50% training and 50% test sets are trained for 2500 epochs and 25 runs for each classiﬁcation model, GRLVQ-C0 with k = 8, GRLVQ-C1 with k = 7, and GRLVQ-Euc. The exponents have been determined in a number of runs as a compromise between speed of convergence, related to small exponents, and over-ﬁtting observed for high values. Table 4 contains the average results of the classiﬁers with optimum model size. GRLVQ-C0 uses only 3 prototypes, 1 for series one, and 2 for series two. This asymmetry has proofed to be beneﬁcial for classiﬁcation. Likewise, GRLVQ-C1 makes use of 2 and 3 prototypes for series one and two, respectively. The squared Euclidean GRLVQ-Euc yields about guessing results for 2 prototypes per class; accuracies get better for 20 prototypes per class, but then the generalization is rather poor. All classiﬁers but the small Euclidean one indicate a detectable diﬀerence between the two series of experiments. However, the GRLVQ-C1 -classiﬁer that maintains the opposition of correlation and anti-correlation is a good choice with respect to model size and generalization ability. The relevance proﬁles for the three classiﬁers are given in Fig. 4. They show a rough correspondence in identifying relevant developmental stages within a range of 4–8 days. However, the details must be considered with care, because diﬀerent conﬁgurations of the relevance proﬁles are found to lead to EGRLVQ cost function minimization, especially the C1 type. Thus, the data space is too homogeneously ﬁlled to emphasize speciﬁc dimensions clearly. This is also reﬂected in the small variability of the relevance factors λi ; in this case, larger relevance learning rates produce unstable solutions. Nevertheless, biologically consistent interpretation of the relevance proﬁles has been found: further biological investigations have supported a slight shift in assigning developmental stages between the two sets of independent experiments. In the conducted gene expression experiments, a robust transcriptional reprogramming occurred during intermediate stage related to days 4-8 of ﬁlial tissue development. Although overall expression data between the two sets of experiments are hardly distinguishable in practice, the slight systematic inﬂuence depending on an assignment of the developmental stages aﬀects gene expression during this intermediate phase. These slight diﬀerences were detected and could be well exploited by the GRLVQ-classiﬁers, which conﬁrms their use for processing biological observations. 14

Euclidean relevances for gene expression series classification 0.147

relevance profile average std.-dev

0.146 relevance factor

0.145 0.144 0.143 0.142 0.141 0.14 0.139

0

2

4 6 8 developmental stage (days after flowering)

10

12

C0 relevances for gene expression series classification 0.147

relevance profile average std.-dev

relevance factor

0.146 0.145 0.144 0.143 0.142 0.141 0.14

0

2

4 6 8 developmental stage (days after flowering)

10

12

C1 relevances for gene expression series classification 0.147

relevance profile average std.-dev

relevance factor

0.146 0.145 0.144 0.143 0.142 0.141

0

2

4 6 8 developmental stage (days after flowering)

10

12

Fig. 4. GRLVQ relevance proﬁles characterizing the developmental stage that enhance the distinction of two experimental gene expression series. From top to bottom: Euclidean, C0 , C1 . Diﬀerent characteristics occur depending on the underlying similarity measure.

15

5

Conclusions

Adaptive correlation-based similarity measures have been successfully derived and integrated into the existing mathematical framework of GRLVQ learning. The experiments with the GRLVQ-C classiﬁers show that there is much potential in using non-Euclidean similarity measures. GRLVQ-C1 maintains properties of correlation vs. anti-correlation, while GRLVQ-C0 opposes both characteristics against dis-correlation, which leads to structurally diﬀerent classiﬁers. The GRLVQ-C0 pattern matching is somewhat analogous to the property of Hopﬁeld networks that do not distinguish a pattern from an inverted copy of it. By the utilization of Pearson correlation, no preprocessing is required for getting independent of data contrast related to scaling and shifting. As a consequence, in comparison to the Euclidean metric less prototypes are necessary to represent certain types of data: it has been showed that the functional Tecator data and the gene expression classiﬁcation proﬁt from using correlation measures. High sensitivity to speciﬁc diﬀerences in the data is realized, and very good classiﬁcation results are obtained. Many other areas of GRLVQ-C applications ranging from image processing to mass spectroscopy can be thought of, areas which proﬁt from relaxed pattern matching in contrast to strict metricbased classiﬁcation. A very important property of the proposed types of correlation measures, C0 and C1 , is their adaptivity for enhanced data discrimination at a global data perspective. As has been shown by the experiments, relevance scaling helps to ﬁnd interesting data attributes, and, thereby, the scaling drastically increases the classiﬁcation accuracies for high-dimensional data. Even for standard methods, like the demonstrated k-NN-C and the MDS visualization technique, their expressiveness can be much improved if the data is subject to preprocessing by the GRLVQ-C scaling factors. Yet, an interesting theoretical problem remains, apart from the practical beneﬁts: to which extend can the large margin properties of GRLVQ-Euc be transferred to the new correlation measures of GRLVQ-C? This and a number of further issues will be addressed in future work. GRLVQ-C is online available as instance of SRNG-GM at http://srng.webhop.net/.

Acknowledgments

Thanks to Dr. Volodymyr Radchuk for macroarray hybridization experiments. Gratitudes to Prof. Wolfgang Stadje, University of Osnabr¨ uck, for his solution to the combinatorial probability of accidentally identiﬁed relevant genes. Many thanks also for the precise statements of the anonymous reviewers. The present work is supported by BMBF grant FKZ 0313115, GABI-SEED-II. 16

References [1] G. Biau, F. Bunea, and M. Wegkamp. Functional Classiﬁcation in Hilbert Spaces. IEEE Transactions on Information Theory, 51(6):2163–2172, 2005. [2] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomﬁeld, and E. Lander. Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, Oct 1999. [3] B. Hammer, M. Strickert, and T. Villmann. On the Generalization Ability of GRLVQ Networks. Neural Processing Letters, 21:109–120, 2005. [4] B. Hammer and T. Villmann. Generalized Relevance Learning Vector Quantization. Neural Networks, 15:1059–1068, 2002. [5] S. Kaski. Bankruptcy analysis with self-organizing maps in learning metrics. IEEE Transactions on Neural Networks, 12:936–947, 2001. [6] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 3rd edition, 2001. [7] F. Rossi, B. Conan-Guez, and A. E. Golli. Clustering functional data with the SOM algorithm. In Proceedings of ESANN 2004, pages 305–312, Bruges, Belgium, April 2004. [8] F. Rossi, N. Delannay, B. Conan-Guez, and M. Verleysen. Representation of functional data in neural networks. Neurocomputing, 64:183–210, March 2005. [9] A. Sato and K. Yamada. Generalized Learning Vector Quantization. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7 (NIPS), volume 7, pages 423–429. MIT Press, 1995. [10] M. Strickert, N. Sreenivasulu, W. Weschke, U. Seiﬀert, and T. Villmann. Generalized Relevance LVQ with Correlation Measures for Biological Data. In M. Verleysen, editor, European Symposium on Artificial Neural Networks (ESANN), pages 331–338. D-side Publications, 2005. [11] M. Strickert, S. Teichmann, N. Sreenivasulu, and U. Seiﬀert. High-Throughput Multi-Dimensional Scaling (HiT-MDS) for cDNA-Array Expression Data. In W. Duch, J. Kacprzyk, E. Oja, and S. Zadro˙zny, editors, Artificial Neural Networks: Biological Inspirations – ICANN 2005, Springer Lecture Notes in Computer Science, pages 625–633, 2005. [12] N. Villa and F. Rossi. Support vector machine for functional data classiﬁcation. In Proceedings of ESANN 2005, pages 467–472, Bruges, Belgium, April 2005. [13] T. Villmann, F. Schleif, and B. Hammer. Supervised Neural Gas and Relevance Learning in Learning Vector Quantization. In T. Yamakawa, editor, Proc. of the Workshop on Self-Organizing Networks (WSOM), pages 47–52, Kyushu Institute of Technology, 2003.

17

Recommend Documents

Generalized Relevance LVQ with Correlation Measures ... - UCL/ELEN

Relevance LVQ versus SVM x Î¹â 1 - Semantic Scholar

Generalized Relevance Learning Vector ... - Semantic Scholar

Functional Relevance Learning in Generalized ... - Semantic Scholar

Experiments with Supervised Fuzzy LVQ - Semantic Scholar

Approximations with Reweighted Generalized ... - Semantic Scholar