A conditional density error model for the statistical analysis of ...

Report 2 Downloads 26 Views
Vol. 18 no. 8 2002 Pages 1064–1072

BIOINFORMATICS

A conditional density error model for the statistical analysis of microarray data Brad Love 1,∗, David R. Rank 1,2, Sharron G. Penn 1,2, David A. Jenkins 1,3 and Russell S. Thomas 1,3 1 Aeomica,

928 East Arques Avenue, Sunnyvale, CA 94085, USA

Received on October 22, 2001; revised on January 30, 2002; accepted on February 15, 2002

ABSTRACT Motivation: In many microarray experiments, relatively few intra- and inter-array replicate measurements are made due to significant cost limitations and sample availability. Compounding this problem is a lack of robust statistical methods for analyzing gene expression data with limited experimental replicates. As a result, the interpretation of the results of these experiments are difficult with little understanding of the probability of type I and type II errors. Results: The variability in a series of replicate microarray measurements was modelled using a combination of parametric and non-parametric methods. A 3-dimensional surface was created for the conditional distribution of the variability given the mean signal intensity in both the Cy3 and Cy5 channels. The results were used as the basis for developing statistical methods for analyzing gene expression data with limited experimental replicates. Availability: The statistical analysis scripts are available upon request. Contact: [email protected]; [email protected]

INTRODUCTION The development of DNA microarray technology has significantly altered our view of biology and the way we approach both research and clinical challenges. We now have the capability to monitor genome-wide expression changes and use this information to understand fundamental cellular processes, identify targets for drug intervention, and aid in the diagnosis and treatment of disease (Iyer et al., 1999; Golub et al., 1999; Zanders, 2000). However, cost constraints and limited sample sizes reduce the number of experimental replicates that can be performed in many microarray experiments. In addition, ∗ To whom correspondence should be addressed at Informax Inc., 7600 Wisconsin Avenue, Suite 1100, Bethesda, MD 20814, USA 2 Pressent address: Kalypsys, Inc., 11099 N. Torrey Pines Road, Suite 200, La Jolla, CA 92037, USA 3 Pressent address: Amersham Biosciences, 928 East Arques Avenue, Sunnyvale, CA 94085, USA

1064

many of the current procedures for handling microarray data do not adequately address issues of quality control or statistical inference. The limited replicates and lack of quality control reduce the ability to derive statistical significance for observed changes and restricts the interpretation of the results. If this technology is to have widespread experimental and clinical applications, then sound statistical approaches must be integrated into the data analysis process. To date, robust statistical methods for analyzing microarray data are still in the process of being developed (Zhang, 1999; Dudoit et al., 2000; Hughes et al., 2000; Kerr et al., 2000; Lee et al., 2000; Baldi and Long, 2001; Tusher et al., 2001). Methods for the analysis of microarrays with little replicate data have started to be introduced in the literature (Long et al., 2001; Nadon et al., 2001; Kepler et al., 2001). In this study, we attempt to address some of these microarray analysis issues by taking a fundamentally different approach. First, the intra and interarray variability in gene expression measurements was characterized with a series of replicate experiments. Second, a conditional density error model was constructed based on the replicate experiments in order to relate variability to the magnitude of expression. Finally, various statistical methods were developed for applying the conditional density error model to subsequent microarray experiments thereby allowing analysis of data sets with limited replicate measurements. An example is presented using a typical microarray experiment analyzed for quality control thresholds, gene expression confidence intervals, significant expression, and determining statistical power.

METHODS Microarray construction and hybridization To construct the DNA microarrays, potential genes within the human genome were identified, amplified by PCR, and arrayed on glass slides using previously published techniques (Penn et al., 2000). Each PCR product was spotted twice on each slide, one on the left half of the slide and the other on the right half, for c Oxford University Press 2002 

Conditional density error model and microarray data

intra-array replicate analysis. Poly-A selected RNAs were purchased from Clontech (Palo Alto, CA) and used to synthesize labelled cDNA probe with a standard reverse transcriptase reaction with the incorporation of Cy3- or Cy5-dCTP (Amersham Pharmacia, Piscataway, NJ Penn et al., 2000). After hybridization, the slides were scanned with a microarray scanner (Avalanche model, Molecular Dynamics, Sunnyvale, CA) and the fluorescence data was analyzed using the ArrayVision software package (Version 6.0, Imaging Research, St. Catharines, Ontario, Canada). Analysis of the gene expression measurements was performed on background subtracted integrated signal in relative flourescent units (RFUs), which are relative measurements of photonic energy. For characterizing the intra- and inter-array variability and developing the conditional density surface, a series of 8 hybridizations were performed. These 8 hybridizations consisted of 4 replicate experiments of Cy3 labelled brain cDNA versus Cy5 labelled liver cDNA and 4 replicate experiments of Cy3 labelled liver cDNA versus Cy5 labelled brain cDNA. Subsequent application of the analysis was performed on an example microarray hybridization of Cy3 labelled heart cDNA versus Cy5 labelled skeletal muscle cDNA. In addition, the replicate agreement thresholds were evaluated using additional hybridizations of varying image qualities. The high quality hybridization was Cy3 labelled thymus cDNA versus Cy5 labelled skeletal muscle cDNA. The normal hybridization was Cy3 labelled ovary cDNA versus Cy5 labelled skeletal muscle cDNA. The poor hybridization was Cy3 labelled fibroblast cDNA versus Cy5 labelled lung cDNA.

Statistical analysis platform The statistical analysis was implemented in Matlab (Version 6.1 Release 12.1; MathWorks Inc., Natick, MA) in the Windows 2000 environment. The software ran on a Dell workstation with dual Xeon 1.7 GHz processors and 1 GB of RAM. A typical complete analysis took approximately 1.5 hours to perform. Conditional density surface Within the replicate data set, a total of 8 observations were available per gene in each tissue and each dye. Half of the 8 observations were interarray replicates from separate hybridizations and half were intra-array replicates from duplicate spots. For example, 4 microarrays with 2 spots per array were hybridized with brain Cy3 labelled cDNA. The raw signal data for each gene was adjusted for background signal by subtraction, checked for intra-array consistency, and normalized to median signal. For verifying intra-array consistency, the signal from the duplicate spot on each microarray was checked for consistent signal using a strict intra-array agreement statistic. The relative

strictness of the intra-array threshold corresponded to a relatively ideal hybridization with minimal variability due to dust, scratches, background and other anomalies. Normalization of the data was performed by multiplying the observed background subtracted signal by the median of all background subtracted signal for all experiments in that channel divided by the median of the background subtracted signal for the microarray it was observed on. To check whether the subset of gene expression signals for which at least 4 observations passed the intra-array threshold can be assumed to come from a normal distribution with unknown mean and variance a Lilliefors’ test with an α = 0.05 was applied (see Sheskin, 2000 for more details). The basis of the conditional density approach relies on estimating the conditional relationship of the standard deviation, σgene given the mean expression of a gene, µgene . This assumes that the standard deviation of the expression is not only due to the underlying sequence being investigated, but also due to the signal magnitude. These two quantities result in the observed signal intensity of a gene following hybridization. To define this relationship, the conditional distribution was estimated nonparametrically using a second order Epanechnikov kernel for the standard deviation given the mean expression of each gene using the data from the inter- and intra-array replicate measurements. The nonparametric method used to estimate the conditional density surface allowed for minimal assumptions to be made about the resulting shape of the density and the Epanechnikov kernel function was chosen for its optimal asymptotic properties (M¨uller, 1984; Fan and Gijbels, 1996). The Epanechnikov kernel estimated the conditional density at the point  σ given  µ using all the observed data points (µi , σi ) for i = 1, 2, . . . , n, within a window as given by b1 and b2 , as      µ−µi  σ −σi 1 n K K i=1 nb1 b2 b1 b2   f σ |µ ( µ,  σ) = (1)   µ−µi n 1 K i=1 nb1 b1 where, 3 (1 − z 2 )1{|z|1} 4  1, if |z|  1 = 0, otherwise.

K (z) =

(2)

1{|z|≤1}

(3)

and

Here K (z) is the second order Epanechnikov kernel function and b1 and b2 are the bandwidth parameters for  µ and  σ respectively. The bandwidth parameters for  µ and  σ are smoothing variables that can either be chosen by the user or selected based on the data. There are several automated methods for choosing the bandwidths such as minimizing mean squared error, but the choice 1065

B.Love et al.

of smoothing parameters will influence the resulting estimate. For this analysis, bandwidths were chosen to be 1/30th of the range of its corresponding variable, where the range is the difference between largest observed value and the smallest. To prevent potential boundary effects, we restricted the estimation to a square grid incorporating the minimum and maximum of the observed data (in both variables), i.e. max( µ) = max(µ1 , . . . , µn ), etc.. . .

Gene expression confidence intervals and significant expression Using the conditional density surface, gene expression confidence intervals can be calculated for subsequent microarray experiments of various microarray design, i.e. microarrays with different genes laid down can have confidence intervals calculated for them using the standard deviation information contained in the surface. In our example, the background subtracted signals in each dye channel were used to sample a standard deviation from the corresponding conditional distribution. Given the signal intensity and standard deviation, a Gaussian distribution was sampled for each simulated intra-array replicate observation. This process was repeated for every gene on the microarray and each microarray simulation was repeated k times. Typically, k should be a large number, at least 200, to obtain an adequate measure of variability. In our case, each microarray was simulated 2000 times. This results in nk observations for every spot on the microarray where n is the number of intra-array replicates and is theoretically similar to having repeated the microarray k times to estimate the gene expression value. The k replicates of the microarray simulation were then used to calculate the 95% confidence interval for every µgene on the microarray. This was achieved by estimating the 2.5 and 97.5 percentiles of the k simulations for each gene. It is useful to note that these intervals will be conservative, i.e. wider, since variation will be coming from 2 places, the simulated standard deviation and the simulated observations. For determining differential expression, a Bonferroni adjustment for multiple comparisons can be applied to the associated confidence intervals. In our example, the Bonferroni adjustment was   applied to the percentiles α˜ α˜ at 100 2s and 100 1 − 2s where s is the number of genes tested and α˜ is one minus the desired level of confidence. After adjustment, differential gene expression was determined by looking for nonoverlapping confidence intervals. It should be noted that as s increases, the number of simulations, k, should also increase to keep the estimate of the adjusted confidence interval as reliable as possible. Clearly, these types of comparisons should be performed for only a subset of interesting genes, since as s gets larger α˜ 2s gets smaller. As a rough guide, we suggest that the 1066

number of simulations needed for s expressions should be at least 60s α˜ . For determination of significant expression relative to negative controls, the expression of the negative controls was subtracted from the genes of interest and the percentiles adjusted using the same approach. Genes with confidence intervals that did not contain 0 were determined to be significantly expressed. If confidence intervals for gene expression ratios are required, this can be achieved by simulating each treatment in its corresponding dye channel, calculating the ratio, and then repeating the simulation a total of k times. After the k simulations, the 2.5 and 97.5 percentiles can be estimated and adjusted for multiple comparisons. For ratios, the confidence intervals are evaluated to see if they include 1.

Intra-array replicate agreement thresholds To define intra-array replicate agreement thresholds, the conditional density surface can be used to determine the probability that the intra-array replicates come from the same distribution with a defined mean and standard deviation. For the duplicate pair of observations in our experiment, this is expressed as X 1 , X 2 ∼ N(µ, σ (µ)) where X 1 and X 2 denote the paired replicate expression of gene X , where E(X 1 ) = E(X 2 ) = µ, i.e. the expected value of X 1 and X 2 is µ and σ (µ) is the standard deviation of the gene expression values determined by µ. The minimum and maximum values of the duplicate observations are defined as X min = min(X 1 , X 2 ) and X max = max(X 1 , X 2 ), respectively. The probability distribution functions (pdf) were calculated to determine probabilistically if X max has the same parameters that gave rise to X min . The pdf for X min was first calculated as, f X min (x) = 2 f (x) (1 − F(x)) ,

(4)

the pdf for X max was calculated using, f X max (x) = 2 f (x)F(x).

(5)

The joint distribution for X min and X max can be calculated using the equation, f X min ,X max (xmin , xmax ) = 2 f (xmin ) f (xmax ),

(6)

where f (x) is the pdf of a Gaussian distribution, f (x) = √

1 2π σ (µ)

e

(x−µ)2 2σ 2 (µ)

(7)

and F(x) is the cumulative distribution function (cdf) represented by, 1 F(x) = √ 2π σ (µ)



x

−∞

(y−µ)2

e 2σ 2 (µ) dy.

(8)

Conditional density error model and microarray data

It is useful to note the following functional relationships,  ∞ (y−µ)2 1 µ= √ ye 2σ 2 (µ) dy (9) 2πσ (µ) −∞ and





µˆ 3 = (y−µ)2 2σ 2 (µ)

1 σ 3 (µ) = √ (y − µ)2 e dy (10) 2π −∞ The conditional distribution of X max given X min was calculated as, f X max |X min (xmax | xmin ) =

Finally, using a first order Taylor’s expansion of the function σ in an attempt to improve the estimate of µ, we have,

f (xmax ) . 1 − F(xmin )

(11)

Given the limited number of intra-array replicates, the signal intensity represented by µ is not actually observed for a given gene and must be estimated based on the data. In the case of intra-array duplicate measurements, µ must be estimated based on two observations, either X max or X min . In this case, we chose to estimate µ based on X min since the minimum is less likely to be affected by microarray anomalies that cause a large increase in signal (e.g. dust). However, it should be noted that if the background estimate for one of the replicate pairs is artificially elevated then the maximum would provide the best estimate for µ or if both observations are effected by anomalies, then this method will also fail. In practice, the latter possibility is unlikely since it is assumed that microarray anomalies do not happen often and they affect replicate genes independently. To calculate this relationship, the expectation of X min given µ can be defined as  ∞ x f X min (x)dx E X min |µ (X min ) = −∞  ∞ = 2x f (x) (1 − F(x)) dx −∞

= µ − cσ (µ)

(12)

where c ≈ 0.5641 is estimated using 4-point cubic polynomial integration method (Press et al., 1996). Neither the expectation of X min nor µ are observed experimentally, thereby forcing an estimation of µ using the observed xmin . The simplest estimator is given by, µˆ 1 = xmin .

(13a)

Although this is the simplest estimator of µ since it does not require any additional calculations, it is clearly not the best estimator since it does not use the result in equation (12). The problem with the simple solution is that the input for the function σ is µ which is unknown and trying to be estimated. Plugging in xmin for the argument for σ and solving for µ gives rise to the second estimator, µˆ 2 = xmin + cσ (xmin ).

(13b)

xmin + c(σ (xmin ) − xmin σ  (xmin )) 1 − cσ  (xmin )

(13c)

Note that this estimator can only be used if σ  (xmin ) = d dx σ (x)|x=xmin  = 1/c). Since we derived an estimate of µ based on xmin of any paired replicate of genes, we can now determine the probability that xmax has the same parameters as xmin . Simply put, we can now determine if xmax came from a normal distribution with the same parameters as estimated from xmin . The probability that xmax has parameters µ and σ (µ) is equivalent to, P (X max > xmax | xmin )  ∞ f X |X min (x | xmin )dx = xmax  ∞ f (x) = dx xmax 1 − F(x min ) ∞ f (x)d x x = max 1 − F(xmin ) 1 − F(xmax ) = 1 − F(xmin )

(14)

Furthermore, since the distribution is assumed to be Gaussian, the cdf can be normalized to use the standard normal cdf typically denoted as (z). As a result, we have, 1 − F(xmax ) 1 − F(xmin )   −µˆ i − c 1 − xmax σ (µˆ i )   = xmin −µˆ i 1 − σ (µˆ ) − c

P (X max > xmax | xmin ) =

(15)

i

We can then estimate this p-value by using the previously derived estimator for µ. By using µˆ 2 we have, (X max > xmax | xmin ) P   −xmin −cσ (xmin ) − c 1 − xmax σ (xmin +cσ (xmin ))   = −cσ (xmin ) 1 − σ (xmin − c +cσ (xmin ))

(16)

For short hand, we will refer to this as the PRP value (for Paired Replicate P-value). Critical values can be determined prior to data analysis based on the level of comfort by the user. For example, a typical critical value used is α = 0.05. The test could then calculate the PRP value for each replicate pair of gene expression values and compare it to the critical value. If the PRP value is less 1067

B.Love et al.

than α, we would reject the hypothesis that xmax came from a normal distribution with parameters µ and σ (µ) as estimated from xmin (i.e. reject the hypothesis of replicate agreement). Otherwise, we would fail to reject replicate agreement.

where α is the probability of a type I error, β is the probability of a type II error and z k is the kth percentile of the standard normal distribution. The equation utilizes the conditional density surface to calculate the minimal detectable change, , for a given µ and sample size.

2.5 2 Probability

Power analysis By applying the conditional density surface, statistical power can be estimated for subsequent experiments and gene expression changes. Based on Lilliefors’ test, the observed gene expression signals were normally distributed. Therefore, to detect a difference between two true averages, µ1 and µ2 , the required sample size is given by the standard normality formula,  2 z α2 + z β (σ (µ))2 n (17) 2

_5

x 10 3

1.5 1 0.5 0 0 0.5 1 1.5

5

x 10

2 2.5 3

Standard Deviation

14

16

12

_2

0

2

4

6

8

10

5

x 10 Mean Expression

Fig. 1. The Cy3 conditional distribution for the standard deviation, σ (µ), given the mean gene expression value, µ.

_5

x 10

1068

3 2.5 2 Probability

RESULTS AND DISCUSSION Conditional density surface Plots of the conditional density surfaces for Cy3 and Cy5 are shown in Figures 1 and 2, respectively. Although the bivariate densities were similar between Cy3 and Cy5, the conditional densities showed significant differences due to different distributional properties of the average signal. Specifically, the conditional surfaces were similar between Cy3 and Cy5 until approximately 1 000 000 RFUs where a lack of observations, here observations are the integrated signal of the gene expression, produced an artificial truncation of the Cy3 surface (Fig. 1 and 2). Out of the 5939 observations in Cy3 only 2 had more than 1 000 000 RFUs while in Cy5 out of 5019 observations, 42 were more than 1 000 000 RFUs. It should be noted that the observations above 1 000 000 RFUs were not saturated. Additional observations in this range should help overcome this problem. One assumption of the conditional density surface approach is that the surface remains relatively constant over time. In reality, process changes in the laboratory, modifications in microarray construction, or significant alterations in the hybridization protocol could change the shape of the surface. To overcome this potential limitation, the surface should be periodically checked with additional replicate experiments. If the surface does not change appreciably, the additional replicate information could be added to the existing surface thereby increasing the overall sample size.

1.5 1 0.5 0 0 0.5 1 5

x 10

1.5 2 2.5

Standard Deviation

3

16

14

12

10

8

6

4

2

_2

0 5

x 10

Mean Expression

Fig. 2. The Cy5 conditional distribution for the standard deviation, σ (µ), given the mean gene expression value, µ.

A second assumption of the conditional density approach is that the signal associated with microarray expression studies is normally distributed. For the replicate series of hybridizations, 94.96% of the Liver Cy3 observations, 95.63% of the Brain Cy3 observations, 95.38% of Liver Cy5 observations, and 95.82% of Brain Cy5 observations failed to reject the null hypothesis that the data came from a Gaussian distribution with unknown mean and variance. The small number that did reject the null hypothesis could easily be attributed to random error

Conditional density error model and microarray data

Intra-array replicate agreement thresholds Depending on the type and design of a particular microarray, intra-array replicates are sometimes included to alleviate potential problems associated with dust, local background, or other microarray anomalies. Typically, a quality control threshold is set for replicate agreement and gene expression values which meet this threshold are averaged for further analysis. Unfortunately, many of the thresholds for replicate agreement are set arbitrarily and have a different impact depending on expression (i.e. low expressing genes are more sensitive to random fluctuations). In deriving a statistically uniform threshold for replicate agreement, we chose to determine the probability that the intra-array replicates came from the same distribution with a defined mean and standard deviation. To do this, µ was estimated based on duplicate observations. Three potential estimators, µˆ i for i = 1, 2, 3, were evaluated based on their bias and mean squared error using a simulation-based approach. Comparisons were made at both the single gene level and across multiple genes within a typical hybridization. Results are summarized in Table 1. At the single gene level, a prototypical gene expression measurement was simulated 10 000 times with a µ = 24 616.13 and σ (µ) = 6825.84. The µˆ 2 estimator had the smallest bias of the three and also had the

700

600

500 Number of observations

(i.e. the failure percentage is approximately equal to α). For each replicate hybridization, we analyzed a total of 3744 gene expression signals. Due to the large number of gene expression values analyzed (i.e. approximately 30 000 total) in both dye channels, this is relatively strong evidence for the normality assumption under our experimental conditions. However, it should be noted that this assumption is not essential for the specific application of our approach. The assumption can be weakened as long as the distribution of observations can be characterized through parametric or nonparametric means. The primary requirement is the ability to randomly sample from the resulting distribution using one of a variety of methods (e.g. inverse cdf, acceptance/rejection sampling, Box–M¨uller, etc.). As can be seen in Figures 1 and 2, as the average expression increases its corresponding standard deviation increases as a result. Additionally, as can be seen by the height of the surface, the spread of the variability is larger when the mean is high. Furthermore, in the Cy3 channel, variability can be low across the whole spectrum of signals, while this is not the case for Cy5. This can be seen in Cy5 by the density getting further from zero as the mean expression increases. This can be interpreted as high signals in Cy5 will be consistently more variable, while as signals in Cy3 can, but not necessarily, have a very low amount of variability.

400

300

200

100

0 5,000

10,000

15,000

20,000 25,000 30,000 Expression Level, in RFUs

35,000

40,000

45,000

Fig. 3. A histogram of 10 000 estimated values of µ using the µˆ 2 estimator. The vertical axis shows the frequency of observations. For the single gene simulation µ = 24 616.13 and σ (µ) = 6825.84.

 µˆ ≈ −cσ (µ) as shown in Table 1. Note that Bias 1

smallest MSE (Table 1). As shown in Figure 3, µˆ 2 also provides a relatively unbiased estimate of µ. For multiple genes, a range of expression values related to a typical hybridization were simulated 10 000 times and showed



  that MSE µˆ 2 was significantly smaller than MSE µˆ 1

 or MSE µˆ 3 . All three estimators have approximately the same standard deviation when compared across multiple genes, but µˆ 2 shows a slightly lower standard deviation (Table 1). Therefore, differences in the MSE are largely due to the bias and not the standard deviation of the estimator. Based on Figure 4, estimating µ using µˆ 2 appears relatively robust with very few outliers. The average error for estimating µ based on the multiple gene simulation was only 59 RFUs. When applying this approach, the nature of σ (x) should determine which estimator to use. If σ (x) is either known or σ (x) is expected to be small relative to µ then we would suggest using µˆ 3 . Otherwise, based on this simulation, we would suggest using µˆ 2 . The impact and nature of the bias associated with estimating µ using µˆ 2 was assessed by simulating three single genes with a different µ and σ (µ) and across multiple genes within a typical hybridization. For the three individual genes, the observed PRP values were plotted against the corresponding percentiles (Figure 5). Since these values were based on simulation, they should roughly be equal and form a straight line. In the important rejection region between 0.0 and 0.1, a small divergence from the expected uniform distribution was observed 1069

B.Love et al.

5

2

1

x 10

0.9 1.5

0.8 1

Observed PRP value

0.5

2

Observed Bias of

0.7

0

0.5

0.6

0.5

0.4

0.3 1

0.2 1.5

2

0.1

0 0

10

20

30

40 50 60 Observed Percentile

70

80

90

100

Fig. 4. The error associated with estimating µ using µˆ 2 across multiple genes in a simulated microarray hybridization. The error is represented as µˆ 2 − µ for each gene. Not that this plot is symmetric at 0 for the vertical axis and 50 on the horizontal axis. This shows that estimation errors are unbiased and evenly distributed across a microarray chip and that large errors infrequently happen. Table 1. Simulation-based evaluation of three proposed estimators of µ across a single gene and a range of gene expression values in a typical microarray hybridization. Note that a low expressing gene has an expression of approximately 25 000 RFUs

1

Bias µˆ 1 µˆ 2 µˆ 3

Multiple genes4   Bias Stdev MSE

Single gene 2 3   Stdev MSE

−3775 61 −5166

5708 5729 5711

−5807 59 −6886

46 830 672 32 828 982 59 299 311

13 597 12 865 13 825

218 598 109 165 503 759 238 562 499



1 The Bias is calculated as Bias = E µ ˆ i − µ, where µ is the true

mean and E µˆ i is the expected value of the estimator based on the

the simulation. 2 The Standard Deviation is calculated as



2  Stdev = E µˆ i − E µˆ i . 3 The Mean Squared Error is calculated as 

2  MSE = E µˆ i − µ = Bias2 + Stdev2 . 4 For multiple genes, the Bias, Stdev and MSE were calculated for

each gene across a range of gene expression values in a typical microarray hybridization. The results were then averaged across all values.

for each simulated gene (Figure 5). A divergence from the straight line in this region indicates the relationship between the PRP value and the probability of type 1 error. The impact of the systematic differences can be seen using a critical value of α = 0.05 (i.e. probability of a type 1 error). In this case, we would reject 5.91% 1070

0

0.1

0.2

0.3

0.4

0.5 Percentile

0.6

0.7

0.8

0.9

1

Fig. 5. The impact of estimating µ using µˆ 2 on the probability of type 1 error based on simulating three genes with a different µ and σ (µ). With no bias, the three simulations should form a straight line with slope 1 (solid line). The three genes are represented by a dashed line (µ = 24 616.13; σ (µ) = 6825.84), a dot-dashed line (µ = 184 276.62; σ (µ) = 11 491.38), and a dotted line (µ = 431 213.82; σ (µ) = 33 863.42). The rejection region is located between 0.0 and 0.1 on both axes.

for gene 1, 4.01% for gene 2 and 1.96% for gene 3. If the model was strictly correct, these values should equal approximately 5%. We hypothesize that the reason for this difference is due to estimating µ versus actually observing µ. Clearly, estimating µ from one observation, xmin , has a certain amount of variability associated with it and that variability affects the calculation of the PRP value. We looked at 3 different expression levels and determined that α should be set as follows to correct for systematic bias of expression level: an α = 0.0441 for low expressing genes, an α = 0.0577 for medium expressing genes and an α = 0.0791 for high expressing genes. These αs are the estimated type I error probability of approximately 0.05 based on the 3 simulated genes. For multiple genes across a microarray hybridization, a similar trend was observed. At an α = 0.05 the type I error probability was 0.0559, very close to α, suggesting that over the microarray the type I errors made by different expression levels even out. To set the type I error at 5%, we estimate α = 0.0462 to be the used threshold. After validating the replicate agreement method using a simulation-based approach, the method was applied to a series of three hybridizations selected for a range of image qualities. The image from one hybridization is relatively high quality with little or no visible anomalies. The image on the second hybridization is a typical hybridization with only a limited number of anomalies. The image

Conditional density error model and microarray data

5

1 2

x 10

0.9 1.8

0.8 1.6

0.7 that is Detectable

0.5

0.4

Minimal

Observed PRP value

1.4

0.6

0.3

1.2

1

0.8

0.6

0.2 0.4

0.1 0.2

0

0

0.1

0.2

0.3

0.4

0.5 Percentile

0.6

0.7

0.8

0.9

1

0

0

1

2

3

4

5

6

7

8

9

10 5

x 10

Fig. 6. Application of the intra-array replicate threshold method to three microarray hybridizations of different qualities. The dotdashed line represents the relatively high quality hybridization with little or no visible anomalies (human thymus, Cy3). The dotted line represents a typical hybridization with a limited number of anomalies (human ovaries, Cy3). The dashed line represents a relatively low quality hybridization with a large number of anomalies (human fibroblasts, Cy3).

Fig. 7. Power analysis of four different sample sizes using the conditional density surface for Cy3 (α = 0.05 and β = 0.2). Sample sizes of 2, 4, 8, and 12 are represented by the solid, dashed, dotted, and dot-dashed lines, respectively. Note that both  and µ are in RFUs.

5

2

Power analysis Calculating power is a useful tool in gene expression analysis to understand the number of replicates necessary to statistically discern various alterations in gene expression. If experimental replicates are feasible, the conditional density surface can be used to estimate the power associated with a particular experimental design. A summary of the power analysis is provided in Figures 7 and 8 and shows the relationship between  and µ for sample sizes 2, 4, 8, and 12 in Cy3 and Cy5. Note that with each slide we

1.8

1.6

that is Detectable

1.4

Minimal

of the third hybridization is relatively low quality with a large number of scratches and dust spots. As shown in Figure 6, the percentage of PRP values that indicate replicate agreement is significantly higher in the high quality hybridization while the low quality hybridizations showed a much lower percentage. The typical hybridization with only limited anomalies showed an intermediate replicate agreement profile. At an α = 0.0462, the PRP method removed 5.15%, 12.54%, and 21.47% of the replicates for the relatively high quality, normal, and low quality hybridizations, respectively. Notably, removing 5.15% of the data in the high quality image is approximately equal to the chosen accepted probability of type I error, suggesting that those removed are likely due to randomness versus microarray anomalies.

x 10

1.2

1

0.8

0.6

0.4

0.2

0

0

1

2

3

4

5

6

7

8

9

10 5

x 10

Fig. 8. Power analysis of four different sample sizes using the conditional density surface for Cy5 (α = 0.05 and β = 0.2). Sample sizes of 2, 4, 8, and 12 are represented by the solid, dashed, dotted, and dot-dashed lines, respectively. Note that both  and µ are in RFUs.

have two intra-array observations of the gene expression value in each dye. Large drops in differential detection are achieved when the sample size is increased by a relatively small number of replicate hybridizations. 1071

B.Love et al.

Conclusions In this paper, we present a new framework for the analysis of microarray data that is based on a conditional density surface f σ |µ (µ, σ ) from an initial series of replicate hybridizations. Theoretically, the series of replicate hybridizations only needs to be done once and the conditional density surface derived from this experiment will provide statistical information on the results of future experiments without additional replicate measurements. As a result, this approach to microarray analysis is well suited for experiments with few intra- and inter-array replicates due to cost limitations or sample availability. In addition, if cost or sample size is not a limiting factor, the power analysis techniques can provide sample size estimates prior to performing an experiment so that the desired results can be inferred from the data. It should be noted that this technique is independent of the normalization method used, although clearly better normalization methods will result in better results. Furthermore this method is not restricted to only paired replicates, µ2 can be estimated for as many replicates as provided assuming that Equations (4) and (5) are derived for the number of replicates then Equation (12) would be recalculated for µ2 and Equation (14) would be redone to derive the new PRP value. Potentially, other statistical analysis techniques could be developed using the conditional density approach to further advance the mining of microarray data. ACKNOWLEDGEMENTS The efforts of Aeomica scientists and technical staff to develop the microarrays and perform the hybridizations are greatly appreciated. The comments, questions and patience of the anonymous referees made this a better paper. REFERENCES Baldi,P. and Long,A.D. (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, 17, 509– 519. Dudoit,S., Yang,Y.H., Callow,M.J. and Speed,T.P. (2000) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, http://www.stat.berkeley.edu/users/terry/zarray/Html/matt.html Fan,J. and Gijbels,I. (1996) Local Polynomial Modelling and its Applications. CRC Press, London. Golub,T.R., Slonim,D.K., Tamayo,T., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Caligiuri,M.A., Bloomfield,C.D. and Lander,E.S. (1999) Molecular classification of cancer: class discovery and class pediction by gene expression monitoring. Science, 286, 531–537. Hughes,T.R., Matron,M.J., Jones,A.R., Roberts,C.J. et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126. Iyer,V.R., Eisen,M.B., Ross,D.T., Schuler,G., Moore,T., Lee,J.C.F., Trent,J.M., Staudt,L.M., Hudson,J.Jr., Boguski,M.S.,

1072

Lashkari,D., Shalon,D., Botstein,D. and Brown,P.O. (1999) The transcriptional program in the response of human fibroblasts to serum. Science, 283, 83–87. Kepler,T.B., Crosny,L. and Morgan,K.T. (2001) Normalization and analysis of DNA microarray data by self-consistency and local regression. http://www.santafe.edu/sfi/publications/00wplist. html Kerr,M.K., Martin,M. and Churchill,G.A. (2000) Analysis of variance for gene expression microarray data. J. Comput. Biol., 7, 819–837. Lee,M.L., Kuo,F.C., Whitmore,G.A. and Sklar,J. (2000) Importance of replication in microarry gene expression studies: statistical method and evidence from repetitive cDNA hybridizations. Proc. Natl Acad. Sci. USA, 97, 9834–9839. Long,A.D., Mangalam,H.J., Chan,B.Y.P., Tolleri,L. et al. (2001) Improved statistical inferences from DNA microarray data using analysis of variance and a bayesian statistical framework— analysis of global gene expression in Escherichia coli K12. J. Biol. Chem., 276, 19937–19944. M¨uller,H.G. (1984) Smooth optimum kernel estimators of regression curves, densities and modes. Annals of Statistics, 12, 766– 774. Nadon,R., Shi,P., Skandalis,A., Woody,E., Hubschle,H., Susko,E., Rghei,N. and Ramm,P. (2001) Statistical inference methods for gene expression arrays. In Bittner,M.L., Chen,Y., Dorsel,A.N. and Dougherty,E.R. (eds), Micorarrays: Optical Technologies and Informatics, Proceedings of SPIE Vol. 4266, San Jose, CA, pp. 46–55. Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. and Tsui,K.W. (2001) On differential variability of expression rations: improving statistical inference about gene expression changes from microarray data. J. Comput. Biol., 8, 37–52. Penn,S.G., Rank,D.R., Hanzel,D.K. and Barker,D.L. (2000) Mining the human genome using microarrays of open reading frames. Nature Genet., 26, 315–318. Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. (1996) Numerical Recipes in C. Cambridge University Press, Cambridge. Schuchhardt,J., Beule,D., Malik,A., Wolski,E., Eickerhoff,H., Leharch,H. and Herzel,H. (2000) Normalization strategies for cDNA microarrays. Nucleic Acids Res., 28, e47. Sheskin,D.J. (2000) Handbook of Parametric and Nonparametric Statistical Procedures. Chapman Hall / CRC, New York. Tusher,V.G., Tibshirani,R. and Chu,G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121. Wildsmith,S.E., Archer,G.E., Winkley,A.J., Lane,P.W. and Bugelski,P.J. (2001) Maximization of signal derived from cDNA microarrays. Biotechniques, 30, 202–208. Worley,J., Bechtol,K., Penn,S., Roach,D., Hanzel,D., Trounstine,M. and Barker,D. (2000) A systems approach to fabricating and analyzing DNA microarrays. In Schena,M. (ed.), Microarray Biochip Technology. Biotechniques Books, Natick, MA, pp. 65– 86. Zanders,E.D. (2000) Gene expression as an aid to the identification of drug targets. Pharmacogenomics, 1, 375–384. Zhang,M.Q. (1999) Large scale gene expression data analysis: a new challenge to computational biologists. Genome Res., 9, 681– 688.