B RIEFINGS IN BIOINF ORMATICS . VOL 7. NO 1. 37^ 47
doi:10.1093/bib/bbk003
Propagating uncertainty in microarray data analysis Magnus Rattray, Xuejun Liu, Guido Sanguinetti, Marta Milo and Neil D. Lawrence Submitted: 22nd July 2005; Received (in revised form): 23rd November 2005
Abstract Microarray technology is associated with many sources of experimental uncertainty. In this review we discuss a number of approaches for dealing with this uncertainty in the processing of data from microarray experiments. We focus here on the analysis of high-density oligonucleotide arrays, such as the popular Affymetrix GeneChipÕ array, which contain multiple probes for each target. This set of probes can be used to determine an estimate for the target concentration and can also be used to determine the experimental uncertainty associated with this measurement. This measurement uncertainty can then be propagated through the downstream analysis using probabilistic methods. We give examples showing how these credibility intervals can be used to help identify differential expression, to combine information from replicated experiments and to improve the performance of principal component analysis. Keywords: microarray; Affymetrix GeneChipÕ; probabilistic model; gene expression; bayesian inference; principal component analysis
INTRODUCTION Microarrays provide a practical method for measuring the expression level of thousands of genes simultaneously [1, 2]. This technology is associated with many significant sources of experimental uncertainty, which must be considered in order to make confident inferences from the data. In this review we focus on high-density oligonucleotide arrays, such as the popular Affymetrix GeneChipÕ array [2], which have multiple probes associated with
each target. The probe-set can be used to measure the target concentration and this measurement is then used in the downstream analysis to achieve the biological aims of the experiment, e.g. to detect significant differential expression between conditions, or for the visualisation, clustering or supervised classification of data [3, 4]. Most currently popular methods for the probelevel analysis of Affymetrix arrays only provide a
Corresponding author. Magnus Rattray, School of Computer Science, University of Manchester, Manchester, M13 9PL, UK. Tel: þ44 161 275 6187; Fax: þ44 161 275 6204. E-mail:
[email protected] Magnus Rattray gained his PhD in 1996 from the Department of Computer Science at the University of Manchester, where he is now Senior Lecturer. He works on the theory and application of statistical learning and probabilistic modelling, with current applications in phylogenetics, microarray data analysis and comparative genomics. Xuejun Liu gained her MSc in Computer Science from Nanjing University of Aeronautics and Astronautics in 2002. She is now studying for a PhD in the School of Computer Science at the University of Manchester. Her research is on the application of probabilistic models for microarray data analysis. Guido Sanguinetti gained a DPhil in Mathematics from Wadham College, Oxford in 2003. He is currently a Research Associate at the Department of Computer Science at the University of Sheffield and his research focuses on probabilistic modelling and on the propagation of uncertainty through microarray analysis. Marta Milo gained her PhD in 2000 from the University of Naples ‘Federico II’. Dr Milo is currently a Wellcome Trust Research Fellow jointly in the Department of Biomedical Science and the Department of Computer Science at the University of Sheffield. Her main interests are understating regulatory gene networks in mammalian biological systems by merging microarray data analysis with biological data. Neil Lawrence gained his PhD in 2000 from the Computer Laboratory at the University of Cambridge. Dr Lawrence is currently a Senior Lecturer in the Department of Computer Science at the University of Sheffield. His main interest is machine learning algorithms based on probabilistic models and their application in speech, graphics, vision and bioinformatics. ß The Author 2006. Published by Oxford University Press. For Permissions, please email:
[email protected] 38
Rattray et al.
single point estimate that summarises the target concentration [5–8]. Yet the probe-set also contains much useful information about the uncertainty associated with this measurement. By using probabilistic methods for probe-level analysis it is possible to associate gene expression levels with credibility intervals that quantify the measurement uncertainty associated with the estimate of target concentration within a sample [9, 10]. This within-sample variance is a very significant source of uncertainty in microarray experiments, especially for relatively weakly expressed genes, and we argue that this information should not be discarded after the probelevel analysis. In this review we describe some recent developments in probabilistic probe-level analysis and discuss methods for propagating measurement uncertainty through various stages of the downstream microarray data analysis. The article is structured as follows. We first review some recent developments in probabilistic probe-level analysis that allow an estimate of target concentration to be associated with a level of measurement uncertainty. We then describe some methods for propagating this measurement uncertainty through a probabilistic downstream analysis. In particular, we show how to identify differentially expressed genes, combine biological replicates and improve the performance of a probabilistic version of principal component analysis (PCA).
PROBABILISTIC PROBE-LEVEL ANALYSIS Affymetrix chips have short specific oligonucleotide probes that are tethered to the surface of the array. Target cRNA is fluorescently labelled and hybridised to the array so that the amount of target binding to a particular probe can be measured by analysing the resulting two-dimensional image. Each probe is 25 bases long and each gene is typically represented by 11–20 probe-pairs referred to as a probe-set. A probe pair consists of a perfect match (PM) sequence identical to the target sequence and a mismatch (MM) sequence with a complementary middle base. The MM probe is designed to measure non-specific hybridization, which occurs when target sequences bind to probes that are not perfectly complementary. Probe-level analysis is concerned with measuring the expression level of a gene given intensity levels from its associated probe-set.
There are a large number of methods that have been developed for probe-level analysis. The most popular are MAS 5.0 [5], MBEI [6], RMA [7] and GCRMA [8], while many other methods can be found listed on the AffyComp website [11]. Most of the currently popular methods do not provide a level of uncertainty in the measured expression levels. The MBEI model implemented in the dChip software [12] was one of the first to include methods for attaching errors or uncertainties to gene expression measurements. The model assumes a Gaussian distribution for the difference between PM and MM intensities, and the authors show how the standard error of the parameter estimate can be obtained and used in downstream analyses [12]. By only modelling the difference between PM and MM one loses some potentially important information. The MAS 5.0 software, provided by Affymetrix [5], uses a non-parametric test to determine P-values indicating whether or not a probe-set is measuring a significant amount of signal. However, it is not clear how to use this quantity in the downstream analysis, except to discard results from probe-sets that are considered insignificant. More recently, probabilistic models have been developed which describe the joint distribution of PM and MM intensities. The BGX (Bayesian Gene Expression) group [13] have developed a hierarchical Bayesian model for probe-level analysis. They use Markov chain Monte Carlo (MCMC) methods in order to sample from the posterior distribution of the mean expression level associated with each probe-set [10] (Appendix 1 provides a brief overview of Bayesian inference methods). This method can provide a good approximation to the posterior distribution under the assumed model, although the MCMC algorithm can be prohibitively slow for large data sets [9]. The PUMA (propagating uncertainty in microarray analysis) group [14] have developed gMOS [15] (gamma model of oligonucleotide signal) and more recent modified versions mgMOS and multimgMOS [9] which are described in Appendix 2. These models use Gamma distributions to model the positive probe intensities and a computationally efficient method is used to approximate the posterior distribution of expression level. The probe-specific parameters are integrated out of the likelihood so that the number of parameters appearing in the final likelihood expression does not scale with the size of the probe-set, resulting in a computationally efficient
Propagating uncertainty in microarray data analysis
Distribution of
(a)
(b)
(c)
0.2
2
8
0.15
1.5
6
0.1
1
4
0.05
0.5
2
0
−40
−20
39
0
0 4
5
6
0 7.5
7
8
8.5
Figure 1: The posterior distribution of mean logged expression level for a spike-in gene 38734 at in the Affymetrix U95 Latin Square data set, estimated using multi-mgMOS. The concentrations are increasing from left to right: (a) 0 pM, (b) 8 pM and (c) 256 pM.
Posterior Signal Distribution In Figure 1 we show the inferred posterior distribution for the logged target concentration measured by one probe-set using a spike-in data set where gene concentrations are controlled. In the left plot there is no target RNA in the sample and the posterior has a long left-hand tail indicating that most of the probability is associated with a very low predicted expression level (note that small concentrations correspond to negative logged values). Due to the large experimental uncertainty associated with weakly expressed genes, one cannot completely exclude the possibility that the target is present at a low concentration and the breadth of the posterior probability distribution reflects this uncertainty. The plots in the middle and on the right show the results for increasing amounts of target. As the concentration is increased, the expression level is measured with greater certainty and the posterior approaches a peaked symmetrical distribution.
Credibility intervals In Figure 2 we show the estimated expression level and 5–95% credibility intervals associated with one gene from a data set derived from mouse back skin
(a) Normalised expression level
optimisation scheme. Although we only present results using multi-mgMOS in this review, the methods for downstream analysis that we describe later can make use of any method that provides reasonable credibility intervals for measured expression levels.
(b)
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4
0
10
20
−4 0
10
20
Days after birth
Figure 2: Temporal profile of gene Dab2 in the dataset from Lin et al. [16]. On the left we show the mean posterior estimate and 5^95% credibility intervals calculated using multi-mgMOS for a single replicate at each time point. On the right we show the mean posterior estimate and 5^95% credibility intervals after combining three biological replicates. The dashed line is the result from quantitative real-time PCR. The results show that time-points with low expression levels are typically associated with less confidence and that biological replicates can be used to reduce large credibility intervals.
during hair-follicle morphogenesis [16] (credibility intervals are defined in Appendix 1). The results in the left-hand plot are obtained using a single replicate for each time-point. The credibility intervals contain 90% of the posterior probability and the data points are the median posterior estimates of logged expression level. These results are for one of eight genes that were observed to be hair-cycle associated. The dashed line shows the results from quantitative
40
Rattray et al.
real-time PCR (q-PCR) validation experiments. Notice that the credibility intervals tend to be larger for experiments in which the gene is relatively weakly expressed. Although the pattern is consistent with the q-PCR results, the credibility intervals suggest that some care is required in interpreting these results.
propagating measurement uncertainties through a probabilistic model. We give two examples of how to apply this general framework. In the first example we consider how measurement uncertainties can be included in the analysis of biological replicates. In the second example we describe a modified version of PCA that can include information about measurement uncertainties.
PROPAGATING MEASUREMENT UNCERTAINTY
Identifying Changes in Concentration between Individual Samples
To make the most accurate inferences, in principle one should combine all stages of analysis within a single probabilistic model. The high density of probes on modern chips makes this approach impractical for typical sized data sets. Therefore, different stages of the analysis as usually carried out separately. Below we consider different levels of analysis making use of measurement uncertainties derived from a probabilistic probe-level analysis. First we consider the most basic task of identifying changes in target concentration between individual chips. Then we describe a general framework for
In Figure 3 we show results of applying multimgMOS to two chips from an artificial spike-in data set where a large number of genes have been spiked into the sample at known concentrations [17] (chips S1 and C1 were used). On the left of Figure 3 the genes are ranked according to log-ratio of estimated mean expression level and the horizontal lines show the 5–95% credibility intervals. Genes that genuinely differ in concentration have been marked with a box. It can be observed that only two of the top 100 genes identified solely in terms of log-ratio are correctly identified as differing in concentration
(b) 1
20
20
Probabilistic ranking
Log-ratio ranking
(a) 1
40
60
40
60
80
80
100
−10
0
10
100
Log-ratio
−2
0
2
Figure 3: Data from chips C1and S1in the spike-in data set from Choe et al. [17] were processed using multi-mgMOS in order to obtain the mean posterior estimate of the log-ratio of expression level between chips for each gene, along with the 5^95% credibility intervals.On the left is the ranking of genes (top100 shown) based only on the point estimate of log-ratio. On the right is the ranking of genes based on the posterior probability that the sign of the estimated log-ratio is correct.Genes that have genuinely different concentrations are indicated with a box.
Propagating uncertainty in microarray data analysis
41
essential to take the credibility intervals into account when using a probabilistic probe-level analysis method to identify differential expression. One can balance the sensitivity and specificity of the probabilistic ranking by changing the size of credibility intervals. A receiver operator characteristic (ROC) curve provides a useful way to visualise the trade-off. ROC curves are often used in the assessment of probe-level analysis methods [17, 18]. The area under the ROC curve (AUC) provides a useful measure of performance. On the left of Figure 4 we show a ROC curve for all pairs of comparisons between different conditions for the same spike-in data set used in Figure 3 [17]. The results using a simple fold-change criterion to determine differential expression are clearly inferior to results using the probabilistic approach. This means that evaluations based simply on point estimates of fold change, such as those used by the current version of the popular AffyComp benchmark [18], are not suitable for evaluating probabilistic probe-level analysis methods. We return to this issue in the discussion. In practice one is usually more interested in either comparing replicated conditions or in using the probe-level results for further downstream analysis. Next we give examples of how the results from a probabilistic probe-level analysis can be propagated through these sorts of analysis.
using a simple log-ratio score. Using this score would therefore result in many false positives. The credibility intervals of 98 genes intersect the zero line, indicating that there is no strong confidence that any of these genes have different concentrations in the two samples. Notice that the credibility intervals of two of the genes with genuinely differing concentration also cross the zero line, indicating that these would be considered false negatives using a probabilistic criterion with a cut-off chosen at this level of credibility. Similarly, the two non-spiked-in genes with credibility intervals that do not cross the zero lines are false positives using this criterion. Identifying differential expression without replicates is difficult and errors will be made by any method. On the right of Figure 3 we use a probabilistic ranking of the genes that takes the credibility intervals into account. The genes are ranked according to how much posterior probability there is, that the sign of the predicted log-ratio is correct. This is equivalent to gradually reducing the credibility intervals and ranking genes in the order with which credibility intervals no longer include the zero line. A similar ranking is obtained by using the posterior variance estimates to construct a z-score (the two methods are equivalent for Gaussian posterior probabilities). The top genes in this ranking are all true positives. These results show that it is 1
1
(b)
(a)
0.9
0.95
0.8
True positive rate
0.9 0.7 0.6
0.85
0.5
0.8
0.4
0.75
0.3 0.7
mmgMOS+Bayesian (0.9431) mmgMOS+Cyber–T (0.9310) GCRMAqn+Cyber–T (0.9367) GCRMA+Cyber–T (0.8811)
0.2 mmgMOS+PPLR (0.9127) mmgMOS foldchange (0.6012)
0.1 0
0.65
0
0.2
0.4
0.6
False positive rate
0.8
1
0.6
0
0.2
0.4
0.6
0.8
1
False positive rate
Figure 4: ROC curves are shown for the spike-in data set from Choe et al. [17] with AUC values given in the key. On the left we show results from comparing all pairs of chips from different conditions using a simple fold-change ranking and a probabilistic ranking (PPLR). On the right we show results using the three replicates for each condition. We compare results using multi-mgMOS and GCRMA for probe-level analysis, either using probe-level variance information as described in the main text (mgMOS þ Bayesian) or using a standard regularised t-test (Cyber-T). Results for GCRMA are shown with and without quantile normalization.
42
Rattray et al.
Propagating Uncertainty through a Probabilistic Model In order to keep useful information from the probelevel analysis one can use probabilistic models for downstream analysis that allow for the inclusion of a measurement variance term for each probe-set. For tractability and convenience we typically assume a Gaussian distribution for the measurement uncertainty fitted to the mean and variance of the posterior distribution derived from the probe-level analysis. The Gaussian approximation may break down for very weakly expressed genes, as observed in Figure 1, but these genes will be associated with very large variances and will generally not be expected to play an important role in any downstream analysis of microarray data. Let y^ in be the measured logged expression level for gene n in sample i. This can be modelled as, y^ in ¼ yin þ in
with
in Nð0; 1 in Þ;
where yin is a probabilistic model and vin is the zeromean Gaussian measurement noise with variance 1 in (written as an inverse precision) determined by the probe-level analysis. Next we give some examples of how this measurement noise affects performance for different choices of model. The main technical difficulty in dealing with the additional measurement error term is that the noise has a chip and gene specific variance.
Combining Information from Biological Replicates The simplest possible model is an independent univariate Gaussian for each gene yin Nðn ; n2 Þ. A useful application of this model is for combining biological replicates. The variance term n2 captures the variance of the expression level of gene n across biological replicate samples. This parameter may be shared across groups if the number of replicates per group is very low. After propagating measurement noise through the model one obtains, y^ in Nðn ; n2 þ 1 in Þ:
The parameters n and n2 then have to be estimated from the data. In fact we are mainly interested in the posterior distribution of the mean n, which can be used to summarise knowledge about the mean expression level of replicates. The posterior
variance in this mean summarises our measurement uncertainty about this quantity. This variance is influenced both by the measurement uncertainty from the probe-level analysis and by the inherent additional variability of the replicate samples due to other biological and experimental factors. There are a number of possible priors that one can choose for the parameters. Here, we choose a simple hierarchical scheme, with an inverse-Gamma prior on the variance and a Gaussian prior on the mean, with hyperparameters optimised using a variational approximation [19] (details and code are provided on the PUMA website [14]). One can then obtain the mean and variance of the posterior distribution of n. On the right of Figure 2 we show the effect on the 5–95% credibility intervals in the mouse data example when three biological replicates are combined. There is now greater confidence in the predicted temporal profile, which is still consistent with the q-PCR result. It is straightforward to identify differentially expressed genes between experiments from different classes of sample. For technical replicates within each class this can be done at the probe-level analysis stage by sharing the signal parameter between replicates and then computing the posterior distribution for the log-ratio of expression level between classes [9, 10]. For biological replicates it is more reasonable to model each set of replicates using the above Gaussian model. The resulting model can be used to compute the posterior probability for the log-ratio between classes. On the right of Figure 4 we show the result of using this approach on a spike-in data set with three replicates for each condition [17]. The AUC is clearly improved in comparison to not using replicates (left of Figure 4). For comparison, we have included results using a regularised t-test as provided by the popular Cyber-T software [20]. We have also included results using the popular GCRMA software for probe-level analysis [8] with or without quantile normalisation [21]. All methods were subjected to the same form of probe-set level loess normalisation as used by Choe et al. [17], in order to make results comparable. It can be observed that the Bayesian approach that takes probe-level variances into account performs well and has the largest AUC. It performs particularly well when the number of false positives is low, which is important when providing an accurate ranked list of significant genes.
Propagating uncertainty in microarray data analysis
Improved Principal Component Analysis PCA is a very popular method for reducing the dimensionality of high-dimensional data, either to aid visualisation or as a pre-processing stage before further clustering or classification. It has been applied extensively in microarray data analysis [22, 23]. One perceived advantage of PCA is that it can deal effectively with noise. However, in its standard form PCA implicitly assumes that all measurements are associated with similar levels of noise. This assumption is clearly violated for microarray data and this may explain why pre-processing by PCA can result in the degraded performance of clustering methods [24]. Variance-stabilising transformations of microarray data aim to bring variances onto a comparable scale for all genes [25], yet such transformations treat all experiments equally and therefore cannot account for differences in measurement uncertainty for the same gene in different samples. An alternative approach is to recast PCA as a probabilistic model and propagate measurement uncertainty through the model [26]. A probabilistic interpretation of PCA provides the model through which the measurement uncertainty can be propagated. The model is known as probabilistic PCA (PPCA) and is defined by [27], yin ¼ i þ
q X
Wik zkn þ "in
with
k¼1
zn Nð0; IÞ; "n Nð0; IÞ:
The maximum likelihood solution for the parameter matrix W has columns that span the principal subspace defined by the eigenvectors of the sample covariance associated with the q largest eigenvalues, where q is typically much smaller than the original data dimensionality. The conditional mean for the latent variables zn provide an equivalent q-dimensional projection of the data as for standard PCA. An advantage of the probabilistic interpretation is that it allows several extensions of the basic model. For example, Bayesian PPCA has been used for the estimation of missing values in microarray data [28]. By including measurement noise in PPCA it is possible to effectively denoise a microarray data set. This has been shown to improve the performance of subsequent hierarchical clustering [26]. Although
43
the maximum likelihood (ML) parameters can no longer be obtained in closed form, an iterative EM-algorithm can be used to find the ML parameters. An advantage of this modified version of PCA is that the influence of consistently noisy genes is automatically down-weighted by the associated large measurement uncertainties obtained from the probe-level analysis. Here we show results from applying this method to the visualisation of a data set containing samples from different types of tumour [29]. In the original study, PCA was applied to 1000 genes that were considered to exhibit significant variation across the data set. The results showed poor clustering in the three-dimensional principal subspace identified by PCA. A subset of 50 genes were subsequently selected that were highly informative about each tumour type in the data set. PCA analysis of these 50 genes showed a clear clustering in the threedimensional principal subspace. It is unlikely that any unsupervised method (i.e. a method that does not use class labels) could outperform this approach since the knowledge of sample classification was used to select the 50 informative genes. We can therefore consider PCA on this subset as a ‘gold-standard’ with which to compare an alternative approach. We reprocessed these 50 genes and found that 90% of samples are closest to a sample of the same class in the three-dimensional subspace obtained by standard PCA. In order to compare the performance of PCA with our modified version of PPCA we randomly selected 1000 genes from the data set (we did not have access to the original 1000 genes). Under standard PCA, only 43% of samples are closest to a sample of the same class in the three-dimensional subspace, while for modified PCA the number increases to 71%, indicating a much more coherent grouping of similar samples. The modified version of PCA therefore greatly improves the clustering within the three-dimensional subspace, although still falling short of the gold standard obtained by only considering the 50 most informative genes. In Figure 5 we compare the projections obtained by each method. The modified PCA model provides a better grouping for sample classes and the 3rd principal component in particular seems more informative in discriminating between medulloblastoma (þ symbol) and malignant glioma (s symbol), while standard PCA has these two groups mixing in the three-dimensional projection.
Rattray et al. 2
3rd Principal component
2nd Principal component
44
Standard PCA 1 0 −1 −2 1
2
3
4
2
Standard PCA
1
0 −1
5
1
2 PUMA PCA 1 0 −1 −2
2
2.5
3
2
3
4
5
1st Principal component 3rd Principal component
2nd Principal component
1st Principal component
3.5
1st Principal component
1 PUMA PCA 0.5 0 −0.5 −1
2
2.5
3
3.5
1st Principal component
Figure 5: Visualisation of microarray data from a study of different classes of embryonic tumours of the central nervous system. Different sample classes are represented by different symbols: medulloblastoma (þ), malignant glioma (s), normal cerebella (x), supratentorial primitive neuroectodermal tumour () and atypical teratoid/rhabdoid tumour () (see Pomeroy et al. for details [29]). Projections onto the first three principal components are shown for data containing the expression levels of1000 randomly selected genes.The top row shows results from standard PCA and the bottom row shows results from PCA modified to take measurement uncertainty into account.
DISCUSSION We have described some methods for determining the measurement uncertainty associated with highdensity oligonucleotide arrays, and how these uncertainties can be propagated through the downstream analysis of microarray data. We believe that other popular analysis tools can also be adapted to include information about measurement uncertainties using a similar approach. Many microarray data analysis methods already make use of probabilistic models [16, 30–32] and new probabilistic models are being developed for other popular methods such as agglomerative clustering [33]. It has been shown that replicate variance can be used to improve the performance of probabilistic clustering methods [16, 34] and we believe that the inclusion of probe-level uncertainties will further improve the performance of clustering algorithms when the number of replicates is limited. An important distinguishing feature of Bayesian inference methods is that they are not limited to determining a single point estimate for a quantity of interest. This has important implications when considering how to evaluate their performance and
not taking this feature into account can result in misleading conclusions [9]. Different point estimates are associated with varying levels of bias and variance, and which point estimate is most appropriate depends on the biological question being addressed. For example, we might choose the posterior mean of the logged target concentration as a point estimate. This estimate would be associated with very low bias, which means that it provides a good estimate of logged concentration averaged over replicates. However, the same estimate may be associated with a large variance for weakly expressed genes, in the sense that variation across replicates is large. If this point-estimate is used to compute logratios for weakly expressed genes, then these logratios may have a large magnitude. However, they will also be associated with broad credibility intervals and cannot be taken as strong evidence for significant fold-changes. Bayesian methods can easily be adapted to determine more useful point estimates for identifying significant log-ratios, such as the probabilistic rank that was used in Figure 3. AffyComp is a popular benchmark for Affymetrix probe-level analysis [11, 18], but this benchmark
Propagating uncertainty in microarray data analysis only currently allows for methods that provide a single point estimate of expression level. We would therefore argue that this benchmark does not currently provide an appropriate evaluation of probabilistic predictions. New criteria could be introduced to address this limitation, for example by allowing the submitter to decide on their own criteria for ranking differential expression.
APPENDIX 1: PROBABILISTIC INFERENCE Standard textbooks provide an introduction to probabilistic inference and its applications in bioinformatics [4, 35]. Here we briefly review some of the key concepts. Bayes theorem provides us with the fundamental tool for carrying out probabilistic inference. We wish to apply a probabilistic model with parameters to a data set D. The data likelihood P(D|) is the probability of the data given the parameters. The prior probability P() is the probability we assign to different choices of model parameters before viewing the data. The posterior probability P(|D) is the probability we assign to different choices of the model parameters after viewing the data. Bayes theorem is an equation that relates these three quantities, PðjDÞ ¼ R
PðDjÞPðÞ : PðDjÞPðÞd
The denominator ensures that the probability is properly normalised and integrates to one over the allowed parameter range. The posterior probability assigns a probabilistic weighting to all alternative choices for the model parameters given the data. In a hierarchical scheme, the prior may itself have some free parameters and these are referred to as hyperparameters. A popular point estimator for the model parameters is the maximum a posteriori (MAP) estimate, or posterior mode, which is the value of that maximises P(|D). This is the most likely specific choice of parameters given the available data. As well as being a sensible choice, this is also a convenient choice because it can be obtained by simple maximization of the numerator in Bayes theorem and this can often be carried out using standard optimization algorithms. If the prior probability is equal for all choices of parameters then this is equivalent to the ML estimator, which is another
45
popular choice. In some cases the mode may not be representative of the bulk of the posterior distribution, e.g. it may be at a boundary of the allowed parameter range for a very asymmetrical distribution. Other point estimators, such as the posterior mean or median (in one dimension), might then be preferred. Of course, one always loses some information by summarizing the posterior distribution by a point estimate. In order to compute the MAP parameter estimate it is not necessary to evaluate the integral in the denominator of Bayes theorem. However, in order to compute other probabilistic quantities of interest, it will be necessary to evaluate this integral and other closely related integrals. For example, in order to summarise a unimodal posterior distribution for a scalar parameter we may want to compute the 5-95% credibility interval 2 ½a; b of the parameter. This requires a solution to the following two equations, Z
Z
a
PðjDÞd ¼ 0:05 1
1
PðjDÞd ¼ 0:05: b
Evaluating integrals of this type is the technical problem that makes Bayesian inference difficult in practice. For many useful models these integrals cannot be evaluated exactly and one must resort to approximate methods. Luckily, with increases in computational resources and the development of improved methods this is becoming feasible for increasingly large inference problems. The approximate inference method used in BGX is MCMC using a combination of Gibbs [36] and Metropolis [37, 38] sampling. MCMC is an iterative method for providing samples from a posterior distribution. A sequence of parameters is generated by a Markov process and eventually provides a correlated sample from the posterior distribution. An advantage of this approach is that it is very flexible and can be applied to a broad range of problems. The main drawback with the method is that many samples may be required in order to get an accurate representation of the posterior distribution, so that the computational cost can be prohibitively high. It may also be difficult to assess whether the chain has properly converged. The approximate inference method used in mgMOS and multi-mgMOS [9] is a variant of the Laplace approximation [39]. The posterior is approximated by a Gaussian distribution with mean equal to the MAP estimate. The covariance is
46
Rattray et al.
obtained from the curvature of the logged posterior at this point. The signal parameter is constrained to be positive and therefore a truncated Gaussian is used. The Laplace approximation often works well for low-dimensional parameter spaces as long as the amount of data is not too small relative to the dimension. Other approximate inference methods have also been developed, such as variational methods [19] which approximate the posterior by a simpler distribution that allows the inference problem to become more easily tractable. This is the approximation scheme used for hyperparameter estimation when combining replicates that is discussed in the main text.
APPENDIX 2: multi-mgMOS In the original gMOS model, MM and PM are a combination of two Gamma distributed variables representing signal and background with a shared scale parameter [15]. The modified model mgMOS includes a Gamma prior over the scale parameter so that it becomes a latent variable. The multi-mgMOS model described here shares the latent scale parameter across corresponding probe-pairs in different chips in the data set [9]. Let ygjc and mgjc represent the PM and MM intensities respectively, for probe j in probe-set g on chip c. The model is defined by, ygjc Gaðagc þ gc ; bgj Þ; mgjc Gaðagc þ gc ; bgj Þ; bgj Gaðcg ; dg Þ:
The parameter gc determines the specific hybridization signal component, agc represents the background and non-specific hybridization, while models the effect of specific binding to the MM probes. The scale parameters bgj capture probespecific effects. Using the same notation as Appendix 1, the data are D ¼ {ygjc, mgjc} and the parameters are ¼ {agc, gc, , cg, dg}. The latent variables bgj are not considered parameters because they are integrated out of the likelihood P(D|). This means that the number of parameters does not scale with the number of probes in the probe-set, so that the MAP solution for the parameters can be found by fast gradient-based optimisation. The expected logged expression level can be written as a simple monotonic function of gc and it is therefore sufficient to find the percentiles of the posterior distribution of this parameter. The percentiles of P(gc|D) can be
used to find the percentiles of the logged expression level and sampling can be used to find the mean and variance. Key Points Probe-level analysis of Affymetrix GeneChipÕ arrays can be used to associate gene expression measurements with credibility intervals. This measurement uncertainty can be propagated through a probabilistic downstream analysis. Examples given included identification of differential expression and PCA. The approach can be used for many other types of analysis if appropriate probabilistic models are developed.
Acknowledgements MR, GS and NDL gratefully acknowledge a BBSRC award ‘‘Improved processing of microarray data with probabilistic models’’. MM is supported by an Advanced Training Fellowship from the Wellcome Trust.
References 1.
Schena M, Shalon D, Davis RW, et al. Quantitative monitoring of gene-expression patterns with a complementary-DNA microarray. Science 1995;270: 467–70. 2. Lockhart DJ, Dong HL, Byrne MC, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 1996;14:1675–80. 3. Slonim DK. From patterns to pathways: gene expression data analysis comes of age. Nat Genet 2002;32:502–8. 4. Baldi P, Brunak S. Bioinformatics: The Machine Learning Approach. Cambridge, MA: MIT Press, 2001. 5. Affymetrix. Microarray Suite User Guide Version 5.0. Affymetrix Inc, Santa Clara, CA, 2001. 6. Li C, Wong W. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Pro Nat Acad Sci USA 2001;98:31–6. 7. Irizarry RA, Hobbs B, Collin F, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003;4:249–64. 8. Wu Z, Irizarry RA, Gentleman R, et al. A model based background adjustment for oligonucleotide expression arrays. J Amer Statistical Assoc 2004;99:909–17. 9. Liu X, Milo M, Lawrence ND, etal. A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips. Bioinformatics 2005;21:3637–44. 10. Hein A.-MK, Richardson S, Causton HC, et al. BGX: a fully Bayesian integrated approach to the analysis of Affymetrix GeneChip data. Biostatistics 2005;6:349–73. 11. AffyComp. A Benchmark for Affymetrix GeneChip Expression Measures, http://affycomp.biostat.jhsph.edu/ [21 November 2005, date last accessed]. 12. Li C, Wong W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology 2001;2:0032.1–0032.11.
Propagating uncertainty in microarray data analysis 13. BGX. Flexible Bayesian Clustering and Partition Models for Gene Expression Data, http://www.bgx.org.uk/ [21 November 2005, date last accessed]. 14. PUMA. Propagating Uncertainty in Microarray Analysis, http://www.bioinf.man.ac.uk/resources/puma [21 November 2005, date last accessed]. 15. Milo M, Fazeli A, Niranjan M, et al. A probabilistic model for extracting of expression levels from oligonucleotide arrays. Biochem SocTrans 2003;31:1510–12. 16. Lin KK, Chudova D, Hatfield GW, et al. Identification of hair cycle-associated genes from time-course gene expression profile data by using replicate variance. Proc Natl Acad Sci USA 2004;101:15955–60. 17. Choe SE, Boutros M, Michelson AM, et al. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control set. Genome Biology 2005;6:R16. 18. Cope LM, Irizarry RA, Jaffee HA, et al. A benchmark for affymetrix GeneChip expression measures. Bioinformatics 2004;20:323–31. 19. Jordan MI (ed.) Learning in Graphical Models. MIT Press, MA: Cambridge, 1998. 20. Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes’. Bioinformatics 2001;17:509–19. 21. Bolstad BM, Irizarry RA, Astrand M, et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003;19: 185–93. 22. Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Nat Acad Sci USA 2000;97:10101–6. 23. Holter NS, Mitra M, Maritan A, et al. Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc Nat Acad Sci USA 2000;97:8409–14. 24. Yeung KY, Ruzzo WL. Principal component analysis for clustering gene expression data. Bioinformatics 2001;17: 763–74. 25. Huber W, von Heydebreck A, Sultmann H, et al. Variance stabilization applied to microarray data calibration and to
26.
27. 28.
29.
30. 31.
32.
33.
34.
35.
36.
37.
38. 39.
47
the quantification of differential expression. Bioinformatics 2002;18:S96–S104. Sanguinetti G, Milo M, Rattray M, et al. Accounting for probe-level noise in principal component analysis of microarray data. Bioinformatics 2005;21:3748–54. Tipping ME, Bishop CM. Probabilistic principal component analysis. J Roy Stat Soc B-Stat Methodol 1999;61:611–22. Oba S, Sato M, Takemasa I, et al. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003;19:2088–96. Pomeroy SL, Tamayo P, Gaasenbeek M, etal. ‘Prediction of central nervous system embryonal tumour outcome based on gene expression’. Nature 2002;415:436–42. Girolami M, Breitling R. Biologically valid linear factor models of gene expression. Bioinformatics 2004;20:3021–33. Friedman N, Linial M, Nachman I, Pe’er D. Using Bayesian networks to analyze expression data. J Comput Biol 2000;7: 601–20. Schliep A, Steinhoff C, Schonhuth A. Robust inference of groups in gene expression time-courses using mixtures of HMMs. Bioinformatics 2004;20:i283–i289. Heller KA, Ghahramani Z. Bayesian hierarchical clustering. Proc of the 22nd Int Conf on Machine Learning, Bonn, Germany 2005. Medvedovic M, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 2004;20:1222–32. Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis. Cambridge University Press, UK: Cambridge, 1998. Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities. J Am Stat Assoc 1990;85: 398–409. Metropolis N, Rosenbluth AW, Rosenbluth MN, et al. ‘Equation of state calculations by fast computing machines’. J Chem Phys 1953;21:1087–92. Hastings WK. ‘Monte Carlo sampling methods using Markov chains and their applications’. Biometrika 1970;57:97–109. Kass R, Raftery A. Bayes factors. JAm Stat Soc 1995;90:773–95.