Technical Report no. 471 - Department of Statistics - University of ...

Report 2 Downloads 31 Views
Probabilistic Segmentation and Intensity Estimation for Microarray Images Raphael Gottardo, Julian Besag, Alejandro Murua and Matthew Stephens Department of Statistics, University of Washington, Box 354322 Seattle, WA 98195-4322 [email protected]

December 1, 2004 Abstract Algorithms for image segmentation and intensity estimation are crucial to the successful analysis of cDNA microarray experiments but most procedures allow little or no flexibility. Our methods described here are probabilistic and use Bayesian hierarchical modelling. Segmentation of images into foreground objects (spots) and background is more realistic in allowing a much wider variation in spots, including those with doughnut shapes, which are observed quite frequently in practice. We are also able to penalize the occurrence of small artefacts. Our intensity estimation is robust and avoids the common logical error that estimates of foreground may be less than those of corresponding background. Markov chain Monte Carlo provides an integrated approach to segmentation and estimation in sampling from the full posterior distribution of all the parameters and this enables us to focus on any quantity of interest. We demonstrate our methods on two publicly available datasets and show that they lead to improved segmentation in comparison to other approaches. Also our measures of uncertainty are useful in suggesting low quality spots. Keywords: Bayesian estimation; cDNA microarrays; Gene expression; Hierarchicalt; Image analysis; Markov chain Monte Carlo; Markov random fields; Measures of uncertainty; Quality filtering; Segmentation; Spatial models

1

Introduction

Recent technological advances enable biomedical investigators to monitor the expression levels of thousands of genes under the same experimental conditions. This technology, known as microarrays, can be used to characterize and classify diseases, design new drugs, etc. In a microarray experiment, two biological (or experimental) conditions are usually compared in terms of expression levels of thousands of genes. Two samples representing each 1

condition are labelled with red and green dyes to distinguish them. The experiment produces one or more slides with thousands of spots, with varying degree of red and green dyes quantifying the expression level of the corresponding genes. The slides are then scanned to provide two grey scale images, for the red and green fluorescent dyes. Figure 1 shows an example of such an image for one of the data sets explored in this paper. Further details about the microarray technology are given in the next section. Image analysis is required to provide estimates of the foreground and background intensities for both the red and green dyes for each gene, and is the initial step of any analysis. In order to estimate the intensities one needs to locate the spots on the images and classify each pixel as part of a spot or as a background pixel. An early statistical treatment of microarray images can be found in Chen et al. (1997). The effect of different image analysis methods was first pointed out by Yang et al. (2002). The estimates of the red and green intensities are the starting point of any statistical analysis such as testing for differential expression (Newton et al. 2001; Dudoit et al. 2002; Gottardo et al. 2003), discriminant analysis, and clustering (Yeung et al. 2001). There are three main issues in the analysis of microarray images. • Addressing or gridding which consists of locating the spots on the array. • Segmentation, which consists of classifying the pixels either as foreground, i.e. part of a spot, or as background. • Intensity estimation, which consists of estimating the foreground and background intensities of each spot on the array in each sample. Estimation of the background intensity is usually considered necessary in order to accurately estimate the amount of cDNA hybridized. It is motivated by the fact that a spot’s measured fluorescence intensity includes a contribution that is not due to the hybridization of the mRNA samples to the spotted DNA. In our experience, gridding is relatively straightforward so we consider only the segmentation and estimation parts. Some of the new arrays provide reference marks to better locate the spots, which make gridding even easier. Most existing segmentation methods are deterministic, in the sense that a pixel is classified either as foreground or background. Fixed circle segmentation methods fit a circle of fixed radius to all the spots (Eisen 1999). Such methods are usually ineffective as the spots are rarely of the same size. In adaptive circle segmentation, the circle’s radius is estimated separately for each spot (Buhler et al. 2000; Axon Instruments Inc. 2003). These methods are more flexible and fit the data better. However, some spots tend to have non circular shapes. Adaptive segmentation methods include seeded region growing (Yang et al. 2002), mathematical morphology (Angulo and Serra 2003), and histogram based methods (GSI Lumonics 1999). Histogram based methods have the disadvantage of not using any spatial information. While being adaptive, the seeded region growing method and the mathematical morphology are not flexible enough. Since with these methods the background needs to be connected, they can not properly segment doughnut shaped spots (Figure 1), which are rather common and are caused by the evaporation of the dyes, which induces a flow towards the edge to replenish liquid lost by evaporation while keeping the contact line in place. As a consequence, the dye material is carried along with this flow. This phenomenon is well known and is sometime referred to as capillary flow (Deegan et al. 1997). Once the segmentation has been completed, most methods estimate the foreground and background intensities separately. First the foreground is estimated by computing either the mean or median intensities of the pixels classified as foreground. Then the background is usually estimated locally for each spot. Some estimation methods use the median intensity 2

value of neighbouring pixels (Eisen 1999; GSI Lumonics 1999; Imaging Research Inc. 2001; Axon Instruments Inc. 2003). Some others form background images (Yang et al. 2002; Br¨andle et al. 2003) from the raw images and use it to extract background information for each spot. One problem with these approaches is that they can produce negative estimates, which is not only physically impossible but can cause problems when computing log ratios. People are often interested in the log ratio of the intensity estimates from the two samples as it measures the difference (on the log scale) between the two samples. A log (base 2) ratio of one is sometimes used as a rule of thumb for selecting differentially expressed genes (Schena et al. 1995; Yang et al. 1999). It would seem more satisfactory to have a method that estimates the foreground and background intensities at the same time and constrains the difference to be positive; the log estimates and log ratios would then be well defined. In this paper, we present a hierarchical Bayesian model for the analysis of microarray images, where both segmentation and estimation, and the associated errors, are modelled simultaneously. The model is more flexible than existing approaches, allowing the proper segmentation of all sorts of shapes. Estimation is robust by using hierarchical-t errors, and the estimated background corrected intensities are guaranteed to be positive. In addition to the full posterior distribution of the parameters, we obtain the full posterior distribution - and hence measures of uncertainty - of interest such as log ratios. Using experimental data, we demonstrate how the measure of uncertainty, combining information from both segmentation and estimation, can be useful in filtering out low quality spots. The paper is organized as follow. Section 2 describes the microarray technology and the creation of images. Section 3 introduces the data structure and the notation. In Section 4 we present the Bayesian hierarchical model used to segment the images and estimate the intensities. In Section 6, we apply our model to experimental data, illustrate the methodology and introduce two methods for quality filtering. In Section 7, we compare our method to three popular softwares for the analysis of microarray images. Finally, in Section 8 we discuss our results, some possible extensions and the limitations of our methodology.

2

From Microarray Slides to Images

cDNA microarrays (Schena et al. 1995) consist of thousands of individual DNA sequences printed in a high density array on a glass microscope slide using a robotic arrayer. A microarray works by exploiting the ability of a given labelled cDNA molecule to bind specifically to, or hybridize to, a complementary sequence on the array. By using an array containing many DNA samples, scientists can measure, in a single experiment, the expression levels of hundreds or thousands of genes within a cell by measuring the amount of labelled cDNA bound to each site on the array. In a typical two-colour microarray experiment, two mRNA samples, from control and treatment situations, are compared for gene expression. Treatment is taken in a broad sense to mean any condition different from the control. Both mRNA samples, or targets, are used to produce cDNA, which is labelled using different fluorescent dyes (red and green dyes), then mixed and hybridized with the arrayed DNA sequences. When hybridization is complete the slide is washed and laser excitement of the slide is used to yield a luminous emission that is then measured by a scanning microscope, which converts the luminous emission into electric current. One common type of detector is a photomultiplier tube (PMT). The amount of amplification can be adjusted by varying the PMT’s voltage input, typically adjusted so that brightest pixels are just below the level of saturation (Yang et al. 2002). The PMT voltage input can have a large impact on the final results. If it is set too low, the lowest intensity spots will not be detected and if it is set too high the high intensity spots will be saturated. Fluorescence measurements are made with a microscope that illuminates each spot and 3





Figure 1: Three blocks of one of the HIV raw images. The whole image contains 12 blocks. Each block is formed by 16 × 40 spots. At the bottow we have enlarged two portions of the image containing several artifacts not caused by hybridization of the probes to the slide. Some spots are doughnut shaped with larger intensity on the perimeter of the spot.

4

Figure 2: Illustration of our basic gridding algorithm. The image is projected onto the x-axis and y-axis. The troughs in the two projections define the lines of the grid.

measures fluorescence for each dye separately, thus providing a measure of the relative mRNA abundance of the gene in the two varieties. For each experiment, the scanning process provides two 16-bit TIFF grey-scale images, one for each dye, containing several million pixels with raw intensities ranging from 0 to 65535. The raw intensities in each pixel represent the total fluorescence in the region corresponding to that pixel. See Yang et al. (2002) and Sebastiani et al. (2003) for further details about the technology.

3

Data Structure and Notation

A microarray slide generally consists of several gene blocks corresponding to the printing mechanism. An image will generally contain several rectangular blocks of 100-1000 spots: those considered here contain 12 blocks of 16 × 40 spots. Because the blocks are distant from each other, we can analyze them separately (Figure 1).

3.1

Gridding

For each block, we first obtain a rough estimate of the position of each spot in the block by finding a rectangular grid, such that each rectangle is of about the same size and each contains one spot. Our method we used is similar that of Yang et al. (2002). As this is a straightforward task we only describe the main idea behind the algorithm. Gridding is done independently for each block. For a given block, we need to locate the upper left hand corner and lower right hand corner of the block. This only need to be approximate, and in our case was done manually. Then the image portion representing the block is projected onto the x-axis and the y-axis. The projection will usually look like a series of peaks (representing the spots) separated by troughs (region of low intensities between spots). Finally, the grid is formed by plotting a line in each trough. Figure 2 gives an illustration of the algorithm.

5

3.2

Notation

Having a rectangular grid, we can introduce the notation for the pixel intensities and the parameters of interest. From now on, we use the word spot to refer to a single rectangle of the grid, since assuming that gridding was successful, each rectangle should contain one spot. We use the word variety with symbol v (= 1, 2) to refer to the two samples (one labelled red, the other green) that are hybridized to a single slide. After gridding, the data take the form ysvp : s = 1, . . . , S; v = 1, 2; p = 1, . . . , Ps

(1)

where ysvp is the intensity of pixel p from the sth spot in variety v. The notation given here is illustrated in Figure 3 with a block of size 2 × 2. The goal of segmentation is to estimate the true unknown hybridization regions, represented on Figure 3 for illustration purposes. We denote by xsp the classification label of pixel p of spot s, taking the value 0 (background) and 1 (foreground), which does not depend on v as we assume the classification of a given pixel is the same for the two images. This is a natural assumption since the two images are generated from the same slide scanned twice, at least as long as the images are registered, which was the case for the data used here. The classification labels are latent variables and need to be estimated. For a given spot, the foreground (resp. background) is formed by the pixels having classification label one (resp. zero). This is represented by the pixels inside the thick black curve on Figure 3. The background is made of the pixels outside that same region. For clarity the regions displayed on Figure 3 are connected, but in practice they may not be. In addition, for each variety of each spot we wish to estimate the hybridization effect, denoted by φsv , representing the amount of hybridization of spot s in variety v. We assume that the hybridization effect is additive to a background effect, βsv , quantifying a contribution that is not due to hybridization. In the literature, the hybridization effect is sometimes called hybridization intensity or background corrected intensity. We use a variable without its subscripts to denote the collection of all such variables. For example, y denotes the collection of all pixel intensities.

4

Image Segmentation and Intensity Estimation via hierarchical Bayesian Modelling

Segmentation and intensity estimation are carried out concurrently via hierarchical Bayesian modelling. We use a Bayesian linear model (Lindley and Smith 1972) with hierarchical-t sampling errors to allow for outliers (Besag and Higdon 1999).

4.1

The model

We assume that the intensity of a given spot s in a given variety v can be described as the sum of a background effect βsv , a hybridization effect φsv and an additive noise component. The additive hybridization effect only contributes to the pixels that are classified as foreground. In order to capture uncertainty about the classification of all the pixels, we treat

6

Variety 2

Variety 1                    

                                                                

                                                            





















































































































































































 

 

 

 

 

 

 

 

 

  



 

 

 

 

 

 

 

 

  



















 

 

 

 

 

 

 

 

 

  















         















 

















  



 

 

 

 

 

 

 

 





















 

  

 

 

 

 

 

 

 

 

















         

 

 

 

 

 

 

 

 

  



















  

 

 

 

 

 

 

 

  























  

 

 

 

 

 

 

 

 

  





















 

 

 

 

 

 

 

 

   

                                         

Spot

























































































































































































































































































































Foreground

Truth Background





 

 

 

 

 

 

  



 



 

 

 

 

 

 

 

  

















 

 

 

 

 

 

 

  





























 

 

 

 

 

 

 

 

 

  



















 

 

 

 

 

 

 

  



















 

        















 













  

 

 

 

 

 

 

 

  



Pixel

Figure 3: Example of a block of 2 × 2 genes. We have two images, one for each variety. A spot refers to a single rectangle of the rectangular grid. The foreground is represented by the pixels inside the thick black curve. The background is made of the pixels outside that same region. The truth is unknown and is represented for clarity.

the classification of a given pixel xsp as a random variable taking the value 0 (background) and 1 (foreground). Conditionaly on the parameters (β, φ, x), we assume the (ys1p , ys2p ) are independent and can be written as,         s1p φs1 βs1 ys1p √ / wsp xsp + + = (2) s2p φs2 βs2 ys2p where

iid

(s1p , s2p )0 |Vs ∼ N2 (0, Vs )

with covariance matrix, (Vs−1 |ρ, λs1 , λs2 )

1 = (1 − ρ2 )

iid

λ − p s1 − λs1 λs2 ρ

λsv ∼ Ga(a2v /bv , av /bv ), and

iid

wsp ∼ Ga(ν/2, ν/2)

(3) ! p λs1 λs2 ρ , λs2

(4) (5) (6)

where wsp and (svp , svp )0 are independent. Since the w’s are independent of the ’s, we have √ svp / wsp ∼ T(ν,0,Vs ) , i.e. the (bivariate) errors have a bivariate t distribution with ν degrees of freedom and covariance matrix Vs . The advantage of writing the model this way is that, conditioning on the wsp , the sampling errors are again normal, but with different precisions, and computation reduces to a weighted least squares problem. In general, background intensities tend to vary spatially within a slide and it is common to estimate them locally for each spot. The background intensity of spot s in the image 7

corresponding to variety v is represented by βsv , which we model as a first order Gaussian intrinsic autoregression (Besag and Kooperberg 1995), as follows, ! P −1 4λ β βv i∈∂s iv , (βsv |β∂sv , λβv ) ∼ N (7) ni ni where ∂s corresponds to the 1st order neighbors on the lattice formed by the rectangular grid and ni = #{i : i ∈ ∂s}. The hyperparameter λβv is also treated as random with prior specified in the next section. Typically, the number ni of neighbours will be 4 except on the boundary of the lattice where it is 2 or 3. To diminish any possible edge effects, we augmented the block with an additional row of blank spots on each side. A block is usually surrounded by an area of background where no solution has been spotted (Figure 1) and we can use these data to augment the block. The hybridization effect of spot s of variety v, denoted by φsv , is modelled as a random effect, (φsv |ξφ , λφ ) ∼ log Normal(ξφv , λ−1 (8) φv )

where the priors for ξφv and λφv are specified in the next section. The binary random variable xsp represents the class of pixel p in spot s taking the value zero (background) and one (foreground). We use a first order symmetric Ising model for the xsp , as follows, X (xsp |xs∂p , cs ) ∝ exp(γ 1[xsi =xsp ] )1[kp−cs k