Combinatorial image analysis of DNA microarray

Report 8 Downloads 120 Views
Vol. 19 no. 2 2003 Pages 194–203

BIOINFORMATICS

Combinatorial image analysis of DNA microarray features C. A. Glasbey 1,∗ and P. Ghazal 2 1 Biomathematics

and Statistics Scotland, JCMB, King’s Buildings, Edinburgh EH9 3JZ, Scotland, UK and 2 Genomic Technology and Informatics Centre, University of Edinburgh, Little France Crescent, Edinburgh EH16 4SB, Scotland, UK

Received on May 2, 2002; revised on August 27, 2002; accepted on September 9, 2002

1 INTRODUCTION Microarrays are a powerful new set of genomic methods for monitoring genetic content or expression activity of thousands of genes simultaneously (see, for example, Chipping Forecast, 1999). One of the most commonly ∗ To whom correspondence should be addressed.

194

used microarray platforms in academia involves printing genetic material using pin transfer methodology. Typically, DNA corresponding to specific genes are placed as spots with a pitch of approximately 250 µm on a glass slide, which is then hybridized with either one or two samples, labelled with fluorescent dyes, and the microarray is digitally scanned at the corresponding wavelengths of the fluors. Quantifying the amount of fluorescent sample hybridized to specific microarray elements is subject to a degree of uncertainty due to inherent deficiencies of the technology that can result in small- and large-scale intensity variations within spots, highly variable and nonlinear background fluorescence, fabrication inconsistencies and post-print artefacts. For these reasons, and because of the large quantities of data generated, it is well appreciated that statistical and computational tools are essential at all stages of a microarray-based experiment for extracting reliable and meaningful information. The first stage in the analysis of such data is the use of image analysis to summarize the information in each spot, and to estimate the level of expression of each gene. To assess image quality, Brown et al. (2001) proposed measures of variability of pixels within a spot; and to improve quality, Schadt et al. (1999) used an adaptive pixel selection algorithm to remove pixels contaminated by noise and Yang et al. (2002) used a top-hat filter to correct for trends in background values. For segmentation, computer packages have implemented a range of methods: QuantArray (GSI Luminomics, 1999) applies thresholds to the histogram of pixel values in a target region around a spot, ScanAlyse (Eisen, 1999) uses a circle of fixed radius, GenePix (Axon Instruments Inc., 1999) allows the circle radius to vary from spot to spot, and UCSF Spot (Jain et al., 2002) uses histogram information within a circle. Steinfath et al. (2001) fitted a scaled bivariate Gaussian density function to pixel values, as did Br¨andle et al. (2000), but using a robust fitting method. Schadt et al. (1999) proposed an adaptive pixel selection algorithm to remove pixels contaminated by noise. Bozinov and Rahnenf¨uhrer (2002) used k-means clustering, Kim et al. c Oxford University Press 2003; all rights reserved. Bioinformatics 19(2) 

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

ABSTRACT Motivation: DNA and protein microarrays have become an established leading-edge technology for large-scale analysis of gene and protein content and activity. Contactprinted microarrays has emerged as a relatively simple and cost effective method of choice but its reliability is especially susceptible to quality of pixel information obtained from digital scans of spotted features in the microarray image. Results: We address the statistical computation requirements for optimizing data acquisition and processing of digital scans. We consider the use of median filters to reduce noise levels in images and top-hat filters to correct for trends in background values. We also consider, as alternative estimators of spot intensity, discs of fixed radius, proportions of histograms and k-means clustering, either with or without a square-root intensity transformation and background subtraction. We identify, using combinatoric procedures, optimal filter and estimator parameters, in achieving consistency among the replicates of a gene on each microarray. Our results, using test data from microarrays of HCMV, indicate that a highly effective approach for improving reliability and quality of microarray data is to apply a 21 by 21 top-hat filter, then estimate spot intensity as the mean of the largest 20% of pixel values in the target region, after a square-root transformation, and corrected for background, by subtracting the mean of the smallest 70% of pixel values. Availability: Fortran90 subroutines implementing these methods are available from the authors, or at http://www. bioss.ac.uk/∼chris Contact: [email protected]

Combinatorial image analysis of microarrays

(b)

Fig. 1. Green colour band for small region of one microarray, displayed using a reversed grey-scale, to illustrate the effect of a median filter to reduce speckle noise: (a) original image, (b) the result of applying a 5 × 5 median filter (i.e. m MEDIAN = 2).

(2001) used an edge detection method and Yang et al. (2002) proposed the use of seeded region growing (Adams and Bischof, 1994). For reviews of image segmentation including many more methods, see Haralick and Shapiro (1985) or Glasbey and Horgan (1995, chap. 4), and for a survey of constrained classification, see Gordon (1996). The diversity and complexities of these approaches clearly underscores the significance of image analysis, but methodology for optimization involving a systematic evaluation of estimators and parameters is not defined. In this paper we systematically consider, through a combinatorial approach, a range of image analysis methods. In Section 2 we present some test data. Then, in Section 3 we consider the use of median filters to reduce noise levels in images and top-hat filters to correct for trends in background values. In Section 4 we present several alternative estimators of spot intensity and in Section 5 we identify those estimators which perform best for our test microarrays, in achieving the most consistency among the 3 replicates of each gene. Finally, in Section 6 we discuss the significance of these results for diagnostic microarray applications in genotyping and expression profiling and for reliability of microarray databases.

2 MICROARRAY TEST DATA As a test case of representative data, we analyzed 17 microarrays, each containing the entire genome of a human pathogen (Human cytomegalovirus HCMV; Chambers et al., 1999). The microarrays were fabricated using contact printing (GMS417—Affymetrix printer) and are composed of long (75 mer) oligonucleotide probes, an approach that is gaining popularity. In this study the HCMV microarrays were used for a comparative genotype profiling study of laboratory and clinical strains, to

compare AD169(HB5) with strains Town, Towne(long), GU, LE, Fiala and AD169(long). The arrays were scanned at 10 µm resolution, resulting in digital images approximately 2000 × 2000 pixels in size. For these microarrays a standard receiver–operator analysis score a value of approximately 0.99 (data not shown). Importantly, the advantage of these data is that the false positives and negatives have been well characterized for the system. Each microarray contained 1080 spots, consisting of 3 replicates of each of 234 ORFs, that are potentially present in HCMV strains, and 103 human oligonucleotides, included for control purposes, and the remaining 69 spots were SSC. On each array the test strain was labelled green (Cy3) and compared with AD169(HB5) labelled red (Cy5). Therefore, when the digital image of a microarray is displayed in colour, red spots reveal genes only expressed by HB5, green spots show genes only expressed by the test strain, yellow spots show genes expressed by both strains and dark spots by neither.

3 FILTERS Figure 1a shows the green colour band of a small region of one microarray, displayed using a reversed grey-scale. Black specks one or two pixels in size are randomly observed. They are produced by some contamination of the slide, and will affect the estimates of spot and background intensity. Filters (see, for example, Glasbey and Horgan, 1995, chap. 3) are a set of image analysis methods that can be used to reduce noise in images. One of the simplest non-linear filters is the median filter. For a filter of size m MEDIAN , the output is z i1 ,i2 = median {yi1 + j1 ,i2 + j2 : j1 , j2 = −m MEDIAN , . . . , for all i 1 , i 2 , m MEDIAN } 195

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

(a)

C.A.Glasbey and P.Ghazal

(b)

(c)

Fig. 2. Application of top-hat filter to Figure 1b, to illustrate identification and subtraction of background trend: (a) section of array, displayed with a stretched grey-scale, (b) trend estimated by morphological opening using 25 × 25 square structuring element (i.e. m TOP-HAT = 12), (c) section of array after subtraction of trend.

where yi1 ,i2 (or, more succinctly, yi ) denotes the original pixel value in row i 1 , column i 2 . Therefore, z i is set to the median of the values of y in a (2m MEDIAN +1)×(2m MEDIAN +1) square of pixels centred on location i, and we repeat this for all locations in the image. We require that m MEDIAN ≥ 0 and, if m MEDIAN = 0, then z i ≡ yi and the filter has no effect. Figure 1b shows the result of applying a 5×5 median filter (i.e. m MEDIAN = 2) to Figure 1a, from which we see that most of the contamination has been removed. We consider the best choice of m MEDIAN in Section 5. Another problem with microarrays is that the brightness in the background varies. For example, Figure 2a shows Figure 1b, but with a stretched grey-scale to reveal background trend. This is a common problem in image analysis, and makes comparison of similar features in different parts of an image difficult. A ‘top-hat filter’ can be used to remove the trend, as proposed by Yang et al. (2002). The background trend is estimated using a ‘morphological opening’ (see, for example, Glasbey and Horgan, 1995, chap. 5), obtained by first replacing each pixel by the minimum local intensity in a region and then performing a similar operation on the resulting image, using the local maximum. For a region, we use a square of size (2m TOP-HAT + 1) × (2m TOP-HAT + 1) centered on each pixel, where m TOP-HAT is a non-negative integer used to specify the size of the top-hat filter. Mathematically, the pixels, z i , in the opened image will be given by z i = max xi+ j , j

where xk = min yk+ j , j

for | j1 |, | j2 | ≤ m TOP-HAT , with y again denoting the original pixel values. If m TOP-HAT is set to a very large value (i.e. m TOP-HAT = ∞), then 196

z i ≡ yi and the filter has no effect. If the top-hat filter is applied to Figure 2a using a structuring element which is larger than the spots, then only the pixels in the spots will be substantially changed from yi to z i , as shown in Figure 2b using a 25 × 25 square (i.e. m TOP-HAT = 12). Then, by subtracting z from y, these spots will be made more distinct, as shown in Figure 2c. We consider the best choice of m TOP-HAT also in Section 5.

4 ESTIMATORS OF SPOT INTENSITY We wish to estimate the spot intensities of the red and green colour bands, and their ratio, possibly taking into account the background values. We superimpose a lattice on each microarray image, adjusted for location shifts and small rotations. This is termed gridding, and automatic and semi-automatic methods are discussed by, among others, Schadt et al. (1999), Yang et al. (2002), Steinfath et al. (2001) and Jain et al. (2002). In our case, it locates a 41×41 square of pixels containing the spot approximately at its centre, where 41 corresponds to the inter-spot spacing. We commence this section by considering two well established estimators of spot intensities, those based on discs and proportions. Extending the notation of Section 3, let y1,i and y2,i denote the intensities in channels 1 and 2, respectively, for the pixel at location (i 1 , i 2 ), which we abbreviate to yi . We need to calculate y¯ (S) , the 2-dimensional vector of the mean spot intensities in the two channels, and y¯ (B) , the mean background intensity in the two channels. If we identify the spot centre as being at pixel location ν, we can

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

(a)

Combinatorial image analysis of microarrays

use the disc estimator (S) y¯DISC = (B) y¯DISC =

1 # 1 #



yi

and

i−νR (B)

(S)

y¯PROP

(B)

N 1  = y(i) # (S)

and

(B)

y¯PROP

i=N p

Np 1  = y(i) # i=1 (S)

for specified proportions p (S) and p (B) . So, y¯PROP is the (B) average value of pixels greater than p (S) % and y¯PROP is the average of those less than p (B) %. For either disc or proportion estimators, we then estimate the rate of expression of channel 2 relative to channel 1, denoted c, by either (S)

cˆ =

y¯2

(S)

y¯1

or

cˆ =

(S)

(B)

(S)

(B)

y¯2 − y¯2 y¯1 − y¯1

,

i.e. without or with background subtraction. As Brown et al. (2001) point out, background subtraction can lead to negative estimates, which we have set to small positive values before log-transforming. In Section 5 we consider the choice between disc or proportion estimators and choices of R (S) , R (B) , p (S) and p (B) . We also consider √ whether it is better to average y or y, and whether to use background subtraction. To develop further non-spatial estimators, we consider the distribution of pixel values. Figure 3 shows scatter plots of the square-root-transformed pixel values for 41 × 41 squares of pixels centered on each of four spots on one microarray, chosen to exemplify black, red, yellow and green spots. By using a square-root transformation we have obtained distributions that approximate mixtures of two bivariate Gaussian distributions, with the lower component in each plot representing the background pixels.

(S)

µ2 =



(S)

c µ1

and

(S)

(S)

V22 = c V11 ,

(3)

where, as previously, c denotes the rate of expression of channel 2 relative to channel 1 before application of the square-root transformation. (For alternative models, see Chen et al., 1997; Newton et al., 2001). The likelihood is a product of the probabilities of all the pixels, yi ,   L= p f (S) (yi ) + (1 − p) f (B) (yi ) i

where f (S) and f (B) are the probability density functions for spot and background bivariate normal distributions,  1 √ 1 T −1 √ exp − ( y − µ) V ( y − µ) . f (y) = 1 2 2π |V | 2 We can estimate the unknown parameters ( p, µ, V ) by maximizing L numerically. Results for the data plotted in Figure 3 are given in Table 1. As we would expect, parameters for the background distribution are very similar for all four spots and the two channels are almost uncorrelated. Approximately 20% of pixels are spot pixels, except in the case of spot 1, the black spot, where it can be seen in Figure 3a that the distinction between spot and background pixels is somewhat arbitrary. Estimated values of cˆ are in the correct order, with the red spot having the lowest value and the green spot the highest value. Finally, we note that correlation between channels is positive for all four spots. Reasons for positive and negative correlations are discussed by Brown et al. (2001). If constraint (3) is omitted, then we can simplify parameter estimation by using the EM algorithm (Dempster et al., 1977). From initial estimates of the parameters, we alternate between computing the probability that a pixel is a spot pixel, conditional on y, given by pi =

p f (S) (yi ) p f (S) (yi ) + (1 − p) f (B) (yi )

197

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

where # denotes the number of terms in each sum, and we (S) specify radii R (S) and R (B) . So, y¯DISC is the average value of those pixels located inside a disc of radius R (S) centred (B) on ν, and y¯DISC is the average pixel value for locations outside a disc of radius R (B) also centred on ν and inside the 41 × 41 square. Alternatively, we can ignore the spatial information and use the proportion estimator. We rank yi (i = 1, . . . , N = 412 ), for each channel independently, giving y(1) ≤ y(2) ≤ . . . ≤ y(N ) , and use

We propose as a model for data in each of the 41 × 41 squares:    √ N µ(S) , V (S)  with probability p y∼ (B) (B) N µ ,V otherwise, (2) where µ(S) and V (S) denote the 2-vector mean and 2 × 2 variance matrix of spot pixels, µ(B) and V (B) similarly denote the mean and variance of background pixels, and p is the proportion of pixels in the 41 × 41 square which are spot pixels. Also, because of the relationship between expression in channel 2 relative to channel 1,

C.A.Glasbey and P.Ghazal

(b)

(c)

(d)

Fig. 3. Plots of square-root-transformed pixel values for 41 × 41 squares of pixels centered on each of four spots on one microarray, chosen to exemplify: (a) black, (b) red, (c) yellow and (d) green spots.

Table 1. Maximum likelihood estimates of parameters in model specified by Equation (2), for pixels from each of four 41 × 41 squares, plotted in Figure 3

Spot

(B) µˆ 1

(B) µˆ 2

1 2 3 4

17.0 17.3 17.8 16.2

12.8 13.2 14.2 12.9

198



(B) Vˆ11

4.8 4.6 4.6 4.5



(B) Vˆ22

4.5 4.3 4.4 4.2

ˆ (B)

V12 / (B) ˆ (B) Vˆ V

(S) µˆ 1

0.09 0.05 0.09 0.04

34.8 95.1 113.2 74.9

11

22



(S) Vˆ11

12.4 21.4 21.2 13.7

ˆ (S)

V12 / (S) (S) Vˆ Vˆ

cˆMLE



0.85 0.37 0.80 0.83

0.602 0.172 0.396 0.571

0.07 0.19 0.21 0.23

11

22

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

(a)

Combinatorial image analysis of microarrays

and estimating new parameter values as weighted sums of pixel values, using p as weights, √ i pi yi (S) , µˆ = i pi √ ˆ (S) )2 pi yi (S) 2 i pi ( yi − µ (S) ˆ V = = i − µˆ i pi i pi

1

Another standard estimator is given by the ratio of means of the untransformed data or, equivalently, the ratio of √ sums of squares of y,

cˆMIX-SS



2 ˆ (S) + µˆ (S) V 22 2 pi y2,i = i =

. (S) (S) 2 i pi y1,i Vˆ11 + µˆ 1

There is evidence in Figure 3 of pixels lying between the two component distributions. These are probably mixed pixels, as discussed by Schadt et al. (1999), and after filtering there are likely to be more of them, violating the above assumptions. Therefore, for robustness, and also as a computationally faster alternative, we consider also the k-means clustering algorithm proposed by Bozinov and Rahnenf¨uhrer (2002). For alternative clustering methods, see Gordon (1999). For 2 clusters, from initial estimates of µ(S) and µ(B) , we alternate between computing  √ √ 1 if  yi − µ(S)  <  yi − µ(B)  pi = 0 otherwise, and µˆ

(S)

√ i pi yi = i pi

µˆ

(B)

√ i (1 − pi ) yi = . i (1 − pi )

So, we allocate each pixel to the cluster with more similar mean, then recalculate the means. After a number of iterations, the estimates stabilise. On convergence, we again have two estimators for c:  √ 2 pi y2,i i pi y2,i . cˆCLUS-SS = i . cˆCLUS-MEAN = √ i pi y1,i i pi y1,i (4) To compare the 5 estimators of c, we simulated 1000 independent datasets from model (2) for each of the 4

Spot

cˆMLE

cˆMIX-MEAN

cˆMIX-SS

cˆCLUS-MEAN

cˆCLUS-SS

1 2 3 4

100.0 100.0 100.0 100.0

55.3 93.5 96.3 95.1

86.3 99.5 99.9 99.9

46.7 97.5 92.8 95.6

60.4 100.4 100.1 99.6

sets of parameters in Table 1. We summarize the results as percentage efficiencies of the estimators in comparison with the maximum likelihood estimator, which should be fully efficient in this case as we are fitting the same model that we simulated from. Efficiencies, defined as the ratios of sums of squares of errors between the methods, are given in Table 2. We see that estimators cˆMIX-SS and cˆCLUS-SS are as efficient as cˆMLE , except for spot 1, the black spot, for which c is ill-defined, as can be seen in Figure 3a. Therefore, as cˆCLUS-SS is also likely to be more robust than the other estimators, which are all more explicitly modelbased alternatives, and quicker to compute, we restrict attention to this approach in Section 5.

5

COMBINATORIC PARAMETER OPTIMIZATION In Sections 3 and 4 we have presented a range of methods, all involving parameters whose values need specifying. We propose to use the consistency among the 3 replicates of each gene on a single microarray as our measure of success, as did Steinfath et al. (2001). Consider possibly the simplest method, where we omit filters (i.e. m MEDIAN = 0 and m TOP-HAT = ∞), and estimate µ(S) using (1) with R (S) = 10. Figure 4a shows the 1011 estimated spot intensities (1080, minus the 69 SSC spots) for channel 1 plotted against those for channel 2 for one microarray. We can use analysis of variance to assess what proportion of the variability is unexplained by the gene labels. Residuals, after fitting gene effects, are plotted against estimated gene effects in Figure 4b– d, respectively for channels 1, 2 and the ratio. We have conducted these analyses on log-transformed data, because this approximately standardizes the variances, as can be seen in the figures, and thereby makes for a better summary of the whole dataset. Unexplained variability is 6.0, 6.6 and 6.8% for channels 1, 2 and the ratio, respectively, averaging to 6.5%. We chose this microarray and 8 further microarrays at random from our set of 17, and applied the method to each of them. On average, the unexplained variability was 7.8%. We take this as the standard against which to assess alternative methods. 199

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

and similarly for µ(B) and V (B) , but with pi replaced by (1 − pi ). After a number of iterations, the estimates stabilize and we can estimate the relative rate of expression between the two channels from the ratio of the means  (S) 2  √ 2 µˆ 2 i pi y2,i = √ . cˆMIX-MEAN = (S) µˆ i pi y1,i

Table 2. Percentage efficiencies of estimators of c, in comparison with the maximum likelihood estimator, cˆMLE , obtained by simulating 1000 independent datasets from model (2) for each of the 4 sets of parameter values given in Table 1

C.A.Glasbey and P.Ghazal

(b)

(c)

(d)

Fig. 4. Plots of spot intensities and analysis of variance residuals for each of 1011 spots on one microarray, estimated using the initial method described in Section 5: (a) channel 1 intensity plotted against channel 2 intensity; (b) plot of residuals against fitted values for analysis of variance of gene effects for channel 1 log-intensities; (c) similarly for channel 2 log-intensities; (d) similarly for the ratio of channel 1 and channel 2 log-intensities.

To optimize the parameters, we conducted a systematic search of all combinations of values. For the median filter, we considered as possible values, m MEDIAN = 0, 1, 2, 3. For the top-hat filter we considered m TOP-HAT = 10, 11, . . . , 14, ∞. We considered three segmentation methods: discs of fixed radius, with R (S) = 7 12 , 8 12 ,. . . , 12 12 , and R (B) = R (S) , R (S) + 1, . . . , 16 12 ; fixed proportions, with p (S) % = 70, 75, . . . , 90, and p (B) % = 50, 55, . . . , p (S) ; and k-means clustering on square-root transformed data. 200

Intensities were estimated both with and without a squareroot transformation (in the case of k-means clustering, we thus use either cˆCLUS-MEAN or cˆCLUS-SS in equation (4)), and with and without background subtraction. In total, there were 4464 combinations of parameters. We measure success by how much we have been able to reduce unexplained variability, in comparison with the initial method. Table 3 summarizes the results, by identifying the best combinations of parameter values for each of the three segmentation methods, and the range of parameter

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

(a)

Combinatorial image analysis of microarrays

Table 3. Result of a combinatorial search of 4464 combinations of parameters, to optimize the consistency among gene replicates of estimated spot intensities and ratios of intensities, averaged over 9 microarrays. Unexplained variability is expressed as a percentage of that achieved by the initial method. For each of the 3 segmentation methods, the optimal set of parameters is given, together with the range of parameter values that achieved results within 2.5% of the optimal one

Segmentation method Disc Initial Optimal Near optimal

Unexplained variability (normalized %)

m MEDIAN



m TOP-HAT

0



93.6

1

10

12 12

10–∞

7 12 –12 12 p(S) %

< 93.6 + 2.5

0–3

Proportion Optimal Near optimal

89.9 < 89.9 + 2.5

0 0

10 10–∞

Clustering Optimal Near optimal

93.0 < 93.0 + 2.5

3 2-3

10 10–∞

Fig. 5. Histogram of log-transformed spot intensities for channel 1 obtained using the initial estimator applied to one of the validation microarrays, together with fitted mixture of two normal distributions (—) and the two separate components (- - -). (The overlap between the two mixture components is 3.0%.)

values which were nearly optimal (unexplained variability increased by 2.5% of that for the initial method). As can be seen, the fixed-proportions method performed best, and with optimal parameter values, the unexplained variability was 89.9% of that for the initial method. This was obtained

80 70–80

Background subtraction

R(B) ∗

×

×



×

×



×/

×

p(B) % 70 50–80

 

 

 ×/

× ×/

by applying a 21 × 21 top-hat filter, then estimating spot intensity as the mean of the largest 20% of pixel values in the target region, after a square-root transformation, and correcting for background, by subtracting the mean of the smallest 70% of pixel values. When the same method was applied to the 8 microarrays in the validation dataset, the unexplained variability was 93.5% of that for the initial method. The improvement is not as great as with the training data, probably because of the overfitting involved in considering so many possible sets of parameters. However, it is still sufficient to have a biologically-significant effect on misclassification error rates. For example, Figure 5 shows the histogram of log-transformed spot intensities for channel 1, obtained using the initial estimator applied to a typical validation microarray. If we model this distribution as a mixture of two normal distributions (also shown in Figure 5) then the overlap between the two mixture components shows how many spots we expect to be misclassified by a binary decision rule. In this case it is 3.0%. If we instead used the optimal estimator, the misclassification rate decreases to 2.7%—so 0.3% more spots (i.e. 3 spots, on average in an array with 1011 spots) would be accurately classified by using the optimal estimator. For channel 2, misclassification decreases from 4.9% for the initial estimator to 4.3% for the optimal estimator, resulting in 6 more spots being correctly classified. However, for this array, expression ratios were clearly identifiable, and the misclassification rate was unaffected at 0.4%.

6 DISCUSSION We have addressed the first stage in the analysis of digitally scanned microarrays, the use of image analysis to 201

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

100.0

R(S) 10 12

y

C.A.Glasbey and P.Ghazal

202

a human pathogen, it could be applied to any microarray experiment, including protein arrays, with replicate spots. Also, although this study combinatorially explored 4464 methods, it is far from exhaustive (see review in Section 1) with future refinement and development of new statistical approaches still to be explored. Indeed, integration and implementation of this methodology will be part of an image analysis toolbox of a relational genomic micro database (see, http://www.gti.ed.ac.uk). Fortran90 subroutines implementing these methods are available from the authors, or at http://www.bioss.ac.uk/∼chris

ACKNOWLEDGEMENTS This work was supported by funds from the Scottish Executive Environment and Rural Affairs Department, the Scottish Higher Education Funding Council and BBSRC. REFERENCES Adams,R. and Bischof,L. (1994) Seeded region growing. IEEE Trans. Pattern Analysis Machine Intelligence, 16, 641–647. Axon Instruments Inc. (1999) GenePix 400A User’s Guide. Bozinov,D. and Rahnenf¨uhrer,J. (2002) Unsupervised technique for robust target separation and analysis of DNA microarray spots through adaptive pixel clustering. Bioinformatics, 18, 747–756. Br¨andle,N., Chen,H.-Y., Bischof,H. and Lapp,H. (2000) Robust parametric and semi-parametric spot fitting for spot array images. In ISMB’00—8th International Conference on Intellgent Systems for Molecular Biology. pp. 46–56. Brown,C.S., Goodwin,P.C. and Sorger,P.K. (2001) Image metrics in the statistical analysis of DNA microarray data. Proc. Natl Acad. Sci. USA, 98, 8944–8949. Chambers,J., Angulo,A., Amaratunga,D., Guo,H.Q., Jiang,Y., Wan,J.S., Bittner,A., Frueh,K., Jackson,M.R., Peterson,P.A., Erlander,M.G. and Ghazal,P. (1999) DNA microarrays of the complex human cytomegalovirus genome: profiling kinetic class with drug sensitivity of viral gene expression. J. Virol., 73, 5757– 5766. Chen,Y., Dougherty,E.R. and Bittner,M.L. (1997) Ratio based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics, 2, 364–374. Chipping Forecast (1999) The Chipping Forecast, volume 21 of Supplement to Nature Genetics. Dempster,A.P., Laird,N.M. and Rubin,D.B. (1997) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc., Series B, 39, 1–38. Eisen,M.B. (1999) ScanAlyse, (Available at http://rana/Stanford. EDU/software/). Glasbey,C.A. and Horgan,G.W. (1995) Image Analysis for the Biological Sciences. Wiley, Chichester, UK. Gordon,A.D. (1996) A survey of constrained classification. Computational Statistics and Data Analysis, 21, 17–29. Gordon,A.D. (1999) Classification, 2nd edition, Chapman and Hall, London. GSI Luminomics (1999) QuantArray Analysis Software, Operator’s Manual. Haralick,R.M. and Shapiro,L.G. (1985) Image segmentation

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

summarize the information in each spot. In Section 3 we considered the use of median filters to reduce noise levels in images and top-hat filters to correct for trends in background values. Then, in Section 4 we considered as alternative estimators of spot intensity, discs of fixed radius, proportions of histograms and k-means clustering, either with or without a square-root intensity transformation and background subtraction. Finally, for a genotyping experiment involving 7 strains of the Human Cytomegalovirus (HCMV), in Section 5 we identified the combination of filters and estimator which was best, in achieving the most consistency among the 3 replicates of each gene on each microarray. Our strategy has been to systematically consider a large number of methods, which include approximate versions of published methods, rather than to explicitly assess this handful of published methods. Significantly, an improvement of accurately scoring 0.3% of a typical 10 000 spot microarray using the optimal image analysis estimator would account for 30 genes that would have been misclassified using the initial method alone. With regard to a biological experiment, identification of a single gene (spot) could be a highly meaningful result, although, this requires a minimum of 5 independent experiments to account separately for biological variability. Thus, the implementation of image analysis methodology represents an often overlooked but important first step toward the statistical processing and optimization of DNA microarray experiments. Determining the non-specific fluorescence and reducing noise as a result of spot inconsistencies of DNA microarrays are critical image analysis issues for obtaining reliable and high quality data. In the work presented here, we propose a number of simple, sequential image analysis methods for acquiring precise measurements of gene activity. These have a number of important and broad applications. An important application of these methods would be for improving the accuracy and sensitivity of diagnostics approaches. Indeed, the profiling of genotypes as illustrated in our model microarray data has direct relevance for screening clinical material. In addition, implementation of image analysis methodology as standard procedures would significantly increase the quality of microarray data used in subsequent steps involving univariate or multivariate analyses. We propose that image analysis should be part of a fully developed scheme for processed data storage, not only for DNA but also for emerging protein array platforms, and could be potentially applied to archived microarray image data for improving existing database quality. In conclusion, we describe a number of statistical image analysis tools that can be effectively applied to microarrays for raising the precision of measurements of gene and protein activity. While the methodology was evaluated using a well-defined microarray test system based on

Combinatorial image analysis of microarrays

techniques. Computer Vision, Graphics and Image Processing, 29, 100–132. Jain,A.N., Tokuyasu,T.A., Snijders,A.M., Segraves,R., Albertson,D.G. and Pinkel,D. (2002) Fully automatic quantification of microarray image data. Genome Res., 12, 325–332. Kim,J.H., Kim,H.Y. and Lee,Y.S. (2001) A novel method using edge detection for signal extraction from cDNA microarray image analysis. Experimental and Molecular Medicine, 33, 83–88. Newton,M.A., Kenziorski,C.M., Richmond,C.S., Blattner,F.R. and Tsui,K.W. (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes

from microarray data. J. Comput. Biol., 8, 37–52. Schadt,E.E., Li,C., Ellis,B. and Wong,W.H. (1999) Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. Technical Report 303, Department of Statistics, UCLA. Steinfath,M., Wruck,W., Seidel,H., Lehrach,H., Radelof,U. and O’Brien,J. (2001) Automated image analysis for array hybridization experiments. Bioinformatics, 17, 634–641. Yang,Y.H., Buckley,M.J., Dudoit,S. and Speed,T.P. (2002) Comparison of methods for image analysis on cDNA microarray data. J. Comp. Graph. Stat., 11, 108–136.

Downloaded from http://bioinformatics.oxfordjournals.org/ at Pennsylvania State University on February 23, 2013

203