Bioinformatics Advance Access published July 1, 2004 Bioinformatics © Oxford University Press 2004; all rights reserved.
“Normalization of Microarray Data Using a Spatial Mixed Model analysis which includes Splines” David Baird1*, Peter Johnstone2, and Theresa Wilson3. 1
AgResearch, Lincoln, New Zealand;
2
AgResearch, Invermay Research Centre, Mosgiel, New Zealand;
3
AgResearch Molecular Biology Unit, Department of Biochemistry, University of Otago, Dunedin, New Zealand.
*To whom correspondence should be addressed
Running Head: “Mixed Models using Splines for microarray data”
1
ABSTRACT Motivation: Microarray experiments with thousands of genes on a slide and multiple slides used in any experimental set represent a large body of data with many sources of variation. The identification of such sources of variation within microarray experimental sets is critical for correct deciphering of desired gene expression differences. Results: We describe new methods for the normalisation using spatial mixed models which include splines and analysis of two-colour spotted microarrays for within slide variation and for a series of slides. The model typically explains 45 - 85% of the variation on a slide with only about 1% of the total degrees of freedom. Results from our methods compare favourably with those from intensity dependent normalisation loess methods of Yang et al., (2002) where we accounted for twice as much uncontrolled and unwanted variation on the slides. We have also developed an index for each EST that combines the various measures of the differential response into a single value that researchers can use to rapidly assess the genes of interest. Availability: GenStat code is available from the first author Contact:
[email protected] 1. INTRODUCTION Microarray technology has been used extensively to survey patterns of gene expression in a range of biological models. Using our own collection of bovine ESTs we have constructed large cDNA arrays (up to 22,000 ESTs) for use in several of our research projects. For such large arrays it is essential to identify sources of variation and correct for them to allow for robust use of this technology. Through normalization procedures, such variations can be identified and removed to obtain data for follow on research. The analysis of the microarrays, is a two-step analysis; with a within slide analysis aimed at normalization and if required standardisation, and then a between slide analysis to estimate the differences between targets, and their consistency. Various techniques for normalisation have been suggested, including linear regression (Hedenfalk et al. 2001), ratio statistics (Chen et al., 1997), local smoothing (Yang et al., 2002) and analysis of variance (Kerr et al., 2000 and Chu et al., 2002). Yang et al., (2002) compare these approaches and suggest a method which allows for differences induced by different print tips. We extend this idea to model the rows and columns over the whole slide and within the print tips and also autocorrelation in the printing order. This differs from other methodology in that we able to correct for unwanted variation arising from unevenness of the slide surface and scanning efficiency. The usual statistical modelling approach is taken where all possible sources of noise are jointly fitted in one model, with the need for each term being assessed using statistical significance of the reduction in remaining unexplained variation. Model terms can be added or removed as required. The fitted model then indicates where useful modification of our protocols and equipment would help minimise variation in future experiments.
2
2. METHODS Amplification of ESTs The AgResearch bovine EST collection represents ESTs from a variety of tissues that are in the vector pBK-CMV or pSPORT6. The inserts are amplified by PCR in a 50 µl reaction. A small aliquot of the PCR products were run on a 1% agarose gel to verify the PCR reaction. When required for printing the PCR products were resuspended in water, transferred to 384 well plates, dried and then resuspended in 3 X SSC printing solution (0.45 M sodium chloride, 0.045 M sodium citrate). Microscope slides (Gold Seal) were coated with a polylysine solution as described (Eisen and Brown, 1999). The amplified ESTs were printed using ESI array robot with up to 32 split pinheads depositing 0.6 nl with a 100 um spot size. After spotting, the slides were UV fixed and stored in an airtight container. Labelling was carried out following the original protocol available from www.microarrays.org with minor modifications. Labelled targets were hybridised overnight with ULTRAhyb (Ambion) at 42oC, washed for 5 minutes each in (1) 2 X SSC, 0.1%SDS (2) 1 x SSC and (3) 0.1 X SSC, centrifuged at 500 g for 5 min, dried and scanned. Allocation of Probes to Slides Randomization is a well-known device used to ensure the valid application of significance tests and confidence intervals (Fisher, 1951). Randomization also disarms critics who suggest an allocation of experimental units has been chosen which is favourable to an author’s hypothesis (Cox, 1992). Because of these properties, it is routine in traditional experiments to randomly allocated treatments to the experimental units. In microarray experiments the physical constraints imposed by the storage of probes in 96 well plates and by the microarray printing robots, ensure that a fully randomized layout is not possible. However, printing the 96 well plates in random order is possible and is justified in that some randomization is better than the alternative of no randomization.
3. ANALYSIS Measure of Differential Expression in Probes The slides were scanned using ScanArray 5000 (GSI Luminomics) and the visual images were converted to data files using GenePix 3.0 (Axon Instruments, 2002). The GenePix package substitutes a large constant ±C for M (the log2 ratio of the background corrected intensities) when the foreground expression level is greater than the background in just one dye, with the sign depending on whether this occurs for the red or green dye. However, the use of a constant in all cases can bias the results, as information in the other colour's value is lost. In the case where one dye has a negative corrected intensity, we substitute a minimum positive value for the negative value. This maintains the information in the other channel and down weights the influence of these probes relative to probes with well-defined ratios. The minimum value used is different for each dye and is a function of the standard deviation of the background, so that a smaller minimum value is used for more consistent background levels. The foreground and background levels of the Cy3 and Cy5 dyes are denoted by Rf, Rb 3
and Gf and Gb respectively. If R and B are the standard deviation of the backgrounds for Cy3 and Cy5 respectively, then M is now
M = log 2
max(Rf max(Gf
Rb, k Gb, k
) G) R
(1)
We have used a value of k of 0.5, but have tried values between 0.1 and 1.0. The value of k controls how much the information on the probe is down weighted, with larger values reducing the value of M towards 0. If both dyes have negative corrected intensities, then there is no information in the probe, and M is set to be a missing value. It is expected that the majority of probes in the sample will show no differential expression, and that the value M will be randomly distributed around a mean value of 0. Other approaches to handling values close or below background can be used. One option is to make no background correction, which will shrink all values of M towards zero, with large reductions for spots of low intensity, and minimal reductions on spots with high intensity. This has the advantage of reducing the variation of low intensity spots, but the disadvantage of reducing sensitivity of identifying differentially expressed ESTs with low expression levels. Any spatial trends not eliminated in the log ratios due to trends in the background can be estimated and removed as part of the spatial model, as explained later in this paper. Another alternative is suggested by Durbin and Rocke (2003), in the context of transforming the single channel’s expression, add a constant to all values in each channel as part of a more complex transformation. The constant to be added in the Durbin and Rocke approach is estimated as that giving the best stabilized error variance. For large expression values, these approaches have virtually no effect on the log ratio, but for values just below and above the minimum cut off, the relative differences between the approaches may be substantial. The advantage of using logs over more complicated transforms is that the resulting values are more naturally interpreted by the experimenter. Which approach is best, it terms of giving unbiased results can only be ascertained by a uniformity study, which is not available in our current data sets. Within Slide Dye Bias It is typically found that the mean of M at a certain level of log intensity depends on the level of intensity of the probe. If we define A, the average log intensity of the probe as A=
1 [log2 (max(Rf 2
Rb, k
R
)) + log 2 (max(Gf
Gb, k
G
))] (2)
then a plot of M vs A (an MA plot (Dudoit et al., 2002)), often shows a departure from the zero reference line. It is expected that the level of differential expression is independent of the brightness of the probes. Figure 1 shows the MA plot for one of our microarrays. It can be seen that the mean location of the M values is below zero for A between 8 and 11 and above zero for A greater than 11, falling back to zero as A approaches 16. The points falling on the two lines at the left of the plot are due to the truncation of the intensity of the dyes to the
4
minimum values. Figure 3 shows an MA plot of another microarray, and the difference in the shape of the M-A relationship is notable. Spatial Effects The colour readings for each probe on the slide are the result of a complicated process which involves variability at each step, including slide preparation, probe preparation, probe printing, hybridization, scanning and image analysis. The accumulated effects of error from each step add to the error in the final values. These effects may be correlated spatially on the slide and may differentially affect the two colours. We have observed carry over of signal between spots due to incomplete washing of the pins. This was graphically illustrated in one series of arrays where one cell on a plate became infected by yeast which caused the printed spot to express as bright green. The printing process spread this infection to down stream cells, causing a line of green spots on the microarrays, starting with the initially infected spot, and slowly fading back to the background. A number of authors (e.g. Spruill et al. (2002) and Chu et al.) report significant pin effects on slides. We have typically found that from 15% 85% of the variation in the log ratios can be accounted for by the combination of spatial and dye-bias effects. Spatial Linear Mixed Model for Normalisation Mixed models have been used previously by Wolfinger et al. (2001) for the analysis of microarray data, but these have used simple models where the spot to spot components on a slide have been modelled as coming from an independently and identically distributed normal distribution. It does not seem that mixed models have been previously used to model the spot to spot variation using the spatial structure on the slide. The approach used in the analysis of the red-green log ratio data from the microarray follows the approach used by Gilmour et al., (1997). They apply a linear mixed model with spatial autocorrelation on r × c rectangular array of plots in the context of a small-plot field experiment used to test cereal varieties. If y is the vector of log ratios in array order (column by column), the model for y is: y = X + Zu +
(3)
where is the vector of fixed effects with model matrix X , u is the vector of random effects with another model matrix Z , and is a vector of random errors. Here is further model as the sum of two components, with = + , where is a spatially dependent random error vector, and is a zero mean random vector whose elements are pairwise independent. It is also assumed that u , and are pairwise independent. This model is general, and potential sources of variation may be added into it. The components of variation that we have found important to add to the model for our microarray data are the pins that printed the slides the row and columns of the microarray layout, and the dye bias. Other potential effects that could be added are the short rows and short columns within pins, and batch or 384 well plate effects amongst the probes. These effects can be added in as either fixed terms or random
5
effects. For example, if the pins were entered as a fixed effect, then a separate mean would be estimated for each pin, whereas if this were entered as a random effect, it would be assumed that the pin effects came from a normal distribution with a variance that would be estimated from the data. The model used for is that of a first-order separable autoregressive model (Cullis and Gleeson (1991)), denoted AR1 × AR1. This model will allow for small scale or local spatial trends within array as well as signal carry over from one spot to another. The AR1 model assumes that there is a simple one step correlation, r, between adjacent spots in a row, and another, c, between adjacent spots in a column. The separable process specifies that diagonal correlations are a simple combination of the corresponding row and column correlations, so that a spot diagonally adjacent to another has correlation of r c. The appropriateness of this error model can be assessed by examination of the sample variogram of the residuals (Gilmour et al., 1997). The estimation of the variance parameters associated with the random terms is done by restricted maximum likelihood (REML) (Patterson and Thompson, 1971). The algorithm used to estimate the variance parameters is the average information REML (AIREML) algorithm (Gilmour et al., 1995). The AIREML algorithm is implemented in GenStat (GenStat Committee, 2002) and is many times faster than the standard REML algorithm that uses Fisher scoring. The significance of the random terms can be tested using a likelihood ratio test (Welham and Thompson, 1997). The significance of the fixed effect terms can be tested using the Wald ) ) ) statistic. To test a null hypothesis that = 0, the Wald statistic is defined as '[Var( )] 1 2
and is distributed with an exact distribution, with 1 degree of freedom, if the variance parameters were known. In practice they are estimated, it is only asymptotically distributed as 2 . The reason that the estimated model effects should be tested for significance, is so that important terms in the model be clearly identified and thought given to the possible causes of these terms. Apart from the significance, the relative sizes of these effects should be considered when looking for where effort should be put into controlling these noise terms. Once estimated, all these terms are then removed from the log ratios of each spot giving corrected log ratios. These log ratios have then been normalised for the estimated biases within each slide, and should provide a more accurate estimate of an EST’s up or down regulation. Spatial Terms In our analysis, we have fitted as fixed, the spatial terms for rows, columns and print groups (pins). If these terms were random, the REML analysis would shrink these effects towards zero, if the variance parameters estimated for these were small. As the number of probes in a row and column are large (> 80 in our microarrays), there will be only a small percentage of the statistical information on probes confounded with the row and column terms, so the effect of changing these terms from fixed to random will be negligible. We could also impose a 6
smooth curve on the row and column effects by using cubic spline terms for these. However, we do not do this, as we often find abrupt changes in the sequence of estimated row or column effects. Example of abrupt changes are pin tips becoming stuck with the print head, so from that time onwards the pin does not make good contact with the slide, or changes in between 384 well plates during printing (as found by Spruill et al., 2002). Not enforcing smooth behaviour on the row and column terms provides a more robust model at the small cost of a loss of efficiency to that if a smooth row/column processes were operating. If it was expected that there were separate spatial effects within each print group, then spatial terms for short rows and columns within each print group could be added to the model, as fixed or random effects. Due to the much larger numbers of these short rows and columns, it may better to add this term in as a random effect, so that these effects could be shrunk to zero when they were estimated as being small relative to the underlying variability. If the probes are duplicated side by side, the spatial model used for will need to be different from that of a straight AR1 × AR1 defined on the rows and columns. Since the correlation between two adjacent spots that have the same probe will normally be much higher than two adjacent probes with different probes, the spatial variance structure will need to reflect this. Creating either a new row or column term (depending on which dimension the duplicate probes lie in) allows for this. For example, with the duplicate probes lying side by side within rows, a new column term is created that groups together pairs of the original columns so that the identical probes will have the same value on this term. Then the AR1 component in the column direction is defined using this double column term, and an extra variance term is added into the model to allow for a different level of variation between pairs of duplicate probes. Modelling Dye Bias The dye bias curve can be fitted in a number of ways. The LOESS (Cleveland and Devlin, 1988) regression technique used by Yang et al., (2002) does not naturally fit within the linear mixed model framework. Linear models such as a polynomial regression or a cubic spline with fixed knot points can be included in the fixed effects model. However, these models require that the degrees of freedom associated with the polynomial or the number and position of knots be specified by the user. Cubic smoothing splines (de Boor, 1978) can be specified in a manner that allows these to be included within a linear mixed model (Verbyla, 1995; Verbyla et al., 1997). This involves setting up a set of basis functions for the cubic spline terms (Cox, 1972) and adding these as a set of random terms with a common variance parameter. The estimation of this variance parameter from the data will set the degree of smoothing used by the cubic smoothing spline. However, with large arrays, the full set of basis functions will involve an n × n matrix, so in practice a reduced set of basis functions is calculated, as with more than 100000 points no PC can allocate his much memory. French et al. (2001) provide an S-PLUS formula for the calculation of the basis functions for given knot points. If the reduced set of basis functions has s terms, then this will limit the smoothing cubic spline to the equivalent of at most s degrees of freedom. However, we expect the fitted curve to be much smoother than the maximum degrees of freedom specified in s, so this should not be a limitation. Figure 2 shows the set of basis functions for an array
7
with 8448 probes with 24 terms. Any smooth curve will be a weighted combination of these functions. Variance of Log Ratios The usual analysis of variance assumes a constant variance in the response variable. However, our experience is that the variance of the log ratio M depends on the intensity A. Durbin et al., (2002), Durbin and Rocke (2003) and Huber et al., (2002) have suggested transformations for M to solve this problem. However, in our experience there is not a single transformation that will work on all data sets. Further a transformation alters both the log ratio intensity relationship and variance at the same time. Our preference is to model these separately with non-parametric relationships that don’t impose a particular functional form on the data. The M-A variance relationship also seems to be a function of the image analysis program used. In particular, if the background level of the two dyes is over or under estimated, then this will severely affect the variance in M for low intensity values. If we consider the effect of adding a constant to both dye readings, as below,
max(Rf ~ M = log2 max(Gf
Rb + C , k Gb + C , k
) G) R
(4)
~
we can see that if C is less than zero the variance of M increases, and decreases if C is greater than zero. It is our impression that the GenePix software that we use has higher background estimates than the Spot software package (Buckley, 2000). The MA plots from Spot analyses tend to show the variance decreasing for low values of A, whereas ours using GenePix show it increasing. Between Slide Target Analysis Once the corrected log ratios, Mij (where i indexes the slides and j the probes), have been obtained, the analysis of the differences between targets can be done. We use analysis of variance (ANOVA) with a pooled error across treatments for each probe. The normalization of each slide has effectively removed the block effects, so the Mij values now reflect the differences between treatments on each slide and we have one measurement per slide rather than two. Thus the design matrix, X, for the ANOVA is not standard. If the t target effects are given by T = [T1, T2, … Tt]’, then the model in matrix notation is:
M =X
µ T
+e
(5)
1, if Ti assigned to Cy3 where X 1 = [1…1]’ and X i+1 = + 1, if Ti assigned to Cy5 0, otherwise
8
The constant term µ estimates the dye swap effect for the probe. The usual restriction on T is required is obtain estimates, as one parameter is redundant. If we impose the t
restriction i =1
µ T
Ti = 0 , then we have the usual least squares solution:
= [X ' X + R ' R ] X ' M 1
(6)
where R is the (t+1)-vector [0, 1…] that applies the restriction above.
4. RESULTS To demonstrate the types of effects picked up by our model, a single microarray experiment will be analysed. This was the first of our experiments in which 98% of the variation was identified as unwanted experimental variation. The methods we describe have been used to identify sources of variation in this and subsequent experiments and now this unwanted variation has been reduced this to less than 30%. The results are from a set of 3505 ESTs applied to slide containing 8448 probes printed by a 4 × 4 array of print heads giving probes in a 96 × 88 grid. The ESTs were printed in duplicate, side-by-side within rows, so column terms that grouped duplicates probes together were used in the analysis, giving 44 column terms. The MA plot for this data set (slide 14-9) is shown in Figure 3, and the false colour plot of the M values is shown in Figure 4. In Figure 3, it can be seen that there is a strong dye bias effect within the slide with green dominating at low intensities, despite the background corrections. This is an indication that the non specific background binding for the green or red dyes may be operating differently on the glass surface as to that on the surface covered with cDNA. Also in Figure 4, there are large spatial trends across the slide. The estimated effects are shown in Figures 5-7. In Figure 5, the estimated print head effects show a strong linear trend, from one corner of the slide to the diagonally opposite corner. In Figure 6, the row and column effects show pronounced patterns, with a very strong edge effect on the right hand side of the slide. The estimated dye bias effect is shown in Figure 7, and this picks up the tendency towards the green dye at low intensities. The estimated AR1 autocorrelation coefficients were 0.20 (se 0.02) and 0.15 (se 0.02) for rows and columns respectively, and both were highly statistically significant. The larger autocorrelation in the column direction than the row direction is expected, as the print heads printed sequentially down columns. When these estimated effects and spatial autocorrelation were removed from the M values, we get a set of adjusted M values that are displayed in the MA plot shown in Figure 8. The variance of the M values has been considerably reduced (the estimated effects accounted for 98% of the original variation in M) and there is now no longer an indication of a relationship between M and A. There is a small indication that the variance in M is not constant across values of A. After the analysis was performed and the estimated effects examined, this caused us to re-examine our procedures. It was found that the laser was not tracking correctly and this was subsequently corrected.
9
Most of the microarray slides we have analysed have had strong relationships between the variance of the adjusted M values and A. Figure 9 displays the mean variance relationship between M and A for another typical slide, 18-33. It can be seen that the variance of M is much larger for small values of A and is stable for A values greater than 11. The slight reduction in variance at the upper bound of A, 16, is enforced by the maximum intensity that can be recorded (216). We estimate the M-A variance relationship by smoothing the squared M values with a spline. The square root of this smoothed value gives an approximate standard deviation for values of A. This can be used to standardize M to constant variance over the range of A. Figure 10 displays the standardized M verses A plot for slide 18-33. About 160 microarray slides have now been put through this analysis, and the model typically explains 45 - 85% of the variation on a slide with only about 1% of the total degrees of freedom used by the model. With experience through examination of the estimated effects and feedback to the microarray process, we have been able to bring down the percentage of variation accounted for by these error components, increasing the signal to noise ratio in the results. An index for ranking ESTs has been constructed and a value is assigned to data from each probe. The index (I) is calculated as a weighted sum of a number of statistics measuring the differential expression. The statistics used are the mean log2 ratio (m), the t value of the log2 ratios (t), the mean of the standardized log2 ratios (sm). Further, the index is reduced in proportion to the number of slides that contained missing values for the log ratios. If p is the proportion of slides with non-missing log ratios, the index is calculated as:
I = p x [a × m + b × t + c × sm]
(7)
The weights a, b, and c are chosen to make the contributions of the 3 statistics to the index comparable. Although the statistics m, t and sm are not independent, they do carry extra information over using any one term on its own. We also added an optional extra adjustment of setting I to 0 when the log intensity i is less than a lower bound. The lower bound is set by judging where most of the variation in the log ratios is due to background noise (for our slides we used a value of log2(512) = 9). ESTs with large negative or positive values of I are then of the most interest to the researcher, as they have large consistent effects, and would be the points in the top left or right corners of a volcano plot. In fact, the ranking on the index is similar to ranking the points along a 45 degree axis through the origin in each half of the volcano plot. Comparison with Intensity-Dependent Normalisation using Loess smoothing The spatial model proposed here was compared with the intensity-dependent normalisation of Yang et al. (2002). This intensity-dependent normalisation fits either a loess smoother to a scatter plot of M vs A for all the data points or separate smoothers to the data from each print tip group. These two loess based models were fitted to a 24 slide series of microarrays, along with the spatial model proposed in this paper. To examine the importance of the components of the spatial model, various sub-models were also fitted that excluded some of the model terms. The average percentage variance explained through the stepwise addition of model terms are shown in Table 1.
10
A common loess smooth explained on average 4.5% of the total variation over the 24 slides and within print tip loess smooth explained 9.5%. Simply fitting a different mean per print tip accounted for 5.1% of variation. Fitting a common spline with a different mean per print tip accounted for 10.6% and when per print tip splines are fitted it explains 11.3% (cf 9.5% above). When the row and column components were added to the splines, these explained an extra 5.8% of the variation, and when the autocorrelation between spots in rows and column were added, a further 1.6% was accounted for. The full spline model which included fixed and random effects explained 18.7% of the variation. Therefore in this example our method accounted for twice as much uncontrolled and unwanted variation on the slides than the intensity-dependent normalisation using loess smoothing. The Bioconductor software (Gentleman et al., 2004) marrayNorm function does allow a combination of the intensity, row and column loess fits, and this would be expected to perform similarly to our Pin × Dye Bias + Row + Column model, although the model coefficients are estimated sequentially rather than jointly.
5. DISCUSSION We have used a mixed model approach to analyse many microarray experiments and have found that it can correct for many sources of variation and provide a more robust set of gene expression data. The index number enables the molecular biologists to access quickly the relative changes in expression between the samples tested and thus the index for ranking ESTs is used to help select probes for further study. Follow up verification work using Northern analysis to examine gene expression differences has confirmed agreement with the index number from the microarray data. We have generally found that microarray data tends to underestimate the fold differences in expression relative to the Northern analysis (data not shown). The development of the analysis approach described in this paper has prompted us to include further quality control measures as part of the whole analysis procedure. Because we are working with large data sets we cross check the final analysis spreadsheets with the initial clone tracking files to ensure that the EST name and locations are still matching. Our integrated database also compares the analysed data with the raw jpeg images for each EST as a final quality control check. It is essential to have such quality control checks when analysing multiple slides as part of the larger experimental sets. Initially with many microarray data sets, only those with two or three fold differences in expression were seen as reliable. We have found using our analysis methods that even down to 1.5 fold can be statistically significant, and changes of this order are likely to be biologically important. For many of our animal model systems we do not see large changes in gene expression. These gene expression differences would often be removed in other types of analyses as they would fall into the noise region. The experimental design however, is critical to have confidence in these lower fold differences in gene expression. Loop or
11
balanced incomplete block designs may be essential to control for animal variability as well as within slide and between slide differences.
6. CONCLUSION The benefit of this method where by we use spatial mixed models to estimate pin effects, row and column effects on the slides, autocorrelation and correct dye bias using splines has the advantage over other published methods in that it identifies additional sources of variation and corrects for them as part of the analysis procedure. This results in greater experimental efficiency in that the standard errors of the comparisons of interest are smaller. The visual display of data and fitted model components considerably enhances identification of the various sources of variation. The advantage of the modelling approach we have adopted is that all effects are estimated with full efficiency, and are adjusted for each other. Furthermore, the model can be easily added to or adapted if other effects are found in the data. Acknowledgements We thank Dianne Hyndman for producing the microarray slides and carrying out the hybridizations and Alan McCulloch for development of our microarray database and Harold Henderson for helpful discussions. Sue Welham kindly made available code for calculating the reduced set of spline basis functions. We gratefully acknowledge the helpful suggestions made by the reviewers. References Axon Instruments (2002). http://www.axon.com.
GenePix
Pro
microarray
and
analysis
software.
Buckley, M. J. (2000). Spot User's Guide. CSIRO Mathematical and Information Sciences, Sydney, Australia. http:///www.csiro/au/ipa/Spot/spotmanual.htm. Chen, Y., Dougherty, E.R. & Bittner, M.L. (1997) Ratio based decisions and the quantitative analysis of cDNA microarray images. Journal Biomedical Optics 24:364 - 374 Chu, T., Weir, B. and Wolfinger, R. (2002). A systematic statistical linear modelling approach to oligionucleotide array experiments. Mathematical Biosciences, 176, 35-51. Cleveland, W. S. and Devlin S. J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. Journal of American Statistical Association, 83, 596-610. Cox, D. R. (1992). Planning of Experiments. New York: Wiley. Cox, M. G. (1972). The numerical evaluation of B-splines. Journal of the Institute of Mathematics Applications, 10 134-149.
12
Cullis, B.R. & Gleeson, A.C. (1991). Spatial analysis of field experiments - an extension to two dimensions. Biometrics 47, 1449-1460. de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag, New York. Dudoit, S., Yang, Y. H., Speed, T. P. and Callow, M. J. (2002). Statistical methods for detecting differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12, 111-140. Durbin, B. P., Hardin, J. S., Hawkins, D. M. and Rocke, D. M. (2002). A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics, 18, S105 - S110. Durbin, B. P. and Rocke, D. M. (2003). Estimation of transformation parameters for microarray data. Bioinformatics, 19, 1360-1367. Eisen, M.B. and Brown, P.O. (1999). DNA arrays for analysis of gene expression. Methods Enzymology, 303, 179-205. Fisher, R. A. (1951). The Design of Experiments. 6th Edition. London: Oliver and Boyd. French, J. L., Kamman, E. E. and Wand M. P. (2001). Comment on Ke and Wang. Journal of American Statistical Association, 96, 1285 – 1288. GenStat Committee (2002). The Guide to GenStat Release 6.1 - Part 2: Statistics. VSN International, Oxford. Gentleman, R., Rossini, A., Dudoit, S. and Hornik, K. (2003), "The Bioconductor FAQ". http://www.bioconductor.org. Gilmour, A.R., Thompson, R. & Cullis, B.R. (1995). Average Information REML, an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440-1450. Gilmour, A.R., Cullis, B.R. & Verbyla, A.P. (1997). Accounting for natural extraneous variation in the analysis of field experiments. JABES 2, 269-273. Hedenfalk, I., Duggan, D. Chen, Y & 23 others (2001). Gene-expression profiles in hereditary breast cancer. New England Journal of Medicine, 244, 539-548. Huber W., von Heydebreck, A., Sultmann H., Poustka, A. and Vingron M. (2002). Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics Suppl 1:S96-S104. Kerr, M. K., Martin M. and Churchill, G. A. (2000). Analysis of variance for gene expression microarray data. Journal of Computational Biology, 7, 819
13
Patterson, H.D. & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545-554. Spruill, S. E., Lu, J., Hardy, S. and Weir, B. (2002). Assessing sources of variability in microarray gene expression data. BioTechiques, 33, 916 – 923. Verbyla, A.P. (1995). A mixed model formulation of smoothing splines and testing linearity in generalised linear models. Department of Statistics Research Report 95/5, University of Adelaide. Verbyla, A.P., Cullis, B.R., Kenward, M.G. & Welham, S.J. (1997). The analysis of designed experiments and longitudinal data using smoothing splines (with discussion). Applied Statistics 48, 269-311. Welham, S.J. and Thompson, R. (1997) Likelihood ratio tests for fixed model terms using residual maximum likelihood. Journal of the Royal Statistical Society B 56, 701 - 714 Williams, E. R. (1986). Row and column designs with contiguous replicates. Australian Journal of Statistics, 28, 154-163. Wolfinger, R. D., Gibson G., Wolfinger, E. D., Bennet, L., Hamadeh, H., Bushel, P., Afshari, C. and Paules, R. S. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology, 8, 625 – 637. Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Pend, V., Ngai, J. and Speed, T. P. (2002). Normalization of cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, 30, e15:1-e15:11.
14
Table legends Table 1.
Percentage Variance explained by Loess and Spatial models
Figure Legends Figure 1.
The MA plot of microarray results from slide 18-33.
Figure 2.
Natural spline basis functions for 24 knot-points for a set of 8448 observations.
Figure 3.
The MA plot of microarray results from slide 14-9.
Figure 4.
Spatial plot of log ratios over the slide 14-9. Cells shaded in red and a large positive value of log2(red/green) and cells in blue have a large negative value.
Figure 5.
Estimated pin adjustments for microarray slide 14-9 shown as a 3-dimensional bar chart.
Figure 6.
Estimated (a) row and (b) column adjustments plotted against row or column position respectively.
Figure 7.
Estimated dye bias adjustment for microarray 14-9.
Figure 8.
Adjusted M values after removing estimated spatial and dye bias effects for microarray 14-9. Plotted curves are approximate 95% confidence limits (in blue) around the smoothed mean line (in red).
Figure 9.
Adjusted M values after removing estimated spatial and dye bias effects for microarray 18-33. Plotted curves are approximate 95% confidence limits (in blue) around the smoothed mean line (in red).
Figure 10.
Standardized M values after removing estimated spatial and dye bias effects for microarray 18-33. Plotted curves are approximate 95% confidence limits (in blue) around the smoothed mean (in red).
15
Table 1. Model Fitted Pins + Dye Bias + Pin × Dye Bias + Rows + Columns + AR1 × AR1
Loess 5.1% 7.9% 9.5% -
Cubic Spline 5.1% 10.6% 11.3% 17.1% 18.7%
16
log2(ratio)
Figure 1
log2(intensity)
17
Basis function value
Figure 2
log2(intensity)
18
log2(ratio)
Figure 3
log2(intensity)
19
Figure 4
20
Figure 5
21
Figure 6 a)
b)
22
Bias adjustment
Figure 7
log2(intensity)
23
Residual log2(ratio)
Figure 8
log2(Intensity)
24
Residual log2(ratio)
Figure 9.
log2(Intensity)
25
Figure 10.
Standardized residual log2(ratio)
log2(Intensity)
26