Normalization of Dye Bias in Microarray Data Using the ... - CiteSeerX

Comment

Report 1 Downloads 80 Views

Normalization of Dye Bias in Microarray Data Using the Mixture of Splines Model Y. Joo,1,∗ G. Casella,2 J. Booth,3 K. Lee2 and S. Enkemann4 1

Department of Statistics, College of Medicine, University of Florida, Gainesville FL 32610.

2 3

Department of Statistics, University of Florida, Gainesville FL 32611.

Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853.

4

H. Lee Moffitt Cancer Center and Research Institution, Tampa, Fl 33612 November 3, 2004

Summary. This paper discusses characteristics of dye biases in microarray data that the conventional normalization methods do not handle. A new normalization method is proposed involving a mixture of spines model. We also develop a test for between-group comparisons of each gene that is designed to be used with the proposed method. Key words: Mixture of splines; Microarray data; Dye bias. 1.

Introduction

Microarray technology has been around for less than ten years. The first reported use utilized cDNA aliquots spotted onto glass slides in a defined array (Schena et al., 1995). Shortly thereafter an alternative format was reported ∗

email: [email protected] 1

in which short oligonucleotides were synthesized directly on a substrate to create the array (Lockhart et al., 1996). Since these initial experiments there have been many papers reporting the promise offered by microarray technology. Despite its widespread acceptance there are still many issues that have not been satisfactorily resolved. Many of the uncertainties are caused by the desire to coordinate the measurement of thousands of genes without taking the time to optimize the measurement of each gene (Halgren et al., 2001; Kothapalli et al., 2002). There are other aspects of a microarray experiment that still require optimization and some aspects that may never achieve a resolution that fits all applications. One example is the many mechanisms used for labeling nucleic acids prior to array hybridization. Some microarray technologies directly incorporate a fluorescent dye into the nucleic acid used for hybridization. This provides a direct measure of the amount of nucleic acid hybridized to each spot but it is limited by the amount of dye that can be incorporated by the enzymatic activity of the RNA polymerase or reverse transcriptase enzymes (Foldes-Papp and Rigler, 2001; Yu et al., 1994). For this reason intensely fluorescent dyes must be used which can unfortunately cause a high amount of background fluorescence. An alternative is to incorporate a non-fluorescent label, such as biotin, during transcription and to attach a fluorescent dye later in the process using molecules with an affinity for biotin (Brock and Potgieter, 1989; Warrington et al., 2000). This indirect labeling allows for the amplification of the signal but introduces a second step that can contribute variation to the process. The variety of microarray formats is almost paralleled by the variety of methods for fluorescently labeling the hybridizing 2

nucleic acid. The choice of labeling method and the efficiency of this process varies from experiment to experiment. Thus it is a primary source of variation in the microarray process. Any microarray samples must be normalized to account for this kind of technical variation prior to analysis. If this is not done properly then some of the observed differences may be due to the technical aspects of performing a microarray experiment and not due to the biology that one wishes to study. Statisticians working in genetics applications have been familiar with two types of normalization: experimental normalization to remove systematic bias in microarray experiments and statistical normalization to construct pivotal statistics. Sometimes these two notions of normalization can coincide. There have been numerous papers (Bolstad et al 2003, Dudoit et al 2002 and the references contained therein) focusing on experimental normalization. In this paper, we will focus on normalization of dye bias with a cDNA microarray data example. Since both the cDNA array and the spotted long oligonucleotide array (Barczak et al. 2003) use two color hybridizations with Cy3 (green) and Cy5 (red) dyes, the same normalization techniques are applicable for both array types. In comparing genes in the control and the treatment groups, one of the simplest methods is a t-test: w¯1j· − w¯2j·

t= q

s21j /n1 + s22j /n2

,

(1)

where wijk is a measurement with or without normalization for the gene at spot j on the array of subject k. The same gene is always placed on spot j 3

for all subjects, but can be replicated in more than one spot. The subject is in the control group if i = 1, and in the treatment group if i = 2. The index j = 1, . . . , J, J is the total number of spots on one array, k = 1, . . . , ni and ni is the number of subjects in the control or the treatment group. Summary Pni Pni 2 statistics are w ¯ij· = ¯ij· )2 /(ni − 1). k=1 wijk /ni and sij = k=1 (wijk − w Often the t-statistic in (1) is compared to a standard normal or t reference distribution. However, other techniques have been suggested which more accurately determine significance using (1); for example, by accounting for difference variances in each print-tip group, and for multiple comparisons (Dudoit et al. 2002). Yang et al. (2001) and Dudoit et al. (2002) showed that there are significantly different systematic trends between responses of two dyes depending on print-tips, and suggested corrections using normalization. Yang et al. (2001) suggested removing the mean trend on an m vs. a plot using lowess and estimating a constant standard deviation, for each print-tip group (see Section 2 for detailed description on cDNA microarray experiment and m vs. a plot): wijk =

mijk − Bl (aijk ) , cl σ

where Bl (aijk ) is the lowess fit on the m vs. a plot of subject k in group i using print-tip l, σ 2 is the overall variance of mijk , and c2l is the scale factor for the lth print-tip. The standard deviation σ is assumed to be independent of a. In this method, it is assumed that mijk in each print-tip group has the same location-scale family distribution with means that depends on aijk , and a constant variance. The mean trend is the indication of dye bias, which requires experimental normalization. However, Figure 1 in Yang et 4

al. (2001) shows that mijk given aijk has an asymmetric distribution and non-constant variance particularly where aijk ∈ (11,13), even though it is not pointed in Yang et al. (2001). The shape of distribution also changes with aijk . Therefore, the normalization needs to consider the change of the distributional family depending on aijk . In this paper, we suggest using, instead of (1),: F (m1j· |a1j· ) − F (m2j· |a2j· ) p → N (0, 1), (1/n1 + 1/n2 )/12 where F (mij· |aij· ) =

Pni k=1

(2)

F (mijk |aijk )/ni and F (mijk |aijk ) is the cumulative

distribution function of mijk |aijk . This test is derived from the fact that the cumulative distribution function of any continuous random variable has the uniform distribution, U (0, 1). If an analyst wants higher precision for the reference distribution, the Bates’s distribution can be calculated, (See page 296 in Johnson, Kotz and Balakrishnan 1995 for details) or the sum of uniform distributions can be simulated. However, the convergence to the normal distribution is fast enough to be used in many practical situations. We propose a new method, the mixture of splines model, to estimate or approximate the cumulative distribution function F in (2). At each cross section of aijk , it has a mixture of normal distributions that performs well in approximating various distributions. A similar model to ours was studied in a Bayesian framework by Wood, Jiang and Tanner (2002). The paper is arranged as follows. In Section 2, we give a detailed description of microarray data sets that will be used for the demonstration of the mixture of splines model. In Section 3, we propose the mixture of splines model and the ECM algorithm for estimation. In Sections 4 and 5, a simula5

tion study and an example are given. Finally, a summary is given in Section 6. 2.

Microarray Experiment: Spotted cDNA Arrays

The apolipoprotein AI micro array data (Yang et al. 2001) is used to demonstrate our normalization method. The apo AI experiment includes eight normal C57B1/6 mice for the control group (mouse id=1,2, . . . , 8) and another eight mice with the apo AI knocked-out for the treatment group (mouse id=1,2, . . . , 8). The target cDNA of all sixteen mice was obtained by reverse transcription and labelled using a red-fluorescent dye (Cy5). To create reference samples, the pooled-cDNA sample from eight control mice was obtained and labelled with a green fluorescent dye (Cy3). Then, 16 hybridized samples are generated with the green-dyed pooled-cDNA and red-dyed cDNA from each mouse in the treatment or control groups. A micro array slide with 6384 spots was prepared for a hybridized sample of each mouse. Sixteen print-tips are used to deliver hybridized samples into spots. There are 399 (=6384/16) spots on each slide that are covered by one print-tip. From each spot, the intensity of green, G, and red fluorescent dye, R, was measured. In an ideal case, R and G for the control mice should be the same since both red- and green-dyed samples are made from the control group mice. Even for treatment group mice, R and G should be the same or similar for most of genes (spots) because usually only a few genes have different expression by treatment. If the difference between R and G is independent and identically distributed, the observations will be evenly scattered along the 45o line on the plot of log2 (G) vs. log2 (R) (See Yang et al. 2001 for more details). However, the 6

apo AI data show unexpected nonlinear trends over a in terms of mean and variance. Yang et al. (2001) suggests analyzing the data after a 45o rotation, √ m = log2 (R/G) vs. a = log2 ( RG). This idea has some advantages. First, regardless of whether R or G is the vertical or horizontal axis, the same m vs. a plot is produced. When a lowess or regression type normalization method is applied to find the trend of the mean, we will have different results depending on whether a log2 (G) vs. log2 (R) or a log2 (R) vs. log2 (G) plot is used. Second, the vertical axis, m, is a discrepancy measure between log2 (R) and log2 (G) , which will show the existence of the dye bias. The horizontal axis, a, is the average of log2 (R) and log2 (G). Then it is easy to interpret as the distribution of the bias depends on the average gene abundance. 3.

Mixture of Splines Model

For notational convenience, let (ai , mi ) denotes the i-th observation in the data set of interest, instead of (aijk , mijk ). Since the conventional approach fails to capture distributional variation of m|a, we propose to use the mixture of cubic regression splines model: J X

2

2

e−{mi −µ(ai |βµj )} /{2σ (ai |βσj )} √ f (mi |ai ) = pj 2πσ(ai |βσj ) j=1

(3)

where xTiµj = {1, ai , a2i , a3i , (ai −τ1µ )3+ , (ai −τ2µ )3+ , · · · }, xTiσj = {1, ai , a2i , a3i , (ai − def

τ1σ )3+ , (ai − τ2σ )3+ , · · · }, (C)+ = max(0, C), µ(ai |βµj ) = xTiµj βµj , τdµ and τdµ are knots, and log{σ(ai |βσj )} = xTiσj βσj . Given ai , mi has a normal mixture model, which is flexible in fitting data or in approximating many other distributions. The spline components of µ(ai |βµj ) and σ(ai |βσj ) make these mixture of normal distributions of m|a connect smoothly. To eliminate possible confusion, we will strictly distinguish two terms, spline and 7

spline component. The term spline will be used for a normal distribution e

−{mi −µ(ai |βµj )}2 /{2σ 2 (ai |βσj )}

√

2πσ(ai |βσj )

, of which mean and variance have spline compo-

nents, xTiµj βµj and xTiσj βσj . The model (3) is the mixture of J splines. A similar model in the Bayesian framework appeared on Wood, Jiang and Tanner(2002): J X

e−{mi −µ(ai |βµj )} √ f (mi |ai ) = pj (ai |βpj ) 2πσj j=1

2 /(2σ 2 ) j

(4)

where µ(ai |βµj ) and pj (ai |βpj ) are the spline functions of ai , while σ is a constant over ai . Wood et al (2002) proposed this model for capturing abrupt changes in mean trends and demonstrated its high performance. However, we found from our simulation studies that (3) performs better than (4) in modelling the data that have distributional changes over ai in the second or higher moments. The mixture of two splines model is estimated using ECM (Meng and Rubin 1993) as follows (See Appendix for details). :

Step 1. To get random initial values, randomly divide observations into two (almost) equal size groups. Then, estimate a single (normal) heteroscedastic regression spline model (5) for each group: 2

2

e−{mi −µ(ai |βµ )} /{2σ(ai |βσ ) } √ . f (mi |ai ) = 2πσ(ai |βσ ) (0)

(0)

(0)

(0)

(5) (0)

This step provides initial values, β (0) = (βµ1 , βµ2 , βσ1 , βσ2 ). Also set p1 = (0)

p2 = 1/2 and k = 1.

8

Step 2. (k−1)

(k−1) Pj (mi )

=

pj

(k−1)

φ(mi |βµj

(k−1) (k−1) (k−1) p1 φ(mi |βµ1 , βσ1 )

+

(k−1)

, β σj

)

(k−1) (k−1) (k−1) p2 φ(mi |βµ2 , βσ2 )

where φ(mi |βµj , βσj ) is the normal distribution with mean µ(ai |βµj ) and σ(ai |βσj ). Step 3. βµ(k) = j

( n X Pj(k−1) (mi ) (k−1) 2 T ) i=1 (xiσj βσj (k−1)

(0)

Step 4. Let γσj = βσj

)−1 ( xiµj xTiµj

(k−1) n X Pj (mi )mi i=1

(k−1) 2 )

(xTiσj βσj

) xiµj

and l = 1. (

− γσ(l)j = γσ(l−1) j

(l−1)

∂h(γσj ) ∂γσj

)−1 ) h(γσ(l−1) j

where # (k) {mi − µ(ai |βµj )}2 h(γσj ) = −1 , σ 2 (ai |γσj ) i=1 " # n (k) X ∂h(γσj ) {mi − µ(ai |βµj )}2 (k−1) = −2 Pj (mi ) xiσj xTiσj 2 (a |γ ) ∂γσj σ i σ j i=1 n X

"

(k−1) Pj (mi )xiσj

(l)

Iterate (6) until γσj converges. Then, set βσ(k) = γσ(l)j j . Step 5. (k) p1

Pn

(k−1)

i=1

= Pn

P1

(k−1) (mi ) i=1 P1

(mi ) Pn (k−1) (mi ) + i=1 P2

and (k)

(k)

p2 = 1 − p1 . 9

(6)

Step 6. Repeat Step 2-5 until convergence.

Lange (1995) suggested only one Newton-Raphson iteration in the ECM algorithm, which can be applied to Step 4. However, we found that ECM for the mixture of splines model is sometimes unstable if the Newton-Raphson method is used only once in Step 4. Since ECM algorithm does not guarantee the convergence to the global maximum, Steps 1-6 need to be repeated to explore local modes. Step 1 generates random initial values for this exploration. In our various simulation studies, we obtained reliable results after 50-100 runs. Compared to a single (normal) spline model, this method fits significantly better if m|a has an asymmetric or bimodal distribution. The performance of the mixture of splines model is demonstrated through a simulation study in Section 4. 4.

Simulation study for the mixture of splines model

For the purpose of the simulation studies, consider a true model: g(ai ) =

1 I{a ∈(0,4)} 4 i

and 2

2

2

2

e−{mi −µ(ai |βµ1 )} /{2σ(ai |βσ1 ) } e−{mi −µ(ai |βµ2 )} /{2σ(ai |βσ2 ) } √ √ f (mi |ai ) = 0.4× +0.6× 2πσ(ai |βσ1 ) 2πσ(ai |βσ2 ) (7) where µ(ai |βµ1 ) = 5 + ai − 3a2i + a3i − 3(ai − 2)3+ , σ(ai |βσ2 ) = exp{1 + ai − 0.5a2i + 0.5(ai − 2)3+ }, µ(ai |βµ2 ) = −1 + ai + a2i − (ai − 2)3+ and σ(ai |βσ2 ) = exp{0.1 + 0.5ai − 0.1a2i + 0.1(ai − 2)3+ }. 10

[Figure 1 about here.] Estimating g(ai ) is not of interest. It is used just to simulate the data. We simulated 300 observations from (7) and drew a scatter plot in Figure 1(a). The probability contour plot of the true model (7) in Figure 1-(b) shows that the random variable m|a has an asymmetric unimodal distribution when ai ∈ (0,2) and a bimodal distribution when ai ∈ (3,4). The figure indicates that a single homoscedastic spline model, e−{mi −µ(ai |βµ )} √ f (mi |ai ) = 2πσ

2 /(2σ 2 )

,

(8)

or even a single heteroscedastic spline model (5) can not explain the data properly. The means of the two splines are drawn with a solid line and the mean ± 2× standard deviations are drawn in dash lines. For each estimate of the mixture of splines model, the ECM algorithm ran 100 times as described in Section 3. One hundred ECM’s usually found 4-6 local modes and 10-40 out of 100 runs visited the highest local mode. Let “Model [nmix , nµ , nσ ]” denote a model with nmix splines to be mixed, common nµ knots for the spline component of nmix mean functions, and common nσ knots for the spline component of nmix variance functions. We used equally spaced knots. For example, Model[2,1,0] is 2 X

2

2

e−{mi −µ(ai |βµj )} /{2σ (ai |βσj )} √ f (mi |ai ) = pj 2πσ(ai |βσj ) j=1 where xTiµj = {1, ai , a2i , a3i , (ai − τ1µ )3+ }, xTiσj = (1, ai , a2i , a3i ) and τ1µ is the average of the maximum and minimum of ai . The true model belongs to Model [2,1,1]. 11

We applied the single heteroscedastic spline model (Figure 1-(c)), the mixture of two splines (Figure 1-(d)) and the mixture of three splines models (Figure 1-(e)). A single spline model assigns high probability to the sparse area around (ai ,mi )=(0.5, 0.0), while the mixture of two splines model (Figure 1-(d)) assigns low probability there. The model estimates in Figure 1-(d) and the true model in Figure 1-(b) are very close. The mixture of three splines model in Figure 1-(e) is also similar, but has an unstable pattern in the right tail area (a > 3.5). This can be an indication of over-parametrization. 5.

Normalization of the Apolipoprotein AI data

The relationship between m and a in the Apo AI data is estimated separately for each mouse in each print-tip group using the mixture of splines models. For control group mice, all observations should be scattered around m=0 line when dye bias does not exit, because R is measured from a single control group mouse and G is from a pooled control group sample. For treatment group mice, there can be a few observations away from the m=0 line that distinguish control and treatment group tissues. However, since the number of these observations is typically very low, we assume that their effect on the mixture of splines model is negligible. The same assumption is also used in Yang et al (2001). For example, in Figure 2, we applied the mixture of two splines models to the gene expression on spots that is from mouse 1 in the control group and measured by the 16th print-tip. A single heteroscedastic spline model in Figure 2-(a) performs poorly in capturing asymmetric distribution of m|a around a ∈ (12, 13), while the mixture of splines models do well in Figure 2-(b) and (c). However, Model [3,2,1] has unstable patterns such as extremely high probability in small areas. This is often an indication 12

of overparametrization. Based on the probability contour plots Model [2,2,1] is most reasonable. In Figure 2-(d), -(e) and -(f), CDF plots are drawn for Models [1,2,1], [2,2,1] and [3,2,1]: a CDF plot has the cumulative probability on the vertical axis and a on the horizontal axis. If the model fits well, points within a small window of a-values should be scattered uniformly. Figure 2(d), -(e) and -(f) show that, as more splines are added to the model (from Model [1,2,1] to [3,2,1]), more uniformity is obtained. Particularly, there is obvious improvement between Model [1,2,1] and [2,2,1] where a ∈ (12, 13). However, Figure 2-(e) and -(f) shows negligible improvement between Model [2,2,1] and [3,2,1]. [Figure 2 about here.] 6.

Discussion

Conventional dye normalization methods has been developed based on the assumptions that mean behavior depends on a, but changes in variance depend only on the print-tips (Dudoit et al. 2002 and Yang et al. 2001). However, in practice the entire distribution of m can vary with a (Figure 2). Changes in shape or higher moments of the distribution have not previously been considered. In this paper we suggest using the mixture of splines model to approximate the distributions of m|a. Incorporating the characteristics of data, the average cumulative probability test was applied to compare control and treatment groups. Even though we used cDNA data as an example, this method can also be applied to spotted long oligonucleotide arrays.

13

References Barczak, A., Rodriguez, M. W., Hanspers, K., Koth, L. L., Tai, Y. C., Bolstad, B. M., Speed, T. P., and Erle, D. J. (2003), Spotted Long Oligonucleotide Arrays for Human Gene Expression Analysis Genome Research http://www.genome.org/cgi/content/ abstract/1048803v1 Bolstad B., Irizarry R., Astrand M., Speed T. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 2, 185-193. Brock, K.V. and Potgieter, L.N. (1989) Rapid fluorescence detection of in situ hybridization with biotinylated bovine herpesvirus-1 DNA probes. J. Vet. Diagn. Invest., 1, 34-38. Dudoit, S., Yang, Y., Callow, M. and Speed, T. (2002) Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Statistica Sinica 12, 111-139. Foldes-Papp, Z. and Rigler, R. (2001) Quantitative two-color fluorescence cross-correlation spectroscopy in the analysis of polymerase chain reaction. Biol Chem, 382, 473-478. Halgren, R.G., Fielden, M.R., Fong, C.J. and Zacharewski, T.R. (2001) Assessment of clone identity and sequence fidelity for 1189 IMAGE cDNA clones. Nucleic Acids Res, 29, 582-588. Johnson, N., Kotz, S. and Balakrishnan, N. (1994) Continuous univariate distributions Vol 1. 14

Johnson, N., Kotz, S. and Balakrishnan, N. (1995) Continuous univariate distributions Vol 2. Kane, M.D., Jatkoe, T.A., Stumpf, C.R., Lu, J., Thomas, J.D. and Madore, S.J. (2000) Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res, 28, 4552-4557. Kothapalli, R., Yoder, S.J., Mane, S. and Loughran, T.P., Jr. (2002) Microarray results: how accurate are they? BMC Bioinformatics, 3, 22. Lange, K. (1995) A gradient algorithm locally equivalent to the EM algorithm. Journal of the Royal Statistical Society, Series B, 57 ,425-437. Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H. and Brown, E.L. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol, 14, 1675-1680. Meng, X. and Rubin, D. (1993). Maximum likelihood estimation via the ECM algorithm : A general framework, Biometrika, 80, 267-278. Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467-470. Warrington, J.A., Nair, A., Mahadevappa, M. and Tsyganskaya, M. (2000) Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiol Genomics, 2, 143-147.

15

Wood, S., Jiang, W. and Tanner, M. (2002) Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika 89, 3, pp513528. Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. Normalization for cDNA microarray data. (2001) In M. L. Bittner, Y. Chen, A. N. Dorsel, and E. R. Dougherty (eds), Microarrays: Optical Technologies and Informatics, Proceedings of SPIE, 4266. Yu, H., Chao, J., Patek, D., Mujumdar, R., Mujumdar, S. and Waggoner, A.S. (1994) Cyanine dye dUTP analogs for enzymatic labeling of DNA probes. Nucleic Acids Res, 22, 3226-3232.

Appendix ECM algorithm for the mixture of two splines Here is detailed description of the ECM (Expectation-Conditional Maximization) algorithm for the mixture of splines model (Meng, 1993). Let (ai , mi ) be the observed values and z an indicator variable for a group (spline). Then we have the following algorithm:

Initialization : Assign initial values for estimates (0)

(0)

(0)

(0)

Set k = 1. Also, assign initial values β (0) = (βµ1 , βµ2 , βσ1 , βσ2 ), p(0) = (0)

(0)

(0)

(0)

(p1 , p2 ) and z (0) = (z1 , z2 ). E-step : Calculate the expectation of log complete data likelihood.

16

Let zij = 0 or 1 and

P2

j=1 zij

= 1 for any i. Also, suppose mi ∼

φ(mi |βµj , βσj ) when zij = 1. When zi = (zi1 , zi2 ) is considered as the missing variable, the log complete data likelihood of m and z is log f (m, z|β, p) n X = [zi1 · log {φ(mi |βµ1 , βσ1 ) · p1 } + zi2 · log {φ(mi |βµ2 , βσ2 ) · p2 }] , i=1

where β = (βµ1 , βµ2 , βσ1 , βσ2 ), p = (p1 , p2 ), (k−1)

Pj

(k−1)

(mi ) =

pj

(k−1)

φ(mi |βµj

(k−1)

,βσj

)

(k−1) (k−1) (k−1) (k−1) (k−1) (k−1) p1 φ(mi |βµ1 ,βσ1 )+p2 φ(mi |βµ2 ,βσ2 )

, and φ(mi |βµj , βσj )

is the normal distribution with mean µ(ai |βµj ) = xTiµ βµj and σ(ai |βσj ) = exp(xTiσ βσj ). The expectation is Q(β, p|β (k−1) , p(k−1) , m) = Ez {log f (m, z|β, p)} n h X (k−1) P1 (mi ) · log {φ(mi |βµ1 , βσ1 ) · p1 } = i=1 (k−1) (mi ) +P2

i · log {φ(mi |βµ2 , βσ2 ) · p2 }

CM-step : Maximize Q(·) with respect to βµ = (βµ1 , βµ2 ), βσ = (βσ1 , βσ2 )and pj . (k)

To calculate βµj , set ∂Q(β, p|β (k−1) , p(k−1) , m) ∂Q(β, p|β (k−1) , p(k−1) , m) ∂µ(ai |βµj ) = =0 ∂βµj ∂µ(ai |βµj ) ∂βµj ½ ¾ n X mi − µ(ai |βµj ) ∂µ(ai |βµj ) (k−1) (mi ) ⇔ Pj = 0, 2 (a |β ) σ ∂β i σ µ j j i=1 ⇔

(k−1) n X Pj (mi ) · mi i=1

⇔ βµ(k) = j

(k−1) n X Pj (mi )

xiµj xTiµj βµj (k−1) 2 ) i=1 σ (ai |βσj ) ) ( ( n −1 (k−1) n X X Pj(k−1) (mi ) P (m ) · m i i j xiµj xTiµj xiµj (k−1) (k−1) 2 2 ) ) i=1 σ (ai |βσj i=1 σ (ai |βσj

(k−1) σ 2 (ai |βσj )

xiµj =

17

(k)

To calculate βσj , set ∂Q(β, p|β (k−1) , p(k−1) , m) ∂Q(β, p|β (k−1) , p(k−1) , m) ∂σ(ai |βσj ) = =0 ∂βσj ∂σ(ai |βσj ) ∂βσj · ¸ n X {mi − µ(ai |βµj )}2 ∂σ(ai |βσj ) 1 (k−1) ⇔ Pj (mi ) =0 − σ 2 (ai |βσj ) σ(ai |βσj ) ∂βσj i=1 ¸ · n X {mi − µ(ai |βµj )}2 (k−1) ⇔ Pj (mi ) · xiσj −1 =0 (A.1) 2 (a |β ) σ i σ j i=1 Because the solution for βσj is not in a closed form, we use the NewtonRaphson algorithm: ( − γσ(l)j = γσ(l−1) j

(l−1)

∂h(γσj ) ∂γσj

)−1 ) h(γσ(l−1) j

where # (k) {mi − µ(ai |βµj )}2 h(γσj ) = · xiσj − 1 , from (A.1), σ 2 (ai |γσj ) i=1 " # n (k) X ∂h(γσj ) {mi − µ(ai |βµj )}2 (k−1) = −2 Pj (mi ) · xiσj xTiσj 2 (a |γ ) ∂γσj σ i σ j i=1 n X

"

(k−1) Pj (mi )

After the Newton-Raphson algorithm converges, set = γσ(l)j βσ(k) j . Finally, we need to estimate p1 and p2 : ∂Q(β, p|β (k−1) , p(k−1) , m) =0 ∂p1 ( ) n (k−1) (k−1) X (mi ) P1 (mi ) P1 ⇔ − =0 p 1 − p 1 1 i=1 Pn (k−1) (mi ) (k) i=1 P1 ⇒ p1 = Pn Pn (k−1) (k−1) (mi ) (mi ) + i=1 P2 i=1 P1 18

and (k)

(k)

p2 = 1 − p1 . The algorithm can be easily extended to the mixture of more than two splines.

19

Figure 1. Simulated 300 observations from the model (7) and fits. The contour plots are drawn based on the estimation of a heteroscedastic single spline and mixture of splines models. Dots on all graphs are the observations. Means of splines are drawn in solid lines, and mean ± 2× standard deviation are in dashed lines.

20

Figure 2. Mouse 1 in control group: Probability contour plots of (a) Model [1,2,1]: Single spline, (b) Model [2,2,1]: Mixture of two splines and (c) Model [3,2,1]: Mixture of three splines, and corresponding CDF plots in (d), (e) and (f).

21