Estimation of the Mixture Transition Distribution Model - CiteSeerX

Report 0 Downloads 20 Views
Estimation of the Mixture Transition Distribution Model Andr Berchtold1 Technical Report no. 352 Department of Statistics University of Washington Seattle, WA 98195-4322 April 1999

1

Email: [email protected], [email protected]

Abstract This paper introduces a new iterative algorithm for the estimation of the Mixture Transition Distribution model (MTD). It does not require the use of any specic external optimization procedure and can therefore be programmed in any computing language. Comparisons with previously published results show that this new algorithm performs at least as good or better than other methods. The choice of initial values is also discussed. The MTD model was designed for the modeling of high-order Markov chains and already proved to be a useful tool for the analysis of dierent types of time-series such as wind speeds and wind directions. In this paper, we also propose to use this it for the modeling of onedimensional spatial data. An application using a DNA sequence shows that this approach can lead to better results than the classical Potts model. Keywords: Mixture Transition Distribution model (MTD), Markov chain, Optimization, Iterative procedure, Initial values, Time-series, Spatial data, Potts model, DNA.

Contents 1 2 3 4 5

Introduction Algorithm Relation with standard optimization methods Choice of initial values Examples: Time-series

1 3 6 8 11

5.1 Song of wood pewee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Other time-series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6 Analysis of one-dimensional spatial data

14

7 Conclusion Acknowledgements References

19 19 20

6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6.2 Analysis of DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

List of Tables 1 2

Dierent models for the wood pewee song . . . . . . . . . . . . . . . . . . . 12 Dierent models for the mouse A-crystallin gene . . . . . . . . . . . . . . . 17

i

1 Introduction The Mixture Transition Distribution model (MTD) was introduced by Raftery (1985a) for the modeling of high-order Markov chains in discrete time. Let Xt be a random variable taking value in the nite set f1 : : : mg. In a `-th order Markov chain, the present depends on ` previous periods. As the number of possible outputs m and the order ` increase, the number of parameters of the Markov chain becomes extremely large and prevents its estimation. In Raftery's modeling, the contribution of each lag Xt;1  : : : Xt;` to the present is considered separately. The MTD model is dened as ` X

P (Xt = i0jXt;1 = i1 : : : Xt;` = i`) =

g=1 `

X

=

g=1

'g P (Xt = i0jXt;g = ig ) 'g qi i0 g

(1)

where i0 : : :  i` 2 f1 : : :  mg, Q = qi i0 ] is a m  m row transition matrix, and ' = f'1  : : : '` g is a vector of lag parameters subject to the following constraints: g

X` g=1

'g = 1

(2)

'g  0 (3) Constraints (3) can be removed (Raftery & Tavar, 1994), but in this case the model can produce results which are no longer probabilities and it becomes necessary to impose the new set of constraints Tqi; + (1 ; T )qi+  0  8i 2 f1 : : :  mg where m X

T =

i=1 'i 0

'i

qi; = 1min q j m ij qi+ = 1max q j m ij 1

(4)

The MTD model is far more parsimonious than the whole parameterized Markov chain since, in addition to the m(m ; 1) independent parameters of the transition matrix Q, it requires only one additional parameter for each supplementary lag. An interesting development of the MTD model consists in the use of a dierent transition matrix for each lag (Raftery, 1985b, Berchtold, 1996). This model called MTDg is dened as

P (Xt = i0jXt;1 = i1 : : : Xt;` = i`) =

X` g=1

'g qi(gi)0 g

(5)

where qi(gi)0 is the probability to observe Xt = i0 given Xt;g = ig . For each lag g, we have a dierent row transition matrix Qg = qi(gi)0 ]. The MTDg model is less parsimonious than the MTD model, since it requires m(m;1)+1 additional parameters for each supplementary lag, instead of 1. However, it has the ability to represent situations with a completely dierent relation between each lag and the period t. Sections 5.1 and 6.2 present practical examples. In addition to the use of a dierent matrix for each lag, the MTD model has been developed in several ways including innite lag models and missing data (Mehran, 1989a,b), and continuous spaces (Martin & Raftery, 1987, Le et al., 1996). Other contributions related to the MTD model include Adke & Deshmukh (1988), Schimert (1992), Raftery (1993), Raftery & Tavar (1994), MacDonald & Zucchini (1997) and Berchtold (1997, 1998). Moreover, the model presented by Pegram (1980) is a special case of the MTD. Although the MTD model is a potentially powerful representation of Markov chains, it is dicult to use on account of the lack of a convenient method of estimation. Formally, we have to estimate the transition matrix Q and the vector of lag parameters '. This can be done by maximizing the log-likelihood g

g

log(L) =

m X i0 :::i`=1

ni0:::i log `

where ni0:::i is the number of sequences of the form

X` g=1

'g qi i0 g

!

(6)

`

Xt;` = i` : : : Xt = i0 in the data. The log-likelihood must be maximized with respect to constraints (2) and (3) or (4). 2

The main diculties lie in the non-linearity of (6) and in the large number of constraints. A number of dierent methods were proposed in previous papers, but they all require the use of optimization procedures which are not broadly and freely available. The ExpectationMaximization (EM) algorithm (Dempster et al., 1977) is an alternative method used once to estimate a continuous space version of the MTD model (Le et al., 1996), but again we do not know of a published workable version of the exact procedure used. Clearly, there is a demand for an algorithm which can be programmed in any computing language, without the use of particular toolboxes or optimization routines. Such a method is described in Sections 2 and 3. The choice of initial values is discussed in Section 4 and comparisons with previously published results are provided in Section 5. Finally, in Section 6, we propose to use the MTD model for the analysis of one-dimensional spatial data. A software called GMTD implementing our algorithm can be freely downloaded from the following Web page:

http://www.unige.ch/ses/ecoth/sta/berchtol/softwares.html

2 Algorithm Note rst that there is no algebraic solution to the maximization of the log-likelihood of the MTD model. Therefore our approach is based on an iterative procedure. One of the main weakness of iterative methods is that they converge to a local maximum rather than to the global maximum of the log-likelihood. The choice of the starting values is therefore crucial. We discuss this point in Section 4. The whole set of parameters of the MTD model can be divided into m +1 subsets: the m rows of the transition matrix Q and the vector '. It is possible to increase the log-likelihood by reevaluating only one of these sets, the m others being xed. As a rst approximation, we suppose that the matrix Q is constant and we search to increase the log-likelihood by reevaluating the vector '. We also activate constraints (3). Since the sum of lag parameters is equal to one, the idea is to balance an increase in one of these parameters with an equal decrease in another parameter. The problem lies in the choice of the two parameters to modify. In other words, we must determine which parameter allows by its increase a maximum increase of the log-likelihood, and which parameter implies by its decrease a minimum decrease of the log-likelihood. Let 3

m X

@ log(L) = @'k

i0 :::i` =1

ni0 :::i

`

P` qi 'i q k 0

g=1 g ig i0

 k = 1 : : :  `

(7)

be the partial derivative of the log-likelihood with respect to the kth lag parameter. The partial derivative gives a measure of the local impact produced by the change of one parameter upon the log-likelihood. Since all parameters belong to the continuous space 0 1], all derivatives are non-negative. The best solution is then to increase the parameter corresponding to the largest derivative and to decrease the parameter corresponding to the smallest derivative. If we note '+ the parameter to increase, '; the parameter to decrease and  the amount of change, the reestimation of ' is achieved replacing '+ by '+ +  and '; by '; ; . Four special situations may occur: 1. If '+ = 1, this parameter cannot be increased and the algorithm cannot improve the log-likelihood through a reestimation of the vector '. 2. If '+ +  > 1, the quantity  is too large and it must be set to  = 1 ; '+. 3. If '; = 0, this parameter cannot be decreased. Then, we decrease the parameter corresponding to the smallest strictly positive derivative. 4. If '; ;  < 0, the quantity  is too large and it must be set to  = '; . Once the new vector ' is known, we can compute the new log-likelihood of the model. If it is larger than the previous value, the new vector ' is accepted and the procedure stops. In the other case,  is divided by 2 and the procedure iterates. When  becomes smaller than a xed threshold, we stop the procedure, even if ' was not reevaluated. The log-likelihood achieved through this procedure is higher or equal than the previous value. If we activate constraints (4) instead of (3), we do not have to consider the four special situations listed above, but we must verify a posteriori that all results of the model are probabilities. The same method is applied to each row of the transition matrix Q and the corresponding partial derivatives are

@ log(L) = @qi i0 k

m X i0 :::i` =1

4

ni0 :::i

`

P` ''k q g=1 g i i

g 0

(8)

Now that we have a method to increase the log-likelihood of the model through a reestimation of any of the m + 1 sets of parameters, we can solve iteratively the problem of the maximisation of the log-likelihood given all parameters. The procedure can be decomposed into the three following steps: 1. Initialization  Choose initial values for all parameters.  Choose a value for  and a criterion to stop the algorithm.

2. Iterations  Reestimate the vector ' by modifying two of its elements.  Reestimate the transition matrix Q by modifying two elements of each row.

3. End criterion  If the increase of the log-likelihood since the last iteration is greather than the

stop criterion, go back to step 2.  In the other case, end the procedure.

This method calls for the following remarks: 1. In some situations, especially when the initial values are ill-choosen or when the lag parameters are allowed to take values outside the range 0 1], the convergence of the procedure can be slow. It is then useful to x a maximal number of iterations. 2. At the beginning of the optimization process, it is useful to have a large  in order to allow important changes of the parameters. But, as we approach to the optimum, changes become smaller and a large  implies a great number of useless calculations. It is then suitable to modify dynamically the value of  during the procedure. We chose the following approach. We use a dierent  for each of the m + 1 sets of parameters. All  are initialized to the same value (say 0.1). Then, after the reestimation of each distribution, the corresponding  is modied as follows:  If the distribution was not reevaluated because the parameter to increase was

already set to 1,  is not changed.

5

 If the distribution was reevaluated with the original value of  (that is the value

of  at the beginning of step 2),  is set to 2.  If the distribution was reevaluated with a value of  smaller that its value at the beginning of step 2,  keeps its present value.  If the algorithm was completed without reestimation,  is set to twice the value reached at the end of step 2.

Even if in some particular situations this method can slow down the algorithm, it proved empirically to increase substantially the average speed of convergence. The same method is used to estimate the MTDg model in which a dierent matrix represents the transition probabilities between each lag and the present. In this case, ` dierent transition matrices have to be reevaluated during the second step of the algorithm instead of one.

3 Relation with standard optimization methods In this section, we compare the principle of our optimization algorithm with Newton's method for constrained optimization. For a complete review of standard optimization methods, see e.g. Gill et al. (1981) and Nash & Sofer (1996). The parameters of the MTD model can be classsied into m + 1 sets: the lag parameters '1 : : : '` and the m rows of the transition matrix Q. For the purpose of our demonstration, we note g the gth set of parameters, = f 1 : : :  m+1g the set of all parameters and  the solution of the algorithm. A rst dierence between our method and the others is the reestimation of only two parameters at a time. This can imply a zigzag path from the starting values to the solution. On the other hand, we can easily insure that the iterates remain feasible with respect to all equality and inequality constraints of the model. The main dierence between this algorithm and Newton's method is in the use of the partial derivatives. The analogy for unconstrained maximization can be described as follows. Consider the log-likelihod function log(L( )). Newton's method searches a zero  of 5 log(L( )) through successive iterations dened as

L( t))

t+1 = t + 552log( log(L( )) t

6

(9)

and the rst order stationarity condition is 5 log(L( )) = 0

(10)

Our method follows the same general sheme, that is an iterative process, but by contrast, the rst-order partial derivatives are not used directly inside the reestimation process as in equation (9), and the second-order partial derivatives are not used at all. Instead, we can write our algorithm as

t+1 = t + Pt+1 f f5 log(L( t))g

(11)

where Pt+1 is a (m2 + `)  (m2 + `) diagonal matrix whose elements are the reestimation values associated with each of the m + 1 sets of parameters:

0  (1) t+1 B B B B B Pt+1 = B B B B B @

...

t+1(1)

0 ...

0

t+1(m + 1)

...

t+1(m + 1)

1 C C C C C C C C C C A

and f is a function of the gradient of the log-likelihood:

f f5 log(L( t))g = hf5 log(L( t1))g : : : hf5 log(L( tm+1))g]

(12)

8 > < ;1 g hf5 log(L( t (k)))g = > 1 :0

(13)

with , if k = arg min @ log(L( (k))) , if k = arg max @ log(L( (k))) , otherwise g t

g

g

g t

Note that since the reestimation is made successively for each of the m + 1 sets of parameters, P and f are completely known only at the end of each iteration. This algorithm automatically satises all constraints and is therefore a feasible-point method. 7

4 Choice of initial values Like most of the iterative optimization methods, our algorithm does not insure the convergence to the global maximum of the log-likelihood. In order to maximize the probability to reach this global maximum, the choice of initial values is of critical importance. In the MTD model, each element qij of Q represents simultaneously all of the following situations:

P (Xt = j jXt;1 = i) ... P (Xt = j jXt;` = i) and the parameters '1 : : : '` weight qij according to the lag. To nd initial values for '1 : : : '`, we suppose that these parameters are related to the strength of the relation between each lagged period and the present. Let

0 C (1 1) : : : C (1 1 g g  m) B C . ... Cg = @ .. A Cg (m 1) : : : Cg (m m)

be the crosstable between the lag g (rows) and the present (columns), and let

Qg = P (Xt = j jXt;g = ig )] 0 C (11) : : : CC (1(1m)) C (1) B .. ... = B @ C (m. 1) C (mm) C (m) : : : C (m) g

g

g

g

g

g

0 Q (1 1) : : : Q (1 m) 1 g g B C . ... . @ . A g

=

1 C C A

g

Qg (m 1) : : : Qg (m m)

P

be the corresponding row transition matrix, with Cg(i ) = mj=1 Cg (i j ). The idea is to evaluate the strength of the relation between each lag and the present period in matrices Cg and Qg . To dene a measure, we have to take into account the following elements: 8

 Even if in some situations the states of the process can be ordered or have a numerical

signicance, we deal generally with nominal data. Therefore, the measure does not use the numerical label assigned to each state and must be invariant to a change in the order of modalities. With these requirements, a measure like the autocorrelation is disqualied.

 The measure must increase with the probability to know the value of the variable at time t given its value at time t ; g.  A uniform transition matrix, in which every row is a uniform distribution, acts like

a zero in the MTD model. If such a matrix is put into the model, it assigns the same probability to each situation and then provides no information. Moreover, a transition matrix whose rows and columns are independent means that there is no relation between the corresponding periods and is also of little interest. On the other hand, a probability equal to 1 in a matrix Qg means a perfect association between the corresponding values at time t ; g and t.

 Since in Qg the column variable (time t) depends on the row variable (time t ; g ), the

measure should take into account the direction of this relation.

 In general we restrict the lag parameters 'g to the space 0 1]. So it would be preferable

to use a measure taking value into the same space.

Given all of these requirements, the association measures dened for crosstables (see for instance Bishop et al. (1975) and Ott et al. (1987)) are a perfect choice. Among all the nominal asymmetrical measures, the measure u dened by Theil (1971) seems the best choice. This measure, based on Shannon's entropy, takes value in the space 0 1], 0 denoting independence between rows and columns of the crosstable, and 1 a perfect association. It fullls all of the above requirements and, on the contrary of a measure like Goodman & Kruskal's , it is not inuenced by special structures of the crosstable. For the crosstable Cg , the measure ug is dened as m X m X

ug =

Cg (i j ) log2

Cg (i ) Cg(  j )

Cg (i j ) TCg C (  j ) X g Cg (  j ) log2 TC g j =1

i=1 j =1 m

9

(14)

where m X

Cg (i ) =

j =1 m

X

Cg (  j ) =

i=1

Cg (i j ) Cg (i j )

m X m X

TCg =

i=1 j =1

Cg (i j )

We can then dene the weight 'g associated with the gth lag as

'g = P`ug

k=1 uk

(15)

In this way the lag parameters are proportional to the strength of the relation between each lag and the present. The choice of the initial matrix Q is somewhat more dicult. An easy solution would be to initialize Q as the transition matrix of the rst order Markov chain. But, as we will see later with an example (Section 5.1), some time-series do not give the most important role to the rst lag and this strategy can lead to aberrant results. Another possibility is to dene Q as a weighted sum of matrices Q1 : : : Q`. Let

X Q~ = 'k Ck

(16)

qij = qq~~ij

(17)

`

k=1

Then the element (i j ) of Q is i

where

q~i =

m X j =1

10

q~ij

(18)

Finally, a third possibility is to set Q = Qk , where k = arg maxg=1:::` ug . In practice, the results achieved with the third solution are always equal or better than those obtained through the rst two possibilities. Therefore, we used it in the examples of Sections 5 and 6. In the case of the MTDg model with a dierent matrix for each lag, the initial vector ' is determined as above and Qg is taken as initial transition matrix for the lag g.

5 Examples: Time-series In this Section, we apply our estimation method upon several time-series analyzed with the MTD model in previous articles. All comparisons between models are carried out with the Bayesian Information Criterion (BIC). The application of this criterion for Markov chains is discussed in Katz (1981). It is dened as

BIC = ;2LL + p log(N )

(19)

where LL is the log-likelihood of the model, p is the number of independent parameters and N the number of data. The model achieving the lowest BIC is chosen. Section 5.1 gives an example where the use of our method leads to a better solution that the previously published one. Section 5.2 reports examples for which our procedure gives the same results as previous methods.

5.1 Song of wood pewee Raftery & Tavar (1994) applied the MTD model to a time-series of the twilight song of the wood pewee (these data were originally described by Craig (1943) and reanalyzed in both Chateld & Lemon (1970) and Bishop et al. (1975)). We consider a time-series of length 1327, with three states corresponding to the three distinct phrases of the wood pewee song. We give here a sample of the data. The complete time-series appears in Table 12 of Raftery & Tavar (1994). 2121121121 3121312121 3121312131 2131213121 Raftery & Tavar tted Markov chains of dierent orders and found that the data are best modeled with an order 2 Markov chain whose BIC is equal to 866.6. Then, they applied the MTD model of order 2 and 3, but they did not achieve good results. Finally, they tried 11

several other modelings using the pattern structure of the time-series and their nal model has a BIC of 827. These results are summarized in Table 1 where MC stands for Markov Chain, and Pattern for Raftery & Tavar's best modeling. Using our algorithm, we reestimated the MTD models of order 2 and 3, and we computed MTD models with a dierent matrix for each lag (MTDg in Table 1). In some cases, we dropped the rst data of the sequence in order to have always 1323 elements in the loglikelihood. For comparison purpose with the results published in Raftery & Tavar (1994), the number of structural zeros was not subtracted from the number of parameters for the computation of BIC.

Table 1. Dierent models for the wood pewee song. Note that we obtained the same MTD

model for orders 2 and 3, since '3 = 0 in the third order model. The dierence in BIC is the consequence of the supplementary lag parameter of the third order model. Results reported for the MTD model have been achieved using the best Qg matrix as initial value for Q. When using an average of Q1, Q2 and Q3, the procedure converges to another local maximum corresponding to a BIC of 1442.6. model Raftery & Tavar Independence MC 1 MC 2 MC 3 MTD 2 MTD 3 Pattern Berchtold MTD 2 MTD 3 MTDg 2 MTDg 3

number of log-likelihood parameters

BIC

2 6 18 54 7 8 4

-1349.4 -694.1 -368.6 -354.0 -644.3 -643.1 -399.1

2713.3 1431.4 866.6 1096.1 1338.9 1343.6 827.0

7 8 13 20

-568.0 -568.0 -486.4 -484.0

1186.3 1193.5 1066.2 1111.7

We observe that the MTD model achieves a better result than what was advanced by Raftery & Tavar. This implies that their procedure did not converge to the global maximum of the log-likelihood. The best model in the MTD class is the MTDg 2 model with lag parameters '1 = 0:269 and '2 = 0:731, and transition matrices 12

0 0:097 0:739 0:164 1 Q1 = @ 0:980 0 0:020 A

0 0:996 0 0:004 1 Q2 = @ 0:152 0:020 0:828 A

and 0:987 0:013 0 0:003 0:997 0 It is interesting to note that the most important lag in the explanation of the present is the lag 2 rather than the lag 1. As it can be observed by looking at the data, the succession of the three dierent phrases of the song is governed by some very strong rules. For instance, there are never two successive phrase 3, and phrases 2 and 3 are generally followed by phrase 1. This is reected in the transition matrices by four probabilities very close to 1. While our procedure achieves a better MTD model than the method of Raftery & Tavar, our analysis reaches the same conclusions as theirs: a model based upon the pattern structure of the data is better than a Markov chain in this case.

5.2 Other time-series The original article on the MTD model (Raftery, 1985a) gives three examples. The rst is a time-series of 672 consecutive hourly wind speeds classied into 4 categories. The second is a set of 3-steps transitions showing the relationships between 25 students for a total of 300 data. These data were previously presented in Katz & Proctor (1959) and in Bishop et al. (1975). The last dataset was used rst in Logan (1981). This is the observation of the primary work activity (management, research or teaching) of 9170 American physicists at years 1960, 1962 and 1964. Raftery tted both Markov chains and MTD models for each of these three datasets and found that the best model is in each case a MTD model (of order 3 in the rst case and of order 2 in the two other cases). We applied our procedure on the same datasets. We tried also to use modelings with a dierent matrix for each lag, but our conclusion is in each case identical to Raftery's. Moreover, we determined that these solutions are almost surely strict local maxima. We applied also our procedure on the data provided in Mehran (1989a) concerning the employment status of a sample of 9124 Americans during a four month period, from December 1981 to March 1982. We obtained the same MTD 3 model as Mehran and a MTDg 3 model proved to be useless. The solution is almost surely a strict local maximum. MacDonald & Zucchini (1997) used the MTD model upon dierent datasets. The rst is a length 299 time-series of the duration of successive eruptions of the Old Faithful geyser (Yellowstone Park, USA). These durations are classied into categories short and long. 13

We agree with McDonald & Zucchini to say that the best model is a second order Markov chain and that neither a MTD nor a MTDg can improve the results. Our conclusion is similar in the case of the evapotranspiration dataset, but with a rst order Markov chain. Finally, the MTD 2 model used by McDonald & Zucchini to represent the hourly wind direction at Koeberg (South-Africa) cannot be improved by another MTD model.

6 Analysis of one-dimensional spatial data In this Section we discuss the use of the Mixture Transition Distribution model for the analysis of one-dimensional spatial data.

6.1 Motivation The MTD model was designed for the modeling of time-series and has not previously been applied in another context. We propose here to suppress the time reference and to use this model for the analysis of one-dimensional spatial data. In the discussion of Besag et al. (1991), Raftery & Baneld refer to this possibility, but without providing concrete examples. This proposition is directly related to the Markov Random Field (MRF) theory (Besag, 1974, Kinderman & Snell, 1980). Consider a sequence of length N. To each data i corresponds a random variable Xi taking value in the nite set f1 : : : mg. We want to calculate the probability of observing a particular realization xi of Xi given the value of its neighbors. If we dene, for instance, the neighborhood of i as the set of sites fi ; 2 i ; 1 i + 1 i + 2g, and if we make the assumption that the realisation of i depends only of its neighbors, we can write

P (Xi jX n fXi g) = P (Xi jXi;2 Xi;1  Xi+1 Xi+2 )

(20)

where X denotes the entire set of random variable over the whole sequence. For the purpose of our demonstration, we note X0 = Xi and we number the neighbors from 1 to 4. Then, using a principle similar to the MTD model of equation (1), we approximate the conditional probability of equation (20) as

P (Xi jX n fXig) = 14

4 X

g=1

'g P (X0 jXg )

=

4 X

g=1

'g qx x0 g

(21)

where qx x0 are probabilities of a m  m transition matrix Q representing the conditional relation between the site i and each of its neighbors. The same principle can be extended to dierent sets of neighbors. It is also possible to use a dierent transition matrix for each neighbor. As outlined by Raftery & Baneld, the main diculty with the use of the MTD model in the spatial context is that it does not satisfy the Hammersley-Cliord theorem (Grimmett, 1973, Besag, 1974), since (21) does not dene a correct joint probability over the whole set of random variables. Nevertheless, this method can provide a good approximation of the local structure of the distribution. In his reply to Raftery & Baneld, Besag agrees with this possibility. The estimation of this spatial MTD model requires some explanations. If the summation of equation (6) is taken upon the entire set of variables Xi, this equation is no longer the loglikelihood of the model, but a pseudo-log-likelihood (Besag, 1975, 1977). Another possibility is to take the summation upon a subset X~ of X dened such that no variables in X~ are neighbors. Unfortunately, this method does not take into account the whole information provided by the data, and in most situations it is preferable to maximize the pseudo-loglikelihood of the model. It is interesting to compare the MTD model with the classical Potts model (Domb & Potts, 1951, Kinderman & Snell, 1980, Baxter, 1982). In the Potts model, the probability to observe a given realization of the variable Xi is a function of the number of neighbors taking the same value. For instance, the model corresponding to equation (20) is g

f; N0g P (X0 = x0jX1 X2 X3 X4) = Pmexpexp f; Nj g j =1

(22)

where N0 denotes the number of neighbors taking the same value as X0 , and Nj , j = 1 : : :  m, is the number of neighbors taking value j . To estimate this model, we dene m ; 1 logistic regressions having the same parameter :

P (X0 = j jX1 X2 X3 X4 )

j = log P (X = mjX  X  X  X ) 0 1 2 3 4 = (Nm ; Nj )  j = 1 : : : m ; 1 15

(23)

and we compute the maximum likelihood estimate of . For further details on logistic regression, see Agresti (1990). We identify two steps in the construction of this model. First, each neighbor Xk is recoded as a binary variable in the following way:

1

, if xk = x0 0 , otherwise Then, we suppose that the relative position of each neighbor to the variable X0 has no P importance, and that only the sum N0 = 4k=1 k does matter. Consider rst the binary recoding of the neighbors. If the original variables Xi are binary (m = 2), they are not aected by the recoding and no information is lost. On the other hand, if m > 2 the recoding suppresses a part of the information about the real value of each neighbor of X0. By contrast, this information is not lost with the MTD model. Hence, in some situations, the MTD model could achieve better results than the Potts model. The importance of the relative position of each neighbor must also be considered. Again, the Potts model suppresses this information, while the MTD model keeps it. Hence, we can say that if the data are binary and the relative position of each neighbor does not matter, then the Potts model should generally achieve best results than the corresponding MTD model, since it has fewer parameters. On the other hand, if m > 2 or if the relative position of each neighbor is important, the MTD model is interesting. The following Section provides an illustration of the usefulness of the MTD model in the one-dimensional spatial context.

k =

6.2 Analysis of DNA In this example, we consider a one-dimensional spatial situation. Raftery & Tavar (1994) analysed the Markovian structure of the mouse A-crystallin gene, a set of data previously used in Avery (1987). This is a length 1307 sequence whose each data is one of the four bases fA C G T g. We give here the beginning of the sequence. The complete dataset is provided in Table 7 of Raftery & Tavar (1994). 3411314321 1221212433 4422422413 2432421313 Raftery & Tavar tried to nd the optimal order of the Markov chain representing these data, considering an ordering from the left to the right. For comparison purposes, we choose to drop the rst 4 and last 3 observations in order to have exactly 1300 element in the loglikelihood of each model. Since Raftery & Tavar dropped the rst 5 observations, our results 16

for the same model are slightly dierent. Moreover, according to the convention introduced in Bishop et al. (1975), we do not take into account the number of parameters equal to zero for the computation of BIC. Table 2 summarizes the main results. MC denotes a Markov chain. MTD and MTDg are the Mixture Transition Distribution models with respectively 1 and several transition matrices. The other models appearing in Table 2 are described below.

Table 2. Dierent models for the mouse A-crystallin gene. Note that the MTD 2 and

MTD 3 models are identical, since '3 = 0 at the optimum of the MTD 3 model. A similar situation occurs between the SMTD 1 and SMTD 2 models. model number of log-likelihood BIC parameters Independence 3 -1796.9 3615.4 MC 1 12 -1734.3 3554.6 MC 2 48 -1704.7 3753.7 MC 3 173 -1580.4 4401.3 MC 4 370 -1259.5 5172.0 MTD 2 13 -1734.0 3561.2 MTD 3 13 -1734.0 3561.2 MTDg 2 25 -1720.1 3619.4 MTDg 3 34 -1710.6 3665.0 MC 1r 12 -1734.7 3555.4 MC 2r 48 -1705.1 3754.3 MC 3r 173 -1580.3 4400.9 SMC 1 48 -1656.1 3656.4 SMC 2 370 -1185.7 5024.3 SMTD 1 13 -1727.2 3547.6 SMTD 2 13 -1727.2 3547.6 SMTDg 1 23 -1673.5 3511.8 SMTDg 2 39 -1649.3 3578.3 Potts 1 1 -1796.2 3599.6 Potts 2 1 -1802.3 3611.8 Among the Markovian models and their MTD and MTDg modelings, the best model is the rst-order Markov chain whose BIC is 3554.6. This result is consistent with Raftery & Tavar (1994) and the corresponding transition matrix is

17

0 0:2251 0:2508 0:3441 0:1801 1 B 0:2968 0:3449 0:0588 0:2995 C QACGT = B @ 0:2281 0:2656 0:3156 0:1906 C A

0:1898 0:2780 0:3051 0:2271 As mentioned above, Raftery & Tavar assumed an ordering from the left to the right of the data. On the contrary, we suppose that there is no direction in a gene, and that each base of the gene is inuenced both by its left and right neighbors. To test this hypothesis, we computed rst Markov chains of order 1 to 3, from the right to the left of the gene. These chains are noted MC 1r to MC 3r in Table 2. These reverse chains obtain similar results as the Markov chains calculated from the left to the right, conrming our assumption. The best reverse Markov chain is the rst-order one: 0 0:2258 0:3581 0:2355 0:1807 1 B 0:2080 0:3440 0:2267 0:2213 C Q1r = B @ 0:3323 0:0690 0:3166 0:2821 C A 0:1892 0:3784 0:2061 0:2264 To avoid the assumption of an ordering, we built then models using neighbors on both sides. The SMC 1 model is a transition matrix with two neighbors, one on the left and one on the right. The SMC 2 model is similar, but with two neighbors on each side. Then, we approximated these matrices using the MTD model. The SMTD 1 model has one neighbor on each side, SMTD 2 has two neighbors on each side, and the SMTDg 1 and SMTDg 2 models are similar, but with a dierent transition matrix for each neighbor. Finally, we computed the Potts model with the same neighborhoods. Table 2 shows that the best model is the SMTDg 1 model whose parameters are given below. We note L the base on the left and R the base on the right.

'L = 0:4598 'R = 0:5402

0 0:2039 0:1311 0:5883 0:0767 1 B 0:3654 0:2674 0 0:3672 C QL = B @ 0:2263 0:0981 0:5565 0:1191 C A 0:1470 0:1324 0:5222 0:1984

0 0:2136 0:5124 0:1037 0:1703 1 B 0:1878 0:5059 0:0461 0:2602 C QR = B A @ 0:4086 0 0:2323 0:3591 C 0:1386 0:5462 0:0896 0:2256 18

Although this model has a lot more parameters than the MC 1, SMTD 1 and Potts 1 models, it yields a better BIC value since its log-likelihood is much larger. It is interesting to note that the more important neighbor is the one on the right, with a parameter 'R = 0:5402, and not the one on the left as assumed in Raftery & Tavar (1994). Even with a better loglikelihood than the SMTDg 1 model, the SMC 1 model is clearly disqualied because it has too many parameters. Finally, despite only 1 parameter, the Potts model is absolutely not adequate in this case. As said in Section 6.1, this is a consequence of the loss of information implied by the use of this model on non-binary data. This example shows that the MTD model can be used with success on one-dimensional data which are not time-series. It can then be viewed as a simple and powerful tool for the modeling of dierent types of interactions.

7 Conclusion The Mixture Transition Distribution model already proved its usefulness in several papers, but the lack of a convenient method of estimation was a limitation to its application in practice. Our objective was then to propose a new method of estimation which could be adapted on dierent computing platforms. We dened an iterative procedure that proved to be very convenient to implement since it does not require any external optimization routine. Although we cannot insure that this method always converges to the global maximum of the log-likelihood, a careful choice of the starting values as discussed in Section 4 leads to very good performance. On empirical examples, our algorithm achieves results that are at least as good or better than results obtained through other methods. Finally, we noted that the MTD model is not restricted to the modeling of time-series, but can also be useful in the context of one-dimensional spatial data. Compared to the classical Potts model, it can achieve better results, especially when the data are not binary.

Acknowledgements This research was supported by a grant from the Swiss National Science Foundation and by Oce of Naval Research grant no. N00014-96-1-0192. I would like to thank Chris Fraley for helpful discussions about optimization, and Adrian Raftery for his constant support during the writing of this article. 19

References Adke, S. R., S. R. Deshmukh (1988) Limit Distribution of a High-Order Markov Chain.

Journal of the Royal Statistical Society B, 50, 105-108.

Agresti, A. (1990) Categorical Data Analysis. John Wiley & Sons, New York. Avery, P. J. (1987) The analysis of intron data and their use in the detection of short

signals. Journal of Molecular Evolution, 26, 335-340.

Bartlett, M. S. (1971) Two-dimensional nearest-neighbor systems and their ecological

applications (with discussion). Statistical Ecology, Vol. 1, 180-194. Patil, Pielou, Waters Editors. Pennsylvania State University Press.

Baxter, R. J. (1982) Exactly Solved Models in Statistical Mechanics. Academic Press,

London.

Berchtold, A. (1996) Modlisation autorgressive des chanes de Markov : Utilisation

d'une matrice dirente pour chaque retard. Revue de Statistique Applique, Vol. XLIV (3), 5-25.

Berchtold, A. (1997) Swiss Health Insurance System: Mobility and Costs. Health and

System Science, 1 (3-4), 291-306.

Berchtold, A. (1998) Chanes de Markov et modles de transition : applications aux

sciences sociales. Editions HERMES, Paris.

Besag, J. (1974) Spatial Interaction and the Statistical Analysis of Lattice Systems (with

Discussion). JRSS B, 36, 192-236.

Besag, J. (1975) Statistical analysis of non-lattice data. The Statistician, 24, 179-195. Besag, J. (1977) Eciency of pseudolikelihood estimation for simple Gaussian elds.

Biometrika, 64, 616-618.

Besag, J., J. York, A, Molli (1991) Bayesian image restoration, with two applications

in spatial statistics (with Discussion). Ann. Inst. Statist. Math., 43, 1-59.

Bishop, Y. M. M., S. E. Fienberg, P. W. Holland (1975) Discrete Multivariate Anal-

ysis. MIT Press, Cambridge.

20

Chatfield, C., R. E. Lemon (1970) Analysing sequences of behavioral events. Journal of

Theoretical Biology, 29, 427-445.

Craig, W. (1943) The Song of the Wood Peewee. Albany, University of the State of New

York.

Dempster, A. P., N. M. Laird, D. B. Rubin (1977) Maximum likelihood from incomplete

data via the EM algorithm (with Discussion). Journal of the Royal Statistical Society A, 39, 1-38.

Domb, C., R. B. Potts (1951) Order-disorder statistics. IV. A two-dimensional model

with rst and second interactions. Proceedings of the Royal Society of London, Serie A, 210, 125-141.

Gill, P. E., W. Murray, M. H. Wright (1981) Practical Optimization. Academic Press,

London.

Grimmett, G. R. (1973) A theorem about random elds. Bull. London Math. Soc., 5,

81-84.

Katz, L., C. H. Proctor (1959) The concept of conguration of interpersonal relations

in a group as a time-dependent stochastic process. Psychometrika, 24, 317-327.

Katz, R. W. (1981) On some criteria for estimating the order of a Markov chain. Techno-

metrics, 23, 243-249.

Kinderman, R., J. L. Snell (1980) Markov Random Fields and their Applications. Amer-

ican Mathematical Society, Providence.

Le, N. D., R. D. Martin, A. E. Raftery (1996) Modeling Flat Stretches, Bursts, and

Outliers in Time Series Using Mixture Transition Distribution Models. JASA, 91, 15041515.

Logan, J. A (1981) A Structural Model of the High-Order Markov Process Incorporating

Reversion Eects. Journal of Mathematical Sociology, 8, 75-89.

MacDonald, I. L., W. Zucchini (1997) Hidden Markov and Other Models for Discrete-

valued Time Series. Chapman & Hall, London.

21

Martin, R. D., A. E. Raftery (1987) Robustness, Computation, and Non-Euclidean

Models. JASA, 82, 1044-1050.

Mehran, F. (1989a) Longitudinal Analysis of Employment and Unemployment Based on

Matched Rotation Samples. International Labour Oce, Bureau of Statistics, Geneva.

Mehran, F. (1989b) Analysis of Discrete Longitudinal Data : Innite-Lag Markov Models.

In Statistical Data Analysis and Inference, 533-541. Dodge Editor, Neuchtel.

Nash, S. G., A. Sofer (1996) Linear and Nonlinear Programming. McGraw-Hill, New

York.

Ott, L., R. F. Larson, W. Mendenhall (1987) Statistics, A Tool for the Social Sciences.

Duxbury Press, Boston.

Pegram, G. G. S. (1980) An Autoregressive Model for Multilag Markov Chains. Journal

of Applied Probability, 17, 350-362.

Raftery, A. E. (1985a) A model for high-order Markov chains. Journal of the Royal

Statistical Society B, 47 (3), 528-539.

Raftery, A. E. (1985b) A new model for discrete-valued time series: autocorrelations and

extensions. Rassegna di Metodi Statistici ed Applicazioni, 3-4, 149-162.

Raftery, A. E. (1993) Change point and change curve modeling in stochastic processes

and spatial statistics. Journal of Applied Statistical Science, 1 (4), 403-423.

Raftery, A. E., S. Tavar (1994) Estimation and Modelling Repeated Patterns in High

Order Markov Chains with the Mixture Transition Distribution Model. Applied Statistics, 43 (1), 179-199.

Schimert, J. (1992) A high order hidden Markov model. Ph. D. thesis 40908, University

of Washington, USA.

Theil, H. (1971) On the estimation of relationships involving qualitative variables. Ameri-

can Journal of Sociology, 76, 103-154.

22