Parsimonious Gaussian Mixture Models Paul David McNicholas∗and Thomas Brendan Murphy†‡
∗
Department of Mathematics and Statistics, University of Guelph, Guelph, Ontario, Canada, N1G 2W1.
E-mail:
[email protected]. † School of Mathematical Sciences, University College Dublin, Belfield, Dublin 4, Ireland.
E-mail:
[email protected]. ‡ This research was funded by the SFI Basic Research Grant (04/BR/M0057). Part of this work was completed during a visit to the Center for Statistics in the Social Sciences which was supported by NIH grant (8 R01 EB002137- 02). The authors would like to thank Prof. Adrian Raftery and the members of the Working Group on ModelBased Clustering at the University of Washington for useful suggestions and inputs into this work. The authors would like to thank an anonymous referee for helpful suggestions that have added to the overall quality of this work. This work was carried out while the authors were at the Department of Statistics, School of Computer Science and Statistics, Trinity College Dublin, Dublin 2, Ireland.
1
Abstract Parsimonious Gaussian mixture models are developed using a latent Gaussian model which is closely related to the factor analysis model. These models provide a unified modeling framework which includes the mixtures of probabilistic principal component analyzers and mixtures of factor of analyzers models as special cases. In particular, a class of eight parsimonious Gaussian mixture models which are based on the mixtures of factor analyzers model are introduced and the maximum likelihood estimates for the parameters in these models are found using an AECM algorithm. The class of models includes parsimonious models that have not previously been developed. These models are applied to the analysis of chemical and physical properties of Italian wines and the chemical properties of coffee; the models are shown to give excellent clustering performance. Keywords:
Mixture models, factor analysis, probabilistic principal components
analysis, cluster analysis, model-based clustering.
2
1
Introduction
Mixture models assume that data are collected from a finite collection of populations and that the data within each population can be modelled using a standard statistical model (e.g. Gaussian, Poisson or binomial). As a result, mixture models are particularly useful when modelling data collected from multiple sources. Mixture models are the basis of many modern clustering algorithms and methods based on mixtures offer the advantage of allowing for uncertainties to be quantified using probabilities. The Gaussian mixture model has received particular attention in the statistical literature; this model assumes a Gaussian structure for each population. In particular, the model density is of the form f (x) =
G X
πg φ(x|µg , Σg ),
(1)
g=1
where φ(x|µg , Σg ) is the density of a multivariate Gaussian with mean µg and covariance Σg . Extensive reviews of mixtures are contained in Titterington et al. (1985), McLachlan and Basford (1988) and McLachlan and Peel (2000); these books examine many different types of mixture models, but emphasis is given to Gaussian mixtures. Additionally, Fraley and Raftery (2002) provide an excellent review of Gaussian mixture models with an emphasis on applications to cluster analysis, discriminant analysis and density estimation. The general Gaussian mixture model (1) has a total of (G − 1) + Gp + Gp(p + 1)/2 parameters of which Gp(p + 1)/2 are used in the group covariance matrices Σg . A simpler form of the mixture assumes that the covariances are constrained to be equal across groups, which reduces to a total of (G − 1) + Gp + p(p + 1)/2 parameters of which p(p + 1)/2 are used for the common group covariance matrix Σg = Σ. Banfield and Raftery (1993), Celeux and Govaert (1995) and Fraley and Raftery (1998, 2002) exploit an eigenvalue decomposition of the group covariance matrices to give a wide range of covariance structures that use between one and Gp(p + 1)/2
3
parameters. The eigenvalue decomposition of the covariance matrix is of the form Σg = λg Dg Ag D0g where λg is a constant, Dg is a matrix of eigenvectors of Σg and Ag is a diagonal matrix with entries proportional to the eigenvalues of Σg . The model-based clustering methods that they have developed allow for constraining the components of the eigenvalue decomposition of Σg across components of the mixture model; the components of the eigenvalue decomposition are easily interpreted. Fraley and Raftery (2002) demonstrate that the parsimonious mixture models derived in this model-based clustering framework give excellent results in a variety of applications. We develop a new class of Gaussian mixture models with parsimonious covariance structure. These models are based on assuming a latent Gaussian model structure for each population; the latent Gaussian model is closely related to the mixture of factor analyzers model (Ghahramani and Hinton 1997). The mixture of factor analyzers model assumes a covariance structure of the form Σg = Λg Λ0g + Ψg , where the loading matrix Λg is a (p × q) matrix of parameters typically with q ¿ p and the noise matrix Ψg is a diagonal matrix; a detailed development of this covariance structure is given in Section 2. The loading and noise terms of the covariance matrix can be constrained to be equal or unequal across groups and the noise term can be further restricted to give a collection of eight parsimonious covariance structures. These covariance structures can have as few as pq − q(q − 1)/2 + 1 parameters or as many as G[pq − q(q − 1)/2 + p] parameters with q = 0, 1, 2, . . .. The full development of the collection of parsimonious Gaussian mixture models (PGMMs) and methods for fitting these models are given in Section 3. Some computational aspects, including the issue of model selection for the class of parsimonious Gaussian mixture models, are addressed in Section 4. We propose using the BIC as a model selection criterion. In Section 5, we apply the models to the analysis of data recording chemical and physical properties of Italian wines collected from three areas of the Piedmont region. We find that the mixture models can be used to cluster the observations into the three
4
areas with only one error. The results of the PGMM analysis are compared with other model-based methods of analysis. In Section 6, we apply the models to the analysis of data recording the chemical properties of coffee from a wide range of locations. The mixture models are shown to capture the difference between Arabica and Robusta species of coffee plant. The results are compared with other model-based methods and are shown to give superior performance. We conclude, in Section 7, with a discussion of the methods and results of this work.
2
Latent Gaussian Models
In this section, we briefly review the latent Gaussian model underlying factor analysis and probabilistic principal components analysis; methods for fitting these models are also discussed. Factor analysis (Spearman 1904) is a data reduction technique that aims to find unobserved factors that explain the variability in the data. The model (see Bartholomew and Knott 1999, Chapter 3) assumes that a p-dimensional random vector x is modelled using a q-dimensional vector of unobserved latent factors u where q ¿ p. The model is x = µ + Λu + ² where Λ is a p × q matrix of factor loadings, the factors u ∼ N (0, Iq ) and ² ∼ N (0, Ψ), where Ψ = diag(ψ1 , ψ2 , . . . , ψp ). It follows from this model that the marginal distribution of x is multivariate Gaussian with mean µ and covariance ΛΛ0 + Ψ. The probabilistic principal components analysis (PPCA) model (Tipping and Bishop 1999b) is a special case of the factor analysis model but it assumes that the distribution of the errors are isotropic so that Ψ = ψIp . Suppose that x = (x1 , x2 , . . . , xn ) are iid from a factor analysis model, then the
5
log-likelihood is n X
log f (xi ) = −
i=1
where S = (1/n)
Pn
ª n n © np log 2π − log |ΛΛ0 + Ψ| − tr S(ΛΛ0 + Ψ)−1 , 2 2 2
i=1 (xi
− µ)(xi − µ)0 is the sample covariance matrix; in fact the
data only appears in the model through S. ˆ = x. Clearly, the maximum likelihood estimate of µ is µ The EM algorithm (Dempster et al. 1977) is used to find maximum likelihood estimates for Λ and Ψ. We take x as the observed data and let u be missing data; the values of u = (u1 , u2 , . . . , un ) being the unobserved latent values. The complete-data consist of (x, u), the observed and missing data. Therefore, the complete-data log-likelihood is ( ) n X n 1 −1 0 lc (x, u) = C − log |Ψ| − tr Ψ (xi − µ)(xi − µ) 2 2 i=1 ) ( n n X X 1 + (xi − µ)0 Ψ−1 Λui − tr Λ0 Ψ−1 Λ ui u0i , 2 i=1
i=1
where C is a constant function of µ, Λ, and Ψ. The expected value of ui conditional on xi and the current model parameters is E[ui |xi , µ, Λ, Ψ] = Λ0 (ΛΛ0 + Ψ)−1 (xi − µ) = β(xi − µ), where β = Λ0 (ΛΛ0 + Ψ)−1 and the expected value of ui u0i conditional on xi and the current model parameters is E[ui u0i |xi , µ, Λ, Ψ] = Iq − βΛ + β(xi − µ)(xi − µ)0 β 0 . ˆ is Hence, the expected complete-data log-likelihood Q, evaluated with µ = µ, n o n © ª ª n n © ˆ log |Ψ−1 | − tr Ψ−1 S + n tr Ψ−1 ΛβS − tr Λ0 Ψ−1 ΛΘ , 2 2 2 ¡ ¢ ˆΛ ˆ β ˆ 0 is a symmetric q × q matrix. ˆ + βS where Θ = Iq − β Q(Λ, Ψ) = C +
Differentiating Q with respect to Λ and Ψ−1 respectively, utilizing results from L¨ utkepohl (1996), and solving the resulting equations gives © ª ˆ 0 Θ−1 and Ψ ˆ . ˆ = Sβ ˆ = diag S − Λ ˆ βS Λ
6
(2)
Hence, the maximum likelihood estimates for Λ and Ψ can be obtained by iteraˆ and Θ as required. tively applying (2) and updating the values of β For the PPCA model, we find that ª 1 © ˆ . ˆ βS ψˆ = tr S − Λ p
3
Parsimonious Gaussian Mixture Models
Ghahramani and Hinton (1997) extended the factor analysis model (Section 2) by developing the mixtures of factor analyzers model which assumes a mixture of Gaussian distributions with factor analysis covariance structures; this work was further developed by McLachlan and Peel (2000). Additionally, Tipping and Bishop (1999a) developed a mixtures of probabilistic principal components analyzers model. Under the general mixtures of factor analyzers model, the density of an observation in group g is multivariate Gaussian with mean µg and covariance Λg Λ0g + Ψg and the probability of membership of group g is πg . Therefore, the mixtures of factor analyzers model has density f (xi ) =
G X g=1
¾ ½ πg 1 0 0 −1 (x − µ ) (Λ Λ + Ψ ) (x − µ ) . exp − i g g g i g g 2 (2π)p/2 |Λg Λ0g + Ψg |1/2
We extend the mixture of factor analyzers model by allowing constraints across groups on the Λg and Ψg matrices and on whether or not Ψg = ψg Ip . The full range of possible constraints provides a class of eight different parsimonious Gaussian mixture models (PGMM), which are given in Table 1. Three of the models (UCU, UUC and UUU) given in Table 1 have been developed previously. Ghahramani and Hinton (1997) assume the equal noise model (UCU) whereas McLachlan and Peel (2000) and McLachlan et al. (2003) assume unequal noise (UUU), however they comment that assuming equal noise (UCU) can give more stable results. In the context of the mixtures of probabilistic principal components analyzers model, Tipping and Bishop (1999a) assume unequal, but isotropic, noise (UUC), ie.
7
Table 1: Parsimonious covariance structures derived from the mixtures of factor analyzers model. ModelID
Loading
Error
Covariance
Matrix
Variance
Isotropic
Parameters
CCC
Constrained
Constrained
Constrained
[pq − q(q − 1)/2] + 1
CCU
Constrained
Constrained
Unconstrained
[pq − q(q − 1)/2] + p
CUC
Constrained
Unconstrained
Constrained
[pq − q(q − 1)/2] + G
CUU
Constrained
Unconstrained
Unconstrained
[pq − q(q − 1)/2] + Gp
UCC
Unconstrained Constrained
Constrained
G[pq − q(q − 1)/2] + 1
UCU
Unconstrained
Constrained
Unconstrained
G[pq − q(q − 1)/2] + p
UUC
Unconstrained
Unconstrained
Constrained
G[pq − q(q − 1)/2] + G
UUU
Unconstrained
Unconstrained
Unconstrained
G[pq − q(q − 1)/2] + Gp
Ψg = ψg Ip . The other five covariance structures (CCC, CCU,. . . , UCC) are new and we note that incorporating constraints on the loading matrices dramatically reduces the number of covariance parameters and yields parsimonious models. The alternating expectation-conditional maximization (AECM) algorithm (Meng and van Dyk 1997) is used for fitting these models. This algorithm is an extension of the EM algorithm that uses different specifications of missing data at each stage. For the PGMM, when estimating πg and µg the missing data are the unobserved group labels z and when estimating Λg and Ψg the missing data are the group labels z and the unobserved latent factors u. At the first stage of the algorithm, when estimating πg and µg , we let z = (z1 , z2 , . . . , zn ) be the unobserved group labels, where zig = 1 if observation i belongs to group g and zig = 0 otherwise. Hence, the complete-data likelihood for the mixture model is L1 (πg , µg , Λg , Ψg ) =
n Y G Y £ ¤z πg f (xi |µg , Λg , Ψg ) ig . i=1 g=1
8
Hence, we find that the expected complete-data log-likelihood is of the form G X
Q1 (µg , πg ) =
g=1
−
G
X ng np ng log πg − log 2π − log |Λg Λ0g + Ψg | 2 2 g=1
G X g=1
ª ng © tr Sg (Λg Λ0g + Ψg )−1 , 2
where
ˆ g, Ψ ˆ g) ˆ g, Λ π ˆg φ(xi |µ , ˆ g0 , Ψ ˆ g0 ) ˆ g0 , Λ π ˆg0 φ(xi |µ 0
zˆig = PG ng =
Pn
ˆig i=1 z
(3)
g =1
and Sg = (1/ng )
Pn
ˆig (xi i=1 z
− µg )(xi − µg )0 . Maximizing the expected
complete-data log-likelihood with respect to µg and πg yields Pn zˆig xi ng ˆ g = Pi=1 and π ˆg = µ . n ˆig n i=1 z At the second stage of the AECM algorithm, when estimating Λg and Ψg , we take the group labels z and the latent factors u to be the missing data. Therefore, the complete data log-likelihood is G · X ng © −1 ª ng tr Ψg Sg l(πg , µg ,Λg , Ψg ) = C + − log |Ψg | − 2 2 g=1
+
n X
zig (xi − µg )0 Ψ−1 g Λg ui −
i=1
¸ n ª 1 © 0 −1 X tr Λg Ψg Λg zig ui u0i , 2 i=1
where C is constant with respect to µg , Λg and Ψg . ˆg It follows that the expected complete-data log-likelihood evaluated with µg = µ and πg = π ˆg is of the form G X
·
ª © 1 1 © −1 ª ˆ log |Ψ−1 tr Ψg Sg + tr Ψ−1 g |− g Λg β g Sg 2 2 g=1 ¸ ª 1 © − tr Λ0g Ψ−1 , g Λg Θ g 2
Q(Λg , Ψg ) = C +
ng
ˆ Λ ˆ ˆ0 ˆ where Θg = Iq − β ˆig are computed as in (3) with the estimates g g + β g Sg β g and the z ˆ g and π of µ ˆg as calculated in the first stage of the AECM algorithm. The resulting estimates when we impose constraints (Table 1) on the Λg and Ψg matrices can be easily derived from the expression for Q(Λg , Ψg ). We illustrate how the maxima are
9
computed for the CCC (Appendix A.1) and CUU (Appendix A.2) cases; the CUU case is included because it is different to the other cases. Outline calculations required to compute the estimates for all of the models are given in Appendix B. The AECM algorithm iteratively updates the parameters until convergence to maximum likelihood estimates of the parameters. The resulting zˆig values at convergence are estimates of the a posteriori probability of group membership for each observation and can be used to cluster observations into groups.
4
Computational Aspects
4.1
Initialization
The AECM algorithm was run multiple times for each dataset with different random starting values of the zng to prevent the attainment of a local, rather than global, maximum log-likelihood. The details of how the algorithm was initialized are outlined below. To initialize the zng , random values were chosen for zng ∈ {0, 1} so that
P
g zng
=1
for all n. Then the CCC model was run from these starting values and the resulting zˆng were then taken as the starting group membership labels for every other model. The initial values for the elements of Λg and Ψg were generated from the eigendecomposition of Sg as follows. The Sg were computed based on the initial values of the zng . The eigen-decomposition of each Sg was computed using Househoulder reduction and the QL method (details given by Press et al. 1992). Then the initial values of the elements of Λg were set as λij =
p
dj ρij ,
where djj is the jth largest eigenvector of Sg , ρij is the ith element of the jth largest eigenvector of Sg , i ∈ {1, 2, . . . , p} and j ∈ {1, 2, . . . , q}. The Ψg where then initialized
10
as Ψg = diag{Sg − Λg Λ0g }.
4.2
Aitken’s Acceleration
The Aitken’s acceleration procedure, described in McLachlan and Krishnan (1997), was used to estimate the asymptotic maximum of the log-likelihood. This allowed a decision to be made on whether the EM algorithm had converged or not. The Aitken’s acceleration at iteration k is given by a(k) =
l(k+1) − l(k) , l(k) − l(k−1)
where l(k+1) , l(k) and l(k−1) are the log-likelihood values from iteration k + 1, k and k − 1 respectively. Then the asymptotic estimate of the log-likelihood at iteration k + 1 is given by (k+1) l∞ = l(k) +
1 (l(k+1) − l(k) ). 1 − a(k)
B¨ohning et al. (1994) contend that the algorithm can be considered to have converged when (k+1) (k) |l∞ − l∞ | < ²,
where ² is a ‘small’ number. Lindsay (1995) gives an alternative stopping criterion; that the algorithm can be stopped when (k) |l∞ − l(k) | < ².
The latter criterion is used for the analysis herein, with ² = 10−2 .
4.3
Model Selection & Performance
The Bayesian information criterion (Schwartz 1978) is proposed for selecting an appropriate PGMM. For a model with parameters θ, the Bayesian information criterion (BIC) is given by ˆ − m log n, BIC = 2l(x, θ)
11
ˆ is the maximized log-likelihood, θˆ is the maximum likelihood estimate of where l(x, θ) θ, m is the number of free parameters in the model and n is the number of observations. The use of the BIC can be motivated through an asymptotic approximation of the log posterior probability of the models (Kass and Raftery 1995). The usual regularity conditions for the asymptotic approximation used in the development of the BIC are not generally satisfied by mixture models. However, Keribin (1998, 2000) shows that BIC gives consistent estimates of the number of components of a mixture model. In addition, Fraley and Raftery (1998, 2002) provide practical evidence that the BIC performs well as a model selection criterion for mixture models. We propose using the BIC to choose the model type (CCC, CCU, . . . , UUU), the number of latent factors q and the number of components G for the parsimonious Gaussian mixture models. In addition, the BIC can be used to choose between the PGMMs and model-based clustering as implemented by mclust. Note that there are alternatives to the BIC for mixture model selection; for example, the integrated completed likelihood (ICL) (Biernacki et al. 2000) penalises the BIC by subtracting estimated entropy of group membership for each observation. Although not described in the applications herein, the use of ICL as an alternative to the BIC often gave comparable clustering performance. The performance of the PGMMs in revealing group structures in data is demonstrated using a cross tabulation of the maximum a posteriori (MAP) classification of the observations with the true group membership. From the cross-tabulation, it is clear that the PGMM has excellent clustering performance in the applications considered (sections 5 and 6). In cases where the results are less clear, clustering performance of various methods can be quantified using a Rand index (Rand 1971) or an adjusted Rand index (Hubert and Arabie 1985).
12
5
Application: Italian Wines
5.1
The Data
Forina et al. (1986) reported twenty-eight chemical and physical properties of three types of wine (Barolo, Grignolino, Barbera) from the Piedmont region of Italy. We report results on the analysis of twenty-seven of the variables from the original study (Table 2); the sulphur measurements from the original data were not available for this analysis. A subset of thirteen variables (Table 2) are commonly available in the UCI Machine Learning data repository and as part of the gclus library (Hurley 2004) for R (R Development Core Team 2004). For completeness we also analyze the more common thirteen variable subset of the data.
Table 2: Twenty-seven of the twenty-eight chemical and physical properties of the Italian wine reported by Forina et al. (1986). The variables marked with an asterisk are contained in the UCI Machine Learning data repository version of the data. Chemical Properties Alcohol∗
Sugar-free extract
Fixed acidity
Tartaric acid
Malic acid∗
Uronic acids
pH
Ash∗
Alcalinity of ash∗
Potassium
Calcium
Magnesium∗
Phosphate
Chloride
Total phenols∗
Flavonoids∗
Nonflavonoid phenols∗
Proanthocyanins∗
Color Intensity∗
Hue∗
OD280 /OD315 of diluted wines∗
OD280 /OD315 of flavonoids
Glycerol
2-3-butanediol
Total nitrogen
Proline∗
Methanol
13
Forina et al. (1986), who originally analyzed these data, describe the data collection process in some detail. It is worth observing that the wines in this study were collected over the time period of 1970–1979 (Table 3). Furthermore, the wines were collected over a ten year period and the Barbera wines are predominantly from a period which is later than the Barolo or Grignolino wines.
Table 3: The year of production of the wine samples in the data. Year of Production 70 71 Barolo Grignolino
72
19 9
9
73
74 75
76
77
78
79
29
5
20 20 7
9
Barbera
16 9
9
12 5
The parsimonious Gaussian mixture models are demonstrated on the analysis of these Italian wines (Section 5.3.1). Alternative model-based methods of analysis are discussed in sections 5.2.2 and 5.3.3 and the methods are compared in Section 5.3.4.
5.2 5.2.1
Twenty-Seven Variables Parsimonious Gaussian Mixture Models
All eight PGMMs were fitted to the data for G = 1, 2, . . . , 5 and q = 1, 2, . . . , 5. The BIC value for each model was computed and the model with the highest BIC value (−11454.32) was selected; this was the CUU model with G = 3 and q = 4. Figure 1 shows the maximum BIC for the eight parsimonious models for each pair (G, q). While q = 0 is possible, the PGMMs with q = 0 correspond to the mclust modelbased clustering models (EII, EEI, VII and VVI), so the case where q = 0 is not considered at this point but it is considered in Section 5.2.2.
14
5 4 3 1
2
q
1
2
3
4
5
G
Figure 1: A heat map representation of the maximum BIC value for each (G, q), where the maximum is taken over all eight models.
A cross tabulation of the MAP classifications from the best PGMM versus the true wine type is given in Table 9; this model correctly classifies all but one of the wines.
5.2.2
Model-Based Clustering (mclust)
Model-based clustering was also completed on the wine data using the mclust software (Fraley and Raftery 1999, 2003) for R (R Development Core Team 2004). The model with the highest BIC value (−12119.3) was a three component mixture with the VVI covariance structure; detailed explanations of the covariance structures are given in Fraley and Raftery (2002). The VVI covariance structure is a diagonal covariance structure which implies elliptical group structure with the variable shape and different sized ellipses that are aligned to the axes. The MAP classifications were calculated for the three component VVI model; a
15
Table 4: A classification table for the PGMM with highest BIC value on the wine data. Cluster 1 Barolo
2
3
70
1
59
Grignolino Barbera
48
cross tabulation of the classifications and the wine types is shown in Table 5.
Table 5: The classification table for the best model found using mclust on the wine data. Cluster 1
2
Barolo
58
1
Grignolino
4
66
Barbera
3
1 48
While model-based clustering found three groups, it is clear that it did less well than the PGMMs in separating the wines into type.
5.2.3
Variable Selection
Variable selection was carried out using clustvarsel (Dean and Raftery 2006), with the maximum number of groups preset at eight. Nineteen variables were selected and they are given in Table 6. A VVI model with four groups was chosen and the classifications for this model are given in Table 12.
16
Table 6: Variables selected for the wine dataset. Chloride
Malic acid
Flavanoids
Proline
Colour intensity Uronic acids
Magnesium 2-3-butanediol
Tartaric acid
Total nitrogen
Calcium
Alcalinity of ash
Hue
OD280 /OD315 of diluted wines
Methanol
Nonflavanoid phenols
Alcohol
Sugar-free extract
Phosphate
Table 7: Classification table for variable selection on the wine dataset.
Barolo
5.2.4
1
2
52
7
Grignolino
17
Barbera
1
3
4
54 47
Model Comparison
A comparison of the clustering results on the wine data indicates that all of the methods are good at discovering the group structure in the wine data. The PGMM and mclust methods both find the correct number of groups. However, the PGMM method is more accurate. The Rand and adjusted Rand indices (Table 8) indicate that the PGMMs are better at finding the wine types than mclust or variable selection. Notably, mclust outperformed the variable selection method on these data.
Table 8: Rand and adjusted Rand indices for the models that were applied to the wine data. Model
Rand Index Adjusted Rand Index
PGMM
0.99
0.98
mclust
0.95
0.90
Variable Selection
0.91
0.78
17
The BIC values of the PGMM and mclust results can be compared; the BIC is higher for the PGMM method. Interestingly the model with the higher BIC also had better clustering performance, while this is not guaranteed to happen, this phenomenon has been observed previously in Fraley and Raftery (2002).
5.3
Thirteen Variables
5.3.1
Parsimonious Gaussian Mixture Models
All eight PGMMs were fitted to the subset of the wine data for G = 1, 2, . . . , 5 and q = 1, 2, . . . , 5. The model with the highest BIC (−5294.68) was selected; this was the CUU model with G = 4 and q = 2. Figure 2 shows the maximum BIC for the eight
3 1
2
q
4
5
PGMMs for each pair (G, q).
1
2
3
4
5
G
Figure 2: A heat map representation of the maximum BIC value for each value of (G, q), where the maximum is taken over the eight models.
18
Again, the cases where q = 0 are considered in Section 5.3.2. A cross tabulation of the MAP classifications from the best PGMM versus the true wine type is given in Table 9. The Grignolino wines are spread mostly across clusters 2 and 3.
Table 9: A classification table for the PGMM with highest BIC value for the subset of the wine data. Cluster 1 Barolo
2
3
4
38
31
2
59
Grignolino Barbera
48
A cross tabulation of the MAP classification results for the PGMM versus wine type and year is given in Table 10. The table shows that Cluster 2 consists primarily of Grignolino wines from 1970–1974 and Cluster 3 consists primarily of Grignolino wines from 1974–1976. Therefore, the mixture model appears to be dividing the Grignolino wines into two groups according to year of production. Incidentally, due to the data collection process, it is not possible to completely determine if the correspondence of the Barbera wines and Cluster 4 is due to the wine type or the year of production. The Barbera wines were primarily from 1978 and 1979 whereas the other wine types were from 1970–1976.
5.3.2
Model-Based Clustering (mclust)
Analysis of the subset of the wine data using the mclust software yielded eight groups with a VEI covariance structure and a BIC value of −5469.95. The VEI covariance structure corresponds to equal shaped clusters which are aligned with the axes but
19
Table 10: A table of the MAP classifications from the PGMM fitted to the subset of the wine data versus the wine type and year. Barolo 71 Cluster 1 19
Grignolino
73 74 70
Barbera
71 72 73 74
75
76
74
76
78
79
9
5
29
5
20 20
Cluster 2
7
8
4
8
8
2
1
Cluster 3
2
1
2
1
8
7
10
Cluster 4
1
1
with variable volume. Classifications for this model are given in Table 11.
Table 11: The classification table for the best model found using mclust on the subset of the wine data. Cluster 1 Barolo Grignolino
2
3
40 18
1 21
Barbera
5.3.3
4
5
22 4
6
7
27
1 17
8
27
Variable Selection
The clustvarsel package, with the maximum number of groups was preset at eight, was used to analyze the subset of the wine data. The variables selected were Malic acid, Proline, Flavanoids, colour intensity and OD280 /OD315 of diluted wines. The selected model was a VEV model with three groups and the classifications for this model are given in Table 12.
20
Table 12: Classification table for variable selection on the subset of the wine data. Cluster 1
2
Barolo
51
8
Grignolino
3
67
1
1
47
Barbera 5.3.4
3
Model Comparison
A comparison of the clustering results on the subset of the wine data indicates that all of the methods are very good at discovering the group structure in these data. The PGMM and variable selection methods both find the correct number of groups but the PGMM is more accurate. The mclust software finds eight groups. The Rand and adjusted Rand indices for all three techniques are given in Table 13.
Table 13: Rand and adjusted Rand indices for the models that were applied to the subset of the wine data. Model
Rand Index Adjusted Rand Index
PGMM
0.91
0.79
MCLUST
0.80
0.48
Variable Selection
0.90
0.78
Comparing the BIC values of the PGMM and mclust results, the BIC is higher for the PGMM. Again, the model with the higher BIC also had better clustering performance.
21
6
Application: Coffee Data
6.1
The Data
Streuli (1973) reports on the chemical composition of coffee samples collected from around the world. A total of 43 samples were collected from 29 countries and beans from the Arabica and Robusta species were collected. The thirteen chemical constituents reported in the study are listed in Table 14; only the first twelve are used in our analysis because the Total Chlorogenic Acid measurement is the sum of the Chlorogenic, Neochlorogenic and Isochlorogenic acid values.
Table 14: Twelve of the thirteen chemical constituents of coffee reported by Streuli (1973). Chemical Constituents Water
Bean Weight
Extract Yield
pH Value
Free Acid
Mineral Content
Fat
Caffeine
Trigonelline
Chlorogenic Acid Neochlorogenic Acid
6.2
Isochlorogenic Acid
Parsimonious Gaussian Mixture Models
All eight PGMM were fitted to the data for G = 1, 2, . . . , 5 and q = 1, 2, . . . , 5. The model with the highest BIC was the CCU model with G = 2 and q = 4. Figure 3 shows the maximum BIC for the eight parsimonious models for each pair (G, q). A cross tabulation of the classifications from the best PGMM versus the coffee species is given in Table 15; all of the coffee samples are clustered into the two coffee species.
22
5 4 3 1
2
q
1
2
3
4
5
G
Figure 3: A heat map representation of the maximum BIC value for each value of (G, q) where the maximum is taken over the eight models (CCC,CCU,. . . , UUU).
Table 15: A classification table for the PGMM with highest BIC value for the coffee data. Cluster 1 Arabica
36
Robusta
6.3
2
7
Model-Based Clustering (mclust)
The mclust software chose three groups with a VEI covariance structure; this corresponds to equal shaped clusters which are aligned with the axes but with variable volume. The model divides the Arabica coffees into two groups; these groups have no discernable geographical basis.
23
Table 16: A classification table for the best mclust results for the coffee data. Cluster 1 Arabica
2
22 14
Robusta
6.4
3
7
Variable Selection
The clustvarsel package, with the maximum number of groups was preset at eight, was used to analyze the coffee data. Nine variables were selected, so that the only variables not selected were extract yield, free acid and chlorogenic acid. The selected model was a EEV model with five groups and the classifications for this model are given in Table 17.
Table 17: Classification table for variable selection on the coffee data. Cluster
Arabica
1
2
3
4
10
9
9
8
Robusta
6.5
5
7
Model Comparison
A comparison of the clustering results on the coffee data indicates that the PGMM is the best method at discovering the group structure in the wine data. The PGMM was the only method to find two groups, corresponding to the two varieties of coffee. The mclust software found three groups and variable selection found eight. The Rand and adjusted Rand indices for all three methods that were applied to the coffee data are given in Table 18.
24
Table 18: Rand and adjusted Rand indices for the models that were applied to the coffee data. Model
Rand Index Adjusted Rand Index
PGMM
1.00
1.00
MCLUST
0.66
0.38
Variable Selection
0.46
0.16
Comparing the BIC values of the PGMM and mclust results, the BIC is higher for the PGMM. Once again, the model with the higher BIC also had better clustering performance.
7
Discussion
A new family of parsimonious Gaussian mixture models has been proposed. These models are closely related to the mixtures of factor analyzers and mixtures of principal components analyzers models and includes them as special cases. The number of covariance parameters in the PGMM family grows linearly with data dimension which is especially important in high-dimensional situations. In modelbased clustering as implemented in mclust, the number of parameters in any of the diagonal covariance structures is linear or constant in the data dimension, however the number of parameters in the non-diagonal covariance structures is quadratic in data dimension. This means that PGMMs may offer a more flexible modelling structure for high-dimensional data than mclust. The PGMMs work particularly well in high-dimensional settings because of the parsimonious nature of the covariance structure. Additionally, the PGMMs appear to be particularly good at modelling situations where some of the variables are highly correlated. The PGMMs allow for model-comparison with other model-based methods like mclust, which allows for the methods to be used in conjunction with each other.
25
Chang (1983) shows that the principal components corresponding to the highest eigenvalues do not necessarily contain the most group information. Directly modelling data using PGMMs avoids the problems of implementing data reduction using principal components analysis before clustering. The application of the PGMMs to the wine and coffee datasets indicates that the models give excellent clustering performance. The clusters found using the model showed greater ability to capture the group structure of the data than other techniques.
A
Calculations for AECM Algorithm
We adopt the following notation: ˜= S
G X
ˆΛ ˆS ˆ 0. ˜ = Iq − β ˆ +β ˜β π ˆg Sg and Θ
g=1
A.1
Model CCC
ˆ =Λ ˆ p )−1 ˆ 0 (Λ ˆΛ ˆ 0 + ψI For Model CCC we have Λg = Λ and Ψg = Ψ = ψIp . Therefore, β ˜ takes the place of Sg and so Θg is replaced by Θ. ˜ Therefore, Q and it follows that S can be written as · ¸ © ª © ª © 0 ª n −1 −1 −1 −1 ˆ ˜ ˜ ˜ Q(Λ, ψ) = C + p log ψ − ψ tr S + 2ψ tr Λβ S − ψ tr Λ ΛΘ . 2 Differentiating Q with respect to Λ and ψ −1 respectively gives the score functions below. · ¸ ¡ ¢ ∂Q(Λ, ψ) ˆ 0 − ΛΘ ˜β = ψ −1 n S S1 Λ, ψ = ∂Λ ¸ · © ª © 0 ª ¡ ¢ ∂Q(Λ, ψ) © ª n ˆ ˜ ˜ ˜ S2 Λ, ψ = = pψ − tr S + 2 tr Λβ S − tr Λ ΛΘ ∂ψ −1 2 ¡ new ¢ ˆ , ψˆ = 0 for Λ ˆ new gives Solving S1 Λ ˆ 0Θ ˆ new = S ˜β ˜ −1 , Λ
26
¡ new new ¢ ˆ , ψˆ and solving S2 Λ = 0 for ψˆnew we obtain 1 © ˜ ˆ new ˆ ˜ ª ψˆnew = tr S − Λ βS . p
A.2
Model CUU
ˆ = Λ ˆ 0 (Λ ˆΛ ˆ0 + Ψ ˆ g )−1 and Θg = For Model CUU we have Λg = Λ. Therefore, β g 0
ˆ Λ ˆ Sg β ˆ ), so that Q can be written as ˆ +β (Iq − β g g g Q(Λ, Ψg ) = C +
G X g=1
· ng
¸ © −1 ª 1 © 0 −1 ª 1 © −1 ª 1 −1 ˆ Sg − tr Λ Ψ ΛΘg . log |Ψg | − tr Ψg Sg + tr Ψg Λβ g g 2 2 2
Differentiating Q with respect to Λ and Ψ−1 g respectively gives the score functions · ¸ G ¢ ∂Q(Λ, Ψ) X ¡ 0 −1 −1 ˆ S1 Λ, Ψg = = ng Ψg Sg β g − Ψg ΛΘg , ∂Λ g=1 ¸ · ¢ ∂Q(Λ, Ψg ) ¡ 1 0 1 1 0 0 ˆ S2 Λ, Ψg = Ψ − S + Λ β S − ΛΘ Λ . = n g g g g g 2 2 g 2 ∂Ψ−1 g ¢ ¡ new ˆ ,Ψ ˆ g = 0 gives Now, solving S1 Λ G X
ˆ −1 Λ ˆ new Θg = ng Ψ g
g=1
G X
ˆ0 , ˆ −1 Sg β ng Ψ g g
(4)
g=1
ˆ new in a row-by-row manner. Write which must be solved for Λ λi =
¡
λi1 λi2 · · ·
¢ λiq ,
to represent the ith row of the matrix Λ and let ri represent the ith row of the matrix on the right-hand-side of (4). Now, row i of (4) can be written ˆ new λ i
G X ng Θg = ri , ˆ ψg(i) g=1
where ψˆg(i) is the ith entry along the diangonal of Ψg . Therefore, ˆ new λ i
−1 G X n g = ri Θg , ψˆg(i) g=1
27
for i = 1, 2, . . . , p. Unfortunately, the row-by-row nature of this solution slows the fitting of the CUU model to a great degree. ¡ new new ¢ ˆ ,Ψ ˆ ˆ new gives Solving diag{S2 Λ } = 0 for Ψ g g n o new new 0 ˆ Sg + Λ ˆ new = diag Sg − 2Λ ˆ new β ˆ ˆ Ψ Θ ( Λ ) . g g g
B
Estimation Procedure for Eight Covariance
Structures ˆ Λ ˆ and Ψ ˆ for each of the eight parsimonious models are The resulting equations for β, given in the following sections.
B.1
Model CCC © 0 ª ˆ =Λ ˆ p )−1 , Λ ˆ 0Θ ˆ new = 1 tr S ˆS ˆ 0 (Λ ˆΛ ˆ 0 + ψI ˆ new = S ˜β ˜ −1 , (ψ) ˜ −Λ ˆ new β ˜ . β p
B.2
Model CCU © ª ˆ =Λ ˆ 0Θ ˆS ˆ 0 (Λ ˆΛ ˆ 0 + Ψ) ˆ −1 , Λ ˆ new = S ˜β ˜ −1 , Ψ ˆ new = diag S ˜ −Λ ˆ new β ˜ . β
B.3
Model CUC ˆ =Λ ˆ 0 (Λ ˆΛ ˆ 0 + ψˆg Ip )−1 , Λ ˆ new β g
−1 G G X X n n 0 g g ˆ Sg β Θg , = g ψˆg ψˆg g=1
(ψˆg )new =
g=1
ª 1 © ˆ Sg + Λ ˆ new β ˆ new Θg (Λ ˆ new )0 . tr Sg − 2Λ g p
28
B.4
Model CUU
In this case the loading matrix must be solved in a row-by-row manner. This slows the fitting of this model to a great degree.
new
ˆ =Λ ˆ ˆ 0 (Λ ˆΛ ˆ0 + Ψ ˆ g )−1 , λ β g i
−1 G X n g = ri Θg , ˆ ψg(i) g=1
© ª ˆ Sg + Λ ˆ new = diag Sg − 2Λ ˆ new β ˆ new Θg (Λ ˆ new )0 , Ψ g g ˆ new is the ith row of the matrix Λ ˆ new , ψˆg(i) denotes the ith element along where λ i ˆ 0 and ˆ g , ri represents the ith row of the matrix PG ng Ψ ˆ −1 Sg β the diagonal of Ψ g g g=1 i = 1, 2, . . . , p.
B.5
Model UCC
G ª © 1X 0 −1 ˆ new −1 new 0 ˆ ˆ0 ˆ ˆ ˆ ˆ ˆ Sg . ˆ ˆ new β β g = Λg (Λg Λg + ψIp ) , Λg = Sg β g Θg , (ψ) = π ˆg tr Sg − Λ g g p g=1
B.6
Model UCU
ˆ 0 Θ−1 , Ψ ˆ =Λ ˆ 0 + Ψ) ˆ −1 , Λ ˆ new = Sg β ˆ new = ˆ 0 (Λ ˆ gΛ β g g g g g g
G X
ª © ˆ Sg . ˆ new β π ˆg diag Sg − Λ g g
g=1
B.7
Model UUC
ª © ˆ =Λ ˆ 0 Θ−1 , (ψˆg )new = 1 tr Sg − Λ ˆ Sg . ˆ 0 + ψˆg Ip )−1 , Λ ˆ new = Sg β ˆ 0 (Λ ˆ gΛ ˆ new β β g g g g g g g g p
B.8
Model UUU © ª ˆ =Λ ˆ 0 Θ−1 , Ψ ˆ Sg . ˆ 0 (Λ ˆ gΛ ˆ0 + Ψ ˆ g )−1 , Λ ˆ new = Sg β ˆ new = diag Sg − Λ ˆ new β β g g g g g g g g g
References Banfield, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 (3), 803–821.
29
Bartholomew, D. J. and M. Knott (1999). Latent variable models and factor analysis (2nd ed.). London: Edward Arnold. Biernacki, C., G. Celeux, and G. Govaert (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis & Machine Intelligence 22 (7), 719–725. B¨ohning, D., E. Dietz, R. Schaub, P. Schlattmann, and B. Lindsay (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics 46, 373–388. Celeux, G. and G. Govaert (1995). Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793. Chang, W.-C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. J. Roy. Statist. Soc. Ser. C 32 (3), 267–275. Dean, N. and A. E. Raftery (2006). The clustvarsel Package. R package version 0.2-4. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 (1), 1–38. With discussion. Forina, M., C. Armanino, M. Castino, and M. Ubigli (1986). Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201. Fraley, C. and A. E. Raftery (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41 (8), 578–588. Fraley, C. and A. E. Raftery (1999). Mclust: Software for model-based clustering. Journal of Classification 16, 297–306. Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 (458), 611–612.
30
Fraley, C. and A. E. Raftery (2003). Enhanced model-based clustering, density estimation and discriminant analysis software: MCLUST. Journal of Classification 20 (2), 263–296. Ghahramani, Z. and G. E. Hinton (1997). The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, University Of Toronto, Toronto. Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification 2, 193–218. Hurley, C. (2004). Clustering visualizations of multivariate data. Journal of Computational and Graphical Statistics 13 (4). Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90, 773–795. Keribin, C. (1998). Estimation consistante de l’ordre de mod`eles de m´elange. C. R. Acad. Sci. Paris S´er. I Math. 326 (2), 243–248. Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhy¯ a Ser. A 62 (1), 49–66. Lindsay, B. (1995). Mixture Models: Theory, Geometry and Applications. Hayward, CA: Institute of Mathematical Statistics. L¨ utkepohl, H. (1996). Handbook of Matrices. Chicester: John Wiley & Sons. McLachlan, G. J. and K. E. Basford (1988). Mixture models: Inference and applications to clustering. New York: Marcel Dekker Inc. McLachlan, G. J. and T. Krishnan (1997). The EM algorithm and extensions. New York: John Wiley & Sons Inc. McLachlan, G. J. and D. Peel (2000). Finite Mixture models. New York: John Wiley & Sons.
31
McLachlan, G. J., D. Peel, and R. W. Bean (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis 41 (3–4), 379–388. Meng, X. L. and van Dyk (1997). The EM algorithm — an old folk song sung to the fast tune (with discussion). Journal of the Royal Statistical Society Series B 59, 511–567. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery (1992). Numerical Recipies in C — The Art of Scientific Computation (2nd ed.). Cambridge University Press. R Development Core Team (2004). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850. Schwartz, G. (1978). Estimating the dimension of a model. Ann. Stat. 6, 31–38. Spearman, C. (1904). The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101. Streuli, H. (1973). Der heutige stand der kaffeechemie. In Association Scientifique International du Cafe, 6th International Colloquium on Coffee Chemisrty, Bogat´a, Columbia, pp. 61–72. Tipping, M. E. and C. M. Bishop (1999a). Mixtures of probabilistic principal component analysers. Neural Computation 11 (2), 443–482. Tipping, M. E. and C. M. Bishop (1999b). Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B 61 (3), 611–622.
32
Titterington, D. M., A. F. M. Smith, and U. E. Makov (1985). Statistical analysis of finite mixture distributions. Chichester: John Wiley & Sons.
33