Modelling and Analysing Interval Data Paula Brito Faculdade de Economia/NIAAD-LIACC, Universidade do Porto Rua Dr. Roberto Frias, 4200-464 Porto, Portugal
[email protected] Abstract. In this paper we discuss some issues which arise when applying classical data analysis techniques to interval data, focusing on the notions of dispersion, association and linear combinations of interval variables. We present some methods that have been proposed for analysing this kind of data, namely for clustering, discriminant analysis, linear regression and interval time series analysis.
1 Introduction In classical data analysis, data is represented in a n × p matrix where n individuals (in rows) take exactly one value for each variable (in columns). However, this model is too restrictive to represent data with more complex information. Symbolic data analysis has extended the classical tabular model by allowing multiple, possibly weighted, values for each variable. New variable types have then been introduced - interval, categorical multi-valued and modal variables - which allow taking into account variability and/or uncertainty which is often inherent to the data. In this paper we focus on the analysis of interval data, that is, where individuals are described by variables whose values are intervals of |R. We discuss some issues which arise when trying to apply classical data analysis techniques to interval data, and present some methods which have been proposed for analyzing this kind of data. De Carvalho et al (to appear) have proposed a partitioning clustering method following the dynamic clustering approach and using an L2 distance. The result of any clustering method depends heavily on the scales used for the variables, natural clustering structures can sometimes only be detected after an appropriate rescaling of variables. In this context, the standardization problem has been addressed, and three standardization techniques for interval-type variables have been proposed. Numerous methods have now been proposed for clustering interval data. Bock (2002) has proposed several clustering algorithms for symbolic data described by interval variables, based on
2
Paula Brito
a clustering criterion and thereby generalizing similar approaches in classical data analysis. Chavent and Lechevallier (2002) proposed a dynamic clustering algorithm for interval data where the class representatives are defined by an optimality criterion based on a modified Hausdorff distance. Souza and De Carvalho (2004) have proposed partitioning clustering methods for interval data based on city-block distances, also considering adaptive distances. Various new techniques will be described in the forthcoming monograph (Diday and Noirhomme (2006)). Based on one of the dispersion measures proposed for the standardization process, it has been possible to obtain a covariance measure and define a sample correlation coefficient r for interval data. Following this line, a linear regression model for interval data, which has r2 as determination coefficient, has been derived. Various approaches for linear regression on interval variables have been investigated by Billard and Diday (2003) and Neto et al (2004). In a recent work, Duarte Silva and Brito (to appear) discuss the problem of linear combination of interval variables, and compare different approaches for linear discriminant analysis of interval data. A first approach is based on the measures proposed by Bertrand and Goupil (2000) and Billard and Diday (2003), assuming an uniform distribution in each interval. Another approach consists in considering all the vertices of the hypercube representing each of the n individuals in the p-dimensional space, and then perform a classical discriminant analysis of the resulting n × 2p by p matrix. This follows previous work by Chouakria et al (2000) for Principal Component Analysis. A third approach is to represent each variable by the midpoints and ranges of its interval values, perform two separate classical discriminant analysis on these values and combine the results in some appropriate way, or else analyze midpoints and ranges conjointly. This follows similar work on Regression Analysis by Neto et al (2004), and Lauro and Palumbo (2005) on Principal Component Analysis. Perspectives of future work include modeling interval time-series data, a problem which is addressed by Teles and Brito (2005) using ARMA models. The remaining of the paper is organized as follows: In Section 2 we define precisely interval data. Section 3 presents the dynamical clustering method and the three standardization techniques proposed in (De Carvalho et al (to appear)). In Section 4 we derive a linear regression model from one of the proposed dispersion measures. Section 5 discusses alternative definitions for the concepts of dispersion, association and linear combinations of interval variables, following (Duarte Silva and Brito (to appear)). In Section 6, the three alternative discriminant analysis approaches investigated by Duarte Silva and Brito (2005) are presented. Section 7 introduces recent work on modelisation of interval time-series data. Finally, Section 8 concludes the paper, raising the main questions that remain open to future research.
Modelling and Analysing Interval Data
3
2 Interval Data In classical data analysis, data is represented in a n×p matrix where n individuals (in rows) take exactly one value for each variable (in columns). Symbolic data analysis has extended the classical tabular model by allowing multiple, possibly weighted, values for each variable. In this paper we focus on the analysis of interval data and discuss some issues that arise when trying to apply classical data analysis techniques to interval data. Given a set of individuals Ω = {ω1 , . . . , ωn }, an interval variable is defined by an application Y : Ω → T such that ωi → Y (ωi ) = [li , ui ], where T is the set of intervals of an underlying set O ⊆ |R. Let I be an n × p matrix representing the values of p interval variables on Ω. Each ωi ∈ Ω, is represented by a p-uple of intervals, Ii = (Ii1 , . . . , Iip ), i = 1, . . . , n, with Iij = [lij , uij ], j = 1, . . . , p (see Table 1). Y1 ω1 [l11 , u11 ] ... ... ωi [li1 , ui1 ] ... ... ωn [ln1 , un1 ]
... Yj . . . [l1j , u1j ] ... . . . [lij , uij ] ... . . . [lnj , unj ]
... Yp . . . [l1p , u1p ] ... . . . [lip , uip ] ... . . . [lnp , unp ]
Table 1. Matrix I of interval data
Interval data may occur in many different situations. We may have ‘native’ interval data, describing ranges of variable values, for instance, daily stock prices or monthly temperature ranges; imprecise data, coming from repeated measures or confidence interval estimation; symbolic data, as descriptions of biological species or technical specifications. Interval data may also arise from the aggregation of huge data bases, when real values are generalized by intervals. In this context, mention should also be made to Interval Calculus (Case (1999), Moore (1966)), a discipline that has derived rules for dealing with interval values. Given two intervals I1 and I2 , any arithmetical operation ’op’ between them is defined by I1 op I2 = {x op y, x ∈ I1 , y ∈ I2 }. That is, the result of a given operation between the intervals I1 and I2 is an interval comprehending all possible outcome of the operation between values of I1 and values of I2 .
3 Dynamical Clustering De Carvalho, Brito and Bock (to appear) have proposed a partitioning clustering method following the dynamic clustering approach and using an L2 distance. Dynamic clustering (Diday and Simon (1976)), generally known as (generalized) ‘k-means clustering’, is a clustering method that determines a partition
4
Paula Brito
P = (P1 , ..., Pk ) of a given data set Ω = {ω1 , ..., ωn } of objects into a fixed number k of clusters P1 , ..., Pk and a set L = (`1 , ..., `k ) of cluster prototypes by optimizing a criterion W (P, L) that evaluates the fit between the clusters and the cluster prototypes. Starting from an initial system of class representatives, or from an initial partition, the method applies iteratively an assignment function, which allows determining a new partition, followed by a cluster representation function, which defines optimum prototypes, until convergence is attained. Let Pk be the family of all partitions P = (P1 , . . . , Pk ) of Ω into the given number k of non-empty clusters. Also, let L be the set of ‘admissible’ class representatives or prototypes (dependent on the given data type) and denote by Lk = (L)k the set of all systems of k prototypes L = (`1 , . . . , `k ) (one for each cluster). In our case, the ‘representation space’ L is the set I p of all finite intervals from |R. Consider a function D(Ph , `h ) that measures how well the prototype `h represents the objects in class Ph (a low value indicates a good fit between `h and Ph ). The clustering problem consists in finding a pair (P ∗ , L∗ ) ∈ Pk × Lk that minimizes a criterion W (P, L) =
k X
D(Ph , `h ) : W (P ∗ , L∗ ) = min W (P, L)
(1)
P,L
h=1
In the proposed method, we use the L2 distance between interval-vectors, ϕ(Ii , `) :=
p X |lij − l`j |2 + |uij − u`j |2 .
(2)
j=1
Then D(Q, `) is typically obtainedX as the sum over all objects ωi from a subset (class) Q ⊂ Ω: D(Q, `) := ϕ(Ii , `) for ` ∈ L, Q ⊂ Ω. such that ωi ∈Q
the criterion to minimize is given by W (P, L) =
p h k i X X X (h) (h) (lij − lj )2 + (uij − uj )2 → min P,L
h=1 ωi ∈Ph j=1 (h)
(h)
(h)
(3)
(h)
where `h = ([l1 , u1 ], ..., [lp , up ]) ⊂
0
βj` lij +
X βj` 0
βj` uij +
X
βj` lij
βj`