Set estimation and nonparametric detection - Semantic Scholar

Report 2 Downloads 193 Views
Set estimation and nonparametric detection Amparo BAILLO, Antonio CUEVAS and Ana JUSTEL Key words and phrases: false alarm probability, McDiarmid's inequality, nonparametric quality control, support estimation. AMS 1991 subject classi cations: Primary 62G07; secondary 62G20. ABSTRACT This paper considers the estimation of a set S  0g (where f^n is a nonparametric density estimator of f ), although the estimators of type ff^n > cn g with cn # 0 provide a richer and more

exible alternative. Convergence rates (with respect to both d and dH ) for these estimators are given in Cuevas and Fraiman (1997). A similar plug-in approach to the estimation of level sets is considered in Molchanov (1998), where a result on the asymptotic distribution of dH (ff^n  cg \ K0 ; ff  cg \ K0 ) (K0 is compact) is given. The main applications of set estimation are by now related with its use as an auxiliary tool in di erent computer-intensive statistical methods. Thus, in cluster analysis, given a density f on cg. Hence the estimation of this level set appears as a natural intermediate step. This is done in Polonik (1995) by using the so-called excess mass approach. See also Cuevas, Febrero and Fraiman (1999). Apart from this application in cluster analysis, the level sets have also an obvious interpretation in terms of con dence sets. Das Gupta, Ghosh and Zen (1995) provide a method for constructing multivariate con dence sets under some shape restrictions (the level set is assumed to be star-shaped). They give some examples of applications, frequentist and Bayesian, within a parametric framework. In econometrics there is another interesting application to the problem of estimating the \ecient frontier"; see Simar (1996) and references therein. Let us nally mention that the support estimation problem could be placed in the general framework of stereology if we understand that this term refers, in a wide sense, to the reconstruction of a body from lower dimensional sampled information. For instance, in the three-dimensional case one could think of estimating a body from random sections of dimension two (hyper-planes), one (straight lines) or zero 3

(points). An interesting application of the estimator (1) to image analysis is given by Bertholet, Rasson and Lissoir (1998) who use set estimation for the detection of homogeneous areas on satellite images. In addition to the just mentioned applications, set estimation techniques can be used in the context of nonparametric statistical quality control. Since this application is the main target of this paper, we separately comment it in the following subsection. 1.2. Nonparametric detection. The Devroye-Wise method Suppose that we have a sample of iid observations X1 ; : : : ; Xn , drawn from an unknown density f on 0, then 1 X P fanPn > g < 1; for all  > 0: n=1 In particular, if n is of exact order (log n=n)1=d , then the complete convergence n Pn ! 0 holds for all 2 [0; 1=2).

The proof of this theorem is based on an exponential inequality, due to McDiarmid (1989), concerning the di erence between Pn and E (Pn ). A short proof as well as some nice applications of this inequality can be found in Devroye (1991). The precise statement of McDiarmid's result is as follows: Let X1 ; : : : ; Xn be independent random vectors taking values in a set A  0,   2 ; 2 t P : P fjg(X ; : : : ; X ) ; Eg(X ; : : : ; X )j  tg  2 exp 1

n

n

1

n c2 i=1 i

This inequality can be applied to g(X1 ; : : : ; Xn) = Pn , since assumption (8) holds clearly in this case with ci = max f(B (xi ; n)); (B (x0i ; n ))g and ci  Cdn for some constant C > 0. We thus obtain that, for all t > 0,   2 P fjP ; EP j  tg  2 exp ; 2t ; n

n

nC 2 2nd

and, if an = o(expfcndng),





P fan jPn ; EPn j  g  2 exp ; na22C 2 2d ; n n

for all  > 0. Thus,

whenever

1 X n=1

P fanjPn ; EPn j  g < 1

1 X





2 exp ; na22C 2 2d < 1: n n n=1

7

2

(9) (10)

(11)

Note also that, by Theorem 1, (10) implies the complete (and a.s.) convergence

an Pn ! 0 since an E (Pn ) ! 0. A sucient condition for (11) is na2n 2nd log n ;! 0:

Remark. The standardness assumption in Theorem 1 may be weakened by assuming that for every  > 0, there exists some () > 0 such that ndn (n ) ! 1 and

(B (x; ) \ S )  ()L (B (x; )); 8x 2 S:

(12)

When  ! 0, typically () will decrease to zero at the same rate f does. As an example, for the triangular density, () = . This a ects the rate of convergence of E (Pn ) since, under (12),

E (Pn )  C exp(;n(n )wd dn ): In other words, the slower f decreases to zero, the slower rates we get for E (Pn ). 3. ESTIMATION OF A CONNECTED SET

The assumption of connectedness is one of the mildest shape restrictions to be imposed on S . If, for instance, the random vector X corresponds to measurements drawn from an industrial process, the lack of connectedness in S means that, in fact, there are several \disjoint" processes working in parallel. We show here that the naive estimator (1) is exible enough to incorporate the connectedness assumption and give a consistent estimation of S . Then in many quality control problems, this estimator is a natural choice to base the method (5) of a nonparametric change-point detection method. In section 4 we will address the implementation aspects and consider the control of the false alarm probability. The basic idea is to consider the estimator (1) with the smoothing parameter n = n(X1 ; : : : ; Xn ), de ned by n

n = inf  > 0 : S^n () := [ni=1 B (Xi ; ) is a connected set

o

(13)

The following consistency results are in the spirit of Theorem 1 in Devroye and Wise (1980). The main di erence is that now n is a stochastic sequence of smoothing parameters instead of a deterministic one. The proof can be found in the Appendix. It relies on a result of Tabakis (1996) (for part (a)) and Penrose (1999) (for part (b)) on random graphs. Theorem 3. Let X1 ; X2 ; : : : be a sequence of iid random vectors in c , where c is the critical value of the distribution of Tn, P (Tn > c ) = . This probability is calculated under the null hypothesis that no change has occurred (i.e., X1 ; : : : ; Xn+1 are iid). An exact calculation of c is impossible since the distribution of Tn is unknown. However, we can compute a bootstrap approximation of c as follows:

Smoothed bootstrap computation of c (SB)

1. Generate a large number, B , of bootstrap samples X1; : : : ; Xn ; Xn+1 by ordinary resampling from the original data X1 ; : : : ; Xn . 2. Smooth the Xi's in the form Xi0 = Xi + Zi0 , where the Zi0's are iid from a uniform distribution on B (0; n ). Note that these arti cial observations X10; : : : ; Xn0+1 are in fact a \smoothed bootstrap sample" drawn from a kernel density estimator (based on X1 ; : : : ; Xn) with uniform kernel on B (0; 1) and bandwidth n . 3. For each bootstrap sample X10; : : : ; Xn0+1 , compute the statistic Tn(i) , given by (15). Then estimate the distribution of Tn by the empirical associated with Tn(1) ; : : : ; Tn(B), and the percentile c by the corresponding empirical percentile. The stopping rule Tn > c is equivalent to raise an alarm whenever Xn+1 2= S^n (c n ). Thus, c represents an expansion/contraction of the minimum radius for connection. 10

A second possibility to compute c is to select n = c n by a leave-one-out device, conceptually related to the cross-validation smoothing procedures (see, e.g., Simono 1996).

Cross-validation smoothing (CV) 1. For each  in a thick enough grid, evaluate ^

P^n () = #fXi 2=nSn;i ()g ;

where S^n;i () is the estimator (1) based on the sub-sample X1 ; : : : ; Xi;1 ; Xi+1 ; : : : ; Xn . 2. Select n as the value of  in the grid which minimizes the distance jP^n () ; j. We compare these two proposals for selecting the smoothing parameter in a simulation study. 4.2 A simulation study We compare the SB and CV methods by using them to evaluate the critical values c for radii n corresponding to the false alarm probabilities =0.1, 0.05 and 0.01. We consider two underlying distributions, uniform in the unit circle and standard bivariate normal, with sample sizes n = 50 and 100. In each case, c is approximated by Monte Carlo (using 10000 samples X1 ; : : : ; Xn ; Xn+1 ), so that the performance of both smoothing methods should be measured by their closeness to the corresponding Monte Carlo result (see Table 1). The smoothed bootstrap (with B = 10000) and cross-validation columns are average choices of c over 300 independent runs. Table 1: Smoothed bootstrap and cross-validation choices of c . The true values are approximated by Monte Carlo Uniform case

Normal case



n = 50 0.10

0.05 0.01 n = 100 0.10 0.05 0.01 n = 50 0.10 0.05 0.01 n = 100 0.10 0.05 0.01

Monte Smoothed CrossCarlo Bootstrap Validation 1.2157 1.0624 1.2225 1.4260 1.2802 1.4080 1.9162 1.8435 1.7841 1.1612 1.0016 1.1741 1.3519 1.1989 1.3529 1.7854 1.6764 1.7344 0.9962 0.9710 1.0106 1.3326 1.2371 1.3226 2.2484 2.0808 1.9433 0.7845 0.7827 0.7797 1.0569 1.0148 1.0457 1.9156 1.7257 1.7536

The conclusions from Table 1 are clearly favorable to the cross-validation procedure since it is more accurate than smoothed bootstrap in nine out of twelve cases. 11

These results have been con rmed by the power study summarized in Tables 2 and 3. The power we consider is the probability of detecting a change at the rst stage, obtained with both methods. We also evaluate the relative frequencies of false alarms observed in practice. The values in Tables 2 and 3 are sample means (on 300 independent runs) when the smoothing parameter is selected with SB (Table 2) and CV (Table 3). The values in brackets are the standard deviations. In the \false alarm" rows we show the mean of the observed relative frequencies of false alarm, when the underlying distribution actually does not change for Xn+1 . The last column corresponds to the case where the false alarm probability is not xed in advance and the minimum radius for connection n is used. The \power" rows are the average of the relative frequencies of alarm, observed when the distribution changes at the n + 1 stage. The \out-of-control" distribution for the uniform case is uniform on the circle with radius 1.35 centered at (0,1). In the normal case it is Gaussian with marginal means 1 = 2 = 2 and covariance matrix given by 12 = 22 = 2 and 12 = 0 (this example is also considered in Liu 1995). In all cases the relative frequency of the alarms fXn+1 2= S^n g is evaluated from 3000 additional observations Xn+1 generated from the \under-control" distribution, for the \false alarm" rows, or from the \out-of-control" distribution, for the \power" rows. We observe in Tables 2 and 3 that in the normal case the empirical false alarm is very close to the theoretical one, using both CV and SB. For the uniform case CV clearly outperforms SB. The best way to evaluate the power results on Tables 2 and 3 is, maybe, to compare them with those obtained in the ideal, unrealistic situation in which both distributions (before and after the change-point) are completely known. Thus, for example, denote by f0 the standard Gaussian bivariate density and by f1 the alternative (\out of control") normal density considered in the above simulations. We have to decide, from a single observation X , between the hypotheses

H0 : X  f0 and H1 : X  f1 with a xed type I error (false alarm) probability . Notice that no training sample

is required here, since the \under-control" distribution is known. Of course, the solution is to reject H0 when f1 (X )=f0(X ) > k , where k is given by Pf0 ff1(X )= f0 (X ) > k g = and the corresponding \ideal power" is IP ( ) = Pf1 ff1(X )=f0(X ) > k g. We have obtained Monte Carlo approximations to IP ( ) based on 200000 observations, for = 0:01; 0:05 and 0:1. The results are in Table 4. If we compare these ideal powers with those provided by our nonparametric method (see Tables 2 and 3) with small or moderate sample sizes, we observe that the di erences between the real and the ideal powers range between 0.1 and 0.15. This seems a reasonable price to pay for working in a completely nonparametric setup. The analogous comparison for the uniform case is even more favorable. In fact, if both distributions are completely known we can use an obvious detection procedure, X 2= B ((0; 0); 1), with null probability of false alarm. The corresponding power is approximately 0.6512. Thus the di erences between the ideal error probability (0.3488) and the sum of both error probabilities (including false alarm) of our method are smaller than 0.1 for n = 100 and only slightly larger than 0.1 when n = 50. Three nal remarks: (a) The comparative results of the smoothed bootstrap and cross-validation choices are globally favorable to the latter. We think however that the smoothed 12

Table 2: Power study with smoothed bootstrap. Uniform case

= 0:1 = 0:05 = 0:01 c = 1 n = 50 false alarm 0.1692 0.0878 0.0152 0.2020 power

n = 100 false alarm power Normal case

n = 50 false alarm power

n = 100 false alarm power

(.075) 0.6743 (.041) 0.1708 (.062) 0.6852 (.032) 0.1073 (.061) 0.7352 (.069) 0.0980 (.050) 0.7332 (.051)

(.057) 0.6269 (.040) 0.0871 (.045) 0.6443 (.029) 0.0598 (.043) 0.6802 (.086) 0.0550 (.034) 0.6800 (.063)

(.020) 0.5442 (.044) 0.0143 (.015) 0.5850 (.026) 0.0152 (.018) 0.5374 (.128) 0.0147 (.015) 0.5525 (.101)

(.067) 0.6913 (.038) 0.1706 (.054) 0.6861 (.029) 0.0993 (.058) 0.7274 (.073) 0.0577 (.035) 0.6817 (.068)

Table 3: Power study with cross-validation. Uniform case

= 0:1 = 0:05 = 0:01 c = 1

n = 50 false alarm 0.1059 power

n = 100 false alarm power Normal case

n = 50 false alarm power

n = 100 false alarm power

(.057) 0.6401 (.037) 0.099 (.042) 0.6531 (.024) 0.1026 (.050) 0.7367 (.057) 0.0986 (.035) 0.7400 (.040)

0.0591 (.044) 0.6062 (.036) 0.0515 (.029) 0.6251 (.023) 0.0554 (.038) 0.6750 (.078) 0.0516 (.025) 0.6828 (.052)

Table 4: Ideal powers in the normal case

IP ( )

=0:01 =0:05 =0:1 0.6504

0.8105

13

0.8730

0.0197 (.025) 0.5525 (.048) 0.0137 (.016) 0.5826 (.028) 0.0193 (.022) 0.5626 (.113) 0.0128 (.011) 0.5569 (.081)

0.1995 (.064) 0.6889 (.038) 0.1742 (.056) 0.6885 (.030) 0.1022 (.057) 0.7299 (.069) 0.0554 (.030) 0.6824 (.062)

bootstrap procedure still deserves some attention. Indeed, the possibility of changing the bandwidth used for the auxiliary density estimator in the generation of the bootstrap samples provides an additional exibility which could motivate further research for improvements. (b) A possible heuristic explanation for the better results of cross-validation when compared with smoothed bootstrap is that the latter does worse in an external band of the support. The point is that, in this external band, there is little overlapping of balls, which constitute the support of the distribution used to draw the arti cial samples. This makes the estimated density much lower in the external band than in the inner part of the support. Thus, the bootstrap tends to underestimate the support and to give smaller values of c . In the uniform case, the contrast between the quality in the estimation of the external band and that of the inner part tends to be even sharper when n increases. On the contrary, in the normal case the di erence is not so important since the underlying distribution is not bounded away from zero. (c) Whereas the theoretical results given in the previous sections concern only the case of compact support, the proposals for selecting n make sense in general and, in fact, have shown a good performance in the normal case. A reasonable conjecture in this case is that the bootstrap and/or the crossvalidation choice of n for a false alarm probability , say n ( ), leads (when replaced in (1)) to a consistent estimator of the -level set ff > cg such that Pf ff > cg = 1 ; . In Figure 1 we show the level sets (the solid lines), for = 0:01; 0:05; 0:1, and the corresponding set estimates (1), when n = n ( ). The darker shadowed area the larger . The distributions are: (a) standard bivariate normal; and (b) correlated bivariate normal, with  = 0:4.

(a)

(b)

Figure 1: Level set estimates in the normal case. 4.3. A real-data example The book by Flury and Riedwyl (1988) includes a data set of measurements taken on 100 genuine and 100 forged Swiss thousand franc bills. Six variables (X1 ; : : : ; X6 ) have been measured for each bill. They correspond, respectively, to 14

the length of bill (X1 ), width of bill measured on the left and on the right (X2 and X3 ), width of margin at the bottom and at the top (X4 and X5 ), and length of the image diagonal of bill (X6 ). All measurements are given in millimeters. If we look at these data within the change-point setup, we can consider the genuine bills as a training sample where the distribution of (X1 ; : : : ; X6 ) is undercontrol. Then we estimate the support and detect the presence of a forged bill by an \alarm" in our detection procedure. In order to reduce the dimension of the problem, we rst perform a principal components analysis (with the data of the genuine bills). The rst two principal components are

Y1 = ;0:133X1 + 0:439X2 + 0:712X3 + 0:746X4 + 0:553X5 ; 0:867X6 Y2 = 0:835X1 + 0:545X2 + 0:419X3 ; 0:154X4 ; 0:194X5 + 0:236X6 We apply our detection method for the variables (Y1 ; Y2 ), assuming that the training sample consists of only n = 90 genuine bills. We check for a possible false alarm (with = 0:05) for the remaining genuine bills (i.e, using the values (Y1n ; Y2n ) for n = 91; : : : ; 100) and for a change point in the rst forged bills (from the stage n = 101). We consider a cross-validation choice of n and change sequentially the

set estimator after every new observation which does not raise an alarm. The alarm is given for n = 102. Hence, a change-point has been detected coinciding with the second appearance of a forged bill. Figure 2 shows a control chart for this problem. We plot the sequentially obtained values of the statistic Tn de ned in (15) (solid line), and the corresponding critical values c0:05 , for n = 90; : : :; 101 (dashed line). As in a typical control chart we have two areas, the \under-control" and the \out-of-control". They are separated by the c values which act as a con dence band. It can be seen that all the observations X90 ; : : : ; X100 remain neatly within the under-control area, whereas the rst forged bill is close to the alarm area and the second forged bill correspond to an out-of-control observation.

0.8 forged bills 0.6 0.4 0.2 0

92

94

96

98

100

102

Figure 2: Tn -control chart for Swiss bills. 4.4. Software A MSDOS program, which performs the set-based change-point detection method, the principal components of the Swiss thousand bills data set, and the codes of the FORTRAN'90 subroutines used in this section, can be downloaded from the web 15

site http://www.adi.uam.es/~ajustel/bcj.html APPENDIX Proof of Lemma 1. Let us prove (7). Given n > 0 let us take a minimal covering of S by balls Bj := B (Zj ; n =2) (with j = 1; : : : ; Rn ) centered at points Zj 2 S . We have    E (Pn ) = E  S^nc X1 ; : : : ; Xn      = E  S^nc \ [Rj=1 Bj X1 ; : : : ; Xn R    X  E  S^nc \ Bj X1 ; : : : ; Xn (16) j =1 n

n

De ne

(

An;j := ! :

It is clear that, by de nition,

n X

)

IB (Xi (!)) > 0 : j

i=1

n

o

An;j  ! : S^nc (!) \ Bj = ; : Hence, if X is a random vector with distribution  and independent from X1 : : : ; Xn , we have X ;1(S^nc \ Bj )  Acn;j \ X ;1(Bj ) and E ((S^nc \ Bj jX1 ; : : : ; Xn )) = E (P (X ;1 (S^nc \ Bj )jX1 ; : : : ; Xn ))  E (P (Acn;j \ X ;1 (Bj )jX1 ; : : : ; Xn )) Thus, from (16),

E (Pn )  =

= = =

Rn X j=1 R X n

j=1 R X n

j=1

Rn X j=1 R X n

j=1

E (P (Acn;j \ X ;1 (Bj )jX1 ; : : : ; Xn )) E (E (IA IX ;1 (B ) jX1 ; : : : ; Xn )) c n;j

j

E (IA E (IX ;1 (B ) jX1 ; : : : ; Xn)) = c n;j

j

(Bj )E (IA ) = c n;j

Rn X j=1

 ;  (Bj ) P IB (X ) = 0 n = j

(Bj ) (1 ; (Bj ))n

Now, by applying the inequality (1 ; x)n  e;nx , valid for 0  x  1, we get

E (P n ) 

Rn X j=1

(Bj ) expf;n(Bj )g: 16

(17)

In order to get an upper bound for (17) let us observe that, since  has a bounded density (say f  C1 ), (Bj )  C1 wd dn , where wd is the Lebesgue measure of the unit ball in 0 (depending on S ), (7) follows directly from (18). 2 Proof of Theorem 3. We have 











 S^n (n )S =  S^n (n ) n S +  S n S^n (n ) :

(19)

Let us rst prove that n ! 0, a.s. Indeed, take anyp  > 0. Let us consider a covering C of S consisting of closed cubes with side = d and disjoint interiors. With probability one, every C 2 C such that L (C \ S ) > 0 will eventually contain an observation Xi . This shows that n < diag(C ) = , eventually a.s. By the dominated convergence theorem, this implies the a.s. convergence to zero of the rst term in the right-hand side of (19). To study the second term in (19) let us note that, in the setup of graph theory, n is half the length, Mn , of the longest edge in the minimal spanning tree (see, e.g., Lebart, Morineau and Warwick 1984, p. 122) with vertices fX1; : : : ; Xn g. This follows from the iterative calculation of n , since this iterative procedure in fact provides a minimal spanning tree whose longest edge has length 2n . Indeed note that, in a spanning tree, the length of any edge with vertex X1 must be at least R1 ; also, the length of any edge joining X2 and other point di erent from X1 must be at least R2 ; : : : , etc. Under the hypothesis in (a) and the assumptions of the theorem, Tabakis (1996) has proved that (

1=d  1=d ) k log n = 1; nlim !1 P Mn  Awd n 

for every k < 1, where A = supff (x) : x 2 S g, (observe that f is continuous, and hence bounded, on the compact S ) and wd is the volume of the unit ball in