Statistics and Computing (1997) 7, 115±124
Numerical performance of block thresholded wavelet estimators PETER HALL1, SPIRIDON PENEV1,2, GEÂRARD KERKYACHARIAN1,3 and DOMINIQUE PICARD1,4 1 Centre for Mathematics and its Applications, Australian National University, Canberra, ACT 0200, Australia 2 School of Mathematics, University of NSW, Sydney, NSW 2052, Australia 3 Faculte Mathematiques et Informatiques, Universite de Picardie, 33 rue Saint-Leu, 80039 Amiens Cedex 01, France 4 DeÂpartement de Mathematiques, Universite de Paris VII Paris Cedex 05, France
Received November 1995 and accepted September 1996
Usually, methods for thresholding wavelet estimators are implemented term by term, with empirical coecients included or excluded depending on whether their absolute values exceed a level that re¯ects plausible moderate deviations of the noise. We argue that performance may be improved by pooling coecients into groups and thresholding them together. This procedure exploits the information that coecients convey about the sizes of their neighbours. In the present paper we show that in the context of moderate to low signal-to-noise ratios, this `block thresholding' approach does indeed improve performance, by allowing greater adaptivity and reducing mean squared error. Block thresholded estimators are less biased than term-by-term thresholded ones, and so react more rapidly to sudden changes in the frequency of the underlying signal. They also suer less from spurious aberrations of Gibbs type, produced by excessive bias. On the other hand, they are more susceptible to spurious features produced by noise, and are more sensitive to selection of the truncation parameter. Keywords: Adaptivity, bias, density estimation, mean squared error, non-parametric regression, smoothing parameter, variance
1. Introduction and summary Wavelet estimators enjoy unusual adaptivity, and can be used to de-noise complex signals that are composed of a mixture of smooth curves and erratic ¯uctuations. In their most commonly suggested form (e.g. Donoho and Johnstone, 1995; Donoho et al. 1995), wavelet methods achieve this adaptivity through term-by-term thresholding, in which individual decisions are made about the importance of terms in the empirical wavelet expansion. If the absolute value of a term exceeds a certain threshold, chosen to be higher than plausible moderate deviations of the noise, then it is deemed that the term contains signi®cant information about the signal. In this case it is retained in the expansion; otherwise it is deleted. This approach achieves a degree of trade-o between variance and bias contributions to mean squared error. Without some form of thresholding the estimator would 0960-3174 Ó 1997 Chapman & Hall
suer from too much variance, and would not perform nearly so well in processing complex signals. The trade-o achieved by standard wavelet signal estimators is not optimal, however. It removes too many terms from the empirical wavelet expansion, with the result that the estimator is too biased and has a sub-optimal mean square convergence rate (and also in other L p metrics). These problems appear to be unavoidable in term-byterm thresholding, since decisions about individual terms are necessarily based on a relatively low level of information and so are more prone to error. The diculty emerges through the need to guard against `false positive' decisions about the presence of irregularities of g, and introduces a logarithmic factor into both the threshold and the convergence rate. As a result, the estimator is oversmoothed, and so does not react to relatively subtle changes in the target function, of the type that `adaptive' bandwidth selection rules are designed to track in more conventional estimators.
116 (For the latter, see for example Gasser et al., 1991 and Fan and Gijbels, 1995.) Details will be given in Section 2.2. Recent work has shown that by combining wavelet coecients into groups, and using information about each one to assist in assessing its neighbours, one can increase the accuracy with which coecients may be estimated, and thereby improve the adaptivity of wavelet methods. Theoretical accounts of this idea have shown that it has merit; see for example Efroimovitch (1985), who discussed it in the context of general estimation by orthogonal series; Kerkyacharian et al. (1994), who considered pooling wavelet coef®cients across all spatial locations, for a given level of resolution; and Hall et al. (1995), who suggested pooling coef®cients at locally chosen spatial positions, for a given resolution level. The purpose of the present paper is to explore numerical properties of the latter procedure, which we call `block thresholding', and to compare it with termby-term thresholding. We summarize an extensive simulation study, addressing the performance of thresholding for a variety of smooth and discontinuous curves. Our results lead to the following conclusions. In the context of moderate to low signal-to-noise ratios, block thresholding provides greater adaptivity than term-by-term thresholding. It reacts more rapidly to jump discontinuities or corners in curves, and produces curve estimators that reach more deeply into troughs or peaks in smooth functions. It suers less from Gibbs' phenomenon, since it produces less biased estimators (Gibbs' phenomenon results from oscillatory wavelet coecients in bias terms); and it better adjusts stochastic error relative to bias, producing estimators with lower mean squared error. On the other hand, its higher level of stochastic variability can lead it to add spurious features that are not present in the true curve; and it is more susceptible than term-by-term thresholding to choice of truncation level for the series of thresholded terms. The eect of truncation is only of second order, but that is a drawback as well as a blessing, since it makes the threshold awkward to select by purely objective, empirical means. In the context of signals with relatively high signal-to-noise ratios, term-by-term thresholding already performs well (e.g. Donoho, et al. 1995), and block thresholding has relatively little to offer. Section 2 will introduce and discuss methodology for wavelet thresholding, in the case of regression and density estimation. Section 3 will summarize the results of our numerical study, and draw the conclusions noted above.
2. Methodology for block thresholding 2.1. Introduction Let /, the scaling function, be given by the dilation equaP c /
2x ÿ k, where the constants ck satisfy tion /
x k k P P ck 2 and
ÿ1k k i ck 0 for 0 i s ÿ 1. Suppose /
Hall, Penev, Kerkyacharian and Picard
R is normalized / 1, and has orthonormal R so that translates: /
x/
x ÿ kdx d0k , the Kronecker delta. De®ne the wavelet function w by X
ÿ1k ck1 /
2x k w
x k
The functions / and w are often called `father' and `mother' wavelets, respectively. We assume that both functions are compactly supported; this is almost invariably the case in practice. RThey necessarily satisfy R the ; w
x orthonormality relations w
xw
x ÿ kdx d 0k R /
x ÿ kdx 0 and w
xw
2x ÿ kdx 0; and also, the R i w
xdx 0 for 0 i s ÿ 1. `zero moments condition', x R De®ne j
s!ÿ1 x2 w
xdx; /k
x /
x ÿ k and wjk
x 2j=2 w
2j x ÿ k. For appropriate choice of the cj 's the latter functions form a complete orthonormal basis for any square-integrable function g, so that g
X k
ak /k
1 X X j0
k
bjk wjk
2:1
R R where ak g/k ; bjk gwjk and the series converge in a mean square sense. See for example Daubechies (1992, p. 130) and Meyer (1992, p. 67). Sections 2.2 and 2.3 will treat estimation of g from data Y fY1 ; . . . ; Yn g generated by the model Ym g
xm m ;
1mn
2:2
where xm m=n and the independent and identically distributed errors m have zero mean and variance r2 > 0. Thus, it is assumed that g vanishes outside the interval I 0; 1. Section 2.4 will discuss wavelet methods in the context of density estimation. 2.2. Term by term thresholding The coecients ak and bjk may P P be estimated by a^k nÿ1 m Ym /k
xm and b^jk nÿ1 m Ym wjk
xm , respectively, and substituted into the expansion at (2.1) to obtain an estimator of g: q X X X a^k /k
2:3 b^jk wjk I
jb^jk j > d g~ k
j0
k
Here, d denotes the threshold and q is a truncation point, without which the series in i does not converge. The purpose of thresholding the terms in wjk is to capture irregularities, such as discontinuities, in g. An irregularity in the vicinity of k=2j is indicated by an unduly large value of the wavelet coecient bjk , which would typically be identi®able by a large value of its estimate b^jk . In the vicinity of substantial irregularities (such as jump discontinuities), thresholding has the eect of reducing bias without appreciably increasing variance. To appreciate the operation of thresholding, observe that the asymptotic variance of b^jk equals nÿ1 r2 , even at arbitrarily high levels j:
Numerical performance of block threshold wavelet estimators
n=r2 varb^jk nÿ1
n X
2j w
2j xm ÿ k2
Z
2j w
2j x ÿ k2 dx 1:
m1
To ensure an overall rate of convergence that is at least as fast as nÿc1 for some c1 > 0, the level of `false positive' detections of irregularities should be kept to no more than nÿc2 , for a suitable positive c2 . Supposing for simplicity that the error distribution is normal, so too is the distribution of b^jk . As a result, if irregularities of g are not present near k=2j then the mean of b^jk is approximately zero and the probability that jb^jk j exceeds d (i.e. the probability of a false positive detection of an irregularity) can be of order nÿc2 only if d is approximately c3 r
nÿ1 log n1=2 , where c3
2c2 1=2 . (Recall that the variance of b^jk is approximately nÿ1 r2 , and that the probability that a normal N
0; nÿ1 r2 variable exceeds c3 r
nÿ1 log n1=2 is asymptotic 2 to a constant multiple of nÿc3 =2
log nÿ1=2 : There is a little arbitrariness in choice of c2 and c3 , but taking c2 1 and c3 21=2 produces a level of false positives that is adequate for most purposes. Thus, the threshold is often taken to be d
2r2 nÿ1 log n1=2 , in which r2 is replaced by an estimator for practical application. The logarithm term here produces a logarithmic factor in convergence rates; see for example Theorems 2 and 3 of Donoho et al. (1995). This is the factor which renders those rates inferior to ones that are optimal in a minimax sense, uniformly over large function classes. It produces a degree of oversmoothing (see for example Hall and Patil 1996). As a result the mean squared error is dominated by bias, to ®rst order, and the estimator does not react quite as readily to relatively subtle ¯uctuations in the curve (such as changes in curvature near a smooth peak or trough, as distinct from more dramatic ¯uctuations such as discontinuities) as it might if it had a slightly lower degree of smoothing.
2.3. Block thresholding in non-parametric regression The logarithmic factor cannot be removed from d without seriously impairing the convergence rate by introducing substantial additional stochastic variability. However, it may be shown that if each bjk were known, and could be used to replace b^jk in the indicator function I
jb^jk j > d in the de®nition of g~, then the appropriate value of d would be a constant multiple of nÿ1=2 , the constant being chosen to optimize a measure of performance. This would allow the oending logarithmic factor in convergence rates, noted in the discussion above, to be removed. We suggest overcoming the problem of excessive variability of b^jk by estimating not bjk but its average over neighbouring ks. Speci®cally, partition the integers into consecutive, non-overlapping blocks of length l l
n, the bth block being Bb fk :
b ÿ 1l m 1 k bl mg for ÿ1 < b < 1 and an arbitrary integer m, representing
117
translation of the blocks. (In our numerical work we shall average over all possible translations.) Put X Bjb lÿ1 b2
b jk P where
b denotes summation over k 2 Bb . This represents our approximation to b2jk for those values of k in Bb . We could estimate Bjb as X e jb lÿ1 b^2 B
b
jk
ejb suers excessively from bias. Instead, we However, B suggest employing the estimator X b jb lÿ1 c^
2:4 B
b jk where c^jk 2fn
n ÿ 1gÿ1 nÿ2
nÿ1 X m1
X 1m1 <m2 n
Ym1 Ym2 wjk
xm1 wjk
xm2
Ym Ym1 wjk
xm 2
is an estimator of b2jk . (This quantity is constructed to have b jb enjoys bias particularly low bias. As an estimator of Bjb ; B ÿ1 of smaller order than n Bjb . It is related to secondorder U -statistics (Ser¯ing 1980, Chapter 5), in that the double series in the de®nition of c^jk is a sum of a product of pairs of independent random variables.) Replacing the threshold indicator function I
jb^jk j > d, in the de®nition b jb > d2 , we obtain the block thresholded estiof ~g, by I
B mator q X X X X ^ w I
B b jb > d2 a^k /k g^ b jk jk
b j0 ÿ1