Minimax Optimal Level Set Estimation Rebecca M. Willett and Robert D. Nowak University of Wisconsin, Madison, WI, USA ABSTRACT Tree-structured partitions provide a natural framework for rapid and accurate extraction of level sets of a multivariate function f from noisy data. In general, a level set S is the set on which f exceeds some critical value (e.g. S = {x : f (x) ≥ γ}). Boundaries of such sets typically constitute manifolds embedded in the high-dimensional observation space. The identification of these boundaries is an important theoretical problem with applications for digital elevation maps, medical imaging, and pattern recognition. Because set identification is intrinsically simpler than function denoising or estimation, explicit set extraction methods can achieve higher accuracy than more indirect approaches (such as extracting a set of interest from an estimate of the function). The trees underlying our method are constructed by minimizing a complexity regularized data-fitting term over a family of dyadic partitions. Using this framework, problems such as simultaneous estimation of multiple (non-intersecting) level lines of a function can be readily solved from both a theoretical and practical perspective. Our method automatically adapts to spatially varying regularity of both the boundary of the level set and the function underlying the data. Level set extraction using multiresolution trees can be implemented in near linear time and specifically aims to minimize an error metric sensitive to both the error in the location of the level set and the distance of the function from the critical level. Translation-invariant “voting-over-shifts” set estimates can also be computed rapidly using an algorithm based on the undecimated wavelet transform.
1. LEVEL SET ESTIMATION In this paper, we address the problem of recovering level sets from noisy multi-dimensional data. Set estimation may be more desirable than complete signal reconstruction for a variety of reasons. In many applications, the location of boundaries or level sets are of principal importance, while the amplitude of the function away from any boundary is secondary, if not irrelevant. For example, doctors may wish to identify regions where uptake of a pharmaceutical exceeds some critical level, or mapping agencies may want to extract networks of roads from satellite images. As described later in this paper, however, plug-in estimators based on signal estimates with very low mean amplitude errors over the entire function may not be suitable for accurate level set extraction. Because set estimation is intrinsically simpler than function estimation, explicit level set extraction methods can potentially achieve higher accuracy than more indirect approaches (such as estimating a set from an estimate of the function). This distinction manifests itself in the selection of tree-pruning criteria. Tree-based signal estimation frameworks (e.g. [1; 2; 3]) determine whether nodes should be pruned from the tree based on the size of the tree. Recent work in tree-based methods for binary classification [4], however, has revealed that optimal pruning decisions must exhibit a spatial adaptivity not possible using traditional, tree size based criteria. This difference implies that one critical aspect of tree-based set estimation must be the selection of the tree-pruning rule most appropriate for set estimation.
1.1. Problem formulation and unique challenges The main focus of this paper will be what is typically called regression level set estimation. The framework presented here can also be used for density level set estimation, as described later in the paper. The regression level set recovery problem can be described as follows: Let f : [0, 1]d −→ [C` , Cu ] be a function of bounded amplitude, and let S ≡ {x ∈ [0, 1]d : f (x) ≥ γ} for some γ ∈ [C` , Cu ]. Given n noisy observations (xi , yi ) ∈ [0, 1]d × [C` , Cu ], i = 1, . . . , n, where E[yi ] = f (xi ), our goal is to learn an estimate of S. Let PX be the probability measure for X, which determines the marginal distribution of observations on the support of f . One approach to this problem might be to first compute an estimate of f , denoted fb, and then let Sb b = {x ∈ [0, 1]d : f
fb(x) ≥ γ}; this is frequently called a “plug-in” estimator. Plug-in estimators can be effective under stringent global
smoothness assumptions placed on f , but such restrictions limit the applicability of level set estimation. To see why plugin approaches approach may not be as effective as a more direct approach, consider the following sketch. For the moment, assume f is a Lipschitz function such as the one displayed in Figure 1(a); i.e. |f (x) − f (y)| ≤ Lkx − yk for some constant L. Further assume that the boundary of S (denoted ∂S) is in a box-counting class (for example, as displayed in Figure 1(b)); that is, if [0, 1]d were divided into md uniform hypercubes (each with volume m−d and sidelength 1/m), then there exists some constant C > 0 independent of m such that ∂S would pass through at most Cmd−1 of the hypercubes. Next, consider the partition-based estimators described in papers such as [1; 5; 6], where we form a partition adapted to the observations and fit a constant to each cell in the partition. As described in [5; 6], this approach has many advantages in the context of signal estimation, including near minimax optimal rates of convergence for Lipschitz functions. However, these estimators attempt to minimize a global error metric and are unable to hone in on the boundary of a level set when it does not correspond to a boundary or edge in the function itself. In particular, to effectively balance between fidelity to the observations and complexity of the estimate, the partitionbased method will partition the function into roughly md cells and compute the sample average in each cell. In fact, since f does not contain any edges or boundaries in this example, all md cells will be roughly the same size; for example, see Figure 1(c). The optimal number of cells minimizes the upper bound on the sum of the squared L2 approximation error, O(m−2 ), and the estimation error, O(md /n); minimizing this sum, we find md should scale like O(nd/(d+2) ). Of these md cells in the estimate, O(md−1 ) will intersect ∂S because of the box-counting condition. This means that the volume of the region where we cannot accurately approximate ∂S is md−1 m−d = m−1 = n−1/(d+2) ; for example, see Figure 1(d). As we will see later, accuracy this low will not facilitate minimax optimal error decay rates for this problem. In contrast, we develop a methodology which will partition f in a way that hones in on the boundary of the level set, as displayed in Figure 1(e), to get more accurate set estimates, as displayed in Figure 1(f).
(a)
(c)
(e)
(b)
(d)
(f)
Figure 1. Unbalanced partitions for level set estimation. (a) Smooth function. (b) Level set of the function. (c) Partition induced on the function by denoising. (d) Thresholding the denoised function results in a large error. (e) Partition induced on the function by an explicit set estimation method. (f) Direct set estimation can result in a smaller error.
An alternative approach might consist of thresholding all the observations (zi = I{yi ≥γ} ) and then applying a binary classification method to the result. However, this approach immediately disregards important information about the distance of each observation from γ, the level of interest. In a sense, this distance is an indication of our confidence that the
observation is in S. That said, the problem of level set estimation is very closely related to classification. In particular, trees have been studied very carefully in the context of classification [4] and we will be using a number of analysis techniques and results from that domain for this problem. Related work was conducted by Mammen and Tsybakov in [7], but their work focused on the estimation of a boundary between two constant-valued regions, an edge detection problem which is a special case of the more general level set estimation problem presented in this paper. The advantage of the regression level set estimation method proposed in this paper is that it is capable of utilizing additional information available from non-binary observations. Cavalier [8] examined a problem similar to the one discussed here and based on the work of Tsybakov [9] for density level set estimation using piecewise polynomials. The estimators proposed in these works, however, are not computable and place stronger assumptions on S (specifically, it must be “star-shaped” about the origin).
1.2. Error metrics for set estimation Careful selection of an error metric is the first step in designing a level set estimator. While goal of signal estimation is typically to minimize the mean squared error between the true signal and the estimate, in level set estimation it is more appropriate to minimize the symmetric difference between the true set of interest and its estimate weighted by severity of the error over the symmetric difference. An appropriate error metric can be designed as follows. For a given set F , we define the loss function to be eF (x) =
1 γ − f (x) I{x∈F } − I{x∈F c } + 2(Cu − C` ) 2
(1)
where I is the indicator function. The normalization ensures that eF ∈ [0, 1] and does not impact the result. From here, we define risk function as Z R(F ) = eF (x)dPX ; (2) this measures the distance between the signal, f , and the threshold, γ, and weights the distance at each location x by plus or minus one according to whether x ∈ F . Thus regions where x ∈ F but f (x) < γ (that is, x ∈ S c ) will contribute positively to the risk function. Given the risk function, we can define the “excess risk” as R(F ) − R(S), which measures the difference between the risk of an estimate and the risk of the true level set, S. Using the definition of the risk, the excess risk can be written as Z 1 |γ − f (x)|dPX , (3) R(F ) − R(S) = Cu − C` ∆(S,F ) where ∆(S, F ) ≡ {x : x ∈ (S ∩ F c ) ∪ (F ∩ S c )} denotes the symmetric difference, where S c is the complement of S. The excess risk gives a weighted measure of the symmetric difference between S and F , as desired. Note that minimizing (3) is equivalent to minimizing R(F ) since R(S) is a constant. The effect of such a metric is demonstrated in Figure 2. On the left is drawn a contour outlining the true level set S. The center and rightmost figures show the boundary of two different candidate level set estimates. There is only a small symmetric difference between the set in the center image and the truth, but the distance of the function from the level γ is large in this region. In contrast, there is a large symmetric difference between the set in the rightmost image and the truth, but the distance of the function from the level γ is relatively small in that region. An additional advantage of the proposed metric is that it is simple to define an empirical loss metric for a candidate level set estimate F as 1 γ − yi I{xi ∈F } − I{xi ∈F c } + ebF (xi ) = (4) 2(Cu − C` ) 2 resulting in the empirical risk function n X b n (F ) = 1 R ebF (xi ), (5) n i=1 b n (F ) − R(S)] = R(F ) − R(S). This metric is distinctly different which is both computable and constructed so that E[R from the global Lp norms typically encountered in signal estimation.
(a)
(b)
(c)
Figure 2. Behavior of level set error metric. (a) Function f and true level set S. (b) Level set estimate (solid line) with a small symmetric difference but large errors within the symmetric difference region. (c) Second level set estimate (solid line) with same error as estimate in (b); this estimate has a large symmetric difference but small errors within the symmetric difference region. Despite these differences, these two set estimates could have the same weighted symmetric difference risk.
2. LEVEL SET ESTIMATION PROCEDURE We propose to estimate the level set of a function from noisy observations by using a tree-pruning method akin to CART [1] or dyadic decision trees [4]. In the two-dimensional problems highlighted earlier, a quadtree structure was effective. In general, 2d -ary trees are appropriate in regimes where the number of observations is much larger than the dimensionality of the observations; for large d, however, growing a 2d -ary tree with even one level results in 2d cells, which leads to several computational and statistical problems. To avoid this problem, we will focus on binary trees, where each internal node in the tree is assigned a label corresponding to the coordinate direction in which the dyadic split takes place. Let π(T ) denote the partition induced on [0, 1]d by the binary tree T . That is, if T has k leaf nodes (|T | = k), then π(T ) = {A1 , A2 , . . . , Ak } so that each Ai ∈ π(T ) is a hyperrectangular subset of [0, 1]d determined S by the sequence of coordinate directions assigned to the ancestors of the corresponding leaf node vi in T . Note that k Ak = [0, 1]d , and Ai ∩Aj = ∅ for i 6= j. A zero or one is assigned to each leaf node of T (equivalently, to each A ∈ π(T )) to indicate whether b A sample tree and corresponding partition in displayed in Figure 3. We that cell of the partition is estimated to be in S. will consider all binary trees with at least M = 2J leaf nodes, for some nonnegative integer J, and denote this collection TM . Note that the sidelength of each cell must then necessarily be no less than 2−J . 1
1
2
1
2
1
Figure 3. Sample partition and corresponding tree.
Our goal is to find a tree based on noisy observations, Tbn , so that R(Tbn ) is as small as possible. We will choose an estimate according to b n (T ) + Φn (T ) Tbn = arg min R (6) T ∈TM
and explore how the choice of the penalty term Φn (T ) impacts R(Tbn ).
2.1. A square root penalty b n (T ) predicts R(T ) is through Hoeffding’s inequality, resulting in the following theoOne way to determine how well R rem: L EMMA 2.1. Let δ ∈ [0, 1]. With probability at least 1 − δ r b n (T ) + log(1/δ) + |T |(log d + 1) log 2 ∀ T ∈ TM . R(T ) < R 2n This lemma is proved in Appendix A.1. Lemma 2.1 can be used (in a manner similar to that in the proceeding subsection) to define r log(1/δ) + |T |(log d + 1) log 2 00 , Φn (T ) ≡ 2n compute the associated expected excess risk, and derive rates of convergence. This process yields an expected excess risk decay rate which is slower than the minimax rate, which implies that this approach based on Hoeffding’s inequality is suboptimal. Because of the suboptimality of the result and the similarity of the derivation to that of the optimal rate, we do not present the details here.
2.2. A spatially adaptive penalty To avoid the suboptimal rates described in the preceding subsection, we must use a spatially adaptive penalty similar to that described in [4]. First, note that it is possible to express the difference between the true risk and the empirical risk as X b n (T ) = b n (A), R(A) − R R(T ) − R A∈π(T )
where Z R(A)
=
eA (x)
=
b n (A) R
=
eA (x)
=
eA (x)I{x∈A} dPX 1 γ − f (x) I{label(A)==1} − I{label(A)==0} + 2(Cu − C` ) 2 n 1X ebA (xi )I{x∈A} n i=1 1 γ − f (x) I{label(A)==1} − I{label(A)==0} + . 2(Cu − C` ) 2
These definitions are simply local counterparts of the risk and loss expressions defined in equations (1), (2), (4), and (5). b n (A) and sum over all A ∈ π(T ) to achieve a tighter bound leading We can now bound the difference between R(A) and R to near-optimal error decay rates. In particular, we can use the relative form of Hoeffding’s inequality [10] to show: L EMMA 2.2. Let δA ∈ [0, 1]. With probability at least 1 − δA for any T ∈ TM and any A ∈ π(T ), r b n (A) < 2R(A) log(1/δA ) . R(A) − R n This is proved in Appendix A.2. Now let A be the collection of all A such that A ∈ π(T ) for some T ∈ TM . We wish to show that a similar bound holds for all A ∈ A simultaneously. To do so, let δA = δ2−(JAK+1) , where JAK denotes the number of bits required to encode the position of A. Specifically, consider the prefix code proposed in [4] for A ∈ π(T ). If A is at level j in the binary tree T , then j + 1 bits must be used to describe the depth of A, j bits must be used to describe whether each branch is a left or right branch, and j log2 d bits must be used to describe the coordinate direction of each of the j branches. This results in a total of j(log2 d + 2) + 1 bits, and this expression is denoted as JAK. From here we can derive the following lemma:
L EMMA 2.3. Let
r Φ0n (T )
≡
X A∈π(T )
2R(A) (log(2/δ) + JAK log 2). n
With probability at least 1 − δ, b n (T ) + Φ0n (T ) ∀ T ∈ TM . R(T ) ≤ R b n (T ) + This is proved in Appendix A.2. This lemma implies that if we were to select a tree which minimized the sum R 0 Φn (T ), then our estimator would have a low risk with high probability. However, since R(A) must be known to compute Φ0n (T ), such a strategy is not practically feasible. We can, however, bound R(A) by another term which R R is computable as follows: recall that R(A) = A eA (x)dPX , and that eA ∈ [0, 1]. Thus we can bound R(A) by pA ≡ A dPX . This is useful because of the below lemma, which was proved in [4]. First, define the empirical estimate of pA to be n
pbA ≡
1X I{Xi ∈A} , n i=1
and, for δ ∈ [0, 1], JAK log 2 + log(1/δ) pb0A (δ) ≡ 4 max pbA , n and
JAK log 2 + log(1/δ) p0A (δ) ≡ 4 max pA , . 2n
L EMMA 2.4. [4] Let δ ∈ [0, 1]. Then with probability at least 1 − δ, pA ≤ pb0A (δ) ∀ A ∈ A, and with probability at least 1 − δ, pbA ≤ p0A (δ) ∀ A ∈ A. From here we can define the penalty term to be r Φn (T ) =
X A∈π(T )
2b p0A (δ) (log(2/δ) + JAK log 2). n
(7)
Intuitively, this penalty is designed to favor unbalanced trees which hone in on the location of the manifold defining the boundary of the level set; the same reasoning explains why spatially adaptive penalties are so effective for binary classification problems using dyadic decision trees [4]. To see this, note that JAK j, while pb0A (δ) 2−j . This implies that deep nodes contribute less to Φn (T ) than shallow nodes, and so, for two trees with the same number of leafs, Φn (T ) will be smaller for the more unbalanced tree, similar to the partition displayed in Figure 1(e). As shown in the following section, the estimator defined by (6) and (7) is nearly minimax optimal. Furthermore, the estimator is rapidly computable; fast methods for binary classification using dyadic decision trees are described in [4; 11] and can easily be extended to the level set estimation problem. Specifically, let L = log2 M be the maximum number of dyadic refinements along any coordinate used to form a tree T . Then Tbn can be computed in O(ndLd log(nLd )) operations using a dynamic programming algorithm. To counteract grid effects which result from the use of binary trees, it is possible to perform a “voting over shifts” routine similar to the “averaging over shifts” or “cycle spinning” methods commonly used in signal estimation. Rather than averaging the results of shifted estimators, we determine whether each of the M initial cells is more often in Tbn or Tbnc and make the corresponding set assignment. A naive implementation of this approach would increase the complexity by a factor of n, but clever algorithms such as the one described in [12] take advantage of redundant tree-pruning decisions and thus only increase the complexity by a factor of log n.
3. PERFORMANCE ANALYSIS Many of the key points in the following theoretical analysis were derived using the error bounding techniques developed by Scott and Nowak [4] in the context of binary classification. The proposed level set estimation method hinges on the following main result: T HEOREM 3.1. Let Φn (T ) be defined as in (7). Then with probability at least 1 − 2δ, b n (T ) + Φn (T ) R(T ) ≤ R
(8)
for all T ∈ TM . This follows trivially from Lemmas 2.3 and 2.4. Not only does this framework give us a principled way to choose a good level set estimator, but it also allows us to bound the expected risk for a collection of n observations. In particular, we have the following theorem: T HEOREM 3.2. Let Tbn be as in (6) with Φn (T ) as in (7), with δ = 1/n. With probability at least 1 − 4/n, R(Tbn ) − R(S) ≤ min {R(T ) − R(S) + 2Φn (T )} . T ∈TM
As a consequence, h i 4 E R(Tbn ) − R(S) ≤ min (R(T ) − R(S) + 2Φ(T )) + . T ∈TM n This can be proved by closely following the proof of Theorem 3 in [4] and using the above lemmas and theorems. This bound on the expected error in Theorem 3.2 allows us to analyze the proposed method in terms of rates of error convergence. In particular, because this problem has been posed as a generalization of the binary classification problem, we may draw extensively from the results of [4] to highlight several advantageous features of the error convergence of the proposed method. In this analysis, for sequences an and bn let the notation an bn imply there exists some C > 0 such that an ≤ Cbn for all n. We will examine rates of convergence for two classes of functions f and distributions PX to demonstrate that the proposed method both (a) adapts to the regularity of f in the vicinity of ∂S and (b) adapts to the regularity of the curve ∂S. Define DBOX (κ, γ, c0 , c1 , c2 ) to be the set of all (f, PX ) pairs such that 1. PX (A) ≤ c0 λ(A) for all measurable A ⊆ [0, 1]d ; 2. for S = {x ∈ [0, 1]d : f (x) ≥ γ}, ∂S is in a box-counting class; i.e. if [0, 1]d is partitioned into md equal sized cells, each with sidelength 1/m and volume m−d , and NS (m) denotes the number of such cells intersected by the boundary of S, then NS (m) ≤ c1 md−1 for all m; and 0 3. for all dyadic m, there exists some Tm ∈ Tm such that 1/κ
0 (R(Tm ) − R(S))
0 ≤ c2 PX (∆(Tm , S)).
0 Next note that the first and second conditions together mean PX (∆(Tm , S)) ≤ c0 c1 /m, so that the third condition implies 1/κ 0 (R(Tm ) − R(S)) ≤ c0 c1 c2 /m. The role of condition III is to provide a means of exploring how the estimator behaves for various amplitudes of f in the vicinity of ∂S. Note that Z 1 0 |γ − f (x)|dPX R(Tm ) − R(S) = Cu − C` ∆(Tm 0 ,S) Z dPX ≤
=
0 ,S) ∆(Tm 0 PX (∆(Tm , S)).
From here it is clear that the third condition holds trivially for κ = 1. However, for large κ, the third condition will only hold when f is close to γ in the vicinity of ∂S. In this case, it will be very difficult to estimate S accurately, but the estimate
could be wrong over a large volume and still only incur a small error, resulting in faster rates of convergence. In contrast, small κ allows a jump in f near ∂S and so is a relatively “easy” problem; however, an estimate which is incorrect on a small volume will result in a large error, resulting in slower rates of convergence. Condition III allows us to examine the performance of the proposed estimator under these different conditions. This definition, when combined with Theorem 3 and Lemma 8 of [4] yields the following result: T HEOREM 3.3. Choose M such that M (n/ log n)1/d . For d ≥ 2 we have ( ) r n h io log n −κ d/2−1 b E R(Tn ) − R(S) min m + m sup m n DBOX (κ,γ,c0 ,c1 ,c2 ) κ d+2κ−2 log n . n
(9)
This may be proved using an argument very similar to the one used to prove Theorem 6 of [4]. In particular, condition III above leads to the m−κ term in (9), and Lemma 8 of [4] leads to the second term in (9). Careful examination of minimax lower bounds derived in the context of binary classification (Theorem 5, [4]) indicates that the same lower bounds should hold for the level set estimation problem posed here. Specifically, recall that the excess risk in this setting R 1 is R(T ) − R(S) = Cu −C |f (x) − γ|dPX . The excess risk in the classification setting has the same form, except R ` ∆(S,T ) γ = 1/2, f ≥ 0, and f (x)dPX = 1. The derivation on the lower bounds presented in [4] should carry over to the regression level set estimation context with only minor modifications, which would mean that the rate in Theorem 3.3 is within a log factor of the minimax lower bound and hence near optimal. It is also possible to show that the proposed method adapts to the regularity of ∂S. In particular, define the boundary fragment class DBF (α, γ, c0 , c1 ) for α < 1 to be the set of all (f, PX ) such that 1. PX (A) ≤ c0 λ(A) for all measurable A ⊆ [0, 1]d ; and 2. one coordinate of ∂S is a function of the others, where the function has H¨older smoothness α < 1 and constant c1 . The analysis here would also apply to compositions of members of DBF , even if the “orientation” (which coordinate is a function of the others) varies member to member. The orientation does not need to be known. Following the argument Theorem 9 in [4], it is straightforward to prove the following theorem: T HEOREM 3.4. Choose M such that M (n/ log n)1/(d−1) . For d ≥ 2 and α < 1, we have ( ) r n h io log n sup E R(Tbn ) − R(S) min m−α + m(d−α−1)/2 m n DBF (α,γ,c0 ,c1 ) α α+d−1 log n . n As before, an examination of the lower bounds derived in [4] indicates that this rate is within a log factor of the minimax optimal rate. Note that α < 1 implies fractal-like boundaries, which are important for a variety of scientific applications and similar to the ones displayed in the simulation section later in this paper.
4. EXTENSIONS 4.1. Observations on the grid The above analysis assumes that the locations of the observations are random and distributed according to PX . In many domains, however, the locations of the observations are dictated by the measurement device and lay on a grid (e.g. pixels in an image or voxels in a volume). In this case, the analysis above must be modified slightly, as described in this subsection.
Without loss of generality, assume n = 2dk for some k, so that n points can be distributed uniformly on [0, 1]d . Specifically, divide [0, 1]d into n equally sized non-intersecting hypercubes, A1 , . . . , An , where the volume of Ai is 1/n and its sidelength is n−1/d . Let Z 1 fi ≡ f (x)dx |Ai | Ai and yi = fi + i . The empirical loss and risk metrics remain the same as they were in the random observation location case, and R the terms eT (xi ) remain independent and lie within the interval [0, 1]. We redefine the risk in this case to be R(T ) = eT (x)dx, b n (A) are defined accordingly. It now becomes necessary to derive a corollary to Lemma 2.2 and the quantities R(A) and R b n (A). Using this setup, we have and bound the difference R(A) − R ! n n X X 1 1 b n (T ) = R(T ) − b n (T ) R(T ) − R eT (xi ) + eT (xi ) − R n i=1 n i=1 ! Z n n 1X 1X b = eT (x)dx − eT (xi ) + eT (xi ) − Rn (T ) n i=1 n i=1 n Z X γ − f (x) γ − fi = I{Ai ∈T } − I{Ai ∈T c } dx − 2n(Cu − C` ) Ai 2(Cu − C` ) i=1 ! n 1X b n (T ) + eT (xi ) − R n i=1 ! n Z n X f (x) − f (x) 1X b n (T ) = dx + eT (xi ) − R 2(Cu − C` ) n i=1 i=1 Ai ! n 1X b n (T ) . = eT (xi ) − R n i=1 From here is the sufficient to apply Hoeffding’s inequality and the analysis above to the final expression to arrive at the same upper bounds on the excess risk as were derived in the case of random observations.
4.2. Multiple simultaneous level sets In some applications one may wish to extract a collection of level sets, e.g. for the generation of a contour plot. This has proven to be very useful in the context of digital elevation map storage and retrieval [13]. Simultaneous (rather than sequential) extraction of multiple level sets is important because it ensures that the estimated level sets are nested. Consider the case in which we are interested in a collection of critical levels, {γk }K k=1 , where C` ≤ γ1 < γ2 < · · · < γK ≤ Cu . For each γk , let Sk ≡ {x ∈ [0, 1]d : f (x) ≥ γk }, so that S1 ⊇ S2 ⊇ · · · ⊇ SK ⊇ [0, 1]d , and denote the collection of level sets as S ≡ {Sk }. If F ≡ {Fk } is a candidate collection of level set estimates, then let the empirical loss be computed as eF (x) =
K X k=1
i 1 γk − f (x) h I{x∈Fk } − I{x∈Fkc } + . 2K(Cu − C` ) 2
This metric is bounded (eF (x) ∈ [0, 1]) and admits an easily computable empirical estimate. Furthermore, mass is added to the loss for each incorrect level set in which x is placed by F. In this case, the excess risk is simply X RF − RS ≡ (RFk − RSk ) k
=
Z K 1 1 X |γk − f (x)|dPX . K Cu − C` ∆(Sk ,Fk ) k=1
These metrics, combined with the above objective function, method, and analysis results in a practical, principled estimator for the simultaneous extraction of multiple level sets.
4.3. Density level set extraction The above framework can also be used for density level set extraction. Specifically, given n iid observations of some density f : [0, 1]d −→ [C` , Cu ], where C` ≥ 0, we wish to identify the set S ≡ {x ∈ [0, 1]d : f (x) ≥ γ} as before. This problem is important in a variety of applications because of its close ties to anomaly detection. For example, researchers studying computer networks may wish to identify worms or malicious agents by identifying packets with “unusual” routing behavior. By estimating the level set of a density, we estimate patterns of behavior whose likelihoods are below some critical level. This problem has been explored in other contexts, notablyR[9]. Tsybakov’s approach is based on maximizing an “excess mass” metric, M (F ). Specifically, he defines M (F ) ≡ F f (x)dx − γλ(F ), where λ(F ) is the Lebesgue measure of F . Note that M (S) ≥ M (F ) for all F . The excess mass function and the risk function proposed here are related as follows: R(F ) = −M (F ) + M (F c ). Tsybakov’s work exhibits similar rates of error convergence for estimation of density level sets, and in fact can nearly optimally estimate much smoother boundaries of level sets than those proposed here by fitting polynomials to the boundary. However, his problem is formulated in the plane (d = 2), and his approach neither exhibits the spatial adaptivity accompanying the proposed tree-based method nor admits a computationally tractable estimator. The formulation of this problem is slightly different from the one above because we do not make noisy observations of the amplitude of f . However, we will see that the above method and analysis hold after a change of loss metric. Let λ(F ) denote the Lebesgue measure of F , and let I{xi ∈F } − I{xi ∈F c } 1 γ (λ(F ) − λ(F c )) + + . ebF (xi ) = 2(Cu + 1) 2 2(Cu + 1) We will see that this definition of empirical loss (a) leads to a risk function of R(F ) = −M (F ) + M (F c ), (b) leads to an excess risk proportional to a weighted symmetric difference function as above, and (c) allows us to apply the same analysis techniques as above since ebF (xi ) is normalized to be in [0, 1]. Since γ ∈ [C` , Cu ], and λ(F ) ≤ 1, and λ(F c ) = 1 − λ(F ), we have ebF (xi ) ∈ [0, 1]. Also, if n
X b n (F ) = 1 R ebF (xi ), n i=1 then R(F )
h i b n (F ) = E R R R f (x)dx − F c f (x)dx 1 γ (λ(F ) − λ(F c )) F + + = 2(Cu + 1) 2 2(Cu + 1) Z γ − f (x) 1 = I{x∈F } − I{x∈F c } dx + 2(Cu + 1) 2
and the excess risk is R(F ) − R(S) =
1 Cu + 1
Z |γ − f (x)|dx. ∆(F,S)
From here it is possible to apply the above analysis to derive an analogous objective function, method, and performance characterizations to those described for the alternative formulation.
5. SIMULATION RESULTS To test the practical effectiveness of the proposed method, we simulated observations of the elevation of St. Louis, where the true elevations, normalized to lie between zero and 255 were obtained from the U.S. Geological Survey website and are displayed in Figure 4(a). Organizations such as the USGS are often interested in identifying flood plains, which shift as a result of plate tectonics and erosion. The true flood plain (the level set of interest) is displayed in Figure 4(b); note that it encompasses low-lying regions outside the river, distinguishing this problem from an edge detection problem. Our goal is to extract the flood plain from the set of noisy observations displayed in Figure 4(c). The noisy observations were obtained by adding zero-mean uniform noise with a variance of three thousand to the true image. (Note that this implies that the xi ’s are deterministic.) As shown in Figure 4(d), simply thresholding the observations to obtain the level set Sbthresh is highly insufficient in the presence of noise. In contrast, the application of the proposed method to this data results in an accurate estimate of the level set, Tbn , as displayed in Figure 4(e). Because the number of dimensions in this example (two) is small relative to the number of observations, we employ a quad-tree structure instead of the binary tree described above. While this would not be feasible for much larger dimensions, in two dimensions it is both easy to implement and subject to all the performance characterizations derived earlier in this paper. This estimate was formed by weighting the penalty Φn (T ) to minimize the average empirical risk and objectively highlight the difference between the proposed approach and a wavelet-based approach (which was given the same advantage, as described below). Furthermore, we employed “voting over shifts”, a process analogous to averaging over shifts or using an undecimated wavelet transform. Careful thought reveals that voting over shifts can be accomplished in O(n log n) time, as described in [12]. Compare this result with the result of a more indirect approach: namely, performing wavelet denoising and thresholding the denoised image to obtain a level set estimate, Sbwavelet , to produce to image in Figure 4(f). We used undecimated Haar wavelet denoising, and set the hard threshold to minimize the average empirical risk. After empirically selecting an appropriate weight on Φn (T ) and the wavelet threshold, we observed the following mean risks over one hundred noise realizations: R(Sbthresh ) − R(S) R(Sbwavelet ) − R(S)
=
8, 595
=
1, 469
R(Tbn ) − R(S)
=
1, 101.
Roughly speaking, wavelet denoising is analogous to choosing a partition with a penalty proportional to the size of the tree or partition, as opposed to the spatially adaptive penalty employed in this method. This example demonstrates that, as expected, the spatially adaptive penalty results in a partition which drills down on the location of the boundary; the wavelet-based approach, in contract, appears to oversmooth the boundary. Furthermore, since the level set of interest does not correspond to an edge in the image, we would not expect curvelets or wedgelets to significantly outperform wavelets in this context. We see similar performance in the case to multiple simultaneous level set extraction, as displayed in Figure 5.
6. CONCLUSIONS This paper demonstrates that tree-pruning based approaches to regression and density level set estimation result in nearly optimal estimates and can be computed rapidly to produce effective practical estimates. The introduction of a new error metric allows us to bound the weighted symmetric difference between the true level set and the estimate using the relative form of Hoeffding’s inequality. This has proven to yield much more accurate results than those achievable with a plug-in method, such as denoising followed by computing the level set from the denoised data. However, the derived penalization term must be weighted significantly in practice to yield reasonable results, suggesting that the bounds used are not tight. Removing this slack is an area of ongoing research, as is a more detailed investigation of how performance scales with noise variance.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4. Simulation results. (a) True function f : [0, 1]2 → [0, 255]. (b) True level set S = {x ∈ [0, 1]2 : f (x) > 120}. (c) Noisy observations, yi ∈ [−95, 350], i = 1, . . . , 5122 . (d) Level set of observations Sbthresh = {xi : yi > 120}. R(Sbthresh ) − R(S) = 8, 598. (e) Level set estimated with the proposed method. R(Tbn ) − R(S) = 1, 104. (f) Level set estimated by TI Haar wavelet denoising followed by thresholding. R(Sbwavelet ) − R(S) = 1, 472.
APPENDIX A. PROOFS OF RESULTS A.1. Proof of error bounds based on a square root penalty Proof of Lemma 2.1 First, recall Hoeffding’s inequality: T HEOREM P A.1. [14] Let the random variables X1 , X2 , . . . , Xn be independent, with 0 ≤ Xk ≤ 1 for each k. Let Sn = Xk and let µ = E[Sn ]. Then for any > 0, P [µ − Sn ≥ ] ≤ e−2
2
/n
.
b n (T ), µ = nR(T ), and = Applying Hoeffding’s inequality to this problem with Xi = eT (xi ), Sn = nR for some δT ∈ [0, 1] we have that for a given tree T , " # " # r r n log(1/δ ) log(1/δ ) T T b n (T ) ≥ b n (T ) ≥ = P R(T ) − R P nR(T ) − nR 2 2n
p (n/2) log(1/δT )
≤ δT . Next let δT ≡ δ2−|T |(log d+1) . We will see that |T |(log d + 1) is the length of a good prefix code for T , and so we will be able to use the Kraft inequality to state X δ2−|T |(log d+1) ≤ δ. T ∈TM
In particular, note that to prefix encode a binary tree structure, |T | log d bits would be required since (a) a binary tree with |T | leaf nodes will have |T | − 1 internal nodes, and (b) each binary split can take place in one of d coordinate directions. An additional |T | bits would be required to label each leaf node of |T |. (Note that this analysis holds for trees where the coordinate direction of each split is a free parameter. If “cyclic trees”, as described in [4], were employed instead, the prefix code would need to be different but the rest of this proof would remain the same.)
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5. Simulation results. (a) True function f : [0, 1]2 → [0, 255]. (b) True level sets with γ1 = 90 and γ2 = 180. (c) Noisy observations, yi ∈ [−95, 350], i = 1, . . . , 5122 . (d) Level sets of observations. R(Sbthresh ) − R(S) = 7, 537. (e) Level sets estimated with the proposed method. R(Tbn ) − R(S) = 783. (f) Level sets estimated by TI Haar wavelet denoising followed by thresholding. R(Sbwavelet ) − R(S) = 1, 055.
Let ET denote the event that r b n (T ) + R(T ) ≥ R
log(1/δ) + |T |(log d + 1) log 2 . 2n
By the union bound, " P
# [ T ∈TM
ET
≤
X
δ2−|T |(log d+1)
T ∈TM
≤ δ.
A.2. Proof of error bounds based on a spatially adaptive penalty Proof of Lemma 2.2 First, recall the relative form of Hoeffding’s inequality: T HEOREM P A.2. [10] Let the random variables X1 , X2 , . . . , Xn be independent, with 0 ≤ Xk ≤ 1 for each k. Let Sn = Xk and let µ = E[Sn ]. Then for any > 0, P [Sn ≤ (1 − )µ] ≤ e−
2
µ/2
.
b n (A), µ = nR(A), and = Applying this inequality to this problem with Xi = eT (xi ) for xi ∈ A, Sn = nR p 2 log(1/δA )/µ we have that s ! # " 2 log(1/δ ) A b n (A) ≤ 1 − nR(A) P nR (nR(A)) s " # 2 log(1/δ ) A b n (A) ≥ nR(A) = P nR(A) − nR nR(A) " # r 2 log(1/δ )R(A) A b n (A) ≥ = P R(A) − R n ≤ δA .
Proof of Lemma 2.3 (This closely follows the proof of Theorem 2 in [4], but since the risk cannot be expressed in terms of probability measures in this context, some technical details are different.) Applying Lemma 2.2, we have that for a particular T and A r 2R(A) ((JAK + 1) log 2 + log(1/δ)) b R(A) − Rn (A) < n r 2R(A) (JAK log 2 + log(2/δ)) ≤ n with probability not exceeding δ2−(JAK+1) , where the second inequality follows for δ ≥ 1/2, the only case of interest. We now wish to show that a similar bound holds uniformly for all T ∈ TM and all A ∈ π(T ) by applying the union bound. However, we want to avoid summing over redundant (A) pairs since this will introduce slack into the bound. Note the for a given A ∈ A, T will have the sole effect of assigning a label to A, indicating whether it is estimated to be in S. Thus we can sum over all A ∈ A and, for each A, over just two trees which assign different labels to A. Applying the union bound in this manner, and letting EA denote the event that r 2R(A) (JAK log 2 + log(2/δ)) b , R(A) − Rn (A) < n we have P
[
[
X
EA ≤
T ∈TM A∈π(T )
δ2−(JAK+1)
A ∈ π(T ) label = 0 or 1
≤
X
δ2−JAK
A∈A
≤ δ where the last inequality follows from the Kraft inequality, which is applicable since JAK is a prefix codelength.
References [1] L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA, 1983. [2] R. Willett and R. Nowak, “Platelets: a multiscale approach for recovering edges and surfaces in photon-limited medical imaging,” IEEE Transactions on Medical Imaging 22(3), 2003.
[3] D. Donoho, “Wedgelets: Nearly minimax estimation of edges,” Ann. Statist. 27, pp. 859 – 897, 1999. [4] C. Scott, Dyadic Decision Trees. PhD thesis, Rice University, 2004. [5] R. Willett and R. Nowak, “Multiscale poisson intensity and density estimation.” Submitted to IEEE Trans. Info. Th., 2005. [6] R. Willett and R. Nowak, “Platelets: a multiscale approach for recovering edges and surfaces in photon-limited medical imaging,” IEEE Transactions on Medical Imaging 22(3), pp. 332–350, 2003. [7] E. Mammem and A. Tsybakov, “Asymptotical minimax recovery of sets with smooth boundaries,” Annals of Statistics 23(2), pp. 502–524, 1995. [8] L. Cavalier, “Nonparametric estimation of regression level sets,” Statistics 29, pp. 131–160, 1997. [9] A. Tsybakov, “On nonparametric estimation of density level sets,” Annals of Statistics 25(3), pp. 948–969, 1997. [10] C. McDiarmid, “Concentration,” in Probabilistic Methods for Algorithmic Discrete Mathematics, pp. 195–248, Springer, (Berlin), 1998. [11] G. Blanchard, C. Sch¨afer, and Y. Rozenholc, “Oracle bounds and exact algorithm for dyadic classification trees,” in Proceedings of COLT: The Annual Workshop on Learning Theory, pp. 378–392, (Banff, Canada), 2004. [12] R. Willett and R. Nowak, “Fast multiresolution photon-limited image reconstruction,” in Proc. IEEE Int. Sym. Biomedical Imaging — ISBI ’04, (15-18 April, Arlington, VA, USA), 2004. [13] A. Sole, V. Caselles, G. Sapiro, and F. Arandiga, “Morse description and geometric encoding of digital elevation maps,” IEEE Trans. on Im. Proc. 13(9), pp. 1245–1262, 2004. [14] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” J. Amer. Statist. Assoc. 58, pp. 713– 721, 1963.